Thursday, May 31, 2012

outage for more than two weeks, Docomo unresponsive

[On May 12th, 2012, the cellular modem went offline again.  This is an email I sent out on Monday, May 14th:]
This morning I noticed that it's offline again.  The modem's been unreachable since Saturday afternoon UTC (that's Saturday night Saipan time, Saturday morning Miami time, and early in the morning Saturday Honolulu time).
[This time, Ross had a lot of difficulty getting the attention of the Docomo people.  This is from his email on May 14th:]
I did in fact notice the outage. I just sent a note to Docomo to check.
[And then on May 21st:]
The modem is still unreachable and I have received no response from Docomo from my 5/14 email. I just emailed them again today.
[And finally, success, on Tuesday, May 29th:]
My request to Domoco finally came through (after a third email). The modem is back online now. It is still unclear why the modem connection keeps dropping. The only known way to remotely reset it is through Docomo. I'll continue keeping tabs on the network status.
[Later Ross speculated about the reason why Docomo took so long to respond to his May 14th message.  This is from an email dated May 30th:]
My impression of the Docomo situation is that we were put on the back-burner. My email yesterday contained a bit more pressure and that did the trick. Also, the account manager (Kevin McCale) was on leave for a couple weeks and the technician (Luigi Onglao) cc'd responded to him, but not me. Luigi thought that Kevin had responded. Just a miscommunication. They are both usually good about responding promptly.
[However, the modem connection continues to be spotty and error-prone following this reset until about June 22nd (for details, see later blog entry timestamped June 22nd).]

Mike J+

Tuesday, May 15, 2012

another Docomo outage (for a few days)

[From an email I sent out on Monday, May 14th, 2012:]
I noticed last week that the feed from the LLBP7 station had been interrupted again.  Last week I was hugely busy, however, and by the time I sat down on Friday to send out an alert email like this one, the data feed was working again.  I think it was probably offline for about two or three days.
[From a reply by Ross Timmerman, later that same day:]
Last week the modem was offline a few days and I reported it to Docomo. They responded fairly quick asking me to try connecting again, and it was fixed. I asked Docomo what did but they never elaborated, nor did they explain why the outage occurred. I will keep an eye out for connection issues and keep Docomo posted.

Friday, May 4, 2012

detailed discussion of PacIOOS/datalogger logic

On April 26th, Ross Timmerman contacted me about a "data alignment" problem he was experiencing with the PacIOOS data.  The PacIOOS CTD's data are recoverable in two ways.  The first way is from its real-time reports written from its serial port to the listening datalogger, which are then parsed, stored in datalogger memory, and included in data table downloads (via the cellular modem) from the datalogger.  These downloads are then written to a file that is monitored by a Java program supplied by PacIOOS.  Note that there is a known problem in this Java program that caused the loss of one or two CTD records per hour, although all of those records are archived here in Miami.

The second way to recover the PacIOOS CTD's data is by direct download from its on-board memory.  I never got confirmation of when those data were downloaded by David and Steven, but from a blip in the data feed on March 26th, 2012, I suspect the download may have happened on that day.

The problem as Ross reported it was that although the measurements from these two sources were exactly the same, the timestamps were not.  Timestamps on the data captured by the datalogger and accessed by modem were consistently ahead of those downloaded directly from the CTD, by seventeen minutes and one second.  Ross contacted me to ask whether this time offset might be caused by something in our data transfer or parsing routines.

My answer (sent May 3rd) was basically that I did not see how this could be possible.  The CTD reports a timestamp to the logger, and the logger breaks it up into numbers and stored these in separate memory locations with no "awareness" that they are date and time values.  In order for some step in the process to add 17:01 to each timestamp, it would need to correctly handle hour (and day, and so on) boundaries in multiple memory locations.

I am reproducing large portions of my email reply below, because they provide an insight into how the PacIOOS CTD reports its data and status messages, how these messages are parsed by the datalogger, and the form in which those data are recovered by our systems in Miami.  [editor's note: as of August 14th, 2012, there is still (to my knowledge) no explanation of the data alignment problem reported by Ross in April.]

[email excerpt begins here:]

I think now I at least understand the general description of the problem.

Unfortunately, I can't see how this could be happening on this end of the equation.  Let me line up some evidence for you by describing what our logger/processing is like.

This is a sample record from the PacIOOS CTD produced when you were visiting AOML in March of 2010:

# 23.2650,  0.00001,    0.031, 0.0658, 0.1418,   0.0114, 1491.963, 08 Mar 2010 13:12:29
The logger program distinguishes one of these "data" lines apart from the status message lines because it begins with # (status message lines begin with <). Then it "splits" the line into separate fields based on CR, LF, space, comma and colon characters (producing 13 fields in all).  The ninth field, a 3-character month, is converted to a number between 1 and 12 according to a program "case" statement (and it would get -1 if it didn't match any of them).

These thirteen values are then stored in thirteen different memory locations in the logger memory.  This is the only time that these date/time values are written to in the logger program.  Note, too, the that logger doesn't "know" that these are date and time fields, so it would take quite a lot of logic for it to apply a consistent time offset to these values.  In order to add seventeen minutes to every measurement, it would have to adjust the hour, day, month and even year values to compensate when the minute fields exceeded 60.  But the program is "dumb" enough that it only reports what it sees from the CTD reports.

Let's add some more evidence.  I'm focusing on that record highlighted in your first message, "22 Sep 2011 12:01:06".  I went back to the raw table-data downloads from the logger, and here is the relevant data record:

"2011-09-22 12:06:00",7461,2011,265,1206,12.69,12.69,12.7,4316,28.62,1,72,1,30,915,27.99,410.9,"2011-09-22 12:02:00",411,8.23,"2011-09-22 12:05:50",4.912,"2011-09-22 12:06:00",2.777,157.9,318.4,-0.001,0,-0.001,0.066,12.63,25.12,12,0.006,-0.002,0,0.372,12.58,26.06,12,56.44,29.34,4.13,1543,34.17,21.8,1,2011,9,22,12,1,6,29.3537,5.63757,2.333,0.0643,0.0784,34.1177,1543.372,2011,9,22,12,3,39,1606481,4,12.1,8.7,61.3,137,21.9,138662,7298,3455762,19,3,0,0,23,1,1,0,0,0,0,0,56.71,29.56,1544,34.2,0,6.8,4,"2011-09-22 12:00:05",2.853,285.6,27.9,28.1,77.47,410.4,"2011-09-22 12:00:55",410.4,0,0,0,0,0,0,0,0,0,0,12.73,3.516,6,6

Your CTD's data are this portion:
(data portion)
2011,9,22,12,1,6,29.3537,5.63757,2.333,0.0643,0.0784,34.1177,1543.372
(status portion)
2011,9,22,12,3,39,1606481,4,12.1,8.7,61.3,137,21.9,138662,7298,3455762,19,3
(logger counts)
0,0,23,1,1

The explanation of these fields is this:
(data portion)
PacIOOS Year
PacIOOS Month
PacIOOS Day
PacIOOS Hour
PacIOOS Minute
PacIOOS Second
PacIOOS Water Temperature
PacIOOS Conductivity
PacIOOS Water Pressure
PacIOOS Chloro Volts
PacIOOS Turbid Volts
PacIOOS Salinity
PacIOOS Sound Velocity

(status portion)
PacIOOS Status Year
PacIOOS Status Month
PacIOOS Status Day
PacIOOS Status Hour
PacIOOS Status Minute
PacIOOS Status Second
PacIOOS Status Serial Num
PacIOOS Status Num Events
PacIOOS Status Volts Main
PacIOOS Status Volts Lith
PacIOOS Status Curr Main
PacIOOS Status Curr Pump
PacIOOS Status Curr Ext
PacIOOS Status Mem Bytes
PacIOOS Status Samples
PacIOOS Status Sample Free
PacIOOS Status Sample Len
PacIOOS Status Headers

(logger counts)
PacIOOS Bad Count
PacIOOS Other Count
PacIOOS Stat Count
PacIOOS Rec Count
PacIOOS Rec SubCount


Just as the "data" values are as extracted from an individual data line, the "status" values come from the "getsd" command that the logger sends to the CTD once per hour, at three minutes past the hour.  Here is a sample "status" output from your visit [to Miami] in March of 2011:

<StatusData DeviceType = 'SBE16plus' SerialNumber = '01606481'>
  <DateTime>2010-03-09T09:33:51</DateTime>
  <LoggingState>logging</LoggingState>
  <EventSummary numEvents = '4'/>
  <Power>
     <vMain>13.5</vMain>
     <vLith>8.5</vLith>
     <iMain>61.2</iMain>
     <iPump> 0.4</iPump>
<Executing/>
<Executing/>
<Executing/>
<Executing/>
     <iExt01>20.5</iExt01>
  </Power>
  <MemorySummary>
     <Bytes>8949</Bytes>
     <Samples>471</Samples>
     <SamplesFree>3462589</SamplesFree>
     <SampleLength>19</SampleLength>
     <Headers>5</Headers>
  </MemorySummary>
</StatusData>


The last five values are my diagnostics from the parsing of the CTD output.  Record Counts and Record SubCounts are running counts of how many data records have been received from the CTD.  The Counts are reset to 0 on the hour and the SubCounts are reset to 0 every six minutes.  The Counts should cycle from 0 to 10 throughout the hour and the SubCounts should always be 0 or 1.

The Status Counts are the number of lines that were recognized as elements of a status message, by virtue of beginning with <.  There used to be 23 of these (as in the message above), consistently.  Sometime during the station outage (October - March), this number grew to 26.  More about this in a moment.

The "Bad" Counts are lines that were recognized as status lines but didn't match any expected string.  There used to be zero of these but sometime between October and March there began to be 3 such lines.  Along with the StatCount jump from 23 to 26, this is consistent with the CTD adding three new lines of unanticipated formatting to the same old status message.  Looking through the bits of datalogger parsing variables (via the cellular modem) I see signs of a line reading "<Executed/>".  Note that this line wasn't in my sample output when I wrote the program so I didn't match against it.  My guess is that all three of these "Bad" lines now appearing in every status message are this "<Executed/>" line.  I'm not overly worried about it.

"Other" Counts are lines that were not recognized as either status or data messages.  The program also throws away "s>" lines and lines that contain nothing but "sbe 16plus", as well as blank lines.  So the "Other" counts have always been zero, and remain so.

Whew.  Now let's look at the entire logger record again.  The first field is the logger timestamp for the record.  Picking apart the contents of this record, you see this series of events:

  • at 12:01:06 (data values 2011,9,22,12,1,6) the logger records the CTD data output.
  • at 12:03:39 (data values 2011,9,22,12,3,39) the logger records the CTD status output.  This strongly suggests that the CTD's clock is running about 39 seconds faster than the logger clock (which, though imperfect, seems reasonable to me), since the logger sends the "getsd" command at exactly 3 minutes past the hour.  Note that these values are repeated in the 6-minute data table output for 10 records, since the status values are updated only once per hour.  I didn't have to write these values into the 6-minute table at all but I thought you might like to receive some of them in your data feed, which is built from the 6-minute reports.  You decided against that, but they're not taking up much room so I've left them in.
  • at 12:06:00 (logger record timestamp "2011-09-22 12:06:00"), the logger takes a snapshot of a pre-determined subset of its system variables and writes their content into the 6-minute data table.  Those are all of the values you see above.  Note that some of them are logger timestamps, like the timestamps of maximum winds or minimum barometric pressures.  Your "timestamps" though, as I've explained, are just sets of six unrelated integers as far as the logger is concerned.

So the question is, in what form did the CTD output this data line on the serial line to the logger?  Your CTD downloads suggest that it must have been in this form:

# 29.3537,  5.63757,    2.333, 0.0643, 0.0784,  34.1177, 1543.372, 22 Sep 2011 11:44:05

My logger records say it must have been in this form:

# 29.3537,  5.63757,    2.333, 0.0643, 0.0784,  34.1177, 1543.372, 22 Sep 2011 12:01:06

My conclusion:  I cannot see how the logger could have "made up" the "12:01:06" timestamp.  It's just not that smart, and all of the timing is otherwise consistent as you can see.  I could maybe understand if the parsing routines were switching seconds with minutes, or something.  I could maybe understand if "real" timestamps from elsewhere in your download file were being misapplied to the wrong CTD data, maybe with a consistent offset in number of records.  But there's no player in this picture who is smart enough to read "11:44:05" and come with "12:01:06", particularly across hour/day/month boundaries.

I would suggest you investigate the integrity of the CTD download file a bit more.  My main thought is, what would happen if the laptop used to download the CTD data was 17 minutes off in its system clock?  Do you have the "raw" CTD data file or some kind of SeaBird-converted version of it?  What version of the SeaBird software was used to download the file (and might that specific version be vulnerable to a miss-set system clock)?  Was the file sent to you in UTC, ChST, HST, EST, EDT, or some other timezone?  Do you have a CTD in the lab that you can test against that version of the SeaBird software with a laptop that has the wrong system time?

I'm also wondering whether this same offset can been seen post-station-revival.  I never got confirmation on exactly when the CTD data were downloaded, but I suspect from a dip in the measurement counts that it might have happened on March 26th.  If so, then you should have both sets of data for the period from March 19th to March 26th, right?  I'm wondering if that week's data showed the same time offset that you're seeing in the August - October data.