Thursday, September 29, 2011

Using RSS to Monitor Data Transfers.

starOceanDataRat.orgOceanDataRat.org
September 27, 2011 12:38 PM
by admin

Using RSS to Monitor Data Transfers.

I got this idea from a colleague down at Stennis Space Center about a year ago.  He said "Wouldn't it be nice if we could know when data arrives on the server the same way get notified about online news articles?"  The light bulb went on and pretty much exploded.  And why try to replicated the functionality?  Just use the same technology to publish data transfers to the web.  The technology I'm referring to is Real Simple Syndication (RSS), a dirt-bag simple way to publish information that allows anyone to subscribe to receive news updates on all sorts of platforms (browsers, news reader, email clients).

What is RSS?

The last line of the previous paragraph pretty much sums up what RSS does.  How it works is as the name implies, real simple.  RSS is just an XML-based text file hosted on a web server.  The file must adhere to the standardized RSS XML schema but because the XML schema is standardized, all kinds of programs have been written to interpret and display RSS articles.

Here's the basic layout of an RSS file:

<?xml version="1.0" encoding="ISO-8859-1"?> <rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"> <channel> <atom:link href="http://tethys.gso.uri.edu/data/rss.xml" rel="self"  type="application/rss+xml" /> <title>EX Data Transfer RSS Feed</title> <link>http://tethys.gso.uri.edu/data/rss.xml</link> <description>This RSS feed provides updates to when files are synced to the  Shore-based Redistribution Server</description> <language>en-us</language> <copyright>Copyright (C) 2011 Okeanos Explorer Program</copyright> <item> <title>EX1106 Data Upload Update - Tue, 27 Sep 2011 13:30:57 UTC</title> <description> <![CDATA[ <p>Added new file: EX1106/CTD/XBT/EX1106_XBT94_110927.EDF</p> <p>   13 files updated in ./EX1106/SCSData/NAV</p> ]]> </description> <pubDate>Tue, 27 Sep 2011 13:30:57 UTC</pubDate> </item> </channel> </rss>

The breakdown:

  • Line 1 is required, exactly as it appears.
  • Line 2 is the main container, everything within the <RSS> container is interpreted as part of the RSS feed.
  • Line 3 is the container for an RSS channel, think TV channels.  I believe there can be multiple channels in a single RSS feed but I'm not sure how RSS client interpret multiple channels.  For this article I don't use multiple channels.
  • Lines 4-5 add compatibility with ATOM clients.  ATOM is an alternative syndication protocol.
  • Line 6 is the Title of the RSS feed
  • Line 7 is the URL of the feed (or it can be used to link to a parent site)
  • Lines 8-9 are the description for the feed, a.k.a what the RSS feed is propagating.
  • Line 10 is the language of the feed
  • Line 11 is the copyright info.
  • Line 12 is the opening tags for an item (article).
  • Line 13 is the title of the item (article).
  • Line 14 is the opening tag for the meat of the article
  • Lines 15-18 are the meat of the article.  This RSS feed uses HTML-style tags for formatting the text.  This is not required but for the RSS feed I setup it made things easier.   To use HTML-style formatting add "<![CDATA[" at the beginning and "]]>" at the end of the text block.
  • Line 19 is the closing tag for the description.
  • Line 20 is the publishing date/time.  The date/time must be formatted just as it is shown to adhere to the RSS schema standard.
  • Line 21 is the closing tag for the item (article).  At this point addition items can be added.
  • Lines 22-23 are closing tags for the channel and the rss feed.

Now back to the original problem…

The Okeanos Explorer transfers all collected data (sans raw multibeam data and high-definition video) to shore via satellite every hour.  The collection, cataloging, checksum generation and upload all happen auto-magically.

The participants on shore are dependent on this data flow to stay in the know of the ship's findings as well as being able to actively participate in the exploration.  This data dependency created one of the most asked questions… "is it (data) there yet?"

Enter RSS.  By creating an RSS feed based on the successful transfer of data from the ship to shore, shore-side participants are almost instantly informed of when new data has arrived.

How I did it:

For each hourly transfer to shore (via rsync) there is a corresponding log file.  The log file is created by rsync using the "-i" flag.  This produces a list of all the files in the source directory and how each file was interpreted (i.e. as a new file, an updated file or unchanged).  I include the Cruise ID and the date/time of the transfer in the log file name (i.e. EX1104_Transfer_to_Shore_20110810T093000Z).  This is used by my script to populate the <title> and <pubdate> fields in the RSS article.

After a successful data transfer I upload the corresponding rsync log files to a specific directory on the shore-based server.  Once a file arrives on the shore-side server a bash script processes the log file into a RSS article (<item></item>) and adds the article to the beginning of the RSS feed and presto, within seconds of the data arriving, the users are made aware.  After the log file is processed it is moved to a backup directory so that it is not processed again.

In order to minimize the length of each article I only show what's new and what has changed.  For new files I list the file name and full path. For updated files I list the directory name and number of files that were updated.

Once the RSS file is created I save it in the Okeanos Explorer's shore-side web server for the shore-side team to see.  Take a look.

Caveats

Satellite communications at sea can be flakey sometimes due to faulty equipment, tracking issues and weather.  This causes the data transfers to periodically fail.  As part of the Okeanos Explorer's hourly data transfer scripts, the rsync command is called repeatedly until the entire transfer completes or up to five times, whichever comes first.  Each rsync call produces a new rsync log file.  At the end of a successful transfer, all of the logs are sent to shore.  To account for this I wrote a script that batch processes any and all rsync log files within a directory.

The Code

Here is the bash script I use to process the rsync log files: download.

Here is the root RSS file that the articles are added to: download.  I used this file just the first time I run the script.  To add articles to an existing RSS feed you need to run the script against the most recent version.

Here's the script I used to batch process a directory of rsync log files: download.

Both of the scripts are heavily documented so if you have any issues running them please take a look at the comments.

I hope this helps.

Want to talk about this some more? Please post your questions in the Forums.

Share

Data Management bash rss rsync script


Dr. Art Trembanis
Associate Professor
CSHEL
109 Penny Hall
Department of Geological Sciences
The College of Earth, Ocean, and Environment
University of Delaware
Newark DE 19716
302-831-2498

"Education is not the filling of a pot, but the lighting of a fire." -W. B. Yeats

No comments: