September 19, 2011 2:55 PM
Here's a technique I developed two years ago to streamline the process of consolidate datasets collected on multiple computers. Using some batch scripts, PHP code and some open source tools I created a simple web-based management system for controlling my regular data backup tasks.
Problem: The data collection landscape on the Okeanos Explorer is not the simplest. There are separate workstations for each of the major collection systems (CTD/XBT, SCS, Multibeam, EK60, etc). At the beginning of each cruise the ship's survey techs create new folders on each of the collection workstations for that system's datatype (i.e. SBE SeaSave, SIS, SCS). For size reasons (i.e. multibeam data) some of the collection workstations store their collected data on a share drive (i.e. NetApps storage array). In these cases the folder is created on the shared drive. This is not done as standard practice to prevent dependencies on network resources. Typically data remains on the collection workstation to improve the performance of additional product development (i.e. creating maps, calculating SVPs, plotting SST, etc) or for data comparison.
The ship needed a unified data management solution that met the following requirements:
- All the collected data needed to be consolidated consistently on a cruise-by-cruise to a single point.
- The solution had to be flexible enough to accomidate and ever-changing list of collection points.
- It needed to be simple enough that it could be managed with a minimum amount of the survey tech's time.
- It needed to be platform independent.
- It needed to enforce data management policies (i.e. naming conventions) where possible.
Solution: The first thing that was needed was a consolidated collection point, something the Okeanos Explorer calls the shipboard data warehouse (Warehouse). The Warehouse has enough storage for all datasets (sans raw High Definition (HD) video and raw multibeam data) for an entire field season (~800GB). The hardware is reasonably fault-tolerant; Dell PowerEdge 2950 Server, rack-mount, dual NICs, dual power supplies, 8 hot-swappable 150GB SAS drives connected to a hardware RAID controller. The server is running Debian 6 (Linux)
We used rsync batch scripts called as a scheduled tasks (Linux/Mac used BASH scripts and cron jobs) to transfer the data from the collection computer to the Warehouse (Refer to Using RSYNC to Efficiently Backup Data). The rsync jobs run every hour. We use rsync's –include/–exclude arguments to enforce naming conventions. The rsync jobs are tailored such that the data from each collection point is copied to the standardized directory location on the Warehouse regardless of the original directory name. A new directory structure is created on the Warehouse for each cruise that contains the cruise id (i.e. EX1104). Error checks are performed in the scripts where ever possible. Any errors as well as successes are reported to the collection workstation as Growl notifications (Refer to Using GrowlNotify to Send System-wide Notifications From Scripts).
Having all the data in one place was extremely useful to ship's crew and science alike. This prompted us to make the consolidated datasets publicly available (read-only) via FTP, SMB and HTTP. To quell security concerns we moved the Warehouse to the visitor's network and altered the backup scripts to use SSH tunneling for the transfers (Refer to Setting Up SSH Public Key Authentication).
All of the backup scripts behave based on a centralized configuration file. The configuration file contains all of the local and remote (on the Warehouse) directory names. The configuration file lives on the Warehouse and is access by the collection workstations via http using the wget utility (wget for Windows). Once the file is downloaded, the variables are loaded into the shell environment, the local copy of the file is immediately deleted (for security) and the script does it's job. When the shell completes (success or fail) the variables are erased (again for security).
Management of the configuration file is web-based. A secure website (via .htaccess) running on the Warehouse contains a web form for altering the master configuration file. At the beginning of each cruise the survey tech updates the directory information as required and hits a "save" button. The next time the scripts run the new variables will be applied.
Additional bells and whistles: While in port the ships's are turned off. The scheduled tasks used for the automatic backups however are not. There are two ways to handle this.
- Go around to each of the collection workstations and disable all the scheduled tasks
- Use the central configuration file to disable the backup scripts.
We went with the latter approach. A variable in the central configuration file serves as the master switch that will prevent the script from reaching the rsync command.
The configuration file also contains the cruise ID which is used as part of the ship's naming convention and is the name of the top-level cruise directory.
Here's the code for the website, big thanks to friend and honorary datarat Eric Martin: download. Unzip this into the document root folder for your website (i.e. /var/www). You will need to open index.php and set the $batFile variable for your particular installation.
Here's a sample backup script that uses the method described in this article: download. You will need to change the HOMEPATH, CONFIG_FILENAME and CONFIG_URL variables for your particular installation.
I hope this helps.
Want to talk about this some more? Please post your questions in the Forums.
Dr. Art Trembanis