Thursday, 30 September 2010

Rubymatica as web application

With a working command line script, it was time to create a web interface. The first step of this was to modify my main processing method so it could be called from an external script. Initially, Rubymatica processed every ingest (folders) in the origin folder. That script was upgraded to accept a command line parameter and only process a single ingest. Internally, the code was refactored in order to abstract processing into a sub-step after initial setup.

Ruby on Rails is a wonderful way to build web sites. Rails is convoluted, but there are many good examples on the Web. Basically, I followed one of the “hello world” examples, and added a method to the hello_world_controller to call the same entry point as the command line wrapper. This worked well, although it was processing the ingest in real time, so it was obvious that sooner or later the processing would have to be made asynchronous.

The web interface also needed code to report on the status and results of ingests in the system. The web site has been kept as simple as possible because we will eventually be using Rubymatica as a backend for a Hydra head. Even so, the simple web site needs several controller methods: offer_upload, do_upload, save_file, full_status, reset, get_file, get_log_xml, process_sip, show_logs, file_list, and report. As of this writing Rubymatica is unable to create a BagIT bag, so there will be at least one more method added to the controller.

Changing Rubymatica to become an asychronous process was interesting (at least to a programmer). In this sense “asynchronous” means execution of a program as a background task with no window or terminal session. The main web page has a link for pending ingests. Clicking one of these links starts the process. The program begins to run and is “forked” to a separate task. This forking happens very quickly. You see the main web page refresh with a status message that says (essentially) “Processing has started”. Meanwhile, the background task runs independently taking as long a necessary. This is important for two reasons. First, web browsers will “time out” after a minute or two giving you a mostly empty window with an error message. Web browsers can’t deal with long-running tasks. Secondly, programmers don’t want to lock the user into doing nothing while the independent background task runs. The user doesn’t need to stare at a blank page waiting for the task to complete. (As an aside, Linux programmers call tasks “processes”, but since the word “process” has many meanings in the mixed world of digital accessions, I’ve change it to “task” throughout this blog.)

Normally, (in Perl or Python with the Apache web server) programmers simply fork. However, Ruby is often run in a special web server, and forking actually forks the web server, not just the Ruby on Rails controller. That is bad. Fortunately, it was easy to “exec” the task, which means that a new task is created in the background. Using the ingest name supplied from the web request I simply exec’d a task using the command line that I had created earlier. This is simple, elegant, and robust.

You may be wondering how the web site is able to know the status of background tasks. In order to fill this need, the background task writes status messages (and some other administrative meta data) into a SQLite database which is present in each ingest. One of the web controller methods queries each ingest’s database, and displays the most recent status message in the main web page. Another controller reports the full list of status messages for a given ingest.

Rubymatica is near maturity. It has (generally speaking) an architecture typical of most web applications: code, HTML templates, and a SQL database. The code is Ruby on Rails. The HTML templates used by Rails are Ruby erb files. The SQL is a SQLite database in each ingest directory. Eventually there might be a SQL database with system-wide Rubymatica information, although SQLite will happily open multiple databases so a system wide database may not be necessary.

Rubymatica needs to gain a few features to be considered a SIP creation tool, and I’ll cover those in a later blog posting. However, there is one more interesting technical problem with checksums and ingests.

Normal practice validating files is to generate a file that is a list of checksums for all the files in a directory tree. We typically put the checksum file in the root directory of the directory tree. However, a real-world ingest may include a pre-existing checksum file, changing file names (due to detox), removed files (due to anti-virus filtering), and new files from extracting .tar and .zip files. We need to detect an existing checksum file, modify it as file names change, modify it when files are moved/deleted, and add to it when we extract .tar and .zip files.


Links:

http://en.wikipedia.org/wiki/Ruby_on_Rails

http://en.wikipedia.org/wiki/Fork_(operating_system)

http://en.wikipedia.org/wiki/Sqlite

https://wiki.duraspace.org/display/hydra/The+Hydra+Project

http://en.wikipedia.org/wiki/Checksum




Archivematica SIP ported to Rubymatica

I thought some of you might enjoy a look inside the mind of a software developer converting key scripts from the Archivematica project into Ruby. The impetus of the conversion was to have a Submission Information Package (SIP) creation tool written in Ruby and ready to be integrated with a suite of web applications. Ruby is the language of choice at the University of Virginia (UVA) Library and we are in the process of rolling out a Hydra/Hydrangea web application technology stack. Archivematica is written in Python and shell scripts, and while a web interface is planned, it was months away and (as far as I know) will not be using Hydra. At UVA, the processing steps beyond SIP creation will happen in other systems, so scope of the conversion fit into a 2 to 4 week time line. After checking Google for existing projects, we settled on the name “Rubymatica”.

I am very grateful to the Archivematica developers for creating a working product based on Linux. Archivematica uses a “desktop explorer” user interface where moving an ingest to specific folders causes the Archivematica processing scripts to run. This architecture is easy to understand, and fairly easy to port to another programming language.

For the first phase, I rewrote shell and Python scripts in Ruby as a single group of methods in one script with a single entry point. I ran my script from the command line. I had to work out origin and destination folders, because Archivematica moves the ingest into a temporary folder, and moves it a second time to a final destination. Since I would have a unique subfolder for each ingest, I didn’t need an intermediate temporary folder. I made additional modifications to the Archivematica directory structure for clarity and programming sanity. For instance, Archivematica creates some meta data files in the same directory as the ingested files. My code was simpler if the meta data for a given ingest always resided in it’s own folder. (Early versions of Rubymatica were processing the meta data files as part of the ingest because the files were mixed together in the same directory.)

Ingesting a collection of files involves traversing the directory tree of the ingest. This traversal happens several times. For example, .tar and .zip files need to be extracted, therefore the directory tree is traversed searching for files to extract. This process is iterative in that each newly-extracted directory must be traversed. Archivematica uses the Python script called Easy Extract. I recoded Easy Extract as a recursive Ruby method, which was nontrivial and required a couple of days work. The directory tree also has to be traversed when calling “detox” to clean file names, and traversed again for virus checking with ClamAV. To keep my sanity, I created a method specifically to traverse the directory tree. In keeping with the theme of clarity and simplicity of algorithms, the directory tree is crawled several times rather than trying to do everything in one pass. This works well and fits the programmer mantra: avoid premature optimization.

Creating the METS XML took several days of difficult work. I used Nokogiri’s Builder class since it has powerful XML tools. Nokogiri is a Ruby module or a “gem” in Ruby parlance. The Builder class of Nokogiri is wonderfully powerful, but almost entirely undocumented. Archivematica uses a Python class to build the METS XML, but the parallels between Python’s XML tools and Nokogiri Builder are tenuous. With some help from the very bright, talented, and helpful Andrew Curley here at UVA, I was able to bend Nokogiri Builder to my will. It was a huge battle and Nokogiri nearly crushed me. I’ve created example code for Nokogiri Builder which can be found at:

http://defindit.com/readme_files/nokogiri_examples.html

Now that I have a working command line script, it is time to think about creating a web interface. I’ll cover that process in my next blog.

Links:

http://archivematica.org/wiki/index.php?title=Main_Page

http://en.wikipedia.org/wiki/Ruby_(programming_language)

http://en.wikipedia.org/wiki/Tar_(file_format)

http://en.wikipedia.org/wiki/Python_(programming_language)

http://en.wikipedia.org/wiki/Shell_script

http://en.wikipedia.org/wiki/Clam_AV

http://en.wikipedia.org/wiki/Directory_tree