Thursday 30 September 2010

Archivematica SIP ported to Rubymatica

I thought some of you might enjoy a look inside the mind of a software developer converting key scripts from the Archivematica project into Ruby. The impetus of the conversion was to have a Submission Information Package (SIP) creation tool written in Ruby and ready to be integrated with a suite of web applications. Ruby is the language of choice at the University of Virginia (UVA) Library and we are in the process of rolling out a Hydra/Hydrangea web application technology stack. Archivematica is written in Python and shell scripts, and while a web interface is planned, it was months away and (as far as I know) will not be using Hydra. At UVA, the processing steps beyond SIP creation will happen in other systems, so scope of the conversion fit into a 2 to 4 week time line. After checking Google for existing projects, we settled on the name “Rubymatica”.

I am very grateful to the Archivematica developers for creating a working product based on Linux. Archivematica uses a “desktop explorer” user interface where moving an ingest to specific folders causes the Archivematica processing scripts to run. This architecture is easy to understand, and fairly easy to port to another programming language.

For the first phase, I rewrote shell and Python scripts in Ruby as a single group of methods in one script with a single entry point. I ran my script from the command line. I had to work out origin and destination folders, because Archivematica moves the ingest into a temporary folder, and moves it a second time to a final destination. Since I would have a unique subfolder for each ingest, I didn’t need an intermediate temporary folder. I made additional modifications to the Archivematica directory structure for clarity and programming sanity. For instance, Archivematica creates some meta data files in the same directory as the ingested files. My code was simpler if the meta data for a given ingest always resided in it’s own folder. (Early versions of Rubymatica were processing the meta data files as part of the ingest because the files were mixed together in the same directory.)

Ingesting a collection of files involves traversing the directory tree of the ingest. This traversal happens several times. For example, .tar and .zip files need to be extracted, therefore the directory tree is traversed searching for files to extract. This process is iterative in that each newly-extracted directory must be traversed. Archivematica uses the Python script called Easy Extract. I recoded Easy Extract as a recursive Ruby method, which was nontrivial and required a couple of days work. The directory tree also has to be traversed when calling “detox” to clean file names, and traversed again for virus checking with ClamAV. To keep my sanity, I created a method specifically to traverse the directory tree. In keeping with the theme of clarity and simplicity of algorithms, the directory tree is crawled several times rather than trying to do everything in one pass. This works well and fits the programmer mantra: avoid premature optimization.

Creating the METS XML took several days of difficult work. I used Nokogiri’s Builder class since it has powerful XML tools. Nokogiri is a Ruby module or a “gem” in Ruby parlance. The Builder class of Nokogiri is wonderfully powerful, but almost entirely undocumented. Archivematica uses a Python class to build the METS XML, but the parallels between Python’s XML tools and Nokogiri Builder are tenuous. With some help from the very bright, talented, and helpful Andrew Curley here at UVA, I was able to bend Nokogiri Builder to my will. It was a huge battle and Nokogiri nearly crushed me. I’ve created example code for Nokogiri Builder which can be found at:

http://defindit.com/readme_files/nokogiri_examples.html

Now that I have a working command line script, it is time to think about creating a web interface. I’ll cover that process in my next blog.

Links:

http://archivematica.org/wiki/index.php?title=Main_Page

http://en.wikipedia.org/wiki/Ruby_(programming_language)

http://en.wikipedia.org/wiki/Tar_(file_format)

http://en.wikipedia.org/wiki/Python_(programming_language)

http://en.wikipedia.org/wiki/Shell_script

http://en.wikipedia.org/wiki/Clam_AV

http://en.wikipedia.org/wiki/Directory_tree

No comments:

Post a Comment