Thursday 30 September 2010

Rubymatica as web application

With a working command line script, it was time to create a web interface. The first step of this was to modify my main processing method so it could be called from an external script. Initially, Rubymatica processed every ingest (folders) in the origin folder. That script was upgraded to accept a command line parameter and only process a single ingest. Internally, the code was refactored in order to abstract processing into a sub-step after initial setup.

Ruby on Rails is a wonderful way to build web sites. Rails is convoluted, but there are many good examples on the Web. Basically, I followed one of the “hello world” examples, and added a method to the hello_world_controller to call the same entry point as the command line wrapper. This worked well, although it was processing the ingest in real time, so it was obvious that sooner or later the processing would have to be made asynchronous.

The web interface also needed code to report on the status and results of ingests in the system. The web site has been kept as simple as possible because we will eventually be using Rubymatica as a backend for a Hydra head. Even so, the simple web site needs several controller methods: offer_upload, do_upload, save_file, full_status, reset, get_file, get_log_xml, process_sip, show_logs, file_list, and report. As of this writing Rubymatica is unable to create a BagIT bag, so there will be at least one more method added to the controller.

Changing Rubymatica to become an asychronous process was interesting (at least to a programmer). In this sense “asynchronous” means execution of a program as a background task with no window or terminal session. The main web page has a link for pending ingests. Clicking one of these links starts the process. The program begins to run and is “forked” to a separate task. This forking happens very quickly. You see the main web page refresh with a status message that says (essentially) “Processing has started”. Meanwhile, the background task runs independently taking as long a necessary. This is important for two reasons. First, web browsers will “time out” after a minute or two giving you a mostly empty window with an error message. Web browsers can’t deal with long-running tasks. Secondly, programmers don’t want to lock the user into doing nothing while the independent background task runs. The user doesn’t need to stare at a blank page waiting for the task to complete. (As an aside, Linux programmers call tasks “processes”, but since the word “process” has many meanings in the mixed world of digital accessions, I’ve change it to “task” throughout this blog.)

Normally, (in Perl or Python with the Apache web server) programmers simply fork. However, Ruby is often run in a special web server, and forking actually forks the web server, not just the Ruby on Rails controller. That is bad. Fortunately, it was easy to “exec” the task, which means that a new task is created in the background. Using the ingest name supplied from the web request I simply exec’d a task using the command line that I had created earlier. This is simple, elegant, and robust.

You may be wondering how the web site is able to know the status of background tasks. In order to fill this need, the background task writes status messages (and some other administrative meta data) into a SQLite database which is present in each ingest. One of the web controller methods queries each ingest’s database, and displays the most recent status message in the main web page. Another controller reports the full list of status messages for a given ingest.

Rubymatica is near maturity. It has (generally speaking) an architecture typical of most web applications: code, HTML templates, and a SQL database. The code is Ruby on Rails. The HTML templates used by Rails are Ruby erb files. The SQL is a SQLite database in each ingest directory. Eventually there might be a SQL database with system-wide Rubymatica information, although SQLite will happily open multiple databases so a system wide database may not be necessary.

Rubymatica needs to gain a few features to be considered a SIP creation tool, and I’ll cover those in a later blog posting. However, there is one more interesting technical problem with checksums and ingests.

Normal practice validating files is to generate a file that is a list of checksums for all the files in a directory tree. We typically put the checksum file in the root directory of the directory tree. However, a real-world ingest may include a pre-existing checksum file, changing file names (due to detox), removed files (due to anti-virus filtering), and new files from extracting .tar and .zip files. We need to detect an existing checksum file, modify it as file names change, modify it when files are moved/deleted, and add to it when we extract .tar and .zip files.


Links:

http://en.wikipedia.org/wiki/Ruby_on_Rails

http://en.wikipedia.org/wiki/Fork_(operating_system)

http://en.wikipedia.org/wiki/Sqlite

https://wiki.duraspace.org/display/hydra/The+Hydra+Project

http://en.wikipedia.org/wiki/Checksum




5 comments:

  1. Tom, this is great work. I look forward to playing around with it.

    Regarding the troubles you ran into with Rails, For this type of application where you basically want to put an HTML/REST interface in front of an existing (command line) application, you might get more flexibility with less overhead by using Sinatra -- http://www.sinatrarb.com/

    ReplyDelete
  2. Matt, the implementation issues with Rails were minor and not related to REST-ful behavior. WEBrick (the Ruby development web server) is not suited to forking processes, unlike Apache. Using an exec was a fine workaround for WEBrick's limitation.

    Rubymatica and SIP creation in general has one or two states, therefore REST isn't much of an issue.

    Rails can be fairly simple, and I was able to keep the architecture of Rubymatica quite simple. In the long term, the core Rubymatica classes will be integrated into a Hydra head, so the current Rails application is a crude, temporary bridge.

    ReplyDelete
  3. Tom,

    Is the development of Rubymatica ongoing and possibly in sync with the latest version of Archivematica?

    Thanks,

    Dean

    ReplyDelete
  4. Useful site. It was very great stuff. Thanks for sharing it.

    ReplyDelete
  5. Thank you for sharing. Now, I can do it easier because of this blog. Keep up your good work guys!

    ReplyDelete