With a working command line script, it was time to create a web interface. The first step of this was to modify my main processing method so it could be called from an external script. Initially, Rubymatica processed every ingest (folders) in the origin folder. That script was upgraded to accept a command line parameter and only process a single ingest. Internally, the code was refactored in order to abstract processing into a sub-step after initial setup.
Ruby on Rails is a wonderful way to build web sites. Rails is convoluted, but there are many good examples on the Web. Basically, I followed one of the “hello world” examples, and added a method to the hello_world_controller to call the same entry point as the command line wrapper. This worked well, although it was processing the ingest in real time, so it was obvious that sooner or later the processing would have to be made asynchronous.
The web interface also needed code to report on the status and results of ingests in the system. The web site has been kept as simple as possible because we will eventually be using Rubymatica as a backend for a Hydra head. Even so, the simple web site needs several controller methods: offer_upload, do_upload, save_file, full_status, reset, get_file, get_log_xml, process_sip, show_logs, file_list, and report. As of this writing Rubymatica is unable to create a BagIT bag, so there will be at least one more method added to the controller.
Changing Rubymatica to become an asychronous process was interesting (at least to a programmer). In this sense “asynchronous” means execution of a program as a background task with no window or terminal session. The main web page has a link for pending ingests. Clicking one of these links starts the process. The program begins to run and is “forked” to a separate task. This forking happens very quickly. You see the main web page refresh with a status message that says (essentially) “Processing has started”. Meanwhile, the background task runs independently taking as long a necessary. This is important for two reasons. First, web browsers will “time out” after a minute or two giving you a mostly empty window with an error message. Web browsers can’t deal with long-running tasks. Secondly, programmers don’t want to lock the user into doing nothing while the independent background task runs. The user doesn’t need to stare at a blank page waiting for the task to complete. (As an aside, Linux programmers call tasks “processes”, but since the word “process” has many meanings in the mixed world of digital accessions, I’ve change it to “task” throughout this blog.)
Normally, (in Perl or Python with the Apache web server) programmers simply fork. However, Ruby is often run in a special web server, and forking actually forks the web server, not just the Ruby on Rails controller. That is bad. Fortunately, it was easy to “exec” the task, which means that a new task is created in the background. Using the ingest name supplied from the web request I simply exec’d a task using the command line that I had created earlier. This is simple, elegant, and robust.
You may be wondering how the web site is able to know the status of background tasks. In order to fill this need, the background task writes status messages (and some other administrative meta data) into a SQLite database which is present in each ingest. One of the web controller methods queries each ingest’s database, and displays the most recent status message in the main web page. Another controller reports the full list of status messages for a given ingest.
Rubymatica is near maturity. It has (generally speaking) an architecture typical of most web applications: code, HTML templates, and a SQL database. The code is Ruby on Rails. The HTML templates used by Rails are Ruby erb files. The SQL is a SQLite database in each ingest directory. Eventually there might be a SQL database with system-wide Rubymatica information, although SQLite will happily open multiple databases so a system wide database may not be necessary.
Rubymatica needs to gain a few features to be considered a SIP creation tool, and I’ll cover those in a later blog posting. However, there is one more interesting technical problem with checksums and ingests.
Normal practice validating files is to generate a file that is a list of checksums for all the files in a directory tree. We typically put the checksum file in the root directory of the directory tree. However, a real-world ingest may include a pre-existing checksum file, changing file names (due to detox), removed files (due to anti-virus filtering), and new files from extracting .tar and .zip files. We need to detect an existing checksum file, modify it as file names change, modify it when files are moved/deleted, and add to it when we extract .tar and .zip files.
Links:
http://en.wikipedia.org/wiki/Ruby_on_Rails
http://en.wikipedia.org/wiki/Fork_(operating_system)
http://en.wikipedia.org/wiki/Sqlite
https://wiki.duraspace.org/display/hydra/The+Hydra+Project
http://en.wikipedia.org/wiki/Checksum
Ruby on Rails is a wonderful way to build web sites. Rails is convoluted, but there are many good examples on the Web. Basically, I followed one of the “hello world” examples, and added a method to the hello_world_controller to call the same entry point as the command line wrapper. This worked well, although it was processing the ingest in real time, so it was obvious that sooner or later the processing would have to be made asynchronous.
The web interface also needed code to report on the status and results of ingests in the system. The web site has been kept as simple as possible because we will eventually be using Rubymatica as a backend for a Hydra head. Even so, the simple web site needs several controller methods: offer_upload, do_upload, save_file, full_status, reset, get_file, get_log_xml, process_sip, show_logs, file_list, and report. As of this writing Rubymatica is unable to create a BagIT bag, so there will be at least one more method added to the controller.
Changing Rubymatica to become an asychronous process was interesting (at least to a programmer). In this sense “asynchronous” means execution of a program as a background task with no window or terminal session. The main web page has a link for pending ingests. Clicking one of these links starts the process. The program begins to run and is “forked” to a separate task. This forking happens very quickly. You see the main web page refresh with a status message that says (essentially) “Processing has started”. Meanwhile, the background task runs independently taking as long a necessary. This is important for two reasons. First, web browsers will “time out” after a minute or two giving you a mostly empty window with an error message. Web browsers can’t deal with long-running tasks. Secondly, programmers don’t want to lock the user into doing nothing while the independent background task runs. The user doesn’t need to stare at a blank page waiting for the task to complete. (As an aside, Linux programmers call tasks “processes”, but since the word “process” has many meanings in the mixed world of digital accessions, I’ve change it to “task” throughout this blog.)
Normally, (in Perl or Python with the Apache web server) programmers simply fork. However, Ruby is often run in a special web server, and forking actually forks the web server, not just the Ruby on Rails controller. That is bad. Fortunately, it was easy to “exec” the task, which means that a new task is created in the background. Using the ingest name supplied from the web request I simply exec’d a task using the command line that I had created earlier. This is simple, elegant, and robust.
You may be wondering how the web site is able to know the status of background tasks. In order to fill this need, the background task writes status messages (and some other administrative meta data) into a SQLite database which is present in each ingest. One of the web controller methods queries each ingest’s database, and displays the most recent status message in the main web page. Another controller reports the full list of status messages for a given ingest.
Rubymatica is near maturity. It has (generally speaking) an architecture typical of most web applications: code, HTML templates, and a SQL database. The code is Ruby on Rails. The HTML templates used by Rails are Ruby erb files. The SQL is a SQLite database in each ingest directory. Eventually there might be a SQL database with system-wide Rubymatica information, although SQLite will happily open multiple databases so a system wide database may not be necessary.
Rubymatica needs to gain a few features to be considered a SIP creation tool, and I’ll cover those in a later blog posting. However, there is one more interesting technical problem with checksums and ingests.
Normal practice validating files is to generate a file that is a list of checksums for all the files in a directory tree. We typically put the checksum file in the root directory of the directory tree. However, a real-world ingest may include a pre-existing checksum file, changing file names (due to detox), removed files (due to anti-virus filtering), and new files from extracting .tar and .zip files. We need to detect an existing checksum file, modify it as file names change, modify it when files are moved/deleted, and add to it when we extract .tar and .zip files.
Links:
http://en.wikipedia.org/wiki/Ruby_on_Rails
http://en.wikipedia.org/wiki/Fork_(operating_system)
http://en.wikipedia.org/wiki/Sqlite
https://wiki.duraspace.org/display/hydra/The+Hydra+Project
http://en.wikipedia.org/wiki/Checksum