Friday, 19 November 2010
Surprise Use of Forensic Software in Archives
However, after attending a 2.5-days training on AccessData FTK (a computer forensic software), I started to see the potential of using forensic software to process digital archives. I found out that the functions (bookmarks, labels) which help investigators to organize the evidence they selected are equally applicable to the organization of the whole collection. The functions (pattern and full text search) which are used to found particular evidence are equally applicable to search for restricted materials. I can also use the software to replace a number of software I am using to processing digital collections. Although, 90% of the training is related to cracking passwords, searching on delete files, identifying pornographic images, etc., I found the 10% the course worth every cents Stanford spent on it. Of course, the ideal case would be a course tailored for the archival community, but unfortunately, there is no such course exists.
Now, I am using AccessData FTK to replace the following software I used in the past to process digital archives.
Karens's Directory Printer - to create a comprehensive manifest of the electronic files of collections
QuickView Plus - to view files with obsolete file formats
Xplorer - to find duplicate files, copy to folders
DROID, JHOVE - to extract technical metadata: file formats, checksums, creation/modification/access dates
Windows Search 4.0 - to perform full text search on files with certain formats (word, pdf, ascii);
I am using following functions, which I have not found software package to perform in a very user friendly manner, in AccessData FTK to process digital archives.
Pattern search (to locate files containing restricted information such as social security no, credit card no., etc.)
Assign bookmarks, labels to files (for arranging files into series/subseries, other administrative and descriptive metadata)
Extract email headers (to; from; subject; date; cc/bcc) from emails written in different email programs for preparing correspondence listing.
The cost of licensing the software seems high. But if you look at the total costs of learning several "free" software, the lack of support for such software, and the integrated environment you get in using on software, you may find the total costs of using commercial forensic software is cheaper than using "free" software.
Tuesday, 16 November 2010
Other Highlights from the DLF Fall Forum
I attended the working session on curation micro-services, led by Stephen Abrams of the California Digital Library, Delphine Khanna from University of Pennsylvania, and Katherine Kott from Stanford University. (Patricia Hswe from Pennsylvania State University was supposed to be one of the discussion leaders, but she was unable to attend the Fall Forum.) The micro-services approach is a philosophical and technical methodology for the architecture of digital curation environments. This approach values simplicity and modularity, which allows "minimally sufficient" components to be recombined. Furthermore, the strength of curation micro-services is the relative ease by which they can be redesigned and replaced as necessary. The slides from the beginning part of the session can be found here.
There was also a reading session at the DLF Fall Forum on "Reimagining METS." The session's discussion revolved around ideas put forth in a white paper distributed in advance of the conference. The majority (if not all) of the METS Editorial Board facilitated the discussion, which was very high level and incredibly interesting. Much of the discussion seemed to imply the requirement that METS actually needed to change. The most interesting potential idea that seemed to get a fair amount of traction was to consider whether METS should focus on its strength in packaging and abdicate some of its functionality to other standards that arguably do it better (e.g., OAI-ORE for structure).
On the last day, I went to the workshop on JHOVE2, which is the successor project to the JHOVE framework for characterization. JHOVE2 has an notably different architecture and expanded feature set, which expands characterization to include other processes, including identification, validation, feature extraction, and assessments based on user-defined policies. Additionally, users will be able to define format characterization and validation files for complex digital objects, such as GIS shapefiles. The presenters stated that future development for JHOVE2 will include a GUI to assist in rule set development. From the standpoint of a digital archivist, this tool will be essential in any of the further work that we do.
Wednesday, 10 November 2010
Donor survey web form is ready
I realized I could quickly adapt an existing application as a donor survey, if the application were flexible enough. A couple of years ago I created a mini Laboratory Information Management System (LIMS). The programmer can easily modify the fields in the forms, although users cannot add fields ad-hoc. The mini LIMS has its own security and login system, and users are in groups. For the purposes of the AIMS donor survey, one section of the LIMS becomes the “donor info”, another section becomes the “survey”.
Using the list of fields that Liz Gushee, Digital Archivist here at UVa, gathered while working with the other AIMS archivists, I put the donor’s name, archivist name and related fields into the “donor” web form. All the survey questions went into the “survey” web form. The survey will support distinct surveys from the same donor over time to allow multiple accessions.
Our next step will be to map donor survey fields to various standard formats for submission agreements and other types of metadata. While the donor survey data needs to be integrated into the SIP workflow, we haven’t written the transformation and mapping code. We are also lacking a neatly formatted, read-only export of the donor survey data. Our current plan is to use Rubymatica to build a SIP, and that process will include integration of the donor survey data. The eventual product will be one or more Hydra heads. Learn more about Hydra:
https://wiki.duraspace.org/display/hydra/The+Hydra+Project
Everyone is invited to test our beta-release donor survey. Please email Tom twl8n@virginia.edu to request an account. Include the following 3 items in your email:
1) Your name
2) A password that is not used elsewhere for anything important
3) Institutional affiliation or group so I know what group to assign you to, even if your group has only one person.
On the technical side, there were some interesting user interface (UI) and coding issues. Liz suggested comment fields for the survey, and we decided to offer archivists a comment for every question. I used Cascading Style Sheets (CSS) to minimize the size of comment fields so that the web page would not be a visual disaster. If one clicks in a comment, it expands. Click out of it, and it shrinks back down.
The original LIMS never had more than a dozen fields. The donor survey is very long and required some updates to the UI. By using a JQuery function ajax() I was able to create a submit button that would save the survey without redrawing the web page. CSS code was required to make the save buttons remain in a fixed location while the survey questions scrolled.
The mini LIMS relies on a Perl closure to handle creation of the web forms, and saving of the data to the database. Calling functions from the closure creates the fields. A closure is similar to a class in Java or Ruby, but somewhat more powerful. The data structures from the closure are passed to Perl’s HTML::Template module to build the web pages from HTML templates.
The mini LIMS was originally created to use PostgreSQL (aka Postgres), but I’ve converted it to us SQLite. SQLite is a zero-administration database, easily accessed from all major programming languages, and the binary database file is directly portable to all operating systems. Perl’s DBI database connectivity is database agnostic. However, some aspects of the SQL queries are not quite portable. Postgres uses “sequences” for primary keys. Sequences are wonderful, and vastly superior to auto-increment fields. SQLite does not have sequences, so I had to create a bit of code to handle the differences. The calendar date functions are quite different between Postgres and SQLite, so once again, I had to generalize some of the SQL queries related to date and time.
There was one odd problem. Postgres and SQLite are both fully transactional. However, because SQLite is an “embedded” database, it cannot handle database locks as elegantly as client-server databases. Normally this is not a problem, and all databases have some degree of locking. In this instance I got a locking error when preparing a second query in one transaction. Doing a commit after the first query appears to have fixed the problem. I’ve worked with SQLite for several years an never encountered this problem. It could be a bug in the Perl SQLite DBD driver. Using the sqlite_use_immediate_transaction did not solve the problem.
The mini LIMS source code is available.
Friday, 29 October 2010
Digital Library Federation (DLF), Fall Forum, 2010
Duke University: Naomi Nelson
Emory University: Erika Farr, Peter Hornsby
Stanford University: Peter Chan, Glynn Edward, Michael Olson
Tuesday, 26 October 2010
Update on the Donor Survey
A few months have gone by and the archivists have had the opportunity to think more about how we envision the donor survey fitting into both shared and institution-specific born-digital workflows. First of all, we all agreed that we wanted to move away, as much as is possible, from continuing to create paper-based forms and records regarding donors and content. Moving the donor survey to a web-based tool, complete with an SQLLite database back-end, seemed to be a good way to start (for technical specifics, please see Tom's forthcoming entry regarding the web form - coming up next!). In the web-based survey, we deliberately included a space for the archivist to record comments for each question and answer on the survey. We realized that by creating a place for the archivist to record their findings and/or elaborate on what was recorded by the donor/owner of the personal archive, we could make the process of determining the scope of the personal archive for transfer that much more transparent. As one of our senior archivists on the project pointed out, it's as important to know what was excluded from transfer and why as to have a trail of documentation as to what was transferred and why (especially if the processing of the collection follows many months later!). We hope that adding this feature to the survey will help with the recording of that process in a centralized location and perhaps serve as the digital equivalent to a donor file.
As to how the donor survey fits into our shared and institution-specific workflows, that is still a work in process. Generally speaking, it is intended that the data collected from the survey could be mapped to a submission agreement, which, in turn, would then be part of the SIP (submission information packet). We also intend to map portions of what had been collected from the survey and submission agreement in Archivists' Toolkit and Calm (collection management software from the UK) to form an accession record. Ideally, we want to have to enter/create data once and have it re-purposed as often as is needed throughout our workflow.
We invite you to test out our web survey and to give us your feedback. In our next entry, Tom will be posting a description of the technical side of the survey web form and he'll include a link for access. Other folks have been working on other versions of surveys for electronic records as well. If you're not already familiar with Chris Prom's blog, Practical E-Records, get a readin'. Chris recently posted a version of a donor survey; check it out here.
Liz Gushee
University of Virginia
Thursday, 30 September 2010
Rubymatica as web application
Ruby on Rails is a wonderful way to build web sites. Rails is convoluted, but there are many good examples on the Web. Basically, I followed one of the “hello world” examples, and added a method to the hello_world_controller to call the same entry point as the command line wrapper. This worked well, although it was processing the ingest in real time, so it was obvious that sooner or later the processing would have to be made asynchronous.
The web interface also needed code to report on the status and results of ingests in the system. The web site has been kept as simple as possible because we will eventually be using Rubymatica as a backend for a Hydra head. Even so, the simple web site needs several controller methods: offer_upload, do_upload, save_file, full_status, reset, get_file, get_log_xml, process_sip, show_logs, file_list, and report. As of this writing Rubymatica is unable to create a BagIT bag, so there will be at least one more method added to the controller.
Changing Rubymatica to become an asychronous process was interesting (at least to a programmer). In this sense “asynchronous” means execution of a program as a background task with no window or terminal session. The main web page has a link for pending ingests. Clicking one of these links starts the process. The program begins to run and is “forked” to a separate task. This forking happens very quickly. You see the main web page refresh with a status message that says (essentially) “Processing has started”. Meanwhile, the background task runs independently taking as long a necessary. This is important for two reasons. First, web browsers will “time out” after a minute or two giving you a mostly empty window with an error message. Web browsers can’t deal with long-running tasks. Secondly, programmers don’t want to lock the user into doing nothing while the independent background task runs. The user doesn’t need to stare at a blank page waiting for the task to complete. (As an aside, Linux programmers call tasks “processes”, but since the word “process” has many meanings in the mixed world of digital accessions, I’ve change it to “task” throughout this blog.)
Normally, (in Perl or Python with the Apache web server) programmers simply fork. However, Ruby is often run in a special web server, and forking actually forks the web server, not just the Ruby on Rails controller. That is bad. Fortunately, it was easy to “exec” the task, which means that a new task is created in the background. Using the ingest name supplied from the web request I simply exec’d a task using the command line that I had created earlier. This is simple, elegant, and robust.
You may be wondering how the web site is able to know the status of background tasks. In order to fill this need, the background task writes status messages (and some other administrative meta data) into a SQLite database which is present in each ingest. One of the web controller methods queries each ingest’s database, and displays the most recent status message in the main web page. Another controller reports the full list of status messages for a given ingest.
Rubymatica is near maturity. It has (generally speaking) an architecture typical of most web applications: code, HTML templates, and a SQL database. The code is Ruby on Rails. The HTML templates used by Rails are Ruby erb files. The SQL is a SQLite database in each ingest directory. Eventually there might be a SQL database with system-wide Rubymatica information, although SQLite will happily open multiple databases so a system wide database may not be necessary.
Rubymatica needs to gain a few features to be considered a SIP creation tool, and I’ll cover those in a later blog posting. However, there is one more interesting technical problem with checksums and ingests.
Normal practice validating files is to generate a file that is a list of checksums for all the files in a directory tree. We typically put the checksum file in the root directory of the directory tree. However, a real-world ingest may include a pre-existing checksum file, changing file names (due to detox), removed files (due to anti-virus filtering), and new files from extracting .tar and .zip files. We need to detect an existing checksum file, modify it as file names change, modify it when files are moved/deleted, and add to it when we extract .tar and .zip files.
Links:
http://en.wikipedia.org/wiki/Ruby_on_Rails
http://en.wikipedia.org/wiki/Fork_(operating_system)
http://en.wikipedia.org/wiki/Sqlite
https://wiki.duraspace.org/display/hydra/The+Hydra+Project
http://en.wikipedia.org/wiki/Checksum
Archivematica SIP ported to Rubymatica
I am very grateful to the Archivematica developers for creating a working product based on Linux. Archivematica uses a “desktop explorer” user interface where moving an ingest to specific folders causes the Archivematica processing scripts to run. This architecture is easy to understand, and fairly easy to port to another programming language.
For the first phase, I rewrote shell and Python scripts in Ruby as a single group of methods in one script with a single entry point. I ran my script from the command line. I had to work out origin and destination folders, because Archivematica moves the ingest into a temporary folder, and moves it a second time to a final destination. Since I would have a unique subfolder for each ingest, I didn’t need an intermediate temporary folder. I made additional modifications to the Archivematica directory structure for clarity and programming sanity. For instance, Archivematica creates some meta data files in the same directory as the ingested files. My code was simpler if the meta data for a given ingest always resided in it’s own folder. (Early versions of Rubymatica were processing the meta data files as part of the ingest because the files were mixed together in the same directory.)
Ingesting a collection of files involves traversing the directory tree of the ingest. This traversal happens several times. For example, .tar and .zip files need to be extracted, therefore the directory tree is traversed searching for files to extract. This process is iterative in that each newly-extracted directory must be traversed. Archivematica uses the Python script called Easy Extract. I recoded Easy Extract as a recursive Ruby method, which was nontrivial and required a couple of days work. The directory tree also has to be traversed when calling “detox” to clean file names, and traversed again for virus checking with ClamAV. To keep my sanity, I created a method specifically to traverse the directory tree. In keeping with the theme of clarity and simplicity of algorithms, the directory tree is crawled several times rather than trying to do everything in one pass. This works well and fits the programmer mantra: avoid premature optimization.
Creating the METS XML took several days of difficult work. I used Nokogiri’s Builder class since it has powerful XML tools. Nokogiri is a Ruby module or a “gem” in Ruby parlance. The Builder class of Nokogiri is wonderfully powerful, but almost entirely undocumented. Archivematica uses a Python class to build the METS XML, but the parallels between Python’s XML tools and Nokogiri Builder are tenuous. With some help from the very bright, talented, and helpful Andrew Curley here at UVA, I was able to bend Nokogiri Builder to my will. It was a huge battle and Nokogiri nearly crushed me. I’ve created example code for Nokogiri Builder which can be found at:
http://defindit.com/readme_files/nokogiri_examples.html
Now that I have a working command line script, it is time to think about creating a web interface. I’ll cover that process in my next blog.
Links:
http://archivematica.org/wiki/index.php?title=Main_Page
http://en.wikipedia.org/wiki/Ruby_(programming_language)
http://en.wikipedia.org/wiki/Tar_(file_format)
http://en.wikipedia.org/wiki/Python_(programming_language)
http://en.wikipedia.org/wiki/Shell_script
http://en.wikipedia.org/wiki/Clam_AV
http://en.wikipedia.org/wiki/Directory_tree
Wednesday, 25 August 2010
Can you read 5.25 inch double sided / high capacity IBM / MS DOS formatted diskettes?
I believe many archives have old computer storage media such as 5.25 and 3.5 inch. floppy diskettes and zip disks in their collections. Since you can still buy 3.5 inch. floppy disk drives with USB interface in the market, reading 3.5 inch. floppy diskettes is not a problem. Also, because some Zip drives use ATAPI / USB interface, connecting a zip drive to your pc is not a problem either (ATAPI is still widely used to connect CD/DVD drives.). The problem is 5.25 inch. floppy disk drives which use 34-pin floppy disk drive connectors (NO USB/ATAPI version). The 34-pin floppy disk drive connector doesn't exist in the motherboards of some modern personal computers (e.g. the FRED we have at Stanford.). Even a motherboard has a 34 pin floppy disk drive connector, some BIOS can recognize 3.5 inch. floppy disk drive but not 5.25 inch. floppy disk drive (e.g. the retired Dell pc in my offce).
In order to read 5.25 inch. double sided or high density IBM / MS DOS formatted diskettes, one option is to get a Catweseal card to put in a spare PCI slot in your pc. The Catweseal card has a 34-pin floppy disk drive connector for you to connect to a 5.25 inch. floppy drive. According to the manual, it can be configured to read Commodore 64 disk, extended density format disk (2.88MB), CP/M format disk (8-inch floppy for PDP-11 machines), IBM / MS DOS format disk. One limitation is that you have to use the "Imagetool" software come with the Catweseal card to read and write disk images. You cannot browse the contents of a disk and see what is inside before you make the disk image. You have to create the disk image and export the files in the disk image to see the contents of the disk. Also, Imagetool does not have an option to create logical disk image (slack space not captured / deleted files not copied). At Stanford, we create mostly logical disk images. Another option is to get a pc with motherboard and BIOS capable of connecting and recognizing 5.25 inch. floppy disk drive. I checked the specifications of several new pc in the market and found no mention about 5.25 inch. floppy disk drive (I will be very surprised if I find the specifications mention that.). I have checked the specifications of new motherboards to see whether they can connect and recognize 5.25 floppy disk drives. (If yes, I can buy the motherboard and build a pc around it.) The specifications will tell you whether the motherboards have a floppy disk connector but no details on whether they recognize 5.25 floppy disk drives. One day, I opened a retired pc of mine and discovered that it has a 34 pin floppy disk connector in the motherboard. I bought the pc to my office and connected the 5.25 floppy disk drive to the motherboard and it WORKED!!!! I can see the contents of 5.25 inch floppy disks using Windows Explorer. And I can use FTK Imager to create logical disk images. After that, I connected an ATAPI Zip drive (taken from the retired Dell pc in my office) to the same machine and it became a standalone capture station for me!!
For people who don’t have 5.25 inch drive in their office, they can try to get one from eBay. But keep in mind that the drives you bought from eBay may be more than 20 years old and they may stop working anytime. In fact the two 5.25 inch. floppy disk drives Stanford bought from eBay last year have stopped working. Another source of 5.25 inch. floppy disk drives is people around you. They may have one in their garages. In fact, I mentioned Stanford had many 5.25 floppy disks and the problems we were facing to a manager from Konica Minolta in a conference in July. He told me that he had one 5.25 inch. floppy disk drive in his office and asked whether I wanted it or not!! I have just received the drive on last Fri. and it is now working for me :)
Wednesday, 28 July 2010
CALM Digital Records meeting
Adrian Brown (Parliamentary Archives) convenor of the meeting, hosted by The National Archives, reported on the main findings from a survey of CALM users conducted at the end of 2009. It was clear from the meeting that many archivists were actively investigating the options and issues surrounding a digital repository but that the lack of a digital repository with-in their organisation and the need for training were huge obstacles that needed to be overcome.
I gave a brief outline of the AIMS project and presented a diagram to highlight our current thinking about how Fedora and born-digital material can be integrated into our workflows. [This model is currently still conceptual but we will be working with Axiell to progress this – comments welcomed]. Natalie Walters (Wellcome Library) highlighted their work and how she had found that many of the professional archive skills used to handle and manage paper archives still applied in the born digital arena. Malcolm Todd (The National Archives) talked about four key aspects to digital repository technology modularity, interoperability, sustainability and cost effectiveness all of which are being actively embraced by the AIMS project.
Malcolm Howitt and Nigel Pegg (Axiell) spoke about their plans to extend CALM to link to digital repositories and it is hoped that we can work closely with them on this.
The rest of the meeting was spent discussing and identifying issues surrounding cataloguing and metadata; accession and ingest; user access and best practice. A number of common themes emerged:
• That the differences between paper and digital archives are often exaggerated with issues like provenance and integrity key to both
• That depositor’s perception of digital archives is very different to paper and that by acting promptly archivists is the only way we can avoid technological obsolesce and a digital dark hole in the historical record
• The need for archives staff to be actively involved in the digital repository and not leave it for ICT staff to develop/manage exclusively
• That born digital archives may open-up the archives to new audiences
• A desire to share experiences, documentation etc for the wider benefit of the profession
• A need for more opportunities for “hands-on” experiences with born-digital archives and repository to increase familiarity with-in the archives profession
Tuesday, 27 July 2010
Introduction
The Born Digital Archives is the blog of the AIMS team. We hope to stimulate dialog about practical solutions to archiving materials that originate in digital form. We invite all concerned archivists to chime in with questions and comments via the "comments" to each post. AIMS is inclusive with the intention to create open source solutions that are useful to both small and large institutions.
Friday, 16 July 2010
Surveying Born Digital Collections
The main difference between
1. Additional
2. Division of the survey into 2 parts: Part I is designed to be a prompt sheet for phone / face-to-face interview with donors by curators / digital archivists. Part II is to be filled out by digital archivists regarding technical details of the tools used to create digital material.
3. Usage of non-technical terms.
I think the survey should be sent before the actual interview as "something for the donor to start thinking about". If the donor is willing to reply before the interview, it helps the digital archivist to prepare as well. In fact, I sent the survey to a donor in July and she replied before the interview mentioning that she used Eudora for her emails. Since I was not familiar with Eudora, the answer helped me to get prepared for the interview as well.
Finally, I have to thank Susan Thomas, project manager of the Paradigm and the futureArch project, for her comments on the AIMS Digital Material Survey and her sharing of the experience in using the Paradigm records survey.
We would like to seek your comments on the survey as well. If you are going to discuss with donors on personal digital archives, why not download the survey and give it a try. Even if you are not collecting personal digital archives in the near future, take a look and tell us what you think.
Click below for the survey:
AIMS Digital Material Survey– Personal Digital Archives