Friday, 18 March 2011

Personal Digital Archiving Conference 2011

I had the good fortune to attend the 2011 Personal Digital Archiving conference at the Internet Archive, along with other colleagues on the AIMS project, including Michael Forstrom from Yale and Michael Olson, Peter Chan, and Glynn Edwards from Stanford. The conference was exceptional, and had a great range of presentations ranging from those on fairly pragmatic topics to the highly theoretical. There are a number of other blogs with comprehensive notes on the conference, and the conference's organizers have already provided a detailed listing of those. Instead, I'd just like to focus on what I considered the highlights of the conference.
  • Cathy Marshall's keynote was excellent. I have seen her speak before, and she presented an survey of her ongoing research into personal digital archives.
  • Jeremy Leighton John presented on work undertaken since the Digital Lives project at the British Library.
  • Judith Zissman presented on "agile archiving", similar to agile development, wherein individuals can continually refine their archival practices.
  • Birkin Diana presented on how Brown University is working to make their institutional repository a space for personal materials, and strategies that allow users to work on adding metadata iteratively.
  • Daniel Reetz presented on his DIY Book Scanner project, but also brought in detailed technical analysis about how image sensors in digital cameras work and how our brains process image data.
  • Jason Zalinger introduced the notion of Gmail as a "story-world" and presented some prototype tools and games to help navigate that world.
  • Cal Lee presented on introducing education about digital forensics to the archival curriculum.
  • Kam Woods also presented on applying digital forensics to the archival profession.
  • Sam Meister presented on the complex ethics of using forensics in acquiring and processing the records from start-up companies.
In addition, I presented with Amelia Abreu on "archival sensemaking", which introduces the notion of personal digital archiving practice as an iterative, context-bound process.

Saturday, 12 March 2011

Processing Born Digital Materials Using AccessData FTK

To follow up with my previous blog entry on "Surprise Use of Forensic Software in Archives", I have prepared a YouTube video "Processing Born Digital Materials Using AccessData FTK". I hope this video can give people more details on how FTK is being used at Stanford University Libraries. Take a look and let me know what you think.

I would like to say a few words on discovery and access even though it is not the topic of the video. After we process the files in FTK, one way to delivery the files is to store them in a Fedora repository and let people access our Fedora repository using a web browser through Internet. We have developed an alpha version of this model using files from the Stephen Jay Gould collection. Another way to provide access to the files is to let people use FTK to access the files in our reading room. I will write about that later.

Hope you enjoy the video.

Friday, 4 March 2011

File type categories with PRONOM and DROID

In order to assess a born digital accession, the AIMS digital archivists expressed a need for a report on the count of files grouped by type. The compact listing gives the archivist an overview that is difficult to visualize from a long listing. The category report supplements the full list of all files, and helps with a quick assessment after creation of a SIP via Rubymatica. (In a later post I’ll point out some reasons why pre-SIP assessment is often not practical with born digital.)

At the moment we have six categories. Below is a small example ingest:

Category summary for accession ingested files
moving image1
still image26

Some time ago we decided to exclusively use DROID as our file identification software. It works well to identify a broad variety of files, and is constantly being improved. We initially were using file identities from FITS, but the particular identity was highly variable. FITS gives a “best” identity based meta data returned by several utility programs. We wanted a consistent identification as opposed to some files being identified by DROID, some by the “file utility” and some by Jhove. We are currently using the DROID identification by pulling the DROID information out of the FITS xml for each file. This is easy and required very little change to Rubymatica.

PRONOM has the ability to have “classifications” via the XML element FormatTypes. However, there are a couple of issues. The first problem is that the PRONOM team is focused primarily on building new signatures (file identification configurations) and doesn’t have time to focus on low priority tasks such as categories. Second, the categories will almost certainly be somewhat different at each institution.

Happily I was able to create an easy-to-use web page to manage DROID categories. It only took one day to create this handy tool, and the tool is built-in to Rubymatica. The Rubymatica file listing report now has three sections: 1) overview using the categories 2) list of donor files in the ingest with the PRONOM PUID and human readable format name 3) the full list of all files (technical and donor) in the SIP.

This simple report seems anticlimactic, but processing born digital materials consists of many small details, which collectively can be a huge burden if not properly managed and automated. Adding this category feature to Rubymatica was a pleasant process, largely because the PRONOM data is open source, readily available, and delivered in a standard format (XML). My thanks and gratitude to the PRONOM people for their continuing work.

As I write this I notice that DROID v6 has just been released! The new version certainly includes a greatly expanded set of signatures (technical data for file identifications). We look forward to exploring all the new features.