Friday, 4 March 2011

File type categories with PRONOM and DROID

In order to assess a born digital accession, the AIMS digital archivists expressed a need for a report on the count of files grouped by type. The compact listing gives the archivist an overview that is difficult to visualize from a long listing. The category report supplements the full list of all files, and helps with a quick assessment after creation of a SIP via Rubymatica. (In a later post I’ll point out some reasons why pre-SIP assessment is often not practical with born digital.)

At the moment we have six categories. Below is a small example ingest:

Category summary for accession ingested files
moving image1
still image26

Some time ago we decided to exclusively use DROID as our file identification software. It works well to identify a broad variety of files, and is constantly being improved. We initially were using file identities from FITS, but the particular identity was highly variable. FITS gives a “best” identity based meta data returned by several utility programs. We wanted a consistent identification as opposed to some files being identified by DROID, some by the “file utility” and some by Jhove. We are currently using the DROID identification by pulling the DROID information out of the FITS xml for each file. This is easy and required very little change to Rubymatica.

PRONOM has the ability to have “classifications” via the XML element FormatTypes. However, there are a couple of issues. The first problem is that the PRONOM team is focused primarily on building new signatures (file identification configurations) and doesn’t have time to focus on low priority tasks such as categories. Second, the categories will almost certainly be somewhat different at each institution.

Happily I was able to create an easy-to-use web page to manage DROID categories. It only took one day to create this handy tool, and the tool is built-in to Rubymatica. The Rubymatica file listing report now has three sections: 1) overview using the categories 2) list of donor files in the ingest with the PRONOM PUID and human readable format name 3) the full list of all files (technical and donor) in the SIP.

This simple report seems anticlimactic, but processing born digital materials consists of many small details, which collectively can be a huge burden if not properly managed and automated. Adding this category feature to Rubymatica was a pleasant process, largely because the PRONOM data is open source, readily available, and delivered in a standard format (XML). My thanks and gratitude to the PRONOM people for their continuing work.

As I write this I notice that DROID v6 has just been released! The new version certainly includes a greatly expanded set of signatures (technical data for file identifications). We look forward to exploring all the new features.

No comments:

Post a Comment