Friday, 18 March 2011

Personal Digital Archiving Conference 2011

I had the good fortune to attend the 2011 Personal Digital Archiving conference at the Internet Archive, along with other colleagues on the AIMS project, including Michael Forstrom from Yale and Michael Olson, Peter Chan, and Glynn Edwards from Stanford. The conference was exceptional, and had a great range of presentations ranging from those on fairly pragmatic topics to the highly theoretical. There are a number of other blogs with comprehensive notes on the conference, and the conference's organizers have already provided a detailed listing of those. Instead, I'd just like to focus on what I considered the highlights of the conference.
  • Cathy Marshall's keynote was excellent. I have seen her speak before, and she presented an survey of her ongoing research into personal digital archives.
  • Jeremy Leighton John presented on work undertaken since the Digital Lives project at the British Library.
  • Judith Zissman presented on "agile archiving", similar to agile development, wherein individuals can continually refine their archival practices.
  • Birkin Diana presented on how Brown University is working to make their institutional repository a space for personal materials, and strategies that allow users to work on adding metadata iteratively.
  • Daniel Reetz presented on his DIY Book Scanner project, but also brought in detailed technical analysis about how image sensors in digital cameras work and how our brains process image data.
  • Jason Zalinger introduced the notion of Gmail as a "story-world" and presented some prototype tools and games to help navigate that world.
  • Cal Lee presented on introducing education about digital forensics to the archival curriculum.
  • Kam Woods also presented on applying digital forensics to the archival profession.
  • Sam Meister presented on the complex ethics of using forensics in acquiring and processing the records from start-up companies.
In addition, I presented with Amelia Abreu on "archival sensemaking", which introduces the notion of personal digital archiving practice as an iterative, context-bound process.

Saturday, 12 March 2011

Processing Born Digital Materials Using AccessData FTK

To follow up with my previous blog entry on "Surprise Use of Forensic Software in Archives", I have prepared a YouTube video "Processing Born Digital Materials Using AccessData FTK". I hope this video can give people more details on how FTK is being used at Stanford University Libraries. Take a look and let me know what you think.

I would like to say a few words on discovery and access even though it is not the topic of the video. After we process the files in FTK, one way to delivery the files is to store them in a Fedora repository and let people access our Fedora repository using a web browser through Internet. We have developed an alpha version of this model using files from the Stephen Jay Gould collection. Another way to provide access to the files is to let people use FTK to access the files in our reading room. I will write about that later.

Hope you enjoy the video.

http://www.youtube.com/watch?v=hDAhbR8dyp8

Friday, 4 March 2011

File type categories with PRONOM and DROID

In order to assess a born digital accession, the AIMS digital archivists expressed a need for a report on the count of files grouped by type. The compact listing gives the archivist an overview that is difficult to visualize from a long listing. The category report supplements the full list of all files, and helps with a quick assessment after creation of a SIP via Rubymatica. (In a later post I’ll point out some reasons why pre-SIP assessment is often not practical with born digital.)

At the moment we have six categories. Below is a small example ingest:

Category summary for accession ingested files
data3
moving image1
other2
sound2
still image26
textual12
Total46


Some time ago we decided to exclusively use DROID as our file identification software. It works well to identify a broad variety of files, and is constantly being improved. We initially were using file identities from FITS, but the particular identity was highly variable. FITS gives a “best” identity based meta data returned by several utility programs. We wanted a consistent identification as opposed to some files being identified by DROID, some by the “file utility” and some by Jhove. We are currently using the DROID identification by pulling the DROID information out of the FITS xml for each file. This is easy and required very little change to Rubymatica.

PRONOM has the ability to have “classifications” via the XML element FormatTypes. However, there are a couple of issues. The first problem is that the PRONOM team is focused primarily on building new signatures (file identification configurations) and doesn’t have time to focus on low priority tasks such as categories. Second, the categories will almost certainly be somewhat different at each institution.

Happily I was able to create an easy-to-use web page to manage DROID categories. It only took one day to create this handy tool, and the tool is built-in to Rubymatica. The Rubymatica file listing report now has three sections: 1) overview using the categories 2) list of donor files in the ingest with the PRONOM PUID and human readable format name 3) the full list of all files (technical and donor) in the SIP.

This simple report seems anticlimactic, but processing born digital materials consists of many small details, which collectively can be a huge burden if not properly managed and automated. Adding this category feature to Rubymatica was a pleasant process, largely because the PRONOM data is open source, readily available, and delivered in a standard format (XML). My thanks and gratitude to the PRONOM people for their continuing work.

http://www.nationalarchives.gov.uk/PRONOM/Default.aspx

http://droid.sourceforge.net/

As I write this I notice that DROID v6 has just been released! The new version certainly includes a greatly expanded set of signatures (technical data for file identifications). We look forward to exploring all the new features.

Tuesday, 22 February 2011

Arrangement and Description of born-digital archives

For the last two months the Digital Archivists have been trying to define the requirements of a tool to enable archivists to arrange and describe born-digital archives. To do this we have stood-back and reviewed the traditional skills and processes and whether changes are required or appropriate to accommodate the particular issues surrounding born-digital archives.

The components we identified were as follows:
• Graphical User Interface – needs to be clean and easy to use
• Intellectual Arrangement - must be easy and instinctive for archivists to use
• Appraisal – born-digital archives need to be appraised as much as their paper predecessors
• Rights and Permissions – to enable the management of access to the born-digital archives and also to demonstrate to 3rd part depositors that the material is safe in your care
• Descriptive Metadata – a term we have been using to relate to description information and to explicitly distinguish this from the technical metadata about each file
• Import/Export functionality – to import/export data with other tools
• Reporting – to provide a range of "views" for managing the digital assets

Through a series of user stories and scenarios we have sought to clearly explain the requirement and how this might relate to other functionality.

This work has been under-taken predominantly through the use of GoogleDocs and created a document that we can all access and edit, create diagrams and include screenshots as necessary. Over weeks hundreds of comments have been added, and the text subjected to a comprehensive review and refinement process by numerous staff across the four partners.

Each institution has now scored and prioritised these features and as befits a collaborative initiative like the AIMS project allow us to identify a core group of features and functionality that we feel will be of greatest use to our institution and the wider archival community.

With the exception of intellectual arrangement most tasks and processes are not unique to archives so there is already a body of knowledge and experience in how to approach the task. For intellectual arrangement we have to be clear and precise about what we need and want we didn’t, for example a single intellectual arrangement when multiple versions would be possible in a digital environment.

Over the next few months we will be refining and reviewing these requirements, very much aware that there are only seven months of the project remaining. We also intend to discuss those aspects we identified as "critical" in future blog postings.

Tell us what tools you use with born-digital archives...

Friday, 19 November 2010

Surprise Use of Forensic Software in Archives

When I first heard of the use of computer forensic in archives, I was excited and wanted to learn how these law enforcement techniques could help me do a better job in processing digital collections. After learning people are using computer forensic to copy disk image (i.e. an exact copy of a disk, bit by bit) and to create a comprehensive manifest of the electronic files of collections, I was a bit disappointed because software engineers have been using the Unix dd command for many years to copy disk images. Also, there are tools (e.g. Karens's Directory Printer) available to create comprehensive manifest of the electronic files of collections. Data recovery is another feature of forensic software some people consider useful for archivists/researchers. In my opinion, data recovery may be useful for researchers but not archivists. Without informed written consents from donors, archivists should NOT recover the deleted files at all. Also, in some cases, a deleted file doesn't appear as one file, but instead, tens or hundreds of segments of files. When most archivists don't do item level processing in paper collections due to limited resources, I can't image archivists performing sub-item level processing in digital collections. Computer forensic in criminal application usually look for particular evidences. Organizing all files in a disk drive is usually not their interest. In archives, we are organizing all files in disk drives, looking for particular items are not our duty. Computer forensic may be more useful for researchers when they want to look for particular items. All these lead me to the conclusion that computer forensic may not be very useful for digital archivists.

However, after attending a 2.5-days training on AccessData FTK (a computer forensic software), I started to see the potential of using forensic software to process digital archives. I found out that the functions (bookmarks, labels) which help investigators to organize the evidence they selected are equally applicable to the organization of the whole collection. The functions (pattern and full text search) which are used to found particular evidence are equally applicable to search for restricted materials. I can also use the software to replace a number of software I am using to processing digital collections. Although, 90% of the training is related to cracking passwords, searching on delete files, identifying pornographic images, etc., I found the 10% the course worth every cents Stanford spent on it. Of course, the ideal case would be a course tailored for the archival community, but unfortunately, there is no such course exists.

Now, I am using AccessData FTK to replace the following software I used in the past to process digital archives.
Karens's Directory Printer - to create a comprehensive manifest of the electronic files of collections
QuickView Plus - to view files with obsolete file formats
Xplorer - to find duplicate files, copy to folders
DROID, JHOVE - to extract technical metadata: file formats, checksums, creation/modification/access dates
Windows Search 4.0 - to perform full text search on files with certain formats (word, pdf, ascii);

I am using following functions, which I have not found software package to perform in a very user friendly manner, in AccessData FTK to process digital archives.
Pattern search (to locate files containing restricted information such as social security no, credit card no., etc.)
Assign bookmarks, labels to files (for arranging files into series/subseries, other administrative and descriptive metadata)
Extract email headers (to; from; subject; date; cc/bcc) from emails written in different email programs for preparing correspondence listing.

The cost of licensing the software seems high. But if you look at the total costs of learning several "free" software, the lack of support for such software, and the integrated environment you get in using on software, you may find the total costs of using commercial forensic software is cheaper than using "free" software.

Tuesday, 16 November 2010

Other Highlights from the DLF Fall Forum

A few weeks ago I got the opportunity to attend the Digital Library Federation's Fall Forum in Palo Alto, California. This is the same conference for which Peter previously announced his session on born digital archives. In addition to Peter's session, there were a number of other sessions that were of interest to those working with digital archives.

I attended the working session on curation micro-services, led by Stephen Abrams of the California Digital Library, Delphine Khanna from University of Pennsylvania, and Katherine Kott from Stanford University. (Patricia Hswe from Pennsylvania State University was supposed to be one of the discussion leaders, but she was unable to attend the Fall Forum.) The micro-services approach is a philosophical and technical methodology for the architecture of digital curation environments. This approach values simplicity and modularity, which allows "minimally sufficient" components to be recombined. Furthermore, the strength of curation micro-services is the relative ease by which they can be redesigned and replaced as necessary. The slides from the beginning part of the session can be found here.

There was also a reading session at the DLF Fall Forum on "Reimagining METS." The session's discussion revolved around ideas put forth in a white paper distributed in advance of the conference. The majority (if not all) of the METS Editorial Board facilitated the discussion, which was very high level and incredibly interesting. Much of the discussion seemed to imply the requirement that METS actually needed to change. The most interesting potential idea that seemed to get a fair amount of traction was to consider whether METS should focus on its strength in packaging and abdicate some of its functionality to other standards that arguably do it better (e.g., OAI-ORE for structure).

On the last day, I went to the workshop on JHOVE2, which is the successor project to the JHOVE framework for characterization. JHOVE2 has an notably different architecture and expanded feature set, which expands characterization to include other processes, including identification, validation, feature extraction, and assessments based on user-defined policies. Additionally, users will be able to define format characterization and validation files for complex digital objects, such as GIS shapefiles. The presenters stated that future development for JHOVE2 will include a GUI to assist in rule set development. From the standpoint of a digital archivist, this tool will be essential in any of the further work that we do.

Wednesday, 10 November 2010

Donor survey web form is ready

As part of our on-going work to ingest born-digital materials, we have implemented a donor survey web form. In the evolving AIMS work flow, the donor survey occurs pre-ingest. The data collected in the survey may become part of the Submission Information Packet (SIP). This is our first attempt at testing several ideas. We expect to make changes and we invite comments. Our donor survey web site is open to archivists and other people interested in born-digital work flows. See the account request information below.

I realized I could quickly adapt an existing application as a donor survey, if the application were flexible enough. A couple of years ago I created a mini Laboratory Information Management System (LIMS). The programmer can easily modify the fields in the forms, although users cannot add fields ad-hoc. The mini LIMS has its own security and login system, and users are in groups. For the purposes of the AIMS donor survey, one section of the LIMS becomes the “donor info”, another section becomes the “survey”.

Using the list of fields that Liz Gushee, Digital Archivist here at UVa, gathered while working with the other AIMS archivists, I put the donor’s name, archivist name and related fields into the “donor” web form. All the survey questions went into the “survey” web form. The survey will support distinct surveys from the same donor over time to allow multiple accessions.

Our next step will be to map donor survey fields to various standard formats for submission agreements and other types of metadata. While the donor survey data needs to be integrated into the SIP workflow, we haven’t written the transformation and mapping code. We are also lacking a neatly formatted, read-only export of the donor survey data. Our current plan is to use Rubymatica to build a SIP, and that process will include integration of the donor survey data. The eventual product will be one or more Hydra heads. Learn more about Hydra:

https://wiki.duraspace.org/display/hydra/The+Hydra+Project

Everyone is invited to test our beta-release donor survey. Please email Tom twl8n@virginia.edu to request an account. Include the following 3 items in your email:

1) Your name
2) A password that is not used elsewhere for anything important
3) Institutional affiliation or group so I know what group to assign you to, even if your group has only one person.

I'll send you the URL and further instructions.


On the technical side, there were some interesting user interface (UI) and coding issues. Liz suggested comment fields for the survey, and we decided to offer archivists a comment for every question. I used Cascading Style Sheets (CSS) to minimize the size of comment fields so that the web page would not be a visual disaster. If one clicks in a comment, it expands. Click out of it, and it shrinks back down.

The original LIMS never had more than a dozen fields. The donor survey is very long and required some updates to the UI. By using a JQuery function ajax() I was able to create a submit button that would save the survey without redrawing the web page. CSS code was required to make the save buttons remain in a fixed location while the survey questions scrolled.

The mini LIMS relies on a Perl closure to handle creation of the web forms, and saving of the data to the database. Calling functions from the closure creates the fields. A closure is similar to a class in Java or Ruby, but somewhat more powerful. The data structures from the closure are passed to Perl’s HTML::Template module to build the web pages from HTML templates.

The mini LIMS was originally created to use PostgreSQL (aka Postgres), but I’ve converted it to us SQLite. SQLite is a zero-administration database, easily accessed from all major programming languages, and the binary database file is directly portable to all operating systems. Perl’s DBI database connectivity is database agnostic. However, some aspects of the SQL queries are not quite portable. Postgres uses “sequences” for primary keys. Sequences are wonderful, and vastly superior to auto-increment fields. SQLite does not have sequences, so I had to create a bit of code to handle the differences. The calendar date functions are quite different between Postgres and SQLite, so once again, I had to generalize some of the SQL queries related to date and time.

There was one odd problem. Postgres and SQLite are both fully transactional. However, because SQLite is an “embedded” database, it cannot handle database locks as elegantly as client-server databases. Normally this is not a problem, and all databases have some degree of locking. In this instance I got a locking error when preparing a second query in one transaction. Doing a commit after the first query appears to have fixed the problem. I’ve worked with SQLite for several years an never encountered this problem. It could be a bug in the Perl SQLite DBD driver. Using the sqlite_use_immediate_transaction did not solve the problem.

The mini LIMS source code is available.