Thursday, 6 October 2011

Day of Digital Archives – some personal reflections

To mark the Day of Digital Archives I thought I would add a personal note about the “journey” I have made in the last two years. It was about this time in 2009 that it was announced that the AIMS Project was being funded by the Andrew W. Mellon Foundation and that I would be seconded from my post as Senior Archivist to that of Digital Archivist for the project.

At the time I had considerable experience of digitisation but very little about digital archives. So I began reading a few texts and following references and links to other sources of information until I had a pile of paper several inches high of things to read. At first there was a huge amount to take-in – new acronyms, especially from the frightening OAIS, and plenty of projects like the EU funded Planets initiative. It seemed that the learning would never stop – there was always another link to follow another article to read and it was really difficult to determine how much was making sense.

Talking to colleagues who were already active in this field also revealed how little digital media we actually had at the University of Hull Archives – just over two years ago we literally had a handful of digital media whilst others were already talking about terabytes of stuff. Fortunately the AIMS project sought to breakdown the workflow into four distinct sub-functions and placed emphasis on understanding the process compared to ‘traditional’ paper archives which reduced the sense of being overwhelmed by it all.

Since then I feel I have come along way – I have attended a large number of events and spoken at a fair few and quickly become both familiar and comfortable with the language. I do appreciate the time I have been able to dedicate solely to the issue of digital archives and that many colleagues are embracing this “challenge” without this luxury.

The biggest recommendation I can make is to start having a play with the software – many of the tools that we use at Hull University are free – Karen’s Directory Printer for creating a manifest of records including checksums that have been received; FTK Imager for disc images etc etc. Nor do you have to wait for digital archives or risk changing key metadata whilst you are experimenting – you can use any series of digital files or old media that are lurking in the way of a drawer. We have also created a forensic workstation and shared our experiences via this blog.

Once we had started to experiment, we created draft workflows and documentation and refined this as we experimented further – all tasks from photography of media to using write-blockers do become less daunting the more frequently you do them. Having learnt from many colleagues we have started to add content to the born-digital archives section of the History Centre website. I have also used some of my own email to play with the MUSE visualisation tool to understand how it might allow us to provide significantly enhanced access to this material in the future.

Although the project funding has now finished and I have returned to my “normal” job I do think that digital archives has now become part of my normal work and each depositor is now specifically asked about digital archives and in public tours of the building we explicitly mention the challenges and the opportunities of digital archives. We don’t have all of the answers yet – archiving e-mail in particular still scares me, but don’t feel as daunted as I did two years ago.

Sunday, 4 September 2011

A Tale of Two conferences

Last week I was fortunate to be part of the AIMS team presenting our work at the SAA Conference in Chicago. Despite the Saturday 8am start of our session and the impending threat of Hurricane Irene well over 150 delegates turned-out to hear our presentation which included both an introduction to the AIMS framework and reporting our practical experiences through case studies. If you missed it or want to relive it the presentations are available online via Slideshare.

On Friday I spoke at the ARA conference in Edinburgh – the theme of which was advocacy and as part of a Data Standards Group I spoke about the skill set that I had acquired during my change of role from archivist to digital archivist as a result of the AIMS project.

Although the two presentations were different in content and context they both included the same message – an attempt to breakdown the perceptions and myths surrounding born digital archives. In talking about skills in Edinburgh I sought to highlight the relevance of the traditional archive skills in the digital age and to encourage more individuals to do something.

It also raised the question – something that arose in the AIMS unconference in Charlottesville and the UK workshop in London, of when will digital archives become “the norm”. We don’t know the exact answer to this, but I do know it is necessary if we are to successfully manage the challenges of born-digital archives and strive to meet the increasing expectations of our users.

Friday also marked the end of a six month contract during which Nicola Herbert has helped us with the practical elements of digital preservation at Hull. I would like to thank Nicola for her hard work and direct users to her guest blogs on photography of media and write-blockers.

Friday, 2 September 2011

AIMS@SAA Part Two: SAA Session 502

SESSION 502 - Born-Digital Archives in Collecting Repositories: Turning Challenges into Byte-Size Opportunities
SAA 2011
Chicago, IL 
Aug 27, 2011

As the endnote to their foray into the SAA 2011 Annual Meeting, the AIMS Digital Archivists delivered a presentation on the AIMS project on Saturday morning. Although we were competing with Hurricane Irene’s effect on travel schedules, an 8 a.m. Saturday timeslot, and presentations from our colleagues Michelle Light, Dawn Schmitz and John Novak’s on delivering born-digital materials online as well as the Grateful Dead Archivist and a member of the band Phish, attendance was pretty darn good! We were pleased to be able to speak with some colleagues after the session and facilitate a few discussions during the question and answer portion of the session.

The presentation itself gave a brief overview of the project and then focused on the AIMS framework, or the four areas we’ve identified as key functions of stewardship for born-digital materials: Collection Development, Accessioning, Arrangement and Description, and Discovery and Access.

We’re very happy to share our slides here through slideshare. Remember, this is just a taste of what’s to come in the white paper this fall, so keep checking the blog for updates!

Slide are posted here after the jump! 

AIMS@SAA Part One: CREW Workshop

CREW: Collecting Repositories and E-Records Workshop
SAA 2011
Chicago, IL 8/23/2011

The AIMS partners hosted a workshop in the run-up to the 2011 SAA Annual Meeting in August. 45 participants from the US and Canada joined us in exploring the challenges, opportunities and strategies for managing born-digital records in collecting repositories.

The workshop was organized around the 4 main functions of stewardship that the AIMS project has focused on: Collection Development, Accessioning, Arrangement and Description, and Discovery and Access. In addition to the AIMS crew (no pun intended) presenting on the research done through the AIMS project, several guest presenters showcased case studies from their own hands-on approaches to managing born-digital materials. Seth Shaw, from Duke University discussed the evolution of electronic record accessioning at Duke University and his development of the Duke Data Accessioner. Gabriela Redwine discussed work done in arrangement and description at the Harry Ransom Center at the University of Texas at Austin. Finally, Erin O’Meara showcased work done at the University of the North Carolina at Chapel Hill to facilitate access to born-digital records through finding aid interfaces.

In between presentations the participants engaged in lively discussions around provocative questions and hypothetical scenarios. At the end of the event, the AIMS partners felt they had gained just as much from the day’s activities as they hoped the participants had. Ideas that were discussed and case study examples will help strengthen the findings of the white paper due out this fall.

See the workshop presentations after the jump! 

Tuesday, 30 August 2011

Forensic Workstation pt3

A guest posting from Nicola Herbert, Digital Project Preservation Assistant at Hull University Archives

Once we had the forensic workstation up and running (see part 1 and part 2 in this on-going series) we installed MS Office and Mozilla Thunderbird (for working with Outlook .pst files). We also installed FTK Imager, Karen’s Directory Printer, DROID and the MUSE e-mail visualisation tool (in beta, but provides a very interesting perspective on the data). We are also planning to purchase Quickview Plus, a piece of software that enables viewing a range of file formats without requiring the original software on your PC.

We had already played around with these tools on our normal PCs and had run them on files copied from digital media prior to setting up the workstation.

Having received our two Tableau write-blockers we were eager to combine the separate processes we had developed into an integrated workflow. We have two write-blockers, one for USB devices (T8-R2) and one for internal hard drives from PCs and laptops (T35es). Simon’s visit to Jeremy John at the British Library had whetted our appetite for getting our mini digital forensics lab in operation.

USB devices
After a thorough read-through of the instructions we tested out the USB write-blocker first. Setting it up is relatively simple; the vital thing is to make the connections between device and write-blocker, write-blocker and forensic PC before switching on power to the write-blocker. The forensic workstation recognises the USB device as normal, and off you go.

We then run FTK Imager to create a logical image of the device. We tested the various formats and settings available and eventually decided that creating true forensic images would raise too many trust issues with potential depositors with regard to us being able to restore deleted files. For this reason we will create ‘Folder contents only’ forensic images which recreate the device as it would appear in normal use. From here we are exploring our options for exporting the files from the disk image, but we have found that the exported files display an altered Accessed date – any comments/suggestions on this issue would be gratefully received.

We also create directory listings of the contents with MD5 and SHA-1 checksums. From the disk image and directory listing we can start to consider the arrangement for the collection, using Quickview Plus to preview file contents.

Our second write-blocker can be used with IDE and SATA hard drives...but more of this in part 4!

Monday, 22 August 2011

Forensic Workstation pt2

When we moved from the University campus to our new joint facilities with Hull City Archives and the Local Studies Library we took the opportunity to upgrade many of our PCs – leaving a few older specimens “just in case” anybody was so desperate that they were willing to accept a machine that was reluctant to start-up!

Recently the library has been re-organising its stock and space-utilisation ahead of a major refurbishment. Our old PC was discovered in the basement and ear-marked for disposal (well recycling really but disposal is less ambiguous). It was at this point, and with a new-found digital archives perspective, that I realised the potential of this machine to become our first digital forensics workstation. With an internal 3.5” floppy drive, CD drive and 2 USB ports this was a combination that seemed to promise possibilities for dealing with a range of media but also the chance to transfer the files once they had been extracted. The PC with slightly grubby keyboard and monitor were shipped to their new home at the History Centre.

I had by this time, started to identify requirements for a new PC to act as a workstation for the capture of hard-drives and other large volume of material. This request intrigued a colleague Tom in ICT and a visit was duly arranged, Tom was really interested in our work and offered to help. Tom took our PC and returned it a few days later - with a clean version of the Windows XP image installed aswell as an internal zip drive added.

Tom has also promised to put aside a couple of internal 3.5” floppy drives as an insurance policy for the drives failing as Jeremy Leighton John at the British Library had reported mixed results when using the external USB floppy drives. Having two workstations, one old and one new, will give us an option for dealing with some media formats; a USB drive for 3.5" floppy drives and an external 250MB zip drive. The latter was found when clearing-out an old cupboard and came with all cables and even its original installation CD proving that assembling a forensic workstation does not have to cost a fortune and I have heard several tales of kit assembled via ebay purchases.

Tuesday, 16 August 2011


Today's post is just a brief announcement...The AIMS team will be taking part in two events at next week's Society of American Archivists Annual Meeting. The first is a workshop we've developed to provide an opportunity for archivists and technologists to discuss issues related to collection development, accessioning, appraisal, arrangement and description, and discovery and access of these materials. Unfortunately, space issues have required us to limit registration and it is now full. However, we promise to post a longer recap to this blog after the event.

No such limitations exist for our other SAA event, a presentation entitled Born-Digital Archives in Collecting Repositories: Turning Challenges into Byte-Size Opportunities, which will be given August 27th at 8 a.m. At this presentation the AIMS Digital Archivists will describe a bit of the high-level framework being developed by the AIMS project to characterize archival workflows for born-digital materials in archival repositories.

We hope to see you there!

Friday, 12 August 2011

Digital Forensics for Digital Archivists

I’ve been very fortunate here at UVa to have at my disposal some wonderful resources for getting up to speed with born-digital theory and practice. First and foremost, UVa is home to Rare Book School which has offered a course on Born Digital Materials for the past two years (and I’ve just learned will offer it again in 2012). I was able to take this course in July along with 11 fellow classmates from around the country. A week and a half later I was then off to the headquarters of Digital Intelligence, Inc. makers of our Forensic Recovery of Evidence Device (FRED) for Computer Forensics with FRED. This was a two day course covering basic digital forensic skills as well as the FRED system.

Mulder and Scully are concerned about the viability of this forensic evidence gathered next to UVa's FRED...

Given my great bounty, and my belief in professional karma, I’ve decided to give a brief overview of both of these classes here on the blog followed by my thoughts on a potential Digital Forensics for Archivists class/workshop that I’d really like to see developed, by myself or whomever! Two major classes out there that I have not taken are the DigCCurr Professional Institute and SAA’s electronic records workshop. Anyone with experiences in those classes, please add your comparisons in the comments.

RBS L95 — Born Digital Materials: Theory and Practice

Overall, I’d say this class has the perfect name: there’s an almost equal amount of theory and practice. That may sound like faint praise, but it’s really not. It’s something that too few workshops or classes get right. Instructors Naomi Nelson and Matt Kirschenbaum deserve much credit for a well constructed week that built practice on top of theory.

For someone new to the field of the born-digital it’s a great foundation. Concepts like metadata, preservation, “the cloud,” essential characteristics, physicality/materiality and digital humanities are combined with real-life examples from libraries, archives, and the university. This overview allowed us to attack the fundamental question of the class: what should we be trying to accomplish when we attempt to “save” (or steward, curate, safeguard, preserve, “archive”) born-digital materials.

On the practical side of things, digital forensics is covered and students get the opportunity to do a few lab exercises with emulators, floppy drives, and older models of equipment. The syllabus and reading list provide an excellent bibliography for further research.

It’s a relatively high-level class and therefore a great way to get started or a great way to get administrators thinking intelligently about the issues they need to face. I think that a more practitioner-focused and through digital forensics curriculum in the archives or cultural heritage setting could complement the course very nicely.

Computer Forensics with FRED training

University of Virginia decided to invest in the FRED technology last year and has not regretted it. While the FRED can do lots of neat things, I feel it is important to note that many or all of the same things can be done with other hardware and software, it just takes a bit more persistence. Similarly, despite the name a lot of this course dealt with basic data and file system concepts, as well as a little bit about some of the specific hardware most commonly found. In the future, DI is going to be splitting this up into two classes: Digital Forensic Essentials and Digital Forensics with FRED. The first part is a two day course and covers the hardware, data, and system stuff. The second is a one day class that covers the specifics of FRED. Although the first class will be more expensive than the current combined class is, it would be of more interest to those in the archival world.

As it is geared for law enforcement, a lot of time was spent on detected deleted, fraudulent, or hidden material. While all the cops in the room thought that this would be of no use to me, I disagreed. I need to know what I am collecting (whether inadvertent or not), whether it is authentic, and how to communicate with donors to decide how to deal with it. In addition, if we can get donors to agree to let us transfer backup or deleted versions of manuscripts, we’ll gain a wealth of information about how the final version evolved. Knowing that such recovery is possible is one of the more glamorous promises of digital forensics.

We also learned how to create and navigate disk images. While some of this stuff was fairly easy for me to pick up beforehand from Peter Chan’s tutorials, the extra practice and insight was very useful.

Digital Forensics for Archivists

Based on my experiences in these two classes, I would propose a Digital Forensics for Archivists workshop geared specifically for those interested in incorporating forensic techniques into the capture and processing of digital materials. The outline of topics I would expect to see on the syllabus below is probably a bit ambitious for a one-day workshop and would certainly have some hurdles to overcome related to provisioning hardware for all. However, these are the areas I’ve come to think of as necessary for an archive to be prepared for the variety of media that we will be collecting for the continuing future.

Digital Forensics for Archivists

  • Hardware basics

    • IDE, SCSI, SATA, USB, Firewire
    • Floppy drives
    • Optical disks
    • Hard drives
    • Internal basics (motherboard, pci, power, etc.)

  • Operating Systems

    • DOS
    • Windows
    • MAC OS
    • Linux

  • File system basics

    • FAT

    • NTFS

    • HPFS

  • Forensic vs. logical copying

    • What happens to deleted data

    • How it can be recovered

    • Why you need to know…

  • Write blocking

    • How to achieve it

  • Image files

    • Types

    • Software

    • Uses

  • Emulation and Migration

    • Cost/benefit of each

    • Possible use cases for each

So what do you think? Pipe dream? Useful? Impractical? Let me know in the comments…

Monday, 25 July 2011

Forensic workstation pt 1

A key part of dealing with born-digital archives is the ability to receive and process material without making changes to the underlying metadata including date created, date accessed etc – data that researchers will be looking to use and rely on. As archivists we place considerable emphasis on our roles as custodians and with digital material it is important that we treat the material carefully and appropriately. Fortunately there are tools that help us with the authenticity of born-digital files the most obvious of which is the checksum.

An important legacy of the AIMS project for us at Hull is working towards our ability to take born-digital material from depositors as a normal part of our work. A key component of this is a forensic workstation – by which I mean a PC (or two) through which material can be safely captured following a clear process, in-effect replicating the isolation room for receiving paper material. This will allow us to undertake a forensic examination – to check the material is what we expected or agreed to take, including the ability to generate a manifest of the material to send to the depositor, and that it does not include viruses etc.

There seem to be two main routes – to purchase FRED which stands for Forensic Recovery Evidence Device (other digital forensic workstation solutions are available). A second and more organic solution, and the one we intend to adopt at Hull, is to start with a new PC and to add appropriate hardware and software to this to provide the equivalent functionality. At the moment we are pondering a name for this with current suggestions including:
- Hal - Hull Archives Laboratory
- Harold – Hull Archives Recovery Of Legacy Data
- Hilary - Hull Investigator for Library and Archives RecoverY
- Dawn – Digital Archives WorkstatioN
but we are open to other suggestions until the machine is installed and formally named!

We don’t want to become a computer museum with an extensive range of hardware, software and operating systems environments for any possible eventuality. We do want a core ability to handle material we reasonable expect to receive – including material on 3.5” floppy disks, zip disks, hard drives etc. We intend to develop and extend our capacity as need dictates – if we get material in a format we will consider whether we need to support this ourselves or whether a suitable 3rd party is more appropriate.

Central to this is the need for write blockers which prevent you from writing or updating the files. Having read countless websites I felt I knew what they were supposed to do but had a nagging doubt that my knowledge was incomplete.

A tour of the British Library eMss Labs courtesy of Jeremy Leighton John (as featured on the BBC Radio 4 programme 'Tales from the Digital Archives' broadcast in May but still available online) confirmed the simplicity of theory and the fragility of the media – just having the hardware isn’t enough – you also need some luck that you have the correct drivers to read the specific version of the media. In the next few weeks I hope to place our order for the various bits and pieces and will update you on this exciting journey!!

Monday, 4 July 2011

Curators Workbench workshop

I was fortunate enough to attend the Curator’s Workbench workshop at the British Library last week. It was a chance to see, have a play and discuss the tool with its developers Greg Jansen and Erin O’Meara from University of North Carolina. The tool is designed to aid with the accession, arrangement, description and staging of material prior to ingest into a digital repository. Essentially the tool has an interface designed for archivists can use.

The session featured a walk-through and chance to have a play with experts on-hand if you had a problem – only necessary as we had latest ‘unstable’ release including the latest enhancements to functionality and GUI. Stable versions are available for download via GitHub. I am especially smitten with the crosswalk feature providing a drag’n’drop interface for mapping the metadata with METS. There is also the date recogniser which allows you to map the date format to the ISO standard, though there could be issues if the data is in a variety of formats, ie 1984 would be transformed to 01/01/1984.

It has a different take to where arrangement and description occurs in the workflow to that intended for Hypatia in the AIMS workflow, but it does raise some interesting questions that I hope to explore in more detail over the next few months.

It was also interesting to hear features and functionality on their wish-list including disc images, multiple users, recording processing notes, PREMIS and so the list goes on!
The discussion that followed was really enlightening as it highlighted the different approaches that archives are currently adopting to the preservation of born-digital archives.

I picked-up some useful pointers to software and tools I haven’t used before – Bulk extractor, Google Refine, and came away determined to throw more stuff at Curators Workbench, to join the users discussion list (done) and to figure out some of the aspects we have avoided so far things like PREMIS and METS etc !

Tuesday, 21 June 2011

Photographing the digital: creating images of Hull University Archives’ digital media

A guest posting from Nicola Herbert, Digital Project Preservation Assistant at Hull University Archives

Over the last few months I have been working with the AIMS team at Hull University. My role entails getting stuck into some practical processing of the born-digital collections in the Hull University Archives as well as planning aspects of digital preservation. A lot of our work so far has been to discover and document the material that we already hold in what we thought were purely paper collections and I have written a workflow for the discovery of these items and their preparation for ingest into Fedora. As part of this workflow we decided to photograph all of the removable media we currently have and create a process for photography of new deposits when they arrive.

Why bother?
By retaining photographs of the original media alongside content we will be able to provide an image of the appearance of the original media to researchers if they request it. For the foreseeable future we are storing the image files on a shared drive, but they will eventually be stored as an element of metadata with the digital files in our Fedora Repository. We will be dealing with large numbers of media items so need to ensure consistency in the way the media is photographed and information recorded from those images.

Having not previously numbered the discs, we decided on a simple running number within each accession. Despite our familiarity with labelling paper material, it seemed more complicated with digital. Our conservator advised against sticking labels (even conservation grade) onto the plastic casing of a floppy or Amstrad disc. Though a specialist CD marker can be used to label CDs, we were reluctant to permanently mark the items! After a worryingly long thought process we decided to stick to the old faithful method of writing in pencil on the existing label or case.

I then started planning the process. Despite trying to anticipate the different elements of information to include for each media type, it was only trial runs photographing actual media that gave the full picture - i.e. that Amstrad discs have three aspects to photograph (Side A, Side B and the edge). Lots of seemingly trivial questions arose - like whether to photograph the case or whether to photograph a label if blank. Getting the process right from the start will save time in the long run.

We decided to create a ‘clapperboard’ to photograph with the items for a failsafe way to ensure easy identification. I decided on a reusable form printed on a transparency which we can label with a drywipe marker. Putting theory into practice needed several trial runs; after each one I adapted the form and the procedure.

In addition I wrote up detailed notes describing the procedure for each type of media we anticipate encountering. We worked out a sensible image quality – so to ensure legibility of the labels without clogging up our servers with unnecessarily large images. Once the photographs have been taken they are renamed and filed. We also maintain an inventory of the items and record the media and label information alongside it. This ensures that if we send items (like our Amstrad discs) away to a third party we can match them to our records when they return.

This process has been satisfying to complete and enables us to tick at least one thing off our to-do list. Anyone can get this part of the process completed – even for material which is stored on a shared drive, photography of the original media is a useful process.

Wednesday, 18 May 2011

AIMS: the UnConference

Not two full weeks into my new job as Digital Archivist at UVa on the AIMS grant, I rolled up my sleeves to facilitate and host an unconference with my fellow Digital Archivists. Our unconference would be two full days of discussions, demonstrations, lightning talks, and networking with digital archivists from around the globe. At first the thought was a little terrifying – I’m not even fully sure I know what this job is yet, how could I actually lead discussions on the salient topics? But my fears were baseless: all the unconference attendees were thoughtful, articulate, and lively participants. I learned much more from them than they probably did from me.

The unconference was held on the 13th and 14th of May at the Omni Hotel in Charlottesville. The 27 participants represented libraries, archives, museums, and digital humanities centers across the US, Canada, and the United Kingdom. Despite the differences in our institutions, backgrounds, and training, we learned that we not only shared similar challenges, but also the same hopes for collaboration and innovation.

The first day started off with a round of lightning talks. Each participant had 5 minutes to present a topic, project, problem or idea that they were interested in talking about. The variety in the talks was remarkable to me, traversing the breadth and depth of all that can be thought of as “born-digital” and the many processes involved in managing it. The lightning talks were also great way to get an introduction to each participant as well as their perspective or the particular issues they were dealing with in their institution. A brief outline of each of the talks is available on the AIMS Unconference Wiki.

Thursday, 5 May 2011

Workshop on "Using FTK Imager and AccessData FTK to Capture and Process Born Digital Materials"

On April 22, I conducted a 2-hour workshop on "Using FTK Imager and AccessData FTK to Capture and Process Born Digital Materials.” The purpose of the workshop was to give staff a hands-on experience in using FTK Imager and AccessData FTK. Eight colleagues from the Stanford University Libraries attended the workshop – primarily from Special Collections and University Archives and the Humanities and Social Sciences Group.

The workshop covered the following:

FTK Imager – how to:
1. Download and install the software (free software -
2. Create a forensic image of an USB flash drive.
3. Create a logical image of the same flash drive.

AccessData FTK – how to:
1. Load an image – for this workshop we used a sampling from the Stephen Jay Gould papers.
2. View technical metadata generated by the software.
3. Arrange column settings to see specific file attribute (e.g. duplicate files).
4. Search for social security numbers using pattern search.
5. Test the full-text search function.
6. Flag files with sensitive information with "privileged" tag (such as those with social security numbers, etc.)
7. Use the bookmark feature for hierarchical information and apply it to groups of files (e.g. series, subseries, etc.)
8. Label groups of files with user defined labels (controlled vocabulary for computer storage media, document type suggested in the workshop, subject headings or access rights, etc.)
9. View files with specific bookmarks and labels.

Many incoming collections are hybrid collections – containing both analog and digital material. The digital component will become even greater as we move forward. Empowering all archivists to use a tool such as AccessData FTK to process the digital materials would be very useful.

Friday, 8 April 2011

Data Management Planning?

Guest blogger: Andrew Sallans

Following on Tom's generous invitation to write a post for the AIMS partner blog, I am finally getting around to doing so. Tom and I have been holding monthly discussions about our respective projects since sometime last summer, and have talked in great length about the commonalities between what my group (the Scientific Data Consulting Group) is dealing with in regards to research data management versus what the AIMS group is dealing with in terms of born-digital archive material.

We have found that there are many areas of similarity, and that we face many of the same challenges, although we approach the problem quite differently and of course have entirely different terminology given our relative perspectives.

To get started, I have a pretty good understanding of the born-digital problem set, but have not been keeping detailed notes on the workflows and solutions that the AIMS group has identified as best practices throughout the life of this project. My intention for this post is to share the issues that we are dealing with in research data management and try to make some suggestions around areas where there may be overlap and opportunities for great information sharing and collaboration.

Starting this past January 18, 2011, the National Science Foundation (NSF) put into effect a new implementation of their pre-existing data management planning requirement. This revision now requires that researchers submit a 2 page data management plan (DMP) that specifies the steps they will take to share the data that underlies their published results. This DMP will undergo formal peer-review, require reporting in interim/final reports, and all future proposals. In effect, what one says must then be done, or else one runs the risk of losing future funding opportunities or worse, losing all funding for the institution from that particular agency. Although this requirement is focused on data sharing, it isn't possible for such an initiative to succeed without first addressing a mass of other data management issues, ranging from technical, to policy, to cultural. As we often point out these days, it is far easier to improve the process of data management up-front, in the operational process phase, than it is to begin thinking about how to share the data at the end of the project. I would expect that those attacking the born-digital archive problem can fully relate.

Here in the Scientific Data Consulting (SciDaC) Group in the UVA Library, we have been collecting and developing our local set of data management best practices for some time and have served as advisors to researchers in both the research data management and DMP development areas (they are of course interrelated, but sometimes have different levels of urgency). In doing so, we have developed what we call a "data interview/assessment" (based in large part upon the visionary work of others, Purdue's Data Curation Profile and work from the UK's Digital Curation Centre, to name a few), which is a series of questions that address many different areas of data management, including context, technical specifications (formats, file types, sizes, software, etc.), policies, opinions, and needs. We meet with researchers to have a conversation, educate them on emerging trends and regulations in data sharing, and listen to their concerns and challenges. In the end, we try to make recommendations on how they can improve their data management processes, and then we offer to connect them with people who can help with the specific details (if it isn't us). For the DMPs, we have a series of templates that are specifically configured for the respective program requirements. Again in this case, we do some education, then offer some feedback and advising on what qualifies as good data management decisions for a particular community. Behind all of these efforts, we know we don't know all the answers, but we do know most of the questions to ask and who we need to pull together to figure out the solutions. That's our basic operating principle.

So, sound a bit familiar? Based on conversations with Tom, and reading some of the posts in the AIMS blog myself, it sounds like we are up against some very similar challenges in regards to the front-end of the issue, around education, conducting inventories and assessments, and figuring out how to manage processes before it comes down to managing the information itself and providing access to it for others. Appraisal and selection is incredibly important to us, but is usually driven more by the type of data. As an example, reproducible data generated by a big machine might not be important to keep, but the instructions and context in which it is generated would be invaluable. On the other hand, data from natural observations (ie. like climate data) would be critical to save. These considerations are not always apparent to the researcher, as they often think within the context of their work, rather than others. I would expect that the back-end is even more similar, as we are all ultimately dealing with bits and bytes, formats, standards, and figuring out how to decide what to keep and how to do it.

Lastly, for now, I also would like to mention that I had the opportunity to attend the annual Duke-Dartmouth Advisory Council meeting at the Fall CNI Forum several months ago.

As you'll read, this project aims to bring together stakeholders from all areas of digital information across the institution, to talk about and plan in a collaborative and strategic way. They aim to tackle the challenges of management, technology, policy, and hardest, culture. I was incredibly impressed by the vision of this undertaking, and hope that we can continue to refine our efforts at developing a collaborative digital information management strategy as well. In practical terms, we all need to try and be attentive to how our effort plugs-in with others around the institution. The issue of digital information management is undoubtedly a very big one, and requires coordination and collaboration across many experts in order to appropriately treat the various bits that we encounter. Doing so will hopefully also provide us with the ability to bring best practices from one challenge to another.


Andrew is currently the Head of Strategic Data Initiatives and the Scientific Data Consulting Group at the UVA Library.

Contact info: Andrew Sallans, Email:, Twitter: asallans

Friday, 1 April 2011

Digital Collaboration Colloquium

On Tuesday I attended the Digital Collaboration Colloquium event in Sheffield organised to mark the end of the White Rose Libraries LIFE-SHARE Project.

The day included a number of talks about how institutions can collaborate including an interesting account of the Wales Higher Education Libraries Forum (WHELF) and experiences from the Victoria & Albert Museum. Although the majority of examples focussed on digitisation the principles and lessons learnt were all equally appropriate to a born-digital context.

As part of the day I presented a Pecha Kucha session on the AIMS project and some of the digital collaboration tools that we have found to be effective including Skype and GoogleDocs. In you are not familiar with this format it involves a presentation of 20 slides, changing automatically every 20 seconds and despite cutting the content quite heavily I still found myself chasing to keep-up with the changes. Other sessions looked at digitisation in-situ in a public setting – bringing behind the scenes in-front of the curtain, and other sessions on the Knitting patterns project at Southampton, the Addressing History project based at EDINA and the Yorkshire Playbills project.

The afternoon included a presentation form our hosts on the LIFE-SHARE project and their experiences of the collaboration continuum and a roundtable session that led to a good discussion between panel and audience. With alot covered in a relaxed and friendly atmosphere there was plenty of networking and I’m sure everybody took something from the day.

The presentations are available via slideshare

Friday, 18 March 2011

Personal Digital Archiving Conference 2011

I had the good fortune to attend the 2011 Personal Digital Archiving conference at the Internet Archive, along with other colleagues on the AIMS project, including Michael Forstrom from Yale and Michael Olson, Peter Chan, and Glynn Edwards from Stanford. The conference was exceptional, and had a great range of presentations ranging from those on fairly pragmatic topics to the highly theoretical. There are a number of other blogs with comprehensive notes on the conference, and the conference's organizers have already provided a detailed listing of those. Instead, I'd just like to focus on what I considered the highlights of the conference.
  • Cathy Marshall's keynote was excellent. I have seen her speak before, and she presented an survey of her ongoing research into personal digital archives.
  • Jeremy Leighton John presented on work undertaken since the Digital Lives project at the British Library.
  • Judith Zissman presented on "agile archiving", similar to agile development, wherein individuals can continually refine their archival practices.
  • Birkin Diana presented on how Brown University is working to make their institutional repository a space for personal materials, and strategies that allow users to work on adding metadata iteratively.
  • Daniel Reetz presented on his DIY Book Scanner project, but also brought in detailed technical analysis about how image sensors in digital cameras work and how our brains process image data.
  • Jason Zalinger introduced the notion of Gmail as a "story-world" and presented some prototype tools and games to help navigate that world.
  • Cal Lee presented on introducing education about digital forensics to the archival curriculum.
  • Kam Woods also presented on applying digital forensics to the archival profession.
  • Sam Meister presented on the complex ethics of using forensics in acquiring and processing the records from start-up companies.
In addition, I presented with Amelia Abreu on "archival sensemaking", which introduces the notion of personal digital archiving practice as an iterative, context-bound process.

Saturday, 12 March 2011

Processing Born Digital Materials Using AccessData FTK

To follow up with my previous blog entry on "Surprise Use of Forensic Software in Archives", I have prepared a YouTube video "Processing Born Digital Materials Using AccessData FTK". I hope this video can give people more details on how FTK is being used at Stanford University Libraries. Take a look and let me know what you think.

I would like to say a few words on discovery and access even though it is not the topic of the video. After we process the files in FTK, one way to delivery the files is to store them in a Fedora repository and let people access our Fedora repository using a web browser through Internet. We have developed an alpha version of this model using files from the Stephen Jay Gould collection. Another way to provide access to the files is to let people use FTK to access the files in our reading room. I will write about that later.

Hope you enjoy the video.

Friday, 4 March 2011

File type categories with PRONOM and DROID

In order to assess a born digital accession, the AIMS digital archivists expressed a need for a report on the count of files grouped by type. The compact listing gives the archivist an overview that is difficult to visualize from a long listing. The category report supplements the full list of all files, and helps with a quick assessment after creation of a SIP via Rubymatica. (In a later post I’ll point out some reasons why pre-SIP assessment is often not practical with born digital.)

At the moment we have six categories. Below is a small example ingest:

Category summary for accession ingested files
moving image1
still image26

Some time ago we decided to exclusively use DROID as our file identification software. It works well to identify a broad variety of files, and is constantly being improved. We initially were using file identities from FITS, but the particular identity was highly variable. FITS gives a “best” identity based meta data returned by several utility programs. We wanted a consistent identification as opposed to some files being identified by DROID, some by the “file utility” and some by Jhove. We are currently using the DROID identification by pulling the DROID information out of the FITS xml for each file. This is easy and required very little change to Rubymatica.

PRONOM has the ability to have “classifications” via the XML element FormatTypes. However, there are a couple of issues. The first problem is that the PRONOM team is focused primarily on building new signatures (file identification configurations) and doesn’t have time to focus on low priority tasks such as categories. Second, the categories will almost certainly be somewhat different at each institution.

Happily I was able to create an easy-to-use web page to manage DROID categories. It only took one day to create this handy tool, and the tool is built-in to Rubymatica. The Rubymatica file listing report now has three sections: 1) overview using the categories 2) list of donor files in the ingest with the PRONOM PUID and human readable format name 3) the full list of all files (technical and donor) in the SIP.

This simple report seems anticlimactic, but processing born digital materials consists of many small details, which collectively can be a huge burden if not properly managed and automated. Adding this category feature to Rubymatica was a pleasant process, largely because the PRONOM data is open source, readily available, and delivered in a standard format (XML). My thanks and gratitude to the PRONOM people for their continuing work.

As I write this I notice that DROID v6 has just been released! The new version certainly includes a greatly expanded set of signatures (technical data for file identifications). We look forward to exploring all the new features.

Tuesday, 22 February 2011

Arrangement and Description of born-digital archives

For the last two months the Digital Archivists have been trying to define the requirements of a tool to enable archivists to arrange and describe born-digital archives. To do this we have stood-back and reviewed the traditional skills and processes and whether changes are required or appropriate to accommodate the particular issues surrounding born-digital archives.

The components we identified were as follows:
• Graphical User Interface – needs to be clean and easy to use
• Intellectual Arrangement - must be easy and instinctive for archivists to use
• Appraisal – born-digital archives need to be appraised as much as their paper predecessors
• Rights and Permissions – to enable the management of access to the born-digital archives and also to demonstrate to 3rd part depositors that the material is safe in your care
• Descriptive Metadata – a term we have been using to relate to description information and to explicitly distinguish this from the technical metadata about each file
• Import/Export functionality – to import/export data with other tools
• Reporting – to provide a range of "views" for managing the digital assets

Through a series of user stories and scenarios we have sought to clearly explain the requirement and how this might relate to other functionality.

This work has been under-taken predominantly through the use of GoogleDocs and created a document that we can all access and edit, create diagrams and include screenshots as necessary. Over weeks hundreds of comments have been added, and the text subjected to a comprehensive review and refinement process by numerous staff across the four partners.

Each institution has now scored and prioritised these features and as befits a collaborative initiative like the AIMS project allow us to identify a core group of features and functionality that we feel will be of greatest use to our institution and the wider archival community.

With the exception of intellectual arrangement most tasks and processes are not unique to archives so there is already a body of knowledge and experience in how to approach the task. For intellectual arrangement we have to be clear and precise about what we need and want we didn’t, for example a single intellectual arrangement when multiple versions would be possible in a digital environment.

Over the next few months we will be refining and reviewing these requirements, very much aware that there are only seven months of the project remaining. We also intend to discuss those aspects we identified as "critical" in future blog postings.

Tell us what tools you use with born-digital archives...