Deep Blue PDF Health Check

August 8, 2019

The Digital Preservation Unit, with much support from colleagues across the library, is embarking on a project to test our preservation commitment to the PDF file format in our institutional repository, Deep Blue Documents. Numbering over 134,000, almost 75% of the files stored in Deep Blue Docs are some form of PDF. The goals of this project are to examine the accessibility of these files using tools likely employed by the repository’s Designated Community and to identify characteristics of PDF that may help us build a risk model for that format. 

While we certainly need automated tools to help us deal with this large number of files, the core of the project consists of putting a subset of PDF files in front of actual humans to evaluate for usability. Our approach is to use computing tools at first to assess the entire corpus of PDF and narrow to a scale where humans can test. Right now, we envision this project having five phases:

Phase 1: Metadata Harvesting Use tools such as JHOVE and PDFInfoTool to extract metadata from all the PDF files in Deep Blue Documents. We are currently testing tool outputs to determine exactly which ones we want to run. 

Phase 2: File characteristics identification and sample size selection Select characteristics of PDF files, such as version and creation date. Matching these to data harvested in phase 1 will determine sample sizes to test each of these characteristics. 

Phase 3: Test file accessibility Using tools that are commonly employed by our Designated Community, testers will attempt to open and use PDF files selected from the sample corpus. We are exploring ways to incorporate testers with needs for using assistive technology. Results of testing will be coded for further analysis. 

Phase 4: Analyze findings to determine if there are certain characteristics that lead to higher rates of successful/unsuccessful access. 

Phase 5: Deeper dive into selected files Initial analysis may lead to more testing of files with certain characteristics. This will be driven by the results from Phase 4. 

Future study: These files, metadata, and testing results could lead to future related research projects, such as comparing our testing results against automated validation of those same files. 

We are in the early stages of this project and have a lot more to learn, but think this type of real-world testing is an important facet of ongoing institutionalized preservation. One of the great tensions in digital preservation is the need to  make decisions for long-term preservation without a lot of historical experience with digital material. We hope to build a testing program to check our assumptions and see what we got right and what we need to adjust. 

We are very interested in engaging with the community as we proceed so please feel free to leave comments or email us with thoughts, suggestions, and ways we could share lessons learned. Keep a lookout here for more project updates.