The Voyages of a Digital Collections Audit/Assessment: Episode 2 (The Pilot Group)

The landing party returns from the alien planets with valuable lessons for exploring the rest of the system

In preparation for the full audit of our digital collections (see the Episode 1 blog post about why we’re doing an audit of our 280+ collections), we conducted an audit of a pilot group of collections. Through this process, we learned a lot that has helped us initiate the full audit.

As we completed this audit, we came to the realization that we are really talking about an assessment of our existing collections. An audit would encompass the set of tools that we can use after we’ve finished our primary assessment (and handled the migration or deprecation of them). That set of tools would be something we could run on collections to gather the information we need to continue to assess collections moving forward. We’ll likely use the word “assessment” in our follow-up blog posts, for clarity.

Conducting a Pilot

What do we know now, looking from the ship at the strange worlds before us? Even at a glance we can see that when digital collections are created over the scope of several decades, they had used different technologies, different best practices even when the technology was similar, robust metadata or limited metadata, interface choices, varying degrees of attention to aspects that may improve usability and accessibility, etc. These choices made sense at the time based on available technology and standards. We value the unique qualities of each collection! But, these concerns will prove problematic as well.

With over 280 digital collections, we didn’t want to just jump right in and start the full assessment on everything. Instead of sending out the whole crew, we decided to first send a landing party to scope things out. Using around a dozen collections that we knew encompassed a range of concerns, we created a map of things to look for in the digital terrain. We looked at quantitative as well as qualitative aspects including, but not limited to:

Identification: title, ID, URL
Size: objects within the collection, data (MB)
Collection type (our collection classes encompass text, image, bibliographic and finding aids)
Content type (text, still image, audio, bibliographic, multimedia)
File format (PDF, JPEG, TIFF, XML, etc)
Access: public, restricted to authorized users, restricted to access only on campus, etc.
Age of the collection: when was it created? when was it updated?
Usage: internal and external usage by patrons
Importance to the university: is the collection one of a kind or serve a special purpose for the University or our reputation?
Quality of the items: at what resolution were they scanned or photographed?

What We Learned

We sent crew members to explore the unknown depths of this pilot group of collections, and they brought back valuable knowledge for future work. We had already anticipated some of the issues, but some were unexpected.

What we’re trying to find out...

...is unclear, even to ourselves at first. While it was easy to start listing some of the quantitative data we could collect about a collection, it was difficult to fully understand how that information would be useful for the audit until we took a step back and discussed some of the items among the group working to gather the information. Even after the pilot collections were assessed, we continue to discuss aspects we're trying to capture in the next phase of the project.

...needs to be considered along with future planning. When talking with subject matter experts, there was a fear that all of this work would be done, but when we're finished, we won't use it again and the information will become stale. Though we’re gathering this information now specifically for a potential collection migration, the information can be worthwhile beyond that. We need to consider how we can make the processes for finding the information last so that fresh information can be gathered in the future without as much effort or starting from scratch.

...requires playing 'telephone' with other experts. After going through all of the pilot collections within our team, we ended up revising wording when we consulted with a former expert. Since that person had been handling the content in a different manner many years ago, their input added to our understanding of the collections. It was clear that parties with different perspectives on the projects and time points in their history brought different information to the table.

...needs to be gathered in a carefully considered environment. The initial area for gathering information about the audit was set up in an individual’s personal space. As we started the audit, we needed to create a fresh area, carry over information, reshare that with appropriate parties, and delete the old documentation to reduce confusion later. If we had identified this issue or waited till later in the process to try to resolve the problem, we would have had more complications with a transfer.

...needs to have clear definitions and noted information locations. Noting where to locate information and exactly what was intended by a concept, like "number of digital objects", was good to flesh out as part of a separate document rather than the main spreadsheet where we were gathering the information. If we had tried to keep the definition information entirely within the spreadsheet, the spreadsheet would have become unwieldy.

...needs to be carefully defined. Each word matters. When we initially had wanted to capture "collection staleness: last update date", we didn’t realize the potentially negative connotation, so was it worth capturing? How stale is a collection of Jane Austen’s works, really? In the same vein, when we asked for "Last update date", were we referring to the last update of the content or the interface?

...may change later, and that's okay. As we started the pilot group, we gathered the most detail about "Size of the Collection (MB)" as we could, a specific number. We acknowledged, though, that we may want to switch to 'small', 'medium', and 'large' later for analysis purposes. We capture the most information now and may generalize later.

Our information sources...

... are less than ideal for long-term storage of information. Some of the information about collections could only be found in email. If this information is worth our keeping, we need to examine our practices as an organization for better storage of the information so that if a key individual leaves, we are not suddenly left without critical information, even rights information, for a collection.

...yielded unclear data even when we could locate it. Dates for one collection indicate that the collection was created in one year, 1998, but were updated in another year, 2000. Essentially, it was impossible to tell whether the updated year reflected updates to the content, the collection itself, or to something else. We need to consider whether there are changes we can make in the future that will clarify these sorts of issues within our databases and collection information. Relying on an individual's memory about what was going on at that time or with a particular collection is not ideal.

...surfaced technical issues. In looking for data within certain areas, we located bugs or broken links. In one instance, collection usage was supposed to be noted, but because a script was broken, usage appeared to be '0' across several collections. Working with others resolved these technical issues.

...had data options that we didn't realize when we initially. When we created the main spreadsheet and noted options for "Format types: text collections," the options we listed were "Plain text", "PDF", "TIF", and "JPG". It was only when we shared the spreadsheet with the subject matter expert, that the expert pointed out that "XML" was also an option.

The collections themselves...

...may be broken with no fixes likely to come. We both identified issues and are aware of known issues within collections for some time, but because of the nature of the stakeholder, rights issues, or other problems, there may never be a fix. For instance, we may never be able to rescan pages to fix images in a particular collection since a stakeholder has been historically non-responsive. Still, we note the issue as part of the audit and may be able to circle back to such problems later or at least note it for the future so that we don't forget the history of the issue.

...have interesting usage that may be worth examining post-audit. In looking into the usage rates for a collection, we found that only a small portion of the usage was within the University of Michigan. While it isn't something that we're going to investigate now, it is of interest and could play a later role, such as in considering the collection's prestige or other qualitative aspects.

To Be Continued…

Stay tuned as we proceed with our digital collections auditing/assessment adventures! Please keep sharing your case studies and feedback.