Combine Metadata Harvester: Aggregate ALL the data!

The Digital Public Library of America (DPLA) has collected and made searchable a vast quantity of metadata from digital collections all across the country. This work has been accomplished with a very small number of resources. The Michigan Service Hub works with cultural heritage institutions throughout the state to collect their metadata, transform those metadata to be compatible with the DPLA’s online library, and send the transformed metadata to the DPLA. For background on DPLA and the Michigan Service Hub, please see Rick Adler’s excellent Library Tech Talk post from November 2016.

As of the writing of this blog post, DPLA displays over half a million records contributed through the Michigan Service Hub and over 36 million records all told. While a large share of these records come from individual ‘Content Hubs’ like the Smithsonian or HathiTrust, there are still millions of records ingested through state-level Service Hubs from a wide range of smaller institutions across America.

CC-BY-2.0, posted to Flickr by cjuneau. Wheat Field - Cyclists at the Central Experimental Farm, Ottawa

The technologies we use to transform and validate XML records, like XSLT, are well-established and highly reliable, but software for handling records at this scale, and performing mass transformation and validation operations, is a little harder to come by. Europeana, which serves the same purpose as DPLA for the European Union, created an app called Repox. It does the job but isn’t as actively maintained as we would like and lacks built-in analysis tools for inspecting imported metadata.

Public Domain, Agriculture in Israel, 1945-1948

Wayne State University Libraries developed software called Combine in 2017 to replace Repox for the Michigan Service Hub. In mid-2019, I was hired to work on Combine here at the University of Michigan Library, on behalf of the Michigan Service Hub, and I got to work with the original developer at Wayne State for a few months until he left to pursue other exciting opportunities. This kind of cross-institutional collaborative work is pretty unusual, and it’s been a really neat opportunity. I’ve particularly enjoyed having this kind of outward-facing experience as someone new to not only U of M but also to the library world as a whole.

A color photograph of a yellow combine harvester coming up over a hill in a partially-harvested grain field, harvesting the grain.
Rick Crowley / Cereal Harvest in Somerset / CC BY-SA 2.0

Combine was built to offer flexibility and repeatability to users handling diverse streams of metadata. As a very simple example, the Bentley Historical Library has a large number of collections that are available through the DPLA. Previously, if a small number of collections needed reprocessing for any reason, we would have to reprocess the entire set. With Combine, we can harvest each individual Bentley collection separately but have the same pre-defined transformation applied, designed specifically for this group of collections. This means that we can conserve resources when only a few collections have updates to process, or easily troubleshoot if a single record is glitchy. To be a little more concrete, we take advantage of the fact that all the Bentley metadata follows the same internal standard to write an XSLT or similar transformation from the Bentley’s format to the MODS format that DPLA wants. For a small local museum, we can take their own internal standard, presumably different from the Bentley’s, and write a transformation for them to apply automatically to all of their respective collections.

Wheat on Field - CC0 1.0 Universal Public Domain

We are working on incorporating several different types of ingestion process, so that it’s equally easy to pull in a spreadsheet of metadata from a very small local history museum as it is to call out to the U-M Library’s OAI endpoint to fetch records from selected collections. We also support multiple technologies for record transformation and validation, so users can choose what works for them.

Combine is available on Github for download and development.

Combine Metadata Harvester: Aggregate ALL the data!

Tags: