Unpacking Old Baggage: Re-Assessing How We Format Metadata and Better Understanding the Systems That Read It

What Went Wrong

When developing our digital archaeology workflows back in 2016, the lab was looking for a way to create consistent metadata for the disk imaging process–more specifically, we wanted this metadata to be included in the same package as the files themselves for easy access. 

We took advantage of the ability to use metadata profiles in Bagger, which was the preferred bag-creation tool at the time. While we were developing this, a blog post from the Library of Congress introduced some new profiles that were pre-loaded in Bagger.  One of these profiles was modeled after another developed by the Indiana Archives and Records Administration (IARA); we really liked this profile and modified it to meet the lab’s needs.

There were some fields in the IARA’s metadata profile that included a “/” or a space character, such as “accession/TransferNumber.” We adapted this formatting to some of our own fields, such as “BarcodeNumber/Identifier.” Although we were using this profile as a light implementation of PREMIS with the idea that someday it would form the basis of a more formal schema implementation, we did not have the foresight to recognize that the usage of special characters like  “/” and “ “  in our metadata profile could be problematic when migrating to a new system.

Identifying the Problem

In 2024, we ran into this exact issue: in an attempt to transfer our first generation of bags into Archivematica storage, we discovered that some of the metadata fields were made unsearchable. After digging into the METS file that Archivematica creates, we discovered that the original JSON field name and value pairs were being converted to XML elements, such as “BagCreator”: “Abby” to <BagCreator>Abby</BagCreator>.

Screenshot of transferred metadata in Archivematica METS file.

This mistake managed to fly under our radar for a long time because Archivematica never gave us an indication or error message when it could not convert the JSON fields; instead, the system ignored them completely. Additionally, when we tested searchability in Archivematica in the past, we were unknowingly only testing metadata fields that didn’t originally contain a slash or space, or that same metadata was captured in a different way. For example, the barcode was always searchable despite having an invalid field name because it was also used as the name of a directory.

Given these factors, we were lucky to identify the issue within the first few test bags of the transfer process. As a result, we were able to put our heads together and brainstorm how to approach the issue while only having to backtrack and remove a handful of bags from storage.

Our Solution 

At the time of this discovery, we had over 1000 bags with the formatting issue that needed to be transferred. Instead of having to manually change these fields, we were able to write a script using the Library of Congress’ bagit-python library to loop through each bag’s metadata file, identify any invalid field names, and replace them with a new, correctly formatted name.

In general, we replaced any “/” with “-” and field names with spaces were concatenated using Pascal case, such as “BarcodeNumber/Identifier” to “BarcodeNumber-Identifier” and “Bag Creator” to “BagCreator,” respectively. We then used bag.save(manifests=True) from the same bagit-python library to regenerate the manifests.

Screenshot of the script we wrote using the bagit-python library.

This method allowed us to edit the metadata text files without having to completely rebag from scratch–phew! By the end of the week, we were able to run the script through all 1000+ bags and add them to Archivematica storage, with all metadata fields made searchable.

Moving Forward

While we were writing the Python script, we researched and identified a whole list of XML formatting rules. Though we didn’t need to address them all during this process, we wanted to ensure that we could prevent something similar from happening again.

So, we built a metadata form that we could edit and re-use in the lab as needed. The form not only solved our XML problem by pre-formatting all field names, it also sped up our metadata workflow by generating a text file in JSON format. As a result, we can efficiently write a metadata file and drop it into its corresponding bag directory before moving onto the next stage in our image processing workflow.

Screenshot of the new metadata form output with correctly formatted field names.

Overall, the lesson learned is that we should be double, triple-checking both ourselves and the requirements of any system that we’re working with. Our new metadata formatting is compatible with Archivematica now, but what about 10, 20 years from now? How do we prevent mistakes like this from slipping through the cracks? We won’t always be able to predict the future or the tools that may emerge, but we can try our best to catch issues sooner rather than later and, from there, build solutions that are malleable.