A New Normal: Our Decision to Migrate DOC and DOCX Files

Huge thanks to Digital Preservation Production Manager Abby Sypniewski and Digital Preservation Lab Intern Luciana Qu, who completed the community scan of peer institutions, other research, and testing that informed the decisions and processes discussed in this post. 

The Digital Preservation Unit is relaunching our digital archaeology and born-digital ingest workflows. As part of that work, we developed a normalization workflow for Microsoft Word documents (DOC and DOCX formats) to produce a preservation package (AIP) that will include the following: 

  • The original, unaltered DOC and/or DOCX files
  • Copies of each DOC and/or DOCX files in the PDF/A-2u format to serve both access and preservation purposes
  • Metadata documenting the creation of the PDF/A-2u versions 

We are using several new, smallish acquisitions as pilot collections to develop and test the implementation of this new approach. I will be focusing here mostly on the factors that led to this decision, and how the early testing has gone. We hope this will be useful for folks to see not only what we are doing, but also how we decided to do it. 

Context is everything 

While we have been working to preserve born-digital material at production levels for almost a decade, we are still behind in developing complete pipelines from ingest to access. The library is not the University of Michigan's official archive, so we are not inundated with deposits of modern digital records. We are, instead, focused on preserving born-digital acquisitions for the special collections libraries under our umbrella. These libraries tend not to collect large numbers of born-digital materials (yet), so we typically see tens to hundreds of Word files in a collection, not thousands. Some other contextual info that might be helpful:

  • Access to our pilot born-digital collections will be a mix of fully open online and reading room restricted.
  • Any solution we come up with for access would have to be straightforward for our staff and users.
  • Our preservation storage solution is separate from our user access mechanisms. 

OK, boomer 

It feels like many in the community have moved from OAIS to simpler or clearer models, which is great. However, I find the argument that preservation decisions should be made based on an archive’s Designated Community (Page 1-11) remains compelling. Thus, we have always felt that consulting our curators, who know our users (Consumers) best, is essential to our preservation and access strategies. So far, the universal message has been that users want to read the content of word processing docs as easily as possible. Maintaining the exact look, feel, and format of the original files is not a priority in the collections we have so far. This obviously gives us some leeway to use normalization for access and preservation as a strategy with this content. 

Word to your mother

Word formats for preservation are not a cut-and-dry discussion. If you are like me, you have a printout of the Library of Congress's Sustainability of Digital Formats website on your bedside table to help guide you through troubled times. The entries for DOC and DOCX tend to receive favorable reviews for preservation, though DOC is definitely less enthusiastic than DOCX [1]. So, why are we making this difficult? It’s really about how access works. Sometimes local versions of Word can be picky about how you open it or whether it opens at all. The software may try to convert an older file format to make it more “compatible”. In addition, as Word moves more into the cloud, access to older versions becomes less clear [2]. In other words (get it), it’s janky. Janky access conflicts with some of our user community's core needs and undermines our confidence in the format's suitability for preservation. Preservation is, at its core, long-term access. While the need to normalize Word files may not be universal, we decided to explore normalization to provide easier access and additional long-term stability to collection material, particularly if we could find both in a single format. 

I should briefly address emulation. Emulation could be a way to ensure continued access to the original versions. While the continued community development of emulation as a tool is vital to our work, we currently lack the organizational commitment or resources necessary to use it as a preservation strategy. We are retaining the original files as the source of access via emulation if we ever get to that point. For now, however, if you want to visit your old friend Clippy, you are on your own [3]. 

You lost me at “PDF”

Hearing that we are migrating files to the PDF format may bring out the pitchforks and torches from some of our digital preservation colleagues, not without reason. Many of the problems associated with the PDF format in our space stem from the challenges of remediating thousands of files that were poorly created for numerous reasons, including sloppy implementation of the standard. That said, PDF/A-2u has several attractive features for our use case, including accessibility tagging and Unicode support [4]. We believe that, because we are creating PDF derivatives in a controlled environment and can perform QC, we can mitigate many of the inherent negatives of the format. So far, our testing has borne this out. Our workflow using Adobe Acrobat produces valid PDF/A-2u files with identical formatting, fonts, and pagination. We will continue to test this with more complex source documents. There is a lot to say about this topic, so we will write a future post detailing our selection of this format specifically for our collection material.

Accessibility 

We saw the normalization process as an opportunity to test whether we could create versions that are more accessible than the original files [5]. So far, we have had frankly surprisingly positive test results using Acrobat to convert Word to PDF/A-2u with automated accessibility tagging. We are hesitant to shout this from the rooftops at this point because, in the world of jank software, Acrobat is certainly no better than Word. We will likely need to use a different conversion tool as our program scales up. Also, we still need to conduct additional conformance testing and continue evaluating its handling of more complex source files. 

Sustainability

One of the values in the library’s Digital Preservation Baseline, our agreed-upon approach to digital preservation, is to incorporate environmental sustainability into our program. Making copies of every Word file to be stored alongside the original obviously results in larger packages and more resource-consuming storage. Compounding that, features that make PDF/A good for preservation, such as font embedding, can make those versions much larger than the original Word file. So far, we have seen increases from 2 to up to 10 times the original file size. Yikes. The smaller the files, the greater the increase, which we attribute to PDF/A overhead. 

While we are not excited about this increase, the actual sizes make us feel a bit better. The majority of PDF files are still under 500K, with the largest ones just over 1MB. Some of this increase will be offset by our other approaches. Through consultation with our curator partners, we have reduced the number of photos we take of the media and stopped storing disk images in the AIP except in rare cases. We are also doing ongoing work as an organization to balance processes such as backups and fixity between preservation and sustainability. While this work helps, there is no getting around the fact that our normalization process will increase our storage footprint. Our emerging philosophy of sustainability centers on the link between selection and energy use. Just as physical collection preservation, proper digital preservation procedures have costs in both monetary and carbon terms. Understanding these costs better helps all of us advocate for careful selection. We should take the necessary steps to properly preserve our collections, and those collections should be limited to material that warrants such preservation [6].

Conclusion

While this post focused on source Word files, this approach will likely be used for other word-processing documents to maintain consistency and address more at-risk formats. We have not yet considered normalization paths for other formats. We will do so on an as-needed basis, with a likely heightened bar for larger content types due to storage impacts. 

This will be the first in a series on the development of our new workflows. Look for a post soon with details on how we chose PDF/A-2u, as well as testing and metadata creation for the conversion process. There will be a post after that on the updating and relaunch of our digital archaeology and born-digital ingest workflows. Also, keep your eyes peeled for more details and data on approaching this work through a sustainability lens. 

---

[1] LOC Sustainiblity of Digital Formats’ discussion of DOC and DOCX

[2] There are alternatives to opening Word docs, such as the open-source (and free) LibreOffice. Our experience has been pretty good with LibreOffice opening both formats, but we did not want to rely on that for future access. More on this in a future post.  

[3] More on Clippy. Also, coming from someone who used Microsoft products during these years, everyone hated Clippy. 

[4] LOC Sustainiblity of Digital Formats’ discussion of PDF/A-2u.

[5] Our entire community is currently grappling with ADA Title II compliance, including discussions about how the requirements apply to archival material. This discussion is out of scope for this post. 

[6] For an overview of the importance of selection to sustainability, see Eira Tansey’s DPC white paper "Environmental Impact and Digital Preservation."