Quality in HathiTrust (Re-Posting)

May 20, 2015

Jeremy York of HathiTrust and Kat Hagedorn of the UM Library collaborated on a HathiTrust blog post regarding how HathiTrust handles reported quality problems for its volumes.

The full blog post is here on the HathiTrust site:


Excerpts from the post:

As reported in our monthly updates, we receive well over a hundred inquiries every month about quality problems with page images or OCR text of volumes in HathiTrust. That’s the bad news. The good news is that in most of these cases, there is something we can do about it. This blog post is intended to shed some light on our thinking and practices about quality in HathiTrust. We hope it will also encourage you to report any problems you might find so that we might have the opportunity to fix it, and deliver the highest quality collections we can for educational and research needs.

There are a variety of approaches that libraries take toward digitization. The choice of a particular approach may be influenced by the amount of materials to be digitized, the time and resources available for digitization, and, significantly, the intended purpose of digitization. For instance, it could be important to preserve, as much as possible for users, the artifactual value of the print original - the texture and color of the pages, the wear and tear on the book, etc. On the other hand it may be important, or sufficient for the purpose (or at least, initial purpose), to target the intellectual content in the book for preservation, rather than the intellectual content and artifact together. In the first instance, manual review and manipulation of each digitized image may be needed to achieve the desired end. The second instance, however, may lend itself well to a larger scale digitization project, where a higher production rate could be achieved by focusing (for instance) on capture of the printed text in the book with a high level of accuracy (not necessarily to the exclusion of physical attributes of the book), rather than fuller characteristics of the artifact, which may be more time-consuming.  

From the time HathiTrust was launched to the present, 6,499 volumes have been reported to have some kind of quality issue. As of May 4, 2015, we have managed to fix the problems in 2,310 of these. Overall, of 1,141 problem reports on full view volumes that are known to come from end users, which are prioritized (many problem reports come from staff at partner institutions engaged in copyright review of limited view materials), we were able to fix 913 of them, a total of more than 80%!