Automatically Assessing OCR Quality in the HathiTrust

April 24, 2018, 12:30pm to 2:00pm

Open to: Faculty, Graduate Students


The rise of large-scale digitized book collections--such as those provided by Google Books, the HathiTrust and the Internet Archive--is enabling a fundamentally new kind of text analysis that exploits the scale of collections to ask questions not possible with smaller corpora. Our long-term goal is to enable this kind of work in large-scale distant reading by general researchers--both for current experts in computational text analysis, and also for the next generation of literary scholars who are currently learning empirical methods alongside traditional techniques for close reading. While these large-scale digital collections in many ways present the greatest opportunity for computational critical research, they present several important challenges as well.

In this talk, we'll discuss one such challenge: uncertainty about the words a book contains, and how that uncertainty propagates to later stages of analysis. Access to the books in large-scale digital collections is mediated by the noisy process of optical character recognition (OCR), in which a photograph of each page is converted into a sequence of characters. OCR is a highly accurate process for contemporary business documents, but often struggles on historical works. While individual researchers may not be able to re-OCR millions of books at scale, surfacing OCR quality for individual volumes can help the robustness of analysis; toward this end, we'll detail our efforts to automatically predict OCR accuracy for books, describe the impact that OCR errors have on downstream natural language processing (such as part-of-speech tagging and dependency parsing), and outline future work exploiting the structure of OCR'd books to learn about the organizational practices of indexing.

David Bamman - Assistant Professor, School of Information
Cody Hennesy - E-Learning and Information Studies Librarian, Doe Library

Digital Humanities

Series description

The Digital Humanities Fellows Lecture Series brings together the campus DH community for the scholarly presentation and informal discussion of specific aspects of digital humanities practice. Each meeting a different Fellow presents their ongoing work before the conversation is opened to hands-on experimentation in addition to questions, and comments. Intended to further the critical understanding and practice of the digital humanities at Berkeley, these lectures are intended for both existing and prospective DH practitioners.