Open to: Faculty, Graduate Students
The rise of large-scale digitized book collections--such as those provided by Google Books, the HathiTrust and the Internet Archive--is enabling a fundamentally new kind of text analysis that exploits the scale of collections to ask questions not possible with smaller corpora. Our long-term goal is to enable this kind of work in large-scale distant reading by general researchers--both for current experts in computational text analysis, and also for the next generation of literary scholars who are currently learning empirical methods alongside traditional techniques for close reading. While these large-scale digital collections in many ways present the greatest opportunity for computational critical research, they present several important challenges as well.
In this talk, we'll discuss one such challenge: uncertainty about the words a book contains, and how that uncertainty propagates to later stages of analysis. Access to the books in large-scale digital collections is mediated by the noisy process of optical character recognition (OCR), in which a photograph of each page is converted into a sequence of characters. OCR is a highly accurate process for contemporary business documents, but often struggles on historical works. While individual researchers may not be able to re-OCR millions of books at scale, surfacing OCR quality for individual volumes can help the robustness of analysis; toward this end, we'll detail our efforts to automatically predict OCR accuracy for books, describe the impact that OCR errors have on downstream natural language processing (such as part-of-speech tagging and dependency parsing), and outline future work exploiting the structure of OCR'd books to learn about the organizational practices of indexing.
David Bamman - Assistant Professor, School of Information
Cody Hennesy - E-Learning and Information Studies Librarian, Doe Library
Register here: https://dhfellows-spring2018.eventbrite.com