From Sources to Data: Using OCR in the Classroom

March 16, 2017, 10:30am to 12:00pm

Open to: All faculty, graduate students, and staff


How do you go from materials in an archive to data that you can search, process, and analyze computationally? Students who are learning to analyze research materials are often presented with those materials in an easy-to-use, digitized state, skipping over some of the crucial steps in data acquisition and cleanup that they’ll face when doing their own research.

This workshop will introduce ABBYY FineReader, optical character recognition (OCR) software that can turn scans of text into searchable, editable text. Compared with Adobe Acrobat, FineReader offers improved overall accuracy as well as better support for complex layouts and for non-English languages (including Arabic, Chinese, Japanese, Hebrew, and Russian).

Students and researchers can access FineReader through a user-friendly virtual Windows desktop managed by Research IT. Come learn how to become a pilot user for this service.

This workshop will also briefly address how students and researchers can OCR thousands of documents using the open-source Tesseract software on the Savio high-performance compute cluster.


  • Quinn Dombrowski, Research IT
  • Stacy Reardon, Library
  • Adam Anderson, Digital Humanities


Registration is unavailable. Registration closed on March 15, 2017 - 5:00pm.


Digital Humanities