Award Date

8-1-2019

Degree Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Computer Science

First Committee Member

Kazem Taghva

Second Committee Member

Laxmi Gewali

Third Committee Member

Jan Pedersen

Fourth Committee Member

Emma Regentova

Number of Pages

184

Abstract

Optical Character Recognition (OCR) Post Processing involves data cleaning steps for documents that were digitized, such as a book or a newspaper article. One step in this process is the identification and correction of spelling and grammar errors generated due to the flaws in the OCR system. This work is a report on our efforts to enhance the post processing for large repositories of documents.

The main contributions of this work are:

• Development of tools and methodologies to build both OCR and ground truth text correspondence for training and testing of proposed techniques in our experiments. In particular, we will explain the alignment problem and tackle it with our de novo algorithm that has shown a high success rate.

• Exploration of the Google Web 1T corpus to correct errors using context. We show that over half of the errors in the OCR text can be detected and corrected.

• Applications of machine learning tools to generalize the past ad hoc approaches to OCR error corrections. As an example, we investigate the use of logistic regression to select the correct replacement for misspellings in the OCR text.

• Use of container technology to address the state of reproducible research in OCR and Computer Science as a whole. Many of the past experiments in the field of OCR are not considered reproducible research questioning whether the original results were outliers or finessed.

Keywords

Google Web 1T; OCR; Optical Character Recognition; Post Processing; Reproducible Research; Text Alignment

Disciplines

Computer Sciences

Language

English


Share

COinS