Award Date

8-1-2012

Degree Type

Thesis

Degree Name

Master of Science in Computer Science

Department

Computer Science

First Committee Member

Kazem Taghva

Second Committee Member

Ajoy K. Datta

Third Committee Member

Laxmi P. Gewali

Fourth Committee Member

Venkatesan Muthukumar

Number of Pages

49

Abstract

In this thesis, we describe a postprocessing system on Optical Character Recognition(OCR) generated text. Second Order Hidden Markov Model (HMM) approach is used to detect and correct the OCR related errors. The reason for choosing the 2nd order HMM is to keep track of the bigrams so that the model can represent the system more accurately. Based on experiments with training data of 159,733 characters and testing of 5,688 characters, the model was able to correct 43.38 % of the errors with a precision of 75.34 %. However, the precision value indicates that the model

introduced some new errors, decreasing the correction percentage to 26.4%.

Keywords

Errors; Hidden Markov Models; Optical character recognition; Second Order HMM

Disciplines

Computer Sciences

Language

English


Share

COinS