Award Date

1-1-1994

Degree Type

Thesis

Degree Name

Master of Science (MS)

Department

Computer Science

First Committee Member

Junichi Kanai

Number of Pages

81

Abstract

A classifier to determine page quality from an Optical Character Recognition (OCR) perspective is developed. It classifies a given page image as either "good" (i.e. high OCR accuracy is expected) or "bad" (i.e., low OCR accuracy expected). The classifier is based upon measuring the amount of white speckle, the amount of broken pieces, and the overall size information in the page. Two different sets of test data were used to evaluate the classifier: the Test dataset containing 439 pages and the Magazine dataset containing 200 pages. The classifier recognized 85% of the pages in the Test dataset correctly. However, approximately 40% of the low quality pages were misclassified as "good." To solve this problem, the classifier was modified to reject pages containing tables or less than 200 connected components. The modified classifier rejected 41% of the pages, correctly recognized 86% of the remaining pages, and did not misclassify any low quality page as "good". Similarly, it recognized 86.5% of the pages in the Magazine dataset correctly and did not misclassify any low quality page as "good" without any rejections.

Keywords

Evaluation; Features; Page; Quality; Simple

Controlled Subject

Computer science; Information science

File Format

pdf

File Size

3461.12 KB

Degree Grantor

University of Nevada, Las Vegas

Language

English

Permissions

If you are the rightful copyright holder of this dissertation or thesis and wish to have the full text removed from Digital Scholarship@UNLV, please submit a request to digitalscholarship@unlv.edu and include clear identification of the work, preferably with URL.

Rights

IN COPYRIGHT. For more information about this rights statement, please visit http://rightsstatements.org/vocab/InC/1.0/


COinS