Award Date

1-1-2007

Degree Type

Thesis

Degree Name

Master of Science (MS)

Department

Computer Science

First Committee Member

Kazem Taghva

Number of Pages

37

Abstract

The most common features that classification systems use is simply to consider all words as features and determine the probability of the document's category based on these words. When given document images, sophisticated optical character recognizers can be used to provide more than the simple text that traditional classification systems use. This metadata and extracting additional features from the document text can improve classification of document images; We have found a greater than 1% increase in recall when looking at font size metadata and extracting other features such as words used in uppercased lines. Since our dataset can have multi-page documents taking only words on the first page increased recall at least 15%. Approximately 2% of recell was increased by ensuring that 100 words of every document was used; this can be explained by some documents having useless header pages that have very little features.

Keywords

Classification; Document; Images; Type

Controlled Subject

Computer science

File Format

pdf

File Size

768 KB

Degree Grantor

University of Nevada, Las Vegas

Language

English

Permissions

If you are the rightful copyright holder of this dissertation or thesis and wish to have the full text removed from Digital Scholarship@UNLV, please submit a request to digitalscholarship@unlv.edu and include clear identification of the work, preferably with URL.

Rights

IN COPYRIGHT. For more information about this rights statement, please visit http://rightsstatements.org/vocab/InC/1.0/


COinS