Award Date


Degree Type


Degree Name

Master of Science (MS)


Computer Science

First Committee Member

Kazem Taghva

Number of Pages



The most common features that classification systems use is simply to consider all words as features and determine the probability of the document's category based on these words. When given document images, sophisticated optical character recognizers can be used to provide more than the simple text that traditional classification systems use. This metadata and extracting additional features from the document text can improve classification of document images; We have found a greater than 1% increase in recall when looking at font size metadata and extracting other features such as words used in uppercased lines. Since our dataset can have multi-page documents taking only words on the first page increased recall at least 15%. Approximately 2% of recell was increased by ensuring that 100 words of every document was used; this can be explained by some documents having useless header pages that have very little features.


Classification; Document; Images; Type

Controlled Subject

Computer science

File Format


File Size

768 KB

Degree Grantor

University of Nevada, Las Vegas




If you are the rightful copyright holder of this dissertation or thesis and wish to have the full text removed from Digital Scholarship@UNLV, please submit a request to and include clear identification of the work, preferably with URL.


IN COPYRIGHT. For more information about this rights statement, please visit