Master of Science (MS)
First Committee Member
Number of Pages
The most common features that classification systems use is simply to consider all words as features and determine the probability of the document's category based on these words. When given document images, sophisticated optical character recognizers can be used to provide more than the simple text that traditional classification systems use. This metadata and extracting additional features from the document text can improve classification of document images; We have found a greater than 1% increase in recall when looking at font size metadata and extracting other features such as words used in uppercased lines. Since our dataset can have multi-page documents taking only words on the first page increased recall at least 15%. Approximately 2% of recell was increased by ensuring that 100 words of every document was used; this can be explained by some documents having useless header pages that have very little features.
Classification; Document; Images; Type
University of Nevada, Las Vegas
If you are the rightful copyright holder of this dissertation or thesis and wish to have the full text removed from Digital Scholarship@UNLV, please submit a request to firstname.lastname@example.org and include clear identification of the work, preferably with URL.
Vergara, Jason Montgomery, "Document type classification from document images" (2007). UNLV Retrospective Theses & Dissertations. 2268.