Award Date
1-1-2007
Degree Type
Thesis
Degree Name
Master of Science (MS)
Department
Computer Science
First Committee Member
Kazem Taghva
Number of Pages
37
Abstract
The most common features that classification systems use is simply to consider all words as features and determine the probability of the document's category based on these words. When given document images, sophisticated optical character recognizers can be used to provide more than the simple text that traditional classification systems use. This metadata and extracting additional features from the document text can improve classification of document images; We have found a greater than 1% increase in recall when looking at font size metadata and extracting other features such as words used in uppercased lines. Since our dataset can have multi-page documents taking only words on the first page increased recall at least 15%. Approximately 2% of recell was increased by ensuring that 100 words of every document was used; this can be explained by some documents having useless header pages that have very little features.
Keywords
Classification; Document; Images; Type
Controlled Subject
Computer science
File Format
File Size
768 KB
Degree Grantor
University of Nevada, Las Vegas
Language
English
Permissions
If you are the rightful copyright holder of this dissertation or thesis and wish to have the full text removed from Digital Scholarship@UNLV, please submit a request to digitalscholarship@unlv.edu and include clear identification of the work, preferably with URL.
Repository Citation
Vergara, Jason Montgomery, "Document type classification from document images" (2007). UNLV Retrospective Theses & Dissertations. 2268.
http://dx.doi.org/10.25669/7fm1-s9xr
Rights
IN COPYRIGHT. For more information about this rights statement, please visit http://rightsstatements.org/vocab/InC/1.0/
COinS