Award Date
1-1-2007
Degree Type
Thesis
Degree Name
Master of Science (MS)
Department
Computer Science
First Committee Member
Kazem Taghva
Number of Pages
56
Abstract
A method of sequentially presented document determination using parallel analyses from various facets of structural document understanding and information retrieval is proposed in this thesis. Specifically, the method presented here intends to serve as a trainable system when determining where one document ends and another begins. Content analysis methods include use of the Vector Space Model, as well as targeted analysis of content on the margins of document fragments. Structural analysis for this implementation has been limited to simple and ubiquitous entities, such as software-generated zones, simple format-specific lines, and the appearance of page numbers. Analysis focuses on change in similarity between comparisons, with the emphasis placed on the fact that the extremities of documents tend to contain significant structural and lexical changes that can be observed and quantified. We combine the various features using nonlinear approximation (neural network) and experimentally test the usefulness of the combinations.
Keywords
Analysis; Boundary; Determination; Document; Lexical; Structural
Controlled Subject
Computer science
File Format
File Size
1413.12 KB
Degree Grantor
University of Nevada, Las Vegas
Language
English
Permissions
If you are the rightful copyright holder of this dissertation or thesis and wish to have the full text removed from Digital Scholarship@UNLV, please submit a request to digitalscholarship@unlv.edu and include clear identification of the work, preferably with URL.
Repository Citation
Cartright, Marc-Allen, "Document boundary determination using structural and lexical analysis" (2007). UNLV Retrospective Theses & Dissertations. 2155.
http://dx.doi.org/10.25669/8sj6-cjkl
Rights
IN COPYRIGHT. For more information about this rights statement, please visit http://rightsstatements.org/vocab/InC/1.0/
COinS