Award Date
5-2011
Degree Type
Dissertation
Degree Name
Doctor of Philosophy in Computer Science
Department
Computer Science
First Committee Member
Kazem Taghva, Chair
Second Committee Member
Thomas Nartker
Third Committee Member
Laxmi Gewali
Fourth Committee Member
Ajoy Datta
Graduate Faculty Representative
Ashok Singh
Number of Pages
81
Abstract
In this dissertation, we investigate the effectiveness of information extraction in the presence of Optical Character Recognition (OCR). It is well known that the OCR errors have no effects on general retrieval tasks. This is mainly due to the redundancy of information in textual documents. Our work shows that information extraction task is significantly influenced by OCR errors. Intuitively, this is due to the fact that extraction algorithms rely on a small window of text surrounding the objects to be extracted.
We show that extraction methodologies based on the Hidden Markov Models are not robust enough to deal with extraction in this noisy environment. We also show that both precise shallow parsing and fuzzy shallow parsing can be used to increase the recall at the price of a significant drop in the precision.
Most of our experimental work deals with the extraction of dates of birth and extraction of postal addresses. Both of these specific extractions are part of general methods of identification of privacy information in textual documents. Privacy information is particularly important when large collections of documents are posted on the Internet.
Keywords
Approximate regular expressions; Data mining; Hidden Markov models; Information extraction; Information retrieval; OCR; Optical character recognition
Disciplines
Computer Sciences | Theory and Algorithms
File Format
Degree Grantor
University of Nevada, Las Vegas
Language
English
Dissertation Defense Presentation
Repository Citation
Pereda, Ramon, "Information Extraction in an Optical Character Recognition Context" (2011). UNLV Theses, Dissertations, Professional Papers, and Capstones. 1061.
http://dx.doi.org/10.34917/2459072
Rights
IN COPYRIGHT. For more information about this rights statement, please visit http://rightsstatements.org/vocab/InC/1.0/
Comments
Attached file: 53 PowerPoint slides