Using the Google Web 1T 5-Gram Corpus for OCR Error Correction
Document Type
Conference Proceeding
Publication Date
5-23-2019
Publication Title
16th International Conference on Information Technology-New Generations (ITNG 2019)
Volume
800
First page number:
505
Last page number:
511
Abstract
In this paper we use the idea of context-based orthographic error corrections by taking the TREC-5 Confusion Track Data Set’s degrade5 and attempting to correct errors generated during Optical Character Recognition. We do this by identifying all errors using OCRSpell, then generate the 3-gram and searching for the first and last word in the Google Web 1T corpus of trigrams. We then select the candidates with the highest frequencies and a small Levenshtein edit distance. We report on our accuracy and precision and discuss on special situations and how to improve performance. All source code is publicly available for our readers to further our work or critique it.
Keywords
Google web1t; Opitcal character recognition; Context based corrections; Natural language processing; OCR post processing
Disciplines
Computer Sciences | Physical Sciences and Mathematics
Language
English
Repository Citation
Fonseca Cacho, J. R.
(2019).
Using the Google Web 1T 5-Gram Corpus for OCR Error Correction.
16th International Conference on Information Technology-New Generations (ITNG 2019), 800
505-511.
http://dx.doi.org/10.1007/978-3-030-14070-0_71