Using the Google Web 1T 5-Gram Corpus for OCR Error Correction
16th International Conference on Information Technology-New Generations (ITNG 2019)
First page number:
Last page number:
In this paper we use the idea of context-based orthographic error corrections by taking the TREC-5 Confusion Track Data Set’s degrade5 and attempting to correct errors generated during Optical Character Recognition. We do this by identifying all errors using OCRSpell, then generate the 3-gram and searching for the first and last word in the Google Web 1T corpus of trigrams. We then select the candidates with the highest frequencies and a small Levenshtein edit distance. We report on our accuracy and precision and discuss on special situations and how to improve performance. All source code is publicly available for our readers to further our work or critique it.
Google web1t; Opitcal character recognition; Context based corrections; Natural language processing; OCR post processing
Computer Sciences | Physical Sciences and Mathematics
Fonseca Cacho, J. R.
Using the Google Web 1T 5-Gram Corpus for OCR Error Correction.
16th International Conference on Information Technology-New Generations (ITNG 2019), 800