Using the Google Web 1T 5-Gram Corpus for OCR Error Correction

Document Type

Conference Proceeding

Publication Date

5-23-2019

Publication Title

16th International Conference on Information Technology-New Generations (ITNG 2019)

Volume

800

First page number:

505

Last page number:

511

Abstract

In this paper we use the idea of context-based orthographic error corrections by taking the TREC-5 Confusion Track Data Set’s degrade5 and attempting to correct errors generated during Optical Character Recognition. We do this by identifying all errors using OCRSpell, then generate the 3-gram and searching for the first and last word in the Google Web 1T corpus of trigrams. We then select the candidates with the highest frequencies and a small Levenshtein edit distance. We report on our accuracy and precision and discuss on special situations and how to improve performance. All source code is publicly available for our readers to further our work or critique it.

Keywords

Google web1t; Opitcal character recognition; Context based corrections; Natural language processing; OCR post processing

Disciplines

Computer Sciences | Physical Sciences and Mathematics

Language

English

UNLV article access

Find in your library

Share

COinS