Computer Science Faculty Research

Using the Google Web 1T 5-Gram Corpus for OCR Error Correction

Jorge R. Fonseca Cacho, University of Nevada, Las VegasFollow

Document Type

Conference Proceeding

Publication Date

5-23-2019

Publication Title

16th International Conference on Information Technology-New Generations (ITNG 2019)

Volume

800

First page number:

505

Last page number:

511

Abstract

In this paper we use the idea of context-based orthographic error corrections by taking the TREC-5 Confusion Track Data Set’s degrade5 and attempting to correct errors generated during Optical Character Recognition. We do this by identifying all errors using OCRSpell, then generate the 3-gram and searching for the first and last word in the Google Web 1T corpus of trigrams. We then select the candidates with the highest frequencies and a small Levenshtein edit distance. We report on our accuracy and precision and discuss on special situations and how to improve performance. All source code is publicly available for our readers to further our work or critique it.

Keywords

Google web1t; Opitcal character recognition; Context based corrections; Natural language processing; OCR post processing

Disciplines

Computer Sciences | Physical Sciences and Mathematics

Language

English

Repository Citation

Fonseca Cacho, J. R. (2019). Using the Google Web 1T 5-Gram Corpus for OCR Error Correction. 16th International Conference on Information Technology-New Generations (ITNG 2019), 800 505-511.
http://dx.doi.org/10.1007/978-3-030-14070-0_71

UNLV article access

Find in your library

COinS

Digital Scholarship@UNLV

Computer Science Faculty Research

Using the Google Web 1T 5-Gram Corpus for OCR Error Correction

Document Type

Publication Date

Publication Title

Volume

First page number:

Last page number:

Abstract

Keywords

Disciplines

Language

Repository Citation

Browse

Links

Digital Scholarship@UNLV

Computer Science Faculty Research

Using the Google Web 1T 5-Gram Corpus for OCR Error Correction

Authors

Document Type

Publication Date

Publication Title

Volume

First page number:

Last page number:

Abstract

Keywords

Disciplines

Language

Repository Citation

Share

Browse

Links