Computer Science Faculty Research

Generating Correction Candidates for OCR Errors Using BERT Language Model and FastText SubWord Embeddings

Mahdi Hajiali, University of Nevada, Las Vegas
Jorge Fonseca Cacho, University of Nevada, Las VegasFollow
Kazem Taghva, University of Nevada, Las VegasFollow

Document Type

Conference Proceeding

Publication Date

1-1-2022

Publication Title

Lecture Notes in Networks and Systems

Publisher

Springer

Publisher Location

New York, NY

Volume

283

First page number:

1045

Last page number:

1053

Abstract

In this paper, we present a method to identify candidates for Optical Character Recognition (OCR) generated errors during the post processing stage using the BERT language model and FastText subword embeddings. Our results show that given an identified error, the model is able to consistently generate the correct candidate for 70.9% of the errors.

Keywords

BERT; FastText; Language modeling; Natural language processing; OCR post processing

Disciplines

Other Computer Sciences | Programming Languages and Compilers

Repository Citation

Hajiali, M., Cacho, J. F., Taghva, K. (2022). Generating Correction Candidates for OCR Errors Using BERT Language Model and FastText SubWord Embeddings. Lecture Notes in Networks and Systems, 283 1045-1053. New York, NY: Springer.
http://dx.doi.org/10.1007/978-3-030-80119-9_69

UNLV article access

COinS

Digital Scholarship@UNLV

Computer Science Faculty Research

Generating Correction Candidates for OCR Errors Using BERT Language Model and FastText SubWord Embeddings

Document Type

Publication Date

Publication Title

Publisher

Publisher Location

Volume

First page number:

Last page number:

Abstract

Keywords

Disciplines

Repository Citation

Browse

Links

Digital Scholarship@UNLV

Computer Science Faculty Research

Generating Correction Candidates for OCR Errors Using BERT Language Model and FastText SubWord Embeddings

Authors

Document Type

Publication Date

Publication Title

Publisher

Publisher Location

Volume

First page number:

Last page number:

Abstract

Keywords

Disciplines

Repository Citation

Share

Browse

Links