Generating Correction Candidates for OCR Errors Using BERT Language Model and FastText SubWord Embeddings

Document Type

Conference Proceeding

Publication Date

1-1-2022

Publication Title

Lecture Notes in Networks and Systems

Publisher

Springer

Publisher Location

New York, NY

Volume

283

First page number:

1045

Last page number:

1053

Abstract

In this paper, we present a method to identify candidates for Optical Character Recognition (OCR) generated errors during the post processing stage using the BERT language model and FastText subword embeddings. Our results show that given an identified error, the model is able to consistently generate the correct candidate for 70.9% of the errors.

Keywords

BERT; FastText; Language modeling; Natural language processing; OCR post processing

Disciplines

Other Computer Sciences | Programming Languages and Compilers

UNLV article access

Share

COinS