Generating Correction Candidates for OCR Errors Using BERT Language Model and FastText SubWord Embeddings
Document Type
Conference Proceeding
Publication Date
1-1-2022
Publication Title
Lecture Notes in Networks and Systems
Publisher
Springer
Publisher Location
New York, NY
Volume
283
First page number:
1045
Last page number:
1053
Abstract
In this paper, we present a method to identify candidates for Optical Character Recognition (OCR) generated errors during the post processing stage using the BERT language model and FastText subword embeddings. Our results show that given an identified error, the model is able to consistently generate the correct candidate for 70.9% of the errors.
Keywords
BERT; FastText; Language modeling; Natural language processing; OCR post processing
Disciplines
Other Computer Sciences | Programming Languages and Compilers
Repository Citation
Hajiali, M.,
Cacho, J. F.,
Taghva, K.
(2022).
Generating Correction Candidates for OCR Errors Using BERT Language Model and FastText SubWord Embeddings.
Lecture Notes in Networks and Systems, 283
1045-1053.
New York, NY: Springer.
http://dx.doi.org/10.1007/978-3-030-80119-9_69