OCR Post Processing Using Support Vector Machines
Document Type
Conference Proceeding
Publication Date
7-4-2020
Publication Title
Science and Information Conference
Publisher
Springer
Publisher Location
London, United Kingdom
Volume
1229
First page number:
694
Last page number:
713
Abstract
In this paper, we introduce a set of detailed experiment using Support Vector Machines (SVM) to try and improve accuracy selecting the correct candidate word to correct OCR generated errors. We use our alignment algorithm to create a one-to-one correspondence between the OCR text and the clean version of the TREC-5 data set (Confusion Track). We then extract five features from the candidates suggested by the Google web 1T corpus and use them to train and test our SVM model that will then generalize into the rest of the unseen text. We then improve on our initial results using a polynomial kernel, feature standardization with minmax normalization, and class balancing with SMOTE. Finally, we analyze the errors and suggest on future improvements.
Keywords
OCR; Support vector machines; SVM; OCR post processing; SMOTE
Disciplines
Computational Engineering | Systems and Communications
Language
English
Repository Citation
Fonseca Cacho, J. R.,
Taghva, K.
(2020).
OCR Post Processing Using Support Vector Machines.
Science and Information Conference, 1229
694-713.
London, United Kingdom: Springer.
http://dx.doi.org/10.1007/978-3-030-52246-9_51