OCR Post Processing Using Support Vector Machines
Science and Information Conference
London, United Kingdom
First page number:
Last page number:
In this paper, we introduce a set of detailed experiment using Support Vector Machines (SVM) to try and improve accuracy selecting the correct candidate word to correct OCR generated errors. We use our alignment algorithm to create a one-to-one correspondence between the OCR text and the clean version of the TREC-5 data set (Confusion Track). We then extract five features from the candidates suggested by the Google web 1T corpus and use them to train and test our SVM model that will then generalize into the rest of the unseen text. We then improve on our initial results using a polynomial kernel, feature standardization with minmax normalization, and class balancing with SMOTE. Finally, we analyze the errors and suggest on future improvements.
OCR; Support vector machines; SVM; OCR post processing; SMOTE
Computational Engineering | Systems and Communications
Fonseca Cacho, J. R.,
OCR Post Processing Using Support Vector Machines.
Science and Information Conference, 1229
London, United Kingdom: Springer.