OCR Post Processing Using Support Vector Machines

Document Type

Conference Proceeding

Publication Date

7-4-2020

Publication Title

Science and Information Conference

Publisher

Springer

Publisher Location

London, United Kingdom

Volume

1229

First page number:

694

Last page number:

713

Abstract

In this paper, we introduce a set of detailed experiment using Support Vector Machines (SVM) to try and improve accuracy selecting the correct candidate word to correct OCR generated errors. We use our alignment algorithm to create a one-to-one correspondence between the OCR text and the clean version of the TREC-5 data set (Confusion Track). We then extract five features from the candidates suggested by the Google web 1T corpus and use them to train and test our SVM model that will then generalize into the rest of the unseen text. We then improve on our initial results using a polynomial kernel, feature standardization with minmax normalization, and class balancing with SMOTE. Finally, we analyze the errors and suggest on future improvements.

Keywords

OCR; Support vector machines; SVM; OCR post processing; SMOTE

Disciplines

Computational Engineering | Systems and Communications

Language

English

UNLV article access

Share

COinS