Building a Wikipedia N-GRAM Corpus

Document Type

Conference Proceeding

Publication Date

8-25-2020

Publication Title

Proceedings of SAI Intelligent Systems Conference

Publisher

IntelliSys 2020

Publisher Location

Virtual Event

Volume

1251

First page number:

277

Last page number:

294

Abstract

In this paper, we introduce a set of approaches to building a n-gram corpus from the Wikipedia monthly XML dumps. We then apply these to build a 1 to 5-g corpus data set, which we then describe in detail, explaining its benefits as a supplement to larger n-gram corpora like Google Web 1T 5-g corpus. We analyze our algorithms and discuss efficiency in terms of space and time. The dataset is publicly available at www.unlv.edu.

Keywords

NGRAM; NLP; Wikipedia; OCR; Wiki

Disciplines

Databases and Information Systems | Software Engineering

Language

English

UNLV article access

Share

COinS