Building a Wikipedia N-GRAM Corpus
Document Type
Conference Proceeding
Publication Date
8-25-2020
Publication Title
Proceedings of SAI Intelligent Systems Conference
Publisher
IntelliSys 2020
Publisher Location
Virtual Event
Volume
1251
First page number:
277
Last page number:
294
Abstract
In this paper, we introduce a set of approaches to building a n-gram corpus from the Wikipedia monthly XML dumps. We then apply these to build a 1 to 5-g corpus data set, which we then describe in detail, explaining its benefits as a supplement to larger n-gram corpora like Google Web 1T 5-g corpus. We analyze our algorithms and discuss efficiency in terms of space and time. The dataset is publicly available at www.unlv.edu.
Keywords
NGRAM; NLP; Wikipedia; OCR; Wiki
Disciplines
Databases and Information Systems | Software Engineering
Language
English
Repository Citation
Fonseca Cacho, J. R.,
Cisneros, B.,
Taghva, K.
(2020).
Building a Wikipedia N-GRAM Corpus.
Proceedings of SAI Intelligent Systems Conference, 1251
277-294.
Virtual Event: IntelliSys 2020.
http://dx.doi.org/10.1007/978-3-030-55187-2_23