"Building a Wikipedia N-GRAM Corpus" by Jorge Ramon Fonseca Cacho, Ben Cisneros et al.
 

Building a Wikipedia N-GRAM Corpus

Document Type

Conference Proceeding

Publication Date

8-25-2020

Publication Title

Proceedings of SAI Intelligent Systems Conference

Publisher

IntelliSys 2020

Publisher Location

Virtual Event

Volume

1251

First page number:

277

Last page number:

294

Abstract

In this paper, we introduce a set of approaches to building a n-gram corpus from the Wikipedia monthly XML dumps. We then apply these to build a 1 to 5-g corpus data set, which we then describe in detail, explaining its benefits as a supplement to larger n-gram corpora like Google Web 1T 5-g corpus. We analyze our algorithms and discuss efficiency in terms of space and time. The dataset is publicly available at www.unlv.edu.

Keywords

NGRAM; NLP; Wikipedia; OCR; Wiki

Disciplines

Databases and Information Systems | Software Engineering

Language

English

UNLV article access

Plum Print visual indicator of research metrics
PlumX Metrics
  • Citations
    • Citation Indexes: 4
  • Usage
    • Abstract Views: 129
  • Captures
    • Readers: 2
see details

Share

COinS