Award Date

12-2011

Degree Type

Thesis

Degree Name

Master of Science in Computer Science

Department

Computer Science

First Committee Member

Kazem Taghva, Chair

Second Committee Member

Laxmi Gewali

Third Committee Member

Ajoy Datta

Graduate Faculty Representative

Venki Mukhukumar

Number of Pages

85

Abstract

This thesis will discuss feature selection algorithms for text-categorization. Feature selection algorithms are very important, as they can make-or-break a categorization engine. The feature selection algorithms that will be discussed in this thesis are Document Frequency, Information Gain, Chi Squared, Mutual Information, NGL (Ng-Goh-Low) coefficient, and GSS (Galavotti-Sebastiani-Simi) coefficient . The general idea of any feature selection algorithm is to determine importance of words using some measure that can keep informative words, and remove non-informative words, which can then help the text-categorization engine categorize a document, D , into some category, C . These feature selection methods are explained, implemented, and are provided results for in this thesis. This thesis also discusses how we gathered and constructed training and testing data, along with the setup and storage techniques we used.

Keywords

Applied sciences; Computational linguistics; Feature selection; Text categorization; Text processing (Computer science)

Disciplines

Computer Sciences | Databases and Information Systems | Systems Architecture

Language

English


Share

COinS