Award Date
12-2011
Degree Type
Thesis
Degree Name
Master of Science in Computer Science
Department
Computer Science
First Committee Member
Kazem Taghva, Chair
Second Committee Member
Laxmi Gewali
Third Committee Member
Ajoy Datta
Graduate Faculty Representative
Venki Mukhukumar
Number of Pages
85
Abstract
This thesis will discuss feature selection algorithms for text-categorization. Feature selection algorithms are very important, as they can make-or-break a categorization engine. The feature selection algorithms that will be discussed in this thesis are Document Frequency, Information Gain, Chi Squared, Mutual Information, NGL (Ng-Goh-Low) coefficient, and GSS (Galavotti-Sebastiani-Simi) coefficient . The general idea of any feature selection algorithm is to determine importance of words using some measure that can keep informative words, and remove non-informative words, which can then help the text-categorization engine categorize a document, D , into some category, C . These feature selection methods are explained, implemented, and are provided results for in this thesis. This thesis also discusses how we gathered and constructed training and testing data, along with the setup and storage techniques we used.
Keywords
Applied sciences; Computational linguistics; Feature selection; Text categorization; Text processing (Computer science)
Disciplines
Computer Sciences | Databases and Information Systems | Systems Architecture
File Format
Degree Grantor
University of Nevada, Las Vegas
Language
English
Repository Citation
Dave, Kandarp, "Study of feature selection algorithms for text-categorization" (2011). UNLV Theses, Dissertations, Professional Papers, and Capstones. 1380.
http://dx.doi.org/10.34917/3274698
Rights
IN COPYRIGHT. For more information about this rights statement, please visit http://rightsstatements.org/vocab/InC/1.0/