Award Date

2009

Degree Type

Thesis

Degree Name

Master of Science in Computer Science

Department

Computer Science

Advisor 1

Kazem Taghva, Committee Chair

First Committee Member

Ajoy K. Datta

Second Committee Member

Laxmi P. Gewali

Graduate Faculty Representative

Muthukumar Venkatesan

Number of Pages

Abstract

Document clustering or unsupervised document classification is an automated process of grouping documents with similar content. A typical technique uses a similarity function to compare documents. In the literature, many similarity functions such as dot product or cosine measures are proposed for the comparison operator.

For the thesis, we evaluate the effects a similarity function may have on clustering. We start by representing a document and a query, both as a vector of high-dimensional space corresponding to the keywords followed by using an appropriate distance measure in k-means to compute similarity between the document vector and the query vector to form clusters. Based on these clusters we decide the best distance metric for the document set used. Next, we compute time complexities for different similarity functions for the same model and document set based on the number of iterations and number of clusters.

Keywords

Canberra distances; Chi-Square; Data mining; Distances; Document clustering; Euclidean distances; Execution time; Information retrieval; K-means clustering algorithm; Similarity functions

Disciplines

Computer Sciences | Databases and Information Systems

File Format

pdf

Degree Grantor

University of Nevada, Las Vegas

Language

English

Repository Citation

Veni, Rushikesh, "Effects of similarity metrics on document clustering" (2009). UNLV Theses, Dissertations, Professional Papers, and Capstones. 71.
http://dx.doi.org/10.34870/1374214

Rights

IN COPYRIGHT. For more information about this rights statement, please visit http://rightsstatements.org/vocab/InC/1.0/

Download

Included in

Databases and Information Systems Commons

COinS

Digital Scholarship@UNLV

UNLV Theses, Dissertations, Professional Papers, and Capstones

Effects of similarity metrics on document clustering

Award Date

Degree Type

Degree Name

Department

Advisor 1

First Committee Member

Second Committee Member

Graduate Faculty Representative

Number of Pages

Abstract

Keywords

Disciplines

File Format

Degree Grantor

Language

Repository Citation

Rights

Included in

Browse

Digital Scholarship@UNLV

UNLV Theses, Dissertations, Professional Papers, and Capstones

Effects of similarity metrics on document clustering

Author

Award Date

Degree Type

Degree Name

Department

Advisor 1

First Committee Member

Second Committee Member

Graduate Faculty Representative

Number of Pages

Abstract

Keywords

Disciplines

File Format

Degree Grantor

Language

Repository Citation

Rights

Included in

Share

Browse