Award Date
2009
Degree Type
Thesis
Degree Name
Master of Science in Computer Science
Department
Computer Science
Advisor 1
Kazem Taghva, Committee Chair
First Committee Member
Ajoy K. Datta
Second Committee Member
Laxmi P. Gewali
Graduate Faculty Representative
Muthukumar Venkatesan
Number of Pages
77
Abstract
Document clustering or unsupervised document classification is an automated process of grouping documents with similar content. A typical technique uses a similarity function to compare documents. In the literature, many similarity functions such as dot product or cosine measures are proposed for the comparison operator.
For the thesis, we evaluate the effects a similarity function may have on clustering. We start by representing a document and a query, both as a vector of high-dimensional space corresponding to the keywords followed by using an appropriate distance measure in k-means to compute similarity between the document vector and the query vector to form clusters. Based on these clusters we decide the best distance metric for the document set used. Next, we compute time complexities for different similarity functions for the same model and document set based on the number of iterations and number of clusters.
Keywords
Canberra distances; Chi-Square; Data mining; Distances; Document clustering; Euclidean distances; Execution time; Information retrieval; K-means clustering algorithm; Similarity functions
Disciplines
Computer Sciences | Databases and Information Systems
File Format
Degree Grantor
University of Nevada, Las Vegas
Language
English
Repository Citation
Veni, Rushikesh, "Effects of similarity metrics on document clustering" (2009). UNLV Theses, Dissertations, Professional Papers, and Capstones. 71.
http://dx.doi.org/10.34870/1374214
Rights
IN COPYRIGHT. For more information about this rights statement, please visit http://rightsstatements.org/vocab/InC/1.0/