Master of Science in Computer Science
First Committee Member
Second Committee Member
Third Committee Member
Fourth Committee Member
Number of Pages
Information retrieval is the process of finding information from an unstructured collection of data. The process of information retrieval involves building an index, commonly called an inverted file. As part of the inverted file, information retrieval algorithms often stem words to a common root. Stemming involves reducing a document term to its root. There are many ways to stem a word: affix removal and successor variety are two common categories of stemmers. The Porter Stemming Algorithm is a suffix removal stemmer that operates as a rule-based process on English words. We can think of stemming as a way to cluster related words together according to one common stem. However, sometimes Porter includes words in a cluster that are un-related. This experiment attempts to correct these stemming errors through the use of Formal Concept Analysis (FCA). FCA is the process of formulating formal concepts from a given formal context. A formal context consists of a set of objects, G, a set of attributes, M, and a binary relation I that indicates the attributes possessed by each object. A formal concept is formed by computing the closure of a subset of objects and attributes. Attribute selection is of critical importance in FCA; using the Cranfield document collection, this experiment attempted to view attributes as a function of word-relatedness and crafted a comparison measure between each word in the stemmed cluster using the Google Web 1T 5-gram data set. Using FCA to correct the clusters, the results showed a varying level of success for precision and recall values dependent upon the error threshold allowed.
Computer algorithms; Formal Concept Analysis; Information retrieval; Stemming
Computer Sciences | Library and Information Science
Hall, Guymon, "Using the Web 1T 5-Gram Database for Attribute Selection in Formal Concept Analysis to Correct Overstemmed Clusters" (2014). UNLV Theses, Dissertations, Professional Papers, and Capstones. 2089.