Electrical and Computer Engineering Faculty Presentations

A Method for Calculating Term Similarity on Large Document Collections

Wolfgang W. Bein, University of Nevada, Las VegasFollow
Jeffrey Coombs, University of Nevada, Las VegasFollow
Kazem Taghva, University of Nevada, Las VegasFollow

Meeting name

International Conference on Information Technology: Coding and Computing

Document Type

Conference Proceeding

Publication Date

4-28-2003

Abstract

We present an efficient algorithm called the Quadtree Heuristic for identifying a list of similar terms for each unique term in a large document collection. Term similarity is defined using the expected mutual information measure (EMIM). Since our aim for defining the similarity lists is to improve information retrieval (IR), we present the outcome of an experiment comparing the performance of an IR engine designed to use the similarity lists. Two methods were used to generate similarity lists: a brute-force technique and the Quadtree Heuristic. The performance of the list generated by the Quadtree Heuristic was commensurate with the brute force list.

Keywords

Brute force technique; Code words; EMIM; Expected Mutual Information Measure; Heuristic algorithms; Information retrieval; IR engine; Keyword searching; Large document collections; Quadtree Heuristic; Quadtrees; Similarity lists; Synonyms; Term similarity

Disciplines

Permissions

Use Find in Your Library, contact the author, or interlibrary loan to garner a copy of the item. Publisher policy does not allow archiving the final published version. If a post-print (author's peer-reviewed manuscript) is allowed and available, or publisher policy changes, the item will be deposited.

Repository Citation

Bein, W. W., Coombs, J., Taghva, K. (2003, April). A Method for Calculating Term Similarity on Large Document Collections. Presentation at International Conference on Information Technology: Coding and Computing,

Available at: https://digitalscholarship.unlv.edu/ece_presentations/33

UNLV article access

COinS

Digital Scholarship@UNLV

Electrical and Computer Engineering Faculty Presentations

A Method for Calculating Term Similarity on Large Document Collections

Meeting name

Document Type

Publication Date

Abstract

Keywords

Disciplines

Permissions

Repository Citation

Browse

Links

Digital Scholarship@UNLV

Electrical and Computer Engineering Faculty Presentations

A Method for Calculating Term Similarity on Large Document Collections

Authors

Meeting name

Document Type

Publication Date

Abstract

Keywords

Disciplines

Permissions

Repository Citation

Share

Browse

Links