Award Date
August 2018
Degree Type
Dissertation
Degree Name
Doctor of Philosophy (PhD)
Department
Computer Science
First Committee Member
Kazem Taghva
Second Committee Member
Laxmi Gewali
Third Committee Member
Justin Zhan
Fourth Committee Member
Fatma Nasoz
Fifth Committee Member
Ashok Singh
Sixth Committee Member
Kathryn Hausbeck Korgan
Number of Pages
79
Abstract
This dissertation revisits the problem of five-year survivability predictions for breast cancer using machine learning tools. This work is distinguishable from the past experiments based on the size of the training data, the unbalanced distribution of data in minority and majority classes, and modified data cleaning procedures. These experiments are also based on the principles of TIDY data and reproducible research. In order to fine-tune the predictions, a set of experiments were run using naive Bayes, decision trees, and logistic regression.
Of particular interest were strategies to improve the recall level for the minority class, as the cost of misclassification is prohibitive. One of The main contributions of this work is that logistic regression with the proper predictors and class weight gives the highest precision/recall level for the minority class.
In regression modeling with large number of predictors, correlation among predictors is quite common, and the estimated model coefficients might not be very reliable. In these situations, the Variance Inflation Factor (VIF) and the Generalized Variance Inflation Factor
(GVIF) are used to overcome the correlation problem. Our experiments are based on the Surveillance, Epidemiology, and End Results (SEER) database for the problem of survivability prediction. Some of the specific contributions of this thesis are:
· Detailed process for data cleaning and binary classification of 338,596 breast cancer patients.
· Computational approach for omitting predictors and categorical predictors based on VIF and GVIF.
· Various applications of Synthetic Minority Over-sampling Techniques (SMOTE) to increase precision and recall.
· An application of Edited Nearest Neighbor to obtain the highest F1-measure.
In addition, this work provides precise algorithms and codes for determining class membership and execution of competing methods. These codes can facilitate the reproduction and extension of our work by other researchers.
Keywords
Cancer; Machine Learning; Sampling; SEER Dataset; SMOTE; Survivability
Disciplines
Computer Sciences
File Format
Degree Grantor
University of Nevada, Las Vegas
Language
English
Repository Citation
Bozorgi, Mandana, "Application of Machine Learning in Cancer Research" (2018). UNLV Theses, Dissertations, Professional Papers, and Capstones. 3353.
http://dx.doi.org/10.34917/14139867
Rights
IN COPYRIGHT. For more information about this rights statement, please visit http://rightsstatements.org/vocab/InC/1.0/