Doctor of Philosophy (PhD)
First Committee Member
Second Committee Member
Third Committee Member
Fourth Committee Member
Fifth Committee Member
Sixth Committee Member
Kathryn Hausbeck Korgan
Number of Pages
This dissertation revisits the problem of five-year survivability predictions for breast cancer using machine learning tools. This work is distinguishable from the past experiments based on the size of the training data, the unbalanced distribution of data in minority and majority classes, and modified data cleaning procedures. These experiments are also based on the principles of TIDY data and reproducible research. In order to fine-tune the predictions, a set of experiments were run using naive Bayes, decision trees, and logistic regression.
Of particular interest were strategies to improve the recall level for the minority class, as the cost of misclassification is prohibitive. One of The main contributions of this work is that logistic regression with the proper predictors and class weight gives the highest precision/recall level for the minority class.
In regression modeling with large number of predictors, correlation among predictors is quite common, and the estimated model coefficients might not be very reliable. In these situations, the Variance Inflation Factor (VIF) and the Generalized Variance Inflation Factor
(GVIF) are used to overcome the correlation problem. Our experiments are based on the Surveillance, Epidemiology, and End Results (SEER) database for the problem of survivability prediction. Some of the specific contributions of this thesis are:
· Detailed process for data cleaning and binary classification of 338,596 breast cancer patients.
· Computational approach for omitting predictors and categorical predictors based on VIF and GVIF.
· Various applications of Synthetic Minority Over-sampling Techniques (SMOTE) to increase precision and recall.
· An application of Edited Nearest Neighbor to obtain the highest F1-measure.
In addition, this work provides precise algorithms and codes for determining class membership and execution of competing methods. These codes can facilitate the reproduction and extension of our work by other researchers.
Cancer; Machine Learning; Sampling; SEER Dataset; SMOTE; Survivability
Bozorgi, Mandana, "Application of Machine Learning in Cancer Research" (2018). UNLV Theses, Dissertations, Professional Papers, and Capstones. 3353.