Award Date

August 2018

Degree Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Computer Science

First Committee Member

Kazem Taghva

Second Committee Member

Laxmi Gewali

Third Committee Member

Justin Zhan

Fourth Committee Member

Fatma Nasoz

Fifth Committee Member

Ashok Singh

Sixth Committee Member

Kathryn Hausbeck Korgan

Number of Pages

79

Abstract

This dissertation revisits the problem of five-year survivability predictions for breast cancer using machine learning tools. This work is distinguishable from the past experiments based on the size of the training data, the unbalanced distribution of data in minority and majority classes, and modified data cleaning procedures. These experiments are also based on the principles of TIDY data and reproducible research. In order to fine-tune the predictions, a set of experiments were run using naive Bayes, decision trees, and logistic regression.

Of particular interest were strategies to improve the recall level for the minority class, as the cost of misclassification is prohibitive. One of The main contributions of this work is that logistic regression with the proper predictors and class weight gives the highest precision/recall level for the minority class.

In regression modeling with large number of predictors, correlation among predictors is quite common, and the estimated model coefficients might not be very reliable. In these situations, the Variance Inflation Factor (VIF) and the Generalized Variance Inflation Factor

(GVIF) are used to overcome the correlation problem. Our experiments are based on the Surveillance, Epidemiology, and End Results (SEER) database for the problem of survivability prediction. Some of the specific contributions of this thesis are:

· Detailed process for data cleaning and binary classification of 338,596 breast cancer patients.

· Computational approach for omitting predictors and categorical predictors based on VIF and GVIF.

· Various applications of Synthetic Minority Over-sampling Techniques (SMOTE) to increase precision and recall.

· An application of Edited Nearest Neighbor to obtain the highest F1-measure.

In addition, this work provides precise algorithms and codes for determining class membership and execution of competing methods. These codes can facilitate the reproduction and extension of our work by other researchers.

Keywords

Cancer; Machine Learning; Sampling; SEER Dataset; SMOTE; Survivability

Disciplines

Computer Sciences

Language

English


Share

COinS