Award Date

May 2019

Degree Type

Thesis

Degree Name

Master of Science in Computer Science

Department

Computer Science

First Committee Member

Fatma Nasoz

Second Committee Member

Ajoy Datta

Third Committee Member

Kazem Taghva

Fourth Committee Member

Mira Han

Number of Pages

Abstract

Cancer is a group of diseases characterized by the uncontrolled growth and spread of abnormal cells. Generally, manufacturing of proteins by cells is controlled by genes. Each gene must have the correct instructions for making its protein, so that it allows the protein to perform the correct function for the cell. When one or more genes in a cell mutate and create an abnormal protein, that is when cancer begins. An abnormal protein provides different information compared to a normal protein. This can cause cells to multiply uncontrollably and cause cancer.

RNA sequencing (RNA-seq) can be used to figure out the functional and structural changes occur in genes. RNA-seq can profile the abundance and composition of entire transcriptome. It is a highly sensitive and accurate tool for measuring expression across the transcriptome. Since it can reveal the changes affecting genes, RNA seq can be valuable for diagnosing, characterizing tumors.

For building the models to classify the given transcriptome data into a particular tissue, data from The Cancer Genome Atlas (TCGA) was used. Using this data, we implemented and compared various machine learning classification algorithms including decision trees, support vector machines, random forest classifiers, K nearest neighbors, stochastic gradient descent classifier to classify gene expression data to a particular tissue. Among all the classifiers used, Stochastic gradient descent classifier with squared hinge loss function had the best performance based on the traditional machine learning metrics.

RNA sequencing, which is an easy to perform test, is done in most labs, therefore there is a great variability on the quality of data. To test if the model built using TCGA transcriptome data generalizes well to classify the genome data, a test was performed on Genotype - Tissue Expression (GTEX) data (which is completely independent from TCGA data). The results were not as high as training data results, but the most evident misclassification was in between Esophagus and Stomach. When considered only TCGA data, the classification test accuracy achieved was 95.46 percent whereas when considered GTEX data too, the training accuracy was 91 percent and test accuracy was 58 percent.

Disciplines

Computer Sciences

File Format

pdf

Degree Grantor

University of Nevada, Las Vegas

Language

English

Repository Citation

Chintham Reddy, Lohitha, "Machine Learning Prediction of Primary Tissue Origin of Cancer from Gene Expression Read Counts" (2019). UNLV Theses, Dissertations, Professional Papers, and Capstones. 3589.
http://dx.doi.org/10.34917/15778420

Rights

IN COPYRIGHT. For more information about this rights statement, please visit http://rightsstatements.org/vocab/InC/1.0/

Download

Included in

Computer Sciences Commons

COinS

Digital Scholarship@UNLV

UNLV Theses, Dissertations, Professional Papers, and Capstones

Machine Learning Prediction of Primary Tissue Origin of Cancer from Gene Expression Read Counts

Award Date

Degree Type

Degree Name

Department

First Committee Member

Second Committee Member

Third Committee Member

Fourth Committee Member

Number of Pages

Abstract

Disciplines

File Format

Degree Grantor

Language

Repository Citation

Rights

Included in

Browse

Digital Scholarship@UNLV

UNLV Theses, Dissertations, Professional Papers, and Capstones

Machine Learning Prediction of Primary Tissue Origin of Cancer from Gene Expression Read Counts

Author

Award Date

Degree Type

Degree Name

Department

First Committee Member

Second Committee Member

Third Committee Member

Fourth Committee Member

Number of Pages

Abstract

Disciplines

File Format

Degree Grantor

Language

Repository Citation

Rights

Included in

Share

Browse