Award Date

May 2019

Degree Type

Thesis

Degree Name

Master of Science in Computer Science

Department

Computer Science

First Committee Member

Fatma Nasoz

Second Committee Member

Ajoy Datta

Third Committee Member

Kazem Taghva

Fourth Committee Member

Mira Han

Number of Pages

57

Abstract

Cancer is a group of diseases characterized by the uncontrolled growth and spread of abnormal cells. Generally, manufacturing of proteins by cells is controlled by genes. Each gene must have the correct instructions for making its protein, so that it allows the protein to perform the correct function for the cell. When one or more genes in a cell mutate and create an abnormal protein, that is when cancer begins. An abnormal protein provides different information compared to a normal protein. This can cause cells to multiply uncontrollably and cause cancer.

RNA sequencing (RNA-seq) can be used to figure out the functional and structural changes occur in genes. RNA-seq can profile the abundance and composition of entire transcriptome. It is a highly sensitive and accurate tool for measuring expression across the transcriptome. Since it can reveal the changes affecting genes, RNA seq can be valuable for diagnosing, characterizing tumors.

For building the models to classify the given transcriptome data into a particular tissue, data from The Cancer Genome Atlas (TCGA) was used. Using this data, we implemented and compared various machine learning classification algorithms including decision trees, support vector machines, random forest classifiers, K nearest neighbors, stochastic gradient descent classifier to classify gene expression data to a particular tissue. Among all the classifiers used, Stochastic gradient descent classifier with squared hinge loss function had the best performance based on the traditional machine learning metrics.

RNA sequencing, which is an easy to perform test, is done in most labs, therefore there is a great variability on the quality of data. To test if the model built using TCGA transcriptome data generalizes well to classify the genome data, a test was performed on Genotype - Tissue Expression (GTEX) data (which is completely independent from TCGA data). The results were not as high as training data results, but the most evident misclassification was in between Esophagus and Stomach. When considered only TCGA data, the classification test accuracy achieved was 95.46 percent whereas when considered GTEX data too, the training accuracy was 91 percent and test accuracy was 58 percent.

Disciplines

Computer Sciences

Language

English


Share

COinS