Award Date

5-1-2019

Degree Type

Thesis

Degree Name

Master of Science in Computer Science

Department

Computer Science

First Committee Member

Fatma Nasoz

Second Committee Member

Ajoy Datta

Third Committee Member

Kazem Taghva

Fourth Committee Member

Mira Han

Number of Pages

73

Abstract

Cancer is one of the leading causes of death globally and was responsible for approximately 9.6 million deaths in 2018. One of the main reason for deaths from cancer is late-stage presentation and inaccessible diagnosis and treatment. Cancer often spreads from the part of the body where it started (primary site) to a different part of the body (metastatic site). Identifying the primary site of cancer plays a key role as it directs the appropriate treatment. Cancer which spreads needs the same treatment as its origin. Having this knowledge can help doctors to decide the type of treatment.

All cancers begin when one or more genes in a cell mutate and create abnormal proteins which cause cells to multiply uncontrollably. Genes are present in the DNA of each cell in human body, and research shows that distinct and abnormal patterns in methylation of DNA are observed in case of cancers. DNA methylation is also considered as an early and fundamental step where normal tissue undergoes transformations. Since DNA methylation is tissue-specific and change with cell differentiation, methylation sites are good markers for identifying tissues of origin.

In this thesis, we propose the use of machine learning techniques to identify the primary sites of cancers to increase the accuracy of diagnosis and treatment.

For this purpose, we implemented various classification algorithms in machine learning like support vector machines, random forests classifier, decision trees, and K nearest neighbor classifier to classify the tumor samples into their tissue origin and compared these models using traditional machine learning metrics. The models are trained and tested on features extracted from the DNA methylation datasets maintained by The Cancer Genome Atlas (TCGA). The experimental results showed that support vector machines could predict the primary sites with 95% training accuracy. The model gave 86% accuracy when tested on a completely independent dataset collected from Gene Expression Omnibus (GEO).

Keywords

Cancer; DNA methylation; Machine learning; Primary tissue; Tissue

Disciplines

Computer Sciences

Language

English

Available for download on Friday, May 15, 2020


Share

COinS