Download Full Text (708 KB)


There are roughly 22,000 protein-coding genes in the human body, many of which play important roles in biological functions. The proteins fold in 3D space, and this is most often necessary for function. A genetic variant can disrupt the secondary structure of a protein (one aspect of structure) or eliminate a site important in protein-protein interaction or post-translational modification. The loss of function or deregulation can result in disease. Thus, there is great biomedical interest in identifying disease-causing single-nucleotide variants.

We hypothesize that we can accurately predict variant pathogenicity. We used machine learning to predict the pathogenicity of a set of 28,369 single-nucleotide variants across 10 genes. The data are acquired from publicly available saturation mutagenesis data sets, which generate every possible amino acid substitution at every position in a protein. Our approach employs a support vector machine using linear, polynomial, and RBF kernel functions. The problem is implemented as a binary classification problem, where a label of 1 indicates a disease-causing variant and a label of 0 indicates a benign variant. The model predicts pathogenicity based on amino acid, post-translational modification, and secondary structure information. We cleaned and analyzed the data with custom Python scripts. Our results show average balanced accuracy scores for classifying pathogenicity of approximately 57.9%, 60.3%, and 60.3% for the linear, polynomial, and RBF kernels, respectively. Therefore, the model is an improvement over random guessing but has room for improvement.

Publication Date

Fall 11-15-2021




Machine learning; Saturation mutagenesis; Bioinformatics; Genetics; Support vector machines

File Format


File Size

986 KB


Faculty Mentor: Martin Schiller, Ph.D.

Predicting Variant Pathogenicity with Machine Learning