Award Date

August 2023

Degree Type


Degree Name

Doctor of Philosophy (PhD)


Epidemiology and Biostatistics

First Committee Member

Qing Wu

Second Committee Member

Jennifer Pharr

Third Committee Member

Lung-Chang Chien

Fourth Committee Member

Soumya Upadhyay

Fifth Committee Member

Mingon Kang

Number of Pages



Introduction: Around one in four adults worldwide suffer from arthritis. There are more than one hundred different forms of arthritis; the two most common forms of arthritis are rheumatoid arthritis (RA) and osteoarthritis (OA). RA is an autoimmune disease that can cause joint inflammation. Around 1.3 million adults in the US suffer from RA, representing 0.6%–1% of the population. The RA diagnosis in its early stages is difficult since its signs and symptoms are similar to other arthritis. OA is the most common form of arthritis. In the US, around 30.8 million people are affected by this disease. However, OA patients are usually diagnosed at the moderate to severe (late) stage when the joint tissue often has already become irreversibly damaged. Therefore, the first study aims to derive the polygenic risk score (PRS) with the most comprehensive genome-wide association studies (GWAS) of RA and validate it in independent multiethnic postmenopausal women. The second aim is to use single nucleotide polymorphisms (SNPs) to develop a comprehensive predictive model for RA. The third aim is to create and validate a genome-wide polygenic score (PGS) for OA. The performance of PGS in OA risk stratification and the association between PGS and other comorbidities were assessed. It is hypothesized that the newly developed PGS using whole-genome data will be significantly associated with OA risk.Methods: The first study included participants from the Women’s Health Initiative (WHI) sample. The best score was selected from a validation dataset. Then, odds ratios (ORs) were calculated with logistic regression, and cumulative hazard were estimated in the testing dataset. The second study included sample participants of WHI. Four models were developed for RA prediction, including model 1 (using the most common RA risk factors), model 2 (with PRS and five principal components), model 3 (with SNPs after feature reduction with FeatureWiz), and model 4 (with SNPS after feature extraction with kernel principal component analysis). Four algorithms were utilized in this study, including logistic regression (LR), random forest (RF), eXtreme Gradient Boosting (XGBoost), and support vector machine (SVM). Performance was assessed using the area under the receiver operating characteristic curve (AUC), sensitivity, specificity, positive and negative predictive values (PPV and NPV), and F1-score. In the third study, PGS was derived using seven thresholds, and the optimal score was selected based on the area under the ROC curve. The prevalence and OR were evaluated by PGS decile groups. Additionally, the Cox proportional hazard model assessed the cumulative hazard of OA. Results: In the first study, the prevalence of RA for the group with the bottom 20% PRS and the top 20% were around 7.8% and 12.7%, respectively. When race, age, BMI, physical activity, smoking status, and five principal components were adjusted, compared with individuals in the bottom 20%, individuals from the second to the fifth groups had a significantly higher risk of RA. This study also showed that the cumulative hazard of RA at the age of 70 was 9.2% (95%CI, 7.3%-10.9%) among individuals in the first decile (bottom 10%) of the PRS distribution and 22.8 % (95%CI, 19.4%-26.2%) in the top decile (top 10%) of the PRS distribution, respectively. In the second study, model 4 had the highest AUC among the four models. The DeLong test shows significant differences in AUC between model 4 and model 3 (all p-values<0.0001) in the three algorithms except for the RF. It also had the highest F1 score of all models. Among the four algorithms, model 4 with XGBoost had a better performance, achieved the best F1-score of 0.83, and its corresponding sensitivity, specificity, PPV, and NPV were 0.73, 0.72, 0.95, and 0.25, respectively. In the third study, the prevalence and OR increased with increasing PGS quintile groups. High-risk carriers (top 5% PGS) had a significantly increased risk of OA among Caucasians (OR, 2.71; 95%CI, 1.51-4.91) and obese groups (OR, 2.11; 95%CI, 1.12-4.01) compared to individuals with the bottom 5% PGS. The cumulative hazard of OA at 70 was significantly higher among individuals with the top 5% PGS (55.1%; 95%CI, 48.9-61.2) compared to those in the bottom 5% PGS (22.6%; 95%CI, 18.6-26.5). Conclusions: The first study's findings indicated that newly developed PRS is significantly associated with RA. PRS might significantly improve RA prediction and can be used to identify the high risk of RA. The second study suggested incorporating genomic information with XGBoost can efficiently predict RA risk among postmenopausal women. The results of the third study indicated that the genome-wide PGS developed in this study could help to stratify OA risk among postmenopausal women.


machine learning; osteoarthritis; polygenic score; postmenopausal women; rheumatoid arthritis


Biostatistics | Epidemiology

File Format


File Size

1840 KB

Degree Grantor

University of Nevada, Las Vegas




IN COPYRIGHT. For more information about this rights statement, please visit

Available for download on Saturday, August 15, 2026