G-1464

2025-10-19 17:47

Written by ARCIMS 26 ARCIMS 26 in Sunday 2025-10-19 17:47

Designing an supervised learning algorithms models in colon cancer screening

 Tayeb Ramim 1 ℗, Behnaz Varaminian 2 ©   

 Student Research Committee, Iran University of Medical Sciences, Tehran, Iran

 Assistant Professor of Hematology & Oncology Department of Internal Medicine, School of Medicine Semnan University of Medical Sciences

Email: tayebramim@yahoo.com
 

 


 
Abstract

Background and Objectives: Colorectal cancer (CRC) continues to be a major contributor to cancer-related mortality on a global scale. Early detection through effective screening mechanisms has been shown to significantly improve patient survival outcomes. However, current screening approaches, such as colonoscopy and fecal immunochemical test (FIT), are constrained by limitations including invasiveness, restricted accessibility, and inconsistent sensitivity. The objective of this study was to develop and evaluate supervised machine learning (ML) models using routinely collected clinical data to improve risk stratification methods. The ultimate aim was to identify high-priority individuals for colonoscopy referrals within CRC screening programs. Methods: A retrospective cohort design was implemented, incorporating data from 15,320 asymptomatic adults who participated in CRC screening between 2018 and 2022. The dataset included input variables such as demographic attributes (age and gender), FIT results, family history of CRC, lifestyle indicators (smoking status and body mass index [BMI]), and comorbidities. The primary output variable was biopsy-confirmed advanced neoplasia, encompassing both adenocarcinoma and high-grade dysplasia. Data preprocessing steps included addressing missing values and standardizing feature scales. The dataset was divided into training (70%), validation (15%), and testing (15%) subsets. Six supervised ML algorithms were employed: Random Forest (RF), XGBoost, Support Vector Machines (SVM), Logistic Regression (LR), Neural Networks (NN), and Decision Trees (DT). Hyperparameter optimization was conducted through five-fold cross-validation. Model performance was assessed using the area under the receiver operating characteristic curve (AUC-ROC), along with sensitivity, specificity, precision, F1-score, and calibration curve analyses. Results: The Random Forest model outperformed all other approaches, achieving an AUC-ROC of 0.92 (95% CI: 0.89–0.94), notably higher than FIT alone, which yielded an AUC of 0.76. The most influential predictive features in the RF model were FIT concentration (feature importance: 0.32), age (0.28), family history of CRC (0.15), and BMI (0.09). At its optimal decision threshold, the RF model demonstrated a sensitivity of 89.4%, effectively minimizing false negatives, and a specificity of 86.1%. XGBoost delivered comparable performance with an AUC of 0.90, while Logistic Regression was the most interpretable model. All ML models demonstrated superior precision-recall balance compared to conventional risk scoring systems such as the Asia-Pacific CRC Screening Score. Conclusion: Supervised ML models, exemplified by the strong performance of Random Forest, present a promising non-invasive approach to CRC risk stratification using routinely collected clinical data. Their implementation has the potential to enhance the efficiency of colonoscopy resource deployment by prioritizing individuals at high risk while reducing unnecessary procedures in low-risk populations. Further research should focus on prospective validation of these models across diverse demographic cohorts and explore their integration into digital health platforms for widespread adoption in CRC screening programs.


Keywords: Colorectal cancer screening, supervised machine learning, risk prediction, Random Forest

Feedback

What is your opinion? Click on the stars you want.

Comments (0)

No Comment yet. Be the first!

Post a comment