
Predicting gastric cancer risk using machine learning based on lifestyle factors: A comparative analysis
Soheil Azarmi Giglou 1 ℗, Amirali Seyyedmatin 1, Elham Safarzadeh 2, Farhad Pourfarzi 3, Masoud Amanzadeh 4 ©
Abstract
Introduction: Gastric cancer stands as one of the main cause of death globally. Prevailing protocols of screening, reliant upon endoscopy and biopsy, impose invasive burdens and exorbitant costs. Therefore, primary prevention through lifestyle and dietary modification, or the outright mitigation of risk factors, indicates paramount significance. The aim of this study is to use machine learning (ML) algorithms to predict gastric cancer risk factors based on lifestyle elements, thereby furnishing a non-invasive, more accessible, and less costly and risky alternative by distinguishing those of elevated peril from their counterparts of lesser vulnerability. Methods and Materials: Data were collected from 21,000 participants in the Persian Cohort Study, including demographic factors, clinical symptoms, dietary habits, and other lifestyle factors. The preparation and preprocessing of gathered data have undergone the refinement through various methods like normalization, data cleaning, and noise removal. Diverse models of this corpus apportioned into distinct training and test datasets and their division randomly employed through the ten-fold cross-validation methodology. Various models were proceed and built upon the training dataset, harnessing algorithms such as Random Forest (RF), Neural Networks (NN), Support Vector Machine (SVM), eXtreme Gradient Boosting (XGBoost) and Logistic Regression (LR). It is noted that the NumPy and Pandas libraries were used for data preprocessing. Besides the Scikit-Learn library used for the deployment of different ML algorithms and built the assembly of models. All were executed within the Python programming language and the Jupyter Notebook environment. Evaluation of diverse ML models employed the confusion matrix, while their performance was examined through an array of indicators, including accuracy, precision, recall, sensitivity, specificity, along with scrutinizing metrics like AUC and by rendering ROC curve. Results: Upon the final evaluation of the Ten-fold cross-validation method, we showed that, among the diverse ML models deployed in our training pursuits, the algorithms underpinning XGBoost classifier achieved an accuracy verging on 82%, a sensitivity of 84%, a specificity of 86%, and an AUC of 85%, thus eclipsing the performance of all other algorithms and demonstrated better efficacy in prognosticating the peril of cancer risk. Among lifestyle behaviors, the greatest impacts on carcinogenesis were to be infected with Helicobacter pylori and to have a diet high in salty, smoked, or processed foods. Conclusion and Discussion: This investigation has unveiled the promising efficacy of ML models as indispensable and auxiliary instruments in the screening, diagnosis, and stewardship of gastric cancer that demonstrates its potential in diagnostic matters. however, additional inquiries are essential to encompass broader populations, amplified sample sizes, and a diversity of ML methodologies.
Keywords: Gastric Cancer, Machine Learning, Lifestyle Factors