G-3652

2025-10-19 14:49

Written by ARCIMS 26 ARCIMS 26 in Sunday 2025-10-19 14:49

Development and validation of a machine learning model for predicting iron deficiency anemia using structured laboratory data

 

 Asma Ahmadi 1℗ , Hesamaddin Kamal Zadeh Takhti  , Ali Haghighat  , Niloofar Choobin  1 ©   


Student Research Committee, Faculty of Para Medicine, Hormozgan University of Medical Sciences, Bandar Abbas, Iran

2  Associate Professor of Information Technology, Department of Health Information Technology, School of Allied Medical Sciences, Hormozgan University of Medical Sciences, Bandar Abbas, Iran

3  Information Systems Department, Computer Engineering, Shiraz University, Shiraz, Iran

Email: niloo.choobin.hit91@gmail.com
 

 


 

 
Abstract

Background: Iron Deficiency Anemia (IDA) is the most prevalent micronutrient disorder worldwide and a major cause of morbidity, particularly among women, children, and hospitalized patients. Despite its global burden, diagnosis often relies on specialized iron studies such as Ferritin or Serum Iron, which may be unavailable in many low-resource clinical settings. This study introduces an intelligent, interpretable machine learning framework for non-invasive IDA prediction using routine complete blood count (CBC) data and textual flag messages generated by automated blood analyzers. Methods: A total of 1019 clinical samples were extracted from laboratory databases, each comprising over 20 structured hematological parameters (including RBC, HGB, HCT, MCV, RDW, RET%, PLT indices, and WBC differentials), along with free-text diagnostic messages (RBC/WBC/PLT flags). After data cleaning, de-duplication, and outlier correction using statistical (z-score, IQR) and clinical thresholds, all features were standardized using MinMaxScaler. Gender encoding, semantic diagnosis grouping, and feature engineering were applied to enrich the dataset.
Automatic labeling of IDA was performed using a hybrid rule-based strategy based on both numerical clinical criteria (Ferritin < 30 ng/mL, or Iron < 50 μg/dL with TIBC > 400 μg/dL) and presence of analyzer flags suggestive of microcytic, hypochromic anemia. Label leakage was carefully avoided by excluding all flag-related columns from training features. To handle class imbalance (79% positive), Synthetic Minority Oversampling Technique (SMOTE) was used.
Several supervised machine learning models were trained and compared, including CatBoost, Random Forest, LightGBM, XGBoost, SVM, MLP, and Logistic Regression. Models were evaluated via stratified 10-fold cross-validation and a hold-out test set. Performance metrics included Accuracy, Precision, Sensitivity (Recall), Specificity, F1-score, and Area Under the ROC Curve (AUC-ROC). Feature importance and model explainability were assessed using SHAP (SHapley Additive exPlanations) values. Results: CatBoost achieved the best performance with AUC-ROC = 0.928, Accuracy = 90.3%, Sensitivity = 91.5%, Specificity = 85.7%, and F1-score ≈ 93%. SHAP analysis confirmed that HGB, HCT, RBC, RDW-CV, and MCV were the most influential predictors. The model successfully identified IDA patterns without relying on Ferritin or Iron in test data, enhancing generalizability. Other models, particularly Random Forest and LightGBM, also demonstrated strong results, though slightly lower in interpretability and overall performance.

Conclusion: This study presents a practical, interpretable, and scalable approach for semi-automated IDA detection using widely available hematological data. The model eliminates dependence on costly biochemical tests, making it suitable for deployment in routine hospital laboratories or remote settings. CatBoost, supported by SHAP-based explanations, provides a clinically relevant, transparent, and high-performing tool for smart anemia screening.


Keywords: Iron Deficiency Anemia; Machine Learning; CatBoost; Non-Invasive Screening; SHAP; CBC.


 

Feedback

What is your opinion? Click on the stars you want.

Comments (0)

No Comment yet. Be the first!

Post a comment