
Unsupervised Machine Learning for Early Detection of Hypertension: Cohort-Based Model Development and Validation Using Real-World Data from Bandar Kong
Niloofar Choobin 1 © ℗, 1. Hesamaddin Kamal Zadeh Takhti 2, Farid Khorrami 3, Hossein Zeinali 4
Abstract
Introduction Hypertension remains one of the most widespread yet underdiagnosed chronic diseases worldwide. While early detection is critical to preventing its long-term complications, conventional screening methods often fail to identify at-risk individuals in the asymptomatic phase. This study aimed to develop and validate unsupervised machine learning models capable of uncovering latent hypertensive risk profiles using real-world cohort data from Bandar Kong, Iran. Methods and Materials: Data were sourced from a regional cohort study conducted between 2016 and 2018, comprising 3,086 participants. A wide range of anthropometric, lifestyle, and baseline clinical features were extracted. The dataset was preprocessed and normalized. Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE) were employed for dimensionality reduction. Clustering algorithms including K-Means, Hierarchical Clustering, and DBSCAN were applied to identify inherent subgroups in the data without labeled outcomes. Cluster quality was assessed using metrics such as Silhouette Score and Davies-Bouldin Index. A post hoc clinical validation was performed by mapping cluster assignments to participants’ blood pressure status. Results: Among the models evaluated, DBSCAN combined with PCA-derived features exhibited superior cluster coherence, revealing a distinct subgroup with subclinical risk features, such as elevated BMI, sedentary lifestyle, and borderline metabolic indicators. The model achieved a Silhouette Score of 0.71 and an Adjusted Rand Index of 0.68 when cross-validated with known hypertensive outcomes. Notably, 22.8% of individuals flagged as high-risk by the model had not been clinically diagnosed during the cohort follow-up. Conclusion: The findings highlight the potential of unsupervised learning to detect hidden risk patterns of hypertension in large-scale, real-world datasets. Beyond identifying known cases, the model successfully exposed a silent high-risk subgroup invisible to routine screening. This demonstrates a promising direction for integrating such AI-driven systems into early warning platforms within primary care, especially in low-resource settings. By uncovering latent structures in unlabelled health data, unsupervised learning is transitioning from a data analysis tool to a driver of proactive, predictive, and personalized medicine. Future studies should explore hybrid architectures combining unsupervised clustering with temporal deep learning to enable both early detection and trajectory forecasting of chronic conditions.
Keywords: Hypertension, Unsupervised Learning, DBSCAN, PCA, Early Detection