Construction of a Diabetes Prediction Model Based on Machine Learning
Research Article
Open Access
CC BY

Construction of a Diabetes Prediction Model Based on Machine Learning

Ke Peng 1*
1 College of Information Engineering, China Ji Liang University, HangZhou, 310018, China
*Corresponding author: 23h034160213@cjlu.edu.cn
Published on 28 October 2025
Journal Cover
ACE Vol.202
ISSN (Print): 2755-273X
ISSN (Online): 2755-2721
ISBN (Print): 978-1-80590-497-7
ISBN (Online): 978-1-80590-498-4
Download Cover

Abstract

This study investigates key predictors of diabetes risk across non-diabetes, prediabetes, and diabetes categories, while developing an optimal prediction model using multiple machine learning algorithms. Biomedical indicators such as HbA1c, urea, and creatinine, along with demographic factors like age and gender, were analyzed to evaluate their predictive value. Among the five algorithms tested, ensemble learning methods (CatBoost and XGBoost) outperformed traditional models, with CatBoost achieving the highest accuracy and demonstrating superior robustness. Feature importance analysis identified HbA1c as the most influential predictor, followed by age and BMI, aligning with established medical knowledge, whereas gender contributed minimally. The findings highlight the potential of advanced machine learning models, particularly CatBoost, in delivering highly accurate and stable diabetes risk prediction. This research provides strong technical support for early screening, targeted intervention, and practical risk assessment in diabetes management.

Keywords:

Diabetes Prediction, Machine Learning, CatBoost, HbA1c

View PDF
Peng,K. (2025). Construction of a Diabetes Prediction Model Based on Machine Learning. Applied and Computational Engineering,202,39-46.

References

[1]. Zhou, B., et al. (2023) Worldwide trends in diabetes prevalence and treatment from 1990 to 2022: A pooled analysis of 1108 population-representative studies with 141 million participants. The Lancet, 404(10467), 2077 - 2093

[2]. International Diabetes Federation (IDF). (2025). IDF Diabetes Atlas. https: //diabetesatlas.org/

[3]. Office for National Statistics (ONS). (2024). Risk factors for pre-diabetes and undiagnosed type 2 diabetes in England: 2013 to 2019. https: //www.ons.gov.uk/

[4]. Marshal, P. (2025). Diabetes Prediction Dataset. Kaggle. https: //www.kaggle.com/datasets/marshalpatel3558/diabetes-prediction-dataset-legit-dataset

[5]. Luo, F., et al. (2022). Missing Value Imputation for Diabetes Prediction. In 2022 International Joint Conference on Neural Networks (IJCNN), (pp. 1-8) Padua, Italy. https: //doi.org/10.1109/IJCNN55064.2022.9892398

[6]. Zhang, Y., He, S., & You, S. (2019). Application of Ensemble Learning in Diabetes Prediction [J]. Intelligent Computer and Applications, 9(5): 176–179.

[7]. Zhang, C. F., Wang, S., & Wu, Y. D. (2020). Diabetes Risk Prediction Based on GA-Xgboost Model. Computer Engineering, 46(3): 315–320.

[8]. Haque, M. E., Islam, S. M. J., Maliha, J., Sumon, M. S. H., Sharmin, R., & Rokoni, S. (2025). Improving Chronic Kidney Disease Detection Efficiency: Fine Tuned CatBoost and Nature-Inspired Algorithms with Explainable AI. In 2025 IEEE 14th International Conference on Communication Systems and Network Technologies (CSNT). Bhopal, India, pp. 811-818.

[9]. Moore, A., & Bell, M. (2022). XGBoost, A Novel Explainable AI Technique in the Prediction of Myocardial Infarction: A UK Biobank Cohort Study. Clinical Medicine Insights. Cardiology, 16, 11795468221133611.

[10]. Ahmad, G. N., Fatima, H., Ullah, S., Saidi, A. S., & Imdadullah. (2022). Efficient Medical Diagnosis of Human Heart Diseases Using Machine Learning Techniques With and Without GridSearchCV. IEEE Access, 10, 80151-80173.

[11]. World Health Organization. (2011). Use of glycated haemoglobin (HbA1c) in the diagnosis of diabetes mellitus: Abbreviated report of a WHO consultation.

[12]. Ruze, R., Liu, T., Zou, X., Song, J., Chen, Y., Xu, R., Yin, X., & Xu, Q. (2023). Obesity and type 2 diabetes mellitus: Connections in epidemiology, pathogenesis, and treatments. Frontiers in Endocrinology, 14, 1161521. https: //doi.org/10.3389/fendo.2023.1161521

[13]. Singh-Manoux, A., et al. (2008). Gender differences in the association between morbidity and mortality among middle-aged men and women. American Journal of Public Health, 98(12), 2251–2257. https: //doi.org/10.2105/AJPH.2006.107912

[14]. Rodriguez-Leon, C., Aviles-Perez, M. D., Banos, O., Quesada-Charneco, M., Lopez-Ibarra Lozano, P. J., Villalonga, C., & Munoz-Torres, M. (2023). T1DiabetesGranada: a longitudinal multi-modal dataset of type 1 diabetes mellitus. Scientific Data, 10, 916. https: //doi.org/10.1038/s41597-023-02737-4

[15]. Zhao, Q., Li, J., Zhao, L., & Zhu, Z. (2023). Knowledge guided feature aggregation for the prediction of chronic obstructive pulmonary disease with Chinese EMRs. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 20(6), 3343-3352. https: //doi.org/10.1109/TCBB.2022.3198798

Cite this article

Peng,K. (2025). Construction of a Diabetes Prediction Model Based on Machine Learning. Applied and Computational Engineering,202,39-46.

Data availability

The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.

About volume

Volume title: Proceedings of CONF-MLA 2025 Symposium: Intelligent Systems and Automation: AI Models, IoT, and Robotic Algorithms

ISBN: 978-1-80590-497-7(Print) / 978-1-80590-498-4(Online)
Editor: Hisham AbouGrad
Conference date: 12 November 2025
Series: Applied and Computational Engineering
Volume number: Vol.202
ISSN: 2755-2721(Print) / 2755-273X(Online)