Correlating Clinical Assessments for Substance Use Disorder Using Unsupervised Machine Learning
Abstract
1. Introduction
- Unsupervised machine learning is important in this kind of study because it allows the model to create its classification based on data correlations, instead of assigning the classifications beforehand.
- At the time of the study, NSDUH 2019 was a more recent database with 56,137 entries, and half of the data size was used. DSM 5 is more current and is widely used by mental health professionals in classifying SUDs.
- The result of our study is an automated classification in terms of SUD based on DSM 5 and is validated by a mental health professional.
2. Methodology
3. Materials and Methods
3.1. Dataset
3.2. Data Preprocessing
3.2.1. Data Cleaning
3.2.2. Feature Correlation
3.2.3. Feature Scaling
3.2.4. Dimensionality Reduction
3.3. Cluster Analysis
3.3.1. K-Means Clustering
3.3.2. Hierarchical Clustering
3.3.3. DBSCAN
3.4. Validation
4. Results and Discussion
4.1. Alcohol Dataset
4.2. Cocaine Dataset
4.3. Validation Results
5. Interpretation of Results
5.1. By Machine Learning
5.2. By a Mental Health Professional
5.3. Clinical Integration and Diagnostic Implications
6. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. Summary of Interpretations (Table A1)
By Machine Learning | By a Mental Health Professional |
---|---|
1. Two major classifications are observed (major cluster, minor cluster)
|
|
2. Major and minor clusters of the alcohol dataset
|
|
3. Major and minor clusters of the marijuana dataset
|
|
4. Major and minor clusters of the cocaine dataset
|
|
5. Range of coverage of major and minor clusters of the alcohol dataset
|
|
6. Range of coverage of major and minor clusters of the marijuana and the cocaine datasets
|
|
7. Conclusions
|
|
References
- Robbins, T.W.; Clark, L. Behavioral addictions. Curr. Opin. Neurobiol. 2015, 30, 66–72. [Google Scholar] [CrossRef] [PubMed]
- Ahn, W.-Y.; Vassileva, J. Machine-learning identifies substance-specific behavioral markers for opiate and stimulant dependence. Drug Alcohol Depend. 2016, 161, 247–257. [Google Scholar] [CrossRef] [PubMed]
- Al Sukar, M.; Sleit, A.; Abu-Dalhoum, A.; Al-Kasasbeh, B. Identifying a drug addict person using artificial neural networks. Int. J. Comput. Inf. Eng. 2016, 10, 611–616. [Google Scholar]
- Saha, T.D.; Chou, S.P.; Grant, B.F. The performance of DSM-5 alcohol use disorder and quantity-frequency of alcohol consumption criteria: An item response theory analysis. Drug Alcohol Depend. 2020, 216, 108299. [Google Scholar] [CrossRef]
- American Psychiatric Association. Diagnostic and Statistical Manual of Mental Disorders (DSM-5®); American Psychiatric Association Publishing: Washington, DC, USA, 2013. [Google Scholar]
- Alcohol Research: Current Reviews Editorial Staff. Drinking patterns and their definitions. Alcohol Res. Curr. Rev. 2018, 39, 17–18.
- Mak, K.K.; Lee, K.; Park, C. Applications of machine learning in addiction studies: A systematic review. Psychiatry Res. 2019, 275, 53–60. [Google Scholar] [CrossRef]
- Srividya, M.; Mohanavalli, S.; Bhalaji, N. Behavioral modeling for mental health using machine learning algorithms. J. Med. Syst. 2018, 42, 88. [Google Scholar] [CrossRef]
- Tarekegn, A.N.; Michalak, K.; Giacobini, M. Cross-validation approach to evaluate clustering algorithms: An experimental study using multi-label datasets. SN Comput. Sci. 2020, 1, 263. [Google Scholar] [CrossRef]
- Stevens, E.; Dixon, D.R.; Novack, M.N.; Granpeesheh, D.; Smith, T.; Linstead, E. Identification and analysis of behavioral phenotypes in autism spectrum disorder via unsupervised machine learning. Int. J. Med Inform. 2019, 129, 29–36. [Google Scholar] [CrossRef]
- Jing, Y.; Hu, Z.; Fan, P.; Xue, Y.; Wang, L.; Tarter, R.E.; Kirisci, L.; Wang, J.; Vanyukov, M.; Xie, X.-Q. Analysis of substance use and its outcomes by machine learning I. Childhood evaluation of liability to substance use disorder. Drug Alcohol Depend. 2020, 206, 107605. [Google Scholar] [CrossRef]
- Mannes, Z.L.; Shmulewitz, D.; Livne, O.; Stohl, M.; Hasin, D.S. Correlates of mild, moderate, and severe Alcohol Use Disorder among adults with problem substance use: Validity implications for DSM-5. Alcohol. Clin. Exp. Res. 2021, 45, 2118–2129. [Google Scholar] [CrossRef]
- Mintz, C.M.; Hartz, S.M.; Fisher, S.L.; Ramsey, A.T.; Geng, E.H.; Grucza, R.A.; Bierut, L.J. A cascade of care for alcohol use disorder: Using 2015-2019 National Survey on Drug Use and Health data to identify gaps in past 12-month care. Alcohol. Clin. Exp. Res. 2021, 45, 1276–1286. [Google Scholar] [CrossRef] [PubMed]
- Hayley, A.C.; Stough, C.; Downey, L.A. DSM-5 cannabis use disorder, substance use and DSM-5 specific substance-use disorders: Evaluating comorbidity in a population-based sample. Eur. Neuropsychopharmacol. 2017, 27, 732–743. [Google Scholar] [CrossRef] [PubMed]
- Substance Abuse and Mental Health Services Administration. National Survey on Drug Use and Health. 2019. Available online: https://www.samhsa.gov/data/data-we-collect/nsduh-national-survey-drug-use-and-health/national-releases/2019 (accessed on 1 January 2023).
- Hasin, D.S.; O’Brien, C.P.; Auriacombe, M.; Borges, G.; Bucholz, K.; Budney, A.; Compton, W.M.; Crowley, T.; Ling, W.; Petry, N.M.; et al. DSM-5 criteria for substance use disorders: Recommendations and rationale. Am. J. Psychiatry 2013, 170, 834–851. [Google Scholar] [CrossRef] [PubMed]
- Substance Abuse and Mental Health Services Administration. Key Substance Use and Mental Health Indicators in the United States: Results from the 2019 National Survey on Drug Use and Health. 2020. Available online: https://www.samhsa.gov/data/report/2019-nsduh-annual-national-report (accessed on 1 January 2023).
- Iniesta, R.; Stahl, D.; McGuffin, P. Machine learning, statistical learning and the future of biological research in psychiatry. Psychol. Med. 2016, 46, 2455–2465. [Google Scholar] [CrossRef]
- Chowdhry, A.K.; Gondi, V.; Pugh, S.L. Missing data in clinical studies. Int. J. Radiat. Oncol. Biol. Phys. 2021, 110, 1267–1271. [Google Scholar] [CrossRef]
- Pedersen, A.B.; Mikkelsen, E.M.; Cronin-Fenton, D.; Kristensen, N.R.; Pham, T.M.; Pedersen, L.; Petersen, I. Missing data and multiple imputation in clinical epidemiological research. Clin. Epidemiol. 2017, 9, 157–165. [Google Scholar] [CrossRef]
- Valenti, G.D.; Craparo, G.; Faraci, P. The development of a short version of the internet addiction test: The IAT-7. Int. J. Ment. Health Addict. 2023, 23, 1028–1053. [Google Scholar] [CrossRef]
- Dluzniewski, A.; Casanova, M.P.; Ullrich-French, S.; Brush, C.J.; Larkins, L.W.; Baker, R.T. Psychological readiness for injury recovery: Evaluating psychometric properties of the IPRRS and assessing group differences in injured physically active individuals. BMJ Open Sport Exerc. Med. 2024, 10, e001869. [Google Scholar] [CrossRef]
- Bonfiglio, A.Y.; Munniksma, A.; Volman, M.; van Rooij, F.; Gaikhorst, L. Teachers’ attention to students’ funds of identity in Dutch primary school classrooms. Teach. Teach. Educ. 2024, 144, 104584. [Google Scholar] [CrossRef]
- Cai, J.; Luo, J.; Wang, S.; Yang, S. Feature selection in machine learning: A new perspective. Neurocomputing 2018, 300, 70–79. [Google Scholar] [CrossRef]
- Hauke, J.; Kossowski, T. Comparison of values of Pearson’s and Spearman’s correlation coefficient on the same sets of data. Quaest. Geogr. 2011, 30, 87–93. [Google Scholar] [CrossRef]
- Ali, P.J.M.; Faraj, R.H.; Koya, E. Data Normalization and Standardization: A Technical Report; Machine Learning Technical Reports; Koya University: Koy Sanjaq, Iraq, 2014; Volume 1, pp. 1–6. [Google Scholar]
- Kherif, F.; Latypova, A. Principal component analysis. In Machine Learning; Elsevier: Amsterdam, The Netherlands, 2020; pp. 209–225. [Google Scholar]
- Sarker, I.H. Machine learning: Algorithms, real-world applications and research directions. SN Comput. Sci. 2021, 2, 160. [Google Scholar] [CrossRef] [PubMed]
- Chakraborty, S.; Paul, D.; Das, S.; Xu, J. Entropy regularized power k-means clustering. arXiv 2020, arXiv:2001.03452. [Google Scholar] [CrossRef]
- Jiang, X.; Ma, J.; Jiang, J.; Guo, X. Robust feature matching using spatial clustering with heavy outliers. IEEE Trans. Image Process. 2019, 29, 736–746. [Google Scholar] [CrossRef] [PubMed]
- Zhang, M. Use density-based spatial clustering of applications with noise (DBSCAN) algorithm to identify galaxy cluster members. IOP Conf. Ser. Earth Environ. Sci. 2019, 252, 042033. [Google Scholar] [CrossRef]
- Birant, D.; Kut, A. ST-DBSCAN: An algorithm for clustering spatial-temporal data. Data Knowl. Eng. 2007, 60, 208–221. [Google Scholar] [CrossRef]
- Bleeker, S.E.; Moll, H.A.; Steyerberg, E.W.; Donders, A.R.T.; Derksen-Lubsen, G.; Grobbee, D.E.; Moons, K.G.M. External validation is necessary in prediction research: A clinical example. J. Clin. Epidemiol. 2003, 56, 826–832. [Google Scholar] [CrossRef]
- Waddell, J.T. Age-varying time trends in cannabis- and alcohol-related risk perceptions 2002-2019. Addict. Behav. 2022, 124, 107091. [Google Scholar] [CrossRef]
- Dell, N.A.; Srivastava, S.P.; Vaughn, M.G.; Salas-Wright, C.; Hai, A.H.; Qian, Z. Binge drinking in early adulthood: A machine learning approach. Addict. Behav. 2022, 124, 107122. [Google Scholar] [CrossRef]
- Wetherill, L.; Agrawal, A.; Kapoor, M.; Bertelsen, S.; Bierut, L.J.; Brooks, A.; Dick, D.; Hesselbrock, M.; Hesselbrock, V.; Koller, D.L.; et al. Association of substance dependence phenotypes in the COGA sample. Addict. Biol. 2015, 20, 617–627. [Google Scholar] [CrossRef]
Author & Year | Title (First Three Words) | Mental Health Issue | Machine Learning Model Used | Dataset Used |
---|---|---|---|---|
Srividya et al., 2018 [8] | Behavioral Modeling Mental | State of mental health | Unsupervised learning, Supervised learning, SVM, Naive Bayes, KNN, Logistic Regression | Survey (20 questions), Population 1: 300 subjects, Population 2: 356 subjects |
Tarekeng et al., 2018 [9] | Cross-Validation Approach | No specific mental health issue analyzed | Unsupervised learning, K-means, K-fold cross-validation, RMSE, MAPE | Chronic diseases, Emotions, Yeast dataset |
Stevens et al., 2020 [10] | Identification Analysis Behavioral | Autism Spectrum Disorder | Unsupervised learning, Hierarchical clustering | 2400 children from community centers across the U.S. |
Jing et al., 2020 [11] | Analysis of Substance Use | Substance Use Disorder (SUD) | Supervised learning, Random Forest | Recruited via advertisement, public service announcements, random digital calls, posters; Age: 10–22 years |
Mannes et al., 2021 [12] | Correlates of Mild, Moderate | Alcohol Use Disorder (AUD) | Supervised learning, Statistical analysis, Multinomial Logistic Regression | 150 participants (suburban inpatient addiction treatment), 438 participants (urban medical center), Age: 18+ years |
Mintz et al., 2021 [13] | A Cascade of Care for Alcohol | Alcohol Use Disorder (AUD) | Supervised learning, Statistical analysis, Multinomial Logistic Regression | 150 participants (suburban inpatient addiction treatment), 438 participants (urban medical center), Age: 18+ years |
Hayley et al., 2017 [14] | DSM-5 Cannabis Use Disorder | Cannabis Use Disorder (CaUD) | Supervised learning, Logistic Regression | NESARC-III (n = 36,309), Age: 18+ years |
Algorithm | Dataset | ||
---|---|---|---|
Alcohol (%) | Marijuana (%) | Cocaine (%) | |
-Means | |||
Sil_score | 72.5 | 78.7 | 77.9 |
95% CI | ±0.002 | ±0.002 | ±0.002 |
Hierarchical | |||
Sil_score | 83.4 | 84.7 | 78.9 |
95% CI | ±0.001 | ±0.001 | ±0.001 |
DBSCAN | |||
Sil_score | 62.6 | 66.3 | 63 |
95% CI | ±0.002 | ±0.002 | ±0.002 |
Algorithm | Dataset | ||
---|---|---|---|
Alcohol (%) | Marijuana (%) | Cocaine (%) | |
-Means++ | |||
Sil_score | 73.1 | 78.8 | 78.4 |
95% CI | ±0.002 | ±0.002 | ±0.002 |
BIRCH | |||
Sil_score | 68.7 | 70.7 | 67.1 |
95% CI | ±0.003 | ±0.003 | ±0.003 |
HDBSCAN | |||
Sil_score | 40.8 | 58.6 | 81 |
95% CI | ±0.001 | ±0.001 | ±0.001 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tlotleng, K.M.; Jamisola, R.S., Jr.; Brown, J.L. Correlating Clinical Assessments for Substance Use Disorder Using Unsupervised Machine Learning. BioMedInformatics 2025, 5, 54. https://doi.org/10.3390/biomedinformatics5030054
Tlotleng KM, Jamisola RS Jr., Brown JL. Correlating Clinical Assessments for Substance Use Disorder Using Unsupervised Machine Learning. BioMedInformatics. 2025; 5(3):54. https://doi.org/10.3390/biomedinformatics5030054
Chicago/Turabian StyleTlotleng, Kaloso M., Rodrigo S. Jamisola, Jr., and Jeniffer L. Brown. 2025. "Correlating Clinical Assessments for Substance Use Disorder Using Unsupervised Machine Learning" BioMedInformatics 5, no. 3: 54. https://doi.org/10.3390/biomedinformatics5030054
APA StyleTlotleng, K. M., Jamisola, R. S., Jr., & Brown, J. L. (2025). Correlating Clinical Assessments for Substance Use Disorder Using Unsupervised Machine Learning. BioMedInformatics, 5(3), 54. https://doi.org/10.3390/biomedinformatics5030054