Exploring Heterogeneity with Category and Cluster Analyses for Mixed Data
Abstract
:1. Introduction
2. Methodology
2.1. Morphisms and Functors from Category Theory
- i indicates the patient, ;
- j indicates the variable, ;
- k indicates the time point, .
2.2. Hierarchical Clustering Based on Gower Distance
- Quantitative variable: , where is the observed range of variable j;
- Nominal variable: if patients have the same value of the j-th variable at a given time, and 0 otherwise;
- Ordinal variable:, where r is the rank of each measurement, is the patient having the highest rank of variable j, and is the patient having the lowest rank of variable j. This is the Podani extension of the Gower formula [24], to include ordinal variables.
3. Results
3.1. The Case Study
3.2. Distances and Clusters of Patients
4. Discussion and Conclusions
- We first analyze those patients that are in the moderate-disease cluster at each time point: in cluster 1 at , in cluster 3 at , and in cluster 3 at . These patients are successfully treated with RASi only and with RASi + GLP1a. They are characterized by an initial value of the mean diastolic pressure between 70 and 79, which is slowly decreasing; HbA1c starting from 7.9% and lowering to 6.9% (indicating an improvement); and eGFR higher than 77 mL/min/1.73 m, well above the threshold value of 60 mL/min/1.73 m.
- Then, we focus on those patients that have a controlled disease at and (moderate disease, cluster 4). They mostly start with RASi, and some of them switch to RASi + SGLT2i or RASi + MCRa. The response to the therapeutic treatment appears as being slightly better at with respect to . The HbA1c of these patients is around 7.5% and slowly decreases to 7.3%, and their eGFR is constantly lower than the threshold value of 60 mL/min/1.73 m.
- Patients that are in cluster 5 (with poorly controlled disease and risk of kidney complications) at mostly start with RASi only and, in equal distribution, with the other three drug combinations. Patients that are treated with RASi only at then change to RASi + SGLT2i, RASi + MCRa, RASi + GLP1a. Then, most of these patients go to clusters 1 and 3 at , and to clusters 2 and 3 at . This indicates a progressive disease improvement. These patients show a positive response when treated with RASi, RASi + SGLT2i, RASi + GLP1a.
- The patients that are in cluster 2 (with an intermediate disease and at risk of metabolic complications) at mostly move to clusters 1 and 3 at and cluster 3 at , indicating a general improvement. Patients that start with RASi only at then are treated with RASi + SGLT2i or RASi + GLP1a.
- Patients that are in cluster 3 at (with an intermediate disease) are predominantly treated with RASi only; then, half of them change to RASi + SGLT2i, showing improvement. These patients move to clusters 1 and 2 at , and to cluster 1 at .
- Patients that are in cluster 4 at (at risk of kidney and metabolic complications) move to cluster 3 at and to cluster 2 at (a small part to cluster 1). They show a significant improvement.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A
- Serum triglycerides indicate the presence of lipids in the blood, and they are measured in mg/dl. High values of tryglicerides (more than 200 mg/dL) are associated with increased cardiovascular risk. In type 2 diabetes, values of triglycerides are high, especially for patients with metabolic issues, obesity, or renal failure.
- The body mass index (BMI) is measured in kg/m. Values of BMI higher than 25 indicate overweight, and higher than 30 indicate obesity. High values of BMI are associated with increased diabetic and cardiovascular risk.
- Levels of diastolic pressure, jointly with systolic pressure information, provide an indication of cardiovascular risk. Values of diastolic pressure higher than 90 mm Hg overcome the threshold of high blood pressure.
- The glycated hemoglobin (HbA1c) gives information on the average blood glucose levels. It is measured in mmol/mol or in percentage. Low levels of HbA1c indicate good kidney efficiency. Values of HbA1c lower than 7 indicate good kidney efficiency. Karpati et al. [12] build clusters of time trajectories of HbA1c, whose ranges of values are among the relevant indicators of DKD behavior.
- The ratio of urine albumin to creatinine (UACR) indicates the presence of albumine in the urine. Serum albumin is the main protein of human plasma, and high concentrations of serum creatinine in the blood indicate that kidneys are not correctly filtering it. Creatinine is a product of creatine degradation (produced by muscles), which should be usually filtered out by kidneys. Under normal conditions, only a small part of albumin is excreted in urine. In fact, high levels of UACR denote poor kidney filtering efficiency. UACR values can be classified according to KDIGO (giving international guidelines regarding kidney disease, https://kdigo.org/, accessed on 1 December 2020). staging, as low (less than 3 mg/g), average (between 3 mg/g and 30 mg/g), and high (more than 30 mg/g). In our dataset, the mean UACR is considered. The reason is that UACR considerably fluctuates through the day, and thus, in each visit, it is measured three times, taking the average of these values.
- The estimated glomerular filtration rate (eGFR) is measured in mL/min/1.73 m; the value of 60 mL/min/1.73 m is considered the threshold for good kidney efficiency. The variation of eGFR is considered the response variable in our study. We computed the eGFR through the formula from the Chronic Kidney Disease Epidemiology Collaboration (CKD-EPI) [48]:
- The C-reactive protein (CRP) is produced in the liver in response to inflammation. It gives a measure of the progression of renal disease because inflammation is a strong renal and cardiovascular risk factor. It is measured in mg/dL. The presence of inflammation is characterized by a value of CRP (coded as 0); the absence as <0.5 (coded as 1).
- The serum potassium is absorbed with meals and filtered out by the kidneys. However, in renal failure, the amount of serum potassium increases. Serum potassium levels are also influenced by medication: for example, RASi or MCRa increase them. Here, the serum potassium is classified as low (<3.4), normal (3.4–4.5), or high (>4.5).
- The mean arterial pressure is measured asHere, the mean arterial pressure is classified as low (<70), normal (70–100), or high (>100).
- The blood glucose is measured in mg/dL. Values of blood glucose as the laboratory value for diabetes mellitus fluctuate due to therapy and food intake. This problem can be solved with multiple measurements in the same day. The fasting level are ≥126 mg/dL or ≥200 two hours after a standardized oral glucose load. Here, the blood glucose is classified as <130 mg/dL yes (1), and otherwise 0.
- The considered drug combinations are RASi only, RASi + SGLT2i, RASi + MCRa, and RASi + GLP1a. RASi is an acronym for the renin–angiotensin system; it lowers blood pressure, reduces cardiovascular outcomes, slows down the course of heart failure and chronic kidney disease. The SGLT2i is the class of um-glucose co-transporter (SGLT)2 inhibitors; it includes anti-diabetic agents, and lowers blood glucose. The MCRa indicates the class of aldosterone receptor antagonists; it blocks the reabsorption of sodium, encourages water loss, and thus helps decrease blood pressure. The GLP-1 (glucadon-like peptide 1) improves blood sugar control and helps weight loss.
References
- Mayer, G.; Heerspink, H.; Aschauer, C.; Heinzel, A.; Heinze, G.; Kainz, A.; Sunzenauer, J.; Perco, P.; Zeeuw, D.; Rossing, P.; et al. Systems Biology-Derived Biomarkers to Predict Progression of Renal Function Decline in Type 2 Diabetes. Diabetes Care 2017, 40, 391–397. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Park, S.; Xu, H.; Zhao, H. Integrating Multidimensional Data for Clustering Analysis With Applications to Cancer Patient Data. J. Am. Stat. Assoc. 2021, 116, 14–26. [Google Scholar] [CrossRef] [PubMed]
- Liu, L.; Lin, L. Subgroup analysis for heterogeneous additive partially linear models and its application to car sales data. Comput. Stat. Data Anal. 2019, 138, 239–259. [Google Scholar] [CrossRef]
- Krakow, E.; Hemmer, M.; Wang, T.; Logan, B.; Arora, M.; Spellman, S.; Couriel, D.; Alousi, A.; Pidala, J.; Last, M.; et al. Tools for the Precision Medicine Era: How to Develop Highly Personalized Treatment Recommendations from Cohort and Registry Data Using Q-Learning. Am. J. Epidemiol. 2017, 186, 160–172. [Google Scholar] [CrossRef] [Green Version]
- Goel, S.; Salganik, M. Respondent-driven sampling as Markov chain Monte Carlo. Stat. Med. 2009, 28, 2202–2229. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Fuchs, S.; Di Lascio, M.; Durante, F. Dissimilarity functions for rank-invariant hierarchical clustering of continuous variables. Comput. Stat. Data Anal. 2021, 159, 107201. [Google Scholar] [CrossRef]
- Amiri, S.; Clarke, B.; Clarke, J. Clustering categorical data via ensembling dissimilarity matrices. J. Comput. Graph. Statist. 2017, 27, 195–208. [Google Scholar] [CrossRef] [Green Version]
- Cunningham, N.; Griffin, J.; Wild, D. ParticleMDI: Particle Monte Carlo methods for the cluster analysis of multiple datasets with applications to cancer subtype identification. Adv. Data Anal. Classif. 2020, 14, 463–484. [Google Scholar] [CrossRef]
- Doove, L.; Dusseldorp, E.; Van Deun, K.; Van Mechelen, I. A comparison of five recursive partitioning methods to find person subgroups involved in meaningful treatment–subgroup interactions. Adv. Data Anal. Classif. 2014, 8, 403–425. [Google Scholar] [CrossRef]
- Molinari, M.; de Iorio, M.; Chaturvedi, N.; Hughes, A.; Tillin, T. Modelling ethnic differences in the distribution of insulin resistance via Bayesian nonparametric processes: An application to the SABRE cohort study. Int. J. Biostat. 2020, 17, 153–164. [Google Scholar] [CrossRef]
- Boucquemont, J.; Loubère, L.; Metzger, M.; Combe, C.; Stengel, B.; Leffondre, K. Identifying subgroups of renal function trajectories. Nephrol. Dial. Transpl. 2017, 32, ii185–ii193. [Google Scholar]
- Karpati, T.; Leventer-Roberts, M.; Feldman, B.; Cohen-Stavi, C.I.R.; Balicer, R. Patient clusters based on HbA1c trajectories: A step toward individualized medicine in type 2 diabetes. PLoS ONE 2018, 13, e0207096. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Perco, P.; Mayer, G. Molecular, histological, and clinical phenotyping of diabetic nephropathy: Valuable complementary information? Kidney Int. 2018, 93, 308–310. [Google Scholar] [CrossRef] [PubMed]
- Mac Lane, S. Categories for the Working Mathematicians; Cambridge University Press: Cambridge, UK, 1978. [Google Scholar]
- Grandis, M. Higher Category Theory; World Scientific: Singapore, 2020. [Google Scholar]
- Baez, J.; Lauda, A. A Prehistory of n-Categorical Physics. In Deep Beauty: Understanding the Quantum World through Mathematical Innovation; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
- Spivak, D. Category Theory for the Sciences; MIT Press: Cambridge, MA, USA, 2014. [Google Scholar]
- Rosen, R. The Representation of Biological Systems from the Standpoint of the Theory of Categories. Bull. Math. Biophys. 1958, 20, 317–341. [Google Scholar] [CrossRef]
- Varenne, F. The Mathematical Theory of Categories in Biology and the Concept of Natural Equivalence in Robert Rosen. Revue D’Histoire Des Sci. 2013, 66, 167–197. [Google Scholar] [CrossRef] [Green Version]
- Ehresmann, A.; Gómez-Ramirez, E. Conciliating neuroscience and phenomenology via Category Theory. Prog. Biophys. Mol. Biol. (PBMB) 2015, 119, 347–359. [Google Scholar] [CrossRef]
- Carlsson, G.; Mémoli, F. Classifying Clustering Schemes. Found. Comput. Math. 2013, 13, 221–252. [Google Scholar] [CrossRef] [Green Version]
- Carlsson, G.; Mémoli, F. Multiparameter Hierarchical Clustering Methods. In Studies in Classification, Data Analysis, and Knowledge Organization; Springer: Berlin/Heidelberg, Germany, 2021; pp. 63–70. [Google Scholar]
- Bauer, U.; Botnan, M.; Oppermann, S.; Steen, J. Cotorsion torsion triples and the representation theory of filtered hierarchical clustering. Adv. Math. 2020, 369, 107171. [Google Scholar] [CrossRef]
- Podani, J. Extending Gower’s General Coefficient of Similarity to Ordinal Characters. Taxon 1999, 48, 331–340. [Google Scholar] [CrossRef]
- Gower, J. A general coefficient of similarity and some of its properties. Biometrics 1971, 27, 857–871. [Google Scholar] [CrossRef]
- Hummel, M.; Edelmann, D.; Kopp-Schneider, A. Clustering of samples and variables with mixed-type data. PLoS ONE 2017, 12, e0188274. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Distefano, V.; Mannone, M.; Silvestri, C.; Poli, I. Categories and Clusters to investigate Similarities in Diabetic Kidney Disease Patients. In Book of Short Papers, SIS 2021; Pearson: Pisa, Italy, 2021; pp. 1162–1168. [Google Scholar]
- Myers, D. Double categories of Open Dynamical Systems. Appl. Catego. Theory 2020, 154–167. [Google Scholar] [CrossRef]
- Böhm, G. The Gray Monoidal Product of Double Categories. Appl. Categ. Struct. 2020, 28, 477–515. [Google Scholar] [CrossRef] [Green Version]
- Den Teuling, N.; Pauws, S.; Heuvel, E. A comparison of methods for clustering longitudinal data with slowly changing trends. Commun. Stat. Simul. Comput. 2021, 52, 621–648. [Google Scholar] [CrossRef]
- Oellgaard, J.; Gaede, P.; Rossing, P.; Persson, F.; Parving, H.; Pedersen, O. Intensified multifactorial intervention in type 2 diabetics with microalbuminuria leads to long-term renal benefits. Kidney Int. 2017, 91, 982–988. [Google Scholar] [CrossRef] [PubMed]
- Aschauer, C.; Perco, P.; Heinzel, A.; Sunzenauer, J.; Oberbauer, R. Positioning of Tacrolimus for the Treatment of Diabetic Nephropathy Based on Computational Network Analysis. PLoS ONE 2017, 12, e0169518. [Google Scholar] [CrossRef] [Green Version]
- Bauer, U.; Botnan, M.; Oppermann, S.; Steen, J. A comparative study of divisive and agglomerative hierarchical clustering algorithms. J. Classif. 2018, 35, 345–366. [Google Scholar]
- Everitt, B.; Landau, S.; Leese, M. Cluster Analysis; Oxford University Press: Oxford, UK, 2011. [Google Scholar]
- Miyamoto, S.; Abe, R.; Endo, Y.; Takeshita, J. Ward Method of Hierarchical Clustering for Non-Euclidean Similarity Measures. In Proceedings of the 2015 Seventh International Conference of Soft Computing and Pattern Recognition (SoCPaR 2015), Fukuoka, Japan, 13–15 November 2015; pp. 60–63. [Google Scholar]
- Hirano, S.; Sun, X.; Tsumoto, S. Comparison of clustering methods for clinical databases. Inf. Sci. 2004, 159, 155–165. [Google Scholar] [CrossRef]
- Egan, B.; Sutherland, S.; Tilkemeier, P.; Davis, R.; Rutledge, V.; Sinopoli, A. A cluster-based approach for integrating clinical management of Medicare beneficiaries with multiple chronic conditions. PLoS ONE 2019, 14, e0217696. [Google Scholar] [CrossRef] [Green Version]
- Inohara, T.; Shrader, P.; Pieper, K.; Blanco, R.; Thomas, L.; Singer, D.; Freeman, J.V.; Allen, L.A.; Fonarow, G.C.; Gersh, B.; et al. Association of Atrial Fibrillation Clinical Phenotypes with Treatment Patterns and Outcomes: A Multicenter Registry Study. JAMA Cardiol. 2018, 3, 54–63. [Google Scholar] [CrossRef]
- Aschenbruck, R.; Szepannek, G. Cluster Validation for Mixed-Type Data. Arch. Data Sci. Ser. A 2020, 6, 2. [Google Scholar]
- Halkidi, M.; Batistakis, Y.; Vazirgiannis, M. On Clustering Validation Techniques. J. Intell. Inf. Syst. 2001, 17, 107–145. [Google Scholar] [CrossRef]
- Nieweglowski, L. Package ‘clv’: Cluster Validation Techniques. Available online: https://rdrr.io/cran/clv/ (accessed on 31 May 2023).
- Halkidi, M.; Vazirgiannis, M. Clustering Validity Assessment: Finding the optimal partitioning of a data set. In Proceedings of the 2001 IEEE International Conference on Data Mining, San Jose, CA, USA, 29 November–2 December 2001. [Google Scholar]
- Neuen, B.; Weldegiorgis, W.; Herrington, W.; Ohkuma, T.; Smith, M.; Woodward, M. Changes in GFR and Albuminuria in Routine Clinical Practice and the Risk of Kidney Disease Progression. Am. J. Kidney Dis. 2021, 78, 350–360. [Google Scholar] [CrossRef] [PubMed]
- Zaharia, O.; Strassburger, K.; Strom, A.; Bönhof, G.; Karusheva, Y.; Antoniou, S.; Bódis, K.; Markgraf, D.F.; Burkart, V.; Müssig, K.; et al. Risk of diabetes-associated diseases in subgroups of patients with recent-onset diabetes: A 5-year follow-up study. Lancet 2019, 7, 684–694. [Google Scholar] [CrossRef] [PubMed]
- Vallati, M.; Gatta, R.; De Bari, B.; Magrini, S. Clinical Similarities: An Innovative Approach for Supporting Medical Decisions. Stud. Health Technol. Inform. 2013, 192, 1114. [Google Scholar]
- McIsaac, M.A.; Cook, R.J. Response-dependent sampling with clustered and longitudinal data. In ISS-2012 Proceedings Volume on Longitudinal Data Analysis Subject to Measurement Errors, Missing Values, and/or Outliers; Springer: Berlin/Heidelberg, Germany, 2013; pp. 157–181. [Google Scholar]
- Sheng, Y.; Yang, C.; Curhan, S.; Curhan, G.; Wang, M. Analytical methods for correlated data arising from multicenter hearing studies. Stat. Med. 2022, 41, 5335–5348. [Google Scholar] [CrossRef] [PubMed]
- Levey, A.S.; Stevens, L.A.; Schmid, C.H.; Zhang, Y.L.; Castro, A.F.; Feldman, H.I. A new equation to estimate glomerular filtration rate. Ann. Intern. Med. 2009, 150, 9. [Google Scholar] [CrossRef]
Continuous Variable | Mean (Standard Deviation) |
---|---|
body mass index (BMI) | 31.77 (5.56) |
diastolic pressure | 79.29 (10.05) |
glycated hemoglobin (HbA1c) | 7.34 (1.24) |
mean ratio of albumine to serum creatinine (UACR) | 73.27 (259.67) |
estimated glomerular filtration rate (eGFR) | 63.38 (15.81) |
triglycerides | 182.1 (157.00) |
Nominal Variable | Distribution |
serum potassium | 1 (1.3%), 2 (63.4%), 3 (35.3%) |
mean arterial pressure | 2 (63%), 3 (37%) |
blood glucose | (40.4% yes) |
C-reactive protein (CRP) | (68.9% yes) |
Distance Matrices | Mean Values of Distances | Inter-Cluster Distance | Intra-Cluster Distance | Frobenius Norm |
---|---|---|---|---|
0.23 (0.09) | 0.25 | 0.11 | 58.25 | |
0.24 (0.09) | 0.27 | 0.14 | 59.76 | |
0.28 (0.10) | 0.31 | 0.18 | 71.16 |
Time | Cluster | Size | Within Distance | Triglycerides | BMI | Diastolic Pressure | HbA1c | Mean UACR | eGFR |
---|---|---|---|---|---|---|---|---|---|
1 | 21 | 0.10 | 218.24 | 33.80 | 86.71 | 7.90 | 41.22 | 71.81 | |
2 | 62 | 0.09 | 210.48 | 31.30 | 74.72 | 7.55 | 46.48 | 64.64 | |
3 | 33 | 0.09 | 217.63 | 32.08 | 88.48 | 7.70 | 67.84 | 66.33 | |
4 | 39 | 0.18 | 166.79 | 33.57 | 74.20 | 7.50 | 89.09 | 61.41 | |
5 | 48 | 0.09 | 143.56 | 30.32 | 74.02 | 6.87 | 50.90 | 61.67 | |
6 | 32 | 0.16 | 143.25 | 30.60 | 87.03 | 6.69 | 164.26 | 62.47 | |
1 | 66 | 0.10 | 195.91 | 30.99 | 76.64 | 7.62 | 55.02 | 63.57 | |
2 | 42 | 0.17 | 224.17 | 32.59 | 78.42 | 7.75 | 80.87 | 62.45 | |
3 | 99 | 0.17 | 154.08 | 30.72 | 76.72 | 6.72 | 46.43 | 65.71 | |
4 | 28 | 0.14 | 238.32 | 34.07 | 74.43 | 8.22 | 34.72 | 56.71 | |
1 | 65 | 0.23 | 187.57 | 31.31 | 84.09 | 7.62 | 74.82 | 62.36 | |
2 | 72 | 0.20 | 145.22 | 31.55 | 74.61 | 6.63 | 61.23 | 60.82 | |
3 | 57 | 0.13 | 224.84 | 31.46 | 72.74 | 7.61 | 27.93 | 66.47 | |
4 | 41 | 0.17 | 211.71 | 31.17 | 72.97 | 8.18 | 95.15 | 53.78 |
Drug before | % of Controlled Response at | |||
Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | |
RASi | 60.00 | 56.00 | 62.90 | 43.75 |
RASi + SGLT2i | 59.09 | 75.00 | 57.14 | 66.67 |
RASi + MCRa | 50.00 | 66.67 | 45.45 | 0.00 |
RASi + GLP1a | 66.67 | 66.67 | 60.00 | 66.67 |
Drug before | % of Controlled Response at | |||
Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | |
RASi | 51.11 | 47.50 | 70.00 | 56.52 |
RASi + SGLT2i | 63.64 | 55.56 | 72.73 | 83.33 |
RASi + MCRa | 25.00 | 66.67 | 0.00 | 28.57 |
RASi + GLP1a | 100.00 | 25.00 | 100.00 | 100.00 |
Matrix/Measure | Mean | Frobenius Norm of | Chebyshev Distance between and |
---|---|---|---|
0.007 | 26.23 | 33.68 | |
0.047 | 29.15 | 41.50 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Distefano, V.; Mannone, M.; Poli, I. Exploring Heterogeneity with Category and Cluster Analyses for Mixed Data. Stats 2023, 6, 747-762. https://doi.org/10.3390/stats6030048
Distefano V, Mannone M, Poli I. Exploring Heterogeneity with Category and Cluster Analyses for Mixed Data. Stats. 2023; 6(3):747-762. https://doi.org/10.3390/stats6030048
Chicago/Turabian StyleDistefano, Veronica, Maria Mannone, and Irene Poli. 2023. "Exploring Heterogeneity with Category and Cluster Analyses for Mixed Data" Stats 6, no. 3: 747-762. https://doi.org/10.3390/stats6030048
APA StyleDistefano, V., Mannone, M., & Poli, I. (2023). Exploring Heterogeneity with Category and Cluster Analyses for Mixed Data. Stats, 6(3), 747-762. https://doi.org/10.3390/stats6030048