This section begins with a detailed description of the experimental dataset, followed by a comprehensive comparison of model evaluation metrics and a thorough analysis of the results. Combination of five imbalanced data processing methods and six multi-label classification methods described above are used in data analysis and are compared with the original data with six multi-label classification methods.
3.1. Data Description
This article examines eight complications after TAVR surgery: any stroke/TIA (AS), acute MI (AMI), PPM placement (PPMP), conversion to SAVR (CTS), cardiogenic shock (CS), cardiac arrest (Ca), major bleeding (MB), and acute kidney injury (AMI). The data were obtained from a hospital in the United States, and the occurrence of each complication was determined by matching icd9/icd10 codes, which took the value of 1 if it occurred, and 0 if it did not. The presence of complications highlights the highly imbalanced nature of the data, and the total sample size in the data is 25,965. Details are shown in
Table 1 below.
The calculated meanIR value of the complications concerned is 92.4, which shows really high imbalance. In order to comprehensively portray the spatial distribution characteristics and structural dependencies of the labels in the research dataset, this paper combines the UpSet plot and the Chord diagram as two visualisation tools to deeply analyse the imbalance and co-occurrence patterns of the labels.
Figure 1 shows the intersection distribution of labels and their combinations in the dataset. This UpSet plot shows the number of elements in every possible intersection of the sets, where each bar on top represents the size of the specific intersection identified by the connected black dots below it. It can be observed that the distribution of labels is obviously unbalanced, with individual labels (e.g., “PPMP”) appearing much more frequently than other labels in the samples, and a large number of label combinations only exist in a very small number of samples, which is typical of the long-tailed distribution. The most common tag combination contains only a single tag, with a sample size of 4745, while a large number of low-frequency combinations have a sample size of less than 10, indicating that the tag combination space is highly sparse. This feature poses a challenge to multi-label classification models, especially since models based on label combination learning are prone to overfitting to high-frequency combinations while ignoring the ability to model rare labels and their joint expressions.
To further reveal the dependency structure between the labels, the co-occurrence relationship between the labels was visualised using a chord diagram in
Figure 2. The arcs in the figure represent the marginal frequencies of each tag, and the connected bands reflect the co-occurrence intensity between tags. It can be seen that “PPMP” not only has the highest marginal frequency but also has intensive co-occurrence with multiple labels (e.g., “AKI”, “AMI”, “AS”, etc.), forming a centralised structure, suggesting that these labels may show a strong synergistic seizure pattern in clinical situations. On the other hand, labels such as “CTS” and “Ca” co-occur more in isolation or with only a small number of labels. This label co-occurrence network shows a clear hub-and-spoke pattern, where a small number of core labels are unevenly coupled with a large number of peripheral labels.
For demographic characteristics of this dataset, 56.16% were males and 43.84% were females. The minimum age of the study population was 2 years, the maximum age was 90 years and the median age was 80 years. The histogram of age distribution is shown in
Figure 3, 97.4% of the sample were above 60 years old.
For characteristics of hospitals, we show statistical information on the region where the hospital’s division is located (HOSP_DIVISION) and the hospital’s location and teaching status (HOSP_LOCTEACH). There are nine levels for the region in which the hospital’s division is located, and the region with the highest percentage of location is Middle Atlantic with 17.08%. There were three levels of Hospital Location and Teaching Status, with the largest number of people, 70.2%, treating hospitals in the Urban teaching category. In the subsequent modelling, Rural and Urban nonteaching in HOSP_LOCTEACH were combined into one category to make the distinction between teaching and nonteaching only. Only HOSP_LOCTEACH was included in the analysis in the subsequent modelling.
The levels of independent variables used for modelling analysis are summarised in
Table 2:
3.2. Model Evaluation Indicators
To assess model effectiveness, a random 70% of the dataset was designated as the training set, and the remaining 30% constituted the validation set. For the above multilabel classification problem,
,
,
, and
represent the number of samples that are actually in the positive class and predicted to be in the positive class; the number of samples that are actually in the positive class and incorrectly predicted to be in the negative class; the number of samples that are actually in the negative class and predicted to be in the negative class; and the number of samples that are actually in the negative class and predicted to be the number of samples in which the actual class is negative and the prediction is also negative, respectively.
A series of model evaluation metrics have been proposed for multi-label classification problems, and in this paper, we mainly consider micro-F1, macro-F1, Hamming loss, and micro-AUC. In the multi-label classification task, micro-F1 and macro-F1 are two commonly used variations in the F1-score to measure the overall performance of the model in dealing with the problems of label imbalance and a large number of categories.
Macro F-measure: macro-F1 first calculates the F1 score separately on each labelling dimension and subsequently takes its arithmetic average, thus assigning the same weight to each label. This metric is more sensitive in measuring the model’s performance on all labels and is more effective in revealing the model’s ability to learn on a small number of classes of labels.
where
Micro F-measure: Micro-F1 is an F1 value calculated based on the cumulative result of the confusion matrix at the label level of all samples, and its precision and recall are derived from the overall true positives (TP), false positives (FP) and false negatives (FN) to obtain a unified performance metric. Specifically, micro-F1 emphasises the performance of common labels and is suitable for scenarios with uneven label distribution and large sample size.
where
Hamming loss: It specifies the proportion of labels that are misclassified by the classifier, with lower values indicating better model classification.
where
r is the number of samples and
b is the number of labels.
Ranking loss: the proportion of non-relevant labels ranked ahead of truly relevant labels, the lower the value, the better the performance of the classifier. The function
, in the range [0, 1], gives the probability of relevance of the
jth label in the
kth test instance.
and
are the number of relevant and irrelevant labels in that instance, respectively.
returns true if
p is predicted to be true; otherwise, 0 is returned.
Micro-AUC: it evaluates the proportion of label pairs across all instances where the model assigns a higher score to a relevant label than to an irrelevant one. The higher the value of
, the better the performance of the classifier.
3.3. Result Analysis
The variable importance analysis, as shown in
Figure 4, intuitively reveals that the impact of different comorbidities on specific complications varies significantly. Taking dyslipidemia (Dys) as an example, it exhibits relatively high predictive importance in the placement of PPM (PPMP) and acute kidney injury (Aki), which likely correlates closely with pathophysiological mechanisms such as cardiac dysfunction caused by PPMP and fluid retention and pulmonary edema resulting from Aki. However, Dys’ importance is significantly lower in complications such as acute MI (AMI), conversion to SAVR (CTS) and cancer (Ca), indicating that its predictive value is specific. On the other hand, Implantable Prior ICD (PICD), as an indicator of cardiac interventional therapy, demonstrates some predictive utility for AMI, cardiogenic shock (CS), and Aki. However, its importance is particularly prominent in PPMP. This directly reflects the complexity and severity of cardiovascular diseases in PICD patients, increasing thus their risk of developing other cardiac-related complications. In contrast, the predictive effect of PICD on any stroke/TIA (AS) and CTS is negligible. This refined and differentiated identification of the varying effects of different comorbidities on distinct complications further underscores the necessity of employing shrinkage estimation methods in imbalanced multilabel classification tasks. This approach can precisely capture and quantify these specific associations, leading to the construction of more discriminatory, clinically relevant, and personalized predictive models.
Before performing multi-label classification, this paper uses five random resampling methods to sample the imbalanced data of this dataset. The results of the data after sampling by the corresponding methods are displayed in
Table 3. By comparing the improvement effect of the five preprocessing methods on the imbalance level of the multi-labelled dataset, it is found that LPROS exhibits optimal performance. The experimental data show that the maximum imbalance ratio (MaxIR) is reduced from 507.15 to 40.1 (a relative reduction of 92.1%) and the mean imbalance ratio (MeanIR) is reduced from 92.4 to 13.4 (a reduction of 85.5%) after LPROS processing, which is significantly better than the other methods (LPRUS, MLROS, MLRUS). It is worth noting that LPRUS and MLRUS, based on random undersampling, showed limited improvement in imbalance (MaxIR reduction < 5.5%), while the REMEDIAL method did not change the original data distribution due to its focus on noise processing (%
= 0).
After sampling the data for multi-label classification, the commonly used representative methods BR, LP, CC, CLR, HOMER, and MLKNN are selected. While those methods are cooperative with the LocalGLMnet-like regularisation part. The corresponding model evaluation metrics are shown in the following tables. In
Table 4,
Table 5,
Table 6,
Table 7 and
Table 8, the first row, “Original”, indicates results that were not resampled and were derived using the original MLC method.
Table 4 shows the Hamming loss performance of six classifiers (BR, LP, CC, CLR, HOMER, MLKNN) under different resampling methods. The results show that the resampling strategy has a significant effect on the classification performance. Among them, the LPRUS method performs the best and significantly reduces the Hamming loss of multiple classification methods, indicating that it is effective in optimising the sample distribution and improving the classification accuracy. On the contrary, the LPROS method leads to a significant increase in Hamming loss, indicating that it may destroy the original sample structure, thus weakening the model performance. From the perspective of classifiers, CLR and HOMER perform particularly well under LPRUS resampling, but their advantages are not obvious under other sampling strategies; MLKNN is more sensitive to the sampling method, and its performance fluctuates a lot; whereas the performance of BR, LP, and CC is relatively stable under all kinds of sampling strategies; the combination of LPRUS resampling method with CLR or HOMER classifiers can be used as a way of improving the Hamming loss of multi-label classification in this dataset. The LPRUS resampling method combined with CLR or HOMER classifiers can be an effective strategy to improve the performance of Hamming loss for multi-label classification.
Table 5 shows the ranking loss performance. Overall, the LPRUS method achieves the lowest ranking loss among all classifiers, significantly outperforming other resampling strategies, especially on CLR, MLKNN, and HOMER, suggesting that it is effective in optimising the label ranking performance. On the contrary, the LPROS method generally leads to a significant increase in ranking loss, especially in classifiers such as LP and CC, reflecting that it may introduce redundant or misleading samples and destroy the label structure. The overall improvement of the three methods, MLROS, MLRUS, and REMEDIAL, is limited, but REMEDIAL shows some advantages on HOMER and MLKNN, probably due to its ability to optimise label sorting performance. In terms of classifier dimensions, CLR has the best overall performance, with the lowest ranking loss and less sensitivity to sampling strategies; MLKNN responds significantly to sampling methods, and appropriate strategies can bring obvious performance improvement. In summary, LPRUS combined with CLR or MLKNN can be used as a recommended strategy to improve the performance of multi-label classification of this dataset.
Table 6 demonstrates the micro F-measure. The results show that the LPRUS sampling method significantly improves the performance of most of the classification methods, achieving the highest micro F-measure value among several classification methods, demonstrating its effectiveness in optimising the sample distribution and enhancing the model’s classification ability. In contrast, the LPROS sampling method resulted in a general decrease in performance. In terms of classification methods, CLR and HOMER perform optimally under LPRUS sampling, but the advantage is not obvious under other sampling conditions; while BR, LP and CC perform more consistently under different sampling methods, and the performance is affected by sampling in a similar pattern. In summary, LPRUS sampling combined with CLR or HOMER classification methods in this dataset can be used as an effective strategy to improve the performance.
Table 7 reports the macro F-measure results. The overall results show that LPROS is the only resampling method that significantly improves the Macro F values, especially in CC (0.2597), CLR (0.2325), and BR (0.2288) to achieve substantial improvement. For example, the values of BR, LP, CC, CLR and HOMER under this sampling are significantly higher than the other partial sampling cases, which suggests that LPROS sampling is able to optimise the sample distribution to a certain extent and enhance the combined performance of the model in each category. As for LPRUS, MLROS, MLRUS and REMEDIAL sampling methods, the Macro F Measure values of each classification method are generally lower and closer, and the enhancement effect is not obvious. In terms of classification methods, MLKNN performs outstandingly under Original and LPROS sampling, with a Macro F Measure value of 0.4498364, which is much higher than the values of other classification methods under the same sampling, showing that the method is better balanced in predicting each category under specific sampling conditions. On the other hand, BR, LP, CC, CLR, and HOMER have relatively small differences in their performance under different sampling conditions, and the Macro F Measure values are lower in most cases.
Table 8 summarises the micro-AUC results, reflecting the comprehensive discriminative ability of the model on the relevance of sample labels. In terms of overall performance, CLR and MLKNN are the two classifiers with the strongest discriminative ability, obtaining high AUC values under most of the resampling strategies, especially under the LPRUS strategy, MLKNN (0.9242) and CLR (0.9234) achieve the highest micro AUC values. In terms of resampling methods, LPRUS is the only method that improves the AUC in all classifiers, indicating that its optimisation of the sample space structure helps to improve the overall ranking performance of the model, especially for the classifiers such as CLR, MLKNN, CC, etc., MLRUS and MLROS also perform stably and can maintain the AUC values close to or even better than that of the original data in a variety of classifiers with certain generalisation ability. In contrast, the LPROS and REMEDIAL methods fail to improve the AUC performance on most classifiers, and even suffer from significant degradation, suggesting that their resampling strategies may have introduced noise or disturbed the label discrimination boundaries.
It is worth highlighting that CLR is maintained at a high level under all the sampling strategies (minimum 0.8710, maximum 0.9234), showing strong stability and robustness, while MLKNN reaches the optimal value under LPRUS and MLRUS, with strong discriminative performance in the imbalanced label sorting task. In summary, if micro AUC is the main evaluation index, LPRUS is the most effective resampling method, while CLR and MLKNN are the most discriminative multi-label classifiers, and the combination of the two performs particularly well in the multi-label sorting task for this dataset.
Analysis of the provided performance metrics (micro F-measure, macro F-measure, Hamming loss, ranking loss, and micro-AUC) across various resampling and classification methods reveals distinct strengths among the approaches. LPRUS consistently demonstrates superior performance in metrics reflecting overall model accuracy and discrimination, specifically achieving the highest micro F-measure and micro-AUC, alongside the lowest Hamming loss and ranking loss across most classifiers. This indicates its efficacy in enhancing general predictive performance, reducing misclassification rates, and improving ranking quality. Conversely, LPROS excels remarkably in Macro F Measure, showcasing its significant capability to mitigate class imbalance issues and improve the predictive performance for minority classes, albeit at a potential cost of slightly diminished overall accuracy metrics. The performance of MLROS, MLRUS, and REMEDIAL generally falls between these two extremes, offering modest improvements over the “Original” approach but not matching the consistent superiority of LPRUS in aggregate metrics or the specialized strength of LPROS in handling class imbalance. Therefore, the “best” model selection hinges on the specific objectives: LPRUS-based models are optimal for achieving robust overall predictive accuracy and minimizing errors, while LPROS is paramount when prioritizing the accurate identification and recall of rare events or minority classes within imbalanced datasets.
In this study, most models exhibited high computational efficiency during the fitting phase, with the exception of MLkNN-based methods, which were comparatively slower. This indicates that once trained, these models can generate predictions rapidly, making them promising for quick deployment in real-world clinical settings. However, a detailed analysis of prediction latency and specific optimizations for real-time deployment falls beyond the immediate scope of this research.