This work is framed as a data-centric, quality-aware preprocessing algorithm that operates before segmentation. The proposed pipeline has two complementary components: (i) a lightweight visual enhancement stage that increases anatomical contrast while preserving structure, and (ii) an automatic slice-quality estimator that uses anatomically grounded attributes extracted by a YOLOv11 detector to reject unreliable slices prior to segmentation. The overall goal is to improve robustness and stability of downstream models (here, ViTUNeT) without increasing segmentation-network complexity.
2.2. Visual Enhancement Preprocessing
Cardiac MRI slices frequently exhibit low contrast and acquisition-related artifacts that obscure myocardial boundaries, particularly in basal and apical regions. To address these limitations, we apply a three-stage visual enhancement pipeline designed to improve trabecular visibility while preserving anatomical edge integrity.
- 1.
Intensity rescaling. Each slice is linearly mapped to an 8-bit intensity range to standardize dynamic range across acquisitions and provide a consistent representation for subsequent processing. In our pipeline, MRI slices are already handled at 8-bit resolution; therefore, this operation does not reduce bit depth but homogenizes intensity ranges prior to OpenCV-based bilateral filtering and CLAHE. Subsequent z-score normalization ensures that all segmentation models operate on floating-point inputs.
- 2.
Edge-preserving denoising. A bilateral filter [
17] is applied to suppress impulsive noise and background artifacts while preserving the myocardium–blood pool boundary. The bilateral filter is configured with a kernel diameter
,
, and
, providing effective noise attenuation while preserving anatomical boundaries.
- 3.
Local contrast normalization. Contrast-limited adaptive histogram equalization (CLAHE) [
18,
19] enhances local contrast and improves the visibility of subtle intensity variations, particularly within trabeculated regions. CLAHE is applied with a clip limit of
and a tile grid size of (2, 2) to avoid over-enhancement artifacts while improving local structural contrast.
The sequential order of these operations is intentional. Intensity rescaling first homogenizes the input intensity distribution. Bilateral filtering is then applied to reduce noise while preserving anatomical edges, preventing noise amplification during contrast enhancement. Finally, CLAHE is applied once noise has been attenuated, ensuring that contrast enhancement primarily highlights anatomically meaningful structures rather than acquisition artifacts.
As illustrated in
Figure 2, the enhanced images present improved intensity balance and clearer delineation of both compacted myocardium and trabeculated regions, reducing boundary ambiguity and facilitating downstream segmentation tasks.
To complement this qualitative assessment, quantitative image quality metrics were computed across the full dataset (3459 slices). Shannon entropy [
23], Response Surface Methodology (RSM) Contrast, and mean Sobel gradient [
19] magnitude were measured before and after preprocessing. The aggregated results are summarized in
Table 1.
The consistent increase in entropy reflects improved intensity dispersion, the rise in RSM contrast indicates enhanced global contrast, and the higher Sobel magnitude confirms stronger edge definition. Together, these quantitative results support the effectiveness of the proposed enhancement pipeline without evidence of excessive artificial amplification.
The enhancement pipeline was implemented in Python (version 3.10, Python Software Foundation, Wilmington, DE, USA) using OpenCV (version 4.8, OpenCV Foundation, Mountain View, CA, USA) and scikit-image (version 0.21, scikit-image developers, USA). Parameter values were determined through an interactive tuning procedure on a representative subset of 20 diastolic-phase slices from different patients. A dedicated Python interface with adjustable sliders was developed to explore parameter configurations. Two experienced cardiologists evaluated myocardial boundary clarity and trabecular definition while varying the parameters. The final configuration corresponds to the most consistently selected settings and was fixed for all experiments to ensure reproducibility.
All slices are resampled to a fixed spatial resolution of pixels prior to preprocessing and learning. Intensity images are resized using bicubic interpolation (order = 3). Segmentation masks are resized using label-preserving interpolation (nearest-neighbor, order = 0) when required, preventing the creation of spurious intermediate labels. The choice of was made as a practical compromise between preserving fine trabecular detail and computational cost, and to provide a sufficiently dense spatial representation for the transformer-based components of ViTUNeT (i.e., a higher number of spatial tokens/patches), which improves the ability to model small anatomical structures. Although resampling can alter physical scale if pixel spacing metadata are ignored, our evaluation is based on overlap metrics (Dice) computed consistently in the same resampled space for both predictions and ground-truth masks; thus, comparisons across pipeline variants remain valid.
2.3. Supervised Slice-Quality Labels from Segmentation Behavior
Even after the enhancement stage, some slices remain unsuitable for reliable segmentation (e.g., due to truncated anatomy, signal dropouts, or severe motion artifacts). Since such failures should ideally be detected and discarded before applying the segmenter at inference time, we formulated a supervised machine learning approach aimed at filtering defective slices prior to segmentation. The first step, therefore, consisted of generating supervised slice-quality labels using a quantitative proxy derived from segmentation performance.
To ensure methodological rigor, the segmentation models used to construct these labels were trained under a strict patient-level separation protocol. First, an 80/20 train–test split was performed independently within each diagnostic group, guaranteeing that all slices from a given patient were assigned exclusively to one of the two subsets. Subsequently, within the training set, a 5-fold stratified cross-validation scheme was applied, also at the patient level. Stratification was based on the clinically established trabeculated volume (TV%) threshold of 27.4% [
6], which discriminates between healthy subjects and LVNC phenotypes, thereby ensuring balanced pathological representation across folds.
For each fold, an independent segmentation model was trained from scratch. Final predictions were obtained through an ensemble strategy that averages, at inference time, the outputs of the five fold-specific models (
Figure 3). This scheme reduces variance and improves robustness, providing more stable estimates of segmentation performance.
Let
and
denote the ordered sequences (from lowest to highest) containing the trabecular Dice coefficient values obtained for each image in the dataset using the DL-LVTQ and ViTUNeT segmentation models, respectively, following the ensemble approach described above. Slices were considered low quality if their performance fell below the first quartile of the distribution in either model; that is, images belonging to the lowest 25% according to
or
:
For the remaining slices, we need to define a new criterion to determine medium and high quality slices. To do this, we performed an inflection-point analysis of the cumulative Dice distribution in the trabecular region.
Specifically, the images that were not labeled as low quality were sorted in descending order according to their trabecular Dice score
. Let
N denote the number of such slices and
the sorted Dice values (
). We then computed the cumulative Dice curve
and normalized both axes as
The resulting curve (
Figure 4) represents the cumulative contribution of the top-performing slices relative to a uniform distribution.
To identify the structural transition point of this distribution, we applied a maximum-distance-to-diagonal criterion, computing , which corresponds to the classical knee-point detection in concave cumulative curves. This inflection point marks the transition from a steep accumulation region to a slower growth regime.
For ViTUNeT,
Figure 4a shows that the elbow (inflection) point is located at
, whereas for DL-LVTQ,
Figure 4b indicates an elbow at
.
In addition to the x-coordinate itself, we analysed the percentile position of the slice that defines the elbow. For ViTUNeT, the elbow corresponds to the 61.2th percentile of the ordered quality distribution, while for DL-LVTQ it corresponds to the 62.5th percentile.
The mean percentile value is approximately , which coincides remarkably with the golden ratio, . The golden ratio possesses well-established mathematical properties and naturally arises as the limiting ratio of consecutive terms in the Fibonacci sequence. Its appearance in natural growth processes and self-organizing systems has been extensively documented in mathematical and physical literature.
Given this empirical convergence and its theoretical significance, we adopted as a stable and model-consistent threshold to distinguish between medium- and high-quality slices.
Building upon this principle, we further refined the stratification of the remaining slices by introducing a second cutoff defined through a golden-ratio quantile level
applied to the upper portion of the distribution, namely the interval between the first quartile and the maximum value. Since this interval contains
of the probability mass, the corresponding global quantile is
Accordingly, we define
where
denotes the
p-th empirical quantile and the maximum value is implicitly included as the upper bound of the distribution.
We then define two intermediate-quality subsets, one per model:
Slices labeled as
medium quality are defined as the intersection of these subsets:
The intersection criterion was adopted to resolve disagreement between models in a principled manner. Specifically, if a slice does not belong to the intersection of the medium-quality sets defined by DL-LVTQ and ViTUNeT, it means that at least one of the models assigns it a trabecular Dice value consistent with medium-to-high quality. In such cases, we deliberately promote the slice to the high quality category rather than keeping it in the medium group.
Slices exceeding the upper threshold in both models, i.e.,
are labeled as high quality. This dual-model criterion reduces sensitivity to model-specific artifacts and promotes robust quality stratification.
Finally, we introduce an anatomical consistency safeguard using the Dice coefficient of the compact external layer,
. We define the threshold
and downgrade any slice initially labeled as
medium or
high if
.
Figure 5 summarizes the resulting labeling for supervised learning.
2.4. YOLOv11-Based Anatomical Attribute Extraction
Segmentation quality metrics such as the trabecular Dice coefficient () provide an objective measure of performance, but they depend on ground-truth annotations and are, therefore, unavailable prior to inference. To enable quality-aware filtering before segmentation, we instead rely on anatomical cues that are directly observable in the image and can act as proxies for structural visibility and completeness.
To extract such cues, we train a YOLOv11s detector [
24] to localize two anatomical entities in each slice: the full left ventricle (LV) and its internal cavity (IC). The detector is not used for classification or diagnosis; its sole purpose is to provide a compact set of geometrical and confidence-based attributes that characterize anatomical clarity.
The YOLOv11s model corresponds to the standard implementation provided by the Ultralytics framework, initialized from the official pretrained weights. No architectural modifications were introduced; the default backbone, neck, and detection head configurations were preserved.
2.4.1. Training Data and Annotations
YOLOv11s is trained exclusively on slices labeled as high quality according to the procedure described in
Section 2.3. All images are first enhanced using the preprocessing pipeline in
Section 2.2 and resized to
pixels.
Bounding-box annotations are generated automatically from the available segmentation masks using
OpenCV. For each slice, the largest external contour defines the LV bounding box, while the largest enclosed region corresponds to the IC (
Figure 6). The resulting coordinates are converted to the standard YOLO format
[class x y width height].
The dataset is divided at the patient level in an 80/20% ratio for train and test to avoid data leakage, using a fixed random seed to ensure reproducibility.
2.4.2. Model Training Qualitative Assessment
The YOLOv11s model is trained for 75 epochs using the ultralytics framework with a batch size of 4 and the RAdam optimizer.
Training is initialized from pretrained weights in detection mode, with input size and mixed precision enabled (AMP). Deterministic training is enforced with fixed seed. The validation split is used during training, and training is configured with patience = 100. The optimization hyperparameters are: initial learning rate , final learning-rate factor , , , and (, warmup bias ). The Intersection over Union (IoU) threshold for Non-Maximum Suppression (NMS) during evaluation is set to , with a maximum of 300 detections per image. Data augmentation follows the Ultralytics configuration: HSV jitter (, , ), , , , mosaic augmentation enabled () with mosaic disabled in the last 10 epochs, and . No vertical flips are applied, and mixup/copy-paste are disabled.
The training dynamics of the YOLOv11s model over the 75 epochs are shown in
Figure 7.
Visual inspection of validation slices confirms that the trained model consistently localizes both LV and IC structures across a wide range of anatomical appearances and acquisition conditions (
Figure 8). These results support the use of YOLOv11s as a reliable anatomical attribute extractor rather than as a diagnostic model.
2.4.3. Attribute Extraction
Once trained, the detector is applied to all MRI slices, independently of their assigned quality label. For each slice, four scalar attributes are extracted:
Detection confidence for the left ventricle;
Detection confidence for the internal cavity;
Bounding-box area of the left ventricle;
Bounding-box area of the internal cavity.
These attributes provide a compact and interpretable description of anatomical visibility and spatial extent. In the subsequent stage of the pipeline, they are used as input features for a supervised classifier that estimates slice quality prior to segmentation.
Although YOLOv11s is trained exclusively on high-quality slices to learn anatomically consistent localization patterns, it is intentionally applied to slices of all quality levels. When evaluated on medium- and low-quality slices, the detector typically produces lower confidence scores, unstable bounding boxes, or in extreme cases, missed detections. Rather than representing a failure of the pipeline, such degradation reflects reduced anatomical visibility and, therefore, constitutes informative signal for the slice-quality classifier. In this design, the quality-sensitive behavior of YOLOv11 becomes part of the discrimination mechanism itself, allowing unreliable slices to be identified without requiring additional adaptation.
2.5. Attribute–Quality Relationship
To verify that the extracted attributes are informative for anticipating segmentation quality, we quantify their relationship with the trabecular Dice coefficient and include slice_number as a proxy for axial position within the volume.
Since the number of slices varies across patients,
slice_number is defined as the normalized axial position of each slice within the patient-specific stack. Concretely, if
k denotes the index of a slice in its ordered volume and
N the total number of slices for that patient, we compute
This normalization ensures comparability across patients and prevents bias introduced by variable stack lengths.
Figure 9 reports the correlation matrix between
, YOLO confidences/areas for LV and IC, and
slice_number. We observe moderate positive correlations between
and detection confidence (LV:
, IC:
), and weaker positive correlations with bounding-box areas (LV:
, IC:
), supporting their use as predictors of slice quality. The negative association between box areas and slice number (approximately
) is consistent with anatomy, as apical slices typically contain smaller ventricular structures.
It is important to note that global image-quality descriptors such as entropy or Signal-to-Noise Ratio (SNR) primarily quantify photometric dispersion and overall contrast, but do not encode anatomical completeness or structural coherence. In contrast, the YOLO-derived attributes reflect geometrical extent and detection confidence of clinically meaningful structures. The observed correlations with , therefore, suggest that the proposed features capture structurally relevant information beyond global intensity statistics.
2.6. Slice-Quality Classification and Model Training
Slice quality is estimated through supervised learning on tabular anatomical attributes extracted from YOLOv11 detections. Each MRI slice is represented by five predictors: detection confidence and bounding-box area for the left ventricle (LV) and its internal cavity (IC), together with the
slice_number encoding the axial position within the volume. The target variable is the three-class quality label (low, medium, high) derived from segmentation behavior as described in
Section 2.3.
The resulting dataset consists of 3459 samples with no missing values. The distribution of quality classes is shown in
Figure 10, indicating a balanced class composition suitable for supervised learning.
To ensure fair comparison and reproducibility across learning paradigms, all candidate classifiers are trained and evaluated using a unified experimental protocol. The dataset is first split into stratified training (80%) and test (20%) subsets to preserve class proportions. Hyperparameter optimization is then performed exclusively on the training set using exhaustive grid search [
25] combined with 5-fold stratified cross-validation. The macro-averaged F1-score is used as the primary selection criterion, as it balances performance across all classes irrespective of their frequency.
Prior to model training, input features were standardized using a zero-mean, unit-variance transformation. Since margin-based and neural models are sensitive to feature scale, this normalization ensures comparable feature magnitudes and stable optimization. To prevent information leakage, the standardization parameters (mean and standard deviation) were computed exclusively on the training data within each cross-validation fold and subsequently applied to the corresponding validation subset. The same protocol was followed for the final training–test evaluation.
The complete training and selection procedure is summarized in Algorithm 1. Once the optimal configuration is identified for each model, the classifier is retrained on the full training set and evaluated on the held-out test set using F1-score, precision, and recall. The best-performing model is selected as the final slice-quality classifier and integrated into the preprocessing pipeline.
| Algorithm 1 Training and Selection of the Slice-Quality Classifier |
| Require: Labeled dataset , candidate models , hyperparameter grids |
| Ensure: Trained slice-quality classifier |
| 1: Split into stratified training set (80%) and test set (20%) |
| 2: for each model do |
| 3: Initialize best score |
| 4: for each hyperparameter configuration do |
| 5: Perform 5-fold stratified cross-validation on |
| 6: Compute macro-F1 score |
| 7: if then |
| 8: |
| 9: |
| 10: end if |
| 11: end for |
| 12: Train on using |
| 13: Evaluate on |
| 14: end for |
| 15: Select with the highest test macro-F1 score |
| 16: Train final classifier on |
| 17: return |
The candidate learning algorithms evaluated in this study are:
Random Forest (RF): an ensemble of decision trees trained via bagging.
Histogram Gradient Boosting (HGB): a boosting-based method optimized for tabular data using histogram-based splits.
Support Vector Classifier (SVC): a margin-based classifier maximizing inter-class separation.
Multilayer Perceptron (MLP): a fully connected neural network modeling nonlinear interactions among features.