1. Introduction
Early detection of left ventricular (LV) abnormalities such as severe hypertrophy (SLVH) is essential for cardiovascular risk management [
1]. Conventionally, echocardiography remains the clinical standard for assessing LV structure [
2]. However, its use is typically confined to individuals with a high pretest probability, in part due to reliance on specialized equipment and operator expertise [
3]. While cardiac magnetic resonance imaging (MRI) provides accurate and reproducible assessments of myocardial structure and function [
4], its cost, limited availability, and scanning time restrict routine population-level screening. Chest X-rays (CXRs) are widely available in clinical practice, noninvasive, inexpensive, and commonly used as a first-line imaging modality [
5]. Although CXRs lack three-dimensional detail and dynamic imaging capabilities for comprehensive cardiac anatomical assessment, recent deep learning models have demonstrated their ability to extract clinically meaningful information and estimate echocardiographic indices such as IVSDd, LVIDd, and LVPWDd [
6].
However, directly classifying SLVH from CXRs remains challenging due to the subtle nature of phenotypic changes and 3-dimensional overlap of the modality. Consequently, prior work [
6] adopts an indirect approach of first performing regression to estimate intermediate anatomical variables such as IVSDd, LVIDd, and LVPWDd, followed by thresholding to determine SLVH status. Additionally, since clinical thresholds for defining SLVH depend on age and gender, demographic information is explicitly provided to guide a learnable thresholding process and improve predictive alignment with clinical guidelines [
7]. While this strategy can improve predictive performance, it also introduces architectural and conceptual drawbacks. First, it creates vulnerability to error cascading, where inaccuracies in anatomical regression propagate through the model and compromise final classification. Second, explicitly including demographic variables can result in confounding, as these attributes may serve as proxies for the outcome rather than truly independent predictors, thus reducing model transparency and interpretability. These issues make it harder to understand model behavior and raise concerns for clinical deployment, where clarity and robustness are essential. Beyond the risk of error propagation and confounding, from a representation learning perspective, it is preferable to train models directly on the end task [
8]. A dedicated classification model should learn features for discriminating between SLVH-positive and SLVH-negative cases, potentially offering better alignment with the final diagnostic objective and fewer sources of modeling error while implicitly encoding relevant clinical and demographic attributes.
In this work, we propose a simplified yet effective alternative. We introduce a direct classification framework that predicts SLVH status (present or absent) from chest X-rays alone, without relying on intermediate structural predictions or demographic inputs. In this way, we avoid the risks associated with explicitly including anatomical or demographic variables, such as confounding and error propagation, while still aiming for a model that captures clinically meaningful patterns. To verify that our classifier remains aligned with relevant clinical attributes despite not receiving them as inputs, we apply Mutual Information Neural Estimation (MINE) [
9] to quantify the relationship between internal feature representations and relevant attributes such as age, sex, IVSDd, LVIDd, and LVPWDd. MINE estimates mutual information by training a neural network discriminator to distinguish joint from independent samples using the Donsker–Varadhan representation [
10], making it tractable in high-dimensional settings. This framework has been used to reveal how sensitive attributes are encoded in domains such as face recognition [
11] and person reidentification [
12]. We adapt it here to examine how clinical and demographic information is encoded across early, mid, and late layers in SLVH classifiers trained without access to those attributes. The resulting score, which we refer to as expressivity, reflects the degree to which each attribute is entangled in the learned representation.
Figure 1 shows the overall block diagram of our proposed approach.
This analysis quantifies the extent to which internal feature representations encode key variables such as age, sex, and anatomical measurements like IVSDd and LVPWDd. Although these factors are not part of the model input, they are strongly associated with SLVH and should be reflected in the features learned by an effective model. MINE helps us confirm that the model attends to this underlying structure implicitly, supporting both interpretability and clinical trust without compromising design efficiency or robustness. Our contributions are as follows:
Modeling: We present a direct SLVH classification framework using both convolutional as well as transformer backbones with only chest X-ray images, removing reliance on anatomical regressors and demographic inputs. This improves generalizability and streamlines model design.
Evaluation: We address limitations in prior work by constructing a balanced subset of the CheXchoNet dataset, enabling improved discriminative performance and reliable benchmarking. Performance is assessed using AUROC and AUPRC.
Interpretability: We apply MINE to estimate mutual information between internal features and clinical attributes, enabling quantitative analysis of attribute encoding without requiring explicit supervision. This supports a more interpretable and clinically aligned deployment of deep learning models.
To our knowledge, this is the first study to propose a direct classification framework for detecting SLVH from chest X-rays, bypassing the need for intermediate anatomical modeling or demographic inputs. Moreover, we are the first to introduce MINE as a tool for analyzing internal representations in deep learning based cardiac imaging. Beyond the specific application to SLVH classification, MINE represents a generalizable framework for quantifying feature–attribute relationships in deep learning models. Its ability to uncover clinically meaningful correlations makes it particularly well suited for advancing interpretability in cardiac imaging and other domains of clinical computer vision.
2. Materials and Methods
2.1. Dataset and Preprocessing
We use the CheXchoNet dataset [
6], which pairs CXRs with echocardiography-derived structural measurements. To mitigate the class imbalance that hindered previous regression-based approaches [
6], we construct class-balanced subsets by sampling equal numbers of SLVH-positive and SLVH-negative cases while preserving original train–validation–test proportions for comparison. The final dataset includes 11,190 CXRs (5595 per class) from 6021 patients for training, 658 CXRs (329 per class) from 361 patients for validation, and 534 CXRs from 310 patients for testing. This random sampling was performed with a fixed seed for reproducibility. All images are resized to 256 × 256 and normalized using ImageNet preprocessing statistics.
Figure 2 shows the training data distribution for key demographic and anatomical attributes.
2.2. SLVH Classification Architectures
We investigate two representative deep learning architectures for direct classification of severe left ventricular hypertrophy (SLVH) from chest radiographs: (i) a ResNet-18 convolutional neural network and (ii) a Vision Transformer (ViT) encoder pretrained using masked autoencoding. These models are selected to contrast convolutional and transformer-based paradigms in terms of spatial feature extraction and global context modeling, respectively.
2.2.1. ResNet-18: Convolutional Baseline
ResNet-18 [
13] serves as our convolutional baseline. It comprises a stack of residual blocks that hierarchically extract increasingly abstract visual features by aggregating local patterns through convolutional filters. The model is initialized with weights pretrained on the ImageNet dataset containing 14 million data and image pairs. This initialization aids convergence and facilitates generalization.
For SLVH classification, we fine-tune the entire ResNet-18 model end-to-end using the binary cross-entropy loss. A custom classification head, consisting of two fully connected layers interleaved with ReLU activation and dropout regularization, is appended to the global average pooled feature vector. This head transforms the learned spatial features into a scalar probability representing SLVH likelihood.
2.2.2. Vision Transformer: Masked Autoencoder Pretraining
Our second model employs a foundational Vision Transformer (ViT) encoder, pretrained using the Masked Autoencoder (MAE) framework [
14], to capture rich global representations from chest radiographs in a self-supervised manner. Unlike convolutional networks that primarily model local spatial dependencies, ViTs operate on non-overlapping image patches and utilize self-attention mechanisms to learn long-range interactions across the entire image. This is particularly advantageous for cardiothoracic pathologies like SLVH, where diagnostic cues may be spatially diffuse and globally distributed.
The MAE pretraining paradigm is designed to enhance the semantic abstraction capabilities of the encoder. During pretraining, a large fraction (in this case optimal masking ratio is 90%) of input patches are randomly masked, and the model is tasked with reconstructing the missing patches using only the visible subset. The encoder processes the unmasked patches, while a lightweight decoder reconstructs the full image from the encoded tokens. This architecture encourages the encoder to efficiently compress high-level semantic information, as it must infer the global context of the image from sparse visual cues.
In our setup, the ViT encoder is pretrained on a corpus of 300,000 chest radiographs from the MIMIC-CXR [
15], NIH ChestX-ray14 [
16], and Stanford CheXpert [
17] datasets, allowing it to internalize domain-specific structural priors. For downstream SLVH classification, we freeze the pretrained encoder to preserve its learned representations and append a lightweight task-specific MLP head on top of the
[CLS] token. The classification head, trained using binary cross-entropy loss, maps the global representation to a probability score indicating SLVH presence.
Freezing the encoder offers several benefits: it reduces computational overhead, prevents catastrophic forgetting of prelearned thoracic features, and isolates the contribution of the classification head in downstream adaptation. Moreover, this setup allows us to evaluate how well the self-supervised ViT embeddings that are learned independently of any diagnostic label encode clinically salient cues relevant to left ventricular hypertrophy.
2.3. Expressivity Analysis via Mutual Information Estimation
To quantify the extent to which a model’s internal representations encode clinically relevant information beyond the supervision provided during training, we estimate their
expressivity using Mutual Information (MI). Expressivity refers to how much internal features retain signal about auxiliary attributes not used as inputs or targets, yet highly relevant to the clinical phenotype. We focus on five such attributes linked to left ventricular hypertrophy: age, sex, interventricular septal diameter (IVSDd), posterior wall diameter (LVPWDd), and left ventricular internal diameter (LVIDd). Analyzing the expressivity of these factors enables post hoc interpretability by revealing the degree to which clinically meaningful associations are implicitly learned. MI quantifies statistical dependence between random variables. Given a learned feature vector
and attribute
a, the mutual information
measures how much knowledge of one reduces uncertainty about the other. It is formally defined as the Kullback–Leibler divergence between the joint distribution
and the product of marginals
:
where
quantifies how distinguishable the joint distribution is from one assuming independence.
Since direct estimation of Equation (
1) is intractable in high dimensions, we adopt the Donsker–Varadhan (DV) lower bound, implemented via MINE [
9]. Here, a neural network
is trained to distinguish between true (joint) and mismatched (marginal) feature–attribute pairs:
This setup effectively implements a “flip-and-check” logic: learns to output higher values for correct pairs and lower scores for mismatched combinations. The better it separates the two distributions, the tighter the bound, indicating stronger dependence.
We estimate the expectations in Equation (
2) using a mini-batch of size
b. For the joint term:
And for the marginal term:
The final loss minimized is the negative DV bound:
We implement
as a multi-layer perceptron with two hidden layers and ReLU activations. To stabilize training, we apply exponential moving averaging to the marginal term. MINE is trained separately for each attribute and repeated over 10 random seeds to ensure robustness. Expressivity is evaluated at early, mid, and final layers of both ResNet-18 and ViT classifiers. The overall procedure is outlined in Algorithm 1.
Algorithm 1 Expressivity computation on learned representations |
- Require:
Layer L, set of n images I, attribute vector - Ensure:
Expressivity measure - 1:
Initialize ▹ To store expressivity values - 2:
Extract features from L after a particular epoch for all - 3:
Concatenate the features and attributes: ▹ Augmentation step - 4:
for to M do - 5:
Initialize MINE network based on the dimensions of - 6:
Compute expressivity score: - 7:
Append score: - 8:
end for - 9:
return
|
This expressivity analysis provides a principled lens into the structure of latent representations. A higher estimated MI implies that the model captures meaningful attribute-relevant information without direct supervision, supporting interpretability and informing downstream fairness or calibration strategies.
2.4. Implementation Details
For the convolutional baseline, we use ResNet-18 [
13], initialized with ImageNet-pretrained weights. It is fine-tuned end-to-end with an appended classification head comprising two fully connected layers (128 → 64 → 1), each followed by ReLU activation and dropout (
). Training is performed using the Adam optimizer with a learning rate of
and a weight decay of
, optimized via binary cross-entropy loss. For the transformer-based model, we use a ViT encoder pretrained via masked autoencoding (MAE) on around 300,000 images from MIMIC-CXR, Stanford CheXpert, and NIH ChestX-ray14 [
14]. The encoder weights remain frozen, and a task-specific MLP head (512 → 128 → 1) is trained on the
[CLS] token representation. Optimization uses linear learning rate warmup followed by cosine decay, with early stopping based on validation AUROC.
The MINE network is implemented as a multilayer perceptron with two hidden layers of sizes 256 and 64, using ELU activations and Xavier initialization. It is trained using the Adam optimizer (learning rate = , batch size = 100). For each attribute, we extract feature vectors from the early, mid, and final layers of the classification models, form the feature matrix , and concatenate it with the corresponding attribute vector to obtain the input matrix . The expressivity is computed using the MINE objective, and we average the result across random seeds for robustness.
All training and MI estimation procedures were conducted using NVIDIA A5000 GPUs.
3. Results
We demonstrate that severe left ventricular hypertrophy (SLVH) can be directly classified from chest radiographs using deep learning models, without requiring intermediate echocardiographic regressors or demographic covariates. Our approach bypasses multi-stage processing pipelines and enables efficient end-to-end classification. For reference, the current clinical baseline proposed in [
6] employs a two-step approach: (i) regression of anatomical markers such as IVSDd from chest X-rays, followed by (ii) diagnostic thresholding. While this method achieves a reasonably high AUROC of 0.79 [95% CI: 0.76–0.81], it suffers from poor class discrimination under severe class imbalance, yielding an AUPRC of just 0.19 [95% CI: 0.15–0.22].
In contrast, fine-tuning direct classification models on our curated class-balanced dataset yields strong performance across both convolutional and transformer-based architectures, as seen from
Table 1 and
Figure 3. The ViT-Base achieves the highest classification metrics, with an AUROC of 0.816 [95% CI: 0.781–0.850] and an AUPRC of 0.803 [95% CI: 0.755–0.849], indicating both high sensitivity and reliable precision across decision thresholds. ResNet-18 also performs competitively with an AUROC of 0.760 [95% CI: 0.718–0.802] and AUPRC of 0.731 [95% CI: 0.669–0.786].
Attribute-Level Expressivity
To probe whether these models learn latent clinical representations despite training only on binary SLVH labels, we apply MINE to quantify the degree to which intermediate features encode five clinically salient attributes: interventricular septal diameter (IVSDd), left ventricular posterior wall diameter (LVPWDd), left ventricular internal diameter (LVIDd), age, and sex. These variables are known factors or diagnostic correlates of SLVH.
Figure 4 summarizes MINE-based expressivity across layers for the ViT and ResNet18 architectures. For ViT, expressivity increases progressively through the transformer blocks and stabilizes in the final layers, suggesting a hierarchical abstraction of clinical attributes across attention layers. In contrast, ResNet18 shows a sharp increase only at the final global average pooling layer, which is consistent with its local to global convolutional design. These architectural differences directly influence where and how clinically relevant signals are encoded.
Notably, age and sex exhibit consistently high expressivity in the final layers of both models, despite not being explicitly provided as input during training. This suggests that the models implicitly infer demographic context from imaging features, likely because these attributes are clinically established contributors to SLVH pathophysiology and diagnosis. While this capability reflects the models’ ability to internalize important diagnostic cues, it also highlights the need for careful evaluation of potential fairness and bias concerns associated with unintended attribute encoding. Among anatomical features, IVSDd and LVPWDd, which are established indicators of myocardial thickening, are consistently encoded with high expressivity, whereas LVIDd shows the lowest expressivity, consistent with its limited diagnostic relevance for hypertrophy. These findings reveal a consistent expressivity ordering for our best model, which further holds for all ablation cases, as described in
Section 4:
This hierarchy reinforces that deep learning classifiers are capable of recovering clinically meaningful structure from chest radiographs using only image-level supervision.
4. Discussion
The superior performance of the ViT model is likely driven by its global self-attention mechanism, which enables modeling of long-range anatomical dependencies across the thoracic cavity, a capability that convolutional architectures lack due to their reliance on localized receptive fields. These findings establish that SLVH-relevant cues are sufficiently encoded in chest radiographs to support high-accuracy direct classification when leveraging appropriate architectural priors and pretraining strategies. We show some exemplar GRADCAMs here to further strengthen the interpretability of model decisions in
Figure 5. Across multiple examples, we observe that the model consistently attends to central cardiac regions, particularly the mediastinum and left ventricular silhouette, with high-intensity activations (red regions). This localization aligns with the expected anatomical correlate of severe left ventricular hypertrophy, where cardiac enlargement and changes in ventricular wall thickness manifest most prominently in these areas. Importantly, the activations are concentrated around the cardiac contour rather than diffuse pulmonary fields or image borders, suggesting that the network is not relying on spurious features such as rib patterns, background artifacts, or acquisition markers. These interpretability results reinforce the clinical plausibility of the learned features and provide additional transparency. They suggest that the model’s predictions are informed by relevant cardiothoracic structures, in agreement with radiological and cardiological understanding of hypertrophy, thereby increasing confidence in its potential utility.
For the MINE module, we ablate using different architectural choices. We now evaluated both a shallower design (feature_dim → 256 → 1) and a deeper configuration (feature_dim → 512 → 256 → 64 → 1) in addition to the architecture used previously for our best model, which is the ViT backbone. Importantly, across all tested variants, the relative ordering of attribute expressivity remained the same—age > sex > IVSDd > LVPWDd > LVIDd—which aligns with clinical expectations confirmed by a radiologist as seen from
Figure 6 and
Figure 7. This consistency underscores the robustness of our conclusions, while also demonstrating that our findings are not sensitive to the precise architectural instantiation of MINE.
5. Limitations and Future Work
The major limitation of expressivity is that, being an approximation of mutual information (MI), it depends on entropy, which, in turn, depends on the distribution of attribute labels, which might, in principle, affect absolute value-based comparison between different attributes. This is true for any MI-based technique. However, in our dataset, categorical attributes (sex, SLVH) are well balanced, while continuous attributes (age, IVSDd, LVPWDd, LVIDDd) follow physiologically plausible Gaussian-like distributions without extreme skew. Moreover, our analysis focuses on relative changes across network layers rather than absolute comparisons between unrelated attributes. Thus, the observed trends remain valid, and we additionally verified robustness through ablations. This study highlights the dual importance of (i) foundation models as powerful representation learners, as we can directly detect echocardiographic metrics from chest X-rays, and (ii) MINE-based expressivity analysis as a tool to investigate what these models encode. A natural extension of our framework is to incorporate a broader set of clinical and demographic attributes, such as comorbidities, race, medication history, or lifestyle factors, to obtain a more exhaustive ordering of attribute relevance for detecting cardiac structural abnormalities from chest radiographs. Another promising avenue is attribute suppression, where attributes that are less clinically informative (e.g., LVIDd in the context of hypertrophy, as suggested by both our MINE analysis and clinical guidelines) can be down-weighted or actively suppressed during representation learning. Such targeted interventions may improve the specificity of deep models, mitigate spurious correlations, and bring the learned features closer to clinically actionable biomarkers.