Hybrid Classical–Quantum Transfer Learning for Cardiomegaly Detection in Chest X-rays

Cardiovascular diseases are among the major health problems that are likely to benefit from promising developments in quantum machine learning for medical imaging. The chest X-ray (CXR), a widely used modality, can reveal cardiomegaly, even when performed primarily for a non-cardiological indication. Based on pre-trained DenseNet-121, we designed hybrid classical–quantum (CQ) transfer learning models to detect cardiomegaly in CXRs. Using Qiskit and PennyLane, we integrated a parameterized quantum circuit into a classic network implemented in PyTorch. We mined the CheXpert public repository to create a balanced dataset with 2436 posteroanterior CXRs from different patients distributed between cardiomegaly and the control. Using k-fold cross-validation, the CQ models were trained using a state vector simulator. The normalized global effective dimension allowed us to compare the trainability in the CQ models run on Qiskit. For prediction, ROC AUC scores up to 0.93 and accuracies up to 0.87 were achieved for several CQ models, rivaling the classical–classical (CC) model used as a reference. A trustworthy Grad-CAM++ heatmap with a hot zone covering the heart was visualized more often with the QC option than that with the CC option (94% vs. 61%, p < 0.001), which may boost the rate of acceptance by health professionals.


Introduction
According to the WHO's Global Health Estimate [1], ischemic heart disease is the world's biggest killer with 8.9 million victims in 2019, accounting for 16% of the world's total deaths. Each year, about 700,000 persons in the United States and 1,700,000 persons in the 57 ESC member countries die from heart disease, i.e., about one in every five deaths [2,3]. Heart disease cost the United States about 229 billion USD each year from 2017 to 2018 and cost the European Union 210 billion EUR in 2015.
For this reason, cardiology has been one of the branches of medicine where the most progress has been made in recent years. The high prevalence of cardiovascular diseases has led to the constant development of new technologies for diagnosis and treatment. In recent decades, there has been growth in advanced technological developments applied to solving this situation addressing cardiovascular disease assessment, such as the application of Machine Learning (ML) algorithms for the diagnosis of heart failure [4] or cerebrovascular diseases [5], or for the analysis of arrhythmias [6,7] Several deep learning techniques have been presented to assess cardiac diseases using both tabular data and images. This manuscript presents an approach that merges contributions from deep learning techniques with quantum ML techniques for detecting cardiomegaly in chest X-rays (CXR).
CXR is a radiological modality widely used in inpatient and outpatient settings. It is part of the tools used for assessment and follow-up in suspected or known cardiac patients. On the other hand, CXRs are more often performed for a purpose other than heart disease, e.g., in environments where they are first protocolized by on-duty physicians. Some CXR findings may indicate an unsuspected underlying heart condition, unrelated to the patient's actual problem. Besides edema, pleural effusions and pulmonary opacity often revealing heart failure, cardiomegaly is the most relevant indicator of a cardiac problem. Pericardial effusion is another concerning cause of the characteristic enlargement of the heart silhouette, but its presence never excludes concomitant cardiomegaly. The two conditions can thus be aggregated in an early triage. Cardiomegaly is usually a sign of dilated or hypertrophic cardiomyopathy, which is observed in a large range of congenital and acquired diseases [8,9]. One prominent cause of cardiomyopathy is coronary artery disease [10].
Cardiomyopathy is a severe condition leading to heart failure, but also to sudden cardiac death, malignant arrhythmias and thromboembolism from cardiac origin. This represents a major cause of morbidity and mortality all over the world [11]. In summary, the initial radiological discovery of cardiomegaly may indicate, depending on the clinical context, the worsening of a known cardiac condition, an unsuspected cardiac disease or a cardiac complication during another disease. In all cases, a more in-depth cardiological assessment is indicated to clarify the diagnosis and establish appropriate therapeutic measures.
In this context, ML algorithms are being developed for the detection of cardiomegaly on radiographs. These studies are facilitated by the availability of large CXR databases, such as ChestX-ray8, CheXpert or MIMIC-CXR-JPG [12][13][14]. Encouraging results are reported. For instance, when comparing sixteen different deep learning architectures of CNN, ROC AUC scores up to 0.84 were found for the detection of cardiomegaly on a CXR validation dataset of 202 images from CheXpert [15].
Automated analyses of written protocols mention cardiomegaly in 9 to 17% of cases in these large CXR data sets. However, patients treated for heart disease are more likely to present cardiomegaly and undergo repetitive CXRs. For instance, in the CheXpert dataset, cardiomegaly is reported in 11% of the first studies against 13% of the follow-up studies. For medical applications, the performance indicators of a classification method must be such that there is little chance of missing a case while avoiding costly and resource-consuming additional diagnostic procedures. These considerations imply that a balanced dataset is a reasonable option for training the ML algorithm in such an application. Undersampling the majority class is indeed a good strategy that keeps the computational cost within reasonable limits [16], which was necessary to carry out the present study.
In less than a decade, deep learning has revolutionized the field of ML imaging, and automated CXR analyses can now benefit from these advances. Convolutional neural networks (CNN) have been the go-to pattern classification algorithms due to their welldemonstrated approximation and generalization capabilities. However, training a deep CNN in big data applications with a rich feature set is resource-intensive.
CNNs are based on the multiple perceptron model, in which a perceptron or an artificial neuron mimics the neurons in the human brain. In the human brain, specific regions of the brain are dedicated to responding to specific signals, e.g., images with faces trigger a stronger response than images without faces in a region called the fusiform face area (FFA), which is present on the underside of the temporal lobe. Further, in the middle layer of the brain's visual cortex, specific neurons respond to specific shapes, edges, angles and layers [17]. This idea of specific neurons responding to specific features is used in the CNN model, in which each convolutional layer learns a specific feature/characteristic of the image using the image kernels and filters.
A CNN model can be viewed as two parts. The first part acts as a feature transformer that contains one or more layers that learn to identify and assign weights to the features of the image, and the second part acts as a classifier layer that classifies the images into different classes. Transfer learning is a useful technique that takes advantage of this twopart design, in which a model that was trained on a set of images is reused for training and classifying a completely different set [18]. The idea is to retain the convolutional layers that learn the common features of any image, such as lines, edges and curves, and to replace the classifier layers that are image-specific, thus reducing the resources required to fully train the model from the beginning.
This technique makes it possible to download CNNs pre-trained on thousands of image categories with millions of parameters. Using these models with their pre-trained weights, one can classify specific sets of images by retraining only the last parts of the model, where layers can be modified. The models pre-trained on colored images work for grayscale medical images, such as those of CT, mammography, MRI or CXR [19,20], and they have been developed for several conditions, including Alzheimer's disease, brain tumors, breast tumors and COVID-19. For the automated detection of a set of abnormal findings in CXRs, transfer learning based on DenseNet-121 allows discrimination with satisfactory metric scores for most features [12,13]. Using 952 images from the National Institute of Health relabeled as "normal" or "heart failure" (defined as "cardiomegaly or congestion") by two cardiologists, Matsumoto et al. [21] reported an accuracy of 82% in a model based on pre-trained VGG16.
Quantum neural networks were shown theoretically and experimentally to offer an advantage over their classical counterpart in terms of trainability [22], which justifies their use in medical applications [23]. To engineer a hybrid neural network, one can encode the images in an initial quantum layer, as in a classical-quantum model aimed at COVID-19 prediction presented in [24]. Another option is to insert the quantum layer instead of the classical one in the last part of a transfer learning model [25]. This CQ approach, proposed for detecting Alzheimer's disease from brain CT scans [26] and stenosis detection in X-ray coronary angiography [27], was chosen to conduct this research.
A common problem with any diagnostic method based on deep learning is the opacity of the decision-making process, which hinders healthcare professionals' and patients' trust and acceptance [28,29]. Improved visual explanation by saliency methods, such as Grad-CAM++ [30], allows for locating the region of interest of the automated search process on a heatmap. For medical imaging, it is possible to check whether anatomically credible heatmaps can be produced by the algorithm. Cardiomegaly detection is particularly wellsuited for this approach.
To the best of our knowledge, no description of a hybrid transfer learning CQ model for the detection of cardiomegaly in CXRs has been published so far. Our aim was to verify whether efficient CQ models can be designed using currently available software development kits (SDKs) and whether, compared to the CC approach, they can have advantages in terms of trainability, performance and trustworthiness. The objectives for carrying out this project include mining and balancing an adequate CXR dataset, selecting a performing reference model, inserting a parameterized quantum circuit (PQC) [31] in CQ models in place of the classical classifier layer and establishing a common training protocol. Two approaches were considered to assess the predictive performance. A 70/30 train-test split was used for model selection from numerous prototypes, for the comparison of training loss curves, and for the credibility assessment for saliency zones on GradCAM++ heatmaps. K-fold cross-validation, which is more computationally demanding, was used for possible overfitting detection, reference selection and QC vs. CC c statistical comparison of perfor-mance metrics. We also investigated in two Qiskit-based models the normalized global effective dimension (NGED), a parameter related to trainability and performance [22,32].

Dataset and Data Curation
The dataset was mined from the Stanford CheXpert shared dataset [33] consisting of 224,316 chest radiographs of 65,240 patients, reflecting the general CXR flow of a large Health Care Facility with both inpatient and outpatient centers. We aimed to establish an algorithm able to detect cardiomegaly in such a flow. The CheXpert dataset was selected because of the improved performance of the automated label extraction from free-text reports, especially for cardiomegaly [13]. The posteroanterior (PA) CXR view is recommended and has specific diagnostic criteria for cardiomegaly [34]. Therefore, we selected the images corresponding to the first chronological PA view in the source dataset. In the resulting reduced set of CXRs all from different patients, a subset of 1218 was labeled positive for cardiomegaly. A balanced dataset was formed by drawing, in the same reduced set, 1218 additional patients labeled negative (control subset).
One of us (Pierre Decoodt), a cardiologist with forty years of experience in the interpretation of CXR for heart disease, blindly reviewed the 2436 randomly presented images and requalified 111 of them, which were mislabeled "cardiomegaly" for 59 and "no cardiomegaly" for 52. In addition to clinical judgment, a cardiothoracic index lower or equal to 0.45 was required to requalify cardiomegaly as the control and greater or equal to 0.55 to requalify the control as cardiomegaly. For cases judged as doubtful, the labels remained intact. The rate of correct labels before requalification that we found was consistent with the labeler performance reported for the global CheXpert dataset, which was 0.973 for cardiomegaly and 0.909 for no cardiomegaly [13]. Table 1 describes the final dataset. All CXRs were from the general patient population studied at the Stanford Health Care Facility. This ensured that the control group was representative of the disease group, with no bias that could be introduced if the two classes were sampled from different populations. It can be observed that patients with positive labels for cardiomegaly were older. Male patients were the majority, but this was more marked in the control subset. In the whole dataset, 61% of patients presented one or several other abnormal CXR findings. Overall, the predominant conditions in the cardiomegaly subset included lung opacity, edema, pneumonia, atelectasis, pleural effusion or other pleural findings, and the presence of support devices, whereas pneumothorax was more frequent in control cases. Typical examples of CXR are illustrated in Figure 1.
J. Imaging 2023, 9, x FOR PEER REVIEW 5 of 22 Significant differences between cardiomegaly and control by independent two-sample t-test for age and chi-squared contingency for the other features are indicated in the right column. * p < 0.05, ** p < 0.01, ***: p < 0.001. Figure 1. Images from the cardiomegaly subset along with their counterpart from the control subset. First column: no positive label for any other finding. Three last columns: cases of pleural effusion, edema and lung opacity, which were the findings most frequently associated with cardiomegaly in the dataset.

General Design of the Models
Both models share the same architecture. The preprocessed and vectorized images were injected in a pre-trained CNN, used for extracting informative features. They then followed a trainable binary classifier and finally a sigmoid layer ( Figure 2   Significant differences between cardiomegaly and control by independent two-sample t-test for age and chisquared contingency for the other features are indicated in the right column. * p < 0.05, ** p < 0.01, ***: p < 0.001.

General Design of the Models
Both models share the same architecture. The preprocessed and vectorized images were injected in a pre-trained CNN, used for extracting informative features. They then followed a trainable binary classifier and finally a sigmoid layer ( Figure 2). The CQ version differed only by the presence of a PQC between two layers of the trainable classifier.

Image Preprocessing and Vectorization
The CXR images were resized to 256 × 256 pixels and then center-cropped to 224 × 224 pixels. They were then converted to tensor and normalized using the mean and standard deviation. The single-channel X-ray images originally in grayscale were transformed into a three-channel format by repeating the values in a particular image across all channels. This ensured compatibility with the tested pre-trained CNN.

Feature Extractor
A large selection of CNNs pre-trained on ImageNet 1 K V1 can be downloaded from PyTorch. After some preliminary tests with 8 training epochs, we selected AlexNet [35] and DenseNet-121 [36]. These two CNNs, renowned for their performance, differ greatly in their architecture, number of layers (8 versus 121), number of parameters (61.1 million vs. 8 million) and delivered feature vector f (4096 vs. 1024).

Trainable Classifier
In Figure 3 the classifiers of the CC and CQ models are depicted.  The trainable classical classifier consists of a sequence of linear (f, 512), ReLU and linear (512, 2) layers.
The trainable quantum classifier consists of a sequence of linear (f, n), tanh, n-qubit PQC and linear (n, 2) layers. For the Qiskit-based experiments, we tested 2 and 4 as the dimension output D for the classifier. Similarly, we tested 4, 6, 8 and 10 as n and D values for PennyLane-based models.

Parameterized Quantum Circuit
To implement the hybrid classical-quantum models, we used the built-in methods of The trainable classical classifier consists of a sequence of linear (f, 512), ReLU and linear (512, 2) layers.
The trainable quantum classifier consists of a sequence of linear (f, n), tanh, n-qubit PQC and linear (n, 2) layers. For the Qiskit-based experiments, we tested 2 and 4 as the dimension output D for the classifier. Similarly, we tested 4, 6, 8 and 10 as n and D values for PennyLane-based models.

Parameterized Quantum Circuit
To implement the hybrid classical-quantum models, we used the built-in methods of Qiskit and PennyLane, allowing us to integrate the PQC into the Pytorch framework. In Qiskit, we employed the CircuitQNN and TorchConnector functions from the Qiskit Machine Learning package. A four-qubit PQC is presented in Figure 4. The ansatz consists of repeated blocks of 2-qubit entanglement gates with a linear entanglement strategy and 1-qubit rotation gates (rotation angles being set by the circuit parameters).

Training
A 70/30 train-test split was applied to the dataset. We applied RandAugment and 50% RandomAutocontrast to the training data for each epoch.
All models were trained using the same random seed, with the learning rate and weight decay set to 0.0001, Cross-Entropy loss as the cost function and Adam [37] as the optimizer. We compared a standard and an enhanced protocol for training. The first consisted of 20 epochs without "freezer". In the latter, this phase was preceded by training for 2 epochs with "freezer", which froze the layer with pre-trained parameters acting like a weight initializer so that the further 18-epoch training was improved. The batch size was set to 8.
In Qiskit-based models, the computation was conducted in the Aer simulator statevector device with ten shots, using the default gradient settings [38]. In PennyLane-based models, we used as parameters the default qubit device, the torch interface and the "best" differentiation method. The latter depends on the device and interface, being in the most likely case Standard Backpropagation with Pauli-Z expectation values for each qubit [25].

Performance Metrics
The metrics we considered are listed in Table 2. All metric values mentioned in the following tables were for the test subsets. They were calculated using the corresponding functions from the metrics module of the open-access scikit-learn package [39]. The tables of results obtained from 10-fold cross-validation show the mean over the folds as well as, in brackets, the 95% confidence interval for the mean. For the comparison of a metric in two models, we performed a paired t-test on the values observed for each fold's test dataset.

Training
A 70/30 train-test split was applied to the dataset. We applied RandAugment and 50% RandomAutocontrast to the training data for each epoch.
All models were trained using the same random seed, with the learning rate and weight decay set to 0.0001, Cross-Entropy loss as the cost function and Adam [37] as the optimizer. We compared a standard and an enhanced protocol for training. The first consisted of 20 epochs without "freezer". In the latter, this phase was preceded by training for 2 epochs with "freezer", which froze the layer with pre-trained parameters acting like a weight initializer so that the further 18-epoch training was improved. The batch size was set to 8.
In Qiskit-based models, the computation was conducted in the Aer simulator statevector device with ten shots, using the default gradient settings [38]. In PennyLane-based models, we used as parameters the default qubit device, the torch interface and the "best" differentiation method. The latter depends on the device and interface, being in the most likely case Standard Backpropagation with Pauli-Z expectation values for each qubit [25].

Performance Metrics
The metrics we considered are listed in Table 2. All metric values mentioned in the following tables were for the test subsets. They were calculated using the corresponding functions from the metrics module of the open-access scikit-learn package [39]. The tables of results obtained from 10-fold cross-validation show the mean over the folds as well as, in brackets, the 95% confidence interval for the mean. For the comparison of a metric in two models, we performed a paired t-test on the values observed for each fold's test dataset.

Image Interpretation
The image interpretation was visualized with a gradient-weighted class activation map using Grad-CAM++. To provide an idea of the acceptance by clinicians of the CC and CQ models, we grouped all the heatmaps obtained in the test set for each model and submitted them shuffled to one of us (Pierre Decoodt, a cardiologist), who blindly classified them as trustworthy or non-trustworthy. A heatmap was considered trustworthy if the hot zone comprised most of the cardiac silhouette. Heatmaps were classified as non-trustworthy when the hot zone was either outside the cardiac area, extremely extensive, or not visible. Heatmaps with multiple hot zones of similar size were also classified as non -trustworthy.

Normalized Global Effective Dimension
The trainability was assessed for the quantum layer on Qiskit by using the four-step algorithm that the qiskit-machine-learning package provides for assessing the NGED [40].

Selection of Models and Training Protocols
Given the wide spectrum of possible combinations, we adopted a step-by-step heuristic to test model prototypes. We started by calculating the metrics using the 70/30 train-test split approach in thirteen models (Table 3).
As detailed in Appendix A, the results obtained by the single split approach (Table A1) were used for a preliminary analysis of the train and test ROC and loss curves. The calculation times are given in Table A2. The ROC curves for the train set in CC models led to suspecting a possible effect of overfitting that could affect the metric values observed in the test set. However, additional experimentation based on 10-fold cross-validation (Table A3) did not confirm this issue. The single-split results were used to select ten models for statistical comparison, to generate GradCAM++ heatmaps from the test subset and to perform NGED analysis.
The selected CC and CQ models were subjected to ten-fold cross-validation, which allowed us to obtain estimates of the mean and its 95% confidence interval for the metrics. This more computationally demanding technique was used to choose the CC model used as a reference, to investigate a possible overfitting effect (Table A3) and to establish the performance of CQ models that might challenge the reference. Table 4 compares the versions of CC models with and without the freezer option. The metric values for the two DenseNet 121-based models are similar, with lower bounds for accuracy and balanced accuracy greater than 0.85. For models based on AlexNet, the values of the metrics are slightly lower, which is only statistically significant for precision 1 and recall 0. The corresponding ROC curves are shown in Figure 5. as a reference, to investigate a possible overfitting effect (Table A3) and to establish the performance of CQ models that might challenge the reference. Table 4 compares the versions of CC models with and without the freezer option. The metric values for the two DenseNet 121-based models are similar, with lower bounds for accuracy and balanced accuracy greater than 0.85. For models based on AlexNet, the values of the metrics are slightly lower, which is only statistically significant for precision 1 and recall 0. The corresponding ROC curves are shown in Figure 5.    Table 5 lists the metric values for CQ models, all with the freezer option. Compared with the F-Dnet-C model, the AUC score is lower for F-Dnet-P-10q, F-Dnet-Q-4q-2D and F-Dnet-Q-4q-4D (p-value < 0.01). The corresponding ROC curves are shown in Figure 6.

Grad-CAM++ Analysis
The blind examination for classification into trustworthy and non-trustworthy concerns 2190 (3 × 730) heatmaps obtained with the three models: F-Dnet-C, F-Dnet-Q-4q-2D and F-Dnet-P-4q. The results of this classification are presented in Table 6. Overall, trustworthy heatmaps were found in 61% of cases with the CC model, in 94% of cases with the Qiskit-based CQ model (chi-square: p < 0.001 vs. CC) and in 92% of cases with the PennyLane-based CQ model (p < 0.001 vs. CC). Examples of comparisons between models are given in Figure 7 with one case of a non-trustworthy heatmap.   5 lists the metric values for CQ models, all with the freezer option. Compared with the F-Dnet-C model, the AUC score is lower for F-Dnet-P-10q, F-Dnet-Q-4q-2D and F-Dnet-Q-4q-4D (p-value < 0.01). The corresponding ROC curves are shown in Figure 6. Differences from F-Dnet-C (paired t-test) are indicated by ** p < 0.01.  cerns 2190 (3 × 730) heatmaps obtained with the three models: F-Dnet-C, F-Dnet-Q-4q-2D and F-Dnet-P-4q. The results of this classification are presented in Table 6. Overall, trustworthy heatmaps were found in 61% of cases with the CC model, in 94% of cases with the Qiskit-based CQ model (chi-square: p < 0.001 vs. CC) and in 92% of cases with the Pen-nyLane-based CQ model (p < 0.001 vs. CC). Examples of comparisons between models are given in Figure 7 with one case of a non-trustworthy heatmap.   Figure 8a shows the NGED curves calculated in a function of the amount of data in the Qiskit-based QC models. We can see that the four-dim output leads to a much higher effective dimension over a wide range. In Figure 8b, the training loss curves observed in these models with and without a freezer are shown. Without the freezer, the two-dim  Figure 8a shows the NGED curves calculated in a function of the amount of data in the Qiskit-based QC models. We can see that the four-dim output leads to a much higher effective dimension over a wide range. In Figure 8b, the training loss curves observed in these models with and without a freezer are shown. Without the freezer, the two-dim model has higher loss values than those of the four-dim model. With the freezer, the loss values are decreased much more for the two-dim model than for the four-dim model. model has higher loss values than those of the four-dim model. With the freezer, the loss values are decreased much more for the two-dim model than for the four-dim model.

Discussion and Conclusions
In this manuscript, we aimed to study the effect of applying Quantum ML techniques to the detection of cardiomegaly. This is a healthcare problem based on image processing and learning techniques. In particular, we evaluated the performance and trainability of two CC models based on DenseNet-121 and AlexNet, and we modified the DenseNet-121 model by replacing a classical dense layer with a quantum layer leading to CQ models. Following transfer learning training, two different schemas were conducted. In one approach, we froze the weights from the initial layers of the architecture, and in the other approach, we let the weights be updated by our cardiomegaly dedicated data.
We compared the results between the different methods, studied the NGED on the Qiskit-based quantum layer and used Grad-CAM++ to compare the heatmaps obtained from classical and hybrid solutions.
Our protocol required as many common parameters as possible between CC and CQ models to compare them on equal footing. Tuning hyperparameters differently for each model was avoided. We chose twenty epochs after verifying that no sign of an overfitting effect on the test metrics appeared when training CC models up to this number of epochs. Extending this analysis to the CQ models did not seem necessary because the concern about overfitting was triggered essentially by the comparison of training and test ROC curves obtained in the CC models. In addition, this would necessitate considerable time given our computing resources.
The reasons for choosing the F-Dnet-C model as the reference were the generally better test metrics for the versions with the freezer than those without the freezer and the superiority of DenseNet121 over AlexNet for the same criterion via ten-fold cross-validation. Using this approach, the PennyLane-based CQ models up to eight qubits, revealing no statistical differences for any metric compared to the reference. The same was observed for the Qiskit-based models, except that the AUC score was lower. With the exception of the AlexNet-based CC models, the metric values we obtained from the 70/30 train-test split were, in general, very similar to those found via ten-fold cross-validation. Whereas the latter method is one of those recommended for comparing performance in the ML diagnosis of medical images [40], the former retains its interest in exploring a wide range of models with multiple options and for specific types of analysis, involving loss curves or salience maps, for example. Different factors can influence the efficiency of quantum transfer learning. Among them, the quality and compatibility of the pre-trained models can be cited. This factor was only studied here for choosing the classical reference model. The less-performing AlexNet option was not tested for CQ models. We acknowledge that this could be an interesting

Discussion and Conclusions
In this manuscript, we aimed to study the effect of applying Quantum ML techniques to the detection of cardiomegaly. This is a healthcare problem based on image processing and learning techniques. In particular, we evaluated the performance and trainability of two CC models based on DenseNet-121 and AlexNet, and we modified the DenseNet-121 model by replacing a classical dense layer with a quantum layer leading to CQ models. Following transfer learning training, two different schemas were conducted. In one approach, we froze the weights from the initial layers of the architecture, and in the other approach, we let the weights be updated by our cardiomegaly dedicated data.
We compared the results between the different methods, studied the NGED on the Qiskit-based quantum layer and used Grad-CAM++ to compare the heatmaps obtained from classical and hybrid solutions.
Our protocol required as many common parameters as possible between CC and CQ models to compare them on equal footing. Tuning hyperparameters differently for each model was avoided. We chose twenty epochs after verifying that no sign of an overfitting effect on the test metrics appeared when training CC models up to this number of epochs. Extending this analysis to the CQ models did not seem necessary because the concern about overfitting was triggered essentially by the comparison of training and test ROC curves obtained in the CC models. In addition, this would necessitate considerable time given our computing resources.
The reasons for choosing the F-Dnet-C model as the reference were the generally better test metrics for the versions with the freezer than those without the freezer and the superiority of DenseNet121 over AlexNet for the same criterion via ten-fold cross-validation. Using this approach, the PennyLane-based CQ models up to eight qubits, revealing no statistical differences for any metric compared to the reference. The same was observed for the Qiskit-based models, except that the AUC score was lower. With the exception of the AlexNet-based CC models, the metric values we obtained from the 70/30 train-test split were, in general, very similar to those found via ten-fold cross-validation. Whereas the latter method is one of those recommended for comparing performance in the ML diagnosis of medical images [40], the former retains its interest in exploring a wide range of models with multiple options and for specific types of analysis, involving loss curves or salience maps, for example. Different factors can influence the efficiency of quantum transfer learning. Among them, the quality and compatibility of the pre-trained models can be cited. This factor was only studied here for choosing the classical reference model. The less-performing AlexNet option was not tested for CQ models. We acknowledge that this could be an interesting part of further studies. Another factor is increased complexity. In models with an increasing number of qubits, the metric values remained the same, with the exception of a significantly lower AUC score for ten qubits. We did not test quantum hardware, an important and rapidly evolving factor. Using Qiskit, the dimension of the quantum classifier output was a factor that seemed to influence training. In this case, further experimentation, preferably on hardware, can be contemplated for analyzing the impact on quantum parallelism.
The details of the quantum part of the algorithm deserve some comments. Adding a PQC between two layers of the trainable classifier is a common technique, although not always necessary or effective in all cases of quantum transfer learning. We made this choice because it has proven its usefulness in other similar quantum applications. Using Qiskit, the PyTorch connection allowed a classification task on a simulator and hardware [22], and using Pennylane, the 'default.qubit' simulator was used for the implementation of the model in an application involving CT scan images [26]. For adapting activation functions to quantum circuits, one can use quantum gates, such as the SWAP test or the parametrized quantum circuit (PQC), to transform the quantum state before and after applying the activation function. In another option, the output of the quantum circuit is measured, and the measurement result is used as input to a classical activation function. We chose the approach that is integrated into the quantum SDKs that were available to us and that had demonstrated its effectiveness in other applications. For a proper gradient computation when using TorchConnector in Qiskit, the input_gradients parameter must be explicitly set to True. Equations for the resulting default gradients are given in the Opflow section of the Qiskit documentation. For PennyLane, the automatic "best differentiation method" that we used is a prominent built-in feature that is fully described in its documentation. Other options for the gradients could be considered and explored in further works, especially if convergence issues are detected. The simulator parameters of the SDKs can be tested in an initial phase. For example, 10, 100 and 1024 shots improve metric values to the same extent, well beyond 1 shot in Qiskit-based models. A value of 10 is preferred for these models, which require a long training time in the simulator. For PennyLane-based models, the default number of shots is 1024. We acknowledge that the difference in AUC scores between the models built on these different SDKs is most likely due to these shot number discrepancies.
For heatmap analysis, our blinded human classification into trustworthy and nontrustworthy cases was based on well-defined criteria described in the Materials and Methods section. These criteria had to be adapted for the detection of a single large organ inside the thoracic cage, whereas other published ML studies on CXRs address multiple classes of possibly concomitant abnormal findings [12,21,41,42].
The differences observed in the Grad-CAM++ heatmaps obtained from the three models we examined can have several explanations. First, we used the test set CXRs in the review process, which is more representative of what would be obtained in real applications. The trade-off is that non-trustworthy heatmaps are more likely to show up in test cases than in training cases. Second, the classical algorithm is inherently different from those based on quantum SDKs. Moreover, the Qiskit and Pennylane algorithms are not entirely similar, being based on different code and program functions, which explains the difference in heatmaps in the two CQ models. Third, more aberrantly localized salient zones, or multiple ones, are noted in control cases. Indeed, the target, the heart, is by definition larger in cardiomegaly and therefore less often missed. Determining why precisely the quantum approach performs better, in general, would require further investigations. In the face of large amounts of data and parameters [43], satisfactory visualization can be an emergent property of a CQ model confronted with medical images. A high rate of trustworthy heatmaps is a strong argument for the acceptance of quantum ML in diagnostic applications.
We also examined the possible gain in the trainability of the quantum approach. A supervised classifier maps a higher dimensional input feature space to a lower dimensional output class space. This mapping is largely determined by how well the model recognizes the relations between the most relevant latent features and the features that do not contribute to the decision classes [32]. A model learns feature mapping based on the training data, and therefore, measuring the performance of a model on unseen data is extremely important in assessing the risk involved in the model prediction. The measure of how well the model performs on unseen data is termed as the expressibility or the model's ability to generalize [44].
Expressive power or the generalization bound is a measure that plays an important role in identifying an overfitting or underfitting model and the risk involved in the model prediction. The generalization bound provides the bounding value (threshold) for the risk assessment. For efficient assessment, the bounding value should consider the training sample size and the richness of the class predictors output by the model [45]. As QNN combines the principles of classical neural networks and PQC, the generalization bounds applied to the classical networks can be adapted to QNN as well [46].
The NGED is a generalization bound that uses Fisher information. Fisher information describes the geometry of the model parameter space and, in turn, the data distribution [47][48][49]. Fisher information also captures the model's output sensitivity to changes in the input parameter space, thus providing the relationship mapping between the input data and the model output [18].
In QNN models, the cost function or the network gradient is evaluated in the quantum simulator or hardware; then, a classical optimization algorithm trains the parameters of the PQC to minimize the loss or cost of the function [50]. A flat loss landscape or the barren plateau effect, therefore, largely affects the parameter optimization and, hence, the trainability of the QNN models. A barren plateau could be induced either by the noisy quantum hardware or could be circuit-induced (due to random parameter initialization) [51].
In our experiments, we found that the best training performance was obtained in the case of the two-dimensional CQ model, even if the NGED was smaller when compared with the four-dimensional mode. This counterintuitive behavior could be explained based on the fact that we only measured the effective dimension from the quantum layer, or it could be based on the model complexity from the four-dimensional architecture. Nonetheless, we are planning to study the NGED influence on model performance and trainability. In particular, this would help to evaluate the barren plateau effect. Graphical visualization of the algorithm convergence and a comparison of loss curves obtained from simulator vs. real quantum devices would complete such experimental work. This would help set up and evaluate strategies to mitigate the barren plateau effect, e.g., by modifying the quantum ansatz or acting on the training parameters.
As a limitation, the high number of trainable parameters in the classical classifier currently does not allow us to compare the CC and CQ approaches in terms of trainability based on the global effective dimension. Indeed, this would require a very powerful computational resource.
Our findings help define future directions for algorithm improvement, hardware testing, and implementation in a real-life medical environment.
We believe the models can be improved by applying fine-tuning before/after the first two steps with the freezer for a better accuracy outcome [52], tuning the hyperparameters and having a better-fitted quantum circuit in the quantum layer [22]. Another possibility is applying quantum convolutional and max pool [53]. One can also consider testing the hybrid approach on smaller pre-trained models, such as Resnet34 or Swin Transformer [26,54].
A further step would be to test the CQ models on noisy simulators and NISQ devices. The first approach is easily accessible but remains a delicate task [55] that does not necessarily reflect the performance when using the hardware, which improves itself rapidly. Although even more challenging [56], the evaluation of these models on real quantum computers in terms of performance, resilience and credible visualization would allow us to better affirm their feasibility in real situations. In this process, another full bench would be advised to further study the usability of the models.
Additional research exploring the impact of these CC and QC models in a medical setting may be contemplated. The more nuanced probability of cardiomegaly provided by these types of transfer learning algorithms could have a different clinical meaning from that of the simple reference label or the binary output. This probability is a feature that should be tested against established diagnoses and that should be possibly added to tabular data used by statistical and ML applications in heart failure [4]. We can also use proposed publicly available datasets and open-source software [57] to compare heatmaps generated by different saliency methods in our models to confirm that some hybrid options exhibit better localization performance. Acknowledgments: We acknowledge free access to the software development kits Qiskit [58], Pen-nyLane [59] and Pytorch [60]. We thank Pei Pei Chong, Director of Digital Health and Innovations Impact Lab, Taylor's University, for editing and checking the manuscript.

Conflicts of Interest:
The authors declare no conflict of interest. Table A1 lists the metric values observed in the thirteen experiments. Figures A1 and A2 display the ROC curves observed in these models for the training and test sets.

Model Name
Processing Unit Calculation Time per Epoch F-Dnet-C/N-Dnet-C GPU 1 min F-Axnet-C/C-Axnet-C GPU 5 s F-Dnet-P-4q/N-Dnet-P-4q GPU 3 min F-Dnet-Q-4q-2D/N-Dnet-Q-4q-2D GPU and CPU 11 min F-Dnet-Q-4q-4D/N-Dnet-Q-4q-4D GPU and CPU 11 min F-Dnet-P-6q GPU 4 min F-Dnet-P-8q GPU 5 min F-Dnet-P-10q GPU 7 min When comparing the freezer version to the standard version (ten first experiments listed in Table A1), we observed higher metric values with the freezer option in all cases, except when the feature extractor was the AlexNet CNN. Three additional models based on PennyLane, with 6, 8 and 10 qubits, were tested in the freezer version only.

Model Name
Processing Unit Calculation Time per Epoch F-Dnet-C/N-Dnet-C GPU 1 min F-Axnet-C/C-Axnet-C GPU 5 s F-Dnet-P-4q/N-Dnet-P-4q GPU 3 min F-Dnet-Q-4q-2D/N-Dnet-Q-4q-2D GPU and CPU 11 min F-Dnet-Q-4q-4D/N-Dnet-Q-4q-4D GPU and CPU 11 min F-Dnet-P-6q GPU 4 min F-Dnet-P-8q GPU 5 min F-Dnet-P-10q GPU 7 min When comparing the freezer version to the standard version (ten first experiments listed in Table A1), we observed higher metric values with the freezer option in all cases, except when the feature extractor was the AlexNet CNN. Three additional models based on PennyLane, with 6, 8 and 10 qubits, were tested in the freezer version only.

Model Name Processing Unit Calculation Time per Epoch
GPU and CPU 11 min F-Dnet-Q-4q-4D/N-Dnet-Q-4q-4D GPU and CPU 11 min F-Dnet-P-6q GPU 4 min F-Dnet-P-8q GPU 5 min F-Dnet-P-10q GPU 7 min When comparing the freezer version to the standard version (ten first experiments listed in Table A1), we observed higher metric values with the freezer option in all cases, except when the feature extractor was the AlexNet CNN. Three additional models based on PennyLane, with 6, 8 and 10 qubits, were tested in the freezer version only.
To better shed light on the superiority of the freezer-enhanced protocol over the standard protocol, we compared the loss curves. In the CC models based on AlexNet ( Figure A3a), the shortest pre-trained feature extractor, the freezer slowed down the process, whereas the opposite was true in models based on DenseNet-121. In two of the three fourqubit CQ models (six first curves of Figure A3b), all based on pre-trained DenseNet-121, the process started lower and was accelerated with the freezer in two cases: the Qiskit version with two-dimensional output and the PennyLane version. No notable freezer effect was observed for the Qiskit version with four-dimensional output.
J. Imaging 2023, 9, x FOR PEER REVIEW 18 of 22 To better shed light on the superiority of the freezer-enhanced protocol over the standard protocol, we compared the loss curves. In the CC models based on AlexNet (Figure A3a), the shortest pre-trained feature extractor, the freezer slowed down the process, whereas the opposite was true in models based on DenseNet-121. In two of the three fourqubit CQ models (six first curves of Figure A3b), all based on pre-trained DenseNet-121, the process started lower and was accelerated with the freezer in two cases: the Qiskit version with two-dimensional output and the PennyLane version. No notable freezer effect was observed for the Qiskit version with four-dimensional output. For training ( Figure A1a), a very high AUC score was obtained in the F-Dnet-C model. Therefore, we looked at the 11 CXRs with a discrepancy between the label and prediction. (Figure A4). In all these cases, the diagnosis obtained by automated label extraction from free-text reports can certainly be disputed. For training ( Figure A1a), a very high AUC score was obtained in the F-Dnet-C model. Therefore, we looked at the 11 CXRs with a discrepancy between the label and prediction. ( Figure A4). In all these cases, the diagnosis obtained by automated label extraction from free-text reports can certainly be disputed.
To complete our investigations about possible overfitting, ten-fold cross-validation was performed in the CC models with different numbers of epochs (Table A3). In an analysis of variance with the fold, number of epochs and feature extractor as the dependent variables, no significant difference was observed among 10, 15 and 20 epochs for any metric value.  To complete our investigations about possible overfitting, ten-fold cross-validation was performed in the CC models with different numbers of epochs (Table A3). In an analysis of variance with the fold, number of epochs and feature extractor as the dependent variables, no significant difference was observed among 10, 15 and 20 epochs for any metric value.