The Performance of a Deep Learning-Based Automatic Measurement Model for Measuring the Cardiothoracic Ratio on Chest Radiographs

Abstract Objective: Prior studies on models based on deep learning (DL) and measuring the cardiothoracic ratio (CTR) on chest radiographs have lacked rigorous agreement analyses with radiologists or reader tests. We validated the performance of a commercially available DL-based CTR measurement model with various thoracic pathologies, and performed agreement analyses with thoracic radiologists and reader tests using a probabilistic-based reference. Materials and Methods: This study included 160 posteroanterior view chest radiographs (no lung or pleural abnormalities, pneumothorax, pleural effusion, consolidation, and n = 40 in each category) to externally test a DL-based CTR measurement model. To assess the agreement between the model and experts, intraclass or interclass correlation coefficients (ICCs) were compared between the model and two thoracic radiologists. In the reader tests with a probabilistic-based reference standard (Dawid–Skene consensus), we compared diagnostic measures—including sensitivity and negative predictive value (NPV)—for cardiomegaly between the model and five other radiologists using the non-inferiority test. Results: For the 160 chest radiographs, the model measured a median CTR of 0.521 (interquartile range, 0.446–0.59) and a mean CTR of 0.522 ± 0.095. The ICC between the two thoracic radiologists and between the model and two thoracic radiologists was not significantly different (0.972 versus 0.959, p = 0.192), even across various pathologies (all p-values > 0.05). The model showed non-inferior diagnostic performance, including sensitivity (96.3% versus 97.8%) and NPV (95.6% versus 97.4%) (p < 0.001 in both), compared with the radiologists for all 160 chest radiographs. However, it showed inferior sensitivity in chest radiographs with consolidation (95.5% versus 99.9%; p = 0.082) and NPV in chest radiographs with pleural effusion (92.9% versus 94.6%; p = 0.079) and consolidation (94.1% versus 98.7%; p = 0.173). Conclusion: While the sensitivity and NPV of this model for diagnosing cardiomegaly in chest radiographs with consolidation or pleural effusion were not as high as those of the radiologists, it demonstrated good agreement with the thoracic radiologists in measuring the CTR across various pathologies.


Introduction
Chest radiographs are the most common and basic diagnostic examination for cardiothoracic and pulmonary diseases, accounting for 40% of all radiologic examinations [1,2].The cardiothoracic ratio (CTR) measured on chest radiographs, defined as the ratio of the greatest transverse dimension of the heart to the greatest transverse dimension of the thoracic cavity, is a simple but critical parameter for assessing cardiomegaly [3,4].Since various heart diseases are accompanied by cardiomegaly (e.g., hypertension, coronary artery disease, cardiac valve disease, and pulmonary hypertension), accurate CTR measurements can help initiate diagnostic workups for patients' underlying heart diseases, potentially leading to an improvement in patients' prognosis [5][6][7][8][9].Despite the clinical importance of the CTR, because of the vast number of chest radiographs in the increasingly modern medical workload, measuring the CTR on all chest radiographs is considered a time-consuming and redundant step [2,10].In addition, since human measurements of the CTR (e.g., by radiologists or cardiologists) are considered the gold standard, there are issues with intra-observer and inter-observer variability in terms of accurately measuring the CTR [3].
Deep learning (DL) has recently been applied to various medical tasks and achieved a superior or comparable diagnostic performance to experts [11].Indeed, prior studies reported that automatic CTR measurement using DL, specifically U-Net, could provide accurate CTR measurements on chest radiographs [5, [12][13][14][15][16].However, those studies used reference standards constructed with only one radiologist or a consensus reading of two or three radiologists, did not evaluate model performance in the setting of various thoracic pathologies, or did not conduct reader tests with multiple readers; these methodological aspects of previous studies can limit the applicability of DL in measuring the CTR in realworld clinical settings [5, [12][13][14][15][16]. Therefore, this study aimed to validate the performance of a commercially available DL-based CTR measurement model with various thoracic pathologies, and perform agreement analyses with thoracic radiologists and then reader tests using a probabilistic-based reference standard.

DL-Based Model Measuring the CTR on Chest Radiographs
A commercially available DL-based model measuring the CTR on chest radiographs (CTR-AI, version 1.0, HealthHub) was used in this study.Specifically, the model extracts boundaries of the lungs and heart on chest radiographs (posteroanterior view).The architecture of this system is presented in Figure 1.Using standard U-Net architecture, two DL algorithms segment the lungs and heart.These DL algorithms were trained with 633 chest radiographs and internally validated with 160 radiographs from the Japanese Society of Radiological Technology, the Montgomery public dataset, and an in-house dataset.Self-attention modules, consisting of a channel and spatial attention blocks, were added to improve the ability to represent disparate features [14,17,18].The channel attention block extracts the inter-channel connections of the input feature map, while the spatial channel block encodes the relative importance of each spatial location of the input feature map.These attention blocks can be located in the U-Net architecture at any place and in any number.In the best-performance experiments, the attention modules were applied in the first and second places in both directions of the U-Net architecture.The detailed architecture of the segmentation model is shown in Figure 1.After segmenting the lungs and heart, the DL-based model calculates the maximum horizontal distances for each segmented area using an image-processing algorithm to output the CTR value.

Study Sample
To validate the DL model in the setting of various pathologies, we collected posteroanterior chest radiographs with the following findings: no lung or pleural abnormalities (i.e., chest radiographs without any lung parenchymal or pleural abnormalities) (n = 40), pneumothorax (n = 40), pleural effusion (n = 40), and lung consolidation (n = 40).We randomly collected a study sample among chest radiographs taken between July 2018 and June 2021.

Measurement of the CTR by Thoracic Radiologists for Agreement Analyses
To assess the agreement between the DL model and human experts in measuring the CTR on chest radiographs, two thoracic radiologists measured the CTR of the 160 study sample radiographs.They independently measured the CTR twice at a 1-month interval (washout period).That is, four datasets for testing were obtained (two datasets from each of the two thoracic radiologists).When measuring the CTR, they were instructed to measure the maximum left heart diameter (MLD), the maximum right heart diameter (MRD), and the greatest transverse dimension of the thoracic cavity (GT).Then, the CTR was calculated as (MLD + MRD)/GT (Figure 2).(3) the greatest transverse dimension of the thoracic cavity.The CTR is calculated as (maximum left heart diameter + the maximum right heart diameter)/the greatest transverse dimension of the thoracic cavity.

Reader Tests
To compare the diagnostic performance of the DL model for diagnosing cardiomegaly to that of board-certified radiologists, five board-certified radiologists, who did not participate in the agreement analyses, independently measured the CTR of the test datasets in the same way as in the agreement analyses.
Since cardiomegaly is usually defined as a CTR value of more than 0.50 [4], we applied this cut-off value in the reader tests.Since the gold standard of measuring CTR is by human experts [4], we constructed the reference standard using the four measurement results obtained in the agreement analysis.To construct a reference standard for diagnosing cardiomegaly, the Dawid-Skene consensus method was used as a robust way for determining ground truth from various labeling data [19,20].Specifically, we first categorized the four datasets from the two thoracic radiologists as a categorical variable (i.e., the presence or absence of cardiomegaly) using a cut-off value of 0.50.Then, a probabilistic generative model (Dawid-Skene consensus method) was used to fuse the labels from multiple annotation results by weighting reliable factors [19,20].In addition, we set a cut-off value for the cardiomegaly as 0.55, since several prior studies referred to a CTR value of 0.55 for significant cardiomegaly [5,21,22].
As a sensitivity analysis, we set the reference standard with the median values of the four datasets per case.

Statistical Analysis
To calculate the sample size of the test dataset, the Bland-Altman plot was used with an expected mean of differences of 0.00175, an expected standard deviation of differences of 0.04108, and a maximum allowed difference between methods of 0.15.Type I error (alpha) and type II error (beta) were both set at 0.05.The resultant minimum sample size was 23.
Continuous variables are presented as mean with standard deviation (SD) or median with interquartile range (IQR).To evaluate the agreement in measurements of the CTR between thoracic radiologists (four datasets) and the DL-based model, we used the following methods: (a) Bland-Altman plots and mean absolute or relative differences with their limits of agreement (LOAs) [23,24]; (b) mean absolute error (MAE) and root mean square error (RMSE) between thoracic radiologists and the DL-based model; (c) intraclass or interclass correlation coefficients with a comparison of interclass correlation coefficients (ICCs) between the thoracic radiologists and between the thoracic radiologists and the DL-based model [25].The p value was calculated from the empirical distribution from 1000 bootstrap samples.
where y i (e.g., DL model) and x i (e.g., the thoracic radiologists) represent two measurement values.
In the reader tests, diagnostic measures, including sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and accuracy of the DL-based model and five board-certified radiologists were calculated and compared using the noninferiority test with a non-inferiority limit of 10%.
Subgroup analyses were performed in chest radiographs with no lung or pleural abnormalities, pneumothorax, pleural effusion, and lung consolidation.
All statistical analyses were performed using R version 4.1.0(R Project for Statistical Computing).A p value < 0.05 was considered to indicate statistical significance, but a cut-off of a p value of 0.025 was used in the one-sided non-inferiority test.
The MAE and RMSE between the DL-based model and the two thoracic radiologists in two sessions were 0.019 and 0.028, respectively.In the subgroup analyses, the MAE and RMSE in chest radiographs without any lung or pleural abnormality, chest radiographs with pneumothorax, chest radiographs with pleural effusion, and chest radiographs with consolidation were 0.013 and 0.021, 0.013 and 0.02, 0.025 and 0.033, and 0.024 and 0.035, respectively (Table S1).
The intra-observer correlation coefficients of the thoracic radiologists were 0.988 (95% CI: 0.983, 0.991) and 0.977 (95% CI: 0.968, 0.983), respectively.The inter-observer correlation coefficient was calculated as 0.972 (95% CI: 0.953, 0.982) between the two thoracic radiologists and 0.959 (95% CI: 0.945, 0.971) between the DL-based model and the two thoracic radiologists, and these agreements were not significantly different (p = 0.192).In the subgroup analyses, there were no significant differences in agreement between the two thoracic radiologists and between the model and the two thoracic radiologists (p values > 0.05) (Table 2).

Discussion
In this study, we validated the performance of a commercially available DL-based CTR measurement model by assessing its agreement with thoracic radiologists, and then performed reader tests with five radiologists in various thoracic pathologies.The mean absolute and relative differences between the DL-based model and thoracic radiologists were 0.0074 and 1.3%, and their error ranges were from −0.0457 to 0.0605 and from −9.33% to 11.92%, respectively.The MAE and RMSE between the model and the two thoracic radiologists were 0.019 and 0.028, respectively.The ICC between the model and the thoracic radiologists was 0.959, comparable to that between the two thoracic radiologists (ICC = 0.972; p = 0.192).Finally, the DL-based model had comparable sensitivity, specificity, PPV, NPV, and accuracy for diagnosing cardiomegaly compared with five board-certified radiologists using both the reference standard constructed by Dawid-Skene consensus (sensitivity, 96.3% versus 97.8%; specificity, 83.3% versus 85.1%; PPV, 85.9% versus 87.4%; NPV, 95.6% versus 97.4%; and accuracy, 90.0% versus 91.6%; all p < 0.025) and the median CTR determined by the thoracic radiologists (sensitivity, 97.6% versus 97.9%; specificity, 86.8% versus 87.4%; PPV, 89.1% versus 89.5%; NPV, 97.1% versus 97.4%; and accuracy, 92.5% versus 92.9%; all p < 0.025).
We used the Dawid-Skene consensus method to construct reference standards.This is a statistical method for determining the ground truth based on a probabilistic generative model for fusing the labels from multiple voters in a coherent manner [19].Specifically, the Dawid-Skene model generates ground truth from multiple voters by discounting unreliable factors' contributions while compensating them with reliable factors for model prediction [19,26].In contrast, prior studies set the reference standards for measuring the CTR using only one radiologist or consensus readings of two or three radiologists, which are prone to inter-or intra-observer variability [12,14,16].
Another point to consider in measuring CTR is that various lung or pleural pathologies can exist in real-world clinical settings (e.g., pneumothorax, pleural effusion, and lung opacity).Although a prior study included these pathologies in their study sample, they reported only the measurement results of their DL-based model in these settings without reader tests, limiting the clinical validity of the model [14].In contrast, we validated our DL-based model in these pathologic settings through agreement analyses with thoracic radiologists and reader tests with board-certified radiologists.The model showed agreement with thoracic radiologists that was equivalent to those between the two thoracic radiologists in all pathologic settings.Since DL-based models have the potential to be used as screening or triaging tools for cardiomegaly in a real-world clinical setting [5-9], sensitivity and NPV are key diagnostic factors in such a setting [27,28].In the reader tests, the DL-based model achieved diagnostic measures of sensitivity and NPV that were comparable to the five board-certified radiologists for chest radiographs without any lung or pleural abnormality, as well as for chest radiographs with pneumothorax.In contrast, the model achieved an inferior NPV for chest radiographs with pleural effusion, and inferior sensitivity and NPV for chest radiographs with consolidation.These results are in line with a prior study, according to which a DL-based model exhibited low CTR measurement performance in chest radiographs with abnormal findings obscuring the margin of the thoracic cage (e.g., pleural effusion) and heart border (e.g., pneumonia) [14].
Two limitations should be noted in this study.First, we only assessed the measurement performance of the DL-based model for the CTR without evaluating the added value of the model compared to the human radiologists (e.g., improvement in diagnostic performance, reduced measurement time to diagnose cardiomegaly) [29,30].Second, although we showed a comparable performance of the DL-based model in measuring the CTR and diagnosing cardiomegaly, its clinical applicability in a specific scenario was not investigated.For instance, a prime example would be applying this DL-based model in the emergency department to triage patients who should be seen by a cardiologist first.Further validation studies are warranted.
In conclusion, while the sensitivity and NPV of this DL-based model for diagnosing cardiomegaly in chest radiographs with pleural effusion or consolidation were not as high as those of radiologists, the model demonstrated good agreement with thoracic radiologists in measuring the CTR across various pathologies.

Supplementary Materials:
The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/bioengineering10091077/s1, Figure S1 S1: Mean absolute error and root mean square error for measuring the cardiothoracic ratio on chest radiographs between a deep learning-based model and two thoracic radiologists; Table S2: Diagnostic performance for cardiomegaly on chest radiographs (threshold of 0.55) between a deep learning-based model and five board-certified radiologists with a reference standard derived using the Dawid-Skene consensus method; Table S3: Diagnostic performance for cardiomegaly on chest radiographs between a deep learning-based model and five board-certified radiologists with a reference standard derived using the median values.Funding: This research was supported by a grant from the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HI20C2092).However, the funder had no role in the study design; in the collection, analysis, and interpretation of the data; in the writing of the report; and in the decision to submit the article for publication.

Institutional Review Board Statement:
The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the Institutional Review Board of Seoul National University Hospital, and the requirement for written informed consent was waived (IRB No.: H-2203-081-1308).
Informed Consent Statement: Patient consent was waived due to the retrospective nature and minimal risk of this study.

Figure 1 .
Figure 1.(A) Architecture of the computer-aided automatic measurement system of the cardiothoracic ratio (CTR) on chest radiographs.(B) Detail of the segmentation model using standard U-Net architecture with specific self-attention modules.

Figure 2 .
Figure 2. Measurement of the cardiothoracic ratio (CTR).The CTR measurement model and radiologists measured (1) the maximum left heart diameter, (2) the maximum right heart diameter, and(3) the greatest transverse dimension of the thoracic cavity.The CTR is calculated as (maximum left heart diameter + the maximum right heart diameter)/the greatest transverse dimension of the thoracic cavity.

Figure 4 .
Figure 4. Bland-Altman plots for cardiothoracic ratio (CTR) measurements between the deep learningbased model and thoracic radiologists in various thoracic pathologies.The mean absolute and relative differences in chest radiographs (A,B) without any lung or pleural abnormality, (C,D) with pneumothorax, (E,F) with pleural effusion, and (G,H) with lung consolidation.Detailed information for these are shown in Table1.

Figure 5 .
Figure 5. Representative images of cardiothoracic ratio (CTR) measurements obtained from the deep learning-based model.The red lines represent the maximum left and right heart diameters, respectively, the yellow lines indicate the greatest transverse dimension of the thoracic cavity, and the blue lines represent the vertical lines passing through the midpoint of the vertebral bodies.(A) Chest radiographs without any lung or pleural abnormality.The deep learning-based model calculated the CTR as 0.588, which was determined as indicative of cardiomegaly.All five board-certified radiologists identified the radiograph as demonstrating cardiomegaly (CTR range: 0.586-0.599).(B) Chest radiographs with pneumothorax (arrowheads).The model calculated the CTR as 0.362 (CTR range of five radiologists, 0.356-0.375).(C) Chest radiographs with pleural effusion.The model calculated the CTR as 0.5.However, the chest cavity was measured to be shorter than its actual size due to the left pleural effusion (CTR range of five radiologists, 0.433-0.552).(D) Chest radiographs with lung consolidation.The CTR measured by the model was 0.593, which was determined as cardiomegaly, and all five board-certified radiologists read this radiograph as having cardiomegaly (CTR range, 0.554-0.595).

:
Plots of cardiothoracic ratio measurements of radiologists (X-axis) and deep learning-based model (Y-axis).(A-E) Reference standard for cardiomegaly constructed by Dawid-Skene consensus method: (A) study sample, (B) chest radiographs without any lung or pleural abnormality, (C) chest radiographs with pneumothorax, (D) chest radiographs with pleural effusion, and (E) chest radiographs with consolidation.(F-J) Reference standard for cardiomegaly constructed by median values: (F) study sample, (G) chest radiographs without any lung or pleural abnormality, (H) chest radiographs with pneumothorax, (I) chest radiographs with pleural effusion, and (J) chest radiographs with consolidation.Green solid circle: cardiomegaly by both radiologists and a deep learning-based model; Orange solid circle: cardiomegaly only by a deep learning-based model; Orange open circle: cardiomegaly only by radiologists; Green open circle: normal by both radiologists and deep learning-based model; Table

Table 1 .
Mean differences and mean relative differences between measurements of the cardiothoracic ratio by a deep learning-based model and two thoracic radiologists.
LOA: limit of agreement; CI: confidence interval.

Table 2 .
Intraclass or interclass correlation coefficient (ICC) analysis for measurements of the cardiothoracic ratio on chest radiographs between a deep learning-based model and two thoracic radiologists.
ICC: intraclass or interclass correlation coefficients.* p-values were estimated by 1000 rounds of bootstrapping.

Table 3 .
Diagnostic performance for cardiomegaly on chest radiographs for a deep learning-based model and five board-certified radiologists with a reference standard derived using the Dawid-Skene consensus method.
* One-sided non-inferiority test with a cut-off of a p value of 0.025.