Considerations on Baseline Generation for Imaging AI Studies Illustrated on the CT-Based Prediction of Empyema and Outcome Assessment

For AI-based classification tasks in computed tomography (CT), a reference standard for evaluating the clinical diagnostic accuracy of individual classes is essential. To enable the implementation of an AI tool in clinical practice, the raw data should be drawn from clinical routine data using state-of-the-art scanners, evaluated in a blinded manner and verified with a reference test. Three hundred and thirty-five consecutive CTs, performed between 1 January 2016 and 1 January 2021 with reported pleural effusion and pathology reports from thoracocentesis or biopsy within 7 days of the CT were retrospectively included. Two radiologists (4 and 10 PGY) blindly assessed the chest CTs for pleural CT features. If needed, consensus was achieved using an experienced radiologist’s opinion (29 PGY). In addition, diagnoses were extracted from written radiological reports. We analyzed these findings for a possible correlation with the following patient outcomes: mortality and median hospital stay. For AI prediction, we used an approach consisting of nnU-Net segmentation, PyRadiomics features and a random forest model. Specificity and sensitivity for CT-based detection of empyema (n = 81 of n = 335 patients) were 90.94 (95%-CI: 86.55–94.05) and 72.84 (95%-CI: 61.63–81.85%) in all effusions, with moderate to almost perfect interrater agreement for all pleural findings associated with empyema (Cohen’s kappa = 0.41–0.82). Highest accuracies were found for pleural enhancement or thickening with 87.02% and 81.49%, respectively. For empyema prediction, AI achieved a specificity and sensitivity of 74.41% (95% CI: 68.50–79.57) and 77.78% (95% CI: 66.91–85.96), respectively. Empyema was associated with a longer hospital stay (median = 20 versus 14 days), and findings consistent with pleural carcinomatosis impacted mortality.


Introduction
Artificial intelligence offers multiple new possibilities for quantitative image analysis in radiology. AI-aided anatomical segmentation, such as lung segmentation for quantification of lung infiltrates [1], is already established in clinical routines. AI also holds great promise in classifying different pathologies [2]. However, there are major challenges regarding the classification of diseases: In order to train and evaluate an algorithm, high diagnostic accuracy is required for disease classification, but CT-based radiological diagnosis often provides only moderate diagnostic accuracy, depending on the clinical question. Conversely, AI-based quantification or classification is of particular interest for those diagnoses with only moderate radiological diagnostic accuracy. For the training, validation and testing of an AI model, high demands should be made of the reference standard ("ground truth"). Since the primary goal of the development of AI tools should be the application of these tools in routine clinical practice, a classifier should be developed on a heterogeneous dataset that is as independent from the "inclusion" and "exclusion criteria" as possible. This need for generalizability is often in contrast with published data. A large proportion of published diagnostic accuracy and outcome studies [3] shows limiting exclusions of diagnostically challenging cases.
Empyemas are pleural effusions with pus in the pleural space and are most commonly secondary to pneumonia [4]. While empyema-related hospitalizations increase [5], empyemas are additionally associated with worse outcomes, such as prolonged admission, more complications [6] and therefore more invasive management [7] compared to parapneumonic effusions. Distinguishing empyema from other forms of pleural effusion is radiologically challenging; an AI-based classification of effusions could help in the imaging reading process. CT is an integral part of routine clinical diagnostics for the timely diagnosis of empyema; however, there is a large heterogeneity of published diagnostic accuracy measures [8]. These differences might be explained by small sample sizes, differences in reference standards, CT-acquisition, and publication date. Currently, there is no diagnostic accuracy study for the diagnosis of empyema (instead of CT features), nor is there an investigation of outcome measures based on radiological reporting.
The main objective of this study is to generate a dataset for an AI-classifier for detecting empyema in pleural effusions based on routinely performed CTs with pathological confirmation in combination with outcome predictors. The first aim is to (1) determine the diagnostic accuracy of "empyema" and the reported pleural CT features in routinely acquired radiological reports. The second aim is to ascertain the diagnostic accuracy of "pleural CT features" in a blinded manner (2). The third aim is to define a consensus based on routine radiological findings and the blinded interpretation as the reference standard and to evaluate this consensus based on sensitivity and specificity (3). Fourth, we aim to assess pleural features for their prediction of hospital stay time and mortality (4) and finally a prototype for automated empyema prediction is to be developed.

Materials and Methods
This study was approved by the local ethics committee (Project ID: 2021-00946) and is registered on the German Clinical Trials Register (DRKS00025201). No protocol deviation occurred.

Eligibility Criteria
Eligible patients were retrospectively identified based on the presence of pleural effusion in the radiological report between 1st January 2016 and 1st January 2021. All routine chest CTs were included regardless of contrast phase, with pathological reports within 7 days. Patients without pathologic reports and follow-up examinations were excluded. To avoid an inappropriate exclusion, patients who had already received a chest tube for volume decompression prior to CT were not excluded. Additionally, hospital stay time, final diagnosis, and presence of death until April 2021 were extracted from patient records.

Intended Sample Size
We calculated the sample size with an estimated empyema prevalence of 10% in parapneumonic effusions (power = 0.8; p < 0.5; H0: 0.7; H1: 0.9), with a minimum total number of 310 patients for sensitivity (min. empyema: 31) and 34 for specificity (min. empyema: 3). A total of 335 patients with pathological correlation could be identified in the hospital database for the study period between January 2016 and January 2021, and we decided to include the entire consecutive cohort in the study.

CT and Acquisition
Scans were acquired using three different CT scanners: Somatom Definition Flash (n = 95, 2 × 128-slice system), Somatom Definition AS+ (n = 182, 128-slice system), and Somatom Definition Edge (n = 58, 128-slice system; all scanners: Siemens Healthineers, Erlangen, Germany). The peak kilovoltage was 120 kVp and an automatic tube current modulation was performed. A contrast agent was administered in 208 of the 335 CT studies, with routine flow rates of between 2 and 4 mL/s (84 biphasic). Soft tissue kernels (30f), with 1 mm acquisition and 5 mm reconstructions in the coronal, axial, and sagittal planes, as well as 0.7-1 mm lung kernels (70f) were used for image interpretation on EIZO RX350 (EIZO, Ishikawa, Japan) diagnostic monitors.

Pleural CT Features
The empyema-associated pleural CT features described in the literature are pleural thickening, pleural enhancement, microbubbles, extrapleural fat stranding, and loculation. Figure 1 shows an example of the typical features of an empyema. and visceral (lung) pleura, consistent with a "split pleura sign" associated with pleural thickening (red dash). Pleural fat stranding (bold green arrows, compared to the normal contralateral side, thin green arrows) and microbubbles (empty arrows) are also present. Pleural empyema on the right side is loculated (green *) in contrast to the simple pleural effusion on the contralateral side. There is reactive hilar and mediastinal lymphadenopathy (blue arrows).

Radiological Report-Based CT Feature Extraction
Text-based, anatomically structured radiology reports, blinded by definition to the reference standard, were prospectively generated in consensus by a radiology resident and a board-certified specialist. The radiological diagnosis was routinely made based on image findings and knowledge of the clinical information. R.S. (4th post-graduate year, PGY; for details see Appendix A) extracted pleural CT features and test results for pleural empyema.

Prespecified CT Based CT Feature Extraction
All CTs were interpreted independently by R.S. (4 PGY) and N.S. (10 PGY) concerning the following pleural CT features. The interpretation was blinded to radiological reports, clinical information, and pathological diagnosis.
Based on the literature, the aforementioned pleural CT features were divided into the following groups: Pleural thickening was defined as a visible pleural line and classified based on location (circumferential, lung-, rib-, mediastinal involvement) and morphology (smooth, nodular (>2 mm, round), or pleural mass (>3 cm)).
Visible pleural enhancement was also scored as pleural thickening. Thus, the descriptors for location and morphology also apply to any pleural enhancement present. Visible pleural thickening without visible enhancement was not considered enhancement. Pleural enhancement was divided into split pleura sign (visible enhancement of both visceral and parietal pleura) and hemisplit pleura sign (visible enhancement of either visceral or parietal pleura).
Microbubbles were defined as gas surrounded by pleural fluid. The extrapleural fat stranding was defined as having higher HU values compared to the contralateral side and the subcutaneous fat tissue.
According to Tsujimoto et al., a cutoff of 3 cm (the maximum measured distance on an axial slice) was used for the amount of pleural effusion [9]. Rib destruction was defined as osteolysis adjacent to pleural effusion. Mediastinal/hilar lymphadenopathy was defined by a short axis > 1 cm.
After evaluation of the interrater agreement, the non-consensus was resolved by J.B. (29 PGY).

Reference Standard
As the reference standard, we used the final pathology report within 7 days of the CT. The reports describing macroscopic pus or fibropurulent changes were rated as positive for empyema according to literature [10][11][12]. Additionally, macroscopic or microscopic pleural tumor manifestations were defined as pleural carcinomatosis. Clinical information and index test results were available via the hospital information system.

Possible Applications
We used an nnU-net architecture [13] for 3D pleural effusion segmentation of the dataset. In order to evaluate the extent to which radiomic features could be suitable for predicting an empyema, we used the Python software (version 3.7) and the PyRadiomics package for feature extraction [14]. In the preselection phase, we selected 50 features with the highest importance among all the extracted radiomic features. Then a random forest model with bootstrap sampling and 100 decision trees was trained based on a leave-oneout cross-validation, balanced 1:1 regarding pathologically confirmed empyema (n = 81) and randomly selected negatives. Finally, the model was applied to all (n = 335) cases to evaluate prediction performance.

Analysis
To test for normal distribution, we used the Kolmogorov-Smirnov test (e.g., patients' age). We used a t-test to compare the differences in patients' ages between the positive and negative collectives. Interrater variability was assessed with Cohen's kappa and waived by consensus with a third rater (J.B.). Test results were organized into 2 × 2 contingency tables, displaying true positives, true negatives, false positives, and false negatives. Pearson's-Chi (with Cramer's V), sensitivity, specificity, negative predictive value (NPV), positive predictive value (PPV), accuracy, diagnostic odds ratio (DOR), area under the curve (AUC) as well as 95% CI intervals, were calculated for each pleural CT feature. We used Mann-Whitney-U for the analysis of hospital stay time and performed Kaplan-Meier analysis for survival time. All statistical analyses were performed with R 4.0.5 (R Core Team, Vienna, Austria).

Study Population
A total of 2659 eligible patients were identified with pleural effusions between 01/2016 and 01/2021. Of these, 335 patients had pathology workup within 7 days of computed tomography regardless of effusion cause or underlying disease (see Figure 2) and their results are available for download [15]. Of the 335 patients included, 125 were female (37.3%). The mean age was 68.6 years (95-CI: 67.0-70.3, Median: 71, range: 18-96). The primary etiologies (see Figure 3) of pleural effusion were empyemas (n = 81), pleural malignancy (n = 60, with malignant cells) and others (n = 194). Other leading causes of pleural effusion were pneumonia (n = 52), acute or chronic heart failure with pulmonary congestion (n = 50), and trauma (n = 18). The most common malignancy with associated pleural effusion was lung cancer. Pleural carcinomatosis was confirmed by pathology in 34 patients. In 20 cases, the etiology of pleural effusion remained unclear. Pathology diagnoses were based either on intra-operative samples (n = 61), biopsies (n = 42), or fine needle aspiration or thoracentesis (n = 231). A total of 82 patients with empyema were identified. In 14 empyema cases, malignant cells were additionally detected in the pathological specimen with known underlying malignancy.
The patients with empyema were slightly younger (mean age 64.4 versus 70.0, t = 2.87, p = 0.004). In the subset of patients with empyema, 33.8% were women versus 38% in the subset without empyema. Contrast medium was administered in 79% of the cases.

Possible Applications
In addition to the CT datasets, the nnU-net based segmentation masks were published as well and are freely available [15]. Figure 5B shows a corresponding example, with higher density values depicted within the segmentation mask. It shows that densitybased classification approaches could possibly be useful. The random forest model based on radiomics features performed with a sensitivity of 77.78% (95% CI: 66.91-85.96) and specificity of 74.41% (95% CI: 68.50-79.57) for the prediction of pleural empyema.

Discussion
The sensitivity and specificity of CT to diagnose an empyema in clinical practice are 72.84% and 90.94%, respectively. We found moderate to almost perfect interrater-agreement with a sensitivity of 70.37% for pleural thickening, 78.13% for pleural enhancement, 46.91% for fat stranding, and 80.24% for loculation, and corresponding specificities of 85.04%, 90.97%, 90.94% and 78.74%, respectively. The automated detection of pleural empyema achieved an AUC of 0.80.
Since the nomenclature is heterogeneous, we have attempted to use clear definitions for pleural findings based on published studies. While Jimenez et al. and Leung et al. described the anatomic location (e.g., visceral, parietal) of pleural thickening, Tsujimoto et al. [9] used the term "split pleura" sign for visceral and parietal pleural thickening regardless of contrast media, whereas Porcel et al. [18] retained the "split pleura" as a threshold for pleural enhancement. As we understand every visible pleural enhancement as visible pleural thickening, we reserved "enhancement", "split-and hemisplit pleura sign" for contrastenhanced CTs and added a more detailed anatomic description for pleural thickening. Our literature-based definitions might lead to a more standardized reporting nomenclature.
Additionally, this is the first study to evaluate the imaging-based diagnosis of empyema based on prospective gathered reports. We found a high negative predictive value (NPV) of 91.30%, which is comparable to the NPV of CT based diagnoses of COVID-19 pneumonia [25].
Whereas pleural effusions are known as negative outcome predictors in various disease entities, which holds true as well for the ongoing COVID pandemic [26], they show high one-year mortality rates in both non-malignant (25%-57% [27]) and malignant diseases (e.g., 77% [28]). This is consistent with our results. As pleural carcinomatosis has a poor prognosis [29], correlation with pleural findings associated with radiological manifestation is not surprising. With the improvements in patient management and adequate treatment of acute pleural diseases like empyema, the mortality rates have been reduced. However, pleural diseases are still leading to longer hospitalizations, which might be improved by early detection.
With this first radiomics-based study, we have shown that empyema is predictable with high accuracy and that the translation of known CT features based on 3D segmentation might be reasonable for AI algorithms. Comparable to other chest pathologies [30], the tools for risk calculation and outcome prediction are promising.
A limitation of the study is that eligible patients were retrospectively selected on scanners of one vendor at a single institution. A second limitation is that the reference standard was only applied if clinically indicated, hence the empyema prevalence might be higher than expected. Third, in addition to avoiding inappropriate exclusion, patients with chest tubes in situ were also included, which increased the prevalence of iatrogenic microbubbles. However, in contrast to a general tendency to exclude these patients, this better reflects the routine clinical setting. According to the current guidelines [31], only CT was used as a reference test in the current study; nevertheless, it might be worthwhile to investigate the diagnostic accuracy of other modalities such as FDG-PET in the future, which has already proved to be useful for other pleural diseases [32].
With the development of AI-based algorithms for disease detection and classification, outcome evaluation on diagnostic images might become increasingly relevant. We showed that known radiological descriptors vary in their potential for prognostication and can be used as a benchmark for automated tools.

Conclusions
This study serves as an update of previous diagnostic accuracy studies in terms of developments in biomedical engineering, and the results can contribute to more structured reporting. With an AUC of 0.84, the radiological diagnosis of empyema can help to identify patients with longer hospital stays. We hope that the openly available, anonymous CT data, the consensus-based CT features and pathological and outcome data [26] can be used as a baseline for further AI research. Informed Consent Statement: All patients were anonymized. Patient with disconsent were excluded.

Data Availability Statement:
The data presented in this study are openly available in zenodo.org at https://doi.org/10.5281/zenodo.5793365, reference number: 5793366.