Multi-Stage Harmonization for Robust AI across Breast MR Databases

Simple Summary Batch harmonization of radiomic features extracted from magnetic resonance images of breast lesions from two databases was applied to an artificial intelligence/machine learning classification workflow. Training and independent test sets from the two databases, as well as the combination of them, were used in pre-harmonization and post-harmonization forms to investigate the generalizability of performance in the task of distinguishing between malignant and benign lesions. Most training and independent test scenarios were statistically equivalent, demonstrating that batch harmonization with feature selection harmonization can potentially develop generalizable classification models. Abstract Radiomic features extracted from medical images may demonstrate a batch effect when cases come from different sources. We investigated classification performance using training and independent test sets drawn from two sources using both pre-harmonization and post-harmonization features. In this retrospective study, a database of thirty-two radiomic features, extracted from DCE-MR images of breast lesions after fuzzy c-means segmentation, was collected. There were 944 unique lesions in Database A (208 benign lesions, 736 cancers) and 1986 unique lesions in Database B (481 benign lesions, 1505 cancers). The lesions from each database were divided by year of image acquisition into training and independent test sets, separately by database and in combination. ComBat batch harmonization was conducted on the combined training set to minimize the batch effect on eligible features by database. The empirical Bayes estimates from the feature harmonization were applied to the eligible features of the combined independent test set. The training sets (A, B, and combined) were then used in training linear discriminant analysis classifiers after stepwise feature selection. The classifiers were then run on the A, B, and combined independent test sets. Classification performance was compared using pre-harmonization features to post-harmonization features, including their corresponding feature selection, evaluated using the area under the receiver operating characteristic curve (AUC) as the figure of merit. Four out of five training and independent test scenarios demonstrated statistically equivalent classification performance when compared pre- and post-harmonization. These results demonstrate that translation of machine learning techniques with batch data harmonization can potentially yield generalizable models that maintain classification performance.


Introduction
For a given medical imaging protocol, differences in the resulting medical images and the values extracted from them can arise when they are collected in different contexts. For example, different manufacturers can implement a protocol that is nominally the Cancers 2021, 13, 4809. https://doi.org/10.3390/cancers13194809 https://www.mdpi.com/journal/cancers same but then additionally apply proprietary algorithms which may affect the exported data, or some factors in the imaging protocol may be systematically different between two groups, such as magnetic resonance imaging at different clinical field strengths. In general, medical imaging conducted in different contexts may demonstrate several types of differences, deriving from factors such as patient biology, screening protocols, and imaging protocols [1]. Many artificial intelligence/computer-aided diagnosis (AI/CADx) models for diagnosis and prognosis of disease make use of medical images that are acquired within a single institution. This can potentially reduce some differences in factors, however, there is substantial interest in combining datasets to form potentially more generalizable models through the use of images from multiple institutions.
Harmonization is an area of investigation in AI that seeks the compatibility of data acquired from different contexts or sources. The concept of harmonization can have several different levels of application in medical imaging, such as at image acquisition, in which protocols are controlled to be implemented the same way in different contexts, or in the post-processing of acquired images to normalize them between two sources of data.
One level of combining datasets from different sources involves the batch harmonization of extracted features. Batch harmonization methods seek to reconcile what are termed batch effects in the field of genomics, the observation that data drawn from otherwise identical samples measured at different times can show some differences in values [2]. Harmonization can be applied to sets of radiomic features with the underlying assumption that the acquisition of features from images acquired with nominally similar protocols but from different sources may demonstrate these batch effects. In the context of AI/CADx, different databases can be considered as different batches. The harmonization method ComBat [2] has previously been used in a single stage on all cases in each database of radiomic features extracted from full-field digital mammography images [3], positron emission tomography images of breast cancer [4], dynamic contrast-enhanced magnetic resonance (DCE-MR) imaging of breast cancer [5] (Figure 1a), computed tomography images for lung cancer and phantom imaging [6,7], MR imaging of soft tissue sarcomas [8], measurements of diffusion tensor imaging [9] and cortical thickness [10] in the brain, and FLAIR and T1-weighted imaging of the brain and T2-weighted imaging of the prostate [11]. The application of ComBat harmonization to radiomic features of all cases in each database allows for the use of covariates, such that the characteristics of the batches, for example lesion type, are retained. These works and others have demonstrated that ComBat can potentially be useful when the goal is to harmonize, in one stage, entire datasets resulting from combined batches. A recent work summarizes the current status of investigations into harmonization of radiomic features [12].
A related potential application for batch harmonization is its application to an AI/CADx workflow involving training classifiers on a training set and applying them to a completely separate, independent test set. In such a scenario, batch harmonization would be conducted first on the training set alone, with the parameters from the harmonization subsequently applied to the test set, i.e., the data in the test set do not influence the harmonization parameters. Applying harmonization in this way can have potential impacts not relevant when batch harmonization is applied to all cases in each database. For example, batch harmonization may affect feature selection, i.e., the features selected for combined sets of features from different batches may be different by using pre-harmonization features compared with those selected by using post-harmonization features. This may in turn affect classification of cases in the independent test set. Thus, the use of batch harmonization on a training set and then application of parameters to an independent test set requires an additional stage of feature selection harmonization, i.e., the features used in the test set must be the same as those selected on the training set ( Figure 1b). Additionally, because theoretically such a workflow would use unlabeled test cases, covariates (such as, which cases are malignant or benign) cannot be used. The purpose of this study was to assess the performance of AI/CADx of breast cancer in lesions imaged with DCE-MR on two databases using extracted human-engineered radiomic features, batch harmonization, feature selection harmonization, and classification. In this study, the two databases were considered to be the two batches for the purposes of batch harmonization. First, we investigated the effects of batch harmonization of features in the training set that combined the two batches (i.e., the two training datasets) and the application of the associated batch harmonization parameters to the features in the independent test sets. Next, we conducted feature selection and classification using the training and independent test set paradigms on the two databases separately and on combined versions of them, in both their pre-harmonization and post-harmonization forms. We hypothesized that an independent training/test AI/CADx lesion classification pipeline that both harmonizes features by databases and allows feature selection to change after batch harmonization (that is, feature selection harmonization) will result in a predictive model that is generalizable across the two databases, i.e., multi-stage harmonization maintains classification performance.

Database
The study was performed retrospectively under IRB/HIPAA protocol, which waived requirement of informed consent. DCE-MR images of breast lesions from females were The purpose of this study was to assess the performance of AI/CADx of breast cancer in lesions imaged with DCE-MR on two databases using extracted human-engineered radiomic features, batch harmonization, feature selection harmonization, and classification. In this study, the two databases were considered to be the two batches for the purposes of batch harmonization. First, we investigated the effects of batch harmonization of features in the training set that combined the two batches (i.e., the two training datasets) and the application of the associated batch harmonization parameters to the features in the independent test sets. Next, we conducted feature selection and classification using the training and independent test set paradigms on the two databases separately and on combined versions of them, in both their pre-harmonization and post-harmonization forms. We hypothesized that an independent training/test AI/CADx lesion classification pipeline that both harmonizes features by databases and allows feature selection to change after batch harmonization (that is, feature selection harmonization) will result in a predictive model that is generalizable across the two databases, i.e., multi-stage harmonization maintains classification performance.

Database
The study was performed retrospectively under IRB/HIPAA protocol, which waived requirement of informed consent. DCE-MR images of breast lesions from females were collected from two medical centers. These databases were termed Database A and Database B in this study. Lesions in Database A had been imaged during the period of 2005-2017 while the lesions imaged in Database B had been imaged during the period of 2015-2017.
The lesions had been imaged using T1-weighted gradient spoiled sequences in use at the medical centers. Most lesions in Database A had been imaged using Philips Intera scanners, except for three benign lesions which had been imaged using GE scanners, and five and three cancerous lesions which had been imaged using GE and Siemens scanners, respectively. All lesions in Database B had been imaged using GE Discovery 750 scanners. Images from Database A had been acquired in the axial plane and images from Database B were acquired in the sagittal plane. The details of the imaging protocols are available elsewhere [5,[13][14][15]. Information regarding subtypes of benign lesions and cancers had been collected from pathology and imaging reports.
The training and test sets of Database A and Database B were determined based solely on year of image acquisition ( Table 1, Figures 2 and 3). These training and independent test sets from each database were also combined, respectively, yielding a combined training set and a combined test set. Thus, there were three separate training sets:

Lesion Segmentation and Feature Extraction
Lesions had previously been segmented on the MRIs using a computerized fuzzy Cmeans method [16]. In addition, thirty-two radiomic features had already been automatically extracted subsequently after lesion segmentation. The features were made up of five phenotypic feature categories: size, shape, morphology, enhancement texture, and kinetic curve assessment [17][18][19]. These techniques and performance levels had previously been reported.

Harmonization
The batch harmonization method used in this work, Combat [3], models feature data across two batches as (Equation (1)): where is the average value for feature g, X is a design matrix for the covariates of interest, is the vector of regression coefficients corresponding to each covariate, γ is the additive effect of group i on feature g, is a multiplicative group effect, and is an error term for each sample j. The data are standardized according to (Equation (2)): where and are estimators of and , respectively. The variables γ and are then estimated using empirical Bayes estimates. Then, each feature is transformed using the expression (Equation (3)): The ComBat harmonization method was applied to radiomic features extracted from lesions in the training set made of the combined databases, i.e., (A + B)tr. In that process, the empirical Bayes estimates needed to shrink the batch effect parameter estimates of mean and variance were identified. Subsequently, those parametric Bayes estimates ( * and * ) and the factors and from the training set were then used to harmonize

Lesion Segmentation and Feature Extraction
Lesions had previously been segmented on the MRIs using a computerized fuzzy C-means method [16]. In addition, thirty-two radiomic features had already been automatically extracted subsequently after lesion segmentation. The features were made up of five phenotypic feature categories: size, shape, morphology, enhancement texture, and kinetic curve assessment [17][18][19]. These techniques and performance levels had previously been reported.

Harmonization
The batch harmonization method used in this work, Combat [3], models feature data across two batches as (Equation (1)): where α g is the average value for feature g, X is a design matrix for the covariates of interest, β g is the vector of regression coefficients corresponding to each covariate, γ ig is the additive effect of group i on feature g, δ ig is a multiplicative group effect, and ε ijg is an error term for each sample j. The data are standardized according to (Equation (2)): whereα andβ are estimators of α and β, respectively. The variables γ ig and δ ig are then estimated using empirical Bayes estimates. Then, each feature is transformed using the expression (Equation (3)): The ComBat harmonization method was applied to radiomic features extracted from lesions in the training set made of the combined databases, i.e., (A + B) tr . In that process, the empirical Bayes estimates needed to shrink the batch effect parameter estimates of mean and variance were identified. Subsequently, those parametric Bayes estimates (γ * ig andδ * ig ) and the factorsα g andσ g from the training set were then used to harmo-nize the eligible features in the independent test set made of the combined database, i.e., (A + B) te . Batch harmonization determines harmonization factors from all features used in harmonization, so ComBat harmonization was conducted by feature category on collections of features deemed eligible for harmonization, namely, morphology, texture, and most kinetic curve features. That is, harmonization was conducted for morphology features, then for texture features, and then for most kinetic curve features. Kinetic curve features with semi-categorical variables (washout rate and curve shape index) did not undergo harmonization, nor did the kinetic curve feature of volume of most enhancing voxels ( Table 2). It is important to note that since the predictive modeling process would theoretically involve unlabeled test cases if used in a clinical workflow, no covariates were used in the harmonization. The complete compilation of the full feature sets included both harmonized features and the features that did not undergo harmonization, i.e., those that remained as is in the feature sets.
After harmonization, the training set comprised of both batches, (A + B) tr , was separated into its pre-harmonization and post-harmonization separate database forms (A tr and B tr ). After harmonization parameters from (A + B) tr were applied to (A + B) te , (A + B) te was separated into its pre-harmonization and post-harmonization separate database forms (A te and B te ).
As an intermediate step, the visualization of feature harmonization on the training and independent test sets was investigated using t-distributed stochastic neighbor embedding (tSNE) [20,21].

Feature Selection
To enable classifier training, stepwise feature selection was conducted on the three training sets in their pre-harmonization and post-harmonization forms, to yield six sets of selected features, i.e., three with pre-harmonization selected features for A tr , B tr , and (A + B) tr , and three with post-harmonization selected features for A tr , B tr , and (A + B) tr .

Lesion Classification
Linear discriminant analysis (LDA) was used as the classifier to yield an estimate of the posterior probability of malignancy (PM) for each lesion. The determination of the LDA weights was computed using the training sets for ultimate use in the independent evaluation on the test sets ( Figure 4). That is, six sets of LDA weights were calculated corresponding to the three pre-harmonization training sets of A tr , B tr , and (A + B) tr , and to the three post-harmonization training sets of A tr , B tr , and (A + B) tr .
The area under the receiver operating characteristic (ROC) curve (AUC) [22] for the task of classification of lesions as benign or malignant using the proper binormal model [23] served as the figure of merit with the PM as the input. Note that ROC analysis was conducted on each of the training/test scenarios in Figure 4.

Classification Performance Comparison on the Test Sets
Classification performance was compared using superiority testing for each of the combinations of training and testing shown in Figure 4 under both pre-harmonization and post-harmonization conditions. The Bonferroni correction of p-value for multiple comparisons [24] was utilized for each of the test sets, i.e., A te and B te . Thus, in these cases the difference in AUC was deemed to be statistically significant if p < 0.025 (i.e., p < 0.05/2 comparisons). p-value correction was not necessary for the independent test set made up of lesions from both countries (A + B) te since it was evaluated only once, so the difference in AUC was statistically significant if p < 0.05. Equivalence margins for conditions of similarity testing (equivalence and if necessary, non-inferiority) were identified a posteriori when a result failed to demonstrate significant difference [25], because the equivalence margin had not been identified for this classification task.

Feature Selection
Only when using the combined dataset under pre-harmonization and post-harmonization conditions did feature selection demonstrate some differences in the selected features ( Figure 6). This contrasted with no change in selected features when using either Dataset A or Dataset B between the pre-harmonization and post-harmonization features. Notably, most of the features that exhibited a change in selection or non-selection were texture features. Texture features T3, T7, T8, and T11 were not selected post-harmonization but were selected pre-harmonization. Texture feature T9 was selected post-harmonization after having not been selected pre-harmonization.  Table 2. Squares outlined with a dashed line indicate those for which harmonization was conducted.

Feature Selection
Only when using the combined dataset under pre-harmonization and post-harmonization conditions did feature selection demonstrate some differences in the selected features ( Figure 6). This contrasted with no change in selected features when using either Dataset A or Dataset B between the pre-harmonization and post-harmonization features. Notably, most of the features that exhibited a change in selection or non-selection were texture features. Texture features T3, T7, T8, and T11 were not selected post-harmonization but were selected pre-harmonization. Texture feature T9 was selected post-harmonization after having not been selected pre-harmonization.

Feature Selection
Only when using the combined dataset under pre-harmonization and post-harmonization conditions did feature selection demonstrate some differences in the selected features ( Figure 6). This contrasted with no change in selected features when using either Dataset A or Dataset B between the pre-harmonization and post-harmonization features. Notably, most of the features that exhibited a change in selection or non-selection were texture features. Texture features T3, T7, T8, and T11 were not selected post-harmonization but were selected pre-harmonization. Texture feature T9 was selected post-harmonization after having not been selected pre-harmonization.  Table 2. Squares outlined with a dashed line indicate those for which harmonization was conducted.  Table 2. Squares outlined with a dashed line indicate those for which harmonization was conducted.

Lesion Classification Performance and Comparison
As expected, within either Database A or Database B, classification performance (AUC: median, [95% CI]) was unchanged (within numerical calculation limits) when training and independent testing was performed on both pre-harmonization and post-harmonization feature sets ( Classification performance using features harmonized across the two databases demonstrated statistically significant difference when training of a classifier was conducted using lesions imaged in Database A and testing was conducted on lesions imaged in Database B, but not vice versa (Figure 7, Table 3).
In contrast, there was no statistical evidence for difference in classification performance when training was conducted using combinations of features from both databases and independent testing was conducted using features from one dataset or the combination of them (i.e., (A + B) te ), when comparing using pre-harmonization features to using postharmonization features (Figure 8, Table 3).
FOR PEER REVIEW 10 of 16
Classification performance using features harmonized across the two databases demonstrated statistically significant difference when training of a classifier was conducted using lesions imaged in Database A and testing was conducted on lesions imaged in Database B, but not vice versa (Figure 7, Table 3).
In contrast, there was no statistical evidence for difference in classification performance when training was conducted using combinations of features from both databases and independent testing was conducted using features from one dataset or the combination of them (i.e., (A + B)te), when comparing using pre-harmonization features to using post-harmonization features (Figure 8, Table 3).
Classification performance using features harmonized across the two databases demonstrated statistically significant difference when training of a classifier was conducted using lesions imaged in Database A and testing was conducted on lesions imaged in Database B, but not vice versa (Figure 7, Table 3).
In contrast, there was no statistical evidence for difference in classification performance when training was conducted using combinations of features from both databases and independent testing was conducted using features from one dataset or the combination of them (i.e., (A + B)te), when comparing using pre-harmonization features to using post-harmonization features (Figure 8, Table 3).

Discussion
The work presented here appears to be the first to demonstrate batch harmonization of radiomic features extracted from magnetic resonance images conducted in a training and independent test workflow and to investigate the impact of harmonization on classification performance in the context of both feature harmonization and feature selection harmonization. Statistically significant improvement in classification compared using pre-and post-harmonization features was observed in only one of the training and test combinations (A tr , B te ). This may be due to the fact that feature harmonization overcomes the otherwise limiting number of features selected in A tr. The AI/CADx framework used in this study utilized both feature harmonization and feature selection harmonization, demonstrating that implementing these in the training and independent test framework maintains the generalizability of the predictive model.
Batch harmonization via the ComBat method is being extensively utilized to harmonize entire databases of features in a single stage, as noted in the references mentioned above and in other recent works [26][27][28][29][30], but this restricts the application of that form of batch harmonization to cross-validation. In addition to the application of ComBat in its original form, variations are beginning to be investigated, such as M-Combat, which identifies one batch as the reference and harmonizes features in the second to the reference [31,32]. Our work described here is one of a few at this time that investigate the use of harmonization across batches (i.e., not to a reference set) in a completely separate training and test machine learning framework. Luo et al. used a separate training and test framework for various harmonization methods, including ComBat, in their study on microarray gene expression data [33]. The Matthews correlation coefficient (MCC) [34], a variation on the Pearson correlation coefficient, served as the figure of merit in the study, and a given batch harmonization method was determined to be better than no batch effect removal method if the difference in the MCC was greater than a pre-determined threshold. In that study, the ComBat method demonstrated utility in batch harmonization, but the conclusions were limited by the lack of statistical method to determine significance. Pszczolkowski et al. [35] used ComBat on separate training and test sets for a study that compared the use of radiomic features extracted from computed tomography images of the brain to using radiological signs (the blend sign, black hole sign, hypodensities, and island signs) and clinical factors to predict hematoma expansion and functional outcome with acute intracerebral hemorrhage. The authors demonstrated the effect of harmonization on feature distributions using t-SNE figures, resulting in a similar reduction in clustering across batches as presented here, but they did not report investigating the impact of Com-Bat in this way on classification using radiomic features alone, compared with not using ComBat. Radua et al. reported preparing software for the separate training and test framework [36] but their study did not implement it. Da-ano et al. [37] applied four variations of ComBat harmonization-the original version, M-Combat, B-Combat (a version that implements bootstrapping) and BM-Combat (a version that combines single reference and bootstrapping)-to radiomic features extracted from multiparametric magnetic resonance images (T2-weighted, apparent diffusion coefficient, DCE) and from positron emission tomography images of locally advanced cervical cancer for the prediction of local failure in 189 subjects. Their study compared the effect of harmonization on the entire set of features to harmonizing on a training set and applying it to a separate test set. Their results did not demonstrate statistically significant different performance in classification between these two scenarios, a finding which they believed to be favorable for the use of ComBat in a separate training and test framework. That study used only γ ig and δ ig as the parameters determined by the training set and applied to the test set. The two other previous studies which describe using ComBat harmonization in completely separate training and test frameworks [33,35] do not specify which parameters they applied to the test set. In our study, we applied four parameters (γ ig and δ ig as well asα g andσ g ) to the test set. We also studied applying γ ig and δ ig alone to the test set and found no statistical change in classification performance. This may be due to the relatively homogeneous nature of our database across the training and test sets, but it may be important to apply harmonization using the four parameters so that the determination of the harmonization parameters is completely separate from the test set, including those used for the standardization of test set features (as described in Equation (2)).
AI methods in MRI have been used to support decision making in breast cancer diagnosis and prognosis for several decades [38,39]. As the scope and sophistication of AI in medical imaging grows, some have noted that it would be beneficial to implement specific data science curriculum for radiology trainees [40,41]. Education in harmonization of data or feature selection could potentially be one element of such training, but at this time, the harmonization methods described here are in development and have not yet been approved for clinical use.
There were several limitations to this study. For example, we did not investigate correlation with imaging protocol (such as field strength of imaging or image resolution) or patient biology. Note that our results are limited to the clinical task of classification of breast lesions as malignant or benign. Other studies have described statistically significant improvement in classification when using harmonization on entire databases and covariates (e.g., [4,5]), but our study did not include covariates due to its training and independent test design. Additionally, these results may be specific to the types of features used in this study, especially since some features were determined to not be eligible for harmonization. We elected to investigate these features for classification since they have been utilized in several studies for this particular classification question [42,43].
MRI is being investigated as a screening method for patients with increased lifetime risk of breast cancer, history of chest or mantle radiation therapy, history of breast cancer diagnosis and dense breasts, and previous diagnosis of breast cancer before age 50 [44]. However, our work did not study correlations across different screening protocols. Note that the primary goal of our method in eventual clinical use would be to provide decisionmaking support to radiologists. However, further validation and regulatory clearance are necessary prior to clinical implementation. In addition, our databases did not include male subjects. The incidence of breast cancer in males is approximately 1% in the United States [45]. Some case reports have indicated that MR imaging of breast cancers in males can provide decision-making support when initial imaging by mammography and ultrasound imaging is equivocal [46,47]. Future studies will investigate the inclusion of male breast cancer cases in training and/or test sets.
Our study did not investigate the performance of batch harmonization on lesion classification when more than two batches were used. It would be interesting to investigate the classification performance of a third batch of lesions in the testing capacity when they, as was true for our test sets, have no influence on the determination of the harmonization parameters, but they are also not present in the training sets. However, the choices of study design in this work may helpfully demonstrate classification performance in the context of limited scope of training and independent testing frameworks and provide a foundation for further studies.
Finally, while the classification performances in the test sets were maintained when compared pre-and post-harmonization, the ranking of individual cases within the test set could have changed (i.e., the posterior probability of malignancy of individual cases changed). This may have implications for the case-based repeatability of individual cases [48][49][50][51][52] and will be a topic of study in the future.

Conclusions
Classification performance using pre-and post-harmonization radiomic features extracted from DCE-MR images of the breast, in the task of classification of lesions as malignant or benign, has been demonstrated in computer-aided diagnosis using two training and independent test sets. The results suggest that both batch harmonization and its impact on feature selection may not provide statistically significant improvement in classification in training and independent test AI/ML workflows, but rather serve best to preserve classification performance for fixed feature selection. These findings demonstrate relevant considerations when designing methods of classification using machine learning for predictive models applied to differently acquired test sets, especially when attempting to combine datasets collected from two different sources.