An Ensemble of Deep Learning Object Detection Models for Anatomical and Pathological Regions in Brain MRI

This paper proposes ensemble strategies for the deep learning object detection models carried out by combining the variants of a model and different models to enhance the anatomical and pathological object detection performance in brain MRI. In this study, with the help of the novel Gazi Brains 2020 dataset, five different anatomical parts and one pathological part that can be observed in brain MRI were identified, such as the region of interest, eye, optic nerves, lateral ventricles, third ventricle, and a whole tumor. Firstly, comprehensive benchmarking of the nine state-of-the-art object detection models was carried out to determine the capabilities of the models in detecting the anatomical and pathological parts. Then, four different ensemble strategies for nine object detectors were applied to boost the detection performance using the bounding box fusion technique. The ensemble of individual model variants increased the anatomical and pathological object detection performance by up to 10% in terms of the mean average precision (mAP). In addition, considering the class-based average precision (AP) value of the anatomical parts, an up to 18% AP improvement was achieved. Similarly, the ensemble strategy of the best different models outperformed the best individual model by 3.3% mAP. Additionally, while an up to 7% better FAUC, which is the area under the TPR vs. FPPI curve, was achieved on the Gazi Brains 2020 dataset, a 2% better FAUC score was obtained on the BraTS 2020 dataset. The proposed ensemble strategies were found to be much more efficient in finding the anatomical and pathological parts with a small number of anatomic objects, such as the optic nerve and third ventricle, and producing higher TPR values, especially at low FPPI values, compared to the best individual methods.


Introduction
Imaging technologies are frequently used for disease detection and evaluation in the medical domain to overcome the uncertainty and limited detectability of human perception. Different imaging methods such as magnetic resonance imaging (MRI), positron emission tomography (PET), single-photon emission computed tomography (SPECT), ultrasound, computed tomography (CT), and X-ray are actively used for medical analysis. MRI is very frequently used in the quantitative assessment of brain-related disorders such as Alzheimer's disease, epilepsy, schizophrenia, multiple sclerosis, and brain cancer [1]. The anatomical examination of the brain and determination of the anatomical regions is an important research area since conditions such as deformity in the anatomical parts can be a symptom of a disease. For this reason, detecting anatomical regions is frequently used in subjects such as anatomical development, computer-assisted detection, computer-assisted diagnosis, and patient follow-up [2].
Brain-specific anatomical studies are carried out to reveal many different structural and functional functions: extraction of the brain atlas [3], clipping areas that do not belong to the brain, such as the eye and skull, that are visible in MRI [4,5], revealing degenerations in anatomical parts [6], quantitative analysis of diseases [6], and estimation of brain age [7]. and other artifacts. Therefore, object detection models may result in poor performance when applied directly to the medical field.
Ensemble techniques are one of the methods used to improve performance in brain MRI studies for several tasks as in many other fields. According to the current ensemble studies summarized in Table 1, the model ensemble can be achieved through two different techniques, namely feature-level ensemble and combining model outputs. In the feature-level ensemble technique, the features of various inputs are combined during model training; while in the model output combining technique, the outputs of multiple models are combined using various strategies without being limited to a single classifier/segmenter/regressor model, resulting in more accurate and reliable results. As seen in Table 1, these studies are used in segmentation, classification, and regression tasks in many areas such as age estimation, Alzheimer's detection, tumor segmentation, and tumor classification.
In Aurna et al.'s study, a two-stage ensemble-based model was proposed for the tumor classification problem by combining the features extracted from deep learning architectures such as Custom CNN, EfficientNet-B0, and ResNet-50 and classifying them using classical machine learning methods such as support vector machine (SVM), random forest (RF), etc. [24]. Similarly, Aamir et al. also combined features extracted from EfficientNet and ResNet50 architectures for ensemble learning to perform tumor classification. Feature-level ensemble strategies have also been used for tumor segmentation tasks [25]. In Liu et al.'s study, an architecture called PIF-Net was proposed, based on the combination of features from different MRI modalities [26]. In Kua et al.'s study, brain age estimation with the scope of the regression task was performed using ridge regression and support vector regression (SVR) with ResNet [27].
In contrast to feature-level ensemble strategies, Dolz et al. improved brain tissue segmentation performance for infants by combining the results of customized 3D CNN variants using the majority voting method [28]. For the tumor segmentation task, Cabria et al. improved the performance by combining the results of the potential field segmentation, FOR, and PFC methods with rule-based [29], Feng et al. took the average of the results of 3D Unet variants [30], and Das et al. combined the results of the Basic Encoder-Decoder, U-Net, and SegNet models according to their success rates [31]. For the classification task, Tandel et al. combined the outputs of the AlexNet, VGG16, ResNet18, GoogleNet, and ResNet50 models using the majority voting method [32], while Islam et al. combined the outputs of the DenseNet121, VGG19, and Inception V3 models [33], and Ghafourian et al. combined the outputs of SVM, naive Bayes, and KNN [34] in a similar manner. In Kang et al.'s study, the results of multiple models were combined by taking their average values, resulting in improved tumor classification results [35]. For Parkinson's detection, Kurmi et al. used a fuzzy logic-based ensemble strategy with the VGG16, ResNet50, Inception-V3, and Xception models [36]; while for Alzheimer's detection, Chatter et al. combined the outputs of the SVM, logistic regression, naive Bayes, and K nearest neighbor methods using majority voting [37]. Finally, in Zahoor et al.'s study, both feature-level and model-level ensemble strategies were employed using multiple models for tumor classification [38].
Upon reviewing the literature, we observed that most of the ensemble studies were conducted for the tasks of tumor classification and segmentation. To the best of the author's knowledge, there is currently no comprehensive ensemble study for the object detection task for brain MRI studies. The main contributions of this paper are as follows: • A comprehensive ensemble-based object detection study for anatomical and pathological object detection in brain MRI. • A total of nine state-of-the-art object detection models were employed to propose and evaluate four distinct ensemble strategies aimed at improving the accuracy and robustness of detecting anatomical and pathological regions in brain MRIs. The efficacy of these strategies was empirically assessed through rigorous experiments. • A comparative evaluation of the current state-of-the-art object detection models for identifying anatomical and pathological regions in brain MRIs was conducted as a benchmarking study in the novel Gazi Brains 2020 dataset.
• Five different anatomical structures such as the brain tissue, eyes, optic nerves, lateral ventricles, and the third ventricle, as well as pathological objects including whole tumor parts seen in brain MRI, were detected simultaneously.

Dataset
Datasets used in brain MRI studies are generally subject-specific, small, and nonpublic [39][40][41]. Researchers have difficulty in making the brain MRI data open for interdisciplinary study due to the necessity for domain experts to label this data, the time taken to label data, ethical issues for data sharing, etc. Therefore, although the number of medical images taken for disease detection and follow-up is relatively high, the conversion rate of these images into datasets is low.
The Gazi Brains 2020 dataset [42] was used in this study because it contains rich labeling information for both abnormal and normal patients, including various anatomical structures. The Gazi Brains 2020 dataset includes not only the anatomical parts of the brain but also anatomical parts such as the eye and optic nerve that can be seen in any brain MRI. In addition, considering the changes in anatomical parts such as the lateral and third ventricles, especially in the slices where the tumor is seen, it is a more challenging dataset in terms of finding these anatomical parts. The Gazi Brains 2020 dataset includes 50 normal and 50 histologically proven high-grade glioma (HGG) patients and has a total of 12 different types of label information provided by medical experts. All 100 patients have FLAIR, T1w, and T2w sequences, while 50 HGG groups and 12 normal patients have post-contrast T1 sequences. There are anatomical structures and pathological entities in the 12 different labels. In this study, the brain ROI (brain tissue and orbital CSF), eye, optic nerve, lateral ventricle, third ventricle, peritumoral edema, contrast-enhancing part, necrosis of tumor, hemorrhage, and the no contrast-enhancing part were used for the object detection models. The dataset statistics used in this study are given in Table 2. The labels for peritumoral edema, contrast-enhancing part, necrosis of tumor, hemorrhage, and no contrast-enhancing part were combined as the whole tumor objects. In this study, the BraTS 2020 HGG dataset was also used for only pathological object detection. In total, 19,176 slice analyses with any tumor labels (NCR/NET-label 1, EDlabel 2, ET-label 4) were used in the BraTS 2020 HGG dataset [39,43,44]. Slices without any labels were excluded from the analyses.
The dataset preparation process is visualized in Figure 1. The same dataset preparation process was performed as proposed by Terzi et al. [45]. Accordingly, each independent mask was defined as an object in a slice. Thus, there could be several different objects belonging to an anatomical structure in a slice. For example, in the top row of Figure 1b, there are several independent whole tumor masks, highlighted in white. Similarly, there are two independent lateral ventricles, the optic nerve, and the eye in the bottom row of Figure 1b. They were all taken as different objects, and these objects were used in the model training process as shown in Figure 1c.

Deep Learning Architectures for Anatomical and Pathological Object Detection
Object detection is basically evaluated under two models: one-stage and two-stage [46]. In two-stage techniques, first, the regions of objects are identified; then, the model is fed with region proposals for object classification and localization. Nonetheless, in one-stage techniques, a single model is applied that divides the image into regions and reveals the bounding box and label possibilities for each region.
In this study, all the models were trained using MMDetection [47], an object detection toolbox, to avoid various model implementation problems and to provide a standard training pipeline. A total of nine different state-of-the-art object detection models were used, as given below.
Two-stage object detection models consist of two stages: the region proposal network (RPN) and the detector. In the RPN stage, candidate regions are proposed, and in the detector stage, bounding box regression and classification are performed. Two-stage object detection models are also known as the R-CNN family, and there are many examples of this type of model.
Faster R-CNN is an improved version of the Fast R-CNN model that has a faster run time. To realize this, it uses a CNN-based feature extractor for proposing rectangular objects instead of a selective search in the proposal stage. The proposed object features are shared with the detector and used for bounding box regression and classification. A bounding box is defined as four different coordinates b = b x , b y , b w , b h , and the regression is performed using the smoothed L1 loss function (Equation (2)) between the ground truth bounding box and the candidate bounding box. The loss function learns the coordinates from the data by trying to minimize the distance. For the classification process, the classifier performs learning by minimizing the classification cross-entropy loss (Equation (1) for binary cross-entropy) on the training set. During the learning process, positive and negative detections are identified based on the IoU metric according to Equation (3) where p represents the class probability, and y represents the ground truth label.
where x represents the regression label, Here, b represents the bounding box, G represents the ground truth, and T + and T − represent the positive and negative thresholds for the IoU, respectively.
The Cascade R-CNN performs object detection by taking into account different IoU overlap levels compared to Faster R-CNN. The Cascade R-CNN architecture proposes a series of customized regressors for different IoU overlap threshold values, as given in Equation (4). N represents the total number of cascade stages, and d denotes the sample distribution in the equation.
The Dynamic R-CNN involves a dynamic training procedure compared to Faster and Cascade R-CNN. In this method, instead of using predefined IoU threshold values in the second stage, where the regressor and classifier are used, a dynamically changing procedure based on the distribution of region proposals is proposed given in Equation (3). For this purpose, the dynamic label assignment (DLA) process given in Equation (5) is first used according to the T threshold value updated based on the statistical distribution of proposals for object detection. For the localization task (bounding box regression), the Dynamic Smooth L1 Loss given in Equation (6) is used. Similarly to the DLA, the DSL also changes the regression labels based on the statistical distribution of proposals.
where T now represents the current IoU threshold and updates according to the data distribution, where β now represents the hyperparameters to control smooth loss.
The main difference between one-stage object detection models and two-stage object detection models is that one-stage models do not involve a region proposal stage. Therefore, they generally provide faster detector performance compared to the two-stage methods. One-stage object detection models show diversity in many aspects. These models can be divided into various subcategories based on the differences in their architectures (such as anchor-based, anchor-free, feature pyramid, context, etc.). The following is a summary of the one-stage object detectors used in this study.
The most important component of the RetinaNet architecture is a loss function called Focal Loss, given in Equation (7). This proposed loss function solves the problem of class imbalance during training. It allows the rarer class to be learned better than the classic cross-entropy loss (Equation (1)), leading to a balanced model training. The focal loss equation is defined as the focal loss. RetinaNet is an anchor-based architecture that uses the Resnet-based FPN as its backbone. While the backbone is used to extract the convolutional features from the input, two different subnets that take the backbone's outputs as inputs are used for object classification and bounding box regression.
YoloV3 is a variant of Yolo, which is an improved version of YoloV2 in various aspects. The most important factor that makes Yolo models faster than other object detection models is that they do not contain a complex pipeline. In this architecture, the entire image is used as the input to the network, and bounding box regression and object detection are directly performed on this image. The main differences between YoloV3 and YoloV2 are that YoloV3 uses a deeper architecture called Darknet-53 for feature extraction and predicts bounding boxes in three different scales.
FCOS is an anchor-free object detection model that performs object classification and bounding box regression without requiring overlap (IoU) calculations with anchors or detailed hyperparameter tuning. Anchor-based methods consider the center of anchor boxes as the location in the input image and use it to regress the target bounding box. However, FCOS considers any point inside the ground truth box as a location and uses the distance t * = (l * , t * , r * , b * ) (given in Equation (8)) for regression. If a point is inside multiple bounding boxes, it is considered an ambiguous example, and multilevel prediction is used to reduce this situation. FCOS also uses the centeredness strategy to improve the bounding box detection quality (given in Equation (9)). To do this, FCOS adds a parallel stage to the classification task to predict whether a location is a center or not. The centeredness value is trained using binary cross-entropy loss and added to the FCOS's loss function.
where l * , t * , r * , and b * are the distances calculated from the bounding box.
The NAS-FPN model has made improvements to the FPN model architecture used in object detection models for extracting visual representations. Unlike other models that use manual FPN architecture design, this model narrows down a wide search space for FPN architecture through the neural architecture search algorithm to extract the effective FPN architectures.
The ATSS model proposes a method that performs adaptive sampling of training set examples by utilizing the statistical characteristics of positive and negative examples, based on the fact that they have a significant impact on the model's performance. This approach has led to significant improvements in both anchor-based and anchor-free detectors, filling the performance gap between these two different architectures.
VarifocalNet proposed a different approach compared to the other models in evaluating candidate objects during training by considering not only a classification score or a combination of classification and localization scores but also an IoU-based classification score (IACS) that takes detection performance into account. This approach takes both the localization accuracy and the object confidence score into consideration, resulting in successful results, especially in dense object detection scenarios. To estimate the IACS score, the varifocal loss (given in Equation (10)) and star-shaped bounding box representation were proposed.
where p represents the predicted IACS score, and q represents the target score.

Model Ensemble
The ensemble methods include combining the results of many models belonging to an object detection architecture, as well as combining the results of models with different architectures. In this study, four different ensemble strategies were presented as follows: In Figure 2, an ensemble method is presented in which the deep learning architecture is kept constant, and the predictions of many model variants belonging to the relevant architecture are combined to realize this ensemble strategy. Nine different object detection models with tenfold cross-validation were used to measure the efficiency of the ensemble strategy for each model. The numbers indicated in Figure 2 (e.g., A.1, A.10, etc.) represent the fold number of the relevant models. In Figure 2(a.1), the predictions of all model variants belonging to the tenfold cross-validated Model A are combined without any fold selection criteria. In Figure 2(a.2), instead of the ensemble of all the tenfold predictions, the predictions of the best top-k cross-validated models for Model A are combined with the best model selection criteria. In this study, the top-k models were selected as the folds that produced an mAP above the average mAP value, which came from the tenfold cross-validation. Ensemble strategies for a relevant model produce a single ensemble result for all and the top-k folds.
An ensemble strategy of different models is presented in Figure 2b. In the figure, unlike the ensemble strategy in Figure 2a, the fold number is kept constant, and the predictions of the different models are combined. In Figure 2(b.3), the predictions of the models that come from tenfold cross-validation for each different model are combined fold-by-fold without any selection criteria. This ensemble strategy generated a total of ten ensemble results, with one ensemble result per fold. In Figure 2(b.4), for each fold, the best model among all models is selected, and the best models corresponding to each fold are combined to produce a single ensemble result in this strategy.
Predictions of the models and the bounding boxes for each object were combined using the weighted boxes fusion (WBF) [57] method for all ensemble strategies. For the WBF, firstly, the confidence score was calculated for each model as in Equation (11). The coordinates of the new bounding boxes were recalculated with the help of Equations (12) and (13). In the equations, C represents the confidence score, T represents all boxes, and X and Y represent the new bounding box coordinates. In this study, the confidence score for each model prediction was used as the average score of the intersecting bounding boxes. The new coordinates were calculated using the weighted sum of the confidence scores of the boxes. Thus, the effectiveness of the boxes with higher confidence scores was increased.

Evaluation Metrics
Object detection aims to detect objects in an image, their classes, and their locations. For this reason, the object detection task has its own evaluation metrics. Two different popular object detection metrics, the Intersection over Union (IoU) and the mean average precision (mAP), were used to evaluate the results of this study.
Considering the ground truth (A) and the predicted bounding box (B) as two different clusters, in this situation, the ratio of the intersection of these two clusters to the union represents the IoU [58] value (Equation (14)).
Precision (P) and Recall (R), which are classical machine learning metrics, are used as the basis for calculating the mAP value. As can be seen from Equations (15) and (16), P indicates the model's accuracy in classifying a sample as positive, while R indicates the power of the model to detect positive samples. In the equations, TP (True Positive) refers to the model correctly detecting the object. FN (False Negative) means the object cannot be detected by the model, although there is an object in the image. FP (False Positive) means an object is detected, even though there is no object present. In the object detection task, these metrics are calculated according to the case that the IoU value between the predicted boxes and the ground truth boxes is above a certain threshold value.
Therefore, these two metrics are also important, and for this reason, the P-R curve was used, which shows the balance between the P and R for different thresholds. The average precision (AP) summarizes the P-R curve and was calculated according to Equation (17) (R n = 0, P n = 1, n = the number of thresholds). After calculating the AP value for each class, the mAP value was found by taking the average of them [59]. In this study, the FROC curves were plotted using the FPPI (Equation (18)) and TPR (Equation (16)) to measure how changes in the confidence score affected both the False Positive rate per image and the TPR performance.

Experimental Setup
In this study, 2029 axial brain MRI slices belonging to 100 patients in the Gazi Brains 2020 dataset were separated on a patient basis. First, 202 slices belonging to 10 patients including five normal and five tumorous patients (Patient IDs: Sub-10, 14,15,19,32,53,59,61,84, and 94) were selected randomly for testing purposes only. Then, 1827 slices belonging to the remaining 90 patients were divided into training and validation sets according to the tenfold cross-validation method. Each fold included 1644 (±5) training slices and 182 (±5) validating slices. In each fold, the model with the best mAP value in the validation set was selected and tested. This process was repeated for all folds, and benchmarks were made to report. For the BraTS 2020 HGG dataset, the dataset was divided on a patient basis into 60% training, 20% validation, and 20% testing, randomly. All models were selected based on the validation performance and then tested for reporting.
The FLAIR, T1w, and T2w sequences, which were spatially aligned and common sequences for all patients as three channels, were prepared for the model training. The backbone, hyperparameters, and augmentation techniques used in the training process of the models are given in Table 3. All layers of the model were trained with the initial weights coming from the pretrained models that were obtained from the MMDetection GitHub repository. All augmentation techniques were applied with 0.5 probability, and finally, a multiscale flip augmentation technique was applied for the testing pipeline. The default values came from the MMDetection repository and were used for all the parameters except the hyperparameters specified in Table 3.
During the model testing process, non-maximum suppression (NMS) was applied to the model predictions by using 0.6 IoU and 0.1 score threshold values. After applying the NMS to the predictions, the WBF was applied for the ensemble strategies. The WBF weights were taken equally as 1. Moreover, 0.6 IoU and 0.16 skip boxes threshold were used as the WBF hyperparameters.

Experimental Results
In this section, two different benchmark results are given for the anatomical and pathological object detection problem on the Gazi Brains 2020 dataset. The first one aims to present the performance of each model individually, and the second one aims to present the ensemble performance of the models within the scope of the determined strategies.
The average values of the tenfold cross-validation results on the test set for each model are given in Table 4. As can be seen from the table, the best class-based AP(@0.5 IoU) values varied according to the models. In terms of the detection of each anatomical object, the benchmarking results were summarized as follows:

•
All the models had similar success results for the brain ROI object; The ensemble results of different variants for each model are given in Table 5. To realize this strategy, the models from the tenfold cross-validation of the relevant model were used, and the efficiency of the ensemble strategies was reported compared to the best model. In the table, the best fold according to the mAP value (the Best Fold) is compared to the ensemble outputs of the folds above the average mAP (Mean+ Folds, Figure 2(a.1)) and all folds without any model selection criteria (All Folds, Figure 2(a.2)). According to the results reported in Table 5, the ensemble of the models increased the anatomical object detection success rate for each model. Accordingly, the ensemble strategies produced better mAP values by 2% for ATSS, 3% for Cascade R-CNN, 4% for Dynamic R-CNN, 2% for Faster R-CNN, 1% for FCOS, 2% for NAS-FPN, 3% for RetinaNet, 2% for VFNet, and 10% for YOLOv3. The ensemble strategies performed with the models' own variants provided improvements in the better detection of the different anatomical objects for each model. The most interesting and satisfactory results obtained were 5%, 9%, 11%, 6%, 7%, and 18% improvements observed in finding the optic nerve object for the ATSS, Cascade R-CNN, Dynamic R-CNN, Faster R-CNN, VFNet, and YOLOv3 models, respectively. Similarly, 5% and 8% improvements were observed in the NAS-FPN and RetinaNet models, while a 3% improvement in finding the tumor object was observed in the FCOS model. The ensemble results of different models are given in Table 6. In the table, the results obtained from the fold-by-fold ensemble strategy are given for each fold (All Folds-1, 2,. . . ,10). In addition, the average results of the fold-by-fold ensemble strategy (All Folds-Mean, Figure 2(b.3)), the results of the best fold according to the mAP value in this strategy (All Folds-the Best, Figure 2(b.3)), and the ensemble results obtained by selecting the best model for each fold (the Best Folds, Figure 2(b.4)) were compared.
The ensemble of all the models in any fold gave better results than the best model in any fold. For example, the best model of the NAS-FPN produced an mAP value of 0.805, as indicated in Table 5. However, when the values in Table 6 were examined together, the different model ensemble strategy was more successful in all the folds except Fold 9. Similarly, an ensemble of the different models based on the tenfold cross-validation, with a 0.818 (±0.01) mAP, yielded about 6% better mAP results than the best individual model, which is indicated in Table 4, the NAS-FPN with a 0.76 (±0.02) mAP value.  Finally, an ensemble of the different best model variants with a 0.838 mAP value as indicated in Table 6 was 1.8% and 1% better than the other ensemble strategies, such as the ensemble of the NAS-FPN model variants with a 0.82 mAP value in Table 5 and an ensemble of the best different models fold-by-fold with a 0.828 mAP value in Table 6. It was also 3.3% better than the individual NAS-FPN best models with a 0.805 mAP value in Table 5.
The comparative FROC curves with the FAUC values representing the area under these curves are provided in Figure 3 for each anatomical and pathological region using the best individual models and the ensemble strategy (Strategy 4-Best Folds) specified in Table 6. Thus, the effect of the changes in the confidence scores ranging from 0 to 1 with intervals of 0.02 on the FPPI and TPR were measured on the Gazi Brains 2020 dataset. According to the results provided in the figure, the 'Best Folds Ensemble' strategy outperformed with an approximately 1%, 2%, 1%, 2%, and 7% better FAUC performance for the RoI, optic nerve, eye, lateral ventricle, and third ventricle, respectively, compared to the best individual model.
In this study, experiments were also conducted on the benchmark of the models and the ensemble strategy of the best models for detecting the pathological regions on the BraTS 2020 HGG dataset. The anatomical regions were not considered because the BraTS 2020 HGG dataset lacks labels for the anatomical regions. When considering the AP metric with a 0.2 confidence score and a 0.  Figure 4 presents the FROC curves of the models based on their TPR and FPPI values, considering different confidence scores. The ensemble strategy of different models outperformed with an approximately 2% better FAUC performance than the best individual model (Faster R-CNN). Figure 5 visualizes the efficiency of the model ensemble. For visualization purposes, we selected the best ATSS, NAS-FPN, Cascade R-CNN, Dynamic R-CNN, Faster R-CNN, FCOS, and YOLOv3 models from left to right, respectively. Moreover, the classification confidence threshold was selected as 0.3 for each class. As shown in the figure, the ensemble results using different models found anatomical parts that could not be found by a single model and reduced unnecessary predictions and false positives.

Discussion
Automatic anatomical object detection models based on deep learning, as a computeraided diagnostic tool, support the decision-making processes of clinicians by presenting the location, shape, and size of the object in the medical image. However, sometimes the performance of these models cannot meet clinical expectations due to some issues arising from medical images. Anatomical objects do not always differentiate well in contrast and may appear similar to neighboring structures. Even if the characteristics of an anatomical object are known, they often vary from case to case. At the same time, most different types of objects have similar characteristics in size and shape. In addition, noise and artifacts greatly alter the depicted objects. On the other hand, although valuable information is gained about anatomy, extensive expert knowledge is needed to fill the semantic gap between the depicted objects and the image data. Annotating is time-consuming and expensive, as this process involves experts interpreting the medical images and combining them with other test results if necessary. Due to all these problems, medical object detection models may not be as robust as expected.
As in many of the studies listed in Table 2, this study was also conducted with a single dataset. This can lead to overfitting risks for deep learning models developed on limited datasets, limiting their generalization abilities. Although it may be possible to combine data from different open sources, finding datasets with similar labels or similar tasks is almost impossible. In this study, the performance of models was measured using the tenfold cross-validation method on the Gazi Brains 2020 open dataset and the BraTS 2020 HGG dataset. As expected, ensemble strategies are more effective in relatively small datasets such as Gazi Brains 2020, but they also prove to be effective in larger datasets such as BraTS 2020 HGG. This is particularly important in situations where data are limited, highlighting the necessity of ensemble strategies.
Therefore, this study aimed to combine decisions from multiple models using different ensemble strategies and to improve anatomical object detection model performance. In this study, nine different models were compared for the anatomical object detection problem, and four different ensemble strategies were proposed. The proposed ensemble strategies ensured that each model individually boosted the anatomical object detection capacity and achieved the highest success rates by combining different models.

Conclusions
According to the obtained results, the ensemble strategies performed on the model variants were found to change the individual performance of the models, improving the mAP values between 1% and 10% compared to the relevant best model. Similarly, when looking at the changes in the detection performance of the anatomical parts, an improvement of up to 18% AP was observed. In particular, ensemble strategies were found to be much more efficient in determining anatomical parts with a small amount of data, such as the optic nerve and third ventricle. As seen in Table 5, the highest performance increase in eight of the nine models was for the optic nerve and third ventricle anatomical objects. It was also observed that the ensemble of different models outperformed the individual best model in almost all cases, and the ensemble of the different best models was 3.3% better than the best individual model, which was the NAS-FPN. Another important output of the ensemble strategies was that, as seen in Figures 3 and 4, they produced higher TPR values, especially at low FPPI values, compared to the best individual methods.
It is planned to implement ensemble strategies for different tasks such as classification and segmentation in future studies. It is also planned to use anatomical object detection methods in the cascaded architectures for extracting better anatomic patches to realize better segmentation tasks.
Funding: This study was funded by the Digital Transformation Office of the Presidency of the Republic of Türkiye.