Abstract
Object detection is an important tool in many areas, such as robotics or autonomous driving. Especially in these areas, a wide variety of object classes must be detected or interacted with. Models from the field of Open-Vocabulary Detection (OVD) provide a solution here, as they can detect not only base classes but also novel object classes, i.e., those classes that were not seen during training. However, one problem with OVD models is their poor calibration, meaning that the predictions are often too over- or under-confident. To improve the calibration, Temperature Scaling is used in this study. Using YOLO World, one of the best-performing OVD models, the aim is to determine the optimal T-value for this calibration method. For this reason, it is investigated whether there is a correlation between the logit distribution and the optimal T-value and how this can be modeled. Finally, the influence of Temperature Scaling on the Expected Calibration Error (ECE) and the mAP (Mean Average Precision) will be analyzed. The results of this study show that similar logit distributions of different datasets result in the same optimal T-values. This correlation could be best modeled using Kernel Ridge Regression (KRR) and Support Vector Machine (SVM). In all cases, the ECE could be improved by Temperature Scaling without significantly reducing the mAP.
1. Introduction
Object detection plays an essential role in areas such as robotics and autonomous driving. Without reliable object detection, objects cannot be perceived, nor can interaction with them take place. A major challenge in this context is the detection of unknown object classes, i.e., those classes that the model has not seen during training. Many deep learning (DL) and machine learning (ML) approaches such as YOLO [] or Faster R-CNN [] train models on specific object classes and can only detect these during inference. Despite the generally good detection rate, it is not possible to detect object classes outside the training dataset. Adding new classes costs time and resources, as a new dataset must be created for each object class and the model must be retrained.
This limitation has given rise to the field of Open-Vocabulary Detection (OVD) [], where models are being developed that can also detect object classes outside the training dataset. A model is therefore trained using base classes and can detect novel classes as well. There are currently a large number of Open-Vocabulary Detection models, such as mm-Grounding DINO [], Det-CLIPv3 [], Grounding DINO [], Grounding DINO 1.5 Pro [] and YOLO-World [], to name a few methods. All these models have different structures and approaches to detecting unknown objects.
Grounding DINO and mm-Grounding DINO use a Swin Transformer [] backbone to extract image features and a BERT [] backbone to extract text features. In comparison, YOLO-World relies on a lightweight YOLOv8 backbone to extract image features, which greatly improves the inference speed compared to Grounding DINO and mm-Grounding DINO. In addition, CLIP [] text encoder is used instead of BERT. DetCLIPv3 also trains a Swin-T and Swin-L backbone. One problem with the use and comparison of these models is the open-source access and the pre-training datasets. For example, the source code of DetCLIPv3 is not publicly available, just like that of Grounding DINO 1.5 Pro. For this reason, modifications to the source code, as required for this study, are not possible. Furthermore, Grounding DINO 1.5 Pro uses its own pre-training dataset (Grounding-20M), which is also not publicly available. This makes it nearly impossible to compare it with other models, such as YOLO-World, especially with regard to the detection of unknown objects under OVD conditions.
However, when comparing open-source models that use the same or similar pre-training datasets in terms of the accuracy of their predictions under OVD conditions, YOLO-World and mm-Grounding DINO currently prove to be the best performing models. Accuracy is assessed using the mAP (mean average precision) and LVIS minival [] (a subset of LVIS []) is used as the evaluation dataset. mm-Grounding DINO was pre-trained using the Ojects365v1 [] and GoldG [] datasets, while YOLO-World-V2.1 also includes CC-LiteV2 [] in addition to Ojects365v1 and GoldG. The achieved mAP is 35.7% for mm-Grounding DINO [] and 35.5% for YOLO-World-V2.1-L []. However, in terms of inference speed, YOLO-World is significantly faster at around 52.0 FPS on one NVIDIA V100 GPU, while Grounding DINO only achieves 1.5 FPS [] (mm-Grounding DINO is based almost unchanged on the Grounding DINO model [], hence the inference speeds are in the same order of magnitude). Therefore, the investigations in this study focus on the open-source model YOLO-World-V2.1.
In addition to accuracy and inference speed, the calibration of these models is also essential for the use of OVD models in the real world. It was demonstrated in [] that YOLO-World is not well calibrated; the model is overconfident in its predictions. If many false positive (FP) predictions occur with a high confidence value, this suggests a trustworthy prediction, even though the object being searched for was not correctly detected. Many such predictions can result in personal injury and property damage, especially in safety-critical applications. It is therefore important that OVD models are well calibrated. There are numerous methods for correcting miss-calibration, which are listed in [,] for neural networks. In [], the simple calibration method Temperature Scaling is applied to CLIP to reduce the calibration error. The study [] showed that Temperature Scaling can positively influence the calibration of the OVD model YOLO-World by reducing the calibration error. However, the optimal T-value depends heavily on the calibration dataset used, i.e., a suitable calibration dataset was able to reduce the Expected Calibration Error (ECE) by several percent, while an unsuitable dataset resulted in a suboptimal T-value and increased the ECE.
The objectives of this study can be derived from the findings in []. Since the calibration dataset has a decisive influence on the determination of the optimal T-value, the first step is to investigate whether there is a correlation between different datasets and the optimal T-values of the Temperature Scaling calibration method. To this end, the logit distributions according to the YOLO-World prediction of a dataset will be analyzed. The optimal T-values will be determined with a view to minimizing the ECE. Subsequently, the extent to which the optimal T-values can be predicted from a given logit distribution will be investigated. Various machine learning models will be tested to model the relationship. Finally, the accuracy (mAP) will be examined once after calibration, i.e., after optimization of the ECE, as the mAP should not decrease significantly despite improved calibration. In addition, the influence of temperature scaling on the ECE of the respective datasets is also analyzed.
The results of this investigation show that there is a correlation between the logit distribution and the optimal T-values, i.e., if two distributions are similar, they also result in the same optimal T-values. Different distributions have T-values that differ from each other. This makes it possible to predict the optimal T-values based on a given logit distribution. After calibration using temperature scaling, the mAP decreases slightly or remains unchanged when the T-value is less than 1 and increases in most cases when T is greater than 1. An improvement in ECE was achieved in all cases.
2. Related Work
2.1. YOLO-World
YOLO-World is a powerful object detection model with real-time Open-Vocabulary detection capabilities []. As input, YOLO-World receives an image I and a text F, from which the text embeddings and the object embeddings () are derived. D reflects the embedding dimension, C is the number of classes and K is the number of objects. A frozen CLIP text encoder is used to extract the text embeddings. The image features are extracted by a YOLOv8 backbone, which are converted into object embeddings using the RepVL-PAN (Re-parameterizable Vision-Language Path Aggregation Network) and the text contrastive head. Subsequently, the text embeddings are L2-normalized (L2-Norm), while the object embeddings are batch-normalized (Batch-Norm) []. The logits (object-text similarity) between the j-th text embedding and the k-th object embedding can be calculated by
and are two constants, which scale and shift the result. Detailed explanations of how YOLO-World works are provided in []. Finally, the logits are activated by a sigmoid function,
which then yields the corresponding probabilities or confidence values of the predictions. Unlike a softmax activation function, all logits are considered independently of each other.
2.2. Temperature Scaling
Temperature scaling is a method designed to improve the calibration of a model. It scales the logits of a model using a constant value, the temperature value [,]. Temperature scaling can therefore be applied after a model has been trained and has proven to be a simple but efficient method in the literature []. This method is applied before the logits are activated by a softmax function, but can also be applied to other activation functions, such as sigmoid. The logits after calibration can be calculated as
where represents the logits before calibration. Usually, a calibration dataset is required to determine the optimal T-value.
2.3. Expected Calibration Error
A commonly used metric for quantitatively determining the calibration error is the expected calibration error (ECE) [,].
After a prediction, a sample i consists of a confidence value , a predicted class label , a predicted bounding box , a true class label and a true bounding box . Based on this information, a sample can be categorized as either a true positive () or a false positive (). A detailed description of how to determine a true positive and a false positive can be found in []. According to the confidence values , all samples i can be divided into M equally sized bins , where the bin () includes all samples of the interval . The interval size is fixed and is .
To determine the ECE, the accuracy and confidence must be calculated for each bin . The accuracy represents the ratio of true positives to the total number of samples () of the bin and can be calculated using the equation
The confidence is the average of all confidence values that are in Bin and is defined as
Finally, the expected calibration error can be derived based on the accuracy and confidence. The ECE is the weighted average of the difference between and across all bins ,
where n is the total number of all samples i. The ECE is equal to 0 for a perfectly calibrated model, i.e., for all bins .
2.4. Limitations of Existing Methods
In comparison to other methods for determining the optimal T-value, such as cross-validation [,], one of the practical advantages of the proposed approach lies in the reduced time required. Once test data has been collected and an ML model has been trained, the optimal T-value can be easily predicted using a practically obtained logit distribution, without having to test numerous possibilities. The biggest advantage of the proposed approach is its real-world applicability, as no ground truth annotations/labelings need to be created from the collected object detection data (e.g., from a robot) in order to determine an optimal T-value. Cross-validation cannot be used in this case.
In addition, temperature scaling was preferred over more powerful methods such as vector scaling and matrix scaling [], as only one optimal value needs to be determined, which is much easier to achieve with little data (logit distributions) when a post-hoc calibration improvement is the objective. In contrast, vector scaling and matrix scaling require a large number of values to be optimized, which means that more training data is needed, which is difficult to generate. Furthermore, the computational effort required to determine the optimal T-values per dataset is higher, as significantly more combinations have to be tested.
3. Methodology
3.1. Overall Approach
This study aims to determine whether there is a correlation between the logit distributions and the optimal T-values of different datasets in order to achieve the best possible calibration using the Temperature Scaling calibration method. Furthermore, the extent to which modeling is possible is investigated, i.e., whether the optimal T-values can be predicted based on a given logit distribution. Finally, the influence of the method on the accuracy (mAP) and calibration error (ECE) of the predictions is determined.
For all investigations carried out within the scope of this study, the model YOLO-World-V2.1-L (resolution: 1280) [] is used, which detects objects using bounding boxes. A maximum of 300 predictions per image will be accepted (default setting). The reasons for using this model are listed in Section 1. The literature source [] explains how false positives (FP) and true positives (TP) are determined.
Figure 1 provides a summary of the methodology. In the first step, the relationship between the logit distributions and the optimal T-values is investigated. To do this, the optimal T-values of different datasets must first be determined, which differ from each other depending on the dataset. In total, the optimal T-values are derived from 7 different datasets (LVIS minival [], Open Images V7 [,], Pascal VOC [], COCO [], EgoObjects [], Objects365v1 [] and a self-created dataset (see Section 3.2)) with different variants (see Section 4.1 for further explanations).
Figure 1.
The overall approach of this study.
As a metric that should be minimized, the ECE is used, which can be calculated according to Equation (6). This means that a T-value is optimal if the ECE is minimal. Based on [], two IoU (intersection over union) threshold values ( and ) are considered for the calculation of the ECE, i.e., there are also two corresponding optimal T-values, one for the minimum and one for the minimum . The temperature values are set to in accordance with [,]. The set contains as the minimum and as the maximum value. For each of these T-values contained in the set, the and the are calculated and the minimum is derived. For the calculation of the ECE, applies to all investigations.
After the optimal T-values have been determined for each dataset, the logit distribution is extracted, which contains the logits of all predictions of a dataset, which are then filtered. Since YOLO-World uses an internal confidence threshold of 0.001 for all predictions, i.e., YOLO-World rejects all predictions whose confidence is less than equal to 0.001, it follows from Equations (2) and (3) that
and rearranged to , this results in
This means that only the logit values that satisfy the condition from Equation (8) are retained, which greatly reduces the number of logits, since these would not be considered anyway. The logit distribution is always determined at a T-value of , which corresponds to the prediction of YOLO-World without temperature scaling.
To check the relationship between the optimal T-values and the logit distributions, selected distributions are first tested visually to obtain an initial assessment of whether the same optimal T-values are also based on similar distributions and whether unequal T-values have dissimilar distributions. For the visual comparison, four diagrams are used: box plot, line histogram, Q-Q plot (quantile-quantile plot) and ECDF (empirical cumulative distribution function) plot. The box plot is shown here by default with the first quartile, median and third quartile, where the whiskers according to Tukey [] can be calculated as 1.5 times the interquartile range (IQR). Outliers (data points outside the whiskers) are marked as points []. For the Q-Q plot, a total of 10,000 quantiles are calculated per logit distribution and these are plotted against each other for two distributions. The bin size of the histograms for the division of the logits is calculated using the Freedman-Diaconis Estimator [], whereby the bin sizes from both logit distributions to be compared are calculated together. A line histogram is then derived from the bin histogram by replacing all bins with a point in the center of the respective bin, which are connected to form a line. In addition, the frequency of the vertical axis is given as a percentage.
After the visual examination, the logit distributions of selected datasets should be compared with each other using a statistical test. This is another way to show that similar distributions result in the same optimal T-values and that different distributions also have different T-values. As a statistical test, the Cramer-von-Mises two-sample test [] is used for this investigation. The null hypothesis of this test states that two samples come from the same distribution. This is considered to be fulfilled if , where the standard value is assumed in this study. In addition to the statistical test, the Earth Mover’s Distance (EMD) [,] and the non-parametric effect size Cliff’s Delta [] are calculated in order to make further conclusions regarding a correlation between the optimal T-values and the logit distributions. If there is an exact match between two logit distributions, both EMD and Cliff’s Delta are 0. So if two distributions have the same optimal T-values, then EMD and Cliff’s Delta should be very small.
After the relationship between the logit distributions and the optimal T-values has been investigated, this relationship is modeled in the second step utilizing machine learning (ML) methods. The general scheme is illustrated in Figure 2. Since the number of logits depends on the number of classes and the number of images, the distributions have different sizes. Furthermore, a logit distribution sometimes consists of several million logits, which means it is not reasonable to use them directly as input for the ML model. Instead, the distribution is mapped using summary statistics, as suggested in [], which then serve as input for the ML model. As output/prediction, the optimal T-values for an IoU threshold of 0.5 and 0.75 are obtained, which minimize the respective ECE most, i.e., achieve the best calibration.
Figure 2.
Concept for modeling the relationship between the logit distribution and the optimal T-values—an ML model should predict the optimal T-values using a given logit distribution described by summary statistics.
To quantify the prediction errors, three error metrics Mean Absolute Error (MAE), Root Mean Square Error (RMSE) and coefficient of determination (R2) are calculated, as suggested in []. The bootstrap resampling method [,] is used to measure the uncertainty of these errors, where S corresponds to the number of bootstrap runs (). The error metrics can be interpreted and calculated as follows, based on [,]. The MAE indicates the average deviation between the predicted T-values and the true optimal T-values , i.e., it is the average prediction error of the ML models. Here, represents the number of true optimal T-values (or predicted T-values) for each bootstrap run s per IoU threshold . In addition, applies. The RMSE is also calculated from the mean of the difference between and , however the difference is squared, which means that large deviations are weighted more heavily. Afterwards, the square root is taken from the result. With the help of R2, it is possible to describe how well the model can explain the variance of the dependent variable () based on the input data (summary statistics of the logit distribution) in comparison to the mean . In general, the smaller the MAE and the RMSE are, the better the prediction of the model is, while an R2 value approaching 1 means that the variance could be explained perfectly. The MAE, RMSE, and R2 can be calculated as a function of the IoU thresholds and of the bootstrap run s as follows:
where,
In addition, all predicted T-values are rounded to the next multiple of 0.05 before the errors are calculated, since the true T-values are also only available in intervals of 0.05. Subsequently, the median and the 95% percentile bootstrap confidence intervals [] are determined from the S bootstrap runs of the 3 error metrics. Therefore, the errors , and are always given as the median and 95% confidence interval ().
In the third and final step, a discussion of the calibration error (ECE) and the accuracy (mAP) should take place. On the one hand, it should be investigated how much the ECE could be improved by the temperature scaling calibration procedure, based on the respective datasets. In this way, the efficiency of this procedure should be derived. On the other hand, the mAP is examined before and after the calibration, as it must not be significantly worse despite an improvement in the calibration error. Finally, the influence of temperature scaling on the metrics Brier Score [,] and Sigmoid Cross-Entropy [,] is briefly discussed. Temperature scaling only affects the confidence values and does not influence thresholding or post-processing.
3.2. Self-Created Object Detection Dataset
For these and further investigations, a separate dataset consisting of 10 different object classes was created. For simplicity, this will be referred to as mydataset. The dataset contains the object classes dustpan, light switch, trowel, USB flash drive, and drawing compass, which are not included in the datasets Objects365v1 [] (one of the pre-training datasets from YOLO-World) and COCO [], and the object classes cup, scissors, bottle, power outlet/socket, and computer keyboard, which appear in Objects365v1 [], LVIS [] and Open Images V7 [,].
Two different cameras were used to take the pictures: the 48-megapixel camera of the Galaxy A41 smartphone with an image resolution of 3000 × 4000 and a 5 MP Raspberry Pi camera with a 160° wide angle, whereby the resolution of the images is 640 × 480. mydataset consists of three different complexity levels: easy, medium, and hard. Half of the images in each level were taken with the Galaxy camera and the other half with the Raspberry Pi camera. Each object class occurs equally often in total and per complexity level. A summary of the most important attributes of all complexity levels can be found in Table 1.
Table 1.
Summary of the most important attributes of the different complexity levels that were considered when taking the images for the dataset mydataset.
The complexity level easy consists of a total of 800 images, i.e., 400 images were taken with the Galaxy camera and 400 with a Raspberry Pi camera. Each image contains exactly one object, i.e., each of the 10 object classes appears a total of 80 times. The easy level includes additional criteria according to which the images were taken. The background must be simple and must not contain any distractions from the object. The objects are centered in the middle of the image and not occluded. Different variants and angles of the object are allowed, as are slightly changing distances and slightly different lighting conditions, as these are practically unavoidable.
The second level of complexity (medium) consists of two different requirements. In the first part, there is again exactly one object class per image, each class appearing a total of 50 times, whereby this time the distance to the object varies more greatly. This means that larger distances to the object are possible when taking the picture. In addition, the backgrounds are more complex (i.e., slightly less contrast between the object and the background) and contain additional objects, but none of the 10 object classes listed here. As in the easy complexity level, the lighting conditions are slightly variable and occlusion of objects is not allowed. The second part may contain several object classes per image, whereby the same criteria apply as in the first part. Here, each object class should appear 50 times and a random number of the desired objects between 2 and 4 per image. Each object class appears 100 times in this complexity level, which contains a total of 670 images.
In the last level (hard), partial occlusions of objects are allowed, the backgrounds become more complex, i.e., the object is no longer in focus, distances become even larger and perspectives more extreme. As in the other levels, the lighting conditions differ slightly, as this cannot be avoided when taking the pictures. Each object class appears a total of 40 times, where one image contains between 4 and 8 classes. A total of 72 images are included. The complexity level hard also has an extension hard_aug, which includes all images from hard and all images from medium that are augmented. To augment the medium images, Salt&Pepper Noise is applied, where randomly 2% of the pixels are changed to black and white pixels (50% each), a blurring using 11 × 11 Gaussian kernel and a change in brightness by ±35%. This means that from one image, four augmented images are generated, however, the non-augmented images of the medium level are not part of hard_aug. Thus, hard_aug comprises 2752 images.
An example of the images for each complexity level, which includes, among other things, the object class bottle, can be found in Figure 3. From these 4 complexity levels, two further variants can be derived. The variant all includes the 1542 images from easy, medium, and hard, whereas all_aug contains the 4222 images from easy, medium, and hard_aug. The annotation of the images was done using bounding boxes and the software used was Roboflow []. The images and annotations are available at [].
Figure 3.
Left: two images of the object class bottle from the easy complexity level, center: two images including the object class bottle from the medium complexity level, and right: one image including the object class bottle from the hard complexity level.
- Usage of GenAI
For this study, the following generative AI models were used for the following purposes:
- DeepL was used for the translation from German to English.
- ChatGPTo3 and ChatGPT5 were used to support code generation in Python 3.10.11.
4. Experiments & Results
4.1. Determination of the Optimal Ground-Truth T-Values for All Used Datasets
As already mentioned in Section 3.1, the seven datasets, LVIS minival [], mydataset, Open Images V7 [,], COCO [], Pascal VOC [], EgoObjects [], and Objects365v1 [], including some variants of these, are used for the investigations in this study. Since T can take the values , 66 ECE values ( and , respectively) must be calculated for each dataset, which corresponds to 33 evaluation runs per dataset (one run per T-value). Furthermore, the resulting optimal T-values must also be distinct from each other, which is why as many different datasets as possible should be used. Therefore, the generation of data is difficult and time-consuming.
Table 2 provides an overview of the different datasets, including their variants. The variations relate to the number of images (# images), the number of classes (# classes), whether the training, validation, or test dataset was used (train, val, test), and whether the images were selected at random (shuffle, seed). If the images were selected at random, the corresponding seed is also specified. For the Open Images variations, only a subset of the training and validation dataset is used; the same applies to EgoObjects and the COCO training dataset. Since not all images are fully annotated in LVIS minival and Open Images, the column “# images” always shows the actual number of annotated images (e.g., 4809 for LVIS minival) and the total number of images (e.g., 5000 for LVIS minival). In the case of Open Images and COCO, the images and annotations were extracted using the fiftyone library [], while for EgoObjects, the Python module random was used to draw images at random. The dataset mydataset_easy_half was generated by randomly selecting exactly half of the images from mydataset_easy, where each object class continues to be represented equally often. A total of 25 different datasets are used.
Table 2.
Overview of the used datasets and the corresponding optimal T-values.
Also shown in Table 2 are the optimal T-values for the two IoU thresholds and , which result in the minimum ECE values and of the 33 evaluation runs per dataset. As can be seen, the optimal T-values differ between the different datasets. Each dataset also has the corresponding logit distribution, which was determined at (see Section 3.1). Therefore, each of the 25 datasets can be assigned a data pair consisting of the two optimal T-values and the logit distribution.
4.2. Relationship Between the Logit Distribution and the Optimal T-Values
The first step in investigating the relationship between logit distributions and optimal T-values is to visually compare the distributions of the selected datasets. The Table 2 shows that the datasets mydataset_easy and mydataset_easy_half have exactly the same optimal T-values, and . Since the optimal T-values of these two datasets are the same, the underlying logit distributions should also be similar. This is verified by the representation of the 4 plots, box plot, Q-Q plot, histogram and ECDF plot, in Figure 4.
Figure 4.
Representation of the four plots between the logit distributions of the datasets mydataset_easy and mydataset_easy_half: (a) box plot, (b) line histograms, (c) Q-Q plot, (d) ECDF plot. As can be derived from the different plots, the logit distributions of the two datasets are very similar.
The two box plots of the logit distributions of mydataset_easy and mydataset_easy_half look similar because the median, IQR, and whiskers are in a comparable position. The line histograms also have a similar shape. The quantiles plotted on the Q-Q plot are almost all on the 45° line, and the ECDFs of the two distributions are also almost identical. From this visual analysis, it can be concluded that the two logit distributions are similar.
Looking at two datasets with different optimal T-values, first mydataset_easy with and and second Open Images train_3 with and , it is clear that the box plots, line histograms, and ECDF plots differ significantly from one another (see Figure 5). It should be noted that 500,000 samples were randomly selected (without replacement) from the logit distribution of Open Images train_3 to display the line histogram, as otherwise the line histogram of mydataset_easy would be much noisier. Also in the Q-Q plot, there is a clear deviation from the 45° line. This means that the two logit distributions of the datasets mydataset_easy and Open Images train_3 do not show any similarities. This simple analysis already provides an initial indication that similar logit distributions result in identical optimal T-values and that different distributions also have different T-values.
Figure 5.
Representation of the four plots between the logit distributions of the datasets mydataset_easy and Open Images train_3: (a) box plot, (b) line histograms, (c) Q-Q plot, (d) ECDF plot. As can be derived from the different plots, the logit distributions of the two datasets differ.
Using the Cramer-von-Mises two-sample test, further datasets are tested against each other with regard to equal logit distributions. As already mentioned in Section 3.1, there are no statistically significant differences in the distributions if . For this test, the logit distributions of some datasets, such as Open Images or LVIS minival, are not considered, as these consist of several million logits. The problem here is that according to [,], even the smallest deviations become statistically significant in large samples, and thus the p value approaches 0, even if there are no practically significant differences. However, the Cramer-von-Mises two-sample test is sufficient for an initial assessment.
In Figure 6, all logit distributions of some selected datasets were tested against each other. In doing so, only the distributions of mydataset_easy and mydataset_easy_half and of Pascal VOC train and Pascal VOC trainval have the same optimal T-values (see Table 2). The results show that the Cramer-von-Mises test between the logit distributions of datasets with the same optimal T-values yields a p-value that is greater than 0.05 (these values are highlighted in bold). In all other cases, p is clearly below 0.05. The results of this comparison are a further proof that there is a correlation between the logit distributions and the optimal T-values.
Figure 6.
Pairwise matrix showing the p-values of the Cramer-von-Mises two-sample test between the logit distributions of the selected datasets. All p-values greater than 0.05 are highlighted in bold, which means that there are no statistically significant differences between these distributions.
Further evidence that similar logit distributions can be assigned the same optimal T-values is provided by EMD and Cliff’s Delta in Figure 7. For all datasets that result in the same optimal T-values, the underlying logit distributions have very small EMD values. For example, the datasets Open Images train_2 and Open Images val_2 both have optimal T-values of and , and the resulting EMD has a value of 0.0195. This is very low compared to EMD values for logit distributions with unequal optimal T-values. In general, the EMD of distributions with equal optimal T-values are less than 0.025 and are thus highlighted in bold in Figure 7. The remaining EMD values are above 0.1.

Figure 7.
Pairwise matrix of (a) EMD and (b) median and confidence interval of the m-out-of-n bootstrap of Cliff’s delta between the logit distributions of selected datasets. The smallest values, which indicate a high degree of correlation between the logit distributions, are highlighted in bold.
Since calculating Cliff’s delta is very computationally intensive, it was determined using m-out-of-n bootstrap [,]. To do this, 20,000 samples were drawn with replacement from each of the two distributions to be compared; afterwards, Cliff’s delta was calculated, and this procedure was repeated for 2000 bootstrap runs. Figure 7 shows the median and the 95% confidence intervals. This clearly shows that the median and confidence intervals for logit distributions with the same optimal T-values take the lowest values. The Cliff’s delta confidence interval for these distributions has the smallest width and the Cliff’s delta median is also very low. From this, it can be deduced that modeling this correlation is the next reasonable step.
4.3. Prediction of the Optimal T-Values
As already explained in Section 4.1, generating data pairs is difficult and time-consuming. As a reminder, a data pair of a dataset consists of the optimal T-values and the logit distribution at . For this reason, ML models that do not rely on huge amounts of training data are needed for modeling. Some models that meet this requirement are listed in []. These include Support Vector Machine (SVM) and Gaussian Process Regression (GPR), among others. In [], it is shown that Kernel Ridge Regression (KRR) is also a powerful model and can outperform SVM. Therefore, the three models SVM, GPR, and KRR were selected for this investigation.
According to Figure 2, the respective ML model is given a logit distribution as input, from which the optimal T-values are predicted. Bootstrap is used to evaluate the stability of the model performance of the predictions. The number of bootstrap runs is set to . Exactly 25 data pairs are randomly drawn with replacement from the data pairs in Table 2, which are used as training data. The remaining data is used for testing. After all runs, the errors , , and can be derived, as described in Section 3.1.
The accuracy of the predictions depends not only on the selection of a suitable ML model, but also on its parameters. For KRR and SVM, the hyperparameter was defined according to [], i.e., . The hyperparameters (KRR) and C (SVM) are defined as . Furthermore, the parameter of the support vector machine can take the values . The determination of the optimal hyperparameters of KRR and SVM is carried out by Grid Search, i.e., all possible combinations of the defined hyperparameters are tested, and Leave-One-Out cross-validation, where always exactly one data pair (from all data pairs in the training data) is used as a test sample and the remaining data pairs continue to be used as training samples. Each data pair appears exactly once as a test sample. The models are optimized during the training on the MAE, more precisely, a maximization of the negative MAE is aimed for. An RBF (radial basis function) kernel is used for all models (KRR, SVM, and GPR). Before the various ML models are trained, the x-values of the data pairs, i.e., the summary statistics of the logit distributions, are also scaled by the mean and the standard deviation. The other settings and parameters of KRR, SVM and GPR are based on the scikit-learn [] standard implementation in Python. Further information on how the ML models KRR, SVM, and GPR work is listed in [].
To map the logit distributions, in an initial investigation, based on [], the summary statistics mean, standard deviation (std), coefficient of variation (CV), skewness, kurtosis, IQR, and median are defined. The results are listed in Table 3, where the best results are highlighted in bold. Based on the results, it is clear that SVM is the most powerful model, as both the median is very low and the confidence intervals (CI) are the narrowest. In general, SVM predictions have the highest stability compared to the other models. KRR also performs well, even though the median is slightly better in some cases, the confidence intervals are usually larger. GPR performs the worst.
Table 3.
Error metrics RMSE, MAE, and R2 after predicting the T-values for the IoU thresholds 0.5 and 0.75 using bootstrap () and the summary statistics: mean, std, CV, skewness, kurtosis, IQR, and median. The best results are highlighted in bold.
However, the error metrics can be further reduced by using the summary statistics kurtosis, skewness, Q0.125, Q0.25, Q0.375, Q0.5, Q0.625, Q0.75, Q0.875, as suggested in [], to map the logit distribution. Table 4 shows the results, with KRR and SVM performing best. For an IoU threshold of 0.5, KRR is slightly better than SVM. In terms of MAE, KRR has an median error of 0.085, which is relatively low, with a CI of [0.027–0.396]. The CI is narrower compared to SVM and GPR, which means that KRR provides the most stable and reliable predictions. The same applies to the RMSE, even though the two confidence intervals of KRR and SVM are almost identical here. The median RMSE is also relatively low (0.139). However, the widths of the confidence intervals are generally relatively large, with a value of 0.369 for the MAE and 0.603 for the RMSE, considering that the step size of the T-values is 0.05 and T can take a minimum value of 0.4 and a maximum value of 2.0. The KRR model performance therefore varies between the different bootstrap datasets. The R2 score yields similar results. Here, the median of KRR is the best with a value of 0.784 compared to the other models, as this should be as close to 1 as possible. Looking at the width of the confidence interval (4.137), relatively large fluctuations in the predictions are to be expected here as well. How well the model can explain the variance of the dependent variable based on the summary statistics compared to a simple application of the mean therefore varies. Similar results are obtained for an IoU threshold of 0.75, with the difference that SVM is the most powerful. The median values are relatively low, while the width of the confidence intervals is relatively high.
Table 4.
Error metrics RMSE, MAE, and R2 after predicting the T-values for the IoU thresholds 0.5 and 0.75 using bootstrap () and the summary statistics: kurtosis, skewness, Q0.125, Q0.25, Q0.375, Q0.5, Q0.625, Q0.75, and Q0.875. The best results are highlighted in bold.
In general, it can therefore be concluded that the median values of the error metrics indicate minor deviations between the predicted T-values and the true optimal T-values. However, the deviations vary depending on the bootstrap dataset.
Nevertheless, the results of this study show that there is a correlation between the logit distributions and the optimal T-values. Even though the results may vary depending on the data, a modeling of this correlation is possible using ML models, with SVM and KRR proving to be the best models.
- Which summary statistics (features) are most relevant?
The following analysis will examine which summary statistics (features) are most relevant for predicting T-values, whether and to what extent the features can be reduced, and which features worsen the prediction.
Table 4 shows that the nine summary statistics kurtosis, skewness, Q0.125, Q0.25, Q0.375, Q0.5, Q0.625, Q0.75, and Q0.875 provide the best results, i.e., the best medians and the best confidence intervals, after 1000 bootstrap runs. KRR and SVM perform similarly well, so only KRR is used for the following investigations, and due to the high computational effort, the number of bootstrap runs is reduced to . In a first step, all nine features are considered individually and their influence on the stability and accuracy of the prediction errors is examined. In addition, all error metrics of all features are calculated again together for 50 bootstrap runs as a reference. The first finding can already be derived from the results in Table 5. When viewed individually, all quantile-features have relatively good median values and confidence intervals, which are significantly better than those of kurtosis and skewness. Particularly noteworthy are the features Q0.625 for the IoU threshold value 0.5 and Q0.75 for the IoU threshold value 0.75, whose confidence intervals already exceed those of the reference (all features).
Table 5.
Error metrics RMSE, MAE, and R2 after predicting the T-values for the IoU thresholds 0.5 and 0.75 using bootstrap (). The type and number of features (summary statistics) vary, with all features being considered individually, except for all (all nine features are taken into account). All error metrics were calculated only for the predictions of KRR.
In the next step, the sequential forward selection (SFS) [,] method is applied to select the best feature pairs. SFS is applied without floating. Overall, this method determines the best 2, 3, …, 6 feature combinations for the respective IoU thresholds while minimizing the median MAE. Table 6 shows the best feature combinations for each IoU threshold. The best results in terms of the median and confidence interval are highlighted in bold and include three features for the IoU values 0.5 (Q0.125, Q0.625, Q0.75) and 0.75 (Q0.125, Q0.25, Q0.75). It can be seen that the feature combination Q0.125 and Q0.75 works particularly well. Compared to the reference (all nine features), the median of MAE, RMSE, and R2 has improved, as have the corresponding confidence intervals for both IoU thresholds.
Table 6.
Error metrics RMSE, MAE, and R2 after predicting the T-values for the IoU thresholds 0.5 and 0.75 using bootstrap (). SFS was used to determine the best 2, 3, …, 6 feature combinations in terms of minimizing the median MAE. The smallest errors and thus the best feature combinations for the respective IoU thresholds are highlighted in bold. All error metrics were calculated only for the predictions of KRR.
The two best feature combinations of the respective IoU threshold are finally examined for 1000 bootstrap runs in order to have a fair comparison with Table 4. The results in Table 7 show that these three features are sufficient to achieve similarly good or even better results compared to the error metrics of the KRR model in Table 4. From this, it can be concluded that the features can be significantly reduced without any deterioration in the predictions. In addition, Table 6 shows that the features kurtosis and skewness are not included in the best feature combinations, only the quantile features. Kurtosis and skewness are therefore of rather low relevance for mapping the logit distributions.
Table 7.
Error metrics RMSE, MAE, and R2 after predicting the T-values for the IoU thresholds 0.5 and 0.75 using bootstrap (). The error metrics of the two best feature combinations, which were determined using SFS, are calculated. All error metrics were calculated only for the predictions of KRR.
4.4. Investigation of the mAP and the ECE
In order to finally determine the influence of temperature scaling, Table 8 shows the mAP and the ECE before and after the calibration with the optimal T-values. The influence of each optimal T-value on the mAP and ECE is considered individually, i.e., for each optimal T-value and , the mAP and the and are calculated. For better clarity, only the short form of and is used, i.e., and . The calculation of the mAP was performed using mmdet.LVISMetric, which is why all annotation files of the datasets were converted to the LVIS format. For the datasets COCO, Pascal VOC and mydataset, the list neg_category_ids contains all class IDs for which there are no annotations in the respective image. Since the labeling for Open Images and Objects365 is not exhaustive, there is no entry in the list neg_category_ids for these datasets. In Table 8, the mAP values for those where there was a decrease are shown in red and those where the mAP increased are shown in green, whereas no changes are highlighted in blue. The ECE and the mAP are both expressed as percentages. The optimal T-values for each dataset can be found in Table 2.
Table 8.
ECE [%] and mAP [%] of the different datasets—the ECE and mAP are calculated once for , i.e., no temperature scaling was applied, and contrasted with the ECE and mAP of the respective optimal T-values of the IoU thresholds 0.5 and 0.75. A green highlighting of the mAP indicates an improvement in the mAP, a red highlighting indicates a deterioration, and blue indicates no change.
Based on Table 8, it can be seen that in all cases where the optimal T-value is less than 1 (), the mAP is slightly reduced or remains unchanged. The largest reduction is 1.4% for the Open Images train_1 dataset (IoU = 0.75). In this case, the optimal T-value is and the mAP before and after calibration is 42.6% and 41.2%, respectively. But in general, for the case , the influence on the mAP is rather negligible. For optimal T-values that are greater than 1 (), temperature scaling results in an increase in mAP, with the exception of the Pascal VOC val and Pascal VOC train datasets, where the mAP does not change. The strongest increase occurred in the mydataset_easy dataset, where the mAP improved by 2.4%. For most other datasets, the increase in mAP is rather small.
The ECE has been reduced by temperature scaling with the optimal T-values in all cases, i.e., a better calibration has been achieved. For the datasets LVIS minival, mydataset, Open Images, EgoObjects, and Objects365 the effect on the ECE is greatest, with a reduction of up to 3.54% (in the case of LVIS minival). The influence is slightly lower for Pascal VOC and COCO; however, it should be noted that the ECE is already relatively low at . It is also interesting to note that despite a sigmoid activation of the logits (see Equation (2)), in which all logits are considered independently of each other, the ECE can be improved and the mAP remains largely unaffected.
When looking at the mAP in Table 8, it can be seen that it varies greatly depending on the dataset. One possible reason for this, among others, could be the different annotations and labels used in the various datasets. Therefore, the annotation of objects is not trivial. For example, one could annotate a light switch with the frame, as it is usually done, but also without the frame. A multiple power outlet could be annotated completely as one power outlet or even each outlet individually. This is particularly interesting if a different annotation strategy was used for the pre-training dataset of the model. This inconsistency then contributes to part of the uncertainty (data uncertainty) of the model predictions.
As mentioned in Section 3.1, other metrics to indirectly measure the calibration of the model are Brier Score (BS) and Sigmoid Cross-Entropy (SiCE), which will be briefly discussed below. The equations for calculating BS and SiCE are defined in the Appendix A in Equations (A1) and (A2), respectively, based on [,] (BS) and [,] (SiCE). The corresponding notation is defined in Section 2.3. Table A1 and Table A2 show the changes in these metrics depending on the T-value. As already seen for the ECE in Table 8, the metrics are calculated once for and once for the respective optimal T-values depending on the IoU thresholds. The optimal T-values are still those values that result in the lowest ECE. The calibration and reliability of the predictions are optimal if and . The results show that BS and SiCE could always be reduced when the optimal T-value was greater than 1. In the case of T-values less than 1, the SiCE was improved in some cases and worsened in others. BS became worse in all cases. However, the T-values were also not optimized for these metrics.
5. Discussion and Conclusions
This paper examined how to determine the optimal T-value of the temperature scaling method for improving the model calibration of YOLO-World. To do this, the logit distributions of selected datasets were first compared with each other using visual representations (box plots, line histograms, Q-Q plots, and ECDF plots), a statistical test (Cramer-von-Mises two-sample test), the Earth mover’s distance (EMD) and the non-parametric effect size Cliff’s delta. The results showed that datasets with similar logit distributions also result in the same optimal T-values and datasets with different logit distributions result in different optimal T-values. The optimal T-value therefore depends on the underlying logit distribution, which was generated by YOLO-World. Based on this finding, the ML methods KRR, SVM, and GPR were utilized to model the relationship. Using the summary statistics kurtosis, skewness, Q0.125, Q0.25, Q0.375, Q0.5, Q0.625, Q0.75 and Q0.875, the logit distributions could initially be best described. The KRR and SVM models provide the best results with small median MAE and RMSE values and an R2 median close to 1 for both IoU thresholds. However, the confidence intervals are relatively wide, which means that the performance of the models varies in terms of predicting the optimal T-values. For some bootstrap variations, the predictions are very accurate, while in other cases they are rather inaccurate. In further investigations using the KRR model, the nine summary statistics could be reduced to three, achieving the same and in some cases even better results.
The temperature scaling with the optimal T-value was able to improve the calibration error ECE in all cases and this without significantly reducing the mAP. A slight negative influence on the mAP is present in most datasets in which the optimal T-value is less than 1. However, for T-values greater than 1, the mAP is increased in most cases. The extent to which the ECE can be reduced depends on the dataset (see Section 4.4), i.e., in some cases the impact is greater, while in others only a slight improvement could be achieved. Temperature scaling thus proves to be a simple and, in some cases, effective method for improving the calibration of YOLO-World.
This study also showed that this approach has some disadvantages. Various datasets with different optimal T-values are required for training the ML models, as there is no optimal T-value that is the same for all datasets. Furthermore, an evaluation run must be carried out for each dataset and for each T-value in order to determine the optimal T-value, which is time-consuming. This approach is therefore less practical for models such as mm-Grounding DINO, as these models have a significantly higher inference time.
In conclusion, this study shows that there is a correlation between the optimal T-values and the logit distributions generated by YOLO World, which can be modeled using ML models. Through temperature scaling with the optimal T-values, the ECE could be reduced in all cases, with the extent depending on the dataset. The influence on the mAP is generally rather negligible, however a T-value less than 1 can lead to a decrease and a T-value greater than 1 can lead to an increase in the mAP.
In future work, further methods and approaches for improving the determination of confidence values for OVD models from the fields of calibration and uncertainty estimation should be tested and evaluated.
Author Contributions
Conceptualization, M.A.I., R.M.S., I.C. and S.T.; methodology, M.A.I.; software, M.A.I.; validation, M.A.I.; formal analysis, M.A.I.; investigation, M.A.I.; resources, M.A.I.; data curation, M.A.I.; writing—original draft preparation, M.A.I.; writing—review and editing, M.A.I., R.M.S., I.C. and S.T.; visualization, M.A.I.; supervision, I.C. and S.T.; project administration, M.A.I.; funding acquisition, M.A.I., S.T. and I.C. All authors have read and agreed to the published version of the manuscript.
Funding
The APC was funded by the Open Access Publishing Fund of Anhalt University of Applied Sciences.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.
Acknowledgments
We gratefully acknowledge the support of the Research, Transfer and Start-Up Center (Forschungs-, Transfer- und Gründerzentrum), the Anhalt University of Applied Sciences and the state of Saxony-Anhalt for our research. Furthermore, we acknowledge support by the Open Access Publishing Fund of Anhalt University of Applied Sciences. During the preparation of this study, the authors used DeepL for the translation of the text from German into English. In addition, the ChatGPTo3 and ChatGPT5 models were used to support the generation of the Python code used for this study. The authors have reviewed and edited the output and take full responsibility for the content of this publication.
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| OVD | Open-Vocabulary Detection |
| ML | Machine Learning |
| ECE | Expected Calibration Error |
| mAP | Mean Average Precision |
| KRR | Kernel Ridge Regression |
| GPR | Gaussian Process Regression |
| SVM | Support Vector Machine |
| MAE | Mean Absolute Error |
| RMSE | Root Mean Squared Error |
| R2 | Coefficient of Determination |
| IoU | Intersection over Union |
| ECDF | Empirical Cumulative Distribution Function |
| EMD | Earth Mover’s Distance |
| BS | Brier Score |
| SiCE | Sigmoid Cross-Entropy |
Appendix A. Results of Brier Score and Sigmoid Cross-Entropy
Appendix A.1. Brier Score
The Brier Score can be calculated as follows,
The results of the Brier score depending on the T-value are shown in Table A1.
Table A1.
BS [%] of the different datasets—BS is calculated once for , i.e., no temperature scaling was applied, and contrasted with the BS of the respective optimal T-values of the IoU thresholds 0.5 and 0.75. The optimal T-values are the ECE optimized ones.
Table A1.
BS [%] of the different datasets—BS is calculated once for , i.e., no temperature scaling was applied, and contrasted with the BS of the respective optimal T-values of the IoU thresholds 0.5 and 0.75. The optimal T-values are the ECE optimized ones.
| Dataset | T = 1.0 | opt. T-Wert@[IoU = 0.5] | opt. T-Wert@[IoU = 0.75] | |||
|---|---|---|---|---|---|---|
| BS@0.5 | BS@0.75 | BS@0.5 | BS@0.75 | BS@0.5 | BS@0.75 | |
| LVIS minival | 2.40 | 2.23 | 2.59 | 2.34 | 2.93 | 2.64 |
| mydataset_easy_half | 4.66 | 4.68 | 0.91 | 0.91 | 0.91 | 0.91 |
| mydataset_easy | 4.70 | 4.70 | 0.89 | 0.89 | 0.89 | 0.89 |
| mydataset_medium | 2.38 | 2.23 | 1.05 | 0.99 | 1.05 | 0.99 |
| mydataset_hard | 4.58 | 4.04 | 1.25 | 1.07 | 1.31 | 1.13 |
| mydataset_hard_aug | 2.48 | 2.32 | 0.95 | 0.88 | 1.14 | 1.06 |
| mydataset_all | 3.28 | 3.11 | 0.99 | 0.93 | 0.99 | 0.93 |
| mydataset_all_aug | 2.63 | 2.49 | 1.03 | 0.96 | 1.03 | 0.96 |
| Open Images val_1 | 1.81 | 1.52 | 2.09 | 1.67 | 2.39 | 1.90 |
| Open Images val_2 | 1.81 | 1.51 | 1.96 | 1.57 | 2.51 | 2.00 |
| Open Images val_3 | 1.86 | 1.56 | 1.99 | 1.60 | 2.54 | 2.02 |
| Open Images train_1 | 1.94 | 1.66 | 2.15 | 1.77 | 2.78 | 2.29 |
| Open Images train_2 | 2.15 | 1.79 | 2.30 | 1.84 | 2.87 | 2.29 |
| Open Images train_3 | 2.17 | 1.79 | 2.37 | 1.90 | 2.91 | 2.32 |
| COCO train_1 | 1.90 | 1.84 | 2.42 | 2.34 | 2.42 | 2.34 |
| COCO train_2 | 1.91 | 1.84 | 2.44 | 2.35 | 2.44 | 2.35 |
| COCO train_3 | 1.92 | 1.85 | 2.44 | 2.35 | 2.44 | 2.35 |
| COCO val | 1.95 | 1.87 | 2.47 | 2.38 | 2.47 | 2.38 |
| Pascal VOC val | 1.37 | 1.49 | 2.97 | 3.27 | 0.70 | 0.76 |
| Pascal VOC train | 1.37 | 1.50 | 2.98 | 3.31 | 0.64 | 0.70 |
| Pascal VOC tainval | 1.37 | 1.50 | 2.97 | 3.29 | 0.63 | 0.68 |
| EgoObjects val | 1.97 | 1.89 | 2.14 | 2.02 | 2.14 | 2.02 |
| EgoObjects train | 1.92 | 1.83 | 2.09 | 1.94 | 2.09 | 1.94 |
| Objects365 tiny val | 1.73 | 1.55 | 0.86 | 0.76 | 1.01 | 0.90 |
| Objects365 tiny train | 2.05 | 1.83 | 0.95 | 0.85 | 1.14 | 1.02 |
Appendix A.2. Sigmoid Cross-Entropy
The Sigmoid Cross-Entropy can be calculated as follows:
The results of the Sigmoid Cross-Entropy depending on the T-value are shown in Table A2.
Table A2.
SiCE [%] of the different datasets – SiCE is calculated once for , i.e., no temperature scaling was applied, and contrasted with the SiCE of the respective optimal T-values of the IoU thresholds 0.5 and 0.75. The optimal T-values are the ECE optimized ones.
Table A2.
SiCE [%] of the different datasets – SiCE is calculated once for , i.e., no temperature scaling was applied, and contrasted with the SiCE of the respective optimal T-values of the IoU thresholds 0.5 and 0.75. The optimal T-values are the ECE optimized ones.
| Dataset | T = 1.0 | opt. T-Wert@[IoU = 0.5] | opt. T-Wert@[IoU = 0.75] | |||
|---|---|---|---|---|---|---|
| SiCE@0.5 | SiCE@0.75 | SiCE@0.5 | SiCE@0.75 | SiCE@0.5 | SiCE@0.75 | |
| LVIS minival | 10.82 | 10.31 | 10.38 | 9.35 | 11.69 | 10.47 |
| mydataset_easy_half | 20.31 | 20.45 | 4.14 | 4.16 | 4.14 | 4.16 |
| mydataset_easy | 20.74 | 20.81 | 4.06 | 4.07 | 4.06 | 4.07 |
| mydataset_medium | 9.61 | 8.95 | 4.46 | 4.23 | 4.46 | 4.23 |
| mydataset_hard | 20.84 | 18.28 | 5.70 | 4.95 | 5.99 | 5.20 |
| mydataset_hard_aug | 10.27 | 9.56 | 4.22 | 3.91 | 4.96 | 4.61 |
| mydataset_all | 14.01 | 13.27 | 4.47 | 4.24 | 4.47 | 4.24 |
| mydataset_all_aug | 10.97 | 10.32 | 4.56 | 4.28 | 4.56 | 4.28 |
| Open Images val_1 | 8.55 | 7.61 | 9.01 | 7.26 | 10.24 | 8.16 |
| Open Images val_2 | 8.43 | 7.47 | 8.44 | 6.86 | 10.72 | 8.52 |
| Open Images val_3 | 8.63 | 7.65 | 8.54 | 6.95 | 10.82 | 8.61 |
| Open Images train_1 | 8.71 | 7.79 | 8.91 | 7.40 | 11.41 | 9.31 |
| Open Images train_2 | 9.52 | 8.42 | 9.47 | 7.70 | 11.87 | 9.42 |
| Open Images train_3 | 9.31 | 8.07 | 9.66 | 7.83 | 11.87 | 9.41 |
| COCO train_1 | 7.40 | 7.14 | 9.26 | 8.87 | 9.26 | 8.87 |
| COCO train_2 | 7.46 | 7.16 | 9.31 | 8.90 | 9.31 | 8.90 |
| COCO train_3 | 7.50 | 7.20 | 9.38 | 8.91 | 9.38 | 8.91 |
| COCO val | 7.60 | 7.26 | 9.47 | 9.01 | 9.47 | 9.01 |
| Pascal VOC val | 5.34 | 5.68 | 10.94 | 11.70 | 3.06 | 3.22 |
| Pascal VOC train | 5.32 | 5.70 | 10.86 | 11.79 | 2.86 | 3.02 |
| Pascal VOC tainval | 5.33 | 5.69 | 10.90 | 11.74 | 2.82 | 2.97 |
| EgoObjects val | 9.63 | 9.41 | 8.73 | 8.22 | 8.73 | 8.22 |
| EgoObjects train | 9.46 | 9.20 | 8.52 | 7.92 | 8.52 | 7.92 |
| Objects365 tiny val | 7.61 | 6.83 | 3.99 | 3.61 | 4.63 | 4.16 |
| Objects365 tiny train | 8.77 | 7.84 | 4.25 | 3.83 | 5.01 | 4.51 |
References
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- Zhu, C.; Chen, L. A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 8954–8975. [Google Scholar] [CrossRef]
- Zhao, X.; Chen, Y.; Xu, S.; Li, X.; Wang, X.; Li, Y.; Huang, H. An Open and Comprehensive Pipeline for Unified Object Grounding and Detection. arXiv 2024, arXiv:2401.02361. [Google Scholar] [CrossRef]
- Yao, L.; Pi, R.; Han, J.; Liang, X.; Xu, H.; Zhang, W.; Li, Z.; Xu, D. DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 27391–27401. [Google Scholar]
- Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Jiang, Q.; Li, C.; Yang, J.; Su, H.; et al. Grounding DINO: Marrying DINO with Grounded Pre-training for Open-Set Object Detection. In Proceedings of the Computer Vision—ECCV 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Springer: Cham, Switzerland, 2025; pp. 38–55. [Google Scholar]
- Ren, T.; Jiang, Q.; Liu, S.; Zeng, Z.; Liu, W.; Gao, H.; Huang, H.; Ma, Z.; Jiang, X.; Chen, Y.; et al. Grounding DINO 1.5: Advance the “Edge” of Open-Set Object Detection. arXiv 2024, arXiv:2405.10300. [Google Scholar] [CrossRef]
- Cheng, T.; Song, L.; Ge, Y.; Liu, W.; Wang, X.; Shan, Y. YOLO-World: Real-Time Open-Vocabulary Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 16901–16911. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of Machine Learning Research, Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Meila, M., Zhang, T., Eds.; JMLR, Inc.: Cambridge, MA, USA, 2021; Volume 139, pp. 8748–8763. [Google Scholar]
- Kamath, A.; Singh, M.; LeCun, Y.; Synnaeve, G.; Misra, I.; Carion, N. MDETR-Modulated Detection for End-to-End Multi-Modal Understanding. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 1760–1770. [Google Scholar] [CrossRef]
- Gupta, A.; Dollar, P.; Girshick, R. LVIS: A Dataset for Large Vocabulary Instance Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Shao, S.; Li, Z.; Zhang, T.; Peng, C.; Yu, G.; Zhang, X.; Li, J.; Sun, J. Objects365: A Large-Scale, High-Quality Dataset for Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8429–8438. [Google Scholar] [CrossRef]
- Tencent AI Lab. YOLO-World (GitHub Repository). Available online: https://github.com/AILab-CVC/YOLO-World (accessed on 31 July 2025).
- Ingrisch, M.A.; Rajanayagam, S.; Chmielewski, I.; Twieg, S. Calibration of the Open-Vocabulary Model YOLO-World by Using Temperature Scaling. Hochsch. Anhalt. 2025, 13, 155–160. [Google Scholar] [CrossRef]
- Gawlikowski, J.; Tassi, C.R.N.; Ali, M.; Lee, J.; Humt, M.; Feng, J.; Kruspe, A.; Triebel, R.; Jung, P.; Roscher, R.; et al. A survey of uncertainty in deep neural networks. Artif. Intell. Rev. 2023, 56, 1513–1589. [Google Scholar] [CrossRef]
- Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On Calibration of Modern Neural Networks. In Proceedings of Machine Learning Research, Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Precup, D., Teh, Y.W., Eds.; JMLR, Inc.: Cambridge, MA, USA, 2017; Volume 70, pp. 1321–1330. [Google Scholar]
- LeVine, W.; Pikus, B.; Raja, P.; Gil, F.A. Enabling Calibration in the Zero-Shot Inference of Large Vision-Language Models. arXiv 2023, arXiv:2303.12748. [Google Scholar] [CrossRef]
- Stone, M. Cross-Validatory Choice and Assessment of Statistical Predictions. J. R. Stat. Soc. Ser. B (Methodol.) 2018, 36, 111–133. [Google Scholar] [CrossRef]
- Arlot, S.; Celisse, A. A survey of cross-validation procedures for model selection. Stat. Surv. 2010, 4, 40–79. [Google Scholar] [CrossRef]
- Kuznetsova, A.; Rom, H.; Alldrin, N.; Uijlings, J.; Krasin, I.; Pont-Tuset, J.; Kamali, S.; Popov, S.; Malloci, M.; Duerig, T.; et al. The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale. arXiv 2018, arXiv:1811.00982. [Google Scholar] [CrossRef]
- Krasin, I.; Duerig, T.; Alldrin, N.; Ferrari, V.; Abu-El-Haija, S.; Kuznetsova, A.; Rom, H.; Uijlings, J.; Popov, S.; Kamali, S.; et al. OpenImages: A public Dataset for Large-Scale Multi-Label and Multi-Class Image Classification. 2017. Available online: https://storage.googleapis.com/openimages/web/index.html (accessed on 31 July 2025).
- Everingham, M.; Eslami, S.M.A.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes Challenge: A Retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [Google Scholar] [CrossRef]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the Computer Vision—ECCV 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
- Zhu, C.; Xiao, F.; Alvarado, A.; Babaei, Y.; Hu, J.; El-Mohri, H.; Culatana, S.; Sumbaly, R.; Yan, Z. EgoObjects: A Large-Scale Egocentric Dataset for Fine-Grained Object Understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 20110–20120. [Google Scholar]
- Tukey, J.W. Exploratory Data Analysis; Addison-Wesley: Boston, MA, USA, 1977. [Google Scholar]
- Nuzzo, R.L. The Box Plots Alternative for Visualizing Quantitative Data. PM&R 2016, 8, 268–272. [Google Scholar] [CrossRef]
- Scott, D.W. Multivariate Density Estimation: Theory, Practice, and Visualization; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 1992. [Google Scholar]
- Anderson, T.W. On the Distribution of the Two-Sample Cramer-von Mises Criterion. Ann. Math. Stat. 1962, 33, 1148–1159. [Google Scholar] [CrossRef]
- Rubner, Y.; Tomasi, C.; Guibas, L. A metric for distributions with applications to image databases. In Proceedings of the Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271), Mumbai, India, 4–7 January 1998; pp. 59–66. [Google Scholar] [CrossRef]
- Rubner, Y.; Tomasi, C.; Guibas, L.J. The Earth Mover’s Distance as a Metric for Image Retrieval. Int. J. Comput. Vis. 2000, 40, 99–121. [Google Scholar] [CrossRef]
- Cliff, N. Dominance statistics: Ordinal analyses to answer ordinal questions. Psychol. Bull. 1993, 114, 494–509. [Google Scholar] [CrossRef]
- Zhu, J.Y.; Zhang, H.X.; Guo, J.L.; Feng, J.L. Data distributions automatic identification based on SOM and support vector machines. In Proceedings of the International Conference on Machine Learning and Cybernetics, Beijing, China, 4–5 November 2002; Volume 1, pp. 340–344. [Google Scholar] [CrossRef]
- Kumar, B.; Ramya, R.; Kurian, A.G.; J, J.; Prasad, R.; Fathima, S.; Suresh, A. Improving Performance of Supervised Machine Learning Algorithms on Small Datasets. In Proceedings of the 2024 4th International Conference on Ubiquitous Computing and Intelligent Information Systems (ICUIS), Gobichettipalayam, India, 12–13 December 2024; pp. 669–674. [Google Scholar] [CrossRef]
- Efron, B. Bootstrap Methods: Another Look at the Jackknife. Ann. Stat. 1979, 7, 1–26. [Google Scholar] [CrossRef]
- Horowitz, J.L. Bootstrap Methods in Econometrics. Annu. Rev. Econ. 2019, 11, 193–224. [Google Scholar] [CrossRef]
- Chicco, D.; Warrens, M.J.; Jurman, G. The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Comput. Sci. 2021, 7, e623. [Google Scholar] [CrossRef] [PubMed]
- Efron, B. The Jackknife, the Bootstrap and Other Resampling Plans. In CBMS-NSF Regional Conference Series in Applied Mathematics; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 1982. [Google Scholar] [CrossRef]
- Brier, G.W. Verification of Forecasts Expressed in Terms of Probability. Mon. Weather Rev. 1950, 78, 1–3. [Google Scholar] [CrossRef]
- Roulston, M.S. Performance targets and the Brier score. Meteorol. Appl. 2007, 14, 185–194. [Google Scholar] [CrossRef]
- Janthakal, S.; Hosalli, G. A Binary Cross Entropy U-net based Lesion Segmentation of Granular Parakeratosis. In Proceedings of the 2021 International Conference on Advancements in Electrical, Electronics, Communication, Computing and Automation (ICAECA), Coimbatore, India, 8 October 2021; pp. 1–7. [Google Scholar] [CrossRef]
- Terven, J.; Cordova-Esparza, D.M.; Romero-González, J.A.; Ramírez-Pedraza, A.; Chávez-Urbiola, E.A. A comprehensive survey of loss functions and metrics in deep learning. Artif. Intell. Rev. 2025, 58, 195. [Google Scholar] [CrossRef]
- Dwyer, B.; Nelson, J.; Hansen, T. Roboflow (Version 1.0) [Software]. Computer Vision. 2025. Available online: https://roboflow.com (accessed on 8 November 2025).
- Ingrisch, M.A. Galaxy-Raspberry Pi Dataset. 2025. Available online: https://universe.roboflow.com/annotation-mxzvl/galaxy-raspberry-pi-dtprq (accessed on 31 July 2025).
- Moore, B.E.; Corso, J.J. FiftyOne (GitHub Repository). Available online: https://github.com/voxel51/fiftyone (accessed on 31 July 2025).
- Lin, M.; Lucas, H.C.; Shmueli, G. Research Commentary: Too Big to Fail: Large Samples and the p-Value Problem. Inf. Syst. Res. 2013, 24, 906–917. [Google Scholar] [CrossRef]
- Demidenko, E. The p-Value You Can’t Buy. Am. Stat. 2016, 70, 33–38. [Google Scholar] [CrossRef] [PubMed]
- Bickel, P.J.; Götze, F.; van Zwet, W.R. Resampling Fewer Than n Observations: Gains, Losses, and Remedies for Losses. Stat. Sin. 1997, 7, 1–31. [Google Scholar]
- Lee, S.M.S. m-Out-of-n Bootstrap. In Wiley StatsRef: Statistics Reference Online; Wiley: Hoboken, NJ, USA, 2017. [Google Scholar] [CrossRef]
- Xu, P.; Ji, X.; Li, M.; Lu, W. Small data machine learning in materials science. npj Comput. Mater. 2023, 9, 42. [Google Scholar] [CrossRef]
- Mohapatra, P. Microarray Medical Data Classification Using Kernel Ridge Regression and Modified Cat Swarm Optimization Based Gene Selection System. Swarm Evol. Comput. 2016, 28, 144–160. [Google Scholar] [CrossRef]
- Stuke, A.; Rinke, P.; Todorović, M. Efficient hyperparameter tuning for kernel ridge regression with Bayesian optimization. Mach. Learn. Sci. Technol. 2021, 2, 035022. [Google Scholar] [CrossRef]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Murphy, K.P. Machine Learning: A Probabilistic Perspective; The MIT Press: Cambridge, MA, USA, 2012. [Google Scholar]
- Whitney, A. A Direct Method of Nonparametric Measurement Selection. IEEE Trans. Comput. 1971, C-20, 1100–1103. [Google Scholar] [CrossRef]
- Pudil, P.; Novovičová, J.; Kittler, J. Floating search methods in feature selection. Pattern Recognit. Lett. 1994, 15, 1119–1125. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).