Teacher–Student Model Using Grounding DINO and You Only Look Once for Multi-Sensor-Based Object Detection

: Object detection is a crucial research topic in the fields of computer vision and artificial intelligence, involving the identification and classification of objects within images. Recent advancements in deep learning technologies, such as YOLO (You Only Look Once), Faster-R-CNN, and SSDs (Single Shot Detectors), have demonstrated high performance in object detection. This study utilizes the YOLOv8 model for real-time object detection in environments requiring fast inference speeds, specifically in CCTV and automotive dashcam scenarios. Experiments were conducted using the ‘Multi-Image Identical Situation and Object Identification Data’ provided by AI Hub, consisting of multi-image datasets captured in identical situations using CCTV, dashcams, and smartphones. Object detection experiments were performed on three types of multi-image datasets captured in identical situations. Despite the utility of YOLO, there is a need for performance improvement in the AI Hub dataset. Grounding DINO, a zero-shot object detector with a high mAP performance, is employed. While efficient auto-labeling is possible with Grounding DINO, its processing speed is slower than YOLO, making it unsuitable for real-time object detection scenarios. This study conducts object detection experiments using publicly available labels and utilizes Grounding DINO as a teacher model for auto-labeling. The generated labels are then used to train YOLO as a student model, and performance is compared and analyzed. Experimental results demonstrate that using auto-generated labels for object detection does not lead to degradation in performance. The combination of auto-labeling and manual labeling significantly enhances performance. Additionally, an analysis of datasets containing data from various devices, including CCTV, dashcams, and smartphones, reveals the impact of different device types on the recognition accuracy for distinct devices. Through Grounding DINO, this study proves the efficacy of auto-labeling technology in contributing to efficiency and performance enhancement in the field of object detection, presenting practical applicability.

This study compares the object detection performance of CCTV, dashcams, and smartphone images, analyzing the influence of each device type on the detection accuracy of different devices.Real-time object detection is a key requirement for various applications, such as safety and security, traffic flow management, and emergency response, using CCTV, dashcam, and smartphone videos [22,23,26].To address this, this experiment employs the YOLOv8 model, known for its superior performance in terms of speed and parameters [30].However, real-time object detection algorithms like YOLO, while efficient in speed and parameter count, may have lower detection rates compared to heavier models.Therefore, this study aims to achieve performance improvement while maintaining lightweight characteristics.Additionally, the field of object detection has seen significant progress with the advancement of computer vision and machine learning technologies.Nevertheless, accurate and reliable object detection relies heavily on meticulous labeling, a process prone to difficulty, time consumption, and human errors.In response to these challenges, this study aims to explore automated methods for efficient object detection.
This study proposes utilizing Grounding DINO [29] to enhance object detection performance in CCTV, dashcam, and smartphone videos and alleviate the difficulties of manual annotation through the use of auto-labeling technology.Grounding DINO serves as a high-performance zero-shot object detector capable of efficient auto-labeling.However, it may not be suitable for real-time object detection scenarios due to slower processing speeds compared to YOLO [26].
This study conducts object detection experiments using publicly available labels and employs Grounding DINO as a teacher model for auto-labeling.The generated labels are then used to train YOLO as a student model for object detection experiments, allowing for a comparative analysis of performance.
In conclusion, this study affirms that there is no discernible performance degradation when employing automatically generated labels for object detection.The synergistic use of auto-labeling and manual labeling, employing a mixed-method approach, proves to be an effective strategy for performance enhancement.This research delves into the ways in which auto-labeling enhances the accuracy and efficiency of object detection in comparison to manual labeling.Furthermore, it investigates how different device types influence the detection accuracy across various devices.The findings underscore the significant contribution of auto-labeling technology, specifically Grounding DINO, to efficiency and performance improvement within the realm of object detection.These advancements are anticipated to yield positive impacts across diverse applications, ranging from autonomous driving to intelligent surveillance systems.

Object Detection
Object detection has been considered a fundamental topic in computer vision, and the advancement of deep learning technology has revolutionized this field.Continuous research and development have explored various methodologies and models.Here, we provide a brief overview of key methodologies and models in object detection.
Traditional Approaches: Viola and Jones (2001) introduced fundamental work on object detection using Haar-like features and a cascaded classifier.This method laid the foundation for real-time face detection and inspired subsequent research [20].
Region-based CNNs: Girshick et al. (2014) introduced the Region-based CNN (R-CNN) [21], which changed the paradigm of object detection by proposing region proposals using selective search for CNN-based classification.Although effective, R-CNN had high computational costs.Ren et al. (2015) improved R-CNN with Fast R-CNN [22], introducing region-of-interest pooling layers for end-to-end training and faster computation.Mask R-CNN [24] performs simultaneous object detection and instance segmentation (masking) in a deep learning-based model.Based on the Faster R-CNN [23] [27], a model that balances accuracy and computational efficiency in object detection.It achieved top-level performance across various datasets by introducing a compound scaling method to optimize model parameters.
DINO (DETR with Improved Denoising Anchor Boxes for End-to-End Object Detection): DINO [28] improves both performance and efficiency relative to earlier DETR-like models.This is achieved through the utilization of a contrastive approach for noise removal training, a blended query selection method for anchor initialization, and two future prediction strategies for box prediction.
Grounding DINO [29]: This technology performs zero-shot object detection, meaning the model operates on new classes without labeled examples during training.Unlike conventional object detection using trained models on annotated data, Grounding DINO can detect objects even without annotated data for new classes.The core concept of Grounding DINO is the interaction ('grounding') between the visual features of objects and their class names.It utilizes pre-trained DINO to detect objects of new classes and performs interactions by linking the visual features of detected objects with their class names.

Research on Utilizing Various Video Sources
Utilizing various video sources for object detection is a crucial area, considering the unique characteristics of each source for fast and accurate detection.
In modern society, diverse video sources such as CCTV, dashcams, and smartphones are rapidly spreading.Each of these sources has distinct characteristics, requiring special considerations for object detection.CCTV is widely installed in public places, commercial facilities, and residential areas.Dashcams are used to record situations on the road, while smartphones serve as portable video recording devices for swiftly capturing events in one's surroundings.
Research on object detection in CCTV videos: Studies [8,9,11] address the detection and tracking of objects in real-time CCTV scenarios.One study [9] specifically explores detecting weapons using various models on real-time CCTV footage.Another study [10] introduces a research approach utilizing heterogeneous training data and data augmentation to maximize detection rates in CCTV scenes.This research focuses on modeling and predicting the evolution of unique camera parameters using spatial transformation parameters of objects, optimizing the detector accordingly.
Research on object detection in dashcam videos: One study [11] presents an example of using state-of-the-art image processing algorithms on dashcam videos to safely detect traffic signals while driving.Another study [12] proposes the Dynamic Spatial Attention (DSA) Recurrent Neural Network (RNN) to collect and publish datasets for predicting accidents using dashcam videos.
Additionally, in another study [13], the utilization of dashboard cameras is proposed to develop a practical anomaly detection system.Focusing on driver safety issues such as lane departure and following distance, traditional model-based computer vision algorithms have limitations in addressing the diversity of risks on the road.This study emphasizes the importance of an approach specialized for dashcam data.Furthermore, another research study [14] aims to detect road anomalies, such as potholes, in dashcam videos to alert drivers about road irregularities and reduce accidents.
Research on object detection in smartphone videos: Various studies explore object detection using smartphone videos.One study focuses on detecting pests using smartphone videos [15], while others [16,17] propose real-time object recognition systems optimized for speed and minimal performance degradation in the constrained resource and power consumption environment of smartphones.
Furthermore, research [6,7] aims to improve the accuracy of object detection by combining information from various devices at different viewpoints.The growing number of studies on object detection and tracking using videos collected from multiple cameras demonstrates continuous and effective development in the field.These diverse studies underscore ongoing efforts to understand and enhance object detection in various environments, from CCTV and dashcam to smartphone images.

Proposed Method
Dataset: This study used datasets from 'The Open AI Dataset Project (Multi-Video Same Situation and Object Identification Data)'.All data information can be accessed through AI-Hub (www.aihub.or.kr (accessed on 27 February 2024)).Figures 1-3 represent images captured from CCTV, a dashcam, and a smartphone, respectively, in the dataset.This dataset was collected using CCTV, dashcam, and smartphone devices in the same scenarios, with 12 images captured for each device in each scenario.Currently, it is in a limited beta version with restricted public access, containing incidents related specifically to collisions between humans and bicycles.The images are provided in full HD (1920 × 1080) resolution.As the data were collected in identical scenarios, the frame rate is consistent, ensuring a uniform number of images across all device types.However, some slight inconsistencies may exist in publicly available datasets due to issues such as omissions and errors.The object detection classes include (1) person, (2) scooter, (3) vehicle, and (4) bicycle.However, the scooter class is currently not included in the publicly available dataset.These three types of data can be denoted as CCTV (CT), dashcam (black box-BB), and smartphone (SP), respectively.
After a brief examination of the dataset to identify device-specific characteristics, it was observed that the CCTV footage appears to be captured from an overhead perspective, resembling actual CCTV footage.The dashcam footage exhibits darkness due to vehicle tinting, indicating an interior view from within the vehicle.In contrast, the smartphone footage shares a similar screen angle with dashcam footage but comparatively presents a cleaner screen.
The total dataset consists of 3811 images for CCTV, 3816 images for the dashcam, and 3804 images for the smartphone, showing a slight inconsistency in the total number of images for each device.Utilizing auto-labeling with Grounding DINO [29]: Although the YOLO model boasts excellent speed and parameter efficiency, its detection performance on this dataset did not meet expectations.Considerations for performance improvement included altering the model structure, employing more effective transfer learning, or enhancing label quality.Since the YOLO model is well-developed and transfer learning has been effectively conducted on the COCO dataset [31], making changes to it posed difficulties.Therefore, the primary approach focused on enhancing the label quality of this dataset, anticipating that it would contribute to performance improvement.
Recently, the zero-shot object detection model, Grounding DINO, has demonstrated outstanding detection performance.However, with a speed of 8.37 FPS on an A100 GPU, which is significantly lower compared to YOLO's 300 FPS or higher, Grounding DINO exhibits a suboptimal speed for real-time object detection purposes.To address this, the results of zero-shot object detection from Grounding DINO are utilized for auto-labeling.The advantages of auto-labeling include cost and time savings, as well as ensuring consistency and accuracy compared to manual labeling efforts.
Figure 5 proposes a method where images are provided to the Grounding DINO model without training set labels, and it detects four target labels, person, scooter, vehicle, and bicycle, using the bounding box information of these results as YOLO's training labels.To measure the reliability of auto-labels, the mean average precision (mAP) with the training set labels was calculated, resulting in a high performance of 0.789.This indicates that auto-labeling effectively generated labels similar to manual labels.This approach functions like a teacher-student relationship, where the Grounding DINO model imparts knowledge to the YOLO model.Combining auto-labels with manual labels: When experimenting with training object detection using auto-labels generated by Grounding DINO, the performance did not degrade.While the auto-labels created by Grounding DINO could be meaningful from the perspective of dataset construction, they alone did not significantly improve object detection performance in this experiment.Therefore, a method was considered to enhance the quality of training labels by using both auto-labels and manual labels simultaneously.The combination of auto-labels and manual labels was experimented with, without a deep consideration of the strengths and weaknesses of each method.Although auto-labels and manual labels have their respective advantages and disadvantages, combining them straightforwardly may lead to conflicting coordinates for reliable and definite objects, resulting in dual labeling for a single object.Moreover, less reliable objects are likely to exist in only one label.This approach aims to assign weights to important objects, considering these effects.
The fusion method can be observed as the fusion label, obtained by adding the object information of the manual label and auto-label, as shown in Figure 6.Sets with a high probability of being the same object are indicated with blue and red boxes based on the coordinates of the classes, while classes not displayed indicate objects not detected by auto-labeling.In the fusion label, this can be interpreted as double-labeling in the fusion label for objects with high reliability.

Dataset Group Splitting
To investigate the impact of cross-device training on testing, the training and validation data were divided into seven groups (Tables 2 and 3: group 1 trained on CT, BB, and SP altogether; group 2 trained on CT only; group 3 trained on BB only; group 4 trained on SP only; group 5 trained on CT and BB; group 6 trained on CT and SP; and group 7 trained on BB and SP).For result verification, the test group was divided into four categories for experimentation (Table 4: group 1 tested on CT, BB, and SP altogether; group 2 tested on CT only; group 3 tested on BB only; and group 4 tested on SP only).Details of the group division for training and testing can be found in Tables 2-4.

Manual Label-Based Object Detection Experiment
The first experiment involves conducting object detection using manually labeled data provided by AI Hub.The experimental setup utilizes the YOLOv8n.ptmodel, which loads pre-trained parameters from the extensive COCO dataset.Training is performed with a batch size of 32 over 100 epochs.
Figure 7 presents the results of object detection experiments trained with manual labels for the entirety of data group 1.The visual confirmation of stable training is evident as the loss decreases and the mAP values increase during training.Additionally, the confusion matrix allows for an examination of the class distribution in the data and the number of correctly predicted classes.Figure 8 shows the confusion matrix for the object detection experiment trained with manual labels for group 1 (left: original matrix; right: normalized matrix).The class distribution in the data is confirmed to be people, bicycles, and vehicles, with no scooters.Furthermore, in terms of precision, bicycles exhibit the highest results.Up to this point, we have examined performance metrics during training.Starting from the next section, the focus shifts to the results on the test dataset.Additionally, experiments were conducted by dividing the data into seven training groups and four testing groups.Metrics such as precision, recall, mAP50, and mAP50-95 were measured.To facilitate comparison across different groups, emphasis is primarily placed on mAP50, a performance metric commonly considered crucial.While Table 3 represents the division of the validation group, it is predominantly expressed as the training group in the context of result analysis.
Table 5 presents the test results for manual label-based training across different groups.Upon examining the results, it is observed that dashcam (black box-BB) and smartphone (SP) exhibit similar trends, possibly due to the similarity in their screen displays.However, the overall performance of the dashcam is consistently lower than that of the smartphone, possibly attributed to the tinted nature of dashcam screens.Additionally, when looking at test group 2, a decline in performance is noted compared to training groups 1, 5, and 6, despite having a substantial dataset.This decline is attributed to the fact that training with data from different devices, where the angles of CCTV footage differ, leads to a decrease in performance.Examining training group 3, the overall performance is the lowest, suggesting potential issues with the quality of dashcam data or difficulties in identification.The subsequent experiment was conducted without utilizing a pre-trained model.Although it is generally acknowledged that fine-tuning a model pre-trained on a large dataset yields superior performance, there is a possibility of performance degradation due to differences in data domains or issues like overfitting.
The results can be observed in Table 6, and in all experiments, the performance was consistently lower than the results obtained using a pre-trained model, as shown in Table 5.To assess whether the insufficient amount of training data was a contributing factor, we conducted training for group 1 up to 200 epochs.The results of training up to 200 epochs were superior to those of training up to 100 epochs, but it was evident that they were significantly inferior to the model trained for 100 epochs with pre-trained weights.From this experiment, we conclude that fine-tuning with a pre-trained model leads to superior performance.Next, we trained and evaluated object detection using the auto-labels obtained through Grounding DINO.The results can be found in Table 7.While there were instances where the performance was lower compared to object detection using manual labels, it was observed that the performance was higher for group 1, trained on the entire dataset.Although determining overall superiority is challenging, the performance was generally satisfactory.Finally, we trained and evaluated object detection using labels that combined both manual and auto-labels.The results are presented in Table 8, and it was observed that the overall performance was highest when training with combined manual labels, auto-labels, and both.This suggests that assigning weights to important objects contributed to the superior performance in the combined label scenario.When training group 2 with auto-labels, the overall evaluation performance was lower compared to manual labels.It is interpreted that the auto-labels generated by Grounding DINO, trained on a large-scale dataset, may not be well-suited for angles looking from top to bottom, such as in CCTV footage.It is assumed that the quality of auto-labels is relatively lower for CCTV videos.Detection test results trained with manual labels show a significant performance difference among device groups, whereas test results using auto-labels reduce the gap between groups.This indicates that auto-labels provide consistent labeling without significant bias compared to manual labels.
Confirmation of no performance degradation when using auto-labels for object detection, and the ability to enhance performance by combining auto-labels with manual labels, was observed.To improve accuracy and efficiency, a comparison between auto-labels and manual labels was conducted, along with an analysis of the impact of different device types on each other.

Conclusions and Future Works
In this study, it was confirmed that there is no degradation in object detection performance when utilizing automatically generated labels.Furthermore, combining auto-labels with manual labels resulted in an enhancement of object detection performance.These results are applicable only when objects can be detected by Grounding DINO.Additionally, experiments and analyses were conducted using a dataset that includes data collected from various devices such as CCTV, dashcams, and smartphones.This aimed to investigate the impact of each device type on the accuracy of object detection.The auto-labeling technology of Grounding DINO demonstrated its efficiency and performance improvement in the field of object detection, providing evidence for its practical applicability.
Future research should focus on the integration and effective utilization of images obtained from various devices, as they offer broader perspectives and rich information.Additionally, beyond simple combination, there is a need to explore more synergistic and efficient methods for integrating auto-labels and manual labels.
Overall, there are 11,431 image-label pairs in the dataset.The dataset was manually split into training, validation, and test sets.According to Table 1, the training set is divided into CCTV (2088 images), black box (2088 images), and smartphone (2087 images) for a total of 6263 images; the validation set into CCTV (576 images), black box (576 images), and smartphone (576 images), for a total of 1728 images; and the test set into CCTV (1147 images), black box (1152 images), and smartphone (1141 images), for a total of 3440 images.Training the YOLO [26] model: To train the YOLOv8 object detection model, text files with the same names as the image files are required.These text files must contain the class number of the object and bounding box information.The bounding box format for the YOLO model requires normalized values in the order of the center coordinates (x, y), width (w), and height (h) as [x,y, w, h].Annotation information for the publicly available dataset is in the JSON file format, containing various pieces of detailed information that can be utilized for different research purposes.However, for this experiment, many unnecessary details were present, and the bounding box format did not match the YOLO bounding box format.Therefore, as shown in Figure4, a conversion was performed.

Figure 4 .
Figure 4. Conversion of bounding box format and labels for YOLO model.

Figure 5 .
Figure 5. Teacher-student model using Grounding DINO and YOLO.

Figure 6 .
Figure 6.Example of combining manual labels and auto-labels.

Figure 7 .
Figure 7. Training results for object detection using manual labels for group 1.

Figure 8 .
Figure 8. Confusion matrix for object detection training with manual labels for group 1.

Figure 9
Figure 9 illustrates the results of predicting bounding boxes for batches of training data, visually confirming the detection of all objects without any omissions.

Figure 9 .
Figure 9. Object detection prediction results on training data.
[25]itecture, it not only identifies the location and assigns a class to objects but also precisely segments the object's boundaries at the pixel level to generate masks.Single Shot Detectors (SSDs): Liu et al. (2016) proposed the SSD[25], a model predicting bounding boxes and class scores directly from multiple scale feature maps.The SSD achieved real-time object detection performance by passing the network only once.

Table 1 .
Overall dataset composition and partitioning.

Table 2 .
Training dataset group splitting.

Table 4 .
Test dataset group splitting.

Table 5 .
Group-wise testing results for manual labeling training outcomes.

Table 6 .
Results of manual labeling training without pre-trained models.

Table 7 .
Training results using auto-labels generated by Grounding DINO.

Table 8 .
Combined training results of auto-labels and manual labels.