Next Article in Journal
Fabrication and Characterization of a Flexible Non-Enzymatic Electrochemical Glucose Sensor Using a Cu Nanoparticle/Laser-Induced Graphene Fiber/Porous Laser-Induced Graphene Network Electrode
Previous Article in Journal
Boron-Doped Diamond Electrodes for Toxins Sensing in Environmental Samples—A Review
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Overcoming Data Scarcity in Roadside Thermal Imagery: A New Dataset and Weakly Supervised Incremental Learning Framework

by
Arnd Pettirsch
and
Alvaro Garcia-Hernandez
*
Institute for Highway Engineering, RWTH Aachen University, 52062 Aachen, Germany
*
Author to whom correspondence should be addressed.
Sensors 2025, 25(7), 2340; https://doi.org/10.3390/s25072340
Submission received: 18 March 2025 / Revised: 28 March 2025 / Accepted: 5 April 2025 / Published: 7 April 2025
(This article belongs to the Section Sensing and Imaging)

Abstract

:

Highlights

  • Largest thermal roadside dataset introduced: 11,400 annotated images and 142 video clips.
  • Weakly supervised incremental learning framework adapts to new viewpoints without extra labelling.
  • Achieved 8.9-point improvement in mAP on unseen data, enhancing traffic monitoring accuracy.
  • Teacher-student network with pseudo-labelling and memory module prevents forgetting in incremental learning.
  • Open-access dataset and code provided for reproducibility and future research.

Abstract

Roadside camera systems are commonly used for traffic data collection, yet conventional optical systems are limited by poor performance in varying weather and light conditions and are often restricted by data privacy regulations. Thermal imaging overcomes these issues, enabling reliable detection across all conditions without collecting personal data. However, its widespread use is hindered by the scarcity of diverse, annotated thermal training data, especially since fixed cameras installed at the side of the road produce very similar images with the same backgrounds. This paper presents two key innovations to address these challenges: a novel dataset of 11,400 annotated images and 142 unannotated video clips, the largest and most diverse available for thermal roadside imaging to date, and a weakly supervised incremental learning framework tailored for thermal roadside imagery. The dataset supports the development of self-supervised algorithms, and the learning framework allows efficient adaptation to new camera viewpoints and diverse environmental conditions without additional labelling. Together, these contributions enable cost-effective and reliable thermal-based traffic monitoring across varied locations, achieving an 8.9-point increase in mean average precision for previously unseen viewpoints.

1. Introduction

Reliable traffic data collection is essential for road planning, traffic simulation, safety analysis, and road asset management [1]. Among the various sources for collecting traffic data [2], camera systems have proven to be particularly advantageous when the trajectories of road users are of interest, for example, in safety analysis with surrogate safety measures (SSM) [3]. However, optical cameras are susceptible to weather conditions, rely on external lighting, and their use may be restricted by data protection regulations. In such situations, thermal imaging technology is advantageous as it is insensitive to weather and light conditions such as precipitation, darkness, sunlight, or shadows [4] and does not capture personal information.
There are already field solutions detecting road users in thermal images. Two well-known systems for urban real-time traffic detection are the FLIR TrafiSense2 [5] and the FLIR TrafiSense AI [6]. These systems can detect the presence and class of objects in predefined zones. This is sufficient for many applications, for example, for traffic light control. Applications like more detailed road-user-behavior studies call for flexible detection and tracking of objects to obtain the road users’ trajectory.
Systems enabling flexible traffic data collection beyond predefined areas rely fundamentally on effective and accurate object detection algorithms. In this context, object detection refers to the classification and concurrent pinpointing of objects within a video frame [7]. In the literature, there is a notable emphasis on applying these algorithms to traffic detection in RGB camera images, often concentrating on real-time detection [8] or sensor fusion [9]. In thermal imagery, many studies focus on applications such as monitoring drivers [10] or autonomous driving [11]. Researchers have addressed the challenging task of object detection in infrared images, which have fewer features than RGB images, through methods like combining RGB and infrared images [12] or enhancing established architectures by compressing channels and optimizing parameters [13]. While there is some research on detecting pedestrians [14] and parked cars [15] with roadside thermal cameras, there is still a significant gap in research on algorithms for that specific application.
Traffic detection with thermal images from roadside cameras poses an additional challenge. In contrast to applications with moving cameras (e.g., autonomous driving), the images in the datasets are very similar due to the fixed cameras, which poses a significant risk of overfitting, mainly because large datasets such as COCO [16], typically used for initial training, consist of images from RBG and, therefore, non-thermal cameras. Although Danaci et al. [17] have compiled 109 thermal imagery datasets, it appears that only the datasets from Balon et al. [18,19] and the AAU RainSnow Traffic Surveillance Dataset [20] are specifically available for this purpose. Balon et al. achieved promising results using YOLOv5 [21] and YOLOv7 [22] algorithms on their own dataset of infrared images from infrastructure cameras. However, both datasets contain only one location each and were recorded in half an hour and 45 min on one day [18,19], which is why it is to be expected that these models are overfitted to the viewing angles and environmental conditions used. In contrast, the AAU RainSnow Traffic Surveillance Dataset contains images from seven viewpoints [20]. However, due to the small number of about 2000 images, it is not expected to be sufficient to train generalizing models.
Data collection and annotation are costly, especially for infrastructure cameras, which require complex permits and potential road closures for installation. In RGB images, the scientific community is exploring self-learning approaches, often using larger teacher networks to generate pseudo-labels for training student networks [23], a method known as weakly supervised training [24]. These approaches often rely on image similarities [25] and sometimes on incorporating human knowledge [26]. Beyond new training, certain methods apply filtering [27] and tracking algorithms [28] to compensate for false-positive detections and misclassifications. Those algorithms rely on rules and constraints and may lack the generalization of retrained neural networks. Although works like Tang et al. [29] combine elements of both, there is limited research on solutions that fully leverage the unique properties of thermal roadside cameras, such as limited features and stationary cameras detecting primarily moving objects, within weakly supervised learning frameworks.
To develop a generalized algorithm for accurate traffic data collection across diverse locations, two key research areas require improvement in thermal image-based traffic detection. First, heterogeneous datasets are essential for effective viewpoint adjustment, necessitating distinct training and test data. Second, perspective adjustment methods are needed to address limited heterogeneous data, enabling universal application without additional human labeling
This paper enables reliable 2D object detection in thermal imagery from infrastructure cameras by addressing these two key gaps. First, it provides a unique dataset to address the lack of heterogeneous data in infrared images from varied traffic viewpoints. Second, it combines weakly supervised learning, incremental learning, and pseudo-label enhancement to a novel weakly supervised incremental training framework specially adapted to the challenges of roadside thermal imagery, allowing fast and robust adaptation to new camera positions and diverse traffic or environmental conditions.

2. Materials and Methods

2.1. Problem Description

A variety of models are trained on RGB images X R G B with high detection performances. X R G B consists of images x R G B and labels y R G B with y = ( b , c ) , with b as the bounding box coordinates and c as the class label. In this work, these algorithms are adapted to the domain of roadside infrared images X t . Within this domain, a comparatively small amount of human-labeled data X t A exist. Transfer learning can achieve good results within X t A . However, sufficient heterogeneous test data are needed to evaluate how well these results generalize to different subdomains with varying camera locations, perspectives, and traffic situations X t B . This work addresses two problems: providing such testing data and adapting a model M trained with data from X t A to X t B without human annotation efforts.

2.2. Thermal Traffic Dataset

2.2.1. Data Collection

Two thermal image camera models, the AXIS Q1942-E 10 mm [30] and AXIS Q1952-E 10 mm [31], were used to collect the data from the roadside. Both sensors record videos with 30 FPS and have 640 × 480 pixels resolution. The images are decoded in 8-bit format, where 255 is assigned to the hottest and 0 to the coldest pixel. These cameras were installed at 21 locations. These 21 locations include 12 inner-city spots, 5 locations on rural roads, and 4 on the highway. At all places, except two highway sites, the cameras were mounted 5–8 m high on street lamps or specially designed masts as shown in Figure 1. The streetlights or masts are located a few meters from the road. The other two highway cameras were mounted on road sign gantries at a similar height. The open-source tool LabelImg v1.8.0 [32] was used to create 2D bounding box annotations and class labels for every object of the classes: motorcyclist, car, bus, truck, pedestrian, bicyclist, or e-scooter.

2.2.2. Data Splits

The dataset is split into three main parts X t A ,   X t B 1 ,   X t B 2 (see Figure 2). Each part has unique camera locations and viewpoints. X t A   supports general domain adaption for models trained on RGB images. X t B 1 is the target split with unseen camera locations and traffic. X t B 2 is used to evaluate pseudo-label enhancement.
The first part X t A includes 9000 images from 9 inner-city cameras. Approximately every 15th image is labeled, with at least a 0.5 s gap between frames, ensuring variations due to traffic flow. About 60% of the images of each camera are used for training and about 20% each for validation and testing. These cameras were positioned on frequently used streets, capturing various vehicles of all considered kinds of road users, such as cars, pedestrians, bicycles, motorcycles, trucks, buses, and e-scooters. This variety of object classes and their frequent appearance provide a solid base for learning generalized object representations. Given the high cost of annotation, 1000 images per camera, allocating about 600 images for training and roughly 200 for both validation and testing, were used. The second part X t B 1 has 107 small non-annotated video clips from 8 camera positions. X t B 1 also includes a split X t B 1 _ t e s t , with 200 randomly selected images (0.5 s minimum interval) from additional videos of each camera ensuring to have the same amount of testing images per camera as in X t A .   X t B 2 contains 35 video clips from 4 camera positions, with approximately every 15th frame annotated, regardless of whether there is an object in the image, making it suitable for pseudo-label enhancement evaluation. X t B 1 was chosen to contain as many different camera locations as possible with X t B 2 kept smaller, as its purpose is to evaluate the pseudo-label quality. To cover all traffic situations, X t B 2 contains one camera from inner-city and highway and, due to lower traffic flow, two from rural roads.
Table 1 illustrates the dataset’s composition, highlighting the heterogeneity of X t B 1 _ t e s t . X t B 1 _ t e s t includes diverse viewing angles, times of day, and traffic situations, making it suitable for robust generalization evaluation. Figure 3 illustrates this variety, with on example image from each camera position X t B 1 _ t e s t covering all weather conditions.

2.2.3. Data Evaluation

To determine if performance differences are due to testing dataset composition, X t A _ t e s t and X t B 1 _ t e s t are compared by class composition. Assuming larger objects are easier to detect, objects are additionally classified as hard (small), medium, and easy (larger). Objects under 128 pixels are hard, those 128–384 pixels are medium, and those 384 pixels or larger are easy. X t A _ t r a i n and X t A _ v a l splits are also analyzed to assess if the initial transfer learning dataset composition affects model generalization.

2.3. Training Framework

2.3.1. Overview

The proposed training framework uniquely combines knowledge transfer between teacher and student networks, pseudo-label enhancement via motion-based filtering and temporal voting, and a remember module to prevent forgetting, all tailored to overcome the specific challenges of thermal roadside imagery. The framework combines soft-labels created by the teacher network and the remember module and hard-labels, which are not human-made but enhanced soft labels revised by the novel context module. As shown in Figure 4, the training method consists of four main parts: the teacher network, the student network, the context module, and the remember module (a frozen version of the student network). Following knowledge distillation [33], a large teacher model pre-trained on RGB images is fine-tuned on the annotated dataset portion. The teacher model is then fed unlabeled video clips from the X t B 1 dataset. Detections on a subset of frames (every ~15th frame) are used as soft labels, leveraging inter-class relationships to train the student network [34]. The student is a smaller and more efficient version of the teacher network that learns from the soft-labels provided by the teacher network and the hard-labels provided by the context module. The student network is trained on new data but also benefits from the teacher’s prior knowledge, which helps to generalize better to new unseen environments. The student network also receives additional guidance from the remember module to avoid catastrophic forgetting. Teacher detections are also fed to the context module, which provides enhanced hard pseudo labels, as studies like [35] show their potential in weak supervision. The context module leverages unique properties of roadside cameras: moving objects, stable class across frames, and minimal traversal time. Using these heuristics, detections are filtered, and classifications adapted. This module is crucial for reducing false positives and improving classification accuracy.
The parts of the framework described above ensure high-quality pseudo-labels, allowing flexible, rapid adaptation to different sites. To prevent the network from forgetting previous knowledge and enable fast training without full retraining, a frozen version of the student (trained on prior locations), the remember module, is included in the framework. Each training batch also includes a small, random subset of images from prior locations, with the remember module providing soft labels. Previous context module labels are optionally used during validation. This combination of old and new data allows the model to retain valuable features while adapting to new environments.

2.3.2. Basic Object Detection Architecture

Considering the practical demands of traffic analysis applications, which include real-time control of traffic management systems that require fast evaluation on edge devices or remote traffic studies that need battery-operated devices, an efficient architecture is essential. Thus, this work uses the YOLOv7-tiny model, which balances speed and accuracy, as the primary detection model [22]. Additionally, the YOLOv7-family contains widely recognized and validated models, ensuring the reproducibility and credibility of the presented results. For the teacher network, computational cost is less restrictive, allowing the use of the larger YOLOv7. Transfer learning, with a new detection head as suggested by the original authors of [22], was applied to adapt the pre-trained models to roadside thermal imagery. Nevertheless, the proposed method does not rely on any specific parts of YOLOv7 and can be adapted to other object detection architectures.

2.3.3. Weakly Supervised and Incremental Learning Loss

General Loss

Since each batch includes images from both new and previous locations, the student network’s loss ( L ) term has two main components: L P s e u d o which uses the output of the teacher model and context module to help the model to adapt to new locations, and L R e m e m b e r , which prevents the model from forgetting previously learned knowledge. A scaling factor α weights these two loss terms:
L = α L P s e u d o + 1 α L R e m e m b e r

Pseudo-Label Loss

L P s e u d o has three components, similar to the Yolov7 loss: objectness loss L O b j , classification loss L C l s , and bounding box regression term L R e g . These parts are weighted by the factors β , γ , and δ as in the original paper. Since L R e g uses only boxes with true labels and the context module is expected to provide fewer false positives and more true positives, only those parts are used for regression. y is the network’s output and y c o n t e x t is the output of the context module. Objectness and classification losses include terms based on the teacher’s soft labels y ~ t e a c h e r and the context module’s hard labels y C o n t e x t , with both weighted by objectness scores ( p o b j i ) . The classification loss balances context module outputs and teacher outputs using the teacher’s probability score ( p i ) , which multiplies objectness and classification scores based on the teachers input x t e a c h e r . Since y C o n t e x t and y ~ t e a c h e r derive from teacher detections, each y C o n t e x t detection has a matching y ~ t e a c h e r detection. When only a soft label is available, the loss is weighted by the probability score, with all loss terms averaged. This yields the following for L P s e u d o ,   L O b j _ P s e u d o , L C l s _ P s e u d o , and L R e g _ P s e u d o [22]:
L P s e u d o = β L O b j _ P s e u d o + γ L C l s _ P s e u d o + δ L R e g _ P s e u d o
L R e g _ P s e u d o = m e a n ( L R e g y , y c o n t e x t ) i = 0 i = m
L O b j _ P s e u d o = m e a n ( p o b j _ i ( x t e a c h e r ) L O b j y , y ~ t e a c h e r ) i = 0 i = n + m e a n ( ( 1 p o b j i x t e a c h e r ) L O b j y , y C o n t e x t ) i = 0 i = n
L C l s _ P s e u d o = m e a n ( p i ( x t e a c h e r ) L C l s y , y ~ t e a c h e r ) i = 0 i = n + m e a n ( ε L C l s y , y C o n t e x t ) i = 0 i = m w i t h   ε   = 1 p o b j _ i x t e a c h e r   i f   m a t c h i n g   y   C o n t e x t   e x i s t   a n d   0   o t h e r w i s e

Incremental Learning Loss

The L R e m e m b e r equation uses the objectness, classification, and regression loss based on soft labels of the frozen student y ~ F S t u d e n t :
L R e m e m b e r = L O b j y , y ~ F S t u d e n t + L C l s y , y ~ F S t u d e n t + L R e g y , y ~ F S t u d e n t

Basic Loss Terms

Similar to Wang et al. [22], the objectness, classification, and regression loss terms for all losses were calculated as follows with y i either being hard or soft labels. t B o x are the box coordinates passed by the context module. The objectness loss is balanced across different detection levels, as in the original work:
L O b j = [ y i l o g p + 1 y i l o g 1 p ]
L C l s = i = 1 C y i l o g p i
L R e g = i = 1 n 1 I o U ( p B o x , t B o x )

2.3.4. Context Module

The context module was specifically designed to address the unique challenges of thermal roadside imagery, such as low feature density and the stationary nature of cameras. By leveraging motion-based filtering and temporal class voting, it enhances pseudo-label quality in a novel and effective manner.

Sources of Error

The context module works to remove errors in teacher network detection. The three primary error sources are bounding box regression errors (producing misaligned boxes), background confusion, and class confusion (misclassified objects).

Improve Bounding Box Regression

Within the YOLOv7 model group, Wang et al. [22] found that YOLOv7′s bounding box regression is more accurate than YOLOv7-tiny’s, so YOLOv7 serves as the teacher model. Research exists on additional regression enhancement networks [36] and on methods that bypass bounding box regression in weakly supervised training [37]. In the proposed method, only an adaption to the different backgrounds is necessary. In the proposed method, only adaptation to different backgrounds is necessary since object classes, sizes, and aspect ratios remain unchanged. Thus, using a larger teacher network is effective and sufficient.

Remove Background Confusions

A combination of two methods removes false positive detections. First, non-moving objects are eliminated based on the assumption that false positives mainly occur in background areas. To achieve this, background images were generated using a temporal median filter (TMF) [38] based on (10), where k is half the total frame count, B is the background pixel value, and I is the image pixel value at position x , y . The teacher network is then applied to these background images. Detections that appear on the original image but not on the background image, using a matching threshold of 0.5 in IoU, are considered moving objects. In fixed-camera traffic monitoring, these moving detections are treated as true positives. There are limitations, however, when objects remain static due to traffic jams or parking. Such cases typically have high probability scores in the teacher network, reducing their impact on the overall loss calculation.
Second, non-moving object removal is combined with a minimum track-length filter that ensures each object is recorded for at least 30 frames. This works compares the proposed filter with box-density-based methods, like the IoU-based approach of Kim et al. [37] and the distance-based approach of Li et al. [27].
B x , y = m e d i a n ( i = k k I ( x , y , i )

Remove Class Confusions

There are two key components in the approach to reduce class confusion: the tracking algorithm and the voting process. While tracking algorithms are an extensive research area on their own, this work primarily focuses on the voting problem. The SORT [39] algorithm is chosen for tracking, as it has shown strong results in real-time multiple object tracking and does not require additional, computationally costly models. The main assumption in using tracking algorithms for pseudo-label improvement is that an object maintains the same class throughout its track. Voting then determines the common class for these tracks.
This work evaluates three voting strategies: majority vote, maximum-score vote, and soft vote. Majority vote ignores probability scores and assigns the most frequently detected class (11), maximum-score selects the class with the highest overall score (12), while soft vote calculates the average score for each class across all detections in the track (13) [40].
x m a j o r = a r g max c i = 1 n 1 ( y i = c )
x m a x = arg max i { 1 , . . , n } p i ( c )
x S o f t = arg m a x c i = 1 n p i ( c ) n

2.3.5. Pseudolabel Creation

In contrast to actual training, the context module processes videos rather than individual images. Its output remains consistent throughout training. Thus, the pseudo-labels from the context module are pre-calculated. For this, a file with pseudo-labels (similar to the output of [32]) is created every 15 frames, starting from the 15th frame (α-frames). If this frame contains objects, it is saved as an image along with the label file; otherwise, the next α-frame is used. Notably, labels are only created after all objects detected in a specific α-frame have exited the tracking process. This ensures that the voting process includes all appearances of each object. Objects leave tracking if they have not been detected for 5 consecutive frames. For training, 600 images per camera were used, with 200 for validation. The images were selected as uniformly as possible (limited by the annotated frames) from each camera’s videos. The entire process is outlined in Algorithm 1.
Algorithm 1. Pseudo code for the creation of pseudo labels from the context module.
1frame_num = 0
2annotations = {}
3frame_objects = {}
4# Iterate over all frames in the video
5for curr_frame in video.all_frames():
6 frame_num += 1  # Increment frame number
7 # Process current frame
8 curr_detections = detector(curr_frame)
9 filtered_detections = detection_filter(curr_detections)
10 tracked_objects = tracker.add_detections(filtered_detections)
11 # Create pseudo-labels
12 for curr_obj in tracked_objects:
13 if curr_obj.last_detection() < frame_num—5: # Check if object is not detected for 5 frames
14 for obj_frame_num in curr_obj.frame_numbers:
15 if obj_frame_num % 15 == 0:  # Only consider frames divisible by 15 (alpha frames)
16 if obj_frame_num not in annotations:
17 annotations[obj_frame_num] = []
18 annotations[obj_frame_num].append(curr_obj)
19 frame_objects[obj_frame_num].remove(curr_obj)  # Remove from frame objects
20 # Save labels if all objects have left the frame
21 for frame_num_to_check in frame_objects.keys():
22 if len(frame_objects[frame_num_to_check]) == 0:  # No remaining objects in the frame
23 create_label_file(annotations[frame_num_to_check])
24 save_image(images[frame_num_to_check])

2.4. Experimental Design

2.4.1. Evaluation Metrics

General Evaluation Procedure

The training framework in this work comprises four modules: the teacher network, the student network, the context module, and the remember module. The two object detection networks are tested on the X t A test set to assess adaptation to the thermal image domain and on the X t B 1 _ t e s t set to evaluate generalization, both before and after applying the training framework. The difference in performance before and after retraining the student network illustrates the proposed method’s benefit.
The context module enhances the teacher network’s output to produce high-quality pseudo-labels. Thus, the pseudo-label quality is evaluated per frame rather than evaluating filtering and tracking algorithms separately. To ensure that the context module optimization does not overfit specific camera positions and traffic situations, the X t B 2 set was used for evaluation.

Bounding Box Regression

Bounding box regression is assessed using the Intersection-over-Union (IOU) metric. IOU measures the area of intersection divided by the area of union between the detected and ground-truth (GT) bounding boxes (14). This work uses the mean IOU (mIOU), calculated as the mean of all class-wise mean IOUs. For the other metrics, ground truth (GT) and detection are assigned based on an IOU threshold of 0.5 [41].
I O U = A r e a   o f   I n t e r s e c t i o n A r e a   o f   U n i o n

Classification

The mean Average Precision (mAP) is calculated as the mean of all average precisions across classes, where average precision represents the area under the curve in a precision–recall diagram. Precision (15) is the ratio of correct (true-positive) detections and all detections (true-positives and false-positives (FP)). Recall (16) is the ratio of true-positive detections (TP) and all ground-truth objects (true positives and false negatives (FN)). The precision–recall curve is derived using different recall levels (R) (17). At each recall level, only detections above a set score threshold (r) are included [16].
P = T P T P + F P
R = T P T P + F N
A P = 1 R r 0 , 0.1 , , 1 p i n t e r p r
The mAP has the issue of score-threshold variation at the recall level. In practice, precision and recall at a specific threshold are prioritized. Therefore, the F1-Score (18), the harmonic mean of precision and recall, is used as an additional evaluation parameter. Similar to the AP and the IoU, the mean F1-score (mF1) is calculated. The score threshold is set at 0.25, as suggested by the authors of [22,42].
F 1 = 2 P · R P + R = 2 T P 2 T P + F P + F N

2.4.2. Implementation Details

The code used in this work builds on the YoloV7 implementation by the original authors [22] and SORT tracking implementation [39]. Some algorithms for the training framework, pseudo-label generation in the context module, data processing, and evaluation were added to the base code. The code and hyperparameters for all experiments are available here: www.github.com/4rnd25/overcoming_data_scarcity_in_roadside_thermal_imagery.
Training and evaluation used an image size of 640 × 640, with resizing handled by the code from [22]. The hyperparameters, β, γ, and δ used in (2) are set identical to the original YOLOv7 paper [22], maintaining consistency with the original authors’ setup to ensure reproducibility and consistency with the established framework. Incremental training was tested with two strategies for α in equation (1): 0.5 for equal weights between the remember module and current training locations, and an adapted α, calculated as α = n C u r r e n t _ l o c a t i o n s   /   ( n P r e v i o u s _ l o c a i o n s + n C u r r e n t _ l o c a t i o n s ) . For incremental training, the most recent version of the student network initialized the weights. The teacher network was not retrained; weights trained on X t A were used for all training steps. For all soft labels (from the teacher network and the remember module) and for the input to the context module, a non-maximum suppression (NMS) process was applied on the teacher and frozen student outputs. The NMS was not class-agnostic and used a score threshold of 0.001 and an IOU threshold of 0.6. The context module only used detections above a 0.25 score threshold. All models were trained for 300 epochs, with the best models chosen based on the fitness score provided in [22] on the validation set. For incremental training, two strategies were tested for validation set selection: using the context module’s output on current training locations and combining current and all previous validation sets.
Training and inference times were measured on a server with an Nvidia Quadra RTX 5000 GPU and Intel Xeon E5-2640 v3 CPU (2.60 GHz, 16 cores). Additionally, inference time was assessed on an Nvidia Jetson Xavier NX Edge GPU in 15 W 6-core Power mode with Jetson Clocks activated.

3. Results and Discussion

3.1. Dataset

The labeled parts used for initial neural network transfer learning X t A ,   and the evaluation sets X t A _ t e s t and X t B 1 _ t e s t are emphasized in the dataset evaluation. Figure 5 displays the class distribution for these parts and the distribution of objects categorized as easy, medium, and hard to detect.
Class distributions are generally uneven across all analyzed dataset parts, with the car class significantly more represented than others. This distribution reflects traffic data collection applications, as cars are the most common vehicle type. Notably, the test set X t B 1 _ t e s t , which includes heterogeneous viewpoints, contains fewer buses, pedestrians and cyclists but more motorcyclists and trucks. However, differences across difficulty levels are minor, especially in the most represented class, cars.

3.2. Performance Analysis

Figure 6 provides a summary of key results across experiments. The mAP on the X t B 1 _ t e s t dataset is shown, with crosses indicating each model’s performance at specific locations. Models were trained sequentially, location by location, in random order using data from X t B 1 with the proposed framework, or collectively across all locations in X t B 1 . Training was conducted both without the remember module (red) and with it (blue and green). As described, two remember module options were tested: (1) α = 0.5 with validation data only from the current location or X t B 1 (blue) and (2) an adapted α with validation data from all previous locations (green). Overall, performance on unseen data improved significantly. One observes that after adding five extra locations, the performance stabilizes and reaches similar results as when training on all locations. This supports the idea that more diverse locations lead to better detection quality and reduced overfitting. While Figure 6 reports the results during the process of incremental learning, Table 2 compares AP for the baseline model (Yv7t) trained on X t A , the incrementally trained model with adapted α and mixed validation data (Yv7t-PRIA), and the model trained on the entire X t B 1 dataset at once using the adapted α and mixed validation set (Yv7t-PRA) after training on all locations. An increase of 8.9 percentage points in mAP was observed, with similar regression precision (78% mIoU to 79% mIoU) and a notably improved mF1-score (up by 7.5 percentage points). The analysis in Figure 7 indicates that this improvement is mainly due to fewer false positives in the background and better classification. While nearly all classes showed improvement, detection performance remains low for some classes. For instance, poor performance on the e-scooter class suggests that more class objects in the weakly supervised training data are needed and that the teacher network must initially detect all classes reliably.
Comparing the incrementally trained version with the one trained on the entire dataset at once, a slightly lower performance is observed. Additionally, the performance curve does not consistently rise, indicating that even with the remember module, some features are forgotten, especially since not every object class appears in all locations (e.g., no pedestrians in highway camera images).
Due to the high computational cost, mAPs are only reported for a single representative run per setup, consistent with common practice in object detection such as in [22]. Since the models were not trained from random initialization but from pretrained weights, remaining sources of randomness like the sample order are expected to have minor influence and would not explain the substantial improvement observed.

3.3. Comparison with State-of-The Art Datasets

Table 3 presents the results of the Yolov7-tiny model trained on the AAU Rain Snow Dataset (AAU) and the AutomotiveThermal Dataset (AT), evaluated on the test set of each specific dataset and the X t B 1 _ t e s t set. Bahnsen et al. [20] did not report specific results for object detection models on the thermal part of their dataset and did not define a test set. Therefore, 25% of each camera viewpoint was used for testing. In contrast, Baalon et al. [18] defined a test set and reported results with an mAP of 99.2, which was approximately reproduced here (99.7 mAP). Only the first published dataset by Balon et al. was considered since the second is no longer available online [18].
When applying the models trained on their original datasets to the X t B 1 _ t e s t set, a significant performance decrease was observed. The performance drop was particularly notable for the model trained on the AT dataset. This decrease reflects the homogeneous composition of Baalon et al.’s dataset, which resulted in poor detection quality on unknown data but exceptionally high performance on the original dataset. Such good performance indicates that both training and test sets within this dataset were too similar, leading to severe overfitting. This drop in performance provides a clear quantitative measure of overfitting, demonstrating how models trained on less-diverse locations fail to generalize to more varied data.
The AAU-trained model showed better performance on unseen data due to its more diverse dataset composition. However, the poor results on unknown data still suggest that the AAU dataset’s relatively small size of about 2000 images is insufficient for developing robust, generalizing algorithms.

3.4. Detailed Performance Analysis

3.4.1. Performance of the Base Models

Comparing the performance of the larger YoloV7 model (Yv7) and the target model YoloV7-tiny (Yv7t) in Table 4, using the dataset X t A for initial training on X t A _ t e s t and the more heterogeneous dataset X t B 1 _ t e s t , leads to several conclusions. First, the significant drop in mAP and mF1 metrics highlights the generalization problem addressed by this work. Additionally, the use of the larger YoloV7 model as a teacher network is justified, as it performs better on unseen data (32.8 mAP vs. 31.4 mAP). Furthermore, the similar or better mIoU on X t B 1 _ t e s t reinforces that no additional bounding box regression improvements are necessary, as the teacher model’s bounding boxes with a high mIoU of 0.78 effectively guide the student.
Several factors could explain the lower performance on the more heterogeneous test dataset X t B 1 _ t e s t , One reason may be the dataset composition itself, where class distribution influences performance. False positives have a higher impact on underrepresented classes, such as person, bicycle, and bus. For more represented classes, like motorbike and truck, one would expect an opposite effect, but the lack of this effect, combined with a generally similar distribution across difficulty levels (easy, medium, and hard), suggests other factors, such as insufficient training data. The high performance of the car class in X t B 1 _ t e s t also indicates that generalization may strongly depend on the quantity of training examples. These results further demonstrate the overfitting issue, as models trained on a more homogeneous dataset show significant performance drops when tested on more diverse and challenging data.

3.4.2. Remember Module

Influence During Incremental Training

Figure 7 highlights the impact of the remember module in a general manner during the incremental learning process. Generalization performance is notably higher than that of the framework without this module across all training steps. Additionally, even on specific current training data, performance is superior for most locations. This improvement likely stems from learning more robust features due to training with diverse data (different viewing angles, objects, etc.) labeled by the remember module. Furthermore, the influence of human-annotated data enhances performance since X t A _ t e s t images are part of the training, and the loss function is structured to match previous outputs on human-labeled data.

Influence During Training on All Locations at Once

Table 5 compares the results of training on all locations in X t B 1 at once, with and without the remember module, on X t B 1 _ t e s t and X t A _ t e s t . Notably, similar performance is observed on X t B 1 _ t e s t , but there is a significant difference on X t A _ t e s t . This highlights the importance of the remember module, even with a larger, multi-location dataset, in achieving better performance on all data without compromising on new locations. Thus, performance even surpasses the baseline, which was trained only on X t A . data. However, overall performance on X t B 1 _ t e s t is lower than on X t A _ t e s t , likely because strong supervision generally yields better results than weak supervision. Additionally, the diversity in weather, traffic, and backgrounds in X t B 1 likely increases its complexity.

3.4.3. Incremental Learning

The comparison of the blue and green data points in Figure 6 shows that both the scaling factor α and the validation dataset choice significantly affect incremental learning performance. Except for the first location, where only 1/8 of X t B 1 locations were included in the validation set, the model with adapted α and mixed validation sets performed best.
Incremental learning without a remember module results in a sharp performance drop, likely because unlearned features are not relearned, and no loss term manages forgetting. The remember module with α = 0.5 helps mitigate this, but generalization performance still declines at some training steps. First, α = 0.5 may overweight new data, leading to overfitting. Second, generalization properties may be missing from the validation set, preventing the model from saving the best generalizing training state.

3.4.4. Training and Inference Times

Table 6 shows the inference times with a batch size of 1 on Nvidia Jetson Xavier NX and the GPU server. On the server, model size has minimal impact on inference speed, particularly for efficient YOLO models, and YOLOv7 operates well above real-time speed (30 FPS). However, on the edge device, the difference is more substantial, with only YOLOv7-tiny achieving adequate speed (~25 FPS). On the edge device, the model was executed with onnx-runtime. Although further optimizations are possible, the general trend favors YOLOv7-tiny.
Examining training times, the teacher model and context module have little impact and, in some cases, even reduce training time. This is due to fast inference by the teacher model and prior labeling from the context module. Notably, only one measurement was taken, so shorter times may result from external factors, but the overall trend remains relevant. The remember module with both large and small validation datasets noticeably increases training time, likely due to an additional data loader for prior location images. The extended training time with larger validation sets indicates a bottleneck in data loading.
Incremental learning times per location are much faster on average, scaling with the volume of new data (images per new location, number of locations), which further demonstrates the advantages of this approach.

3.5. Evaluation of the Context Module

3.5.1. Filtering Algorithms

The larger teacher model, together with tracking and filtering algorithms, was applied on the X t B 2 dataset to analyze these algorithms’ ability to improve pseudo labels. Although X t B 2 includes only four viewpoints, which raises overfitting risks, the dataset contains over 3000 objects, making the evaluations reasonably valid. The results do not rely on class-wise mean values of F1, precision, and recall when evaluating filter and voting algorithms, as these metrics should remain unaffected by class distribution in X t B 2 . For pseudo-labels, the focus is on overall counts rather than class-wise details.
All evaluations in Table 7 were conducted only on images with detections, as only these detections are used as pseudo labels for model retraining. This leads to a different object count across algorithms. The first observation is that neither the PDBE method based on [27] nor the SRRS method based on [37] worked effectively, especially when applying thresholds reported in [27,37] (PDBE thresh: 6, SRRS thresh 0.08). PDBE resulted in no filtering, while SRRS nearly filtered out all detections at these thresholds. A likely reason is thermal images’ generally lower feature density. Both methods rely on box density, which may not suit thermal images. Presumably, the lower feature density and pretraining on optical images cause certain areas in the image (e.g., some bright spots, even in the background) to register higher box density.
The TMF-based background filter and the minimum track length filter showed improved results, especially in precision, indicating that both algorithms significantly reduce false positive detections. However, both filters also reduce true positive detections, leading to a decrease in recall. As shown in Figure 8, true objects typically have a high probability score from the teacher network, so falsely filtered objects are unlikely to impact the student’s loss significantly. Since the background filter combined with the minimum track length filter achieves the highest F1 score, this combination is chosen as the filtering algorithm for this work.

3.5.2. Voting Strategies for Tracking Objects

Comparing the different voting strategies (see Table 8), it is observed that soft vote and max score vote produce the worst results, likely due to a comparatively high number of false detections with high scores. Major vote proves more reliable as it includes all detections in the decision process. However, the minor change in the number of TPs indicates a low number of identity switches in the dataset X t B 2 , so the comparison is not entirely conclusive. Combined with the chosen filtering algorithms described above, the overall F1 score improves significantly by 6.4 percentage points.

3.6. Future Application

This methodology enables reliable, scalable traffic detection through thermal imaging in diverse environments, addressing the traditional limitations of location-specific training data and high retraining costs. By introducing a framework that quickly adapts to new viewpoints, this approach achieves high detection accuracy and broad generalizability with minimal human intervention.
The system provides dual advantages: quick adaptation to novel sites without the need for retraining from scratch and long-term performance enhancement through cumulative learning. Critically, it supports rapid updates for deployed systems, accommodating new locations or changing environmental conditions without compromising existing performance—a significant advancement for maintaining system reliability over time. Using video data collected under varied conditions, this framework demonstrates consistently improved detection across both new and previously seen viewpoints. Future implementations could further enhance generalizability by training on data from multiple new locations simultaneously, comparable to training on the whole X t B 1 dataset. Additionally, the system’s ability to be deployed on edge devices such as the Jetson Xavier NX and powered by street lamps ensures low-power consumption, cost-effectiveness, and scalability, making it suitable for wide-scale urban deployment. This approach also holds promise for integration with smart city infrastructures, enabling real-time traffic monitoring and more efficient management of urban traffic systems.
Future applications could address the need for sensor failure detection, such as camera misalignment or thermal sensor issues, which are not automatically detected in the current framework. However, the architecture of the system is flexible, and these mechanisms can be incorporated in future versions.

4. Conclusions

The goal of this work is to advance robust object detection in thermal roadside imagery, aiming for reliable results across varied conditions. Our approach quickly adapts to new locations without human annotation and demonstrates significant benefits from this adaptability. The proposed training framework overcomes one of thermal imaging’s key limitations—the scarcity of publicly available training data—thus substantially lowering deployment efforts and enhancing reliability for traffic data collection. Key contributions include the following:
  • Problem of Viewpoint Adaptation: We demonstrated through performance comparisons, that diverse datasets are essential for achieving generalization and needed to identify the problem of missing generalization
  • Novel large Thermal Dataset: We introduced a unique dataset with 9000 labeled images for transfer learning, 1600 for generalization evaluation, 800 for pseudo-label assessment, and 142 videos for weakly supervised learning—the most extensive thermal roadside dataset to date.
  • Innovative Training Framework: We proposed a weakly supervised incremental learning framework that uniquely integrates knowledge transfer between teacher and student networks, pseudo-label enhancement through motion-based filtering and temporal voting, and a remember module to prevent forgetting, specifically designed to address the challenges of thermal roadside imagery.
  • Effective Pseudo label enhancement: We conducted an in-depth analysis of false-positive filtering and compared voting strategies for pseudo-label enhancement tailored for roadside applications.
  • Incremental learning: We verified the framework’s incremental learning benefit across eight different subsets of the prosed dataset.
  • Adaption to new viewpoint: Overall, an improvement of 8.9 percentage points in the mAP on unseen data without any human labeling effort was achieved.
Future work will focus on refining the proposed method’s application in real-world settings, with particular attention to rare classes like e-scooter, Additionally making investigation on the optimal amount of images from new locations or situations needed. Furthermore, research on an automatic monitoring of the detection quality to recognize the need for retraining would enable practical application of a real-time traffic analysis system witch continuously improves without human interruption.

Author Contributions

Conceptualization, A.P.; methodology, A.P.; software, A.P.; validation, A.G.-H.; formal analysis, A.P.; investigation, A.P.; resources, A.G.-H.; data curation, A.P.; writing—original draft preparation, A.P.; writing—review and editing, A.G.-H.; visualization, A.P.; supervision, A.G.-H.; project administration, A.G.-H.; funding acquisition, A.P. All authors have read and agreed to the published version of the manuscript.

Funding

The study was part of the research project ‘SmarteAmpel’ founded by the German Federal Ministry of Economic Affairs and Climate Action (BMWK). Data were further collected in the project ‘BueLaMo—Muensterland’ founded by the German Federal Ministry of Education and Research (BMBF). and the contract research ‘Untersuchung von Gewöhnungseffekten beim Einsatz von fluoreszierenden Materialien’ for the German Federal Highway Research Institute and the project ‘Panoramstraße’ supported by German District administration Dueren.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Zhu, L.; Yu, F.R.; Wang, Y.; Ning, B.; Tang, T. Big Data Analytics in Intelligent Transportation Systems: A Survey. IEEE Trans. Intell. Transp. Syst. 2019, 20, 383–398. [Google Scholar] [CrossRef]
  2. Tasgaonkar, P.P.; Garg, R.D.; Garg, P.K. Vehicle Detection and Traffic Estimation with Sensors Technologies for Intelligent Transportation Systems. Sens. Imaging 2020, 21, 29. [Google Scholar] [CrossRef]
  3. St-Aubin, P.; Miranda-Moreno, L.; Saunier, N. An automated surrogate safety analysis at protected highway ramps using cross-sectional and before–after video data. Transp. Res. Part C Emerg. Technol. 2013, 36, 284–295. [Google Scholar] [CrossRef]
  4. Alldieck, T.; Bahnsen, C.; Moeslund, T. Context-Aware Fusion of RGB and Thermal Imagery for Traffic Monitoring. Sensors 2016, 16, 1947. [Google Scholar] [CrossRef] [PubMed]
  5. Teledyne FLIR LLC. Product Information TrafiSense2. Available online: https://www.flir.com/products/trafisense2/?vertical=publicsafety&segment=solutions (accessed on 24 June 2024).
  6. Teledyne FLIR LLC. Product Information TrafiSense AI. Available online: https://www.flir.com/products/trafisense-ai/?vertical=publicsafety&segment=solutions (accessed on 24 June 2024).
  7. Zaidi, S.S.A.; Ansari, M.S.; Aslam, A.; Kanwal, N.; Asghar, M.; Lee, B. A Survey of Modern Deep Learning based Object Detection Models. Digit. Signal Process. 2022, 126, 103514. [Google Scholar] [CrossRef]
  8. Barthélemy, J.; Verstaevel, N.; Forehead, H.; Perez, P. Edge-Computing Video Analytics for Real-Time Traffic Monitoring in a Smart City. Sensors 2019, 19, 2048. [Google Scholar] [CrossRef] [PubMed]
  9. Vitols, G.; Bumanis, N.; Arhipova, I.; Meirane, I. LiDAR and Camera Data for Smart Urban Traffic Monitoring: Challenges of Automated Data Capturing and Synchronization. In Communications in Computer and Information Science; Springer International Publishing: Cham, Switzerland, 2021; pp. 421–432. [Google Scholar] [CrossRef]
  10. Liu, S.; Wang, Y.; Yu, Q.; Liu, H.; Peng, Z. CEAM-YOLOv7: Improved YOLOv7 Based on Channel Expansion and Attention Mechanism for Driver Distraction Behavior Detection. IEEE Access 2022, 10, 129116–129124. [Google Scholar] [CrossRef]
  11. Wang, J.; Song, Q.; Hou, M.; Jin, G. Infrared Image Object Detection of Vehicle and Person Based on Improved YOLOv5. In Communications in Computer and Information Science; Springer Nature: Singapore, 2023; pp. 175–187. [Google Scholar] [CrossRef]
  12. Bao, C.; Cao, J.; Hao, Q.; Cheng, Y.; Ning, Y.; Zhao, T. Dual-YOLO Architecture from Infrared and Visible Images for Object Detection. Sensors 2023, 23, 2934. [Google Scholar] [CrossRef] [PubMed]
  13. Li, S.; Li, Y.; Li, Y.; Li, M.; Xu, X. YOLO-FIRI: Improved YOLOv5 for Infrared Image Object Detection. IEEE Access 2021, 9, 141861–141875. [Google Scholar] [CrossRef]
  14. Fukuda, T.; Arai, I.; Endo, A.; Kakiuchi, M.; Fujikawa, K. Benchmark of Deep Learning Visual and Far-Infrared Videos Toward Weather-tolerant Pedestrian Traffic Monitoring. In Proceedings of the 2023 IEEE International Conference on Smart Mobility (SM), Thuwal, Saudi Arabia, 19–21 March 2023. [Google Scholar] [CrossRef]
  15. Kasper-Eulaers, M.; Hahn, N.; Berger, S.; Sebulonsen, T.; Myrland, Ø.; Kummervold, P.E. Short Communication: Detecting Heavy Goods Vehicles in Rest Areas in Winter Conditions Using YOLOv5. Algorithms 2021, 14, 114. [Google Scholar] [CrossRef]
  16. Lin, T.-Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context. In Proceedings, Part v 13, Proceedings of the Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Springer International Publishing: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar] [CrossRef]
  17. Danaci, K.I.; Akagunduz, E. A survey on infrared image and video sets. Multimed. Tools Appl. 2024, 83, 16485–16523. [Google Scholar] [CrossRef]
  18. Balon, T.; Knapik, M.; Cyganek, B. New Thermal Automotive Dataset for Object Detection. In FedCSIS (Position Papers); ACSIS: Kobe, Japan, 2022; pp. 43–48. [Google Scholar] [CrossRef]
  19. Balon, T.; Knapik, M.; Cyganek, B. Real-Time Detection of Small Objects in Automotive Thermal Images with Modern Deep Neural Architectures. In FedCSIS (Communication Papers); ACSIS: Kobe, Japan, 2023; pp. 29–35. [Google Scholar] [CrossRef]
  20. Bahnsen, C.H.; Moeslund, T.B. Rain Removal in Traffic Surveillance: Does it Matter? IEEE Trans. Intell. Transp. Syst. 2019, 20, 2802–2819. [Google Scholar] [CrossRef]
  21. Jocher, G.; Stoken, A.; Borovec, J.; Changyu, L.; Hogan, A.; Diaconu, L.; Ingham, F.; Poznanski, J.; Fang, J.; Yu, L.; et al. Ultralytics/Yolov5: v3.1-Bug Fixes and Performance Improvements; Zenodo: Geneva, Switzerland, 2020; Available online: https://ui.adsabs.harvard.edu/link_gateway/2020zndo...4154370J/doi:10.5281/zenodo.4154370 (accessed on 24 June 2024).
  22. Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
  23. Chen, C.; Dong, S.; Tian, Y.; Cao, K.; Liu, L.; Guo, Y. Temporal Self-Ensembling Teacher for Semi-Supervised Object Detection. IEEE Trans. Multimed. 2022, 24, 3679–3692. [Google Scholar] [CrossRef]
  24. Zhou, Z.-H. A brief introduction to weakly supervised learning. Natl. Sci. Rev. 2017, 5, 44–53. [Google Scholar] [CrossRef]
  25. Munir, F.; Azam, S.; Jeon, M. SSTN: Self-Supervised Domain Adaptation Thermal Object Detection for Autonomous Driving. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September 2021–1 October 2021. [Google Scholar] [CrossRef]
  26. Karamanolakis, G.; Mukherjee, S.; Zheng, G.; Awadallah, A.H. Self-Training with Weak Supervision. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 845–863. [Google Scholar] [CrossRef]
  27. Li, Z.; Jia, B.; Chen, M.; Xu, S.; He, Y.; Xie, L. PBDE: An Effective Method for Filtering False Positive Boxes in Object Detection. In Proceedings of the 2020 16th International Conference on Control, Automation, Robotics and Vision (ICARCV), Shenzhen, China, 13–15 December 2020. [Google Scholar] [CrossRef]
  28. Bashar, M.; Islam, S.; Hussain, K.K.; Hasan, M.B.; Rahman, A.B.M.A.; Kabir, M.H. Multiple Object Tracking in Recent Times: A Literature Review. arXiv 2022, arXiv:2209.04796. [Google Scholar]
  29. Tang, K.; Ramanathan, V.; Fei-fei, L.; Koller, D. Shifting Weights: Adapting Object Detectors from Image to Video. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: San Francisco, CA, USA, 2012. [Google Scholar]
  30. Axis Communication AB. Product Information Axis Q1942-E Thermal Network Camera. Available online: https://www.axis.com/de-de/products/axis-q1942-e (accessed on 24 June 2024).
  31. Axis Communications AB. Product Information Axis Q1952-E Thermal Camera. Available online: https://www.axis.com/de-de/products/axis-q1952-e (accessed on 24 June 2024).
  32. Tzutalin. Git Code labelImg. 2015. Available online: https://github.com/tzutalin/labelImg (accessed on 24 June 2024).
  33. Hinton, G. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
  34. Gou, J.; Yu, B.; Maybank, S.J.; Tao, D. Knowledge distillation: A survey. Int. J. Comput. Vis. 2021, 129, 1789–1819. [Google Scholar] [CrossRef]
  35. Kaul, P.; Xie, W.; Zisserman, A. Label, verify, correct: A simple few shot object detection method. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar] [CrossRef]
  36. Zhang, L.; Sun, Y.; Wei, W. Mind the Gap: Polishing Pseudo Labels for Accurate Semi-supervised Object Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; pp. 3463–3471. [Google Scholar] [CrossRef]
  37. Kim, S.; Choi, J.; Kim, T.; Kim, C. Self-Training and Adversarial Background Regularization for Unsupervised Domain Adaptive One-Stage Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar] [CrossRef]
  38. Laugraud, B. Motion-Aware Temporal Median Filtering for Robust Background Estimation. Ph.D. Dissertation, Université de Liège, Brussels, Belgium, 2020. [Google Scholar]
  39. Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016. [Google Scholar] [CrossRef]
  40. Boateng, E.A.; Bruce, J.W. Unsupervised Ensemble Methods for Anomaly Detection in PLC-based Process Control. arXiv 2023, arXiv:2302.02097. [Google Scholar]
  41. Cesare, S. Software Similarity and Classification, 1st ed.; Hrsg, Y., Xiang, H., Eds.; Springer London, Limited: London, UK, 2012. [Google Scholar] [CrossRef]
  42. Hand, D.J.; Christen, P.; Kirielle, N. F*: An Interpretable Transformation of the F-Measure; Machine Learning: Stanford, CA, USA, 2021; pp. 451–456. [Google Scholar] [CrossRef]
Figure 1. Photo of one of the thermal images cameras used in this study mounted at a rural road.
Figure 1. Photo of one of the thermal images cameras used in this study mounted at a rural road.
Sensors 25 02340 g001
Figure 2. Dataset division: Each color represents a split. The first split is divided into training, validation, and test sets. The second split contains 107 non-annotated videos and 1600 annotated images from the same cameras at different times. The third split includes 35 videos and 800 annotated images extracted from them.
Figure 2. Dataset division: Each color represents a split. The first split is divided into training, validation, and test sets. The second split contains 107 non-annotated videos and 1600 annotated images from the same cameras at different times. The third split includes 35 videos and 800 annotated images extracted from them.
Sensors 25 02340 g002
Figure 3. Different recording situations used in X t B 1 . (a) Rain/wet surface at highway. (b) snow at highway. (c) sun inner-city. (d) fog rural road. (e) wet surface highway. (f) sun inner-city. (g) sun rural road. (h) sun rural road.
Figure 3. Different recording situations used in X t B 1 . (a) Rain/wet surface at highway. (b) snow at highway. (c) sun inner-city. (d) fog rural road. (e) wet surface highway. (f) sun inner-city. (g) sun rural road. (h) sun rural road.
Sensors 25 02340 g003
Figure 4. Overview of the proposed training framework. The teacher processes videos from X t B 1 , providing soft labels for key frames and input detections for the context module. The context module generates enhanced pseudo labels for the student network. The student trains on those key frames and additional soft labels from the remember module on previous data.
Figure 4. Overview of the proposed training framework. The teacher processes videos from X t B 1 , providing soft labels for key frames and input detections for the context module. The context module generates enhanced pseudo labels for the student network. The student trains on those key frames and additional soft labels from the remember module on previous data.
Sensors 25 02340 g004
Figure 5. Distribution of human-labeled data sets X t A for initial training and X t B 1 for analysis of generalization abilities with respect to the classes encountered and the difficulty of recognizing the objects.
Figure 5. Distribution of human-labeled data sets X t A for initial training and X t B 1 for analysis of generalization abilities with respect to the classes encountered and the difficulty of recognizing the objects.
Sensors 25 02340 g005
Figure 6. Performance of the difference model versions. The x-axis shows the process during incremental learning on the y-axis the mAP on the current data and the test set X t B 1 _ t e s t is shown.
Figure 6. Performance of the difference model versions. The x-axis shows the process during incremental learning on the y-axis the mAP on the current data and the test set X t B 1 _ t e s t is shown.
Sensors 25 02340 g006
Figure 7. Example detections. (a) Yv7t with false positive (FP) and misclassification (MC). (b) Yv7t-PRA no FP and MC after retraining. (c) Yv7t with multiple FP and one MC. (d) Yv7t-PRA correct classification and no FP after retraining. (e) MC with Yv7t. (f) Correct classification with Yv7t-PRA.
Figure 7. Example detections. (a) Yv7t with false positive (FP) and misclassification (MC). (b) Yv7t-PRA no FP and MC after retraining. (c) Yv7t with multiple FP and one MC. (d) Yv7t-PRA correct classification and no FP after retraining. (e) MC with Yv7t. (f) Correct classification with Yv7t-PRA.
Sensors 25 02340 g007
Figure 8. Example of the TMF filter combined with the minimum track length filter. The class name, the probability score, and an ascending counter number are shown above each object. Filtered objects are marked in red. One can see the filtered parked car, but it is also notable that the score is quite high with 0.78.
Figure 8. Example of the TMF filter combined with the minimum track length filter. The class name, the probability score, and an ascending counter number are shown above each object. Filtered objects are marked in red. One can see the filtered parked car, but it is also notable that the score is quite high with 0.78.
Sensors 25 02340 g008
Table 1. Different recording situations, times of day, and weather conditions for the different dataset splits.
Table 1. Different recording situations, times of day, and weather conditions for the different dataset splits.
SplitCamera PositionsDay TimeWeather
Inner-CityRural RoadHighwayDayNightSunRainFogSnow
X t A 900
X t B 1 _ t e s t 233
X t B 1 233
X t B 2 121
Table 2. Results on the diverse test data X t B 1 _ t e s t . Yv7t is the baseline tiny Yolov7 trained on X t A . Yv7t-PRA trained with pseudo loss and remember module as well as adapted alpha with mixed validation data and Yv7t-PRIA used the same set up but trained incrementally location by location. Results are given in percentage points.
Table 2. Results on the diverse test data X t B 1 _ t e s t . Yv7t is the baseline tiny Yolov7 trained on X t A . Yv7t-PRA trained with pseudo loss and remember module as well as adapted alpha with mixed validation data and Yv7t-PRIA used the same set up but trained incrementally location by location. Results are given in percentage points.
ModelDataMotorcycleCarTruckBusPersonBicycleE-ScooterSum
APAPAPAPAPAPAPmAPmF1mIoU
Yv7tt-B113.183.947.131.318.523.42.431.467.778.0
Yv7t-PRAt-B133.287.163.535.026.834.91.740.375.279.0
Yv7t-PRIAt-B122.187.259.938.824.528.32.637.674.578.0
Table 3. Results of the Yolov7-tiny model trained on the AAU rain snow dataset (AAU) [20] and the AutomotiveThermal dataset (AT) [18] on test sets of the original dataset and the X t B 1 test set. F1, mF1, and mAP are given in percentage points.
Table 3. Results of the Yolov7-tiny model trained on the AAU rain snow dataset (AAU) [20] and the AutomotiveThermal dataset (AT) [18] on test sets of the original dataset and the X t B 1 test set. F1, mF1, and mAP are given in percentage points.
ModelDataMotorcycleCarTruckBusPersonBicycleE-ScooterSum
APAPAPAPAPAPAPmAPmF1mIoU
Yv7tt-A78.188.966.265.742.352.20.256.273.278.0
Yv7t-AAUAAU/77.536.94.20.232.3/30.267.10.76
Yv7t-ATAT/99.799.899.799.8//99.797.60.84
Yvt7t-B113.183.947.131.318.523.42.431.467.778.0
Yv7t-PRAt-B133.287.163.535.026.834.91.740.375.279.0
Yv7t-AAUt-B10.012.02.30.00.00.50.02.18.90.69
Yv7t-ATt-B10.028.94.61.20.40.00.05.015.30.71
Table 4. Performance of the baseline teacher and student network on both main parts of the dataset. All results are given in percentage points.
Table 4. Performance of the baseline teacher and student network on both main parts of the dataset. All results are given in percentage points.
ModelDataMotorcycleCarTruckBusPersonBicycleE-ScooterSum
APAPAPAPAPAPAPmAPmF1mIoU
Yv7t-A24.989.469.467.745.057.20.150.572.479.0
Yv7tt-A78.188.966.265.742.352.20.256.273.278.0
Yv7t-B119.980.945.823.629.029.11.632.869.278.0
Yv7tt-B113.183.947.131.318.523.42.431.467.778.0
Table 5. Comparison of the models trained on all locations of X t B 1 at once using the remember module (Yv7t-PRA) and without the remember module (Yv7t-P).
Table 5. Comparison of the models trained on all locations of X t B 1 at once using the remember module (Yv7t-PRA) and without the remember module (Yv7t-P).
ModelDataMotorcycleCarTruckBusPersonBicycleE-ScooterSum
APAPAPAPAPAPAPmAPmF1mIoU
Yv7tt-A78.188.966.265.742.352.20.256.273.278.0
Yv7t-Pt-A54.184.154.557.823.725.90.042.968.775.0
Yv7t-PRAt-A77.389.464.165.448.558.40.157.675.677.0
Yvt7t-B113.183.947.131.318.523.42.431.467.778.0
Yv7-Pt-B135.488.060.831.522.936.00.839.375.978.0
Yv7t-PRAt-B133.287.163.535.026.834.91.740.375.279.0
Table 6. Inference and training times for the different model and training set-ups. Inference times were measured with batch size 1. Inference was only measured for baseline models since only training not the inference is effected by the proposed framework.
Table 6. Inference and training times for the different model and training set-ups. Inference times were measured with batch size 1. Inference was only measured for baseline models since only training not the inference is effected by the proposed framework.
ModelInference Time—Edge DeviceInference Time—ServerTraining Time AllMean Training Time Incremental
Yv70.2606 s0.009 s16.83 h/
Yv7t0.0393 s0.0074 s8.0 h/
YvltP//7.35 h1.44 h
YvltPR//13.02 h2.36 h
YvltPRA//17.16 h
Table 7. Results of the filtering algorithm. The table shows the mAP, F1, Precision, Recall, and the amount of TP, FP, and FN for the different filtering algorithms on the X t B 2 dataset. F1, mAP, Precision, and Recall are given in percentage points.
Table 7. Results of the filtering algorithm. The table shows the mAP, F1, Precision, Recall, and the amount of TP, FP, and FN for the different filtering algorithms on the X t B 2 dataset. F1, mAP, Precision, and Recall are given in percentage points.
AlgorithmF1PrecisionRecallTPFPFN
YOLOv764.865.763.9206410761168
Background filter 70.482.461.519424151215
Min. track length filter 64.668.661.019358871238
PDBE Thresh 664.865.763.9206410761168
PDBE Thresh 10062.664.660.7196010731271
SRSS Thresh 0.085.530.03.03796
SRSS Thresh 0.00463.365.161.7198910651237
Background filter min track length filter71.089.059.118152241256
Table 8. Results of the voting strategies. The table shows the mAP, the F1, Precision, Recall, and the amount of TP, FP, and FN when applying the different voting strategies on the X t B 2 dataset.
Table 8. Results of the voting strategies. The table shows the mAP, the F1, Precision, Recall, and the amount of TP, FP, and FN when applying the different voting strategies on the X t B 2 dataset.
AlgorithmF1PrecisionRecallTPFPFN
YOLOv764.865.763.9206410761168
YOLOv7 + Major Vote64.965.864.0206910741163
YOLOv7 + Max. score Vote64.765.663.7205910791173
YOLOv7 + Soft Vote64.465.363.5205210861180
Background filter w score thresh 0.5 + min track length filter + Major Vote71.389.259.318222211249
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pettirsch, A.; Garcia-Hernandez, A. Overcoming Data Scarcity in Roadside Thermal Imagery: A New Dataset and Weakly Supervised Incremental Learning Framework. Sensors 2025, 25, 2340. https://doi.org/10.3390/s25072340

AMA Style

Pettirsch A, Garcia-Hernandez A. Overcoming Data Scarcity in Roadside Thermal Imagery: A New Dataset and Weakly Supervised Incremental Learning Framework. Sensors. 2025; 25(7):2340. https://doi.org/10.3390/s25072340

Chicago/Turabian Style

Pettirsch, Arnd, and Alvaro Garcia-Hernandez. 2025. "Overcoming Data Scarcity in Roadside Thermal Imagery: A New Dataset and Weakly Supervised Incremental Learning Framework" Sensors 25, no. 7: 2340. https://doi.org/10.3390/s25072340

APA Style

Pettirsch, A., & Garcia-Hernandez, A. (2025). Overcoming Data Scarcity in Roadside Thermal Imagery: A New Dataset and Weakly Supervised Incremental Learning Framework. Sensors, 25(7), 2340. https://doi.org/10.3390/s25072340

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop