Short Communication: Detecting Heavy Goods Vehicles in Rest Areas in Winter Conditions Using YOLOv5

: The proper planning of rest periods in response to the availability of parking spaces at rest areas is an important issue for haulage companies as well as trafﬁc and road administrations. We present a case study of how You Only Look Once (YOLO)v5 can be implemented to detect heavy goods vehicles at rest areas during winter to allow for the real-time prediction of parking spot occupancy. Snowy conditions and the polar night in winter typically pose some challenges for image recognition, hence we use thermal network cameras. As these images typically have a high number of overlaps and cut-offs of vehicles, we applied transfer learning to YOLOv5 to investigate whether the front cabin and the rear are suitable features for heavy goods vehicle recognition. Our results show that the trained algorithm can detect the front cabin of heavy goods vehicles with high conﬁdence, while detecting the rear seems more difﬁcult, especially when located far away from the camera. In conclusion, we ﬁrstly show an improvement in detecting heavy goods vehicles using their front and rear instead of the whole vehicle, when winter conditions result in challenging images with a high number of overlaps and cut-offs, and secondly, we show thermal network imaging to be promising in vehicle detection.


Introduction
To improve road safety, drivers of heavy goods vehicles must comply with strict rules regarding driving time and rest periods. Due to these regulations and contractual delivery agreements, heavy goods vehicle traffic is highly schedule driven. Arriving at a crowded rest area after long journeys leads to drivers exceeding the permitted driving time or having to rest outside of designated areas. As both can lead to increased traffic risk, the Barents Intelligent Transport System has initiated a pilot project with the aim of automatically reporting and forecasting the current and future availability of parking spaces at rest areas in the Barents region. The pilot project ran from January to April 2021 in two rest areas, one in northern Norway and one in northern Sweden. A crucial part of this pilot project was the detection of heavy goods vehicles on images from a thermal network camera. In this short communication, we propose a feasible solution for heavy goods vehicle detection. Computer Vision algorithms have been implemented for various tasks in traffic monitoring for many years, e.g., traffic sign recognition [1][2][3][4][5][6][7]; intelligent traffic light system [8]; vehicle speed monitoring [9]; traffic violation monitoring [10]; vehicle tracking [11][12][13]; vehicle classification [14][15][16][17][18][19][20][21][22][23][24][25][26]; vehicle counting system on streets and highways [27][28][29][30][31]; parking spot detection from the point of view of the car for parking assistants [32,33]; and parking spot monitoring [34][35][36][37][38][39][40][41][42][43][44][45][46][47][48][49]. Most of the previous studies about parking spot monitoring use data from parking areas for passenger cars which have marked parking spots for each car or in settings in which the cars park in one row along a street [34][35][36][37][38][39][40][41][42][43][44][45][46][47][48]. This differs from the settings of the two rest areas used in our study, which are primarily used by heavy goods vehicles. Passenger cars also pass through these rest areas, but only a few and generally only for brief stops. In winter, the markings of the parking spots are covered by snow and ice and therefore not visible, so heavy goods vehicles do not park in a line or in marked parking spots. This leads to several challenges in detecting heavy goods vehicles: the vehicles face the camera from different angles (front, back, side); so the size of the vehicles differs depending on the distance to the camera, and there is a high overlap of vehicles on the camera image and many vehicles are cut off, so see Figure 1 for examples. In this paper, we used the latest version of the You Only Look Once (YOLO) object detection algorithm [50] to detect vehicles. As computer vision practitioners, our focus was on the application of the algorithm, data acquisition and data annotation. The remainder of this paper is organised as follows. Section 2 describes the selection of the algorithm and the dataset. The training and results are described in Section 3, ideas for further improvement and development are discussed in Section 4, followed by a conclusion in Section 5.

Selection of Algorithm
The decision to use a convolutional neural networks was made due to their ease of use. There are a number of pre-trained models that can be tuned for a variety of tasks. They are also readily available, computationally inexpensive and show good performance metrics. Object recognition systems from the YOLO family [51,52] are often used for vehicle recognition tasks, e.g., [27][28][29]37] and have been shown to outperform other target recognition algorithms [53,54]. YOLOv5 has proven to significantly improve the processing time of deeper networks [50]. This attribute will gain in importance when moving forward with the project to bigger datasets and real-time detection. YOLOv5 was pre-trained on the Common Objects in Context (COCO) dataset, an extensive dataset for object recognition, segmentation, and labelling. This dataset contains over 200,000 labelled images with 80 different classes, including the classes car and truck [50,55]. Therefore, YOLOv5 can be used as such to detect heavy goods vehicles and can be used as a starting point for an altered model to detect heavy goods vehicle features like their front and rear.

The Rest Area Dataset
At each rest area, a thermal imaging network camera is installed in a fixed position facing the main parking area of the rest area. One of the cameras was installed in front of a pole, which appears as a grey area in the centre of the image. The thermal network cameras have an uncooled microbolometre image sensor with a thermal sensitivity (noise equivalent temperature difference) of <50 mK and a thermal sensor resolution of 640 × 480. The Ffmpeg library [56] was used to capture images from the video streams with settings that capture frames with a scene change of more than 2%. This is at the threshold where random sensor noise also triggers [57]. The captured frames have dimensions of 640 × 480. Figure 2 shows images from the camera under different light and weather conditions. Between 15 January 2021 and 22 February 2021, 100,179 images were collected from the cameras. During this period, data collection from both cameras was partially interrupted. These interruptions could last from a few hours to several days. The longest interruption occurred at rest area B, where the camera was offline for the first days of the data collection period. Therefore, less data were available from rest area B. One consequence of the sensitive setting of the motion detector was that it reacted to temperature changes caused by wind. Therefore, many of the images only differed from each other in their grey scale due to temperature changes and not due to changes in vehicle position. The two rest areas in this study are mainly used for long breaks (about 8 h). Therefore, there are long periods of inactivity on a specific parking space. A total of 427 images were selected for annotation and split into the training, validation and test datasets. To prevent testing the model with images that are very similar to the images in the training dataset, the datasets were chronologically split. Table 1 shows how the data were split. The data were annotated using bounding boxes for three classes: truck_front, truck_back and car, see Figure 3 for example. We chose to focus on the driver's cabin in front and the side view (truck_front) and the front view on the rear of the truck (truck_back), as the bounding boxes overlapped too much around the whole vehicle. In addition, this also makes it possible to recognise vehicles where the front or rear is cut off. In the 427 annotated images, there were 768 objects labelled as truck_front, 378 as truck_back and 17 as car. The 264 images from the training dataset were augmented to 580 images. For each image, a maximum of 3 augmented versions were generated by randomly applying horizontal mirroring, resizing (cropping from 19% minimum zoom to 67% maximum zoom) and changes in the grey scale (brightness variations between ±35%). Examples of augmented images are shown in Figure 4.

Training
The model was trained using Google Colab, which provides free access to powerful GPUs and requires no configuration. We used a notebook developed by Roboflow.ai [58] which is based on YOLOv5 [50] and uses pre-trained COCO weights. We added the rest area dataset and adjusted the number of epochs to be trained as well as the stack size to train the upper layers of the model to detect our classes. Training a model for 500 epochs takes about 120 min. The improvement in our model can be seen in the graphs in Figure 5, which display different performance metrics for both the training and validation sets. There are three different types of loss shown in Figure 5: box loss, objectness loss and classification loss. The box loss represents how well the algorithm can locate the centre of an object and how well the predicted bounding box covers an object. Objectness is essentially a measure of the probability that an object exists in a proposed region of interest. If the objectivity is high, this means that the image window is likely to contain an object. Classification loss gives an idea of how well the algorithm can predict the correct class of a given object.
The model improved swiftly in terms of precision, recall and mean average precision before plateauing after about 150 epochs. The box, objectness and classification losses of the validation data also showed a rapid decline until around epoch 150. We used early stopping to select the best weights.

Experimental Analysis
After training our model, we made predictions for the new and unseen pictures in our test set. The examples in Figure 6 show that the algorithm can detect the front of a truck to a higher degree of certainty. However, it has difficulty recognising the rear of a truck, especially when these are located far away from the camera. It also detects a car as a truck_front in two of the images.
It can be seen that the algorithm currently struggles to correctly differentiate between cars and cabins, and this becomes worse the more truck fronts are present in an image. It is also difficult for the algorithm to correctly recognise truck rears in an image. Strategies to overcome these shortcomings are proposed in Section 4.
To evaluate the model trained with the rest area dataset, we compared it to YOLOv5 [50], without it being trained on any additionally data as a baseline model only using COCO weights. This model contains, amongst other classes, the car and truck class, however, it does not distinguish between truck_front and truck_back. Table 2 shows the accuracy of the baseline and the altered model for the four available classes.  The baseline model, which is trained on heavy goods vehicles as a whole, had difficulties detecting them on the test images of the rest of the area's dataset. It either did not recognise the trucks or it did so with much less certainty than the altered model with the two new classes. The additional training also improved detecting heavy goods vehicles on images on which the cabin was cut off. Some examples of detection for the test data of the two models are shown in Figure 7.

Discussion
We see the greatest potential for improving performance in adjusting the physical data collection and in improving the data annotation.
For most applications, changes to the physical data collection cannot be influenced. However, as this is a pilot project running on only two rest areas, there is the possibility of changing the physical setup for data collection if more rest areas are added. Our recommendations for the setup are: to continue using thermal network cameras as it is not possible to read number plates or identify detailed human characteristics in their images and the data are automatically anonymised. Furthermore, the camera delivered usable images for all light and weather conditions that occurred during the project period. However, we suggest using a wider angle camera to capture more and more complete heavy goods vehicles, avoid obstacles in the camera's field of view and increase the resolution of the images.
The three classes in the dataset used in this paper are very unbalanced. Cars are highly under represented in the dataset, reflecting the fact that the rest areas are mainly used by trucks. One strategy to deal with this is to train on only two classes, trucks_front and trucks_back, or to emphasise the annotations for cars more by adding more images with cars. The performance in recognising cars could be increased by adding images with cars from other publicly available datasets. However, there are also only half as many images with the truck_back label as with the truck_front label. We assume that performance can be increased after collecting and labelling more images, especially by balancing the number of images from both resting areas and increasing the number of images in the two smaller classes, car and truck_back.
In addition, we suggest reviewing the data augmentation strategies and using a higher augmentation rate to benefit more from the positive effects of augmentation [59].
One way to deal with a static obstacle, such as the pole located in the middle of the images of a rest area, could be to crop it out of the image, since, for example, a truck cabin with a removed obstacle has more features in common with cabins without obstacles than the cabin with an obstacle has with a cabin without an obstacle ( Figure 8B is more similar to Figure 8C,D) than Figure 8A is to Figure 8C,D). Currently, heavy goods vehicles with both the cab and the rear outside the image, or which are obscured by other vehicles, are rarely detected by our algorithm. To get closer to the goal of detecting all heavy goods vehicles in the picture, we first propose to further specialise our current model. Instead of training it to detect cabins in frontal and side view, it could be trained to detect only in frontal view (windscreen, front lights and number plate facing the camera). Secondly, we propose to add an additional model to the analysis for recognition. The additional model could either detect other characteristic features of heavy goods vehicles that are easily visible from the side, such as wheels, or it could classify the images into categories indicating the number of heavy goods vehicles. Knowing how many of the individual features of a heavy goods vehicle are detected in an image enables us to combine this information to estimate the number of heavy goods vehicles in an image and enables us to predict occupancy rates.

Conclusions
Section 4 shows that there are many steps that still need to be taken to improve the detection of heavy goods vehicles in rest areas. However, we already showed that when analysing images from small angle cameras to detect objects that occur in groups and have a high number of overlaps and cut-offs, the model can be improved by detecting certain characteristic features instead of the whole object. Furthermore, the usage of thermal network cameras has proven to be valuable given the purpose of the project and the dark and snowy winter conditions in northern Scandinavia. We are confident that with a bigger training set and the implementation of the changes suggested in Section 4, the algorithm can be improved even further.