1. Introduction
As a research hotspot in deep learning, target detection has a wide range of applications in computer vision. In particular, road target detection technology is gradually attracting the attention of relevant departments and companies around the world due to its important value in automated driving and thedynamic planning of road traffic flow [
1]. Currently, research in road target identification is primarily focused on the visible light perspective, with few investigations on the infrared perspective. Compared to visible light, infrared radiation can penetrate obstructions such as smoke, fog, and plastic, making infrared imagery advantageous in visual obstructions or bad weather. However, the clarity and resolution of infrared images are lower and the detail is far less than that of visible light images [
2,
3]. In order to cope with the above problems, reducing background interference and ensuring sufficient information extraction are usually taken as the main research directions in the field of infrared target detection. Examples include sparse representation [
4], spatial filtering [
5], and frequency domain filtering [
6] for target detection in infrared images.
The traditional design concept for infrared target detection is to reduce the target’s background and noise while improving its features. Zhao et al. [
5] first used a spatial filtering-based approach to detect infrared targets through background suppression. Anju et al. [
6] based on the difference between the target and background frequencies, so as to extract different parts of the frequency to achieve detection. In general, traditional infrared target detection methods use manually designed feature extractors to extract local features using sliding windows, and then use support vector machines to evaluate the detected targets [
7]. However, these algorithms suffer from high computational complexity, window redundancy, and poor robustness to multi-scale targets. To improve infrared target identification, Chen et al. [
8] created a background suppression module that enhances foreground characteristics. Li et al. [
9] presented the YOLO-CAN model, which employs several learning algorithms and enhances the loss function and convolution module to optimize feature extraction in infrared pictures. Zhou et al. [
10] introduced the advanced classification network ConvNext and the coordinate attention mechanism, and proposed the channel and spatial attention mechanism for the effect of infrared images, which significantly improves the detection effect of the network on infrared small targets.
Furthermore, since objects in road scenes frequently contain a significant number of regular features, researchers have suggested numerous algorithms for object detection based on feature data. Traditional road target detection approaches use photos to generate artificial feature information based on the target’s attributes, which are subsequently used for target detection. However, traditional detection approaches rely on manually generated feature representations and shallow trainable architectures, and detection performance worsens when these low-level picture features are combined with the target detector or scene classifier’s contextual data [
11]. To improve road target detection, Zou et al. [
12] introduced an improved SFPN network into the SSD feature extraction network and used ResNet50 instead of VggNet16 [
13] as the main feature extraction network for the improved model, further increasing the depth of the network to improve performance. Ma et al. [
14] proposed an Improved Small Target Detection (ISOD) network to achieve the fast and efficient detection of small targets by proposing an extended scale feature pyramid network and using an efficient channel attention mechanism for backbone feature extraction. Luo et al. [
15] proposed a road small target detection method based on improved YOLOv3, which introduces DIOU Loss [
16] to improve the accuracy of localization and optimizes the clustering method in the YOLOv3 algorithm, thus significantly improving the accuracy and speed of detection. Liu et al. [
17] thoroughly investigated the model structure and parameter optimization, and proposed the RF-YOLOv3 network model, which was applied to road vehicle detection. The model was developed by a K-means clustering algorithm [
18], which determines the number and aspect ratio of target candidate frames based on the unique characteristics of the vehicle, subsequently adjusts the model parameters based on the clustering results, and is used to improve the RF-YOLOv3 network’s detection accuracy and adaptability. Gao et al. [
19] designed a road target detection method based on improved YOLOv8n, which significantly improves the detection performance of road targets by fusing the C2f and DBB modules, proposing the PA-AFPN feature fusion method and designing the SPPFT2_TA module.
With the advancement of deep learning and computer vision technologies, many fields have turned to more advanced detection methods that focus on improving detection accuracy, robustness, and adaptability. Convolutional neural network-based target detection methods have become a research hotspot in recent years, and these target detection methods are mainly divided into one-stage regression-based target detection methods and two-stage candidate region-based target detection methods. In 2014, R. Girshick et al. [
20] proposed a two-stage target detection algorithm, R-CNN, which was the first time a convolutional neural network was introduced to the field of target detection. Shortly after, R. Girshick et al. [
21] further improved R-CNN by introducing SPPNet [
22] and proposed Fast R-CNN. Later, Ren, S et al. [
23] proposed Faster R-CNN, the first deep learning detection network that can actually be trained end-to-end. These two-stage target detection nets have high computational complexity and significant real-time issues, making them difficult to use on embedded devices with high real-time requirements.
In contrast, one-stage target detection significantly improves detection efficiency by extracting features directly from the image and inferring bounding box location and category confidence. In 2016, Redmon et al. [
24] proposed the single-stage target detection algorithm YOLO, which became the most dominant method in target detection with high accuracy and detection speed. YOLO eliminates the need for the region suggestion step by directly detecting all bounding boxes at the same time to unify the object detection step, as well as directly predicting bounding box locations and class probabilities at the image level using deep convolutional neural networks, striking a balance between detection accuracy and performance [
25]. Redmon then made a series of improvements and proposed YOLOv2 [
26] and YOLOv3 [
27], which further improved detection accuracy while maintaining detection speed. In 2020, Bochkovskiy et al. [
28] proposed YOLOv4. YOLOv4 experimented with a variety of backbone architectures, culminating in the most powerful backbone network CSPDarknet53, and used a modified version of spatial pyramid pooling from YOLOv3-spp and the same multi-scale predictions as YOLOv3. A few months later, the Ultralytics team presented YOLOv5 [
29]. YOLOv5 uses a modified CSPDarknet53 for the backbone and modules such as SCP-PAN and SPPF for the neck. Multiple versions are created by incorporating alternative network depths and widths to fulfill the application and performance needs of various scenarios, considerably boosting the model’s performance, speed, and ease of use. In 2023, the Ultralytics team open-sourced the next major update to YOLOv5, calling it YOLOv8 [
30]. YOLOv8 replaces the C3 module in YOLOv5 with the lighter C2f module. YOLOv8 retains the SPPF module while fine-tuning the model at different scales instead of using a single parameter setting, significantly improving model performance. Furthermore, YOLOv8 adds the Anchor-Free Detection header and VFL Loss for classification loss, while combining DFL Loss [
31] and CIOU Loss as bounding box loss and Binary Cross Entropy as classification loss. These improvements significantly increase the detection performance and flexibility of the model. As a result, YOLOv8 not only is the first choice for target detection, but also excels in various tasks such as image segmentation and pose estimation.
Infrared road target detection focuses on the labeling and localization of pedestrians and various vehicles, etc., in infrared videos and images. Although the above methods can overachieve good results in road target detection, it is difficult to achieve the expected detection effect for the problem of low image resolution in the infrared view. Aiming at the low-resolution and multi-scale problems of infrared road target detection, this paper proposes an infrared road target detection algorithm, YOLO-APDM, based on YOLOv8. Highly accurate detection of infrared road targets while taking into account the complexity of the control model is achieved.
This paper dedicates the first section to the problem of road target detection in infrared scenes and introduces the research progress of target detection based on convolutional neural networks. Then, 
Section 2 focuses on the main structure of YOLO-APDM proposed in this paper, and 
Section 3 details the various improvements of YOLO-APDM. 
Section 4 introduces the dataset used in this paper and shows the comparison results between YOLO-APDM and the original model. Finally, 
Section 5 briefly summarizes the work carried out in this paper.
The main contributions of this paper are briefly summarized as follows:
- Reconstruct the neck of the original model, introduce the P2 layer, optimize the network structure, and improve the multi-scale target detection capability of the model. 
- Improve the C2f module of the model to enhance the ability of the network to focus on the target region and reduce the complexity of the model. 
- Utilize the MSCA mechanism to guide the resources to focus on the most prominent region in the recognition image, thus improving the detection performance of the model. 
  2. The Proposed YOLO-APDM Model
Since its inception as a class of single-target detectors, YOLO has been extensively recognized by academics for its superior detection and real-time performance. In this paper, based on YOLOv8, we perform network structure reconstruction, feature extraction optimization, attention mechanism improvement, and model performance testing before proposing YOLO-APDM, a target detection algorithm for infrared road scenes that achieves high-precision target detection with controlled model complexity.
The overall structure of YOLO-APDM is shown in 
Figure 1. Similar to the overall structure of YOLOv8, it is mainly composed of backbone, neck, and head. To address the problem of large changes in the scale of road targets, the neck part of the algorithm is improved by using the idea of the fusion of attention scale sequences in ASF-YOLO [
32], and the network structure is optimized by introducing the P2 detection layer [
33] based on this idea. A detector head for multi-scale target detection is added and integrated with the original predictor head to produce a four-predictor head structure, which improves the network’s multi-scale detection capability. Deformable convolution v3 (DCNv3) [
34] is integrated with the C2f module, replacing C2f in the network, and increasing the network’s capacity to focus on the target region while reducing model complexity. At the same time, this paper adopts the multi-scale convolutional attention (MSCA) mechanism in SegNext [
35], which allows the model to concentrate its resources on detecting the most important regions in the image, thus improving the model’s detection performance.
  5. Conclusions
In order to solve the problems of low image resolution and multi-scale road targets in infrared road target detection, this paper proposes an improved high-precision infrared road target detection model, YOLO-APDM, based on YOLOv8n, with the design goal of improving the target detection accuracy on the basis of controlling the model parameters. Specifically, this paper adopts the idea of attention scale sequence fusion in ASF-YOLO to reconstruct the neck part of the model pair, and then introduces the P2 detection layer to form a four-prediction head structure, which effectively improves the detection performance of multi-scale targets on the road. In addition, by replacing the deformable convolution v3 module, not only can the number of parameters of the model network be reduced, but the network’s ability to focus on the target area can also be enhanced, thereby improving the network’s flexibility and adaptability. By adding the MSCA mechanism, the detection network resources are concentrated on the key areas of road targets, thereby improving the accuracy and robustness of model detection. Compared with YOLOv8n, YOLO-APDM has significant improvements in major indicators. On the FLIR_ADAS_v2 dataset that retains the main road targets, YOLO-APDM improves mAP@0.5 and mAP@0.5:0.95 by 6.6% and 5.0%, respectively. On the M3FD dataset, mAP@0.5 and mAP@0.5:0.95 increased by 8.1% and 5.9%, respectively. The number of model parameters and model size were reduced by 8.6% and 4.8%, respectively. The improved method proposed in this paper achieves higher detection accuracy, while also effectively reducing the number of model parameters and model size, which is conducive to subsequent deployment on embedded devices.