An Improved YOLOX Model and Domain Transfer Strategy for Nighttime Pedestrian and Vehicle Detection

: Aimed at the vehicle/pedestrian visual sensing task under low-light conditions and the problems of small, dense objects and line-of-sight occlusion, a nighttime vehicle/pedestrian detection method was proposed. First, a vehicle/pedestrian detection algorithm was designed based on You Only Look Once X (YOLOX). The model structure was re-parameterized and lightened, and a coordinate-based attention mechanism was introduced into the backbone network to enhance the feature extraction efﬁciency of vehicle/pedestrian targets. A feature-scale fusion detection branch was added to the feature pyramid, while a loss function was designed, which combines Complete Intersection Over Union (CIoU) for target localization and Varifocal Loss for conﬁdence prediction to improve the feature extraction ability for small, dense, and low-illumination targets. In addition, in order to further improve the detection accuracy of the algorithm under low-light conditions, a training strategy based on data domain transfer was proposed, which fuses the larger-scale daylight dataset with the smaller-scale nighttime dataset after low-illumination degrading. After low-light enhancement, training and testing were performed accordingly. The experimental results show that, compared with the original YOLOX model, the improved algorithm trained by the proposed data domain transfer strategy achieved better performance, and the mean Average Precision (mAP) increased by 5.9% to 82.4%. This research provided effective technical support for autonomous driving safety at night.


Introduction
Traffic accidents are more likely to occur at night due to drivers' poor vision and tired eyes.Statistics show that the probability of an accident at night is 1-1.5 times higher than the probability of an accident during the day, and the traffic death rate per kilometer at night is approximately three times higher than during the day [1].Additionally, there is not enough light for us to clearly see the color details of the cars and people in front of us.The safety of autonomous driving systems and cutting-edge assistive driving systems is adversely impacted by the inability of the current visual-based detection algorithms to reliably identify targets.Therefore, it is crucial to increase the object detection algorithm's precision in low-light conditions at night.To cope with the above, researchers have used different approaches, such as radar [2], lidar [3], the Global Navigation Satellite System (GNSS) [4], etc.However, vision-based solutions have been the preferred choice of many researchers, due to the ease of deployment and affordability of visual sensors.
Object detection is one of the widest studies in computer vision.There are also many useful algorithms for this task.However, there are still two difficulties in vehicle/pedestrian detection, which the current techniques are unable to fully solve.First, performing vision-related tasks in the face of low-light conditions in natural environments can be challenging because of short exposure times, images lacking the necessary features for target detection, and direct image light enhancement potentially generating noise that interferes with vision tasks.The second is the detection of small vehicle/pedestrian targets and multi-scale variations of vehicle/pedestrian targets.The vehicle/pedestrian features extracted by the detector will be more susceptible to noise interference in the environmental background when there are vehicles and pedestrians of different sizes in the scene at the same time, or when there are individuals with a small size and low clarification in the vehicle/pedestrian object.This will cause missed detection and false detection and present significant challenges regarding the accuracy of the detection results.Traditional vehicle/pedestrian detection algorithms recognize targets using manually created feature extraction methods [5,6], but in natural application environments, target features are complex and varied, which makes it difficult to successfully abstract and generalize the manually generated features.The current common approach for detecting vehicles and pedestrians uses deep-learning-based target detection algorithms, which can learn from data to produce feature representations with improved detection accuracy and resilience.One-stage detection techniques and two-stage detection methods are the two categories that deep learning-based object detection algorithms fall under.In order to achieve a trade-off between speed and accuracy, one-stage detection techniques such as the You Only Look Once (YOLO) family of algorithms [7,8] and the Single-Shot multibox Detector (SSD) [9] can directly classify and regress the target.The detection accuracy of two-stage detection techniques, such as the Faster Region with Convolutional Neural Networks (R-CNN) [10], Cascade R-CNN [11], etc., is often greater than that of one-stage detectors, but the detection speed will be significantly slower and they cannot be detected online in real time.The authors of [12] improved the feature pyramid and the feature learning capabilities of the YOLOv3 model for use in actual applications.They also included an attention mechanism and optimized the loss function to further increase the accuracy of vehicle/pedestrian identification.By increasing the detection scale of YOLOv3 from 3 to 4 and establishing the feature fusion target detection layer down-sampled by 4×, Moran Ju et al. [13] were able to extract more object attributes and improve tiny target recognition.Yixing Zhu et al. [14] improved the Cascade R-CNN algorithm.The outline of the object is estimated in the first step using a Locally Sliding Line-based Point Regression (LocSLPR) method, which is defined as the intersection of the sliding lines with the object's bounding box. to fully utilize information The performance of our system is then further enhanced in the second step by gradually regressing the target object using a Rotated Cascade Region-based Convolutional Neural Network (RCR-CNN).Although the performance of these detection algorithms has been improved, there are still problems of target misses and false detections in the face of small, dense targets, line-of-sight occlusions, and low-illumination conditions at night.In addition, although the detection accuracy of the two-stage detection algorithm is higher, it is difficult to meet the real-time requirements of intelligent driving systems in terms of detection speed, while the YOLOv3 [13] and YOLOv4 [15] models of the one-stage algorithm have larger files and lower detection accuracy, which are not suitable for the deployment of low-performance storage devices.
To address the above issues, a lightweight YOLOX [16] is chosen as the baseline model to adapt to the nighttime vehicle/pedestrian detection task by simultaneously improving the algorithm and training strategy.
The main contributions of this paper are as follows: (1) An improved algorithm based on YOLOX was proposed for small target pedestrian and vehicle detection at night.The main improvements include 1. reparameterization of the model structure using the Re-parameterization Visual Geometry Group (RepVGG) technique; 2. the introduction of a coordinate-based attention mechanism; 3. the addition of a new feature scale fusion branch; and 4. improvement of the loss function.(2) A domain transfer training strategy was proposed that allows the model to be trained more efficiently using daytime datasets.The large-scale daytime dataset is fused with the much smaller night dataset after low-illumination degrading.The models were then trained and tested separately after unified low-illumination enhancement to fully extract the data features of the existing daytime dataset and remedy the nighttime data deficiencies problem.(3) The proposed improved YOLOX and domain transfer training strategies were validated on a real-world dataset.The experimental results showed that the improved YOLOX algorithm produced fewer errors than the original algorithm, and it was more accurate for nighttime vehicle/pedestrian detection when combined with a domain transfer training strategy.If we consider deep-learning-based object detection to be a technical aesthetic, then going back in time 20 years would allow everyone to observe "the wisdom of the cold weapon era".The majority of the early object-detection algorithms were created using handcrafted features.By incorporating three crucial techniques-"integral picture", "feature selection", and "detection cascades"-the VJ detector [17,18] significantly increased its detection speed.To reconcile feature invariance (such as translation, scaling, illumination, etc.) and nonlinearity (regarding discriminating different object categories), N. Dalal and B. Triggs first developed the Histograms of Oriented Gradients (HOG) feature descriptor in 2005 [19].P. Felzenszwalb [20] first suggested the Deformable Part Model (DPM) as an extension of the HOG detector in 2008.Since then, R. Girshick [21][22][23] developed a number of modifications.The detection accuracy of modern object detectors has grown well beyond the DPM, yet many of them, including mixture models, hard negative mining, bounding box regression, etc., are still greatly affected by its insightful findings.

Deep-Learning-Based Detection Methods
Due to the continuous development of computers, deep learning-based object detection has become the mainstream of machine vision.Proposal-based and proposal-free object detection techniques now in use can be broadly split into two groups.Object detection is viewed as a bounding-box regression problem by proposal-free approaches.By jointly classifying categories and regressing the positions of predetermined anchors, for instance, YOLO immediately predicts detection results.To handle cases of various scales, SSD integrates predictions computed from hierarchical networks with many relevant fields.For more effective proposal-free object detectors, a number of extensions [24][25][26] have been suggested.In order to classify them for final detection findings, proposal-based algorithms first create region proposals.For example, R-CNN [27] extracts dense region proposals using a hierarchical grouping method and then classifies these recommendations to produce detection results.By adding a Region Of Interest (ROI) pooling layer to exchange features across each proposal, Fast R-CNN [28] accelerates the R-CNN.The hierarchical grouping used in Fast R-CNN [28] is proposed to be replaced by a more accurate and efficient Region Proposal Network (RPN) in Faster R-CNN.For improved detection accuracy, some expansions [29][30][31][32][33] propose more potent proposal-based detectors.Most existing object detection methods require large amounts of annotated data to train the model, which usually takes time and effort to produce, making domain-based adaptive detectors a popular choice for a lot number of researchers.

Domain Adaption-Based Object Detection Method
As previously mentioned, the research presented in this paper is closely related to the fields of knowledge transfer and domain adaptive object detection, both of which focus on learning a high-performing detector in an unlabeled target domain without accessing any annotation of the target domain.Domain Adaptive (DA) detectors have been presented in a sizable number.For example, in order to reduce domain bias through adversarial learning at both the picture and instance levels, DA [34] introduces a domain adaptive detector based on Faster R-CNN.By including object relations in teacher-student consistency regularization, Mean Teacher with Object Relations (MTOR) [35] reduces domain discrepancy.By substantially aligning local similar features and weakly aligning global dissimilar features, Strong-Weak Distribution Alignment (SWDA) [36] offers a potent cross-domain detection approach.At both the picture and instance levels, the Image-Instance Full Alignment Network (IFAN) [37] aligns feature distributions in a coarseto-fine fashion.To reduce the possibility of collapse brought on by parameter sharing between the source and target domains, Asymmetric Tri-way Faster-RCNN (ATF) [38] introduces a tri-stream Faster R-CNN.The object detection classifier and the RPN are trained through collaborative self-training in Collaborative Training between Regions (CTR) [39].To reduce the domain discrepancy, Graph-induced Prototype Alignment (GPA) [40] aligns graph-induced prototype representations in two steps.DA and SWDA can be continuously improved thanks to the effective classification regularization module proposed by Categorical Regularization for Domain Adaptive (CRDA) [41].

Low Illumination Datasets
For low-light object detection tasks, several datasets have been proposed.The NightOwls dataset was suggested by the authors of [42] for the detection of pedestrians at night.In order to account for different unfavorable conditions such as rain, snow, haze, and low illumination, the authors of [43] gathered an Unconstrained Face Detection Dataset (UFDD).Recently, there have been several tracks for vision tasks in various low-visibility environments in the UG2+ challenge [44].The DARK FACE dataset contains 10,000 of them, including 6000 labeled images and 4000 unlabeled ones.An exclusively dark (ExDark) dataset with 7363 images and 12 object classes was proposed by the authors of [45] for a multi-class dark object detection task.The BDD100K [46] dataset, which was released by the Berkeley AI Lab, is the largest and most varied open-source video dataset in computer vision to date.The dataset consists of 100,000 videos with an average length of 40 s at 720 p and 30 frames per second, adding up to over 1100 h.The videos came from a variety of American locations.The database includes information on a variety of weather conditions, such as clear, cloudy, and rainy days at various times of the day and night.

Low-Illumination Enhancement and Restoration Methods
When faced with a low-light image, the first thing that comes to mind is the illumination enhancement and restoration of the low-light image to support subsequent tasks.Low-light vision tasks recover image detail and correct color shifts based on the human visual experience.Early attempts relied on Retinex theory approaches [47][48][49] or histogram equalization (HE)-based approaches [50,51].Nowadays, Convolutional Neural Networks (CNN)-based methods and Generative Adversarial Nets (GAN)-based methods have significantly improved in this task thanks to the development of deep learning.The authors of [52] combined the Retinex theory with a deep network for low-light image enhancement.The authors of [53] used an unsupervised GAN to solve this problem.A self-supervised learning strategy for images with abnormal illumination was recently proposed by the authors of [54].Existing low-light image enhancement methods mainly focus on low contrast to increase visibility while the high noise is usually addressed with a post-processing module.To solve this problem, The authors of [55] suggest a unique technique for improving low-light images based on simultaneous illumination and noise adjustment with unpaired data.A Structure-and Texture-Aware Network (STAN) for low-light image enhancement was proposed by the authors of [56] based on the observation that the representations of structure and texture are highly separated in the frequency spectrum.STAN is made up of a structure sub-network and a texture sub-network.The methods mentioned above have worked well for low-light enhancement, but it is usually not the best performance to use the enhanced images directly in existing object detection networks.

Different Methods Applied to Low-Illuminated Detection
The authors of [57] proposed a method of domain adaptation for object detection in a low-light situation.Pretrained models are merged into this method in different domains using glue layers and a generative model, which feeds latent features to the glue layers to train them without an additional dataset.The authors of [58] suggest an active object detection approach and brightness control strategy based on reinforcement learning.Without having to retrain the detector, low-quality photos can be enhanced into high-quality images with the aid of the pretrained models, and overall performances are increased.In [59], a hyperbolic tangent curve is used to first map the image brightness to the desired level.Secondly, the YCbCr color space unsharp filter block matching and 3-D filtering algorithms are created.Finally, the nighttime surveillance task is concluded with pedestrian detection using a convolutional neural network model.With only modest processing resources and no supervised training, the authors of [60] described a Flash-No-Flash (FNF)-controlled illumination acquisition methodology that enables and impacts reliable object detection.The technique depends on the simultaneous acquisition of two images, one with and one without strong artificial illumination (flash/no flash).The authors of [61] provided a dataset of unprocessed short-exposure low-light photos together with comparable long-exposure reference images to aid in the development of learning-based pipelines for low-light image processing.They created a pipeline for processing low-light pictures based on end-to-end training of a fully convolutional network that works directly with the raw sensor input using the dataset that was presented.However, the performance of these existing methods in truly dark environments is not very satisfactory.

YOLOX Model
YOLOX is a new generation of the YOLO series algorithm following YOLOv5 [62], which improves on YOLOv3 and YOLOv5 and further improves the detection performance.YOLOX is mainly divided into three parts: A backbone network (Backbone), a feature pyramid (Neck), and a prediction module (Head).Among them, the backbone network adopts Cross Stage Partial (CSP) Darknet, which is the backbone feature extraction network of YOLOX, and three effective feature layers are obtained by input images, primarily including Focus, Cross Stage Partial Network (CSPnet) [63], and Spatial Pyramid Pooling (SPP) components.The feature pyramid is constructed by CSP-PAFPN (PAFPN, Path Aggregation Feature Pyramid Networks), which makes full use of the three effective feature layers obtained by the backbone network to improve the model performance by bottom-up and top-down feature fusion.The prediction module improves the detection head to Decoupled Head, which improves the training convergence speed and detection accuracy of the model and uses the Anchor frame approach based on Anchor free, which removes the preset a priori frame and predicts the edges of the target directly to reduce the computational redundancy.
Although the original YOLOX model achieves better performance than other methods for the detection task in this paper, there are still large missed and false detections for small targets, occlusions, and low-illumination environments at night.Therefore, the model structure and loss function of the algorithm are improved in this paper to improve the detection accuracy of nighttime vehicle/pedestrian targets.One of the schematic diagrams of the improved model structure is shown in Figure 1.

Structural Re-Parameterization
In general, the more complex the structure and the more parameters of a deep convolutional neural network, the more expressive features can be extracted, which helps to achieve better results in the training phase.However, more complex models can adversely affect the inference speed.In order to trade-off detection accuracy and speed, the training and inference structures of the YOLOX model are decoupled, drawing on the RepVGG reparameterization idea [64].The specific method is shown in the subfigure of the structural reparameterization module in Figure 1, where the original 3 × 3 convolutions in the YOLOX backbone network is replaced with the RepVGG module; a multi-branch structure is used in the training phase to improve the detection performance, i.e., 3 × 3 convolutions + 1 × 1 convolution + Identity branch; in the inference phase, the reparameterization is performed and a single 3 × 3 convolution structure is used to accelerate.In the inference phase, a single 3 × 3 convolutional structure is used to accelerate the inference, which is consistent with the original YOLOX backbone network structure.By improving the model structure with reparameterization, the performance advantage of adopting a complex model in the training phase and the inference speed advantage of adopting a single structure in the inference phase is taken into account.

Lightweight Design of the Model
Compared with large models, small models require less storage space, which can save more resources and enable the models to be deployed on low-performance storage devices.In view of this, this paper optimizes the design of the number of Neck channels and the Head structure of the YOLOX algorithm to reduce the number of model parameters.First, the number of Neck channels is reduced so that the output dimension of all components (including Conv and CSP) in the PAFPN is changed to 128 dimensions, which corresponds to the input dimension of the decoupled head in YOLOX (128 dimensions); subsequently, the redundant 1 × 1 convolution in the Head is removed (the convolution serves to scale the dimensionality of the feature map output from PAFPN and convert it to 128 dimensions uniformly); finally, the depth of the CSP components in PAFPN is increased (from the default 1 to 3) to avoid reducing the detection performance of the model by decreasing the output dimensionality of all features in PAFPN.The above model structure improvement can reduce the size of the YOLOX model to 13.5 M, which is lower than the size of the YOLOv5 model (13.7 M).In general, the more complex the structure and the more parameters of a deep convolutional neural network, the more expressive features can be extracted, which helps to achieve better results in the training phase.However, more complex models can adversely affect the inference speed.In order to trade-off detection accuracy and speed, the training and inference structures of the YOLOX model are decoupled, drawing on the RepVGG reparameterization idea [64].The specific method is shown in the subfigure of the structural reparameterization module in Figure 1, where the original 3 × 3 convolutions in the YOLOX backbone network is replaced with the RepVGG module; a multi-branch structure is used in the training phase to improve the detection performance, i.e., 3 × 3 convolutions + 1 × 1 convolution + Identity branch; in the inference phase, the reparameterization is performed and a single 3 × 3 convolution structure is used to accelerate.In the inference phase, a single 3 × 3 convolutional structure is used to accelerate the inference, which is consistent with the original YOLOX backbone network structure.By improving the model structure with reparameterization, the performance advantage of adopting a complex model in the training phase and the inference speed advantage of adopting a single structure in the inference phase is taken into account.

Lightweight Design of the Model
Compared with large models, small models require less storage space, which can save more resources and enable the models to be deployed on low-performance storage devices.In view of this, this paper optimizes the design of the number of Neck channels and the Head structure of the YOLOX algorithm to reduce the number of model parameters.First, the number of Neck channels is reduced so that the output dimension of all components (including Conv and CSP) in the PAFPN is changed to 128 dimensions, which corresponds to the input dimension of the decoupled head in YOLOX (128 dimensions); subsequently, the redundant 1 × 1 convolution in the Head is removed (the convolution serves to scale the dimensionality of the feature map output from PAFPN and convert it to 128 dimensions uniformly); finally, the depth of the CSP components in PAFPN is increased (from the default 1 to 3) to avoid reducing the detection performance of the model by decreasing the output dimensionality of all features in PAFPN.The above model structure improvement can reduce the size of the YOLOX model to 13.5 M, which is lower than the size of the YOLOv5 model (13.7 M).

Introduction of Attention Mechanism
In order to obtain richer image semantic information in vehicle/pedestrian target features, an attention mechanism (CA, Coordinate Attention) [65] that fuses coordinate information is introduced to the YOLOX backbone network, added after the CSP component, and the specific structure is shown in the subdiagram of the coordinate attention module in Figure 1.The CA module decomposes the channel attention into two parallel 1D feature encoding operations, which can effectively fuse spatial coordinate information into the generated attention map, enhance the utilization of cross-channel information, direction-aware information, and position-aware information, and promote the model to locate and identify targets more accurately.
Based on the YOLOX model, the CA module aggregates the vertical and horizontal input features into two independent direction-aware feature maps using pooling kernels of dimensions (H, 1) and (1, W).Then, the two feature maps embedded with specific direction information are encoded into two attention maps by "cascading, convolving, slicing, and reconvolving".In this way, long-range dependencies can be captured along one spatial direction and meanwhile precise positional information can be preserved along the other spatial direction.Finally, the two attention maps are multiplied onto the input feature map to enhance the representation of the feature map.The final output of the CA module can be expressed as shown in Equation (1).

Feature Pyramid Improvement
The label-scale information of the nighttime vehicle/pedestrian dataset was scattered and visualized as shown in Figure 2.Among them, the width and height of the target size after normalization are mainly concentrated in the position less than 0.07, i.e., the width of the detected targets is mostly less than 89.6 (1280 × 0.07) and the height is mostly less than 50.4 (720 × 0.07).It is obvious that most of the detected targets are comparatively small.Therefore, it is necessary to ensure the detection accuracy of the algorithm for small, dense targets.

Introduction of Attention Mechanism
In order to obtain richer image semantic information in vehicle/pedestrian target features, an attention mechanism (CA, Coordinate Attention) [65] that fuses coordinate information is introduced to the YOLOX backbone network, added after the CSP component, and the specific structure is shown in the subdiagram of the coordinate attention module in Figure 1.The CA module decomposes the channel attention into two parallel 1D feature encoding operations, which can effectively fuse spatial coordinate information into the generated attention map, enhance the utilization of cross-channel information, direction-aware information, and position-aware information, and promote the model to locate and identify targets more accurately.
Based on the YOLOX model, the CA module aggregates the vertical and horizontal input features into two independent direction-aware feature maps using pooling kernels of dimensions ( ,1) H and (1, ) W .Then, the two feature maps embedded with specific direction information are encoded into two attention maps by "cascading, convolving, slicing, and reconvolving".In this way, long-range dependencies can be captured along one spatial direction and meanwhile precise positional information can be preserved along the other spatial direction.Finally, the two attention maps are multiplied onto the input feature map to enhance the representation of the feature map.The final output of the CA module can be expressed as shown in Equation ( 1).

Feature Pyramid Improvement
The label-scale information of the nighttime vehicle/pedestrian dataset was scattered and visualized as shown in Figure 2.Among them, the width and height of the target size after normalization are mainly concentrated in the position less than 0.07, i.e., the width of the detected targets is mostly less than 89.6 (1280 × 0.07) and the height is mostly less than 50.4 (720 × 0.07).It is obvious that most of the detected targets are comparatively small.Therefore, it is necessary to ensure the detection accuracy of the algorithm for small, dense targets.In the whole feature extraction and fusion process of the YOLOX algorithm, the input image is down-sampled five times in the backbone network, and feature maps sized 320 × 320, 160 × 160, 80 × 80, 40 × 40, and 20 × 20 are generated in sequence, and the last three feature maps are input into PAFPN for feature fusion processing.Among them, largescale feature maps pay more attention to detailed information and retain a large amount of small target information, while small-scale feature maps pay more attention to semantic information.Although small-scale feature maps are better for understanding complex In the whole feature extraction and fusion process of the YOLOX algorithm, the input image is down-sampled five times in the backbone network, and feature maps sized 320 × 320, 160 × 160, 80 × 80, 40 × 40, and 20 × 20 are generated in sequence, and the last three feature maps are input into PAFPN for feature fusion processing.Among them, largescale feature maps pay more attention to detailed information and retain a large amount of small target information, while small-scale feature maps pay more attention to semantic information.Although small-scale feature maps are better for understanding complex objects, the loss of information on small targets may be more serious owing to the smaller resolution.Therefore, the model was improved by adding a new branch with a feature scale of 160 × 160 (where the depth of the Cross Pseudo Supervision (CPS) component is 1) into PAFPN, and the feature fusion process is performed on four different scales of feature maps to obtain more shallow image feature information while retaining the deep image semantic information.The specific improved PAFPN network structure is shown in the fusion detection branch module in the Neck part of Figure 1.

Confidence Loss Function Design
In the process of nighttime vehicle and pedestrian detection based on the YOLOX model, due to the anchor-free mechanism and the large ratio of foreground-to-background pixels in the experimental data, the positive and negative samples in the prediction stage of the algorithm are in an extremely unbalanced state, which makes it difficult for the model to be fully trained.In addition, the target locations of the scenes in the experimental dataset are more concentrated and most of them are in occlusion, which easily leads to target miss detection.
To address the above two problems, Varifocal [66] was introduced as the confidence prediction loss function.Varifocal is mainly used to train the dense target detector to perform the Intersection Over Union (IoU)-aware classification score regression to improve the detection accuracy.Target occlusion is one of the features of dense targets, so using this loss function can alleviate the missed detection phenomenon caused by target occlusion.The Varifocal loss function is also based on the weighting method of Focal Loss [24], which can improve the positive and negative sample imbalance during the training process.The formula of the Varifocal loss function is shown in Equation (2).
where p is the prediction confidence after the Sigmoid activation function; q is the IoU value of positive samples with Ground Truth; a is used to reconcile the balance between positive and negative samples, which is 0.25; and γ is used to reconcile the balance between easy and difficult samples, which is 1.5.

Bounding Box Regression Loss Function Design
The border regression loss in the YOLOX model is mainly calculated by the IoU.When the overlap area between the predicted box and the ground truth box is larger, the IoU value is larger and the localization effect is better.However, IoU loss has the following drawbacks: When there is no overlap between the prediction box and the ground truth box, the IoU is 0, and it does not reflect the relative distance between the prediction box and the ground truth box, and the gradient of IoU loss is 0 at this time, so it cannot be optimized.In addition, IoU is calculated using the overlap area only, and when the overlap area is determined, its IoU value is the same, so it cannot distinguish between the prediction box and the ground truth box alignment mode, which affects the convergence speed.
To address the above problems, CIoU [67] loss was introduced as the bounding box loss function to improve the convergence speed and model detection accuracy.The CIoU loss integrates the overlap area, centroid distance, and aspect ratio between the prediction box and the real box, which directly minimizes the relative distance between the prediction box and the ground truth box and accelerates convergence.The problem that IoU loss does not accurately reflect the overlap of the two boxes was solved, making the regression more accurate and faster when there is overlap or even inclusion with the target box.The formula for calculating CIoU loss is shown in Equations ( 3)- (5).
Appl.Sci.2022, 12, 12476 where ρ denotes the Euclidean distance between two center points; b and b gt denote the center points of the prediction box and the ground truth box, respectively; c denotes the diagonal distance between the minimum closed area of the prediction box and the ground truth box; S i is the area of the overlapping area of the prediction box and the ground truth box; S u is the merging area of the prediction box and the ground truth box; w and w gt denote the width of the prediction box and the ground truth box, respectively; h and h gt denote the height of the prediction box and the ground truth box, respectively.

Combined Loss Function
The final combined loss function consists of the bounding box regression loss L ciou , the confidence prediction loss L vf , the category prediction loss L cls , and the loss together L 1 , and is calculated as shown in Equations ( 6)- (8).
where λ is the loss weight, which is set to 5. n is the number of the candidate box, p is the true category probability, p is the predicted category probability, and σ(x) is a Sigmoid function that maps the values to between (0, 1).In addition, the L 1 loss is used after turning off Mosaic data enhancement, P is the predicted value, and T is the ground truth label value.

Training Strategy for Data Domain Transfer
To further improve the generalization capability of the improved YOLOX algorithm in the nighttime scenario, data enhancement was performed on the existing dataset: First, lowillumination enhancement is applied to the nighttime data to enhance the color characteristics of the target.;then, the existing larger-scale daytime vehicle/pedestrian images were used for data expansion to compensate for the low number of nighttime vehicle/pedestrian detection data in the publicly available automated driving dataset, thus improving the generalization capability of the model.
However, due to the large variability of color features between daytime and nighttime data, which are distributed in different data domains, the model easily deviates from the target data domain if the mixing training is performed directly without domain transfer, and the obtained performance improvement is not significant.To reduce the color feature differences between the two data domains, a data domain transfer approach was used to process the nighttime and daytime datasets separately: First, the raw nighttime and daytime vehicle-pedestrian datasets are low-illumination enhanced and low-illumination degraded to generate fake-day and fake-night vehicle-pedestrian datasets.Then, the fake-night vehicle-pedestrian dataset was separately low-illumination enhanced to also generate the fake-day vehicle pedestrian dataset.Finally, the two generated fake days were mixed and trained for the purpose of data expansion.It is worth noting that since the data domain in the training phase was transformed, the test phase required low-illumination enhancement of the dataset to shrink the differences in color features between the training and test sets.The whole training and testing process of the detection algorithm is shown in Figure 3.
hanced to also generate the fake-day vehicle pedestrian dataset.Finally, the two generated fake days were mixed and trained for the purpose of data expansion.It is worth noting that since the data domain in the training phase was transformed, the test phase required low-illumination enhancement of the dataset to shrink the differences in color features between the training and test sets.The whole training and testing process of the detection algorithm is shown in Figure 3.

Low Light Enhancement
In order to enhance the color features of vehicle and pedestrian targets at night, this paper uses the Zero-Reference Deep Curve Estimation (Zero-DCE) algorithm to achieve low-light enhancement.The algorithm processing process is shown in Figure 4.

Low Light Enhancement
In order to enhance the color features of vehicle and pedestrian targets at night, this paper uses the Zero-Reference Deep Curve Estimation (Zero-DCE) algorithm to achieve low-light enhancement.The algorithm processing process is shown in Figure 4.The algorithm iterates itself continuously to approach pixel-level higher-order curves by designing specific curves associated with the image, resulting in images with strong brightness and contrast.
The curve equation for the brightness enhancement is shown in Equation ( 5) below. Here where n is the number of iterations, A is a matrix of curve parameters of the same size as the input image, ( ) I x is the input low-illumination x image, and ( ) LE x is the enhanced image.
Furthermore, this algorithm has the advantages of fewer model parameters and faster processing speed compared to other enhancement algorithms such as Low-Light Net (LLNet) [68] and EnlightenGAN, which are suitable for deploying low-end devices while ensuring the enhancement effect.In practical tests, the algorithm consumes only 2 ms of inference time for the experimental dataset in this paper, and the model size accounts for only 313 K.The performance of different low-light enhancement algorithms is shown in Table 1

Low-Illumination Degrading Transformations
Further, this chapter was redesigned for the low-illumination degrading stage proposed by Cui et al. [69] to allow the daytime dataset to generate images closer to the target domain.The entire low-illumination degrading transformation is shown in Figure 5, with a new domain adaptation estimation module, and the inverse tone mapping and quantization modules were removed compared to the original paper.Experimentally, these two The algorithm iterates itself continuously to approach pixel-level higher-order curves by designing specific curves associated with the image, resulting in images with strong brightness and contrast.
The curve equation for the brightness enhancement is shown in Equation ( 5) below.
Here n = 1, where n is the number of iterations, A is a matrix of curve parameters of the same size as the input image, I(x) is the input low-illumination x image, and LE(x) is the enhanced image.Furthermore, this algorithm has the advantages of fewer model parameters and faster processing speed compared to other enhancement algorithms such as Low-Light Net (LLNet) [68] and EnlightenGAN, which are suitable for deploying low-end devices while ensuring the enhancement effect.In practical tests, the algorithm consumes only 2 ms of inference time for the experimental dataset in this paper, and the model size accounts for only 313 K.The performance of different low-light enhancement algorithms is shown in Table 1.

Low-Illumination Degrading Transformations
Further, this chapter was redesigned for the low-illumination degrading stage proposed by Cui et al. [69] to allow the daytime dataset to generate images closer to the target domain.The entire low-illumination degrading transformation is shown in Figure 5, with a new domain adaptation estimation module, and the inverse tone mapping and quantization modules were removed compared to the original paper.Experimentally, these two modules produce the generated results, which makes it difficult for the model to explore the target features and is not conducive to training.
modules produce the generated results, which makes it difficult for the model to explore the target features and is not conducive to training.

Invert Gamma Correction Color Space Conversion
Inverse White Balance

Domain Adaptation Estimation
The module first converts the input daytime data into a grayscale image and estimates the ratio of dark pixels in the grayscale image, and then establishes the corresponding suppression factors for different proportions.In addition, the factor acts on the microlight corruption process to generate fake-night data with more stable illumination for day data with different brightness levels.

Gamma Correction and Inverse Gamma Correction
Gamma correction is used for human perception of nonlinearity in dark areas.The standard gamma curve [70] and its inverse process (inverse gamma calibration) are shown in Equations ( 11) and (12).
where the parameter γ obeys the uniform distribution U(2 3.5) γ  ， and is randomly sampled; ε is a very small value to ensure the stability of the value during the conver- sion.

Color Space Conversion
Two color space conversions are included in the low-illumination degrading stage.The first color space conversion is the conversion of sRGB to cRGB and the second color space conversion is the conversion of the white balance signal from the camera's internal cRGB to sRGB, which is caused by the fact that the camera's internal sRGB color space is not identical [71,72].The converted signal sRGB y can be obtained from the color correction matrix (CCM) ccm M , as shown in the following Equation (13).
In addition, its inverse process is shown in Equation ( 14).

White Balance and Inverse White Balance
White balance simulates the color constancy of the Human Visual System (HVS) by mapping "white" colors onto white objects [72].The color of the image is determined by the color of the light and the reflectivity of the material.The white balance step in the camera pipeline estimates and adjusts the gain of the red channel and the blue channel so that the image appears to be illuminated under "neutral" lighting.The process is shown in the following Equation (15).

Domain Adaptation Estimation
The module first converts the input daytime data into a grayscale image and estimates the ratio of dark pixels in the grayscale image, and then establishes the corresponding suppression factors for different proportions.In addition, the factor acts on the micro-light corruption process to generate fake-night data with more stable illumination for day data with different brightness levels.

Gamma Correction and Inverse Gamma Correction
Gamma correction is used for human perception of nonlinearity in dark areas.The standard gamma curve [70] and its inverse process (inverse gamma calibration) are shown in Equations ( 11) and (12).
where the parameter γ obeys the uniform distribution γ ∼ U(2, 3.5) and is randomly sampled; ε is a very small value to ensure the stability of the value during the conversion.

Color Space Conversion
Two color space conversions are included in the low-illumination degrading stage.The first color space conversion is the conversion of sRGB to cRGB and the second color space conversion is the conversion of the white balance signal from the camera's internal cRGB to sRGB, which is caused by the fact that the camera's internal sRGB color space is not identical [71,72].The converted signal y sRGB can be obtained from the color correction matrix (CCM) M ccm , as shown in the following Equation (13).
In addition, its inverse process is shown in Equation (14).

.4. White Balance and Inverse White Balance
White balance simulates the color constancy of the Human Visual System (HVS) by mapping "white" colors onto white objects [72].The color of the image is determined by the color of the light and the reflectivity of the material.The white balance step in the camera pipeline estimates and adjusts the gain of the red channel and the blue channel so that the image appears to be illuminated under "neutral" lighting.The process is shown in the following Equation (15).
where the g r values are chosen randomly among (1.9, 2.4) and the g b values are chosen randomly among (1.5, 1.9), which exhibit a homogeneous distribution and exist separately; and the red and blue gains 1/g are the reciprocals of the inverse process.

Low-Light Corruption
When photons are focused via the lens onto the capacitor clusters, the electric charge produced by each capacitor, taking into account the identical exposure duration, aperture size, and automatic gain control, correlates to the lux of the ambient illumination.The random entry of photons into the sensor causes scattered particle noise, which is a fundamental restriction of the camera.Since the photon arrival time is affected by the Poisson distribution, the uncertainty in the number of photons collected in a fixed period is δ s = √ S, where δ is the scattered noise and S is the signal from the sensor.In the output amplifier, reading noise is generated during the electronic voltage conversion, which can be simulated with a Gaussian random variable with fixed variance and zero mean.
In camera imaging systems, scatter and read noise are more common; therefore, we model the noise measurements on the sensor [73] in Equations ( 16) and ( 17).
where each pixel x's true intensity is derived through a non-processing method and linearly attenuated with the use of the parameter k.The light intensity parameter k is randomly chosen to replicate various lighting situations from a truncated Gaussian distribution with a mean of 0.1 and a variance of 0.08 with a range of (0.01, 1.0).In addition, the parameter ranges of δ γ and δ s refer to the literature [74], as shown in Equations ( 18) and ( 19).

Experimental Dataset
The experimental data are the public dataset BDD100K, which contains 100,000 images of different weather conditions, as well as road scenes at different times of the day, and is the largest and most diverse autonomous driving dataset in terms of content, of which the image resolution is 1280 × 720.
Since the task studied in this paper is nighttime vehicle/pedestrian detection, a total of 4800 nighttime images with vehicle and pedestrian targets were manually screened, including 3800 images in the training set and 1000 images in the validation set.In the data domain transfer study, 9399 daytime images with vehicle and pedestrian targets were prepared.Data enhancement processes such as random level flipping, color transformation, and Mosaic were also performed on the training data to further expand the dataset and improve the generalization capability of the model.

Evaluation Metrics
In order to evaluate the improvement effect of the algorithm in this paper and the difference with other detection algorithms, six metrics are used to analyze the average detection accuracy mean Average Precision mAP, F1 value, Recall, Accuracy Precision, Inference time, and weight size.

Experimental Parameter Setting
The model training framework in this chapter was based on Pytorch 1.8, running on CPU: i7-9700k and GPU: NVIDIA GeForce RTX 2070 SUPER with 8G video memory.The network input size is 640 × 640, and the Stochastic Gradient Descent (SGD) is used as the optimizer.The learning rate is set to 0.01 and the weight decay is 0.0005.The momentum of SGD is 0.937, the image batch size is 8, and the total number of training rounds is 200 Epochs.The first 3 Epochs use the WarmUp learning rate strategy and the last 15 Epochs cancel the Mosaic data enhancement.The degrees of rotation are set to 0, random multiscale with Mixup data enhancement is removed, and the rest of the settings are set to default values.

Training Evaluation Process and PR Curve
In this paper, the whole training evaluation process of YOLOv5, YOLOX, and the improved YOLOX algorithm was visualized, and the results are shown in Figure 6.The improved YOLOX algorithm (the algorithm in this paper) has better mAP values (IoU threshold is taken as 0.5) than YOLOv5 and YOLOX, and the results tend to converge.In addition, in order to further improve the detection robustness of the algorithm in the nighttime scenario, the improved YOLOX algorithm is trained again using the training strategy of data domain transfer in this paper, and the results show that the mAP is significantly improved, thus demonstrating the effectiveness of the strategy.
The model training framework in this chapter was based on Pytorch 1.8, running on CPU: i7-9700k and GPU: NVIDIA GeForce RTX 2070 SUPER with 8G video memory.The network input size is 640 × 640, and the Stochastic Gradient Descent (SGD) is used as the optimizer.The learning rate is set to 0.01 and the weight decay is 0.0005.The momentum of SGD is 0.937, the image batch size is 8, and the total number of training rounds is 200 Epochs.The first 3 Epochs use the WarmUp learning rate strategy and the last 15 Epochs cancel the Mosaic data enhancement.The degrees of rotation are set to 0, random multiscale with Mixup data enhancement is removed, and the rest of the settings are set to default values.

Training Evaluation Process and PR Curve
In this paper, the whole training evaluation process of YOLOv5, YOLOX, and the improved YOLOX algorithm was visualized, and the results are shown in Figure 6.The improved YOLOX algorithm (the algorithm in this paper) has better mAP values (IoU threshold is taken as 0.5) than YOLOv5 and YOLOX, and the results tend to converge.In addition, in order to further improve the detection robustness of the algorithm in the nighttime scenario, the improved YOLOX algorithm is trained again using the training strategy of data domain transfer in this paper, and the results show that the mAP is significantly improved, thus demonstrating the effectiveness of the strategy.Finally, the trained model is analyzed using the Precision-Recall (PR) curve (the IoU threshold is taken as 0.5) in this paper, and the results are shown in Figure 6.The PR curve of this paper's algorithm can completely envelop the previous algorithm, thus indicating that this paper's algorithm is better in terms of performance.

Ablation Studies
The improved YOLOX algorithm proposed in this paper mainly includes five improvements, which were model structure reparameterization and light weighting, the CA attention mechanism, feature pyramid improvement, the confidence loss function based Finally, the trained model is analyzed using the Precision-Recall (PR) curve (the IoU threshold is taken as 0.5) in this paper, and the results are shown in Figure 6.The PR curve of this paper's algorithm can completely envelop the previous algorithm, thus indicating that this paper's algorithm is better in terms of performance.

Ablation Studies
The improved YOLOX algorithm proposed in this paper mainly includes five improvements, which were model structure reparameterization and light weighting, the CA attention mechanism, feature pyramid improvement, the confidence loss function based on Varifocal Loss, and the CIoU-based bounding-box regression loss function.In order to verify the effectiveness of each improvement point of the algorithm in this paper, an exhaustive experimental study of model ablation is conducted on the nighttime vehicle/pedestrian dataset, and the results are shown in Table 2.
Among them, Baseline represents the original YOLOX algorithm trained with default parameters, and its mAP result is 76.5%; after the model structure reparameterization and light weighting, the model weight size was reduced by approximately 4 MB, and the accuracy is almost unchanged, and the mAP at this time is 76.6%; after the CSP components at positions P3, P4, and P5 in the YOLOX backbone network were added with the addition of the three-layer CA attention module, the accuracy increased by 1% with almost the same model size, and the mAP is 77.5%, but the inference time increases due to the addition of too many layers.The inference time also increased because the model structure became more complex; then, using Varifocal Loss as the confidence loss, the model inference time was almost unchanged, but the accuracy improved by 0.4%, and the mAP was 79.7% at this time; finally, using CIoU loss as the bounding-box regression loss, the accuracy improved by another 0.4%, and the mAP was 80.1% at this time.In this paper, we further investigate the effect of the additional position and number of layers of CA modules in the backbone network.As shown in Table 3, a total of eight comparative experiments are conducted, including four different location addition schemes and three different layer number analyses.In Table 3, the " √ " indicates that the CA module was used at that location in the network, and the bolded in the table indicates the best results.From the results in Table 3, we know that the CA module has the best accuracy of 80.1% when it was added after the CSP components at positions P3, P4, and P5 in the backbone network.Upon further reducing the number of layers of the CA module from the initial three layers to one layer, the mAP is reduced by only 0.5%, but the inference time was reduced by 5.3 ms and the model size was also reduced by 0.2 MB, which is more advantageous than adding three layers of the CA module.The final mAP of the improved YOLOX algorithm is 79.6%.Compared with the original YOLOX model, the mAP improved by 3.1% and model size was reduced by 1.6 MB.

Comparison of Different Testing Methods
To further illustrate the effectiveness of the proposed method, the detection performance is compared with YOLOv3, YOLOv4, and YOLOv5 in the one-stage detection method and Faster R-CNN and Cascade R-CNN in the two-stage detection method using the same experimental configuration in the nighttime vehicle/pedestrian dataset, and the results are shown in Table 4. From Table 4, it can be seen that the improved algorithm proposed in this paper has the best performance indexes in terms of mAP, F1, and Recall value compared with the YOLO series algorithm and the two-stage algorithm, which are 79.6%, 75.8%, and 72.3%, respectively.Meanwhile, the algorithm in this paper can meet the requirements of real-time detection in terms of inference speed, and the model size is only 15.6 MB, which can be deployed in low-performance storage devices.Although compared with YOLOv5, this algorithm does not have an advantage in inference speed and model size, the average detection accuracy mAP is 5.8% higher and the recall Recall is 8.7% higher.As a result, night object detection is improved, and the model size and inference time can be further reduced by model pruning and quantization operations.Therefore, in practical applications, the algorithm in this paper is more advantageous.
In addition, to illustrate the effectiveness of the data domain transfer training strategy more clearly, a comparative analysis was performed with the improved algorithm without the data domain transfer method, and the comparison results are shown in Table 4 above.
From the results, it can be seen that performance metrics such as mAP, F1, Recall, and Precision values are further improved after the improved algorithm was trained by the data domain transfer strategy, and the mAP values are better compared to adding additional daytime training datasets directly for mixed training.The final improved algorithm improved the mean detection accuracy mAP by 5.9% to 82.4%, F1 value by 4.3%, Recall by 4.9%, and Precision by 3.3% compared to the former YOLOX algorithm.

Effectiveness Analysis
In order to test the detection effectiveness of the proposed method in practice, the detection results of the original YOLOX algorithm, the improved algorithm in this paper and the improved algorithm trained by the data domain transfer strategy are visualized using a real nighttime vehicle/pedestrian dataset, as shown in Figure 7.
From the above figure, it can be seen that the improved YOLOX algorithm has fewer false and false positive rates compared to the original algorithm, and it is more effective for nighttime vehicle/pedestrian detection when combined with the data domain transfer training strategy.
In order to test the detection effectiveness of the proposed method in practice, the detection results of the original YOLOX algorithm, the improved algorithm in this paper and the improved algorithm trained by the data domain transfer strategy are visualized using a real nighttime vehicle/pedestrian dataset, as shown in Figure 7.  From the above figure, it can be seen that the improved YOLOX algorithm has fewer false and false positive rates compared to the original algorithm, and it is more effective for nighttime vehicle/pedestrian detection when combined with the data domain transfer training strategy.

Conclusions
In this work, a vehicle/pedestrian detection algorithm was designed based on YOLOX.In addition, in order to further improve the detection accuracy of the algorithm under low-light conditions, a training strategy based on data domain transfer was proposed.We trained the improved YOLOX with the proposed domain transfer strategy and achieved satisfactory results.The major remarkable features of the proposed approach are: (1) The ablation experiments revealed that the improvement in the feature pyramid part of the YOLOX model improved the most, with a 1.8% improvement in mAP.By adding a fusion detection branch at the large-scale feature map in the backbone network and fusing it with the original three smaller-scale feature maps, the detection capability of

Conclusions
In this work, a vehicle/pedestrian detection algorithm was designed based on YOLOX.In addition, in order to further improve the detection accuracy of the algorithm under low-light conditions, a training strategy based on data domain transfer was proposed.We trained the improved YOLOX with the proposed domain transfer strategy and achieved satisfactory results.The major remarkable features of the proposed approach are: (1) The ablation experiments revealed that the improvement in the feature pyramid part of the YOLOX model improved the most, with a 1.8% improvement in mAP.By adding a fusion detection branch at the large-scale feature map in the backbone network and fusing it with the original three smaller-scale feature maps, the detection capability of small, dense targets is effectively enhanced by retaining the original deep image semantic information in order to obtain more shallow image feature information.
(2) Introducing the coordinate-based attention mechanism in the YOLOX backbone network can improve the feature extraction capability of the deep model, but the extra computational effort increases the model inference time by 7.7 ms; by simplifying the attention mechanism module from a three-layer structure to a one-layer structure, the inference time can be reduced by 5.3 ms at the cost of a 0.5% decrease in mAP.
(3) A model training strategy based on data domain transfer was proposed, in which the nighttime and daytime datasets are domain transferred and then mixed for training by combining low-light enhancement and low-illumination degrading methods.After training the improved algorithm with the domain transfer strategy, the detection performance of both nighttime targets is significantly improved, in which the mAP is increased by 2.8%.
(4) The improved algorithm, after being trained by the domain transfer strategy, eventually improved the average detection accuracy of nighttime vehicle/pedestrian targets by 5.9% to 82.4%.
Our proposed method can effectively improve the target detection accuracy of selfdriving vehicles in a nighttime environment and has good implications for other target detection tasks in low-light environments.However, it is still a long way from the highly reliable sensing method needed to guarantee safe nighttime driving.In the future, we will further explore the target detection task in fog conditions and explore the fusion of visual detection with other techniques such as radar and lidar.

Figure 2 .
Figure 2. Distribution of sample label scale.

Figure 2 .
Figure 2. Distribution of sample label scale.

Figure 3 .
Figure 3. General training and testing framework for algorithms.

Figure 3 .
Figure 3. General training and testing framework for algorithms.

Figure 5 .
Figure 5.The pipeline of low-illumination degrading transformations.

Figure 5 .
Figure 5.The pipeline of low-illumination degrading transformations.

Figure 6 .
Figure 6.(a) Comparison of mAP for different algorithms.(b) Comparison of PR curve for different algorithms.

Figure 6 .
Figure 6.(a) Comparison of mAP for different algorithms.(b) Comparison of PR curve for different algorithms.

Table 1 .
Performance comparison of different low-light enhancement algorithms.

Table 1 .
Performance comparison of different low-light enhancement algorithms.
5.3.3.Effect of CA Module Location and Number of Layers

Table 3 .
Comparison of CA module location and number of layers.

Table 4 .
Performance comparison of different algorithms.