Phased Feature Extraction Network for Vehicle Search Tasks Based on Cross-Camera for Vehicle–Road Collaborative Perception

The objective of vehicle search is to locate and identify vehicles in uncropped, real-world images, which is the combination of two tasks: vehicle detection and re-identification (Re-ID). As an emerging research topic, vehicle search plays a significant role in the perception of cooperative autonomous vehicles and road driving in the distant future and has become a trend in the future development of intelligent driving. However, there is no suitable dataset for this study. The Tsinghua University DAIR-V2X dataset is utilized to create the first cross-camera vehicle search dataset, DAIR-V2XSearch, which combines the cameras at both ends of the vehicle and the road in real-world scenes. The primary purpose of the current search network is to address the pedestrian issue. Due to varying task scenarios, it is necessary to re-establish the network in order to resolve the problem of vast differences in different perspectives caused by vehicle searches. A phased feature extraction network (PFE-Net) is proposed as a solution to the cross-camera vehicle search problem. Initially, the anchor-free YOLOX framework is selected as the backbone network, which not only improves the network’s performance but also eliminates the fuzzy situation in which multiple anchor boxes correspond to a single vehicle ID in the Re-ID branch. Second, for the vehicle Re-ID branch, a camera grouping module is proposed to effectively address issues such as sudden changes in perspective and disparities in shooting under different cameras. Finally, a cross-level feature fusion module is designed to enhance the model’s ability to extract subtle vehicle features and the Re-ID’s precision. Experiments demonstrate that our proposed PFE-Net achieves the highest precision in the DAIR-V2XSearch dataset.


Introduction
A vehicle search involves locating and identifying vehicles in uncropped images of the real world.It has a wide range of applications in intelligent transportation systems and has become essential to the realization of autonomous driving as a result of the continuous development of technology.For the vehicle search task, a comprehensive, trustworthy, and objective dataset is conducive to objectively evaluating the performance of an algorithm, which is one of the most important aspects of the entire task.However, there is no suitable vehicle search dataset.
Observing the existing pedestrian search [1] datasets, roadside cameras frequently use these datasets as data acquisition methods to address security concerns.In accordance with the adage "stand high, see far", the data captured by the roadside camera is typically less obscured and shot more steadily.Nevertheless, in autonomous driving scenarios, roadside cameras frequently have two deficiencies: (1) The roadside cameras can only capture a single angle of the vehicle target.(2) The roadside camera cannot track the long-term target, and it is challenging to fully extract the features of the foreground target.Moreover, vehicle cameras are primarily used for data acquisition in modern automatic driving [2].However, the camera mounted on the side of the vehicle frequently encounters issues, such as occlusion, that prevent it from achieving environmental perception without dead corners.Therefore, Bishop et al. [3] argue that single-vehicle intelligence is not an effective solution to the autonomous driving problem.Consequently, numerous vehicleroad collaboration technologies [4,5] have emerged.Collaboration between vehicles and roads refers to the cooperation between vehicles and roads.The infrastructure is used to provide vehicles with information that extends well beyond their current field of view so that they can complete tasks such as target detection and trajectory prediction, which will ensure future control decisions are correct and safe.If vehicle-road collaboration technology is added to the vehicle search task in order to increase the capability of comprehensive perception, the acquisition ability of vehicle targets and the training precision of the model will be improved.
Based on previous research, the DAIR-V2XSearch vehicle search dataset has been developed.Based on the vehicle-road collaboration DAIR-V2X dataset [6] proposed by Tsinghua University, this dataset is compiled.The vehicle is selected and matched, then the vehicle ID and camera ID are labeled using the data collected by both ends of DAIR-V2X and the tagged vehicle anchor.The DAIR-V2XSearch dataset is advantageous in the following ways: (1) By utilizing the vehicle camera as the mobile end and the roadside camera as the fixed end, the roadside camera can be compensated by the vehicle camera, resulting in a more comprehensive perspective of the same vehicle.(2) Diverse backgrounds are collaborated to produce by two devices.Additionally, the two devices are installed at different heights, so the same vehicle captured from the same vantage point may appear slightly different.(3) Unlike large-scale datasets with good annotations generated in virtual scenarios (such as Sim4cv [7], Carla [8], and other simulators), this dataset is obtained in real-world scenarios, compensates for actual errors caused by virtual scenarios, and facilitates subsequent groundwork.
Existing vehicle search algorithms [9][10][11][12][13] continue to face extremely challenging retrieval and fine-grained problems.In addition to accurately locating the vehicle in the image, the vehicle in the background should also be distinguished and identified in the vehicle search.Currently, there are two categories of technology: one-step and two-step.As shown in Figure 1a, the two-step formula [9] is divided into detection and Re-ID, two independent tasks.First, the existing detection model is used to locate the vehicle, followed by the transmission of the cropped vehicle box to the Re-ID network in order to extract the subtle differences between vehicles.The two-step method can achieve high levels of precision, but it is time-consuming and computationally intensive.As a result, the one-step [10][11][12][13] method was developed.This approach combines detection and Re-ID from beginning to end, as depicted in Figure 1b.The Faster R-CNN framework is utilized in the majority of current one-step models for detection [14], with Re-ID branches added to complete the search task.
The reasons why the accuracy of the one-step method cannot be improved are as follows.First, the anchor boxes are responsible.Anchor boxes are initially designed for target detection [14] and have been employed in Faster R-CNN.However, anchor boxes cannot be utilized to extract Re-ID features.Many fuzzy features are introduced into Re-ID training because anchor box training frequently involves one box corresponding to multiple vehicle IDs or multiple anchor boxes corresponding to one vehicle ID.The second cause is the shared functionality between the two tasks.The detection task is the classification of a class, while the Re-ID task is the classification of multiple ids that belong to the same class.If two tasks utilize identical features, each task's performance may suffer.The reasons why the accuracy of the one-step method cannot be improved are as follows.First, the anchor boxes are responsible.Anchor boxes are initially designed for target detection [14] and have been employed in Faster R-CNN.However, anchor boxes cannot be utilized to extract Re-ID features.Many fuzzy features are introduced into Re-ID training because anchor box training frequently involves one box corresponding to multiple vehicle IDs or multiple anchor boxes corresponding to one vehicle ID.The second cause is the shared functionality between the two tasks.The detection task is the classification of a class, while the Re-ID task is the classification of multiple ids that belong to the same class.If two tasks utilize identical features, each task's performance may suffer.
Therefore, a new network, the phased feature extraction network (PFE-Net), is proposed that effectively addresses the aforementioned problems.This network is based on the YOLOX [15] one-stage detection network, which is designed without anchor frames and has a high detection rate.Unlike previous "detection first" [16,17] or "Re-ID first" [18] frameworks, detection and Re-ID tasks have been equally treated in our architecture.Re-ID includes two isomorphic branches for detection and feature extraction.The detection branch is implemented as an anchor-free box, and SimOTA advanced label assignment is the candidate label assignment strategy.By performing the Re-ID operation on each pixel, the pixel-centered object is represented by the Re-ID branch.In order to better adapt to the produced vehicle search dataset, a camera grouping module and a cross-level feature extraction module are also proposed.
The three most significant contributions of this paper are as follows: 1. To address the insufficiency of vehicle search datasets, a collaborative vehicle search dataset for real-world vehicle scenarios, DAIR-V2Xsearch, is developed.2. To complete the vehicle search more efficiently, a network for phased feature extraction is designed.Combined with the characteristics of the vehicle itself, two modules are simultaneously designed.3.In order to validate the performance of the model, it is included in the DAIR-V2Xsearch dataset for a large number of experiments, and its performance reaches the highest level.Simultaneously, experiments are conducted on the pedestrian search dataset PRW to validate the generalization of the model, achieving high accuracy.Therefore, a new network, the phased feature extraction network (PFE-Net), is proposed that effectively addresses the aforementioned problems.This network is based on the YOLOX [15] one-stage detection network, which is designed without anchor frames and has a high detection rate.Unlike previous "detection first" [16,17] or "Re-ID first" [18] frameworks, detection and Re-ID tasks have been equally treated in our architecture.Re-ID includes two isomorphic branches for detection and feature extraction.The detection branch is implemented as an anchor-free box, and SimOTA advanced label assignment is the candidate label assignment strategy.By performing the Re-ID operation on each pixel, the pixel-centered object is represented by the Re-ID branch.In order to better adapt to the produced vehicle search dataset, a camera grouping module and a cross-level feature extraction module are also proposed.
The three most significant contributions of this paper are as follows: 1.
To address the insufficiency of vehicle search datasets, a collaborative vehicle search dataset for real-world vehicle scenarios, DAIR-V2Xsearch, is developed.2.
To complete the vehicle search more efficiently, a network for phased feature extraction is designed.Combined with the characteristics of the vehicle itself, two modules are simultaneously designed.

3.
In order to validate the performance of the model, it is included in the DAIR-V2Xsearch dataset for a large number of experiments, and its performance reaches the highest level.Simultaneously, experiments are conducted on the pedestrian search dataset PRW to validate the generalization of the model, achieving high accuracy.

Vehicle Search
The objective of vehicle search is to complete the task of locating and identifying the same vehicle given a vehicle target from an uncropped, real set of images, which is the union of the two tasks of vehicle detection and Re-ID.In recent years, pedestrian search has developed rapidly and achieved remarkable results [11][12][13].Consequently, vehicle searches are also slowly evolving.As there are few studies on vehicle search, pedestrian search is the primary research focus.The current framework for pedestrian search can be divided into two-step and one-step modes.A two-step procedure was employed by Zheng et al. [1]: first, vehicles were detected, then the obtained detection box was inserted into the Re-ID network, and finally, the result was obtained.Although the precision of the final search is high, the model was too large and complex, and the calculation speed was slow.
The online strength matching loss was created by Xiao et al. [12] for Re-ID calculations, and the first one-step mode based on Faster-RCNN was proposed.A new Re-ID cut layer was added after the detection features to perform Re-ID matching and calculate the loss.In this way, not only speed calculations but also accuracy were improved.Norm-aware embedding was proposed by Chen et al. [11] to embed pedestrians within detection norms and Re-ID angles, respectively.Despite this improvement, the search frame continued to utilize the original two-stage anchor-base detection network, and the speed remained slow.Subsequently, the first one-stage anchor-free model proposed by Yan et al. [18], with an alignment feature aggregation module designed to adhere to the Re-ID first principle, was found to improve efficiency without sacrificing accuracy.Inspired by previous research, a new similar one-stage anchor-free framework is designed for the vehicle search model, which simultaneously trains the detection and Re-ID tasks.In addition, two new modules are designed based on the characteristics of the vehicle to improve the suitability of the model for vehicle feature extraction.

Vehicle Search Dataset
In recent years, numerous pedestrian search datasets have been published.PRW [1] consisted of data collected by six roadside cameras, while the bounding box position and pedestrian ID were manually labeled.VeRi776 [19] was a vehicle Re-ID dataset obtained by photographing a one-square-kilometer area in 24 h while restricting vehicles to predefined bounding boxes.Recent research on autonomous driving reveals that single-vehicle perception was plagued by occlusions, but these shortcomings could be compensated by the cooperative perception of vehicle and road.DAIR-V2X [6] was the first real-world vehicleroad collaboration dataset annotated with category information and bounding boxes.In general, there is no dataset dedicated to the vehicle search task.Hence, a cross-camera vehicle search dataset, DAIR-V2XSearch, is created to complete the task more effectively.

Vehicle Re-ID
Vehicle Re-ID refers to the process of learning embeddedness features from cropped vehicle images, which is a significant distinction from vehicle search tasks.As shown in Figure 2, (a) is the form of the vehicle Re-ID dataset, and (b) is the form of the vehicle search dataset.In recent years, vehicle Re-ID has been extensively studied.Some methods [20,21] were primarily used to extract easily identifiable vehicle features with a high degree of precision.However, individual, easily identifiable feature information must be marked during training, which requires a significant amount of manpower.There are also methods for feature extraction that achieve high accuracy by designing measurement models [22], adding attention mechanisms [23], generating adversarial networks [24], etc.The conflict between detection and Re-ID tasks is analyzed in vehicle search, two tasks in parallel are processed, and a hierarchical feature extraction module is designed to improve the training accuracy in the Re-ID branch.

Vehicle Detection
There are two classification methods for existing vehicle detection techniques.Firstly, based on the process, they can be divided into: (1) Two-stage methods: This approach typically involves an intermediate region, such as Faster RCNN [14], MASK RCNN [25], etc.As it requires calculating candidate regions, it consumes a lot of memory and reduces detection speed.(2) One-stage methods: This approach outputs the detection results, including YOLO [26], SSD [27], etc., without generating region proposal boxes.This method is fast, and, with continuous improvement, its accuracy can compete with that of two-stage methods.

Vehicle Detection
There are two classification methods for existing vehicle detection techniques.Firstly, based on the process, they can be divided into: (1) Two-stage methods: This approach typically involves an intermediate region, such as Faster RCNN [14], MASK RCNN [25], etc.As it requires calculating candidate regions, it consumes a lot of memory and reduces detection speed.(2) One-stage methods: This approach outputs the detection results, including YOLO [26], SSD [27], etc., without generating region proposal boxes.This method is fast, and, with continuous improvement, its accuracy can compete with that of twostage methods.
Secondly, based on the design method of the anchor box, they can be divided into the following.(1) Anchor-based methods: To obtain the best detection performance, typically, clustering analysis needs to be performed on the anchor points in the dataset before training to determine a set of optimal anchor points.This is a complex process that introduces some prior knowledge to the network.Existing datasets typically require a lot of experiments to determine the optimal anchor points.(2) Anchor-free methods: Anchorfree detection methods [15,28] do not require anchor boxes and have a simple structure and fast calculation speed, such as CornerNet, YOLOX [15], etc.As this is a new dataset, clustering and analysis need to be performed on the dataset to obtain previous boxes, which makes the process complex.The existing anchor-free single-stage detection network does not require this process.To simplify task completion and make the designed model more suitable for different datasets, the anchor-free detection network YOLOX [15] is selected as the basic framework for vehicle search.

Data Acquisition
In autonomous driving, vehicle cameras are used for a variety of purposes.However, numerous studies have found that single perception is frequently hampered by occlusion and other issues.Under vehicle-road cooperation, a vehicle search dataset is created to enhance the performance of vehicle searches.This vehicle search dataset is a modification of the DAIR-V2X dataset proposed by Tsinghua University.This section provides a comprehensive overview of the dataset for cooperative over-the-horizon perception of vehicle and road task requirements.
(1) Sensors: The dataset is collected at 28 intersections selected from Beijing's autonomous driving demonstration zone, with four pairs of high-resolution cameras deployed as roadside devices at each intersection to collect data from various perspectives.In addition, a front-view camera is installed on the vehicle as a vehicle-end device in order to complete the acquisition simultaneously.Secondly, based on the design method of the anchor box, they can be divided into the following.(1) Anchor-based methods: To obtain the best detection performance, typically, clustering analysis needs to be performed on the anchor points in the dataset before training to determine a set of optimal anchor points.This is a complex process that introduces some prior knowledge to the network.Existing datasets typically require a lot of experiments to determine the optimal anchor points.(2) Anchor-free methods: Anchor-free detection methods [15,28] do not require anchor boxes and have a simple structure and fast calculation speed, such as CornerNet, YOLOX [15], etc.As this is a new dataset, clustering and analysis need to be performed on the dataset to obtain previous boxes, which makes the process complex.The existing anchor-free single-stage detection network does not require this process.To simplify task completion and make the designed model more suitable for different datasets, the anchor-free detection network YOLOX [15] is selected as the basic framework for vehicle search.

Data Acquisition
In autonomous driving, vehicle cameras are used for a variety of purposes.However, numerous studies have found that single perception is frequently hampered by occlusion and other issues.Under vehicle-road cooperation, a vehicle search dataset is created to enhance the performance of vehicle searches.This vehicle search dataset is a modification of the DAIR-V2X dataset proposed by Tsinghua University.This section provides a comprehensive overview of the dataset for cooperative over-the-horizon perception of vehicle and road task requirements.
(1) Sensors: The dataset is collected at 28 intersections selected from Beijing's autonomous driving demonstration zone, with four pairs of high-resolution cameras deployed as roadside devices at each intersection to collect data from various perspectives.In addition, a front-view camera is installed on the vehicle as a vehicle-end device in order to complete the acquisition simultaneously.
(2) Data processing: Due to the fact that the two devices jointly perform vehicle searches, it is necessary to time-match the data collected by the two devices.If the time difference between the two devices' data is less than 10 ms, the collected data are selected, and the synchronization time is recorded.The captured video data was then used to crop the keyframes with a 10 ms time difference.
(3) Data labeling: An ID is identified and assigned to the vehicle in the cropped image.In addition to including the camera ID annotation, the vehicle camera ID is set to 0, and the roadside camera ID is set to 1.Each vehicle identification number is associated with at least one camera device.In total, 492 vehicle identification numbers are annotated, Sensors 2023, 23, 8630 6 of 17 each of which is annotated at least twice.As shown in Figure 3, this is the production process and visualization of the dataset.First, the time between two devices is matched, and the vehicle target is clipped.Then, the same vehicle and ID assignment are identified.Finally, information, including the bounding box and vehicle ID, is written in a JSON file.Following the existing sample distribution convention for pedestrian search datasets, the images are divided into two sub-datasets, train and gallery, with a ratio of 1:2, and a trimmed box is randomly selected from each vehicle ID contained in the gallery to form the query dataset.The train dataset is used for training, while the gallery and query datasets are used for testing.
searches, it is necessary to time-match the data collected by the two devices.If the time difference between the two devices' data is less than 10 ms, the collected data are selected, and the synchronization time is recorded.The captured video data was then used to crop the keyframes with a 10 ms time difference.
(3) Data labeling: An ID is identified and assigned to the vehicle in the cropped image.In addition to including the camera ID annotation, the vehicle camera ID is set to 0, and the roadside camera ID is set to 1.Each vehicle identification number is associated with at least one camera device.In total, 492 vehicle identification numbers are annotated, each of which is annotated at least twice.As shown in Figure 3, this is the production process and visualization of the dataset.First, the time between two devices is matched, and the vehicle target is clipped.Then, the same vehicle and ID assignment are identified.Finally, information, including the bounding box and vehicle ID, is written in a JSON file.Following the existing sample distribution convention for pedestrian search datasets, the images are divided into two sub-datasets, train and gallery, with a ratio of 1:2, and a trimmed box is randomly selected from each vehicle ID contained in the gallery to form the query dataset.The train dataset is used for training, while the gallery and query datasets are used for testing.

Dataset Contributions
(1) The initial search vehicle dataset: Using the dataset in two contexts, research is conducted to improve the applicability of vehicle search technology to the field of autonomous driving.Not only the issue of data occlusion caused by vehicle camera acquisition, but also the issue of limited shooting range caused by the roadside camera's fixed field of view are effectively addressed by this method.
(2) Provide complex environmental information: Complex environmental information is contained in the vehicle search dataset.Data is collected by two devices from different angles, which results in data with variable backgrounds, resolutions, and perspectives.The model's robustness has improved and is more suitable for tasks such as vehicle cross-camera object tracking [29], trajectory prediction [30], and others.
The dataset's annotation and additional specific information have been added to the website.Download the dataset at https://github.com/Niuyaqing/DAIR-V2XSearch.git(accessed on 26 February 2023).

Dataset Contributions
(1) The initial search vehicle dataset: Using the dataset in two contexts, research is conducted to improve the applicability of vehicle search technology to the field of autonomous driving.Not only the issue of data occlusion caused by vehicle camera acquisition, but also the issue of limited shooting range caused by the roadside camera's fixed field of view are effectively addressed by this method.
(2) Provide complex environmental information: Complex environmental information is contained in the vehicle search dataset.Data is collected by two devices from different angles, which results in data with variable backgrounds, resolutions, and perspectives.The model's robustness has improved and is more suitable for tasks such as vehicle cross-camera object tracking [29], trajectory prediction [30], and others.
The dataset's annotation and additional specific information have been added to the website.Download the dataset at https://github.com/Niuyaqing/DAIR-V2XSearch.git(accessed on 26 February 2023).

Methodology 4.1. Review
To meet the requirements of vehicle search under vehicle-road collaboration, the phased feature extraction network for vehicle search is introduced in this section.The network structure is shown in Figure 4.In Section 4.2, the benefits and drawbacks of anchor-free and anchor-base networks are analyzed, and YOLOX is chosen as the best backbone network for vehicle search.Then, in Section 4.3 the network for detecting branch parts is designed, and the detection head is decoupled in order to improve the vehicle's detection accuracy.Finally, in Section 4.4, the Re-ID branch is introduced, which uses the design of the camera grouping module and the feature stratification module to extract small, fine-grained differences between vehicles in order to improve the precision of the vehicle search.
work structure is shown in Figure 4.In 4.2, the benefits and drawbacks of anchor-free and anchor-base networks are analyzed, and YOLOX is chosen as the best backbone network for vehicle search.Then, in 4.3 the network for detecting branch parts is designed, and the detection head is decoupled in order to improve the vehicle's detection accuracy.Finally, in 4.4, the Re-ID branch is introduced, which uses the design of the camera grouping module and the feature stratification module to extract small, fine-grained differences between vehicles in order to improve the precision of the vehicle search.

YOLOX Network
YOLOX is one of the most popular one-stage anchor-free object detection methods because both large and small objects without anchors are detected.For a moving vehicle, the boxes from a distance are obtained by using the roadside camera, which will drastically alter the box and increase the applicability of the anchor-free detection method.In addition, for the anchor-base, a single anchor box may correspond to multiple IDs, or multiple anchor boxes may correspond to a single ID, introducing a great deal of ambiguity during the training of Re-ID features, which is not optimal for training the model.In addition, excellent detection accuracy is provided by YOLOX.Despite the fact that object detection focuses on acquiring inter-class information and Re-ID focuses on differentiating inter-class information, there is a conflict between the two tasks that makes learning them simultaneously challenging.However, a more precise box for the detected sample produces a higher detection accuracy, which can result in a higher Re-ID accuracy.
Afterward, two tasks, vehicle detection and vehicle Re-ID, are simultaneously preformed.The YOLOX detection head is used for detection, and excellent accuracy is achieved.The two designed modules are then added to the vehicle search task to complete it.The specific network model framework is shown in Figure 4.

YOLOX Network
YOLOX is one of the most popular one-stage anchor-free object detection methods because both large and small objects without anchors are detected.For a moving vehicle, the boxes from a distance are obtained by using the roadside camera, which will drastically alter the box and increase the applicability of the anchor-free detection method.In addition, for the anchor-base, a single anchor box may correspond to multiple IDs, or multiple anchor boxes may correspond to a single ID, introducing a great deal of ambiguity during the training of Re-ID features, which is not optimal for training the model.In addition, excellent detection accuracy is provided by YOLOX.Despite the fact that object detection focuses on acquiring inter-class information and Re-ID focuses on differentiating interclass information, there is a conflict between the two tasks that makes learning them simultaneously challenging.However, a more precise box for the detected sample produces a higher detection accuracy, which can result in a higher Re-ID accuracy.
Afterward, two tasks, vehicle detection and vehicle Re-ID, are simultaneously preformed.The YOLOX detection head is used for detection, and excellent accuracy is achieved.The two designed modules are then added to the vehicle search task to complete it.The specific network model framework is shown in Figure 4.

Detection Branch
In object detection, classification and regression tasks frequently conflict with each other, which is a well-known issue [31].In this section, the YOLOX detection head is employed.The detection head is set to a decoupled structure, and the regression and classification are output separately, which significantly accelerates the model's convergence. (

1) Creating Corresponding Alignment Entities
In the original YOLOX model, different levels of features are used to detect objects of different sizes, which significantly improves the detection accuracy.However, for the Re-ID task, because the Re-ID features obtained at different stages are distinct, there are different background features, which have a significant impact on the learned discrimination ability.The complexity of the model and the slowdown in training are also increased by using multiple stages for detection, neither of which is conducive to the subsequent Re-ID task.Even though the low-order feature has less semantic information, sufficient location information is contained.Therefore, the detection framework based on FPN [32] is modified, low-order and high-order features are combined, and detection with a single detection head is performed.The structure of the detection head is shown in Figure 5.
other, which is a well-known issue [31].In this section, the YOLOX detection head is employed.The detection head is set to a decoupled structure, and the regression and classification are output separately, which significantly accelerates the model's convergence.
(1) Creating Corresponding Alignment Entities In the original YOLOX model, different levels of features are used to detect objects of different sizes, which significantly improves the detection accuracy.However, for the Re-ID task, because the Re-ID features obtained at different stages are distinct, there are different background features, which have a significant impact on the learned discrimination ability.The complexity of the model and the slowdown in training are also increased by using multiple stages for detection, neither of which is conducive to the subsequent Re-ID task.Even though the low-order feature has less semantic information, sufficient location information is contained.Therefore, the detection framework based on FPN [32] is modified, low-order and high-order features are combined, and detection with a single detection head is performed.The structure of the detection head is shown in Figure 5.To connect the two parts laterally, the { ,  ,  } feature network from the Resnet-50 backbone is utilized, and then each stage is upsampled to obtain the { ,  ,  } feature network.Here, a 3 × 3 deformable convolution is employed, which can better adapt and adjust the receptive field on the input feature map to produce increasingly precise feature maps.

𝑃 = ∁ 𝑐𝑜𝑛𝑣(𝑃 ), 𝑐𝑜𝑛𝑣(𝐶 )
(1) where two 1 × 1 convolutions are used at  and 3 × 3 convolutions are used at  .∁ is represented as the concatenation of two features for improved multi-level feature aggregation.In order to achieve a good balance between the performance of the two subtasks of detection and Re-ID, the largest feature generated at { } is only used for detection, ignoring a certain detection performance.The specific results are detailed in Section 4.3. (

2) Detection Loss Calculation
GIoUloss is used to calculate the confidence IoUloss when calculating the detection branch's loss.
where  and  are boxes for calculating , and C is the outermost box of  and .BCEloss (Binary CrossEntropy loss) is utilized by the detection box position loss, Objloss, and the classification loss, Clsloss.To connect the two parts laterally, the {C 3 , C 4 , C 5 } feature network from the Resnet-50 backbone is utilized, and then each stage is upsampled to obtain the {P 3 , P 4 , P 5 } feature network.Here, a 3 × 3 deformable convolution is employed, which can better adapt and adjust the receptive field on the input feature map to produce increasingly precise feature maps.
P 3 = (conv(P 4 ), conv(C 3 )) where two 1 × 1 convolutions are used at P 4 and 3 × 3 convolutions are used at C 3 .is represented as the concatenation of two features for improved multi-level feature aggregation.In order to achieve a good balance between the performance of the two subtasks of detection and Re-ID, the largest feature generated at {P 3 } is only used for detection, ignoring a certain detection performance.The specific results are detailed in Section 4.3.
(2) Detection Loss Calculation GIoU loss is used to calculate the confidence IoU loss when calculating the detection branch's loss.
where A and B are boxes for calculating IOU, and C is the outermost box of A and B. BCE loss (Binary CrossEntropy loss) is utilized by the detection box position loss, Obj loss , and the classification loss, Cls loss .
where r is represented as the model output value, whose size must be between 0 and 1, and y is represented as the real label.

Re-ID Branch
As part of class-based feature comparison, the Re-ID branch is used to extract more discriminative features between vehicles.To accomplish this objective, two modules are designed that address the Re-ID branch separately.(1) Camera Grouping Module Typically, the dataset for a search task is collected from multiple cameras.Due to the use of multiple cameras, multiple perspectives of the same vehicle can be obtained.However, due to the varying installation positions of the cameras, the pictures they capture will result in significant differences in color, saturation, and brightness.As a result, a camera embedding module is proposed that employs camera ID for simple grouping and imparts camera information into features for aggregation in order to distinguish internal differences between cameras.The insertion position of the camera grouping module is shown in Figure 4.
Specifically, the dataset contains N cameras, denoted as ID r , r ∈ [1, N].To initialize the module, a randomly generated sequence is utilized.Following initialization, the camera embedding is obtained as E c ∈ R N C ×A , where A = H × W, and H and W are represented as the height and width of the corresponding image in the current V 0 channel, respectively.The corresponding camera embedding feature for a photo img i captured by a camera ID r can therefore be expressed as Ec r i .The camera embedding feature E c is passed to the backbone, and the following expression is obtained: where V 0 is represented as an initial backbone feature and γ is a balancing module hyperparameter, and when γ = 0.6, the effect is the best.Through the incorporation of modules, camera clustering is completed to minimize the impact of camera differences.
(2) Cross-level Feature Extraction Module The vehicle's center point coordinates (x, y) are obtained through detection, and then the object Re-ID feature centered at (x, y) is extracted from the feature map to obtain the vehicle's frame feature.After observing the majority of vehicle frames, the most distinctive features (logo, headlights, etc.) are centered.As shown in Figure 6, as the receptive field expands, the vehicle's distinguishing characteristics increase, but so does the amount of background information, which contains more difficult-to-distinguish information.A novel form of progressive central pooling is introduced to process extracted features hierarchically.
where  is represented as the model output value, whose size must be between 0 and 1, and  is represented as the real label.

Re-ID Branch
As part of class-based feature comparison, the Re-ID branch is used to extract more discriminative features between vehicles.To accomplish this objective, two modules are designed that address the Re-ID branch separately.
(1) Camera Grouping Module Typically, the dataset for a search task is collected from multiple cameras.Due to the use of multiple cameras, multiple perspectives of the same vehicle can be obtained.However, due to the varying installation positions of the cameras, the pictures they capture will result in significant differences in color, saturation, and brightness.As a result, a camera embedding module is proposed that employs camera ID for simple grouping and imparts camera information into features for aggregation in order to distinguish internal differences between cameras.The insertion position of the camera grouping module is shown in Figure 4.
Specifically, the dataset contains  cameras, denoted as  ,  ∈ [1, ].To initialize the module, a randomly generated sequence is utilized.Following initialization, the camera embedding is obtained as  ∈  × , where  =  ×  , and H and W are represented as the height and width of the corresponding image in the current  channel, respectively.The corresponding camera embedding feature for a photo  captured by a camera  can therefore be expressed as  .The camera embedding feature  is passed to the backbone, and the following expression is obtained: where  is represented as an initial backbone feature and  is a balancing module hyperparameter, and when  = 0.6, the effect is the best.Through the incorporation of modules, camera clustering is completed to minimize the impact of camera differences.
(2) Cross-level Feature Extraction Module The vehicle's center point coordinates (x, y) are obtained through detection, and then the object Re-ID feature centered at (x, y) is extracted from the feature map to obtain the vehicle's frame feature.After observing the majority of vehicle frames, the most distinctive features (logo, headlights, etc.) are centered.As shown in Figure 6, as the receptive field expands, the vehicle's distinguishing characteristics increase, but so does the amount of background information, which contains more difficult-to-distinguish information.A novel form of progressive central pooling is introduced to process extracted features hierarchically.To implement the preceding statement, local characteristics must first be hierarchically set. Figure 6 is focused on the initial pooling center region, which is followed by decreasing levels.In the context of hierarchical modules, the information contained in the vehicle's features is increased from less to more, from concentrated to generalized, resulting in Sensors 2023, 23, 8630 10 of 17 more generalized training.Assuming that the lower left corner is the origin of the image I ∈ R W×H , the circular center mask region M of the k region can be expressed as follows: where R k is represented as the radius on the kth circle.The extracted mask features are then utilized to reproject the features.The final Re-ID features are acquired.
(3) Re-ID Loss Calculation The network is optimized by building global feature OIM loss [12] (Online Instance Matching loss) and Triplet loss [22].OIM loss is a kind of loss proposed for pedestrian search tasks.Its role is to store all the feature centers that mark identities in a lookup table (LUT).V ∈ R D×L = {v 1 , . . . ,v L } represents L D-dimensional feature vectors.In addition, a circular list is compiled of Q unlabeled identity features, U ∈ R D×Q = u 1 , . . ., u Q .The following formula is used to calculate the probability of identifying x as the identity with ID i based on the two vectors presented above: where T is represented as transpose.The objective of OIM is to minimize the expected probability of a negative logarithm: Then, the commonly used triple loss function is added in Re-ID [22] to distinguish the detailed features between classes, shorten the distance with the corresponding features stored in the LUT, and push the distance of the features outside the LUT to a great distance.After detection, first the candidate feature set is obtained, and then the ternary combination set {a, p, n} is set.Consequently, the triplet loss function L tri is as follows: where f a is represented as the anchor feature itself, f p is represented as the positive sample feature with the same ID as an anchor, and f n is the feature with a different ID than anchor.Finally, the Re-ID branch's computational loss is as follows: when λ = 0.6, the effect is the best.

Experiment Setting
Datasets: Extensive experiments were conducted on the DAIR-V2XSearch dataset.Since there is no existing vehicle search dataset, the popular pedestrian search dataset PRW [1] was chosen to test the effectiveness and generalizability of the proposed method.The PRW dataset includes images captured by six roadside cameras on a college campus.The data is sampled from videos, and pedestrian identities and bounding boxes are manually labeled.This dataset is used to validate the model's generalizability.The data annotations for the two datasets are displayed in Table 1.
Backbone: ResNet-50 [33] is the backbone for feature extraction.The weights trained by ImageNet [34] are utilized as the pre-trained model, and the number of layers is reduced after the pooling layer and an ibn-a block are added [35].
Implementation Details: Among other techniques, resize, random erase, horizontal flip, and mixup are used to enhance the data.For network training, 80 epochs are assigned.The SGD optimizer is employed to expedite the model's approach to the optimal solution; its momentum is set to 0.9, and its weight decays to 1 × 10 −4 .Using cosine annealing, the learning rate of the optimizer is set in the range of 7.7 × 10 −5 to 1 × 10 −2 for the first 20 epochs, remains at 1 × 10 −2 for the next 20 to 60 epochs, and then decreases to 7.7 × 10 −5 for the remaining epochs of the training process.Evaluation index: Mean average precision (mAP) [36] and cumulative matching characteristics (CMC) [37] are used for testing to determine the effectiveness of the proposed network in solving the vehicle search problem after the training phase.mAP is used to evaluate Re-ID's overall performance.CMC is represented as the precision of query flags that appear on candidate lists of various sizes.Recall and AP are utilized to evaluate a detector's performance.In addition, PRW is employed to validate the generalization performance of the model.
Training: The deep learning framework Pytorch 1.8 and the GPU NVIDIA RTX 2080 Ti are employed for all of our training experiments.The batch size for training is set to 4. Using the same GPU training dataset, DAIR-V2XSearch requires four hours to be trained, while PRW requires six hours.

Ablation Experiments (1) Performance Analysis of Each Module
As shown in Table 2, ablation experiments are conducted on the DAIR-V2XSearch and PRW datasets to determine the efficacy of each module.Baseline: As the baseline network, the YOLOX model is added with a Re-ID head in parallel with the detection head.As shown in Table 2, the baseline is offered enhancements by the various modules we have created.In DAIR-V2XSearch and PRW, all modules are combined and compared to the baseline; Rank-1 is improved by 4.95% and 3.16%, while mAP is improved by 6.22% and 5.4747%, respectively.
Comparison of different FPN levels: To evaluate the impact of FPN scale alignment, different levels of feature maps are created, and results are presented in Table 3. Particularly, the characteristics of P 3 , P 4 , and P 5 are evaluated with 8, 16, and 32 strides, respectively.Comparing the detection accuracy to the Re-ID accuracy, the maximum receptive field feature P 3 would result in the highest accuracy.Comparison under varying numbers of FPN branches: To evaluate the impact of varying numbers of FPN branches on the Re-ID task, a number of comparisons are designed.The {P 3 , P 4 } P size range is particular set to [0, 128] and [128, ∞], and the {P 3 , P 4 , P 5 } P size range is particular set to [0, 128], [128,256], and [256, ∞].As shown in Table 4, the increase in the number of FPN branches improves the detection recall rate, but reduces the Re-ID accuracy to some degree.The effect of various coefficients of the ternary ID's loss function: The impact of various coefficients on the precision of Re-ID is investigated.As shown in Figure 7, for the two datasets, the effect of the model is improved differently depending on the coefficients, but overall, there is no significant difference in its effectiveness.The optimal results are achieved when λ = 0.6.(2) Visualized Analysis Visualization of retrieval results.Figure 8 demonstrates the efficacy of the proposed network by displaying the Rank-1 results of baseline and PFE-Net.Orange boxes are represented as the target of the query, as opposed to green for correct results and red for incorrect results.The results demonstrate that our method is more precise.Visualizing perceptual effects across cameras.As depicted in Figure 9, the correct model results are inserted into the original image for the purpose of effect comparison, which is the simultaneous shooting situation of both devices.The results of data collection from the perspective of a single vehicle are shown in Figure 9a.Only two vehicles can be seen from this vantage point, and the road conditions ahead cannot be determined.However, with the addition of Figure 9b, the receptive field of road conditions expands, and road conditions for more than two vehicles can be obtained.By matching the two devices, the perception limitations of a single vehicle are eliminated, enabling the completion of tasks such as road condition evaluation and route planning.(2) Visualized Analysis Visualization of retrieval results.Figure 8 demonstrates the efficacy of the proposed network by displaying the Rank-1 results of baseline and PFE-Net.Orange boxes are represented as the target of the query, as opposed to green for correct results and red for incorrect results.The results demonstrate that our method is more precise.(2) Visualized Analysis Visualization of retrieval results.Figure 8 demonstrates the efficacy of the proposed network by displaying the Rank-1 results of baseline and PFE-Net.Orange boxes are represented as the target of the query, as opposed to green for correct results and red for incorrect results.The results demonstrate that our method is more precise.Visualizing perceptual effects across cameras.As depicted in Figure 9, the correct model results are inserted into the original image for the purpose of effect comparison, which is the simultaneous shooting situation of both devices.The results of data collection from the perspective of a single vehicle are shown in Figure 9a.Only two vehicles can be seen from this vantage point, and the road conditions ahead cannot be determined.However, with the addition of Figure 9b, the receptive field of road conditions expands, and road conditions for more than two vehicles can be obtained.By matching the two devices, the perception limitations of a single vehicle are eliminated, enabling the completion of tasks such as road condition evaluation and route planning.Visualizing perceptual effects across cameras.As depicted in Figure 9, the correct model results are inserted into the original image for the purpose of effect comparison, which is the simultaneous shooting situation of both devices.The results of data collection from the perspective of a single vehicle are shown in Figure 9a.Only two vehicles can be seen from this vantage point, and the road conditions ahead cannot be determined.However, with the addition of Figure 9b, the receptive field of road conditions expands, and road conditions for more than two vehicles can be obtained.By matching the two devices, the perception limitations of a single vehicle are eliminated, enabling the completion of tasks such as road condition evaluation and route planning.(2) Visualized Analysis Visualization of retrieval results.Figure 8 demonstrates the efficacy of the proposed network by displaying the Rank-1 results of baseline and PFE-Net.Orange boxes are represented as the target of the query, as opposed to green for correct results and red for incorrect results.The results demonstrate that our method is more precise.Visualizing perceptual effects across cameras.As depicted in Figure 9, the correct model results are inserted into the original image for the purpose of effect comparison, which is the simultaneous shooting situation of both devices.The results of data collection from the perspective of a single vehicle are shown in Figure 9a.Only two vehicles can be seen from this vantage point, and the road conditions ahead cannot be determined.However, with the addition of Figure 9b, the receptive field of road conditions expands, and road conditions for more than two vehicles can be obtained.By matching the two devices, the perception limitations of a single vehicle are eliminated, enabling the completion of tasks such as road condition evaluation and route planning.search task.Considering the problems inherent in the vehicle itself, a cross-level feature aggregation module is also designed, which makes the model more sensitive to the subtle vehicle features and improves the training accuracy of the model.Numerous experiments demonstrate the generalizability of the method.In the future, research will continue to be conducted to improve the accuracy of the method, and at the same time, the research will be put into action to determine the practicability of the method.We believe that this technology can be applied to subsequent perception tasks like object tracking and trajectory prediction, and it will be increasingly advantageous for autonomous driving tasks like control and decision-making.

Figure 1 .
Figure 1.Search task framework diagram: (a) two-step model structure and (b) one-step model structure.

Figure 1 .
Figure 1.Search task framework diagram: (a) two-step model structure and (b) one-step model structure.

Figure 2 .
Figure 2. Comparison of the vehicle Re-ID and vehicle search dataset.(a) is a Vehicle Re-ID dataset, and (b) is a Vehicle Search dataset.

Figure 2 .
Figure 2. Comparison of the vehicle Re-ID and vehicle search dataset.(a) is a Vehicle Re-ID dataset, and (b) is a Vehicle Search dataset.

Figure 3 .
Figure 3. Dataset production process.(1) Match the time of the two camera ends, (2) Obtain the vehicle boundary boxes, (3) Match the same vehicle ID, (4) Write json file.

Figure 3 .
Figure 3. Dataset production process.(1) Match the time of the two camera ends, (2) Obtain the vehicle boundary boxes, (3) Match the same vehicle ID, (4) Write json file.

Figure 4 .
Figure 4. Phased feature extraction network structure diagram uses two parallel branches for the two sub-tasks of detection and Re-ID.The specific structures of the two branches are shown subsequently.The camera grouping module is received in the backbone.

Figure 4 .
Figure 4. Phased feature extraction network structure diagram uses two parallel branches for the two sub-tasks of detection and Re-ID.The specific structures of the two branches are shown subsequently.The camera grouping module is received in the backbone.

Figure 5 .
Figure 5. Structure diagram of the detection branch.

Figure 5 .
Figure 5. Structure diagram of the detection branch.

Figure 6 .
Figure 6.Flowchart of the cross-level feature extraction module.Figure 6. Flowchart of the cross-level feature extraction module.

Figure 6 .
Figure 6.Flowchart of the cross-level feature extraction module.Figure 6. Flowchart of the cross-level feature extraction module.

Figure 7 .
Figure 7.Comparison charts of two datasets under different  precision line.

Figure 8 .
Figure 8. Rank-1 search results from the gallery in the DAIR-V2XSearch dataset corresponding to the query image.The yellow box represents the original data.The red box represents the output result error.The green box represents the output result error.

Figure 9 .
Figure 9. Visualization of the simultaneous perceptual effect of two cameras.The box color of the same vehicle is the same.(a) Shoot for the vehicle camera.(b) Shoot for the roadside camera.

Figure 7 .
Figure 7.Comparison charts of two datasets under different λ precision line.

Figure 8 .
Figure 8. Rank-1 search results from the gallery in the DAIR-V2XSearch dataset corresponding to the query image.The yellow box represents the original data.The red box represents the output result error.The green box represents the output result error.

Figure 9 .
Figure 9. Visualization of the simultaneous perceptual effect of two cameras.The box color of the same vehicle is the same.(a) Shoot for the vehicle camera.(b) Shoot for the roadside camera.

Figure 8 .
Figure 8. Rank-1 search results from the gallery in the DAIR-V2XSearch dataset corresponding to the query image.The yellow box represents the original data.The red box represents the output result error.The green box represents the output result error.

Figure 7 .
Figure 7.Comparison charts of two datasets under different  precision line.

Figure 8 .
Figure 8. Rank-1 search results from the gallery in the DAIR-V2XSearch dataset corresponding to the query image.The yellow box represents the original data.The red box represents the output result error.The green box represents the output result error.

Figure 9 .
Figure 9. Visualization of the simultaneous perceptual effect of two cameras.The box color of the same vehicle is the same.(a) Shoot for the vehicle camera.(b) Shoot for the roadside camera.

Figure 9 .
Figure 9. Visualization of the simultaneous perceptual effect of two cameras.The box color of the same vehicle is the same.(a) Shoot for the vehicle camera.(b) Shoot for the roadside camera.

Table 1 .
Data comparison between two datasets.

Table 2 .
Comparison of the precision of PRW and DAIR-V2XSearch datasets for distinct modules.√ represents that the module is used.× represents that the module is not used.

Table 3 .
Comparison of the FPN levels in the DAIR-V2XSearch dataset.

Table 4 .
Effect of the number of FPN branches on the precision of the DAIR-V2XSearch dataset.Influence of CGM at Various Stages: In Table5, the influence of CGM is examined at varying stages of ResNet-50 precision.The PRW and DAIR-V2XSearch datasets are validated by us at stage 2 for optimal performance.

Table 5 .
Comparison of the Re-ID precision of CGM at various stages.