Next Article in Journal
Research and Design of a Concave Solenoid Wireless Power Transmission System with High Misalignment Tolerance
Previous Article in Journal
A Comparative Evaluation of Rule-Based Strategies, ECMSs, and MPC Strategies for Fuel Cell Hybrid LCV Energy Management
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Research on Cooperative Vehicle–Infrastructure Perception Integrating Enhanced Point-Cloud Features and Spatial Attention

1
Vehicle and Traffic Engineering College, Henan University of Science and Technology, Luoyang 471023, China
2
Yutong Bus Co., Ltd., Zhengzhou 450000, China
*
Author to whom correspondence should be addressed.
World Electr. Veh. J. 2026, 17(4), 164; https://doi.org/10.3390/wevj17040164
Submission received: 7 February 2026 / Revised: 14 March 2026 / Accepted: 19 March 2026 / Published: 24 March 2026
(This article belongs to the Section Automated and Connected Vehicles)

Abstract

Vehicle–infrastructure cooperative perception (VICP) extends the sensing capability of single-vehicle systems by integrating multi-source information from onboard and roadside sensors, thereby alleviating limitations in sensing range and field-of-view coverage. However, in complex urban environments, the robustness of such systems—particularly in terms of blind-spot coverage and feature representation—is severely affected by both static and dynamic occlusions, as well as distance-induced sparsity in point cloud data. To address these challenges, a 3D object detection framework incorporating point cloud feature enhancement and spatially adaptive fusion is proposed. First, to mitigate feature degradation under sparse and occluded conditions, a Redefined Squeeze-and-Excitation Network (R-SENet) attention module is integrated into the feature encoding stage. This module employs a dual-dimensional squeeze-and-excitation mechanism operating across pillars and intra-pillar points, enabling adaptive recalibration of critical geometric features. In addition, a Feature Pyramid Backbone Network (FPB-Net) is designed to improve target representation across varying distances through multi-scale feature extraction and cross-layer aggregation. Second, to address feature heterogeneity and spatial misalignment between heterogeneous sensing agents, a Spatial Adaptive Feature Fusion (SAFF) module is introduced. By explicitly encoding the origin of features and leveraging spatial attention mechanisms, the SAFF module enables dynamic weighting and complementary fusion between fine-grained vehicle-side features and globally informative roadside semantics. Extensive experiments conducted on the DAIR-V2X benchmark and a custom dataset demonstrate that the proposed approach outperforms several state-of-the-art methods. Specifically, Average Precision (AP) scores of 0.762 and 0.694 are achieved at an IoU threshold of 0.5, while AP scores of 0.617 and 0.563 are obtained at an IoU threshold of 0.7 on the two datasets, respectively. Furthermore, the proposed framework maintains real-time inference performance, highlighting its effectiveness and practical potential for real-world deployment.

1. Introduction

Accurate environmental perception is a fundamental prerequisite for the development of autonomous driving technologies [1]. Currently, autonomous vehicles primarily depend on onboard sensors—such as cameras, LiDAR, and millimeter-wave radar—to acquire detailed three-dimensional information about their surroundings. However, as the demands for perception range and accuracy continue to grow, the limitations of the traditional single-vehicle perception paradigm have become increasingly apparent [2]. Restricted by sensor installation height and physical line-of-sight, single-vehicle sensing still struggles with long-distance detection, severe occlusions, and unavoidable perception blind spots [3,4]. These challenges are particularly pronounced in bus operation scenarios, where vehicles operate on fixed routes, make frequent stops, and interact with a high density of traffic participants. Under such conditions, perception deficiencies can significantly impair autonomous decision-making processes and compromise driving safety.
In response to the inherent limitations of single-vehicle perception, vehicle–infrastructure collaborative perception has recently attracted considerable attention as a promising paradigm for enhancing autonomous driving systems. By enabling information sharing between vehicle-mounted sensors and roadside sensing units, such cooperative perception frameworks can significantly expand sensing coverage and improve environmental awareness, thereby alleviating occlusions and perception blind spots that are difficult to address using onboard sensors alone [5]. Despite its significant potential, vehicle–infrastructure collaborative perception remains far from mature, particularly in complex urban environments. In scenarios characterized by frequent static occlusions, dynamic traffic interactions, and long-range perception requirements, point cloud data collected from heterogeneous sensors often become highly sparse and unevenly distributed, resulting in severe degradation of discriminative features. However, most existing cooperative perception methods are developed under idealized sensing assumptions and primarily focus on cross-source feature interaction, while paying relatively limited attention to the effective extraction and enhancement of sparse and degraded point cloud representations under challenging conditions. Consequently, the robustness and generalization capability of current systems remain limited in real-world urban scenarios [6]. The performance of cooperative perception systems largely depends on effective cross-source feature fusion. However, vehicle-mounted and roadside sensors differ substantially in viewing angle, sensing range, resolution, and operating conditions, leading to inconsistent feature distributions and significant domain discrepancies. Such heterogeneity complicates direct feature fusion and may introduce semantic conflicts that hinder perception performance. Several studies have attempted to address these fusion challenges. For example, F-Cooper [7] employs a max-selection mechanism to emphasize salient features, Yu et al. [8] utilize cross-source feature concatenation, and Ren et al. [9] introduce feature-weighting strategies to improve fusion effectiveness. Nevertheless, these approaches generally operate at a single scale or fixed resolution, which limits their ability to simultaneously capture fine-grained spatial details and global semantic context. Furthermore, most existing methods lack explicit modeling of cross-domain discrepancies. Naive concatenation or simple linear fusion often increases redundant feature channels and computational overhead, while also introducing biased cross-source interference that may degrade detection performance. Although PillarGrid [10] improves computational efficiency through max-pooling-based aggregation, informative signals may inevitably be discarded during the process, thereby compromising feature completeness and representation fidelity. Therefore, effectively mitigating feature sparsity and degradation while robustly integrating heterogeneous vehicle–infrastructure information and capturing complementary multi-scale semantic representations remains a critical and open research challenge for vehicle–infrastructure collaborative perception.
To address the challenges posed by complex traffic environments and local occlusions in vehicle–infrastructure cooperative perception, this paper proposes a novel 3D object detection framework for collaborative scenarios. The framework integrates point cloud feature enhancement with spatially adaptive cross-source feature fusion to improve perception robustness and detection accuracy.
The main contributions of this work are summarized as follows:
  • A point-cloud-enhanced feature modeling approach tailored for vehicle–infrastructure cooperative perception is proposed. By integrating a dual-dimension squeeze-and-excitation mechanism with a multi-scale feature pyramid, the representation capability of sparse point clouds is improved, particularly for long-range objects and heavily occluded regions.
  • A spatially adaptive feature fusion module is designed to explicitly encode feature sources and generate fusion weights using both max pooling and average pooling. Through this design, dynamic and balanced weighting between vehicle-side local features and infrastructure-side global semantic information is achieved, thereby effectively mitigating fusion bias caused by field-of-view discrepancies.
  • Extensive experiments are conducted on the DAIR-V2X dataset and an additional in-house dataset. The results demonstrate that, compared with mainstream cooperative perception approaches, the proposed method achieves a significant improvement in overall 3D detection accuracy and exhibits notably enhanced robustness for long-range targets, occluded regions, and scenarios with incomplete information.
The rest of this paper is arranged as follows: In Section 2.1, we present a comprehensive analysis of the fusion strategy for cooperative perception. Section 2.2 introduces information on the proposed vehicle–infrastructure cooperative perception network. The experimental results are presented in Section 3. Finally, Section 4 contains the conclusion and future work.

2. Materials and Methods

2.1. Related Work

2.1.1. LiDAR-Based 3D Object Detection

LiDAR-based 3D object detection is a fundamental component of autonomous driving perception systems, aiming to accurately predict three-dimensional bounding boxes from point cloud data, including object location, geometric dimensions, orientation, and semantic category [11]. However, LiDAR point clouds are inherently unstructured and spatially non-uniform, and their density decreases rapidly as the sensing distance increases. These characteristics pose significant challenges for effective point cloud representation and the extraction of discriminative features. According to different point cloud representation paradigms, existing approaches can generally be categorized into two main types: point-based detectors and voxel- (or pillar-) based detectors. Each category involves different trade-offs between feature representation capability and computational efficiency.
Point cloud–based methods operate directly on raw point data, thereby preserving geometric structures to the greatest extent. PointNet [12] first introduced an end-to-end framework for point cloud processing, laying the foundation for deep learning–based point cloud analysis. Building upon this paradigm, PointNet++ [13] incorporated hierarchical sampling and multi-scale feature extraction to better capture local geometric relationships. Further extending these ideas, two-stage detectors such as PointRCNN [14] and PV-RCNN [15] integrate fine-grained point-level features with voxel-level contextual information, enabling more accurate 3D object localization. In parallel, one-stage approaches, including 3DSSD [16], streamline the detection pipeline to improve inference efficiency, while Transformer-based models [17,18] introduce global attention mechanisms to enhance contextual reasoning in complex scenes. Despite their strong representational capability, point-based methods typically incur high computational costs, which pose significant challenges for real-time deployment.
In contrast, voxel-based detection methods discretize point clouds into regular three-dimensional grids, converting unstructured data into structured representations that facilitate efficient feature extraction and object detection. VoxelNet [19] pioneered this direction by introducing a voxel feature encoder within an end-to-end learning framework; however, it suffered from substantial computational overhead. To address this limitation, SECOND [20] employed sparse convolution to significantly improve efficiency, establishing a widely adopted voxel-based detection paradigm. More recently, anchor-free detectors such as CenterPoint [21] have further improved localization accuracy and rotational robustness by predicting object centers rather than relying on predefined anchors. To better balance detection accuracy and real-time performance, PointPillars [22] collapses point clouds along the vertical axis into pillar representations and projects them into bird’s-eye-view feature maps, enabling lightweight 2D convolutional networks to efficiently process LiDAR data. Owing to its fast inference speed, low memory footprint, and strong practical applicability, PointPillars and its variants have been widely adopted in real-world autonomous driving perception systems.

2.1.2. LiDAR-Based 3D Object Detection

According to the stage at which information is fused, existing vehicle–infrastructure cooperative perception approaches can generally be categorized into early fusion, intermediate fusion, and late fusion schemes [23,24].
Early fusion directly aggregates raw sensor data, such as LiDAR point clouds, from multiple agents to enlarge the sensing range and improve detection accuracy. Representative studies merge point clouds captured from different viewpoints to enhance perception completeness and mitigate the field-of-view limitations of individual sensors [25,26]. However, transmitting large volumes of raw data leads to excessive communication overhead and high network bandwidth requirements, limiting the practicality of early fusion in real-world deployments.
Late fusion instead performs collaboration at the decision level by combining detection results independently generated by vehicles and roadside units [27,28]. Although this strategy significantly reduces communication costs, it heavily depends on the quality of local detection results. Consequently, valuable intermediate information may be lost during fusion, potentially resulting in degraded perception robustness.
To balance perception accuracy with communication efficiency and inference latency, intermediate fusion has emerged as a dominant paradigm in collaborative perception research [29]. In this framework, agents exchange intermediate feature representations instead of raw data or final detection results, enabling collaborative perception with manageable bandwidth consumption. Representative approaches include F-Cooper [7], which performs voxel feature fusion using a maxout operation, and V2VNet [30], which introduces a spatially aware graph neural network to enable iterative feature interaction between agents. OPV2V [31] further establishes a comprehensive benchmark for cooperative perception, while V2X-ViT [32] adopts Transformer-based heterogeneous attention with delay-aware positional encoding to address temporal misalignment. Where2comm [33] improves communication efficiency by selectively transmitting salient spatial regions based on confidence maps.
Subsequent studies have further explored advanced feature fusion strategies to improve collaborative perception robustness. For example, FETR [34] introduces Transformer-based future feature prediction to compensate for temporal asynchrony, while DI-V2X [35] proposes a domain-invariant distillation framework with domain-adaptive attention to mitigate cross-agent sensor discrepancies. TransIFF [36] develops an instance-level feature fusion framework to address instability caused by bandwidth limitations and domain differences. LRCP [37] focuses on latency-robust cooperative perception by fusing asynchronous BEV features through flow prediction.
More recently, attention-based mechanisms have been introduced to further enhance feature interaction and spatial representation learning in collaborative perception. For example, CoFormerNet [38] employs a transformer-based fusion framework that integrates temporal aggregation with spatially modulated cross-attention for vehicle–infrastructure feature interaction. Similarly, dynamic spatial attention mechanisms have been explored to improve multi-scale feature aggregation for pillar-based point cloud representations, enabling more robust object detection in occluded and dense traffic scenarios [39]. These studies demonstrate that attention mechanisms can effectively enhance cross-agent feature interaction and improve the quality of spatial representations. Beyond autonomous driving, deep learning frameworks have also been widely applied to model complex dynamic systems in various engineering domains. For example, recent studies have explored machine learning approaches for battery lifetime prediction under diverse degradation conditions [40,41]. Although these studies primarily focus on temporal degradation modeling, they demonstrate the effectiveness of deep learning in capturing complex nonlinear patterns. This observation further highlights the potential of data-driven approaches for addressing challenging perception tasks in intelligent transportation systems.
Despite the favorable trade-off between communication efficiency and detection performance offered by intermediate fusion schemes, their effectiveness can still degrade significantly under severe occlusion and highly sparse LiDAR conditions. In such scenarios, insufficient feature extraction on the vehicle side limits the quality of shared representations, ultimately constraining the achievable performance gains of collaborative perception.

2.2. Method

The overall framework of the proposed vehicle–infrastructure cooperative perception system is illustrated in Figure 1. To address blind-spot perception and insufficient feature representation in complex traffic environments, a multi-source point cloud cooperative perception framework is developed. During preprocessing, a timestamp-based synchronization mechanism is employed to temporally align LiDAR data collected from vehicle-mounted and roadside sensors. A global coordinate transformation is then applied to project all point clouds into a unified reference frame, providing a consistent basis for cross-source feature fusion. In the feature encoding stage, an R-SENet attention module is embedded into the PointPillars backbone to jointly model pillar-level and intra-pillar feature dependencies, thereby enhancing geometric feature representations in sparse and occluded regions. A multi-scale feature pyramid backbone network (FPB-Net) is further introduced to extract and aggregate hierarchical features across different spatial resolutions. To reduce communication overhead, roadside features are compressed through an encoder–decoder architecture for efficient cross-agent transmission. Finally, a Spatially Adaptive Feature Fusion (SAFF) module dynamically integrates vehicle-side local features with infrastructure-side global representations via feature expansion and spatial attention, and the fused features are fed into the detection head for accurate 3D object detection.

2.2.1. Point Cloud Data Preprocessing

In vehicle–infrastructure cooperative perception systems, roadside and onboard LiDAR sensors can achieve coarse temporal synchronization via GPS or Precision Time Protocol (PTP); however, precise alignment remains challenging due to heterogeneity in hardware architectures and sampling frequencies. In particular, mismatched frame rates introduce temporal misalignment between point cloud streams, which in turn degrades the effectiveness of multi-source point cloud fusion. To address this issue, this work adopts a timestamp-based frame matching synchronization strategy [42]. Specifically, temporal offsets between roadside and vehicle-side point cloud frames are computed, and an optimal frame pairing is determined to achieve consistent temporal alignment. This procedure ensures both temporal and spatial coherence across heterogeneous point cloud sources, thereby providing a reliable foundation for subsequent cross-source feature fusion. The matching process is formally defined as follows:
F r a m e i w ( n ) = F r a m e i ( j Δ t < δ )
F r a m e v w ( n ) = F r a m e v ( k Δ t < δ )
where F r a m e i w ( n ) and F r a m e v w ( n ) represent the roadside and vehicle-side point cloud of the n -th frame after matching, respectively. In addition, F r a m e i ( j ) and F r a m e v ( k ) denote the original j roadside frame and the k k -th vehicle-side frame. The time difference between these two frames is defined as follows:
Δ t = t i ( j ) t v ( k )
where t i ( j ) and t v ( k ) denote the timestamps of the roadside and vehicle-side frames, respectively, and δ represents the allowable threshold. When the Δ t difference is minimized and falls below δ , the k roadside frame and the j vehicle-side frame are considered successfully matched. This matching strategy ensures precise temporal alignment between heterogeneous point cloud sources, thereby establishing a reliable basis for subsequent multi-source feature fusion and 3D object detection.
Due to the inherent spatial separation between the point clouds acquired by roadside and onboard sensors, performing coordinate transformation after feature extraction in traditional alignment workflows often introduces registration errors caused by the differing viewpoints of vehicle-mounted and infrastructure-based systems. Alternatively, frame-wise coordinate transformations based on relative poses incur substantial computational overhead in dynamic vehicle–infrastructure interaction scenarios. To address these issues, this work adopts a global coordinate transformation (GCT) strategy, which leverages the positional information of both LiDAR platforms to directly project each point cloud frame into a unified global reference frame. By avoiding repeated frame-to-frame transformations, the proposed approach not only improves computational efficiency but also enhances the accuracy and stability of cross-source feature fusion in cooperative perception.
In this work, point cloud data (PCD) are adopted as an illustrative example. We assume that both vehicle-mounted platforms and roadside units are equipped with LiDAR sensors, and that each point cloud is represented in a three-dimensional Cartesian coordinate system whose origin is defined at the geometric center of the corresponding sensor. The mathematical formulation is defined as follows:
P = x i , y i , z i , r i i = 1 , 2 , , N
In this formulation, x i , y i , z i denotes the spatial coordinates of a point, and r i represents its reflection intensity. The six-degree-of-freedom (6-DoF) pose of the LiDAR sensor can be expressed as follows:
I LP = [ X , Y , Z , R , P , Θ ]
where X , Y and Z denote the LiDAR’s position along the x-, y-, and z-axes in the global coordinate system, while R , P and Θ represent the roll, pitch, and yaw angles, respectively.
The global alignment of point clouds from the LiDAR coordinate system can be achieved using the following formulation:
R X = 1 0 0 0 cos ( R ) sin ( R ) 0 sin ( R ) cos ( R )
R Y = c o s ( P ) 0 s i n ( P ) 0 1 0 s i n ( P ) 0 c o s ( P )
R Z = c o s ( Θ ) s i n ( Θ ) 0 s i n ( Θ ) c o s ( Θ ) 0 0 0 1
T = [ X , Y , Z , 0 ] T
P S G = R X 0 0 1 R Y 0 0 1 R Z 0 0 1 P S + T
where R X , R Y , R Z denote the rotation matrices around the x-, y-, and z-axes, respectively, while T represents the translation matrix. P S and P S G correspond to the point cloud data in the sensor coordinate frame and the global coordinate frame. By applying a global coordinate transformation, point clouds acquired from vehicle-mounted and roadside LiDAR sensors are consistently projected into a unified global reference frame. This unified representation facilitates efficient multi-source information fusion while eliminating redundant and computationally expensive coordinate conversions.

2.2.2. Point Cloud Feature Extraction

(1)
Feature Encoding with the Improved PointPillars Network
The conventional PointPillars framework represents sparse 3D point clouds using pillar-based structures and enables efficient object detection in the bird’s-eye-view (BEV) domain. However, its feature encoding stage primarily relies on local statistical descriptors computed within individual pillars, exhibiting limited capacity to capture inter-pillar contextual relationships as well as variations in intra-pillar point distributions. These limitations become more pronounced in vehicle–infrastructure cooperative perception scenarios, where severe environmental occlusion and long-range LiDAR signal attenuation substantially reduce point density around critical objects, leading to the degradation of fine-grained geometric information.
While standard attention mechanisms, such as the Squeeze-and-Excitation Network (SENet [43]), effectively capture inter-channel relationships by adaptively recalibrating feature responses, they are fundamentally designed for dense 2D image formats. In a conventional SE block, the squeeze operation performs global average pooling along the spatial dimensions, compressing each two-dimensional feature channel into a single scalar value. However, this mechanism is not directly suitable for point cloud representations. Specifically, the intermediate tensors and pseudo-images generated by pillar-based encoding possess unique, irregular, and sparse geometric structures. Naively compressing these representations along spatial dimensions in the same manner as traditional 2D images inevitably results in severe information loss, destroying critical spatial geometry.
To address this challenge, we incorporate a redesigned squeeze-and-excitation attention module, termed Redefined-SENet (R-SENet), into the feature encoding stage of PointPillars. Unlike standard SENet, which merely performs 1D channel-wise feature recalibration, our R-SENet introduces a dual-dimensional squeeze-and-excitation mechanism explicitly tailored for unstructured 3D point clouds. By explicitly modeling attention weights across both pillar-level and point-level dimensions, the proposed module enhances feature robustness and representation fidelity under complex traffic conditions. The architecture of the improved feature encoder is illustrated in Figure 2.
The proposed R-SENet module consists of two principal stages—squeeze and excitation—and comprises four operations: feature transformation (Ftr), squeezing (Fsq), excitation (Fex), and feature scaling (Fscale). The overall structure of the module is illustrated in Figure 3.
First, the input point cloud is formulated as a feature tensor X P × N × D , where N denotes the number of sampled points per pillar, D represents the dimensionality of point-wise features, and P is the non-empty pillars. The tensor X is then projected into a higher-dimensional embedding space through the transformation function F tr ( ) , which facilitates the extraction of local geometric characteristics and yields an intermediate feature representation U P × N × D :
U = F t r ( X )
The transformation function F t r ( ) is realized through a shared multilayer perceptron or a channel-wise linear projection, which enables the extraction of more discriminative local geometric features.
Afterward, R-SENet performs the squeeze–excitation operation separately along the pillar dimension and within-pillar point dimension. Along the pillar dimension, the squeeze function F s q ( ) globally aggregates the intermediate feature map over the feature dimension D and the point dimension N, thereby capturing geometric and semantic contextual information embedded in the overall spatial distribution of the point cloud and producing a statistical representation for each pillar. Meanwhile, along the within-pillar point dimension, the features are aggregated over dimensions D and P to obtain a global statistical descriptor for points residing inside each pillar. The computation is expressed as follows:
z P = F s q u p = 1 D × N d = 1 D n = 1 N U P ( n , d )
z N = F s q u n = 1 D × P d = 1 D p = 1 P U N ( p , d )
This multidimensional squeeze strategy enables the network to preserve the statistical correlation among both inter-pillar and intra-pillar features, even when the point cloud is sparse or unevenly distributed. By performing aggregation operations along specific dimensions, the network is able to capture global contextual information from two complementary perspectives: the overall spatial distribution and the fine-grained local geometric structure.
Subsequently, the excitation function F e x ( ) maps the compressed global feature vector z k ( k { P , N } ) to a new representation by employing a gating mechanism composed of two fully connected layers. This module learns the nonlinear dependencies among feature channels and generates the channel-wise weight vectors s d P 1 × 1 × D and s d N 1 × 1 × D . The process is formulated as follows:
s P = F e x ( z P , W ) = σ ( W 2 δ ( W 1 z P ) )
s N = F e x ( z N , W ) = σ ( W 2 δ ( W 1 z N ) )
where W 1 and W 2 denote the learnable weight matrices of the fully connected layers, while δ ( ) and σ ( ) represent the ReLU activation and the sigmoid function, respectively. Subsequently, the scaling function F s c a l e ( ) multiplies the channel-wise weight vectors generated during the excitation phase with the intermediate feature tensor U. This operation yields two recalibrated feature tensors U P P × N × D and U N P × N × D :
U P = F s c a l e ( U , s P ) = s P U
U N = F s c a l e ( U , s N ) = s N U
Afterward, R-SENet performs an element-wise addition of the feature outputs from the two branches, resulting in the final enhanced feature representation U ˜ , which can be expressed as:
U ˜ = U P U N
Finally, the features enhanced by R-SENet U ˜ are scattered back to their original grid locations, producing a pseudo-image representation enriched with semantic information. The dual-branch attention fusion mechanism not only preserves the inherent spatial geometry of the raw point cloud, but also substantially strengthens key feature responses under sparse conditions caused by occlusion or long-range sensing, thereby providing more reliable feature inputs for vehicle–infrastructure cooperative perception.
(2)
2D Backbone Network
Following feature encoding, multi-scale semantic feature extraction from the pseudo-image is critical for detecting objects at varying distances, particularly in vehicle–infrastructure cooperative perception scenarios that involve both dense near-field targets and sparse far-field objects. The Feature Pyramid Network (FPN) [44] is a widely adopted architecture for multi-scale feature extraction in 2D object detection, employing a top-down pathway with lateral connections to merge high-resolution spatial details with high-level semantic information. However, the standard FPN is designed for dense RGB images and generates independent predictions at each pyramid level, making it less suitable for BEV-based point cloud detection. First, the nearest-neighbor or bilinear interpolation used for up-sampling in FPN is not well-suited for the sparse and irregular pseudo-image representations generated by pillar-based encoding. In such cases, spatial structures cannot be reliably recovered through interpolation alone. Second, maintaining separate prediction heads at each pyramid level introduces redundancy and limits the effective integration of complementary information across scales, particularly when LiDAR point clouds exhibit significant density variation between near-field and far-field regions.
To address these limitations, we propose a Feature Pyramid Backbone Network (FPB-Net) that is specifically tailored for sparse BEV pseudo-image representations. Unlike standard FPN, FPB-Net incorporates three key design differences: (1) learnable transpose-convolution-based up sampling replaces simple interpolation, enabling the network to recover fine spatial structures lost during down sampling of sparse feature maps; (2) a cross-layer aggregation stage concatenates all rescaled multi-scale features into a single unified tensor, rather than maintaining separate per-level outputs, ensuring that complementary information from all scales is jointly available for downstream detection; and (3) the entire architecture is jointly optimized with the pillar-based encoder, producing multi-scale features that are coherent with the underlying irregular point distribution.
The architecture of FPB-Net is illustrated in Figure 4. Taking the pseudo-image features as input, the network performs multi-scale feature modeling and cross-layer information fusion to generate high-level semantic representations for downstream object detection.
FPB-Net consists of three principal submodules: a top–down multi-scale feature extraction network, a transpose-convolution-based up-sampling network, and a cross-layer feature aggregation network.
First, the top–down feature extraction network takes low-resolution feature maps as input and progressively enriches the multi-scale semantic representation by capturing global contextual information through cascading convolutional layers and transmitting features across scales. Second, the up-sampling network restores the feature maps to the original spatial resolution using transpose convolutions, preparing them for subsequent feature fusion. Finally, the cross-layer feature aggregation network concatenates features from different stages to merge complementary semantic and spatial information, thereby producing more discriminative global representations. To facilitate efficient cross-layer fusion, we adopt a direct concatenation strategy, which preserves feature integrity across scales while minimizing semantic degradation. Moreover, since concatenation introduces no additional trainable parameters and incurs modest computational overhead, it effectively reduces model complexity and mitigates overfitting risks.
In the implementation process, the convolution operation can be abstracted as a sequence of basic computational units B l o c k ( S , L , F ) , where S denotes the convolution stride, L represents the number of convolutional layers, and F indicates the number of output channels. Each unit B l o c k consists of L two-dimensional 3 × 3 convolutional layers, followed by Batch Normalization and a ReLU activation function:
X ( l ) = R e L U ( B N ( W ( l ) × X ( l 1 ) + b ( l ) ) )
where X ( l ) and X ( l 1 ) denote the feature maps of the l-th and (l + 1)-th layers, respectively, while l = 1 , 2 , , L , W ( l ) and b ( l ) represent the convolution kernel, bias parameters, and the kernel size.
By adjusting the stride S , convolution kernel size, and padding configuration, the spatial dimensions of the input pseudo-image can be flexibly altered, thereby enabling feature representations at multiple scales. Finally, the feature maps from all scales are concatenated to construct the vehicle-side feature representation FV and the infrastructure-side feature representation FI. These multi-scale fused features provide a more robust and efficient representation for subsequent detection tasks, particularly in complex environments and occlusion-prone scenarios.

2.2.3. Feature Compression and Transmission

To reduce the communication overhead incurred during feature sharing in vehicle–infrastructure cooperative perception, an encoder–decoder–based feature compression and transmission mechanism is adopted [8].
During the compression stage, the intermediate infrastructure-side feature map F I encoded using a lightweight convolutional compression network:
F I C = C o n v F I
where F I C denotes the compressed feature representation and C o n v represents the convolutional encoder composed of three Conv–BN–ReLU blocks with progressively increasing stride. This compressor, which is composed of three Conv–BN–ReLU blocks, reduces the feature size from [384, 200, 176] to a more compact representation of [24, 100, 88], corresponding to approximately a 64× reduction in feature volume. When stored in FP16 precision, the original feature map occupies approximately 25.8 MB per frame, whereas the compressed representation requires only 0.4 MB, resulting in a bandwidth reduction of about 98.4%.
Upon reception on the vehicle side, the compressed feature representation F I C is reconstructed using a transposed convolutional decoder:
F I D = D e c o n v F I C
where D e c o n v denotes the transposed convolutional decoder that restores the feature map to its original resolution [384, 200, 176] for subsequent cooperative feature fusion, where F I C denotes the feature dimensionality of the recovered feature map. This compression strategy significantly reduces communication overhead while preserving the semantic information required for collaborative perception. Its impact on detection performance and communication cost is further analyzed in Section 3.7.

2.2.4. Spatially Adaptive Vehicle–Infrastructure Feature Fusion

In vehicle–infrastructure cooperative perception, feature fusion must integrate two heterogeneous feature maps: vehicle-side features capturing fine-grained local geometry and infrastructure-side features providing broader spatial coverage from an elevated viewpoint. A straightforward solution is to concatenate these feature maps along the channel dimension and apply a convolution for channel reduction. However, such naive concatenation treats all spatial locations and feature sources equally, ignoring the spatially varying complementarity between vehicle-side and infrastructure-side observations.
Standard spatial attention modules, such as the Convolutional Block Attention Module (CBAM) [45], enhance feature representation through pooling and convolution operations. However, these mechanisms are designed for single-source feature maps where all channels originate from the same sensing entity. When applied to concatenated vehicle–infrastructure features, the source identity is lost, making the attention weights unable to distinguish the distinct characteristics of each sensing source.
In cooperative vehicle–infrastructure perception, feature fusion is performed by concatenating the vehicle-side feature maps with the decompressed infrastructure feature maps to derive a unified feature representation. To address the parameter redundancy arising from multi-source concatenation and the information loss caused by single pooling–based fusion, we propose a Spatial Adaptive Feature Fusion (SAFF) module. SAFF adaptively integrates spatial features extracted by max pooling and average pooling, enabling efficient complementarity between vehicle- and infrastructure-side representations within a unified feature space. The overall architecture of the SAFF module is illustrated in Figure 5.
The SAFF module consists of three stages: feature dimensional expansion, feature concatenation, and spatial adaptive fusion. First, for the vehicle-side features F V C × H × W and the infrastructure-side features F I D C × H × W obtained from different sensing entities, an expand-dimension (Exp_dim) operation is applied to explicitly introduce a source dimension into the feature tensors. This operation preserves the origin of each feature map while projecting them into a unified feature space, and can be expressed as follows:
F V = f Exp _ dim ( F V )
F I D = f Exp _ dim ( F I D )
where denotes an operation that appends an additional channel at the end of the feature tensor, expanding the original representation from (C,H,W) to (C,H,W,1). Subsequently, the vehicle-side and infrastructure-side features are concatenated along the newly added dimension to form the fused feature representation F fusion C × H × W × 2 , expressed as follows:
F fusion = Concat ( F v , F I D )
This operation structurally aligns the spatial properties of the vehicle-side and infrastructure-side features, thereby establishing a unified input representation that facilitates subsequent adaptive fusion.
Building on this, a spatial attention mechanism [46] is further introduced to enhance the spatial modeling capability of the fused features. Specifically, the fused feature map Ffusion is processed with max pooling and average pooling to extract two spatial descriptors, F m a x C × H × W × 2 and F a v g C × H × W × 2 , respectively. These two descriptors are then concatenated to form the spatial feature representation F spatial 2 × C × H × W :
F spatial = Concat F max , F avg
Thus, the proposed design effectively integrates the two types of spatial information embedded within the intermediate concatenated feature map. A 2D convolutional layer followed by a Sigmoid activation is subsequently applied to perform feature selection and dimensionality reduction. The refined representation is then compressed through a squeeze operation to obtain a feature map with dimensions (C,H,W), yielding the final vehicle–infrastructure spatially adaptive fused feature F Adafusion C × H × W . The corresponding computation is formulated as follows:
F Adafusion = Compress σ ( Conv 2 D ( F spatial ) ) F spatial
In this formulation, σ ( ) denotes the Sigmoid activation function, represents the element-wise multiplication operation, and Compress ( ) indicates the reduction of the feature map along the first dimension.
Through the above design, the Spatial Adaptive Feature Fusion module maintains the consistency of spatial and channel representations while adaptively integrating fine-grained structural information from the onboard perception with global semantic cues from the roadside perception.

2.2.5. Detection Head

In the collaborative vehicle–infrastructure perception network, the detection head performs 3D object detection based on the fused feature representations. In this study, we adopt the standard detection head architecture from PointPillars [22] for classification and regression. Ultimately, the detection head predicts the class label, spatial location, three-dimensional size, and yaw angle for each candidate bounding box, thereby completing the 3D object detection task.

3. Results

The proposed method is evaluated on the DAIR-V2X-C real-world open-source dataset as well as a self-collected vehicle–infrastructure cooperative perception dataset acquired from real urban traffic scenarios, in order to comprehensively assess its cooperative perception performance.

3.1. Device Information

The experiments were performed on a system equipped with an Intel i5-12400F CPU (Intel Corporation, Santa Clara, CA, USA), an NVIDIA GeForce RTX 3060 GPU (NVIDIA Corporation, Santa Clara, CA, USA) with 12 GB of memory, and 32 GB of RAM. The system ran on Ubuntu 20.04.6 LTS (Canonical Ltd., London, UK), with CUDA 11.3 (NVIDIA Corporation, Santa Clara, CA, USA) to accelerate GPU computation. All models were implemented using PyTorch 1.10.1 (Meta Platforms, Inc., Menlo Park, CA, USA) within a conda environment (Anaconda, Inc., Austin, TX, USA) configured with Python 3.7.12 (Python Software Foundation, Beaverton, OR, USA) and cuDNN 8.2.1 (NVIDIA Corporation, Santa Clara, CA, USA).

3.2. Experimental Datasets

3.2.1. DAIR-V2X Dataset

The DAIR-V2X dataset [28] is the first large-scale real-world benchmark designed for vehicle–infrastructure cooperative perception research. It was collected in the Beijing pilot zone for autonomous driving and contains more than 100 traffic scenarios with synchronized data from both vehicle-mounted and roadside sensors. The dataset includes multi-modal information such as images and point clouds acquired from cameras and LiDAR sensors deployed on vehicles and roadside infrastructure. On the vehicle side, a 40-line LiDAR operating at 10 Hz with a 360° horizontal field of view (FOV) is used to capture the surrounding environment. On the infrastructure side, a 300-line LiDAR with the same sampling frequency and a horizontal FOV of approximately 100° is employed to collect point cloud data from roadside viewpoints. The dataset provides large-scale annotations for 3D object detection, together with timestamps and calibration files for cross-sensor synchronization. The DAIR-V2X dataset contains several subsets designed for different cooperative perception settings. In this work, we use the DAIR-V2X-C (cooperative perception) subset, which contains 38,845 frames of synchronized vehicle–infrastructure LiDAR and camera data. The dataset is divided into training, validation, and test sets with a ratio of 50%, 20%, and 30%, corresponding to 19,423, 7769, and 11,653 frames, respectively.
In this study, only LiDAR point cloud data from the DAIR-V2X-C subset are used to evaluate the proposed cooperative perception framework. All compared methods are trained and evaluated using the official dataset split to ensure fair and consistent comparisons.

3.2.2. Self-Collected Dataset

To validate the effectiveness of the proposed method, we selected a typical urban road beneath the Wenzhi Road Overpass in Zhengzhou as the experimental site. Leveraging a Level 3 (L3) Yutong Intelligent Connected Bus operating on the B7 route, we conducted synchronized collection of vehicle-side and roadside point cloud data, based on which an experimental dataset was constructed. The data acquisition platform is illustrated in Figure 6. The dataset contains approximately 2000 pairs of synchronized vehicle- and infrastructure-side point cloud frames. This area exhibits dense traffic flow and mixed interactions between vehicles and pedestrians, effectively reflecting the dynamic occlusions and perception blind spots commonly encountered during bus operations. The main specifications of the self-collected dataset and the sensing system are summarized in Table 1.
Hardware configuration: The data collection for the self-collected dataset relies on a dedicated vehicle–infrastructure sensing setup. Both the roadside infrastructure unit and the vehicle platform are equipped with 16-line LiDAR sensors to capture three-dimensional point cloud data of the surrounding environment. The roadside LiDAR is deployed at a fixed location to provide a broader observation of the traffic scene, while the vehicle-mounted LiDAR captures the local perception data from the ego-vehicle perspective. More detailed specifications of the sensing devices are summarized in Table 1. To ensure precise temporal alignment among multi-source perception data, both the vehicle platform and the roadside infrastructure are equipped with time servers, enabling accurate synchronization across all sensing devices during data acquisition. In addition, an RTK-based high-precision positioning system was integrated on the vehicle platform to provide accurate vehicle pose information. Based on this setup, extrinsic calibration between the vehicle-side LiDAR and the roadside sensing device was carefully performed, enabling the transformation and fusion of vehicle-side and infrastructure-side point clouds within a unified coordinate frame for subsequent cooperative perception processing. This calibration ensures accurate spatial alignment between vehicle-side and infrastructure-side observations.
Software configuration: On the software side, high-precision time synchronization is achieved using dedicated time servers, enabling accurate temporal alignment of data collected from multiple sensing devices. In addition, precise calibration procedures are applied to ensure spatial alignment between the vehicle-side and infrastructure-side sensors. This calibration allows point cloud data captured from different viewpoints to be transformed and processed within a unified spatial coordinate system.
Data collection and annotation: Data are collected in real-world road environments through multiple driving experiments. During each experiment, synchronized point cloud data from both vehicle and infrastructure sensors are recorded. Professional annotators then perform high-quality manual annotations on traffic participants based on the point cloud data using the SUSTechPOINTS [47] platform (Southern University of Science and Technology, Shenzhen, China), which provides interactive visualization and efficient labeling capabilities for large-scale LiDAR point clouds. The annotation process focuses on three common traffic participant categories—car, pedestrian, and cyclist—which are widely adopted in autonomous driving perception tasks. For each labeled object, annotators generated a 7-degree-of-freedom (7-DoF) 3D bounding box, including the center coordinates (x, y, z) and the dimensions and orientation defined by length, width, height, and yaw. Examples from the self-collected dataset are depicted in Figure 7. Examples from the self-collected dataset are depicted in Figure 7. Subfigures (a) and (c) illustrate vehicle-side LiDAR point clouds with 3D annotations, whereas subfigures (b) and (d) illustrate infrastructure-side LiDAR point clouds with 3D annotations. Different object categories are represented using distinct colors, where green bounding boxes denote cars, yellow bounding boxes denote bicycles, and red bounding boxes denote pedestrians. The final dataset contains 2000 synchronized vehicle–infrastructure frame pairs, with a total of 12,473 annotated objects, averaging approximately six objects per frame. To facilitate reproducible experimental evaluation, the dataset is divided into training, validation, and test sets with ratios of 60%, 20%, and 20%, respectively. Due to data governance and privacy considerations related to real-world road infrastructure and operational vehicles, the current version of the dataset is not publicly released. Access to the dataset is restricted to internal research use and collaborative projects approved by the data provider.

3.3. Evaluation Metrics

3.3.1. Intersection over Union (IoU)

IoU is used to quantify the overlap between a predicted bounding box and the corresponding ground-truth box. It is defined as the ratio of the area of their intersection to the area of their union. The IoU calculation is expressed as follows:
I o U = | B p B g | | B p B g |
where B p denotes the predicted bounding box and B g represents the ground-truth bounding box. IoU is used to determine whether a predicted object is considered a true positive by comparing it with a predefined IoU threshold. In this study, the Bird’s-Eye-View Intersection over Union (BEV IoU) is adopted to measure the overlap between the predicted bounding boxes and the ground-truth annotations in the top-down view.

3.3.2. Average Precision (AP)

Average Precision (AP) is used to evaluate the overall detection performance by computing the area under the Precision–Recall (P–R) curve. Following the official protocol [28], we use average precision (AP) as the main evaluation metric. The detection performance is evaluated under two commonly used IoU thresholds, 0.5 and 0.7, denoted as AP@0.5 and AP@0.7, respectively. To ensure consistency with autonomous driving benchmarks, AP in this study is calculated using the 40-point recall interpolation method. To maintain a fair comparison with existing cooperative perception methods, the quantitative evaluation focuses only on the vehicle category. Although the self-collected dataset includes additional annotations such as pedestrians, these categories are not included in the quantitative evaluation to ensure consistency with existing baseline methods.

3.4. Experimental Setup

In the experiments, the detection ranges for the DAIR-V2X dataset were set to x [ 100.8 , 100.8 ] , y [ 40 , 40 ] and z [ 3 , 1 ] , while those for the self-collected dataset were configured as x [ 50.4 , 50.4 ] , y [ 40 , 40 ] and z [ 3 , 1 ] . The communication ranges between the vehicle and the infrastructure were set to 100 m for the DAIR-V2X dataset and 50 m for the self-collected dataset. Beyond these distances, the vehicle is unable to receive cooperative perception information from the infrastructure.
All models are trained and tested on a platform equipped with a GeForce RTX 3060 GPU. The training process spans 45 epochs with a batch size of 2. The Adam optimizer is employed, starting with an initial learning rate of 1 × 10−3 and a weight decay of 1 × 10−4. To stabilize the early stages of training, a 10-epoch learning rate warm-up strategy starting from an initial learning rate of 2 × 10−4 was employed, followed by a Cosine Annealing learning rate scheduler with a minimum learning rate of 2 × 10−5 to dynamically adjust the learning rate during the remaining epochs. During the inference and post-processing stage, Non-Maximum Suppression with a BEV IoU threshold of 0.15 was applied to filter out redundant predicted bounding boxes. The loss weight for classification was set to 1.0, and the regression loss weight was set to 2.0. To enhance data diversity, multiple data augmentation techniques were applied during training, including random flipping along the X- and Y-axes, global scaling with a random factor of [ 0.9 , 1.1 ] , global rotation within [ π / 4 , π / 4 ] , and global translation within the range of [ 0 , 0.5   m ] . These augmentations improve the model’s robustness and feature learning capability. All compared methods were trained and evaluated using the same experimental configuration unless otherwise specified.

3.5. Quantitative Results

To evaluate the detection performance of the proposed method, we conduct comparative experiments on both the DAIR-V2X dataset and our self-collected dataset, benchmarking against several existing fusion approaches, including single-vehicle perception (baseline), early fusion, late fusion, and advanced intermediate collaborative perception methods.
To ensure a rigorously fair comparison and address the potential inconsistencies of heterogeneous settings, all baseline methods were re-implemented using their official open-source codebases and evaluated under a strictly unified experimental protocol. This protocol guarantees identical dataset splits, detection ranges, evaluation metrics, and training configurations. The experimental results are summarized in Table 2. The symbol “-” indicates that the corresponding result is not available.
As shown in Table 2, the single-vehicle perception method relies solely on the onboard sensors of the ego vehicle and does not exploit information from other cooperative vehicles or roadside infrastructure, leading to significantly lower detection accuracy compared with collaborative perception algorithms. This demonstrates that vehicle–infrastructure cooperation can effectively enhance detection performance and enable more comprehensive and accurate environmental perception. Specifically, the proposed method achieves an AP@0.5 of 0.762 and an AP@0.7 of 0.617 on the DAIR-V2X dataset, and an AP@0.5 of 0.694 and an AP@0.7 of 0.563 on our self-collected dataset. Compared with single-vehicle perception, the proposed method improves AP@0.5 by 28.1% on the DAIR-V2X dataset and by 33.5% on our self-collected dataset. It also consistently surpasses advanced intermediate fusion methods such as CoAlign and F-Cooper, confirming the superior feature representation capability of our architecture.
To explicitly address the trade-off between computational efficiency and detection accuracy—a critical factor for practical real-time deployability—we systematically analyzed the model complexity in terms of parameter count, floating-point operations (FLOPs), and on-device inference time. All measurements were conducted on a single NVIDIA GeForce RTX 3060 GPU (batch size = 1, FP32 precision) under the identical BEV feature resolution. The reported latency corresponds strictly to the algorithmic computational pipeline, excluding physical V2X communication delays.
As demonstrated in Table 2, the single-vehicle baseline is the fastest (25.36 ms) due to its lightweight architecture, whereas V2X-ViT incurs the highest cost (161.04 ms) owing to its heavy Transformer design. To explicitly quantify the incremental overhead of our proposed architecture, Table 3 provides a detailed breakdown.
As shown in Table 3, the Cooperative Baseline—which fuses dual-agent point clouds via naive feature concatenation—requires 50.23 ms for inference. Building upon this, the additional parameters introduced by our proposed modules are relatively small, indicating that the overall framework remains exceptionally lightweight. Specifically, R-SENet and SAFF add merely 0.04 M and 0.17 M parameters, respectively, owing to their lightweight attention and spatial pooling mechanisms. FPB-Net introduces a moderate 0.52 M parameters to capture critical multi-scale features.
Overall, the full proposed model introduces only 0.73 M additional parameters and 7.41 GFLOPs over the cooperative baseline, resulting in a total on-device inference latency of 60.67 ms. To clearly define the boundaries of this experiment, it should be emphasized that this 60.67 ms represents the full computational pipeline rather than just the neural network forward pass. Specifically, this latency encompasses: point cloud preprocessing and spatial coordinate transformations (10.20 ms), feature encoding with R-SENet and FPB-Net (34.50 ms), feature decompression alongside SAFF-based multi-agent fusion (8.30 ms), and the detection head with NMS post-processing (7.67 ms). All computational measurements were strictly conducted using a batch size of 1 in FP32 precision to accurately reflect individual frame processing in real-world deployment. Since LiDAR sensors operate at 10 Hz (requiring a latency margin of less than 100 ms), this comprehensive computational overhead confirms that our framework robustly satisfies real-time hardware deployment requirements while delivering substantial accuracy improvements.
Furthermore, it is important to note that the 60.67 ms latency solely accounts for on-device algorithmic computation. In real-world deployments, V2X data transmission over networks (e.g., 5G) inevitably introduces fluctuating communication delays. Such latency can cause temporal and spatial misalignment between cooperative agents, severely challenging feature fusion. To thoroughly investigate the system’s resilience to these practical deployment challenges, a comprehensive robustness analysis is presented in the following section.

3.6. Robustness Analysis

To evaluate the robustness of the proposed cooperative perception framework under realistic deployment conditions, we conduct a sensitivity analysis on the DAIR-V2X dataset by injecting synthetic pose noise and transmission latency into the collaborative perception pipeline. Localization errors and communication delays may introduce spatial and temporal misalignment between agents, thereby affecting feature fusion. Therefore, we analyze the impact of localization noise, heading noise, and transmission latency on detection performance. The evaluation results are illustrated in Figure 8.

3.6.1. Robustness Analysis to Localization and Heading Errors

To simulate realistic localization uncertainty, Gaussian noise with zero mean is added to the agent poses during the coordinate transformation stage. We follow [49] to simulate pose errors that occur during communication between agents. Localization perturbations are introduced with standard deviations σ x y z [ 0 , 0.5 ] m, while heading perturbations are simulated with σ h [ 0 , 1.0 ] . These ranges are consistent with the typical localization accuracy of GPS/RTK systems in urban environments.
As shown in Figure 8a, all collaborative perception methods experience performance degradation as localization noise increases due to spatial feature misalignment across agents. Nevertheless, the proposed method demonstrates the strongest robustness among intermediate fusion approaches. Specifically, the AP@0.5 decreases from 76.2% to 70.1% when the localization noise increases from 0 m to 0.5 m, indicating a relatively moderate degradation compared with other fusion strategies. Early fusion methods such as Cooper exhibit significantly larger performance drops because raw point clouds are directly fused under spatial misalignment. Other intermediate fusion approaches also show more pronounced degradation.
A similar trend can be observed for heading perturbations in Figure 8b. While detection accuracy gradually decreases for all methods as rotational noise increases, the proposed method maintains more stable performance across the entire noise range. This result indicates that the proposed framework is less sensitive to rotational pose errors during feature fusion.
The improved robustness mainly benefits from the Spatial Adaptive Feature Fusion (SAFF) module, which dynamically adjusts fusion weights through spatial attention while explicitly encoding feature-source information. This mechanism allows the network to mitigate feature inconsistencies caused by pose perturbations. In addition, the R-SENet module enhances feature representation before fusion, further improving robustness against spatial misalignment.

3.6.2. Robustness to Transmission Latency

Communication latency is another important factor affecting the performance of collaborative perception systems. To evaluate the temporal robustness of the proposed framework under realistic V2X communication conditions, transmission delays ranging from 0 ms to 500 ms are simulated in the feature transmission pipeline on the DAIR-V2X dataset to approximate the fluctuating latency commonly observed in practical 5G networks. As illustrated in Figure 8c, detection performance gradually decreases for all methods as transmission latency increases due to temporal misalignment between vehicle-side and infrastructure-side observations. However, the proposed method consistently achieves higher detection accuracy and exhibits a slower degradation rate compared with other intermediate fusion approaches. Even under a severe delay of 500 ms, the proposed framework still maintains an AP@0.5 of 62.0%, outperforming all competing intermediate fusion baselines.
The strong temporal robustness can be attributed to two complementary mechanisms. First, the timestamp-based frame synchronization strategy ensures that only temporally aligned frames are fused, preventing severely delayed observations from degrading the fusion process. Second, the spatial attention mechanism in the SAFF module dynamically suppresses unreliable features caused by temporal inconsistency. These results demonstrate that the proposed method can maintain stable perception performance under realistic V2X communication conditions.

3.7. Performance–Bandwidth Trade-off Analysis

A critical requirement for practical V2X deployment is the efficient utilization of communication bandwidth. To quantitatively evaluate this aspect, we analyze the trade-off between detection accuracy (AP@0.5) and communication volume (in log 2 scale megabytes by varying the compression ratio of the encoder–decoder feature pipeline on the DAIR-V2X dataset.
Following the convention established in prior work, the communication volume is computed as:
C o m m = log 2 N c × C × 16 8 × 2 20
where N c denotes the number of transmitted feature elements, C is the channel dimensionality, and features are transmitted in 16-bit floating-point format.
For reference, the uncompressed BEV feature map (C = 384, H = 200, W = 176) stored in FP16 precision corresponds to approximately 25.8 MB (log2 (communication volume in MB) ≈ 4.69 MB). Figure 9 presents the accuracy–bandwidth curves for four representative intermediate fusion methods (the proposed method, CoAlign, V2VNet, and Where2comm) across varying compression levels, alongside several baseline methods (including Cooper, V2X-ViT, CoBEVT, and the Cooperative Baseline) operating at fixed bandwidths, which are denoted by single-point markers. Among the baseline strategies, early fusion (Cooper) incurs the highest communication cost due to the transmission of raw point clouds, yet achieves only moderate detection performance (AP@0.5 = 0.617). In contrast, late fusion significantly reduces bandwidth requirements by transmitting only detection results; however, its accuracy remains relatively low (AP@0.5 = 0.561) because important intermediate feature information is not shared.
As illustrated in Figure 9, the proposed framework consistently achieves the most favorable trade-off between detection accuracy and communication cost across the entire bandwidth spectrum. Even under moderate compression (log2 0.25 MB ≈ −2, corresponding to approximately 0.25 MB per frame), the proposed approach achieves an AP@0.5 of 0.738, surpassing the peak performance of several competing intermediate fusion methods such as V2VNet and Where2comm operating at higher communication budgets. The performance advantage becomes even more pronounced under severely bandwidth-constrained scenarios. At log2 communication volume ≈ −4, the proposed method still achieves an AP@0.5 of 0.618, outperforming CoAlign, Where2comm, and V2VNet. This robustness indicates that the R-SENet module produces highly compression-resilient feature representations by reinforcing salient geometric structures prior to encoding, while the SAFF module effectively preserves cross-agent feature interaction after compression. At the default operating point (64× compression, log2(0.4 MB) ≈ −1.32), the system transmits only 0.4 MB per frame, representing a reduction of approximately 98.4% in communication volume compared with uncompressed feature transmission, while maintaining an AP@0.5 of 0.754, which is within 1.1% of the peak performance. These results demonstrate that the proposed framework achieves a superior Pareto-optimal balance between detection accuracy and communication cost, making it well-suited for bandwidth-constrained V2X deployment scenarios.

3.8. Qualitative Results

Figure 10 provides an intuitive comparison of raw point clouds, the baseline PointPillars model, and the PointPillars model enhanced with R-SENet in terms of their capability to process point-cloud features. The illustrated scenario involves both occlusion and long-range perception challenges. Heatmap visualization is used to highlight the differences in feature representation quality among the compared methods.
As shown in Figure 10b, the baseline PointPillars model, without the enhancement mechanism, primarily captures continuous road boundaries and topographic contours. The resulting Bird’s-Eye View (BEV) feature maps exhibit broad, uniform responses, indicating a foundational capacity for perceiving the overall scene layout. However, in long-range or occluded areas, the significant reduction in point cloud returns leads to diminished feature activation around targets. Consequently, object boundaries appear blurred or fragmented, failing to form distinct structural features. This suggests that PointPillars struggles to extract robust semantic structures in sparse regions, resulting in an inadequate depiction of local contours for critical objects like vehicles and pedestrians, which limits its detection performance in complex traffic environments.
In contrast, as shown in Figure 10c, the PointPillars model augmented with R-SENet applies a dual-level squeeze-and-excitation operation on both pillar-wise and intra-pillar features to adaptively recalibrate feature channels. This mechanism effectively enhances feature sensitivity in sparse or structurally incomplete regions. As demonstrated in the heatmap, the enhanced model not only activates weak geometric patterns that the baseline model fails to recognize but also produces clearer object boundaries and semantic shapes, resulting in a more complete representation of potential targets. Overall, the comparative heatmap visualization confirms that integrating R-SENet substantially improves the ability of the model to extract salient features from sparse point-cloud data, thereby validating the effectiveness of the proposed enhancement.
To evaluate the adaptability of the proposed algorithm under dynamic and static occlusions as well as long-range perception scenarios, we conduct a comparative experiment against a single-vehicle perception method, and the results are presented in Figure 11. Specifically, panels (a) and (c) display visualization results on the DAIR-V2X dataset, while panels (b) and (d) show those obtained from our self-collected dataset. In all subfigures, green boxes denote ground-truth bounding boxes, red boxes represent detected objects, and yellow boxes indicate missed detections or inaccurate predictions caused by point-cloud sparsity due to occlusion or limited sensor range. Furthermore, “Ego” refers to the autonomous vehicle in the DAIR-V2X dataset, “Bus” denotes the intelligent connected bus used for data collection in our dataset, and “Infra” corresponds to the roadside infrastructure sensors.
Figure 12 and Figure 13 present the visualized detection results of V2X-ViT, CoAlign, and the proposed method on both the DAIR-V2X dataset and our self-constructed dataset, where Scene 1, Scene 2, and Scene 3 correspond to representative cases involving occlusion and long-range perception. As illustrated, V2X-ViT exhibits noticeable localization deviations and missing detections, while CoAlign demonstrates more stable detection completeness but still leaves room for improvement in bounding-box accuracy. In contrast, the proposed method substantially improves localization precision while effectively suppressing missed detections. The predicted bounding boxes align more closely with the ground-truth annotations, achieving the best overall detection performance among the compared methods. This performance gain primarily stems from the adaptive enhancement of key-region features enabled by the R-SENet module, as well as the spatially adaptive fusion introduced by the SAFF module when integrating features from vehicle and infrastructure perspectives. The combined effect of these modules enables the model to fully exploit spatial details and semantic cues from multi-source perception, yielding more stable and accurate detection under static/dynamic occlusions and long-range scenarios.
Overall, the qualitative analysis demonstrates that the proposed method achieves superior object detection accuracy compared with single-vehicle perception approaches and several state-of-the-art collaborative perception methods.

3.9. Ablation Study

To evaluate the effectiveness of each component in the proposed collaborative perception framework, comprehensive ablation studies were conducted on both the DAIR-V2X dataset and our self-constructed dataset. Specifically, we evaluated the contribution of the R-SENet attention module, the feature pyramid fusion network, and the spatial adaptive vehicle–infrastructure feature fusion module to overall detection performance. The results are summarized in Table 4.
We first consider a single-vehicle baseline, implemented using the PointPillars detector, which relies solely on the ego-vehicle LiDAR data without any cooperative information. This baseline achieves an AP@0.5 of 0.481 on the DAIR-V2X dataset and 0.359 on the self-collected dataset, serving as the lower bound for perception performance. To establish a cooperative perception reference, we further construct a cooperative baseline (Concatenation) by introducing a simple feature concatenation strategy for vehicle–infrastructure feature fusion. For simplicity, this configuration is referred to as “Coop Baseline” in the ablation settings shown in Table 4. Systematic ablation experiments were conducted to evaluate the performance contributions of three core modules: R-SENet, FPB-Net, and SAFF. In this setting, features extracted from the vehicle and roadside LiDAR sensors are directly concatenated without applying any of the proposed modules. This cooperative baseline significantly improves detection performance, achieving 0.641 and 0.498 in AP@0.5 and AP@0.7 on the DAIR-V2X dataset, and 0.572 and 0.451 on the self-collected dataset. This improvement demonstrates the effectiveness of cooperative perception in providing complementary sensing information.
Building upon the cooperative baseline (concatenation), we progressively integrate the proposed modules to analyze their individual contributions. Introducing the R-SENet module improves the detection performance to 0.701 and 0.556 in AP@0.5 and AP@0.7 on the DAIR-V2X dataset, and to 0.628 and 0.507 on the self-collected dataset, indicating that the attention mechanism enhances feature representation in sparse point cloud environments. After further integrating FPB-Net, the performance increases to 0.732 and 0.588 on the DAIR-V2X dataset, and to 0.659 and 0.534 on the self-collected dataset. This result demonstrates the effectiveness of the multi-scale feature pyramid fusion strategy in capturing contextual information across different spatial resolutions. To further investigate the contribution of the spatial fusion strategy, we also evaluate the configuration combining R-SENet and SAFF without FPB-Net. This configuration achieves 0.719 and 0.573 in AP@0.5 and AP@0.7 on the DAIR-V2X dataset, and 0.642 and 0.520 on the self-collected dataset, showing that the spatial adaptive fusion mechanism improves cross-source feature alignment. Finally, when all three modules are jointly applied, the proposed framework achieves the best performance, reaching 0.762 and 0.617 in AP@0.5 and AP@0.7 on the DAIR-V2X dataset, and 0.694 and 0.563 on the self-collected dataset. These results demonstrate that the proposed modules provide complementary benefits and jointly enhance the robustness and accuracy of vehicle–infrastructure cooperative perception.
Furthermore, the R-SENet module operates as a fully data-driven mechanism that adaptively learns its attention weights without requiring manual hyperparameter tuning. This self-adapting capability ensures robust feature extraction for small targets, such as pedestrians, regardless of the extreme point cloud sparsity inherent in 16-beam LiDARs. Although the combination of R-SENet, FPB-Net, and SAFF achieves the best overall performance in the ablation study, certain challenging scenarios remain difficult for the complete system. In cases where objects are heavily occluded from both vehicle-side and infrastructure-side viewpoints, the cooperative fusion provides limited complementary geometric information, which may result in occasional missed detections. Extremely distant small objects represented by very sparse LiDAR points, particularly pedestrians, also remain challenging due to insufficient structural cues. Furthermore, while the evaluated datasets do not include severe weather conditions such as heavy rain or snow, scenes with strong reflectivity interference or partial sensor shadow regions may still reduce detection confidence. Nevertheless, the cooperative framework generally demonstrates improved robustness compared to single-vehicle perception, while acknowledging that perfect detection cannot be guaranteed under extreme sensing limitations.

4. Conclusions

To address the challenges encountered by intelligent buses and other large vehicles operating in complex urban environments—specifically, dynamic occlusion, long-range point cloud sparsity, and the efficient fusion of heterogeneous features—this study proposes a cooperative vehicle–road 3D object detection framework based on point-cloud feature enhancement and spatially adaptive fusion. First, an R-SENet-based attention mechanism is embedded into the conventional PointPillars encoder to reinforce the representation of sparse and occluded point clouds through dual-dimensional feature statistics and channel recalibration at both pillar and intra-pillar levels. Subsequently, the proposed FPB-Net feature-pyramid backbone is incorporated to enable unified multi-scale point-cloud modeling and effectively accommodate the density variation of LiDAR observations over distance. Building upon these components, a spatially adaptive feature fusion module is introduced to dynamically integrate fine-grained vehicle-side structural information with global semantic cues from roadside sensors using spatial attention. This design mitigates perception blind zones inherent in single-vehicle sensing and maintains low communication overhead via feature compression, thereby substantially improving cooperative perception in spatially heterogeneous scenarios. Experimental evaluations demonstrate that the proposed method significantly outperforms both single-vehicle perception baselines and state-of-the-art cooperative approaches on the DAIR-V2X and self-collected datasets, with particularly strong advantages in occluded and long-range scenarios where point clouds become sparse, while maintaining real-time inference. Future work will explore multimodal fusion strategies to further integrate heterogeneous sensing modalities and construct a more comprehensive perception framework. In addition, more fine-grained evaluation protocols—such as performance analysis across different distance ranges and occlusion levels—will be investigated to further understand and improve the robustness of cooperative perception systems in complex urban environments.

Author Contributions

Conceptualization, S.Y. and C.X.; Methodology, S.Y., C.X. and Z.L.; Software, Y.W. and S.Y.; Validation, S.Y., Y.W., Z.L. and C.X.; Formal Analysis, Y.W.; Investigation, Y.W., Z.L. and C.X.; Resources, Y.W., C.X. and Z.L.; Data Curation, S.Y. and C.X.; Writing—Original Draft Preparation, S.Y.; Writing—Review and Editing, S.Y. and Y.W.; Visualization, S.Y. and Y.W.; Supervision, Y.W.; Project Administration, Y.W.; Funding Acquisition, S.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Science and Technology Project of the Henan Provincial Department of Transportation (No. 2023-5-1).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author(s). Data are contained within the article.

Conflicts of Interest

Zhennan Liu is employees of Yutong Bus Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The authors declare no conflicts of interest.

References

  1. Yurtsever, E.; Lambert, J.; Carballo, A.; Takeda, K. A survey of autonomous driving: Common practices and emerging technologies. IEEE Access 2020, 8, 58443–58469. [Google Scholar] [CrossRef]
  2. Huang, T.; Liu, J.; Zhou, X.; Nguyen, D.C.; Azghadi, M.R.; Xia, Y.; Sun, S. V2X cooperative perception for autonomous driving: Recent advances and challenges. arXiv 2023, arXiv:2310.03525. [Google Scholar] [CrossRef]
  3. Noor-A-Rahim, M.; Liu, Z.; Lee, H.; Khyam, M.O.; He, J.; Pesch, D.; Poor, H.V. 6G for vehicle-to-everything (V2X) communications: Enabling technologies, challenges, and opportunities. Proc. IEEE 2022, 110, 712–734. [Google Scholar] [CrossRef]
  4. Ye, X.; Shu, M.; Li, H.; Shi, Y.; Li, Y.; Wang, G.; Tan, X.; Ding, E. Rope3D: The roadside perception dataset for autonomous driving and monocular 3D object detection task. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 21341–21350. [Google Scholar]
  5. Wu, J.; Xu, H.; Tian, Y.; Pi, R.; Yue, R. Vehicle detection under adverse weather from roadside LiDAR data. Sensors 2020, 20, 3433. [Google Scholar] [CrossRef]
  6. Liu, S.; Gao, C.; Chen, Y.; Peng, X.; Kong, X.; Wang, K.; Wang, M. Towards vehicle-to-everything autonomous driving: A survey on collaborative perception. arXiv 2023, arXiv:2308.16714. [Google Scholar]
  7. Chen, Q.; Ma, X.; Tang, S.; Guo, J.; Yang, Q.; Fu, S. F-Cooper: Feature-based cooperative perception for autonomous vehicle edge computing system using 3D point clouds. In Proceedings of the 4th ACM/IEEE Symposium on Edge Computing, Arlington, VA, USA, 7–9 November 2019; pp. 88–100. [Google Scholar]
  8. Yu, H.; Tang, Y.; Xie, E.; Mao, J.; Yuan, J.; Luo, P.; Nie, Z. Vehicle–infrastructure cooperative 3D object detection via feature flow prediction. arXiv 2023, arXiv:2303.10552. [Google Scholar]
  9. Ren, S.; Lei, Z.; Wang, Z.; Dianati, M.; Wang, Y.; Chen, S.; Zhang, W. Interruption-aware cooperative perception for V2X communication-aided autonomous driving. IEEE Trans. Intell. Veh. 2024, 9, 4698–4714. [Google Scholar] [CrossRef]
  10. Bai, Z.; Wu, G.; Barth, M.J.; Liu, Y.; Sisbot, E.A.; Oguchi, K. PillarGrid: Deep learning-based cooperative perception for 3D object detection from onboard-roadside LiDAR. In Proceedings of the 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), Macau, China, 8–12 October 2022; pp. 1743–1749. [Google Scholar]
  11. Xiang, C.; Xie, X.; Feng, C.; Bai, Z.; Niu, Z.; Yang, M. V2I-BEVF: Multi-modal fusion based on BEV representation for vehicle–infrastructure perception. In Proceedings of the 2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC), Bilbao, Spain, 24–28 September 2023; pp. 5292–5299. [Google Scholar]
  12. Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
  13. Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
  14. Shi, S.; Wang, X.; Li, H. PointRCNN: 3D object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 770–779. [Google Scholar]
  15. Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. PV-RCNN: Point-voxel feature set abstraction for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10529–10538. [Google Scholar]
  16. Yang, Z.; Sun, Y.; Liu, S.; Jia, J. 3DSSD: Point-based 3D single-stage object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11040–11048. [Google Scholar]
  17. Liu, Z.; Zhang, Z.; Cao, Y.; Hu, H.; Tong, X. Group-free 3D object detection via transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 2949–2958. [Google Scholar]
  18. Mao, J.; Xue, Y.; Niu, M.; Bai, H.; Feng, J.; Liang, X.; Xu, C. Voxel transformer for 3D object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 3164–3173. [Google Scholar]
  19. Chen, Y.; Liu, J.; Zhang, X.; Qi, X.; Jia, J. VoxelNeXt: Fully sparse VoxelNet for 3D object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–23 June 2023; pp. 21674–21683. [Google Scholar]
  20. Yan, Y.; Mao, Y.; Li, B. SECOND: Sparsely embedded convolutional detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef]
  21. Yin, T.; Zhou, X.; Krahenbuhl, P. Center-based 3D object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11784–11793. [Google Scholar]
  22. Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. PointPillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12697–12705. [Google Scholar]
  23. Xie, Q.; Zhou, X.; Qiu, T.; Zhang, Q.; Qu, W. Soft actor–critic-based multilevel cooperative perception for connected autonomous vehicles. IEEE Internet Things J. 2022, 9, 21370–21381. [Google Scholar] [CrossRef]
  24. Guo, A.; Zhang, S.; Tang, E.; Gao, X.; Pang, H.; Tian, H.; Chen, Z. When autonomous vehicle meets V2X cooperative perception: How far are we? arXiv 2025, arXiv:2509.24927. [Google Scholar] [CrossRef]
  25. Chen, Q.; Tang, S.; Yang, Q.; Fu, S. COOPER: Cooperative perception for connected autonomous vehicles based on 3D point clouds. In Proceedings of the 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), Dallas, TX, USA, 7–9 July 2019; pp. 514–524. [Google Scholar]
  26. Arnold, E.; Dianati, M.; De Temple, R.; Fallah, S. Cooperative perception for 3D object detection in driving scenarios using infrastructure sensors. IEEE Trans. Intell. Transp. Syst. 2020, 23, 1852–1864. [Google Scholar] [CrossRef]
  27. Mo, Y.; Zhang, P.; Chen, Z.; Ran, B. A method of vehicle–infrastructure cooperative perception based vehicle state information fusion using improved Kalman filter. Multimed. Tools Appl. 2022, 81, 4603–4620. [Google Scholar] [CrossRef]
  28. Yu, H.; Luo, Y.; Shu, M.; Huo, Y.; Yang, Z.; Shi, Y.; Nie, Z. DAIR-V2X: A large-scale dataset for vehicle–infrastructure cooperative 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 21361–21370. [Google Scholar]
  29. Feng, X.; Sun, H.; Zheng, H. LCV2I: Communication-efficient and high-performance collaborative perception framework with low-resolution LiDAR. arXiv 2025, arXiv:2502.17039. [Google Scholar]
  30. Wang, T.H.; Manivasagam, S.; Liang, M.; Yang, B.; Zeng, W.; Urtasun, R. V2VNet: Vehicle-to-vehicle communication for joint perception and prediction. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 605–621. [Google Scholar]
  31. Xu, R.; Xiang, H.; Xia, X.; Han, X.; Li, J.; Ma, J. OPV2V: An open benchmark dataset and fusion pipeline for perception with vehicle-to-vehicle communication. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 2583–2589. [Google Scholar]
  32. Xu, R.; Xiang, H.; Tu, Z.; Xia, X.; Yang, M.H.; Ma, J. V2X-ViT: Vehicle-to-everything cooperative perception with vision transformer. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 107–124. [Google Scholar]
  33. Hu, Y.; Fang, S.; Lei, Z.; Zhong, Y.; Chen, S. Where2Comm: Communication-efficient collaborative perception via spatial confidence maps. arXiv 2022, arXiv:2209.12836. [Google Scholar]
  34. Yan, W.; Cao, H.; Chen, J.; Wu, T. FETR: Feature transformer for vehicle–infrastructure cooperative 3D object detection. Neurocomputing 2024, 600, 128147. [Google Scholar] [CrossRef]
  35. Li, X.; Yin, J.; Li, W.; Xu, C.; Yang, R.; Shen, J. Di-V2X: Learning domain-invariant representation for vehicle–infrastructure collaborative 3D object detection. Proc. AAAI Conf. Artif. Intell. 2024, 38, 3208–3215. [Google Scholar] [CrossRef]
  36. Chen, Z.; Shi, Y.; Jia, J. TransIFF: An instance-level feature fusion framework for vehicle–infrastructure cooperative 3D detection with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–8 October 2023; pp. 18205–18214. [Google Scholar]
  37. Wang, J.; Nordström, T. Latency robust cooperative perception using asynchronous feature fusion. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV); IEEE: New York, NY, USA, 2025; pp. 1–10. [Google Scholar]
  38. Li, B.; Zhao, Y.; Tan, H. CoFormerNet: A Transformer-Based Fusion Approach for Enhanced Vehicle–Infrastructure Cooperative Perception. Sensors 2024, 24, 4101. [Google Scholar] [CrossRef]
  39. Li, Y.; Dai, X.; Ge, B.; Song, Y.; Wang, J. Multi-Scale Dynamic Spatial Attention Module for Robust Point Cloud Perception in Cooperative Vehicle Infrastructure System. IEEE Access 2025, 13, 172895–172904. [Google Scholar] [CrossRef]
  40. Zhang, H.; Li, Y.; Zheng, S.; Lu, Z.; Gui, X.; Xu, W.; Bian, J. Battery lifetime prediction across diverse ageing conditions with inter-cell deep learning. Nat. Mach. Intell. 2025, 7, 270–277. [Google Scholar] [CrossRef]
  41. Zhang, H.; Gui, X.; Zheng, S.; Lu, Z.; Li, Y.; Bian, J. BatteryML: An open-source platform for machine learning on battery degradation. In Proceedings of the International Conference on Learning Representations (ICLR 2024), Vienna, Austria, 7–11 May 2024. [Google Scholar]
  42. Wang, L.; Lan, J.; Li, M. PAFNet: Pillar attention fusion network for vehicle–infrastructure cooperative target detection using LiDAR. Symmetry 2024, 16, 401. [Google Scholar] [CrossRef]
  43. Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
  44. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
  45. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  46. Mushtaq, H.; Deng, X.; Ullah, I.; Ali, M.; Malik, B.H. O2SAT: Object-oriented-segmentation-guided spatial-attention network for 3D object detection in autonomous vehicles. Information 2024, 15, 376. [Google Scholar] [CrossRef]
  47. Li, E.; Wang, S.; Li, C.; Li, D.; Wu, X.; Hao, Q. SUSTechPOINTS: A Portable 3D Point Cloud Interactive Annotation Platform System. In Proceedings of the IEEE Intelligent Vehicles Symposium (IV 2020), Las Vegas, NV, USA, 19–23 October 2020; pp. 1108–1115. [Google Scholar]
  48. Xu, R.; Tu, Z.; Xiang, H.; Shao, W.; Zhou, B.; Ma, J. CoBEVT: Cooperative bird’s-eye-view semantic segmentation with sparse transformers. arXiv 2022, arXiv:2207.02202. [Google Scholar]
  49. Lu, Y.; Li, Q.; Liu, B.; Dianati, M.; Feng, C.; Chen, S.; Wang, Y. Robust collaborative 3D object detection in presence of pose errors. arXiv 2022, arXiv:2211.07214. [Google Scholar]
Figure 1. Framework of the proposed vehicle–infrastructure cooperative perception system. The pipeline consists of point cloud preprocessing (frame matching and coordinate conversion), feature extraction, feature compression and transmission, adaptive feature fusion, and 3D detection. Arrows indicate the data flow direction between modules. In the detection results, green boxes denote ground-truth objects, while red boxes represent detection results.
Figure 1. Framework of the proposed vehicle–infrastructure cooperative perception system. The pipeline consists of point cloud preprocessing (frame matching and coordinate conversion), feature extraction, feature compression and transmission, adaptive feature fusion, and 3D detection. Arrows indicate the data flow direction between modules. In the detection results, green boxes denote ground-truth objects, while red boxes represent detection results.
Wevj 17 00164 g001
Figure 2. Schematic diagram of the improved PointPillars feature extraction module. The module consists of pillar feature encoding and a 2D backbone network with enhanced feature representation. Arrows indicate the data flow direction between different processing stages.
Figure 2. Schematic diagram of the improved PointPillars feature extraction module. The module consists of pillar feature encoding and a 2D backbone network with enhanced feature representation. Arrows indicate the data flow direction between different processing stages.
Wevj 17 00164 g002
Figure 3. Schematic diagram of the R-SENet structure. The module applies channel-wise attention through squeeze and excitation operations to enhance feature representation. Arrows indicate the data flow direction between different processing steps.
Figure 3. Schematic diagram of the R-SENet structure. The module applies channel-wise attention through squeeze and excitation operations to enhance feature representation. Arrows indicate the data flow direction between different processing steps.
Wevj 17 00164 g003
Figure 4. Schematic diagram of the feature pyramid backbone network. The network constructs multi-scale feature representations by combining high-level semantic features with low-level spatial features through a top-down pathway and lateral connections. Arrows indicate the data flow direction between different layers.
Figure 4. Schematic diagram of the feature pyramid backbone network. The network constructs multi-scale feature representations by combining high-level semantic features with low-level spatial features through a top-down pathway and lateral connections. Arrows indicate the data flow direction between different layers.
Wevj 17 00164 g004
Figure 5. Schematic diagram of the spatially adaptive feature fusion module. The module adaptively fuses features from different sources by learning spatial attention weights, enabling effective integration of vehicle-side and infrastructure-side information. Arrows indicate the data flow direction between different processing steps.
Figure 5. Schematic diagram of the spatially adaptive feature fusion module. The module adaptively fuses features from different sources by learning spatial attention weights, enabling effective integration of vehicle-side and infrastructure-side information. Arrows indicate the data flow direction between different processing steps.
Wevj 17 00164 g005
Figure 6. Self-Collected Dataset Acquisition Platform. (a) On-board data acquisition equipment. (b) Roadside infrastructure data acquisition units. (c) Roadside data collection scene. (d) Roadside point cloud visualization interface. (e) On-board point cloud visualization interface. (f) On-board data collection scene.
Figure 6. Self-Collected Dataset Acquisition Platform. (a) On-board data acquisition equipment. (b) Roadside infrastructure data acquisition units. (c) Roadside data collection scene. (d) Roadside point cloud visualization interface. (e) On-board point cloud visualization interface. (f) On-board data collection scene.
Wevj 17 00164 g006
Figure 7. Examples of annotated LiDAR point clouds from the self-collected dataset: (a) vehicle-side scenario 1; (b) infrastructure-side scenario 1; (c) vehicle-side scenario 2; (d) infrastructure-side scenario 2. Green, yellow, and red bounding boxes denote cars, bicycles, and pedestrians, respectively.
Figure 7. Examples of annotated LiDAR point clouds from the self-collected dataset: (a) vehicle-side scenario 1; (b) infrastructure-side scenario 1; (c) vehicle-side scenario 2; (d) infrastructure-side scenario 2. Green, yellow, and red bounding boxes denote cars, bicycles, and pedestrians, respectively.
Wevj 17 00164 g007
Figure 8. Robustness analysis against localization errors, heading errors, and transmission latency on the DAIR-V2X dataset. (a) Detection performance under increasing localization noise (Gaussian noise with zero mean added to the LiDAR pose); (b) detection performance under increasing heading noise (yaw perturbation); (c) detection performance under different transmission delays (0–500 ms).
Figure 8. Robustness analysis against localization errors, heading errors, and transmission latency on the DAIR-V2X dataset. (a) Detection performance under increasing localization noise (Gaussian noise with zero mean added to the LiDAR pose); (b) detection performance under increasing heading noise (yaw perturbation); (c) detection performance under different transmission delays (0–500 ms).
Wevj 17 00164 g008
Figure 9. Performance–bandwidth trade-off on the DAIR-V2X dataset. The x-axis represents the communication volume in log2 scale (MB), and the y-axis represents AP@0.5.
Figure 9. Performance–bandwidth trade-off on the DAIR-V2X dataset. The x-axis represents the communication volume in log2 scale (MB), and the y-axis represents AP@0.5.
Wevj 17 00164 g009
Figure 10. Comparative visualization of point cloud feature extraction: (a) raw input point cloud; (b) BEV feature map generated by the baseline PointPillars model; (c) BEV feature map generated by the PointPillars model with the proposed R-SENet enhancement. The red boxes indicate selected regions for detailed comparison, and the corresponding enlarged views are shown on the right. “Ego” denotes the position of the ego-vehicle.
Figure 10. Comparative visualization of point cloud feature extraction: (a) raw input point cloud; (b) BEV feature map generated by the baseline PointPillars model; (c) BEV feature map generated by the PointPillars model with the proposed R-SENet enhancement. The red boxes indicate selected regions for detailed comparison, and the corresponding enlarged views are shown on the right. “Ego” denotes the position of the ego-vehicle.
Wevj 17 00164 g010
Figure 11. Visualization examples comparing single-vehicle perception and cooperative perception on the DAIR-V2X dataset and the self-collected dataset: (a) single-vehicle perception on DAIR-V2X; (b) cooperative perception on DAIR-V2X; (c) single-vehicle perception on the self-collected dataset; (d) cooperative perception on the self-collected dataset. Green, red, and yellow boxes denote ground-truth objects, detected objects, and missed or inaccurate predictions, respectively.
Figure 11. Visualization examples comparing single-vehicle perception and cooperative perception on the DAIR-V2X dataset and the self-collected dataset: (a) single-vehicle perception on DAIR-V2X; (b) cooperative perception on DAIR-V2X; (c) single-vehicle perception on the self-collected dataset; (d) cooperative perception on the self-collected dataset. Green, red, and yellow boxes denote ground-truth objects, detected objects, and missed or inaccurate predictions, respectively.
Wevj 17 00164 g011
Figure 12. Visualization of detection results of different cooperative perception methods on the DAIR-V2X dataset: (a) V2X-ViT; (b) CoAlign; (c) the proposed method. The rows represent different scenes (Scene 1, Scene 2, and Scene 3). Green boxes denote ground-truth bounding boxes, red boxes represent detected objects, and yellow boxes indicate missed detections or inaccurate predictions caused by point-cloud sparsity due to occlusion or limited sensor range.
Figure 12. Visualization of detection results of different cooperative perception methods on the DAIR-V2X dataset: (a) V2X-ViT; (b) CoAlign; (c) the proposed method. The rows represent different scenes (Scene 1, Scene 2, and Scene 3). Green boxes denote ground-truth bounding boxes, red boxes represent detected objects, and yellow boxes indicate missed detections or inaccurate predictions caused by point-cloud sparsity due to occlusion or limited sensor range.
Wevj 17 00164 g012
Figure 13. Visualization of detection results of different cooperative perception methods on the self-collected dataset: (a) V2X-ViT; (b) CoAlign; (c) the proposed method. The rows represent different scenes (Scene 1, Scene 2, and Scene 3). Green boxes denote ground-truth bounding boxes, red boxes represent detected objects, and yellow boxes indicate missed detections or inaccurate predictions caused by point-cloud sparsity due to occlusion or limited sensor range.
Figure 13. Visualization of detection results of different cooperative perception methods on the self-collected dataset: (a) V2X-ViT; (b) CoAlign; (c) the proposed method. The rows represent different scenes (Scene 1, Scene 2, and Scene 3). Green boxes denote ground-truth bounding boxes, red boxes represent detected objects, and yellow boxes indicate missed detections or inaccurate predictions caused by point-cloud sparsity due to occlusion or limited sensor range.
Wevj 17 00164 g013
Table 1. Specifications of the Self-Collected Dataset and Data Acquisition System.
Table 1. Specifications of the Self-Collected Dataset and Data Acquisition System.
CategoryDeviceDescription
Roadside EquipmentLiDAR SensorRoboSense (RoboSense Technology Co., Ltd., Shenzhen, China)
16-beam
10 Hz
360°/30°
Vehicle EquipmentLiDAR SensorRoboSense
16-beam
20 Hz
360°/30°
Positioning systemRTK-based high-precision localization
System IntegrationSynchronizationHardware-trigger via Time Server
CalibrationPrecise Extrinsic Calibration
Table 2. Detection performance comparison of different perception methods on the DAIR-V2X dataset and the self-collected dataset.
Table 2. Detection performance comparison of different perception methods on the DAIR-V2X dataset and the self-collected dataset.
MethodFusion TypeDAIR-V2XSelf-Collected
AP@0.5AP@0.7Inference Time
(ms)
AP@0.5AP@0.7
Baseline (PointPillars [22])None0.481-25.360.359-
Late Fusion [28]Late0.561-36.720.437-
Cooper [25]Early0.617-69.860.561-
Cooperative BaselineIntermediate0.6890.53150.230.6070.488
F-Cooper [7]Intermediate0.7340.55935.170.7120.546
V2VNet [30]Intermediate0.6540.40273.580.6560.409
CoBEVT [48]Intermediate0.5800.44363.760.5710.440
V2X-ViT [32]Intermediate0.5850.449161.040.5640.453
Where2comm [33]Intermediate0.6250.48882.520.6110.462
CoAlign [49]Intermediate0.7410.59497.410.6680.547
The proposedIntermediate0.7620.61760.670.6940.563
The bold font indicates the optimal value in the corresponding metric.
Table 3. Computational Overhead Analysis of the Proposed Modules.
Table 3. Computational Overhead Analysis of the Proposed Modules.
Module ConfigurationParams (M)FLOPs (G)Latency (ms)
Baseline4.8263.5250.23
+R-SENet4.86 (+0.04)64.38 (+0.86)53.91 (+3.68)
+FPB-Net5.38 (+0.52)69.55 (+5.17)58.22 (+4.31)
+SAFF (Full Model)5.55 (+0.17)70.93 (+1.38)60.67 (+2.45)
Table 4. Ablation study of the proposed modules on the DAIR-V2X and self-collected datasets.
Table 4. Ablation study of the proposed modules on the DAIR-V2X and self-collected datasets.
Ablation SettingFusion TypeDAIR-V2XSelf-Collected
AP@0.5AP@0.7AP@0.5AP@0.7
Baseline (PointPillars)None0.481-0.359-
Cooperative Baseline (Concatenation)Intermediate0.6410.4980.5720.451
Coop Baseline + R-SENetIntermediate0.7010.5560.6280.507
Coop Baseline + R-SENet + FPB-NetIntermediate0.7320.5880.6590.534
Coop Baseline + R-SENet + SAFFIntermediate0.7190.5730.6420.520
Coop Baseline + R-SENet + FPB-Net + SAFFIntermediate0.7620.6170.6940.563
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yan, S.; Wu, Y.; Liu, Z.; Xie, C. Research on Cooperative Vehicle–Infrastructure Perception Integrating Enhanced Point-Cloud Features and Spatial Attention. World Electr. Veh. J. 2026, 17, 164. https://doi.org/10.3390/wevj17040164

AMA Style

Yan S, Wu Y, Liu Z, Xie C. Research on Cooperative Vehicle–Infrastructure Perception Integrating Enhanced Point-Cloud Features and Spatial Attention. World Electric Vehicle Journal. 2026; 17(4):164. https://doi.org/10.3390/wevj17040164

Chicago/Turabian Style

Yan, Shiyang, Yanfeng Wu, Zhennan Liu, and Chengwei Xie. 2026. "Research on Cooperative Vehicle–Infrastructure Perception Integrating Enhanced Point-Cloud Features and Spatial Attention" World Electric Vehicle Journal 17, no. 4: 164. https://doi.org/10.3390/wevj17040164

APA Style

Yan, S., Wu, Y., Liu, Z., & Xie, C. (2026). Research on Cooperative Vehicle–Infrastructure Perception Integrating Enhanced Point-Cloud Features and Spatial Attention. World Electric Vehicle Journal, 17(4), 164. https://doi.org/10.3390/wevj17040164

Article Metrics

Back to TopTop