PnPDA+: A Meta Feature-Guided Domain Adapter for Collaborative Perception

Xin, Liang; Zhou, Guangtao; Yu, Zhaoyang; Wang, Danni; Luo, Tianyou; Fu, Xiaoyuan; Li, Jinglin

doi:10.3390/wevj16070343

Open AccessArticle

PnPDA⁺: A Meta Feature-Guided Domain Adapter for Collaborative Perception^†

by

Liang Xin

^1,2,*,

Guangtao Zhou

²,

Zhaoyang Yu

²,

Danni Wang

¹

,

Tianyou Luo

¹,

Xiaoyuan Fu

^1,* and

Jinglin Li

¹

State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China

²

China Unicom Smart Connection Technology Limited, Beijing 100032, China

^*

Authors to whom correspondence should be addressed.

^†

This paper is an extended version of our paper published in European Conference on Computer Vision (ECCV) 2024.

World Electr. Veh. J. 2025, 16(7), 343; https://doi.org/10.3390/wevj16070343

Submission received: 27 April 2025 / Revised: 12 June 2025 / Accepted: 19 June 2025 / Published: 21 June 2025

(This article belongs to the Special Issue Cooperative Perception, Communication and Computing for Autonomous Vehicles)

Download

Browse Figures

Versions Notes

Abstract

Although cooperative perception enhances situational awareness by enabling vehicles to share intermediate features, real-world deployment faces challenges due to heterogeneity in sensor modalities, architectures, and encoder parameters across agents. These domain gaps often result in semantic inconsistencies among the shared features, thereby degrading the quality of feature fusion. Existing approaches either necessitate the retraining of private models or fail to adapt to newly introduced agents. To address these limitations, we propose PnPDA⁺, a unified and modular domain adaptation framework designed for heterogeneous multi-vehicle cooperative perception. PnPDA⁺ consists of two key components: a Meta Feature Extraction Network (MFEN) and a Plug-and-Play Domain Adapter (PnPDA). MFEN extracts domain-aware and frame-aware meta features from received heterogeneous features, encoding domain-specific knowledge and spatial-temporal cues to serve as high-level semantic priors. Guided by these meta features, the PnPDA module performs adaptive semantic conversion to enhance cross-agent feature alignment without modifying existing perception models. This design ensures the scalable integration of emerging vehicles with minimal fine-tuning, significantly improving both semantic consistency and generalization. Experiments on OPV2V show that PnPDA⁺ outperforms state-of-the-art methods by 4.08% in perception accuracy while preserving model integrity and scalability.

Keywords:

meta-learning; domain adaptation; heterogeneous cooperative perception

1. Introduction

In recent years, intelligent transportation systems (ITSs) have increasingly focused on cooperation among multiple autonomous vehicles to improve overall safety, maneuverability, and traffic efficiency. Cooperative behavior among such vehicles can significantly reduce congestion, shorten travel times, and enhance road safety. Recent studies highlighted the potential of vehicle-to-vehicle (V2V) communication to support these goals at a fine-grained level. For instance, Akopov et al. [1] demonstrated how maneuverability can be improved in a multi-agent fuzzy transportation system using a parallel bi-objective genetic algorithm. In parallel, Kamel et al. [2] reviewed formation control and cooperative strategies for multiple unmanned ground vehicles under nominal and faulty conditions. These works underscore the importance of robust inter-agent collaboration and real-time information sharing in future ITS deployments.

While notable progress has been made in high-level cooperative decision making, its effectiveness ultimately relies on accurate and consistent perception. Cooperative perception allows autonomous vehicles to expand their sensory coverage by exchanging observations with nearby agents, thereby overcoming occlusions and blind spots [3].

Most existing cooperative perception frameworks mainly adopt a feature-level fusion strategy. In this paradigm, each participating vehicle independently extracts intermediate features from its local sensory input and transmits them to others via V2V communication. The receiving vehicle then aggregates all incoming features using a fusion module and feeds the combined representation into downstream perception tasks (e.g., object detection and semantic segmentation). However, these frameworks commonly assume that the intermediate features of different vehicles are homogeneous and share the same distribution and structure. This assumption, while simplifying the fusion, does not hold in real-world scenarios.

In real-world deployments, vehicles often differ in terms of sensor configurations, encoder architectures, and manufacturer-specific perception models. Such heterogeneity leads to semantic inconsistencies in the extracted features, including differences in feature scale, focus, and granularity. These inconsistencies introduce a domain gap that hinders effective information fusion. Without proper semantic alignment, the direct fusion of these heterogeneous features can result in severe semantic conflicts and information loss, ultimately degrading the accuracy, robustness, and scalability of collaborative perception systems.

To bridge feature heterogeneity, prior studies have explored domain adaptation and feature transformation techniques. A notable approach is heterogeneous alliance (HEAL) [4], which aligns heterogeneous features by retraining the encoders of individual vehicles to project them into a unified semantic space, as shown in Figure 1a. However, a major drawback of HEAL is that it assumes that the feature encoder can be modified, which is often infeasible in real deployments where perception backbones are fixed and cannot be easily modified or replaced. Building on efforts to address this limitation, multi-agent perception domain adaption (MPDA) [5] mitigates semantic heterogeneity by employing a learnable feature resizer and a sparse cross-domain transformer to align heterogeneous features with the semantic space of the ego vehicle in terms of spatial resolution, channel dimensions, and feature patterns. Although it does not require altering the original encoder, it demands that each vehicle maintains a one-to-one feature adapter for each type of collaborating vehicle, as shown in Figure 1b, resulting in high storage and computational costs as the collaboration network scales.

More recently, our previous work PnPDA [6] introduced a plug-and-play domain adapter that enables feature alignment across heterogeneous vehicles without modifying existing perception models. It defines a standard semantic space as an intermediate representation for the interaction of cross-vehicle features. Specifically, features from a collaborating vehicle are first projected into the standard semantic space, and then the ego vehicle transforms them into its own feature space for downstream fusion. Compared to one-to-one adaptation methods like MPDA [5], this two-stage transformation process offers improved scalability and alignment accuracy in multi-vehicle collaboration by leveraging a shared standard semantic space. However, such a decoupled mapping approach may incur additional information loss during the two-stage transformation, particularly in scenarios with significant domain gaps between vehicles.

Therefore, heterogeneous cooperative sensing still faces the challenge of simultaneously meeting three key requirements: Preserving the integrity of private models, as modifying them is often infeasible due to proprietary restrictions or hardware dependencies; Minimizing semantic loss during cross-agent feature alignment to ensure accurate and reliable information exchange; Maintaining scalability to support the seamless integration of newly emerging vehicles into dynamic collaborative networks.

To address the above challenges, we propose a meta feature-guided domain adapter for collaborative perception. This adapter comprises two key components: a meta feature extraction network (MFEN) and a plug-and-play domain adapter (PnPDA) [6] module. MFEN is responsible for extracting transferable meta features from heterogeneous domain-specific inputs, providing domain-aware priors to guide the semantic alignment process in PnPDA. Leveraging these priors, PnPDA performs an informed feature conversion across domains, thereby enhancing its generalization ability to unseen or emerging domains with minimal fine-tuning. Through the joint design of MFEN and PnPDA, the proposed framework not only improves the accuracy of semantic alignment across heterogeneous features, but also achieves plug-and-play compatibility and scalability to accommodate emerging vehicles from diverse domains, ultimately benefiting the overall performance of collaborative perception systems. The main contributions of this work are as follows.

We propose a meta feature-guided domain adapter for collaborative perception, which serves as a plug-and-play architecture that enables the seamless integration of existing vehicles without modifying perception models.
We introduce a dual-level meta feature representation, categorizing meta features into domain-related meta features and frame-related meta features. The former captures the unique characteristics of different domains, while the latter encodes spatial–temporal information to enhance scene-level semantic understanding.
We achieve a favorable trade-off between performance and training cost by enabling efficient domain adaptation through lightweight meta feature extraction and minimal fine-tuning steps, thus ensuring the scalability of the collaborative perception network.
Experiments conducted on OPV2V [7] demonstrates that MFEN significantly improves generalization in diverse domains and enables the rapid integration of heterogeneous vehicles, outperforming state-of-the-art methods by 4.08% without requiring the retraining of existing perception models.

The remainder of this paper is organized as follows. Section 2 reviews related work in three areas: cooperative perception, domain adaptation, and meta-learning. Section 3 presents the general cooperative perception pipeline and illustrates how our proposed modules are integrated into this framework to address the challenges of heterogeneous agent collaboration. Section 4 details the proposed PnPDA⁺ framework, including MFEN and PnPDA. Section 5 describes the experimental setup and presents comprehensive quantitative and qualitative results, along with detailed analysis and discussion. Finally, Section 6 concludes the paper and discusses the broader implications of our work for enhancing safety and efficiency in ITSs.

2. Related Work

2.1. Collaborative Perception

Early collaborative perception systems primarily adopted sensor-level fusion [8], directly transmitting raw data (e.g., point clouds or images) among vehicles. While this approach offers rich information, it poses significant communication and synchronization challenges.

To mitigate these problems, feature-level fusion has emerged as the mainstream approach, offering a favorable balance between perception accuracy and communication efficiency. One of the early methods, F-Cooper [9], proposed extracting intermediate features from local point clouds and sharing them across vehicles for voxel-wise fusion. Building on this idea, V2VNet [10] incorporates a learned spatial-aware communication module to adaptively aggregate multi-agent features. With the emergence of Transformer architectures, attention-based fusion mechanisms were explored. For instance, V2X-ViT [11] leverages vision transformers to improve fusion quality by capturing inter-agent spatial relationships. CoBEVT [12] integrates a fused axial attention (FAX) module within a transformer-based architecture to effectively fuse multi-agent and multi-view camera features, capturing both local and global spatial dependencies to enhance bird’s eye view (BEV) semantic segmentation. Beyond general feature fusion, several recent works focus on enhancing fusion selectivity to better address redundancy in multi-agent scenarios. Where2comm [13] introduces spatial confidence maps to selectively fuse only the most informative regions across agents. DiscoNet [14], leveraging a teacher–student knowledge distillation framework, enables a lightweight student model to approximate the performance of a holistic-view teacher model. This design achieves a superior balance between perception accuracy and bandwidth efficiency, making it well suited for scalable multi-agent 3D object detection.

2.2. Heterogeneous Domain Adaption

In the context of multi-vehicle cooperative perception, the presence of diverse sensor modalities and perception model architectures leads to different data distributions. This has made heterogeneous domain adaptation a crucial research area that aims at bridging domain gaps to address complex issues related to data fusion. Recent advances have addressed heterogeneity from three main perspectives: modality discrepancy, perspective misalignment, and semantic inconsistency.

To address modality-level heterogeneity, HM-ViT [15] focuses on resolving the data heterogeneity between point cloud data and image data. It employs a general heterogeneous 3D graph attention mechanism to extract heterogeneous interactions between and within vehicles, ensuring the effective processing of different types of data within the same framework. Perspective misalignment, often caused by different viewpoints or coordinate systems among vehicles, is addressed by TransIFF [16], which associates candidate object proposals from different perspectives using a cross-domain adaptation module. By generating proposals from pre-trained detectors and applying multi-head self-attention, TransIFF captures the global context and aligns multiview representations into a coherent spatial understanding. Beyond these representation-level discrepancies, recent work has increasingly focused on semantic-level domain adaptation. MPDA [5] pioneers this direction by leveraging heterogeneous features from neighboring agents as attention queries and employing a domain classifier to minimize feature inconsistency across domains. The semantic adaptation module is trained under supervision via a classification loss, promoting inter-agent feature alignment. HEAL [4] advances this idea by introducing a shared standard semantic space. Through pyramid-based feature aggregation, it produces unified representations that capture both local and global semantics. This design improves compatibility across agents and enhances forward scalability to unseen sensor domains and perception models.

2.3. Meta-Learning

Meta-learning enables models to rapidly adapt to new tasks with limited supervision by leveraging the knowledge learned across a distribution of related tasks. Early approaches primarily focused on optimizing parameter initialization or designing task-invariant representations to facilitate fast adaptation. For example, model-agnostic meta-learning (MAML [17]) introduced a bilevel optimization framework that learns universal initialization that requires only a few gradient updates for downstream tasks. Prototypical networks [18] approached meta-learning from a metric-based perspective by computing class prototypes from the support set and classifying query samples based on their distances to these prototypes, demonstrating strong performance in few-shot classification.

Building upon these foundations, more recent studies emphasize learning high-level, transferable meta representations (commonly referred to as meta features), which encode task-agnostic priors across training tasks. These meta features are often extracted by meta-networks or shared backbones and serve as a powerful intermediate abstraction for unseen tasks. For example, Javed and White [19] propose a joint optimization scheme where a meta feature extractor is trained alongside task-specific heads, supporting continual learning while mitigating catastrophic collapse through a multi-task loss.

In the vision-language domain, meta features have been further integrated into prompt-based adaptation frameworks. Zhou et al. [20] proposed conditional context optimization (CoCoOp) with learnable prompts on image features to better guide CLIP-like models in zero-shot and few-shot classification. Prefix-tuning [21] adopts a similar strategy by injecting lightweight prefix embeddings into the transformer layers, enabling task adaptation without modifying the backbone. DenseCLIP [22] extends this idea to dense prediction by generating context-aware meta prompts tailored for dense prediction tasks and applying separate meta features during training and inference stage.

3. System Model

3.1. Preliminary

To construct the heterogeneous cooperative perception system described above, we make several assumptions to facilitate design and analysis. (1) The transmitted intermediate features and vehicle poses are assumed to be temporally aligned through synchronized data collection; (2) The communication bandwidth is sufficient to transmit full intermediate BEV features without compression or loss; (3) The domain gap primarily arises from differences in encoder architectures while all vehicles perform the same downstream task (object detection). Based on these assumptions, we formalize the system as follows. In a heterogeneous cooperative system with n vehicles, we define the vehicle set as

V = {V_{1}, V_{2}, \dots, V_{n}}

, and their corresponding sensing domains as

D = {D_{1}, D_{2}, \dots, D_{n}}

, where each domain

D_{i}

corresponds to vehicle

V_{i}

equipped with its own feature encoder.

Feature Extraction: At a given timestamp, each vehicle $V_{i}$ collects raw sensory data $I_{i}$ through its onboard sensors and processes it via its local encoder ${Enc}_{i}$ to produce intermediate feature representations $F_{i}$ as shown in Equation (1).

$F_{i} = {Enc}_{i} (I_{i}), i = 1, 2, \dots, n$

(1)
Neighbor Selection: To enable cooperative perception, each vehicle $V_{i} \in V$ communicates with its neighboring vehicles $V_{j}$ located within a predefined communication range R. The neighbor set of $V_{i}$ is denoted as $N_{i}$ , where $V_{j} \in N_{i}$ if the Euclidean distance between their pose $P_{i}$ and $P_{j}$ satisfies $dist (P_{i}, P_{j}) \leq R$ .

$N_{i} = \{V_{j} | dist (P_{i}, P_{j}) \leq R, j \neq i\}$

(2)
Spatial Transformation: Each neighboring vehicle $V_{j} \in N_{i}$ transmits its feature $F_{j}$ along with its pose $P_{j}$ to the ego vehicle $V_{i}$ . And, $V_{i}$ applies a spatial transformation to map $F_{j}$ from $V_{j}$ ’s coordinate frame to its own, based on the relative pose between $P_{i}$ and $P_{j}$ , yielding the spatially aligned feature $F_{j \to i}$ . This ensures that all received features are spatially consistent in the ego vehicle’s coordinate frame.

$F_{j \to i} = SpatialTransform (P_{i}, P_{j}, F_{j}), V_{j} \in N_{i}$

(3)
Semantic Alignment: Note that, due to variations in encoder architectures, features from different vehicles may differ semantically, even when observing the same scene. Therefore, a domain adapter should be introduced to map the spatially aligned feature $F_{j \to i}$ into the semantic space of the ego vehicle $V_{i}$ , producing the semantically aligned feature ${\hat{F}}_{j \to i}$ for consistent downstream fusion.

${\hat{F}}_{j \to i} = DomainAdapter (F_{i}, F_{j \to i}), V_{j} \in N_{i}$

(4)
Fusion: Finally, vehicle $V_{i}$ aggregates its own feature $F_{i}$ with the semantically aligned features from all neighbors ${{\hat{F}}_{j \to i}}_{V_{j} \in N_{i}}$ to form the fused representation $A_{i}$ .

$A_{i} = Fuse (F_{i}, {{\hat{F}}_{j \to i}}_{V_{j} \in N_{i}})$

(5)
Detection: This is then passed through a task-specific (such as object detection or trajectory prediction) head ${Head}_{i}$ to generate the final output $O_{i}$ .

$O_{i} = {Head}_{i} (A_{i})$

(6)

The overall cooperative perception process can be sequentially represented by Equations (1)–(6). This general formulation covers both homogeneous and heterogeneous cooperative perception settings. In the homogeneous case, where all vehicles share the same encoder architecture and feature semantics, Equation (4) is unnecessary, as the received features are semantically compatible and can be directly fused in Equation (5) after performing Equation (3). In contrast, in the heterogeneous case, semantic inconsistencies arise due to the differences in encoder structures and feature resolutions. Therefore, the domain adapter in Equation (4) becomes essential to transform

F_{j \to i}

into the ego vehicle’s semantic space before effective fusion in Equation (5).

3.2. Meta Feature-Guided Domain Adapter

The core functionality of PnPDA⁺ lies in achieving semantic alignment across heterogeneous features through meta feature-guided domain adaptation, as formulated in Equation (4). This module can be seamlessly integrated after coordinate transformation in Equation (3) and before feature fusion in Equation (5). The overall process of cooperative perception and the position of PnPDA⁺ are illustrated in Figure 2.

In the proposed PnPDA⁺ framework, MFEN is responsible for extracting meta features that characterize both the domain-related properties and frame-related semantics of heterogeneous features. These meta features provide a high-level abstraction of input features from different domains and serve as informative priors that help PnPDA understand how features vary across domains. By providing this semantic guidance, MFEN enables PnPDA to better generalize domain-invariant representations and align features from different sources.

Built upon the meta feature extracted by MFEN, the enhanced PnPDA module leverages these priors to dynamically adjust its feature alignment strategy according to the characteristics of each domain. This allows PnPDA to perform more accurate and flexible semantic adaptation without modifying the core network architecture. Moreover, when encountering emerging vehicles, the system only needs to extract the corresponding meta feature and reuse the existing semantic converter in PnPDA with minimal fine-tuning. This design ensures semantic consistency across heterogeneous agents and supports scalable, low-cost adaptation in dynamic cooperative scenarios.

4. Methodology

4.1. Adaptive Feature Alignment

Heterogeneous features often exhibit inconsistent spatial resolutions and channel dimensions across their respective feature maps. Therefore, before performing meta feature extraction, we adopt an adaptive alignment process to standardize both the spatial and channel dimensions of all feature maps. This ensures that subsequent computations are performed on a unified feature size.

Specifically, for a feature map

F_{j} \in R^{H_{j} \times W_{j} \times C_{j}}

extracted from a heterogeneous domain

D_{j}

associated with vehicle

V_{j}

, where

H_{j}

,

W_{j}

, and

C_{j}

denote the height, width, and number of channels, respectively. In the spatial dimension, we employ an adaptive average pooling operation to compute the mean value within each pooling window. By setting the output size to match the spatial resolution of the ego vehicle’s feature map

F_{i} \in R^{H_{i} \times W_{i} \times C_{i}}

, the pooling window size and stride are automatically adjusted, ensuring spatial alignment with

F_{i}

. This strategy enables the flexible handling of heterogeneous features with arbitrary spatial sizes. In the channel dimension, we first normalize the number of channels of

F_{j}

to

2 C_{i}

. When

C_{j} < 2 C_{i}

, feature duplication is applied along the channel axis; when

C_{j} > 2 C_{i}

, surplus channels are randomly dropped. A subsequent

1 \times 1

convolution is then used to project the feature to the target dimension

C_{i}

. The resulting aligned feature is defined as Equation (7).

F_{j}^{1} = Align (F_{j}) \in R^{H_{i} \times W_{i} \times C_{i}}

(7)

where

H_{i}

and

W_{i}

represent the spatial resolution of the ego feature

F_{i}

, and

C_{i}

denotes its channel dimension.

4.2. MFEN

For each domain

D_{i}

, MFEN is independently executed to extract a domain-specific meta feature

M_{i}

. Each meta feature

M_{i}

is further decomposed into two complementary components: a domain-related meta feature

M_{d}

and a frame-related meta feature

M_{f}

. The former captures the distinctive structural patterns and encoding styles of the domain, while the latter encodes the spatial-temporal context of the current scene. Together, these components provide complementary semantic priors that enhance the PnPDA’s ability to interpret heterogeneous features across vehicles. The architecture of MFEN is illustrated in Figure 3.

4.2.1. Domain-Related Meta Feature

The domain-related meta feature

M_{d}

is designed to capture the overall style and semantic characteristics of each domain. For a specific domain

D_{j}

associated with vehicle

V_{j}

, the corresponding domain-related meta feature is denoted as

M_{d}^{j}

, and is shared across all features within

D_{j}

. By modeling domain-specific attributes,

M_{d}^{j}

establishes clear semantic boundaries between domains, effectively mitigating misalignment caused by feature aliasing. This, in turn, enhances the accuracy and robustness of cross-domain adaptation. During subsequent semantic conversion and feature fusion,

M_{d}^{j}

provides essential context that helps the semantic converter better understand the relationships between different domains.

Before training, all domain-related meta features are randomly initialized and stored in a meta feature library. During training, a momentum-based update strategy is applied to learn an evolving average representation for each domain, enabling adaptation to intra-domain distribution shifts. The update rule is defined as Equation (8).

M_{d}^{j} \leftarrow λ M_{d}^{j} + (1 - λ) \cdot F_{j}^{1}, M_{d}^{j} \in R^{1 \times C_{i}}

(8)

where

λ \in [0, 1]

is the momentum coefficient, and

F_{j}^{1}

denotes the aligned feature representation of domain

D_{j}

after adaptive alignment.

4.2.2. Frame-Related Meta Feature

In addition to modeling unique domain attributes of domain-related meta feature, it is also crucial to model temporal variations. To this end, we introduce a frame-related meta feature to capture temporal dynamics changes at specific moments. This type of meta feature provides fine-grained semantic cues for later semantic conversion. Without such temporal cues, MFEN may lose the ability to distinguish temporal dynamics, leading to severe information loss.

The frame-related meta feature for the neighbor vehicle

V_{j}

of domain

D_{j}

is denoted as

M_{f}^{j}

. To extract such spatial-temporal information, we employ a negative attention mechanism that operates over heterogeneous features observed at different time steps. Specifically, the ego vehicle’s feature

F_{i}

is used as the query, and the neighbor feature

F_{j}^{1}

serves as the key and value, as shown in Equation (9). In this mechanism, the dot product

F_{i} \cdot {F_{j}^{1}}^{⊤}

computes the pairwise similarity between features at corresponding spatial locations, serving as the raw attention score. The sign of this score indicates whether a given feature should be enhanced or suppressed.

M_{f}^{j} = Sign (F_{i} \cdot {F_{j}^{1}}^{⊤}) \times Softmax (| F_{i} \cdot {F_{j}^{1}}^{⊤} |) \cdot F_{j}^{1}

(9)

Here, the Sign function indicates the direction of attention weights, defined as Equation (10).

Sign (x) = \{\begin{matrix} - 1, & if x < 0 \\ 0, & if x = 0 \\ 1, & if x > 0 \end{matrix}

(10)

Unlike conventional attention mechanisms that only assign non-negative weights, negative attention expands the model’s expressive power by allowing weights to take on negative values. This enables the suppression of irrelevant or noisy features in addition to enhancing informative ones. During training, this mechanism facilitates fine-grained filtering in the feature space, improving discriminative representation.

4.2.3. Enhanced Domain-Specific Feature

After obtaining the two types of meta feature, namely the domain-related meta feature

M_{d}^{j}

and the frame-related meta feature

M_{f}^{j}

, MFEN applies complementary operations to incorporate them into the neighbor feature

F_{j}^{1}

from vehicle

V_{j}

in domain

D_{j}

.

Specifically, since

M_{d}^{j}

captures the global domain characteristics of

D_{j}

, we first perform a channel-wise multiplication between

F_{j}^{1}

and

M_{d}^{j}

to obtain

\bar{F_{j}^{1}}

, as defined in Equation (11). Then, the frame-related meta feature

M_{f}^{j}

, which encodes fine-grained spatial-temporal context and shares the same shape as

\bar{F_{j}^{1}}

, is integrated via element-wise addition to yield the final enhanced domain-specific feature

F_{j}^{2}

, as shown in Equation (12):

\bar{F_{j}^{1}} = F_{j}^{1} \cdot M_{d}^{j}, \bar{F_{j}^{1}} \in R^{H_{i} \times W_{i} \times C_{i}}

(11)

F_{j}^{2} = \bar{F_{j}^{1}} + M_{f}^{j}, F_{j}^{2} \in R^{H_{i} \times W_{i} \times C_{i}}

(12)

Finally,

F_{j}^{2}

serves as the enhanced domain-specific feature map, jointly guided by domain-related meta feature and frame-related meta feature. This enriched feature exhibits improved robustness and expressiveness, making it a more effective input for subsequent semantic conversion of PnPDA.

4.3. PnPDA

PnPDA serves as a semantic converter that projects neighbor features

F_{j}^{2}

into the ego vehicle’s

F_{i}

semantic space, thereby enabling efficient domain adaptation. During the pre-training phase, PnPDA consists of three components: a semantic translator, a semantic enhancer, and a semantic calibrator. The semantic translator performs initial cross-domain alignment via knowledge distillation, the semantic enhancer strengthens the expressiveness of the teacher model, and the semantic calibrator utilizes attention mechanisms to achieve fine-grained semantic matching. In the fine-tuning phase, the semantic calibrator is removed, and the transformed features from neighboring vehicles are directly fed into the ego vehicle’s fusion module as Equation (5) for downstream perception tasks. The architecture of PnPDA is illustrated in Figure 4.

4.3.1. Semantic Translator and Enhancer

The input to this module includes two heterogeneous feature maps: the ego vehicle’s feature

F_{T}

and the neighboring vehicle’s feature

F_{S}

. Inspired by knowledge distillation [14,23], we treat the ego feature

F_{T} = F_{i}

as the teacher and the neighbor feature

F_{S} = F_{j}^{2}

as the student, where

F_{j}^{2}

is the meta-guided feature produced by MFEN in Section 4.2.

To align their semantics, the student feature

F_{S}

is first processed by a semantic translator

Φ_{S}

, which projects it into the teacher’s semantic space. To better transfer the semantic knowledge from teacher to student, we introduce a semantic enhancer

Θ_{T}

, which operates on

F_{T}

to strengthen its representation. At this stage, we apply a domain-aware attention mechanism to semantically project both

Θ_{T} (F_{T})

and

Φ_{S} (F_{S})

into a shared semantic space. This projection jointly leverages local and global attention: local attention captures fine-grained spatial semantics, while global attention is applied to the teacher feature

F_{T}

to reveal contextual cues in regions that may be occluded or degraded. Such processing allows the PnPDA to obtain more informative query embeddings and enhances its capacity for downstream fusion and perception tasks.

The semantic translator and enhancer adopt the same projection structure for semantic embedding. Therefore, the projection operations for both types of features are structurally identical. This process can be formulated as Equation (13).

H_{S} = Φ_{S} (F_{S}) \in R^{H_{i} \times W_{i} \times C_{i}}, H_{T} = Θ_{T} (F_{T}) \in R^{H_{i} \times W_{i} \times C_{i}} .

(13)

Next, the semantic translator

Φ_{S}

is updated through backpropagation by minimizing the training loss. It learns a domain-adaptive projection function that maps heterogeneous features into the ego vehicle’s semantic space. In contrast, the semantic enhancer

Θ_{T}

is updated indirectly based on feedback from

Φ_{S}

, using the exponential moving average (EMA) strategy [24]. This allows the teacher to retain stable semantic representations of the ego features, as well as provides more reliable and consistent guidance for the student model. Such a design improves the generalization ability of the student and mitigates training instability, thereby reducing the risk of model collapse during pre-training.

4.3.2. Semantic Calibrator

To further enhance the semantic consistency between heterogeneous features

H_{S}

and

H_{T}

, we introduce a semantic calibrator. In the feature map, each grid cell encodes spatial information about a specific region in the real world. Accordingly, we treat the ego feature map

H_{T} \in R^{H_{i} \times W_{i} \times C_{i}}

as a collection of query vectors for cross-attention. For each spatial location

i = (x, y)

, the query vector is denoted as

Q_{i} = H_{T}^{(i)} \in R^{1 \times C_{i}}

. The neighbor feature map

H_{S} \in R^{H_{i} \times W_{i} \times C_{i}}

serves as both the key and value in the cross-attention mechanism.

During pre-training, we follow standard practice by appending 2D absolute positional embeddings to both the ego query

Q_{i}

and the neighbor feature map

H_{S}

. The semantic calibrator then applies cross-attention to align each ego query with its corresponding spatial region in

H_{S}

. Specifically, attention weights

P_{i} \in R^{1 \times C_{i}}

are computed by measuring the similarity between

Q_{i}

and all spatial positions in

H_{S}

as shown in Equation (14), and the aggregated feature is subsequently obtained via an operator

Ψ_{i}

as shown in Equation (15), which reconstructs the output

H_{S}^{'}

by assembling the attended results according to the original spatial positions of the queries. Finally,

H_{S}^{'}

is passed through a feed-forward network (FFN) in Equation (16) to generate the calibrated feature

G_{S}

.

P_{i} = Softmax (\frac{Q_{i} H_{S}^{⊤}}{\sqrt{d_{H_{S}}}}) H_{S}

(14)

where

d_{H_{S}}

is the dimension of the feature map

H_{S}

.

H_{S}^{'} = Ψ_{i}^{O} (P_{i})

(15)

where O denotes the number of queries in the ego view.

G_{S} = FFN (H_{S}^{'})

(16)

Note that the semantic calibrator is only used during the pre-training stage to help the semantic translator and enhancer in learning more robust and discriminative feature representations.

4.4. Loss

4.4.1. PnPDA Loss

During pre-training, we employ an object-wise contrastive loss to enhance semantic consistency between

H_{T}

and

G_{S}

. We treat two pillar-level features that represent the same object as a positive pair, while those from different objects are treated as negative pairs. The objective is to maximize the similarity of positive pairs and minimize that of negative pairs. To obtain object-level representations from feature maps

H_{T}

and

G_{S}

, we apply a sampling operation to extract the pillar features corresponding to real objects in the scene. The resulting pillar feature sets are denoted as

S_{T} \in R^{N \times V \times C_{i}}

and

S_{S} \in R^{N \times V \times C_{i}}

, where N is the number of objects in the scene, and V is the number of pillar features per object.

For each object in

S_{T}

, we average its pillar features to obtain a compact object-level embedding

{\bar{S}}_{T} \in R^{N \times C_{i}}

. For each object n, we compute the similarity between each student pillar feature

s_{S}^{n, v} \in R_{i}^{C}

and the average representation

\bar{s_{T}^{m}} \in R_{i}^{C}

of object m from the teacher model. The temperature-scaled cosine similarity is given by Equation (17).

φ_{n, v, m} = \frac{{(s_{S}^{n, v})}^{⊤} \bar{s_{T}^{m}}}{τ}

(17)

where

τ

is a temperature hyperparameter.

Next, a cross entropy loss is used to maximize the similarity of matched pairs

φ_{n, n}

, while minimizing the similarity with unmatched pairs

φ_{n, m}

,

m \neq n

. The object-level contrastive loss is formulated as Equation (18).

L_{PnPDA} = - \sum_{n = 1}^{N} \sum_{v = 1}^{V} log (\frac{exp (φ_{n, v, n})}{\sum_{m = 1}^{N} exp (φ_{n, v, m})})

(18)

4.4.2. MFEN Loss

When dealing with the frame-related meta feature

M_{f}

, our goal is to enable it to learn an effective representation that includes both spatial and temporal information in each frame of the feature. However, in the actual implementation, we do not set a separate loss function for

M_{f}

to directly supervise the generation of the spatial-temporal information for each frame. Instead, we adopt a different strategy: after

M_{f}

is fused with the original heterogeneous feature maps, we use the classification loss and regression loss commonly used for supervision in object detection tasks. This is because the classification loss and regression loss can guide the neural network to focus on the information that is truly important for the perceptual task, such as the object’s position, size, and category. In this way, the frame-related meta feature can indirectly obtain the spatial-temporal cues that are helpful in performing a specific perceptual task, without the need to explicitly define additional loss terms to guide this process. This not only simplifies the training process but also encourages the model to pay more attention to the final task performance rather than the quality of the intermediate representation.

Regarding the domain-related meta feature

M_{d}

, we update

M_{d}

via a momentum-based strategy rather than using the traditional loss function to update these meta feature. This approach allows the information in the domain to gradually integrate into the existing domain-related meta feature while retaining the information mined from the historical data of that domain. This method avoids the overfitting problem that may be caused by the direct use of a loss function and helps the system better adapt to environmental changes. This is because the old information is not immediately discarded, but is progressively updated. Specifically, whenever new data are input, MFEN adjusts

M_{d}

based on the current observations and the cumulative effect of historical observations. The weight of new observations is set to be small, while the weight of historical observations remains large. In this way, even if short-term abnormal changes occur in the environment, they will not immediately affect the overall behavior of the system, thus enhancing the stability and reliability of the system. In addition, since

M_{d}

does not involve the direct calculation of the loss function, it can respond more quickly to changes in the environment because it does not have to go through the complete backpropagation process to determine the direction and magnitude of parameter updates. This is particularly beneficial in real-time application scenarios, where the system response speed is often a critical factor.

5. Experiments and Discussions

5.1. Experimental Settings

Dataset. We conducted experiments on OPV2V [7], a large-scale benchmark for cooperative perception in autonomous driving. The dataset consists of 73 diverse driving scenarios, covering six different road types and nine distinct cities, including eight built-in CARLA environments and one highly realistic digital city reconstructed from Los Angeles. Each scenario features multiple autonomous vehicles collecting 3D point cloud data and RGB camera images with synchronized timestamps. In total, the dataset contains approximately 12,000 LiDAR frames, 48,000 RGB images, and 230,000 annotated 3D bounding boxes. OPV2V [7] provides a realistic and diverse cooperative perception setting, making it well suited for evaluating feature alignment methods in heterogeneous multi-agent settings.
Evaluation Metrics. We evaluate 3D object detection performance using average precision (AP) at intersection over union (IoU) thresholds of 0.5 and 0.7, where the IoU is a unitless metric ranging from 0 to 1. Following standard practice [9,11,13] in the field, evaluation is conducted within a predefined spatial region of $x \in [- 140.8, 140.8] meters$ and $y \in [- 40, 40] meters$ .
Experimental Designs. To simulate realistic heterogeneity, we utilize three commonly used LiDAR encoders: PointPillar [25], VoxelNet [26], and SECOND [27], each available in multiple configurations. Table 1 summarizes the detailed parameters of the heterogeneous encoders and their performance under homogeneous settings.

During the pre-training phase, we adopt a multi-domain strategy to expose the model to diverse feature distributions and enhance its adaptability to heterogeneous encoders. Specifically, we construct two distinct multi-vehicle scenarios involving diverse encoder combinations in the format of (ego–neb1–neb2). Each neighbor adopts either encoder type neb1 or neb2, allowing the ego vehicle to observe a wide range of feature variations.

Hetero Scenario 1(pp8–vn4–sd1): The ego vehicle uses PointPillar [25] (pp8) encoder, while neighboring vehicles use VoxelNet [26] (vn4) or SECOND [27] (sd1).
Hetero Scenario 2(pp8–vn4–pp4): The ego vehicle uses PointPillar [25] (pp8) encoder, while neighboring vehicles use VoxelNet [26] (vn4) or PointPillar [25] (pp4).

In the fine-tuning phase, we deliberately reconfigure the ego–neighbor pairings (ego–neb) to introduce previously unseen neighbor encoder types that did not appear during pre-training. The model is adapted to these new heterogeneous combinations by updating only the meta features extracted by the MFEN for the newly introduced neighbor, while the rest of the model, including the core PnPDA module, remains frozen.

Finally, in the inference phase, we evaluate the model under the same ego–neighbor pairings used during fine-tuning. This setting allows us to assess the model’s ability to generalize to novel heterogeneous configurations and maintain robust performance even when faced with unfamiliar vehicle types.

Comparison Methods. We compare PnPDA⁺ against two methods that address feature heterogeneity. Specifically, MPDA [5] serves as a strong baseline that addresses domain discrepancies via attention-based semantic adaptation. We also implement a lightweight domain adaptation baseline, called HETE, which applies a simple max-pooling layer followed by a $1 \times 1$ convolution for spatial and channel alignment. It should be noted that although HEAL [4] is another method capable of handling sensor heterogeneity, it requires retraining the feature encoders of all agents, which contradicts the plug-and-play principle of our method and thus is excluded from direct comparison.
Implementation Details. All encoders, fusion networks, and detection heads follow their original configurations. During the pre-training stage, we adopt Adam optimizer with an initial learning rate of 0.001 and apply a learning rate decay of 0.1 at the 10th and 50th epochs. A similar training strategy is used during the fine-tuning stage.

5.2. Detection Performance

Table 2 demonstrates the 3D object detection results across three heterogeneous ego–neighbor scenarios with varying degrees of semantic and structural differences, where PnPDA⁺ consistently achieves a superior or competitive performance compared to HETE and MPDA.

In the pp8–pp4 case, the ego and neighbor share PointPillar-based encoders with minor differences in voxel size and resolution, leading to a small domain gap. Based on average AP values, PnPDA⁺ outperforms HETE by +9.05/+10.6 and MPDA by +3.15/+1.3 at AP@0.5/0.7, showing reliable alignment under mild heterogeneity. The pp8–sd2 scenario introduces moderate architectural differences, as SECOND incorporates sparse 3D convolutions and multiscale features. Here, PnPDA⁺ achieves +15.2/+17.9 gains over HETE and +3.5/+2.95 over MPDA at AP@0.5/0.7, confirming its adaptability to encoder-level shifts. The largest discrepancy occurs in pp8–vn6, where VoxelNet’s dense voxelization and stacked voxel feature encoding (VFE) layers contrast sharply with the lightweight, pillar-based design of PointPillar. Even so, PnPDA⁺ still achieves the highest robustness, improving over HETE by +37.8/+36.45 and MPDA by +9.3/+6.15 at AP@0.5/0.7, demonstrating strong robustness under severe heterogeneity. The consistently larger margin over HETE stems from its limited capacity, as it performs only shallow alignment via resolution normalization. In contrast, MPDA is tailored for heterogeneous adaptation with semantic-level mechanisms. Still, our method builds upon a more flexible meta-guided approach, allowing it to generalize better across varying degrees of encoder disparity.

Beyond comparisons between methods, we also examine how different training configurations (* vs. +) affect the performance of PnPDA⁺ in the same test scenario. In the pp8–pp4 scenario, the “+” model achieves higher accuracy than “*” by +6.1/+2.0 at AP@0.5/0.7, as expected, since pp4 was seen during pre-training and the model benefits from direct exposure. In contrast, for pp8–sd2, the “*” configuration slightly outperforms “+” by +0.8/+2.5 at AP@0.5/0.7. This may seem counterintuitive, but is understandable: the “*” model had previously seen sd1 during training, which shares architectural similarity with sd2, whereas the “+” model was only exposed to pp4 and vn4. Such results suggest that prior exposure to architecturally similar encoders can facilitate better generalization. It is also worth noting that in the pp8–vn6 scenario, PnPDA⁺ performs slightly lower than MPDA at AP@0.5. This is likely due to MPDA’s explicit domain-specific adapter training, whereas our method relies on a unified pre-trained model to generalize to unseen domains.

Regardless of whether the test domain has been seen during training, our method consistently achieves the accurate semantic alignment of heterogeneous features, demonstrating its strong generalization ability under varying conditions.

5.3. Ablation Study

Evaluation on MFEN components. Table 3 presents the ablation study on the core components of MFEN, including the domain-related meta feature and frame-related meta feature. The model is first trained under the heterogeneous setting “pp8–vn4–sd1“, and then fine-tuned with different ego–neighbor configurations (pp8–pp4, pp8–vn4, and pp8–sd2). We evaluate four variants of the proposed MFEN module by selectively including or excluding the domain-related meta feature and frame-related meta feature to assess their impact on enabling robust generalization under heterogeneous configurations.

As shown in Table 3, the model equipped with both domain-related and frame-related meta features, along with PnPDA, consistently achieves the highest performance across all three ego–neighbor configurations. Among the two types of meta features, the domain-related meta feature has a more significant impact on generalization. For example, in the pp8–pp4 setting, excluding the domain-related meta feature leads to a drop in AP@0.5 from 54.6 to 5.7, whereas removing the frame-related meta feature still retains a relatively higher score of 28.8. Similar trends can be observed in pp8–sd2, confirming that domain-specific information plays a dominant role in adapting to structural variations among encoders. When both meta features are removed, the model’s detection performance deteriorates significantly in unseen configurations. In particular, the pp8–pp4 and pp8–sd2 settings suffer drastic performance drops, with AP@0.5 falling to 4.3 and 1.2, respectively. In contrast, the model maintains a relatively high AP@0.5 of 49.7 in the pp8–vn4 setting. This discrepancy is due to the fact that the vn4 encoder type was already seen during training, whereas both pp4 and sd2 were not. This contrast underscores the model’s reliance on prior exposure in the absence of meta-level guidance, and conversely, highlights the essential role of both meta features in enabling robust generalization to novel agents in open heterogeneous scenarios.

5.4. Visualizations

To intuitively explain how the two types of meta features influence the neighbor features, we present a visualization analysis of their effects. We use a model trained on Hetero Scenario 1, then fine-tuned in pp8–vn6 configuration and evaluate its performance. The visualization results are shown in Figure 5.

Figure 5a represents the original neighbor features extracted from its encoder, Figure 5b represents the original ego-vehicle features, Figure 5c shows the neighbor features after incorporating frame-related meta feature, Figure 5d shows the neighbor features after incorporating the domain-related meta feature, Figure 5e shows the semantically aligned neighbor features via semantic converter, and Figure 5f represents the final fused features of neighbor feature and ego feature by the feature fusion network.

The visualization results demonstrate that, after incorporating a frame-related meta feature, the spatial-temporal information of neighbor features becomes more prominent, particularly enhancing the representation of other vehicles on the road. This improvement significantly increases the representation of neighboring features. When incorporating domain-related meta feature, the neighbor features preserve domain-specific semantics, ensuring that essential structural and stylistic information is preserved. These conversions facilitate the subsequent semantic alignment and feature fusion stages, enabling PnPDA to extract more effective and comprehensive perceptual information from the neighbor features.

Thus, PnPDA⁺ effectively distinguishes between different domains and dynamically performs semantic alignment accordingly. In addition, domain-related meta feature preserve sufficient domain information to enhance the PnPDA’s cross-domain adaptation and generalization capabilities. This ensures that PnPDA⁺ continues to perform efficiently when encountering new domains, making it more robust and adaptable in real-world heterogeneous cooperative perception scenarios.

6. Conclusions

In this paper, we present PnPDA⁺, a unified and modular framework designed to address semantic inconsistencies arising from heterogeneous features in multi-vehicle cooperative perception. The proposed architecture is composed of two tightly integrated components: MFEN and PnPDA. MFEN explicitly extracts domain-related and frame-related meta features, serving as transferable priors that encode domain semantics and spatial-temporal context. These meta features guide the downstream adaptation process in PnPDA, enabling the accurate and efficient semantic alignment without modifying the private perception models of each vehicle.

This design allows PnPDA⁺ to support the seamless integration of emerging heterogeneous vehicles with minimal fine-tuning. By incorporating a dual-level meta feature and a semantic conversion mechanism, the framework effectively mitigates the impact of domain gaps across agents while preserving plug-and-play compatibility.

Extensive experiments on OPV2V validate the effectiveness of our approach. PnPDA⁺ consistently improves semantic alignment, robustness, and scalability over existing baselines, thus advancing the perception capabilities of heterogeneous multi-vehicle systems. With these improvements, we anticipate that the proposed framework will have significant potential for broader applications in ITS. In particular, by enabling more accurate and reliable environmental understanding across diverse agents, PnPDA⁺ is expected to support the improved maneuverability of vehicle platoons, foster smoother multi-agent cooperation under dynamic and occluded conditions, and ultimately contribute to safer and more efficient road traffic.

Author Contributions

Conceptualization, L.X. and X.F.; methodology, L.X.; software, T.L.; validation, G.Z. and Z.Y.; formal analysis, T.L.; investigation, T.L. and D.W.; resources, J.L.; data curation, T.L.; writing—original draft preparation, D.W.; writing—review and editing, X.F., D.W. and L.X.; supervision, X.F.; project administration, X.F. and L.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. The OPV2V datasets analyzed during the current study are available at https://mobility-lab.seas.ucla.edu/opv2v/ (accessed on 18 April 2025).

Conflicts of Interest

Authors L.X., G.Z. and Z.Y. are employed by China Unicom Smart Connection Technology Limited. The authors declare no conflicts of interest.

References

Akopov, A.S.; Beklaryan, L.A.; Thakur, M. Improvement of maneuverability within a multiagent fuzzy transportation system with the use of parallel biobjective real-coded genetic algorithm. IEEE Trans. Intell. Transp. Syst. 2021, 23, 12648–12664. [Google Scholar] [CrossRef]
Kamel, M.A.; Yu, X.; Zhang, Y. Formation control and coordination of multiple unmanned ground vehicles in normal and faulty situations: A review. Annu. Rev. Control 2020, 49, 128–144. [Google Scholar] [CrossRef]
Hu, S.; Fang, Z.; Deng, Y.; Chen, X.; Fang, Y. Collaborative perception for connected and autonomous driving: Challenges, possible solutions and opportunities. arXiv 2024, arXiv:2401.01544. [Google Scholar] [CrossRef]
Lu, Y.; Hu, Y.; Zhong, Y.; Wang, D.; Wang, Y.; Chen, S. An extensible framework for open heterogeneous collaborative perception. arXiv 2024, arXiv:2401.13964. [Google Scholar]
Xu, R.; Li, J.; Dong, X.; Yu, H.; Ma, J. Bridging the domain gap for multi-agent perception. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 6035–6042. [Google Scholar]
Luo, T.; Yuan, Q.; Luo, G.; Xia, Y.; Yang, Y.; Li, J. Plug and Play: A Representation Enhanced Domain Adapter for Collaborative Perception. In Proceedings of the European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024; pp. 287–303. [Google Scholar]
Xu, R.; Xiang, H.; Xia, X.; Han, X.; Li, J.; Ma, J. Opv2v: An open benchmark dataset and fusion pipeline for perception with vehicle-to-vehicle communication. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 2583–2589. [Google Scholar]
Chen, Q.; Tang, S.; Yang, Q.; Fu, S. Cooper: Cooperative perception for connected autonomous vehicles based on 3d point clouds. In Proceedings of the 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), Dallas, TX, USA, 7–10 July 2019; pp. 514–524. [Google Scholar]
Chen, Q.; Ma, X.; Tang, S.; Guo, J.; Yang, Q.; Fu, S. F-cooper: Feature based cooperative perception for autonomous vehicle edge computing system using 3D point clouds. In Proceedings of the 4th ACM/IEEE Symposium on Edge Computing, Washington, DC, USA, 7–9 November 2019; pp. 88–100. [Google Scholar]
Wang, T.H.; Manivasagam, S.; Liang, M.; Yang, B.; Zeng, W.; Urtasun, R. V2vnet: Vehicle-to-vehicle communication for joint perception and prediction. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part II 16. Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 605–621. [Google Scholar]
Xu, R.; Xiang, H.; Tu, Z.; Xia, X.; Yang, M.H.; Ma, J. V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer. In Proceedings of the 17th European Conference on Computer Vision, ECCV 2022; Springer Science and Business Media Deutschland GmbH: Berlin/Heidelberg, Germany, 2022; pp. 107–124. [Google Scholar]
Xu, R.; Tu, Z.; Xiang, H.; Shao, W.; Zhou, B.; Ma, J. CoBEVT: Cooperative Bird’s Eye View Semantic Segmentation with Sparse Transformers. In Proceedings of the Conference on Robot Learning, PMLR, Atlanta, GA, USA, 6–9 November 2023; pp. 989–1000. [Google Scholar]
Hu, Y.; Fang, S.; Lei, Z.; Zhong, Y.; Chen, S. Where2comm: Communication-efficient collaborative perception via spatial confidence maps. Adv. Neural Inf. Process. Syst. 2022, 35, 4874–4886. [Google Scholar]
Li, Y.; Ren, S.; Wu, P.; Chen, S.; Feng, C.; Zhang, W. Learning distilled collaboration graph for multi-agent perception. Adv. Neural Inf. Process. Syst. 2021, 34, 29541–29552. [Google Scholar]
Xiang, H.; Xu, R.; Ma, J. HM-ViT: Hetero-modal vehicle-to-vehicle cooperative perception with vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer vision, Paris, France, 2–3 October 2023; pp. 284–295. [Google Scholar]
Chen, Z.; Shi, Y.; Jia, J. Transiff: An instance-level feature fusion framework for vehicle-infrastructure cooperative 3d detection with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 18205–18214. [Google Scholar]
Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 1126–1135. [Google Scholar]
Snell, J.; Swersky, K.; Zemel, R. Prototypical networks for few-shot learning. arXiv 2017, arXiv:1703.05175. [Google Scholar]
Javed, K.; White, M. Meta-learning representations for continual learning. arXiv 2019, arXiv:1905.12588. [Google Scholar]
Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16816–16825. [Google Scholar]
Li, X.L.; Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. arXiv 2021, arXiv:2101.00190. [Google Scholar]
Rao, Y.; Zhao, W.; Chen, G.; Tang, Y.; Zhu, Z.; Huang, G.; Zhou, J.; Lu, J. Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18082–18091. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 9650–9660. [Google Scholar]
Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12697–12705. [Google Scholar]
Zhou, Y.; Tuzel, O. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4490–4499. [Google Scholar]
Yan, Y.; Mao, Y.; Li, B. Second: Sparsely embedded convolutional detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Comparison of different strategies for heterogeneous multi-vehicle cooperative perception. (a) Retraining the encoder; (b) One-to-One Adapter; and (c) Ours.

Figure 2. The overall architecture of PnPDA⁺. The plug-and-play design of PnPDA⁺ enables the scalable integration of heterogeneous agents by injecting a lightweight domain adapter module. It consists of two key components: MFEN, which extracts domain-aware and frame-aware meta features, and PnPDA, which performs adaptive semantic alignment to transform heterogeneous features into a unified representation for fusion.

Figure 3. MFEN architecture. MFEN extracts domain-related and frame-related meta features from neighbor features. The domain feature encodes style-specific patterns via a momentum-updated feature library, while the frame feature captures spatial-temporal cues. These meta features are applied to the neighbor feature

F_{j}^{1}

, yielding the semantically enriched domain-specific representation

F_{j}^{2}

.

Figure 3. MFEN architecture. MFEN extracts domain-related and frame-related meta features from neighbor features. The domain feature encodes style-specific patterns via a momentum-updated feature library, while the frame feature captures spatial-temporal cues. These meta features are applied to the neighbor feature

F_{j}^{1}

, yielding the semantically enriched domain-specific representation

F_{j}^{2}

.

Figure 4. PnPDA architecture.

Figure 5. Meta feature extraction visualization: from raw to aligned features. (a) Neighbor heterogeneous feature. (b) ego feature. (c) Neighbor + frame-related meta feature. (d) Neighbor + domain-related meta feature. (e) Transformed neighbor feature. and (f) Fused feature.

Table 1. Detailed parameters of heterogeneous encoders. Each encoder is evaluated as the detection accuracy obtained by vehicles under the same scene configuration, using the AP with an IoU of 0.5 and 0.7 as metrics.

Encoder	Type	Voxel Resolution (m)	2D/3D Layers	Half Lidar Cropping Range [x, y] (m)	AP@0.5/ AP@0.7
PointPillar [25]	pp8	0.8, 0.8, 0.8	4/0	[140.8, 64.8]	85.9/62.9
	pp6	0.6, 0.6, 0.4	19/0	[153.6, 38.4]	86.7/68.7
	pp4	0.4, 0.4, 0.4	19/0	[140.8, 64.8]	87.2/77.7
VoxelNet [26]	vn6	0.6, 0.6, 0.4	0/3	[153.6, 38.4]	85.7/71.2
VoxelNet [26]	vn4	0.4, 0.4, 0.4	0/3	[140.8, 64.8]	85.5/77.8
SECOND [27]	sd2	0.2, 0.2, 0.4	12/12	[140.8, 64.8]	84.0/64.8
SECOND [27]	sd1	0.1, 0.1, 0.4	13/13	[140.8, 64.8]	65.1/52.9

Table 2. Comparison with HETE and MPDA under two heterogeneous training scenarios for collaborative 3D object detection on the OPV2V dataset, evaluated using AP@0.5 and AP@0.7. PnPDA⁺ is pre-trained under two heterogeneous combinations (in the format of “ego–neb1–neb2”): “*” indicates training on “pp8–vn4–sd1” as Hetero Scenario 1, and “+” indicates training on “pp8–vn4–pp4” as Hetero Scenario 2. All methods are evaluated under new “ego–neb” combinations to assess the generalization to unseen vehicle types in heterogeneous cooperative perception.

Scenario	PnPDA⁺	HETE	MPDA
PP8–PP4 *	54.6/40.5	44.0/23.0	54.5/40.2
PP8–PP4 +	60.7/42.5	53.2/38.8	54.5/40.2
PP8–SD2 *	56.2/41.7	54.1/39.0	53.4/38.2
PP8–SD2 +	55.4/39.2	27.1/6.1	53.4/38.2
PP8–VN6 *	52.5/38.1	10.5/1.2	52.9/38.0
PP8–VN6 +	53.7/39.3	20.1/3.3	52.9/38.0

Note: The best results are highlighted in bold.

Table 3. Evaluation of the effectiveness of components in MFEN under Hetero Scenario 1. The model is trained on the hetero setting “pp8–vn4–sd1”, then fine-tuned with different “ego–neb” configurations, and evaluated using AP@0.5 and AP@0.7.

PnPDA	Domain-Related Meta Feature	Frame-Related Meta Feature	pp8–pp4	pp8–vn4	pp8–sd2
✓	✓	✓	54.6/40.5	63.7/44.0	55.4/39.2
✓	✓	–	28.8/19.3	50.0/36.0	53.6/37.3
✓	–	✓	5.7/1.7	58.2/44.6	3.5/2.0
✓	–	–	4.3/1.1	49.7/21.5	1.2/0.5

Note: The best results are highlighted in bold.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Published by MDPI on behalf of the World Electric Vehicle Association. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xin, L.; Zhou, G.; Yu, Z.; Wang, D.; Luo, T.; Fu, X.; Li, J. PnPDA⁺: A Meta Feature-Guided Domain Adapter for Collaborative Perception. World Electr. Veh. J. 2025, 16, 343. https://doi.org/10.3390/wevj16070343

AMA Style

Xin L, Zhou G, Yu Z, Wang D, Luo T, Fu X, Li J. PnPDA⁺: A Meta Feature-Guided Domain Adapter for Collaborative Perception. World Electric Vehicle Journal. 2025; 16(7):343. https://doi.org/10.3390/wevj16070343

Chicago/Turabian Style

Xin, Liang, Guangtao Zhou, Zhaoyang Yu, Danni Wang, Tianyou Luo, Xiaoyuan Fu, and Jinglin Li. 2025. "PnPDA⁺: A Meta Feature-Guided Domain Adapter for Collaborative Perception" World Electric Vehicle Journal 16, no. 7: 343. https://doi.org/10.3390/wevj16070343

APA Style

Xin, L., Zhou, G., Yu, Z., Wang, D., Luo, T., Fu, X., & Li, J. (2025). PnPDA⁺: A Meta Feature-Guided Domain Adapter for Collaborative Perception. World Electric Vehicle Journal, 16(7), 343. https://doi.org/10.3390/wevj16070343

Article Menu

PnPDA+: A Meta Feature-Guided Domain Adapter for Collaborative Perception †

Abstract

1. Introduction

2. Related Work

2.1. Collaborative Perception

2.2. Heterogeneous Domain Adaption

2.3. Meta-Learning

3. System Model

3.1. Preliminary

3.2. Meta Feature-Guided Domain Adapter

4. Methodology

4.1. Adaptive Feature Alignment

4.2. MFEN

4.2.1. Domain-Related Meta Feature

4.2.2. Frame-Related Meta Feature

4.2.3. Enhanced Domain-Specific Feature

4.3. PnPDA

4.3.1. Semantic Translator and Enhancer

4.3.2. Semantic Calibrator

4.4. Loss

4.4.1. PnPDA Loss

4.4.2. MFEN Loss

5. Experiments and Discussions

5.1. Experimental Settings

5.2. Detection Performance

5.3. Ablation Study

5.4. Visualizations

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

PnPDA⁺: A Meta Feature-Guided Domain Adapter for Collaborative Perception^†