Vehicle–Road–Cloud Collaborative Perception: Resource and Intelligence Optimization

Liang Xin; Guangtao Zhou; Zhaoyang Yu; Hong Zhu; Xiaolong Feng; Quan Yuan; Jinglin Li

doi:10.3390/app152312613

,

and

¹

China Unicom Smart Connection Technology Limited, Beijing 101599, China

²

State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China

^*

Author to whom correspondence should be addressed.

Appl. Sci.2025, 15(23), 12613;https://doi.org/10.3390/app152312613

This article belongs to the Topic Unmanned Vehicles Technology and Embodied Intelligence Systems for Intelligent Transportation

Version Notes

Order Reprints

Abstract

Vehicle–road–cloud collaborative perception improves perception performance via multi-agent information sharing and data fusion, but it faces coupled trade-offs among perception accuracy, computing resources, and communication bandwidth. Optimizing agents’ intelligence or underlaying resources alone fails to resolve this conflict, limiting collaboration efficiency. We propose C⁴I-JO, a joint resource and intelligence optimization method for vehicle–road–cloud collaborative perception. We employ slimmable networks to achieve intelligent elasticity. Based on these, C⁴I-JO jointly optimizes four key dimensions to minimize resource consumption while meeting accuracy and latency constraints, including collaborative mechanisms to cut redundant communication, resource allocation to avoid supply–demand bottlenecks, intelligent elasticity to balance accuracy and resources, and computation offloading to reduce local burden. We propose a two-layer iterative decoupling algorithm that addresses the optimization problem. Specifically, the outer level leverages Second-Order Cone Programming (SOCP) and the interior-point method, while the inner level utilizes a Genetic Algorithm (GA). Simulations show that C⁴I-JO outperforms baselines in both resource efficiency and perception quality.

Keywords:

collaborative perception; slimmable network; accuracy aware; resource allocation; computation offloading

1. Introduction

Vehicle–road–cloud collaborative perception is a key enabler of autonomous driving in complex traffic scenarios, as it significantly improves environmental perception by leveraging multi-agent information sharing and fusion [1,2,3]. While collaboration overcomes the limitations of single-vehicle sensors, such as occlusions and limited field of view, it simultaneously imposes coupled and stringent demands on system resources, specifically requiring high-performance computing power and substantial, low-latency communication bandwidth.

Existing research predominantly addresses these demands in isolation. Most works focus on optimizing communication protocols to reduce bandwidth usage [4,5] or managing edge computing resource allocation for fixed tasks [6]. However, a critical gap persists: Current methods often assume a static AI model structure. They lack “intelligent elasticity”—the ability to dynamically adjust the neural network’s complexity in response to real-time channel fluctuations and processing loads. Optimizing the underlying infrastructure without enabling the application layer to adapt restricts overall system efficiency and performance under dynamic V2X conditions.

To address this fundamental limitation, we introduce a joint resource and intelligence optimization method for vehicle–road–cloud collaborative perception, named, C⁴I-JO. This method achieves joint optimization through multiple dimensions: collaborative mechanism, communication resource allocation, computing resource allocation, computation offloading, and intelligent elasticity. We construct an adaptive computing framework where a slimmable network [7] is employed to achieve intelligent elasticity to meet the demands of perception computing. Meanwhile, we employ an accuracy-aware coordination method to minimize system resource consumption under the premise of satisfying accuracy and delay constraints.

In conclusion, we present the following contributions:

We use a slimmable network to construct an adaptive computing framework, which enables intelligent elasticity, thereby meeting the computational demands of perception tasks.
We propose an accuracy-aware collaborative mechanism. By leveraging deep supervised learning, a nonlinear mapping is established between perception accuracy, network structure, and data density.
We integrate MEC technology and utilize computation offloading technology to achieve the flexible offloading of computing tasks, thereby improving the overall computing efficiency.
We design and solve a multi-objective optimization problem that comprehensively considers a collaborative mechanism, resource allocation, intelligent elasticity, and computation offloading, aiming to minimize overall resource consumption while satisfying accuracy and delay constraints. Unlike traditional V2X approaches that treat the AI model as a static black box and only optimize transmission resources, our method uniquely introduces “intelligent elasticity.” By jointly optimizing the collaborative mechanism, resource allocation, and the internal structure of the slimmable network, we achieve a superior Pareto frontier between perception accuracy and latency, adapting to dynamic constraints that fixed-model approaches cannot handle.

3. System Model

3.1. Vehicle-Road-Cloud Collaborative Perception Scenario

To improve readability and clarity, the key mathematical notations used in this paper are summarized in Table 1.

Table 1. Key notations and definitions.

As shown in Figure 1, we investigate a vehicle–road–cloud collaborative perception scenario on urban roads. This scenario consists of Q CAVs equipped with sensors, one RSU with both sensing and computing capabilities, and one MEC unit with computing capabilities, where

Q = {0, 1, \dots, Q - 1}

denotes the set of CAVs.

Figure 1. Vehicle–road–cloud collaborative perception scenario.

CAVs and the RSU communicate via C-V2X, while the RSU and the MEC communicate through optical fibers. Specifically, the perception task is initiated by the RSU, along with Q assisting CAVs capable of collaborative perception. Let

Q^{+} = Q \cup {Q}

represent the set of assisting nodes, which includes all CAVs and the RSU, and

q \in Q^{+}

denotes the index of an assisting node.

Assisting nodes perform perception tasks within a region of interest (ROI), detecting G objects. Let

G = {0, 1, \dots, G - 1}

denote the set of objects to be perceived by assisting nodes, where

g \in G

represents the index of each object, and

ϕ_{q, g}

denotes the detection task of the q-th assisting node for object g. The perception tasks of assisting nodes include detecting the presence and positions of all G objects and estimating the category of each individual detected object, denoted as g.

3.2. Perception Data

Each CAV and RSU is capable of scanning its surrounding environment. For every scan, the LiDAR sensor of the q-th assisting node produces a set of 3D point clouds as raw perception data, represented by the set

P_{q} = {(x_{q}^{i}, y_{q}^{i}, z_{q}^{i})}

, where

(x_{q}^{i}, y_{q}^{i}, z_{q}^{i})

represents the 3D coordinates of the i-th point within the surroundings.

All assisting nodes share the same global coordinate system, since the local coordinate systems can be matched via a coordinate mapping. The positions of object

g \in G

is represented by its 3D bounding box, which is denoted as a six-dimensional vector

c_{g} = (x_{g}, y_{g}, z_{g}, b_{g}^{(x)}, b_{g}^{(y)}, b_{g}^{(z)})

. This vector denotes the center, length, width, and height of the 3D bounding box of object g.

For each object g, the raw perception data

P_{q}^{(g)}

of the assisting node q surrounding it can be obtained from the raw LiDAR data

P_{q}

.

3.3. Architecture

Bounding box detection refers to detecting whether there are objects within the ROI area of the RSU, and each object has a corresponding bounding box estimation. We adopt a two-stage collaborative perception method that combines late fusion and intermediate fusion. Late Coarse Fusion Stage: All CAVs and the RSU utilize a compact bounding box detection method to assess the 3D bounding box parameters of each potential object. Intermediate Fine Fusion Stage: Each assisting collaborative node then sends the detected bounding boxes together with intermediate features to the RSU for aggregation, thereby achieving more accurate bounding box detection and object classification results. The first stage is lightweight and involves very little communication and computing resources; therefore, we focus on the optimization of the second stage.

In the intermediate fine fusion stage, object detection based on shared bird’s-eye view (BEV) features usually uses an AI model to estimate the category probability of each object. The AI model requires intermediate feature transmission and fusion during the object detection process. Therefore, the entire process is divided into two sub-stages: the intermediate feature generation sub-stage and the feature fusion sub-stage, as shown in Figure 2. In the intermediate feature generation phase, the AI model leverages a slimmable network to dynamically adjust its structure based on the constraints of edge noise conditions, transforming point cloud data into intermediate features, thereby optimizing the efficiency of the model.

Figure 2. Perception architecture.

Intuitively, the slimmable network acts as a “gearbox” for perception data transmission in V2X scenarios, allowing for real-time adaptation. For instance, in an ideal line-of-sight (LOS) condition with abundant bandwidth, the network uses a “full-width” (

w_{q} = 1.0

) configuration to maximize accuracy. Conversely, in a congested intersection where bandwidth is scarce, the system instantly switches to a “slim” configuration (e.g.,

w_{q} = 0.25

). This elasticity significantly reduces the transmission data volume, preventing critical latency violations and packet loss, which is essential for maintaining safety in dynamic channel conditions.

Let

W = {w_{q}, \forall q \in Q^{+}}

represent the width ratio of a slimmable network for each assisting node in the system. Network width refers to the number of channels per layer, a key determinant of a model’s computational complexity and representational capacity. The network width ratio is the fraction of these channels that are actually engaged in computation. The RSU can perform computation offloading for detection tasks, with the offloading object being the MEC. Let U be a binary decision matrix, where

u_{g} = 1

indicates that the detection task for object g is offloaded to the MEC for computation. In the feature fusion sub-stage, all collaborative nodes participating in collaborative perception transmit their perceived BEV features to the RSU. If

u_{g} = 1

, the RSU then transmits these features to the MEC for corresponding computation.

In object detection for collaborative perception, the BEV features of potential object g generated by assisting node q are affected by the network width parameter

w_{q}

of the feature generation network

f_{1}

. Therefore, the BEV feature generation relationship is as follows:

F^{(g, q)} = f_{1} (P_{q}^{(g)}, w_{q})

(1)

where

F^{(g, q)}

denotes the BEV features of potential object g generated by assisting node q.

Meanwhile, to measure the quality of BEV features

F^{(g, q)}

, we propose a data quality evaluator

f_{2}

. Specifically, we simulate the quality of the BEV features using the density of raw point cloud data and the width of the slimmable network. This design offers advantages in two key aspects. First, the density of point clouds directly reflects the granularity of perception data. A higher density provides richer spatial information, which helps capture subtle differences in spatial distribution and enhances the geometric completeness of BEV features. Second, the width of the network affects the representational capacity of AI model. Therefore, we extract object-aware sensing data

P_{q}^{(g)}

from the raw point cloud data of assisting node q and combine it with the slimmable network width

w_{q}

to construct a quality evaluation metric for

F^{(g, q)}

. This framework not only improves the generalization ability of AI models in diverse scenarios but also provides a more precise and controllable quality prior for subsequent vehicle–road–cloud cooperation. The relationship is given as follows:

Z^{(g, q)} = f_{2} (P_{q}^{(g)}, w_{q})

(2)

where

Z^{(g, q)}

represents the data quality of

F^{(g, q)}

and is a real-valued quantity.

In the object detection process, for object g, up to

Q + 1

assisting nodes can participate in collaborative perception (Q CAVs and 1 RSU). Let S denote the binary decision matrix: if

s_{q, g} = 1

and

q \neq Q

, it indicates that assisting node q can perceive object g and conduct collaborative perception with the RSU; if

s_{q, g} = 1

and

q = Q

, it indicates that the RSU can independently perceive object g. Therefore, the total perception data

F_{g}

is the set of perception data

F^{(g, q)}

from all participating assisting nodes involved in the collaboration:

F_{g} = ⋃_{q \in Q^{+}} s_{q, g} F^{(g, q)}

(3)

Similarly, the data quality

Z_{g}

is defined as follows:

Z_{g} = \sum_{q \in Q^{+}} s_{q, g} Z^{(g, q)}

(4)

3.4. Computation Model

In the point cloud processing of object detection AI models, the overall computational demand increases with the growth of input data volume, thereby imposing higher requirements on computational resources. Let

f_{q}

denote the available computational resources of collaborative node q and

f_{q^{'}}

denote the available computational resources of computational node

q^{'}

(

q^{'} \in {Q, Q + 1}

), with the unit of Hertz (Hz). Define

α_{1} = {α_{q}^{1}, \forall q \in Q^{+}}

as the proportion of computational resources allocated by collaborative node q in the intermediate feature generation stage, and

α_{2} = {α_{q^{'}}^{2}, \forall q^{'} \in {Q, Q + 1}}

as the proportion of computational resources allocated by computational node

q^{'}

in the feature fusion stage. Parameters

ϵ_{1}

and

ϵ_{2}

represent the complexity of the AI model in these two stages (with the unit of cycles per bit), indicating the average number of CPU cycles required to process each bit of sensory data. For each task

ϕ_{q, g}

, the computational demand at each stage is given as follows:

e_{g, q}^{1} = ϵ_{1} \cdot | F^{(g, q)} |

(5)

e_{g}^{2} = ϵ_{2} \cdot | F_{g} |

(6)

where

| \cdot |

denotes the size of the data. Therefore, the total computational demands of collaborative node q and computational node

q^{'}

in their respective stages are denoted as

e_{q}^{1}

and

e_{q^{'}}^{2}

, respectively:

e_{q}^{1} = \sum_{g \in G} e_{g, q}^{1}

(7)

e_{q^{'}}^{2} = \sum_{g \in G} u_{g} \cdot e_{g}^{2}

(8)

For each feature fusion task, data processing starts only after receiving the data transmitted by the auxiliary collaborative nodes. For explanation simplicity, it is assumed that the RSU (or MEC) has received all required data, after which the AI model deployed on it begins processing. The computation times of the two stages on collaborative node q and computational node

q^{'}

are denoted as

t_{q}^{1}

and

t_{q^{'}}^{2}

, respectively, and their calculation methods are as follows:

t_{q}^{1} = \frac{e_{q}^{1}}{α_{q}^{1} \cdot f_{q}}

(9)

t_{q^{'}}^{2} = \frac{e_{q^{'}}^{2}}{α_{q^{'}}^{2} \cdot f_{q^{'}}}

(10)

Meanwhile, collaborative node q and computational node

q^{'}

generate cumulative computing resource consumption

η_{q}^{1}

and

η_{q^{'}}^{2}

in their respective stages. Since the consumption of computing resources varies with time during task processing, we assume that its consumption pattern is as follows:

η_{q}^{1} = t_{q}^{1} \cdot k_{e} \cdot {(α_{q}^{1} \cdot f_{q})}^{3} = e_{q}^{1} \cdot k_{e} \cdot {(α_{q}^{1} \cdot f_{q})}^{2}

(11)

η_{q^{'}}^{2} = t_{q^{'}}^{2} \cdot k_{e} \cdot {(α_{q^{'}}^{2} \cdot f_{q^{'}})}^{3} = e_{q^{'}}^{2} \cdot k_{e} \cdot {(α_{q^{'}}^{2} \cdot f_{q^{'}})}^{2}

(12)

where

k_{e}

represents the effective switching capacitance of the CPU on each collaborative node.

3.5. Communication Model

If assisting node q generates object-aware sensing data

F^{(g, q)}

for object g, i.e.,

s_{q, g} = 1

, but sub-task g is offloaded on the MEC, i.e.,

u_{g} = 1

, then the object-aware sensing data

F^{(g, q)}

should be transmitted from assisting node q to the RSU and subsequently from the RSU to the MEC. Define

φ

as the data size of a single observation point (in bits). The overall size of the perception data sent from assisting node q to the RSU (which may be used for different subtasks) is denoted as

ρ_{q}

and expressed as

ρ_{q} = φ \sum_{g \in G} s_{q, g} | F^{(g, q)} |, \forall q \in Q^{+}

(13)

Orthogonal frequency division multiplexing (OFDM) is used for V2X transmission. Assume that all V2I links operate on a shared spectrum with total bandwidth of

B_{1}

, while the communication between RSU and MEC shares an optical transmission channel which is a fixed value. Let

β = {β_{q}, \forall q \in Q^{+}}

represent the continuous bandwidth distribution matrix in

R^{(Q^{+})}

, where

β_{q}

indicates the proportions of bandwidth allocation from assisting node q to the RSU for communication. Since no bandwidth needs to be allocated to the RSU, we have

β_{Q} = 0

, and the total proportion of bandwidth allocation shall not exceed 1, which is given by

\sum_{q \in Q^{+}} β_{q} \leq 1

(14)

The mean transmission rate from assisting node q to the RSU is given by

R_{q} = β_{q} B_{1} {log}_{2} (1 + \frac{P_{q} {| h_{q} |}^{2} d_{q}^{- γ}}{σ^{2}})

(15)

where

P_{q}

represents the transmission power of assisting node q,

h_{q}

is the channel attenuation factor from assisting node q to the RSU,

d_{q}

is the separation between assisting node q and the RSU,

γ

is the loss exponent, and

σ^{2}

represents the received interference power. We presume a fixed distance during the detection task and take into account the minimal delay requirement of the detection task: for example,

T = 20 ms

. The mean transmission duration of all the perception data sent from assisting node q to the RSU is denoted as

t_{q}

, which is given by

t_{q} = \{\begin{matrix} 0 & β_{q} = 0 or q = Q \\ \frac{ρ_{q}}{R_{q}} & β_{q} > 0 \end{matrix}

(16)

The amount of data transmitted from the RSU to the MEC is denoted as

ρ

, which is

ρ = φ \sum_{g \in G} u_{g} | F_{g} |

(17)

The mean transmission rate from the RSU to the MEC is denoted as R, and the transmission time of all sensing data sent from the RSU to the MEC is denoted as t:

t = \frac{ρ}{R}

(18)

3.6. Problem Formulation

The result of an object detection task is a multi-aspect confidence vector, where greater confidence scores indicate a greater likelihood that the detected object exists and belongs to a specific class. Therefore, the predicted confidence values are used as a measure of the model’s accuracy. We integrate the data

F_{g}

and use the AI model to process it, estimating the perception accuracy

a_{g}

based on the data quality

Z_{g}

, which can be evaluated using a collaborative accuracy prediction model. For small objects, fewer observation points are required for accuracy-aware, thus considering the bounding size of the objects during the accuracy calculation. The input of the model includes data quality and bounding box size, and the outcome is the accuracy:

a_{g} = f_{3} (Z_{g}, b_{g}^{(x)}, b_{g}^{(y)}, b_{g}^{(z)})

(19)

The accuracy of any object g detected by the RSU can be predicted through the accuracy estimation learning model. We set that the observation accuracy of each object g should meet the accuracy threshold A, given by

a_{g} \geq A, \forall g \in G

(20)

Meanwhile, the completion time for assisting node q to finish the object detection task must also not exceed a threshold T. Since the data volume of the detection results is very small, the delay of data transmitted back from the MEC to the RSU is negligible, and we have a latency limitation as follows:

max_{\forall q \in Q^{+}, \forall q^{'} \in {Q, Q + 1}} (t_{q}^{1} + t_{q} + t_{q^{'}}^{2}) + t \leq T

(21)

Since assisting nodes generate cumulative consumption of computing resources during the process of object detection, we set that the assisting nodes use the maximum computing resources in both task stages as their overall benchmark value, which is set as

E = \sum_{q \in Q^{+}} e_{q}^{1} k_{e} {(f_{q})}^{2} + \sum_{\forall q^{'} \in {Q, Q + 1}} e_{q^{'}}^{2} k_{e} {(f_{q^{'}})}^{2}

(22)

We formulate the joint optimization problem

P 1

to minimize the total weighted system cost. The objective function is conceptually defined as the sum of two coupled resource consumption instances: communication cost: determined by the sum of allocated bandwidth fractions

β_{q}

, promoting efficient spectrum reuse; computation cost: determined by the normalized energy consumption derived from the operating frequencies and processing times of all nodes. This optimization is executed under three crucial physical and performance constraints; accuracy constraint: every detected object g must meet a minimum required confidence threshold A (Equation (20)); latency constraint: the total time required for sensing, feature transmission, and fusion must not exceed the safety-critical threshold T (Equation (21)); resource capacity constraint: total allocated bandwidth and computing cycles must not exceed the physical limits of the RSU and MEC server. Accordingly, the optimization problem

P 1

is formulated as follows:

\begin{matrix} P 1 : min_{S, W, U, α_{1}, α_{2}, β} & ω \sum_{q \in Q^{+}} β_{q} + (1 - ω) \cdot \frac{\sum_{q \in Q^{+}} η_{q}^{1} + \sum_{q^{'} \in {Q, Q + 1}} η_{q^{'}}^{2}}{E} \\ s . t . & (), (), () \\ s_{q, g} \in {0, 1}, \forall g \in G, \forall q \in Q^{+} \\ u_{g} \in {0, 1}, \forall g \in G \\ 0 \leq w_{q} \leq 1, \forall q \in Q^{+} \\ 0 \leq β_{q} \leq 1, \forall q \in Q^{+} \\ 0 \leq α_{q}^{1} \leq 1, \forall q \in Q^{+} \\ 0 \leq α_{q^{'}}^{2} \leq 1, \forall q^{'} \in {Q, Q + 1} \end{matrix}

(23)

Problem P1 is subject to constraints in terms of topological structure, network structure, accuracy, latency, and resource capacity. Among these, both the topological constraints and communication resource constraints include Equation (14), while the accuracy constraint and latency constraint are, respectively, given by Equations (20) and (21).

ω \in (0, 1)

represents the importance of balancing the consumption of communication and computing resources.

4. Method

Since Problem P1 involves both discrete and continuous variables and the accuracy constraint derived from the DNN is a nonlinear constraint, it is relatively complex to solve. We observe that the accuracy constraint depends on the collaborative object selection matrix (S), the slimmable network width matrix (W), and the computation offloading matrix (U). The topological constraint depends on the collaborative object selection matrix (S), and both the latency constraint and resource constraint depend on the selection of collaborative objects, resource allocation, and changes in the slimmable network’s width.

Therefore, we propose a decomposition-based algorithm to solve Problem P1. This algorithm adopts a two-layer iterative structure. In the outer-layer module, a Genetic Algorithm (GA) is used to iteratively optimize the collaboration matrix S, network width matrix W, and computation offloading matrix U, satisfying the accuracy constraint and topological constraint, which is denoted as the tuple

(S, W, U)

. Then,

(S, W, U)

is passed to the inner-layer module, which only needs to find a solution to a resource allocation sub-problem to obtain the

solution vector (S, W, U, α_{1}, α_{2}, β)

that meets the latency and resource constraints, as well as the total resource consumption cost. Finally, the Genetic Algorithm continues to generate the next generation of candidate sets, and continuous optimization is carried out to obtain the optimal solution.

4.1. Inner-Layer Module

Once the system is given

(S, W, U)

, we define constants

C_{1} = \frac{e_{q}^{1}}{f_{q}}

and

C_{2} = \frac{e_{q^{'}}^{2}}{f_{q^{'}}}

, which represent that assisting node q and computing node

q^{'}

utilize all their computing resources for intermediate feature generation and feature fusion, i.e.,

α_{q}^{1} = 1

and

α_{q^{'}}^{2} = 1

. Similarly, we define constant

C_{3} = \frac{ρ_{q}}{B {log}_{2} (1 + \frac{P_{q} {| h_{q} |}^{2} d_{q}^{- γ}}{σ^{2}})}

, which represents that the system uses the entire communication bandwidth to transmit perception information, i.e.,

β_{q} = 1

. We substitute

(S, W, U)

into Problem P1 and obtain the following:

\begin{matrix} P 2 : min_{α_{1}, α_{2}, β} & ω \sum_{q \in Q^{+}} β_{q} + (1 - ω) \cdot \frac{\sum_{q \in Q^{+}} η_{q}^{1} + \sum_{q^{'} \in {Q, Q + 1}} η_{q^{'}}^{2}}{E} \\ s . t . & (14), (20), (21) \\ C_{1} \cdot \frac{1}{α_{q}^{1}} + C_{3} \cdot \frac{1}{β_{q}} + C_{2} \cdot \frac{1}{α_{q^{'}}^{2}} \leq T, \forall q \in Q^{+}, \forall q^{'} \in {Q, Q + 1} \\ 0 \leq β_{q} \leq 1, \forall q \in Q^{+} \\ 0 \leq α_{q}^{1} \leq 1, \forall q \in Q^{+} \\ 0 \leq α_{q^{'}}^{2} \leq 1, \forall q^{'} \in {Q, Q + 1} \end{matrix}

(24)

The resource allocation sub-problem P2 is a constrained optimization problem, where the goal function is curved because it satisfies the properties of positive semi-definite matrices. However, in the latency constraints, terms such as

\frac{1}{α_{q}^{1}}

,

\frac{1}{β_{q}}

, and

\frac{1}{α_{q^{'}}^{2}}

are inherently non-convex, which makes the entire problem non-convex and unsuitable for direct solution using standard linear programming techniques. To address this issue, we adopt the framework of second-order cone programming (SOCP) to transform the non-convex latency constraints into convex forms.

We define the following new variables:

x_{q}^{1} = \frac{1}{α_{q}^{1}}

(25)

x_{q} = \frac{1}{β_{q}}

(26)

x_{q^{'}}^{2} = \frac{1}{α_{q^{'}}^{2}}

(27)

where

x_{q}^{1} \geq 1

,

x_{q} \geq 1

, and

x_{q^{'}}^{2} \geq 1

.

Substituting the new variables into the original latency constraints, the nonlinear constraints containing reciprocal terms can be transformed into linear constraints:

C_{1} x_{q}^{1} + C_{2} x_{q} + C_{3} x_{q^{'}}^{2} \leq T, \forall q \in Q^{+}, \forall q^{'} \in {Q, Q + 1}

(28)

This linear constraint satisfies the convexity requirement; however, the variable substitution process introduces new nonlinear equality constraints:

x_{q}^{1} α_{q}^{1} = 1

(29)

x_{q} β_{q} = 1

(30)

x_{q^{'}}^{2} α_{q^{'}}^{2} = 1

(31)

To transform the nonlinear equality constraints of Equations (29)–(31) into convex constraints, we introduce the mean value inequality as a bridge.

For any positive real numbers a and b, the mean value inequality satisfies

\sqrt{a b} \leq \frac{a + b}{2}

, where the equality holds if and only if

a = b

. Then,

\sqrt{x_{q}^{1} α_{q}^{1}} \leq \frac{x_{q}^{1} + α_{q}^{1}}{2}

(32)

\sqrt{x_{q} β_{q}} \leq \frac{x_{q} + β_{q}}{2}

(33)

\sqrt{x_{q^{'}}^{2} α_{q^{'}}^{2}} \leq \frac{x_{q^{'}}^{2} + α_{q^{'}}^{2}}{2}

(34)

Combining with the constraints of Equations (29)–(31), they can be simplified to

1 \leq \frac{x_{q}^{1} + α_{q}^{1}}{2}

(35)

1 \leq \frac{x_{q} + β_{q}}{2}

(36)

1 \leq \frac{x_{q^{'}}^{2} + α_{q^{'}}^{2}}{2}

(37)

Multiplying both sides by two yields the linear inequality constraint:

2 \leq x_{q}^{1} + α_{q}^{1}

(38)

2 \leq x_{q} + β_{q}

(39)

2 \leq x_{q^{'}}^{2} + α_{q^{'}}^{2}

(40)

Equations (38)–(40) are all convex constraints, however, the equivalence with the original equality constraints cannot be guaranteed solely through such linear inequalities. Therefore, it is necessary to further transform them into the standard constraint form of SOCP to ensure the equivalence of constraints during the optimization process.

The core constraint form of SOCP is that the Euclidean norm of a vector is not greater than a certain scalar. Specifically, for an q-dimensional second-order cone

K_{q} = {(u, t) \in R^{(q - 1)} \times R ∣ {∥ u ∥}_{2} \leq t}

, the constraint can be expressed as

(u, t) \in K_{q}

, where

u

is an

(q - 1)

-dimensional vector, t is a non-negative scalar, and

{∥ \cdot ∥}_{2}

denotes the Euclidean norm.

For Equation (38), we construct a one-dimensional vector

u_{q}^{1} = [2 \sqrt{α_{q}^{1}}]

, for which its Euclidean norm is

∥ u_{q}^{1} ∥_{2} = {∥ 2 \sqrt{α_{q}^{1}} ∥}_{2} = 2 \sqrt{α_{q}^{1}}

Meanwhile, we define the scalar

t_{q}^{1} = α_{q}^{1} + x_{q}^{1}

. From Equation (38), we know that

t_{q}^{1} \geq 2 > 0

, which satisfies the requirement of the second-order cone constraint for a non-negative scalar. At this point, Equation (38) can be rewritten as

2 \sqrt{α_{q}^{1}} \leq α_{q}^{1} + x_{q}^{1}

This inequality conforms to the standard form of the second-order cone constraint

{∥ u ∥}_{2} \leq t

, that is,

(u_{q}^{1}, t_{q}^{1}) \in K_{2}

where

K_{2} {= {(u, t) \in R \times R ∣ ∥ u ∥}_{2} \leq t}

is a two-dimensional second-order cone.

Similarly, for Equations (39) and (40), we, respectively, construct one-dimensional vectors

v_{q} = [2 \sqrt{β_{q}}]

and

w_{q^{'}}^{2} = [2 \sqrt{α_{q^{'}}^{2}}]

and define scalars

s_{q} = x_{q} + β_{q}

and

r_{q^{'}}^{2} = x_{q^{'}}^{2} + α_{q^{'}}^{2}

, from which the corresponding second-order cone constraints can be obtained:

(v_{q}, s_{q}) \in K_{2}

(w_{q^{'}}^{2}, r_{q^{'}}^{2}) \in K_{2}

The original non-convex resource allocation sub-problem P2 is finally converted into a standard SOCP problem:

\begin{matrix} P 2 : min_{\begin{matrix} α_{1}, α_{2}, β \\ x_{q}^{1}, x_{q}, x_{q^{'}}^{2} \end{matrix}} & ω \sum_{q \in Q^{+}} β_{q} + (1 - ω) \cdot \frac{\sum_{q \in Q^{+}} η_{q}^{1} + \sum_{q^{'} \in {Q, Q + 1}} η_{q^{'}}^{2}}{E} \\ s . t . & (14), (20), (21) \\ C_{1} x_{q}^{1} + C_{2} x_{q} + C_{3} x_{q^{'}}^{2} \leq T, \forall q \in Q^{+}, \forall q^{'} \in {Q, Q + 1} \\ ([2 \sqrt{α_{q}^{1}}], α_{q}^{1} + x_{q}^{1}) \in K_{2}, \forall q \in Q^{+} \\ ([2 \sqrt{β_{q}}], x_{q} + β_{q}) \in K_{2}, \forall q \in Q^{+} \\ ([2 \sqrt{α_{q^{'}}^{2}}], x_{q^{'}}^{2} + α_{q^{'}}^{2}) \in K_{2}, \forall q^{'} \in {Q, Q + 1} \\ 0 \leq α_{q}^{1} \leq 1, 0 \leq β_{q} \leq 1, 0 \leq α_{q^{'}}^{2} \leq 1 \\ x_{q}^{1} \geq 1, x_{q} \geq 1, x_{q^{'}}^{2} \geq 1 \end{matrix}

(41)

Subsequently, the interior point method is used for solving it.

4.2. Outer-Layer Module

The external optimization module adopts GA to iteratively optimize decision variables S, W, and U. Specifically, the algorithm first randomly initializes a population P consisting of I individuals, where each individual

P_{i}^{0}

is composed of a set of decision variables

(S_{i}^{0}, W_{i}^{0}, U_{i}^{0})

.

During the evolution of the j-th generation, the internal optimization module is invoked to calculate the minimum resource consumption

C (P_{i}^{j})

of each individual

P_{i}^{j}

, using the formula given in (41). Then, the fitness function is defined as

F (P_{i}^{j}) = - C (P_{i}^{j})

; in other words, individuals with lower resource consumption have higher fitness. This transforms the resource consumption minimization problem into a fitness maximization problem that can be directly optimized by GA.

Based on the fitness evaluation results, the algorithm employs an elite selection method to select the two best-performing individuals as parent individuals (denoted as

P_{a}^{j}

and

P_{b}^{j}

, where a and b are the indices of the parent individuals in the j-th generation). Subsequently, a crossover operation is performed on the genes of the two parent individuals with a crossover probability

p_{C}

: A crossover point is randomly selected in the decision variable sequence, and the gene segments on both sides of the crossover point are exchanged to generate a candidate offspring

P_{i}^{(j + 1)}

. For example, if the parent individuals

P_{a}^{j} = (S_{a}^{j}, W_{a}^{j}, U_{a}^{j})

and

P_{b}^{j} = (S_{b}^{j}, W_{b}^{j}, U_{b}^{j})

undergo crossover between W and U, the candidate offspring may be

{P_{i}^{'}}^{(j + 1)} = (S_{a}^{j}, W_{a}^{j}, U_{b}^{j})

.

To maintain population diversity, a mutation operation is applied to the candidate offspring

P_{i}^{(j + 1)}

with a mutation probability

p_{M}

: A random decision variable is selected, and its value is randomly perturbed (e.g., fine-tuning the real-coded value within a reasonable range or flipping the binary-coded bit) to obtain the mutated candidate offspring

P_{i}^{(j + 1)}

.

Afterwards, a constraint check is performed to verify whether

P_{i}^{(j + 1)}

satisfies all constraints (from P2). If it satisfies the constraints, it is incorporated into the

(j + 1)

-th generation population as a new individual

P_{i}^{(j + 1)}

; if not, the parent individual with higher fitness (either

P_{a}^{j}

or

P_{b}^{j}

) is copied to

P_{i}^{(j + 1)}

by default to ensure a stable population size.

The above processes of selection, crossover, mutation, and constraint checking are repeated until the number of iterations reaches the predefined maximum number of generations. At this point, the decision variables

(S, W, U)

corresponding to the individual with the highest fitness in the population are the global optimal solution.

5. Experiments

5.1. Environmental Settings

To train the collaborative accuracy prediction model, we construct a cooperative perception scenario based on the OPENV2V dataset [8]. This scenario contains Q CAVs, one RSU, and one MEC unit, all of which are equipped with slimmable network models. The network width is set to

[0.25, 0.5, 0.75, 1.0]

to represent different model architectures.

For object detection, we use the AI model employed in Where2comm for implementation. We collect the object detection results of each frame and the cooperative topological relationships, and we employ the mean squared error (MSE) loss function to compute the discrepancy between the predicted values and the true labels, thus facilitating the training of the accuracy-aware model.

The perception latency requirement is set to

T = 20 ms

, and the accuracy requirement is set to

0.9

. Other system parameters (e.g., bandwidth, transmit power, computing resources) are detailed in Table 2. To simplify, we presume that all nodes have identical received noise power, transmit power, channel attenuation factor, and computing resources. By setting the weight coefficient

ω = 0.5

, we consider the minimization of total computing resource cost and total communication resource cost to be of equal importance in the optimization objective.

Table 2. System parameters in experiment.

5.2. Experimental Result

First, we demonstrate the necessity of incorporating bounding box dimension information into the input of the accuracy-aware perception model. Specifically, in the single RSU perception scenario, we systematically evaluate the correlation between the number of point clouds corresponding to target objects and their perception accuracy, with a focus on examining the impact of different target sizes (i.e., geometric scales). We take a larger truck and a smaller cyclist as examples. Figure 3 presents the perception accuracy achieved by the truck and the cyclist employing the respective two sets of perception data.

Figure 3. Performance of slimmable network when observing different targets. Note that perception accuracy (y-axis) improves as the network width (x-axis) increases. Crucially, the marginal gain diminishes at higher widths, suggesting that a dynamic trade-off strategy can save resources without significantly compromising performance.

We first demonstrate the difference in perception accuracy between collaborative perception and single-RSU perception. Specifically, under the same edge noise conditions, we collect perception data from the RSU and assisting CAVs and test the perception accuracy of different perception mechanisms for the same object. Among them, the cooperative perception mechanism adopts slimmable network and computation offloading technology. As shown in Table 3, the perception accuracy under the cooperative perception mechanism is significantly superior to that under the single-RSU perception mechanism. This finding suggests that the feature fusion mechanism notably boosts the RSU’s object detection ability, thereby offering solid proof for the validity of the presented cooperative perception mechanism.

Table 3. Differences in perception accuracy between cooperative perception mechanism and single-RSU perception mechanism.

Meanwhile, to investigate the impact of slimmable network width on perception accuracy, we test the influence of slimmable networks on the perception accuracy of a single RSU. As shown in Figure 3 and Figure 4, the experiment also indicates that under the same conditions, the larger the width of the slimmable network, the higher the perception accuracy it can further achieve; however, this is accompanied by higher computing resource consumption. This further demonstrates that improving perception performance requires a trade-off between accuracy and resource consumption.

Figure 4. Application of slimmable network in collaborative perception scenarios. This diagram illustrates the heterogeneous network width configurations assigned to different agents (RSU and CAVs). For instance, while CAV 2 operates at full width (1.0) to maximize feature quality, CAV 3 is restricted to a width of 0.25, demonstrating the system’s granular control over “intelligent elasticity” to adapt to individual link conditions.

Next, we examined the performance of the proposed model, using a trained accuracy estimation learning model. Figure 5 illustrates the variations in collaborative perception and computation offloading methods across various scenarios. Every arrow in the figure represents the transmission of perception data.

Figure 5. Collaborative perception and computation offloading schemes across various scenarios. (a) Under low accuracy requirements (

A = 0.7

), the RSU operates independently. (b) When accuracy demands an increase (

A = 0.9

), collaboration is triggered (blue arrows). (c) Under high computational intensity, tasks are further offloaded to the MEC (orange arrows). These transitions visually confirm the system’s dynamic decision-making capability in response to changing task demands.

Figure 5a,b show the collaborative perception and computation offloading methods under different accuracy requirements. For the lower accuracy requirement with

A = 0.7

, the RSU does not require collaborative perception to complete the target detection task. Figure 5b,c show the collaborative perception and computation offloading methods under different computational intensities. Specifically, to further verify the superiority of the proposed model in terms of resource consumption, we compare it with three representative benchmark methods. Table 4 summarizes the key characteristics of these methods, including differences in the granularity control of collaboration mechanism and slimmable network width allocation. Specifically, C⁴I-JO corresponds to our proposed method; SCFW selectively collaborates with some assisting CAVs but adopts a fixed network structure instead of a slimmable network; ACSW collaborates with all assisting CAVs within the perception range and incorporates the slimmable network; ACFW also collaborates with all assisting CAVs but does not use a slimmable network. Additionally, to verify the effectiveness of computation offloading technology, we further incorporated SCSW and conducted comparative experimental analyses between SCSW and C⁴I-JO, considering C⁴I-JO in two scenarios: with and without the adoption of computation offloading technology.

Table 4. Experimental setup of baseline methods and our proposed method.

Table 5 provides a detailed comparison of the perception accuracy and processing latency of different methods in the object detection sub-task. Specifically, under the same scenario, we evaluated the perception accuracy of three objects (indexed as 0, 1, and 2), as well as the overall processing latency. The experimental results show that all four methods achieve satisfactory perception accuracy; however, the latency performance of ACFW fails to meet the system’s real-time requirement (20 ms) and is therefore excluded from subsequent analysis. By comparing C⁴I-JO and SCSW, we found that computation offloading technology can effectively reduce task processing latency. Meanwhile, considering that SCSW consumes less communication and computing resources than C⁴I-JO, as C⁴I-JO only uses computing resources on the RSU and does not communicate with the MEC, SCSW was also excluded from subsequent analyses. Specifically, the processing delay reported in Table 5 is estimated using the system models defined in Section 3. It comprises the computation time for feature extraction and fusion (calculated via Equations (9) and (10)) and the communication transmission time (calculated via Equations (16) and (18)). The results confirm that by offloading computation and adjusting model width, C⁴I-JO effectively minimizes the total latency defined in Equation (21). Through quantitative comparison, we can comprehensively evaluate the trade-off performance of the proposed model between perception accuracy and resource consumption.

Table 5. Performance comparison of baseline methods and our proposed method. Note: Perception accuracy is measured as a confidence score (scale 0–1), and processing delay is measured in milliseconds (ms), calculated based on the summation of computation and transmission times defined in Equation (21).

Figure 6 shows the change in resource consumption of each method as the computational intensity increases from 10,000 to 30,000. It can be observed that as the computational demand rises, all methods exhibit rising trends in both communication and computational resource consumption. Figure 7 illustrates the variation in resource consumption for each method as the number of perceived objects within the ROI increases. It can be observed that, with the rise in computational intensity and the greater number of perceived objects, all methods exhibit an ascending trend in both communication and computation resource consumption. Meanwhile, compared with the baseline methods, the proposed joint optimization method significantly reduces resource consumption while maintaining perception accuracy and latency performance. Such results firmly demonstrate the validity and advantages of the presented framework in enhancing resource utilization efficiency.

Figure 6. Resource consumption with increasing computational intensity. (a) Total computing resource consumption in the intermediate feature generation phase. (b) Total computing resource in the feature fusion phase. (c) Total communication resource consumption. Observe that as the computational intensity rises, the proposed C⁴I-JO method (blue line) consistently maintains significantly lower resource consumption compared to the baselines (SCFW and ACSW), demonstrating the efficiency advantage gained by jointly optimizing intelligence elasticity and resource allocation.

Figure 7. Resource consumption with increasing perceived objects. (a,b) Total computing resource consumption in feature generation and fusion phases. (c) Total communication resource consumption. Observe that while resource demand scales with the number of objects, the proposed C⁴I-JO method (blue line) exhibits a significantly slower rate of increase compared to benchmarks, highlighting its superior scalability in dense traffic scenarios.

6. Conclusions

This paper proposes a joint resource and intelligence optimization method C⁴I-JO for multi-vehicle collaborative perception. By introducing intelligent elasticity, we break the bottleneck of static AI models in dynamic V2X environments. Quantitative results demonstrate the superiority of this approach: C⁴I-JO achieves a low processing latency of 6 ms, which is approximately 53% lower than the baseline SCFW method (13 ms) and 57% lower than ACSW (14 ms), while maintaining high perception accuracy (confidence score > 0.9). Furthermore, the simulation results confirm that under high computational intensity (intensity = 30,000), our method reduces the total computing resource consumption by approximately 30% compared to full-width collaboration schemes. These results firmly establish that jointly optimizing the collaborative mechanism, resource allocation, and network width significantly improves resource utilization efficiency.

Nevertheless, several limitations remain for future work. First, our current model assumes perfect channel state information (CSI), which may not hold true in highly dynamic V2X environments. Future work will focus on enhancing the system’s resilience against communication failures and packet loss. Second, regarding scalability, the centralized optimization approach may face computational bottlenecks in extremely high-density traffic. We plan to explore hierarchical offloading strategies to address this. Finally, addressing the practical implementation challenges of synchronizing heterogeneous hardware across different vehicle platforms remains a priority.

Author Contributions

Conceptualization, Z.Y.; Methodology, L.X.; Validation, X.F.; Resources, G.Z.; Data curation, X.F.; Writing—original draft, L.X. and H.Z.; Writing—review & editing, Q.Y.; Supervision, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key Research and Development Program of China under Grant 2022YFB4300400 and Grant 2023YFB4301900.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

Authors Liang Xin, Guangtao Zhou and Zhaoyang Yu were employed by China Unicom Smart Connection Technology Limited. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Bai, Z.; Wu, G.; Qi, X.; Liu, Y.; Oguchi, K.; Barth, M.J. Infrastructure-Based Object Detection and Tracking for Cooperative Driving Automation: A Survey. In Proceedings of the 2022 IEEE Intelligent Vehicles Symposium (IV), Aachen, Germany, 4–9 June 2022; pp. 1366–1373. [Google Scholar] [CrossRef]
Xiang, C.; Feng, C.; Xie, X.; Shi, B.; Lu, H.; Lv, Y.; Yang, M.; Niu, Z. Multi-Sensor Fusion and Cooperative Perception for Autonomous Driving: A Review. IEEE Intell. Transp. Syst. Mag. 2023, 15, 36–58. [Google Scholar] [CrossRef]
Huang, T.; Liu, J.; Zhou, X.; Nguyen, D.C.; Azghadi, M.R.; Xia, Y.; Han, Q.-L.; Sun, S. V2X Cooperative Perception for Autonomous Driving: Recent Advances and Challenges. arXiv 2023, arXiv:2310.03525. [Google Scholar] [CrossRef]
Wang, T.-H.; Manivasagam, S.; Liang, M.; Yang, B.; Zeng, W.; Tu, J.; Urtasun, R. V2VNet: Vehicle-to-Vehicle Communication for Joint Perception and Prediction. arXiv 2020, arXiv:2008.07519. [Google Scholar] [CrossRef]
Li, Y.; Ren, S.; Wu, P.; Chen, S.; Feng, C.; Zhang, W. Learning Distilled Collaboration Graph for Multi-Agent Perception. arXiv 2022, arXiv:2111.00643. [Google Scholar] [CrossRef]
Ye, X.; Qu, K.; Zhuang, W.; Shen, X. Accuracy-Aware Cooperative Sensing and Computing for Connected Autonomous Vehicles. IEEE Trans. Mob. Comput. 2024, 23, 8193–8207. [Google Scholar] [CrossRef]
Yu, J.; Yang, L.; Xu, N.; Yang, J.; Huang, T. Slimmable Neural Networks. arXiv 2018, arXiv:1812.08928. [Google Scholar] [CrossRef]
Xu, R.; Xiang, H.; Xia, X.; Han, X.; Li, J.; Ma, J. OPV2V: An Open Benchmark Dataset and Fusion Pipeline for Perception with Vehicle-to-Vehicle Communication. arXiv 2022, arXiv:2109.07644. [Google Scholar] [CrossRef]
Liu, Y.-C.; Tian, J.; Ma, C.-Y.; Glaser, N.; Kuo, C.-W.; Kira, Z. Who2com: Collaborative Perception via Learnable Handshake Communication. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 6876–6883. [Google Scholar] [CrossRef]
Liu, Y.-C.; Tian, J.; Glaser, N.; Kira, Z. When2com: Multi-Agent Perception via Communication Graph Grouping. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 4105–4114. [Google Scholar] [CrossRef]
Hu, Y.; Fang, S.; Lei, Z.; Zhong, Y.; Chen, S. Where2comm: Communication-Efficient Collaborative Perception via Spatial Confidence Maps. arXiv 2022, arXiv:2209.12836. [Google Scholar] [CrossRef]
Xu, R.; Xiang, H.; Tu, Z.; Xia, X.; Yang, M.-H.; Ma, J. V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer. arXiv 2022, arXiv:2203.10638. [Google Scholar] [CrossRef]
Wang, Z.; Fan, S.; Huo, X.; Xu, T.; Wang, Y.; Liu, J.; Chen, Y.; Zhang, Y.-Q. VIMI: Vehicle-Infrastructure Multi-view Intermediate Fusion for Camera-based 3D Object Detection. arXiv 2023, arXiv:2303.10975. [Google Scholar] [CrossRef]
Xu, S.; Hu, X.; Wang, L.; Wang, Y.; Wang, W. Joint Power and Bandwidth Allocation for Internet of Vehicles Based on Proximal Policy Optimization Algorithm. In Proceedings of the 2021 IEEE 20th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), Shenyang, China, 20–22 October 2021; pp. 1352–1357. [Google Scholar] [CrossRef]
Jiang, W.; Song, T.; Song, X.; Wang, C.; Jin, Z.; Hu, J. Energy-Efficient Resource Allocation for NOMA-Enabled Vehicular Networks. IEEE Trans. Veh. Technol. 2025, 74, 12042–12057. [Google Scholar] [CrossRef]
Liu, T.; Tang, L.; Wang, W.; He, X.; Chen, Q.; Zeng, X.; Jiang, H. Resource Allocation in DT-Assisted Internet of Vehicles via Edge Intelligent Cooperation. IEEE Internet Things J. 2022, 9, 17608–17626. [Google Scholar] [CrossRef]
Liu, C.; Xia, M.; Zhao, J.; Li, H.; Gong, Y. Optimal Resource Allocation for Integrated Sensing and Communications in Internet of Vehicles: A Deep Reinforcement Learning Approach. IEEE Trans. Veh. Technol. 2025, 74, 3028–3038. [Google Scholar] [CrossRef]
Ji, M.; Wu, Q.; Fan, P.; Cheng, N.; Chen, W.; Wang, J.; Letaief, K.B. Graph Neural Networks and Deep Reinforcement Learning-Based Resource Allocation for V2X Communications. IEEE Internet Things J. 2025, 12, 3613–3628. [Google Scholar] [CrossRef]
Moreno, A.; Osaba, E.; Onieva, E.; Perallos, A.; Iovino, G.; Fernández, P. Design and Field Experimentation of a Cooperative ITS Architecture Based on Distributed RSUs. Sensors 2016, 16, 1147. [Google Scholar] [CrossRef]
Zobaed, S.; Mokhtari, A.; Champati, J.P.; Kourouma, M.; Salehi, M.A. Edge-MultiAI: Multi-Tenancy of Latency-Sensitive Deep Learning Applications on Edge. In Proceedings of the 2022 IEEE/ACM 15th International Conference on Utility and Cloud Computing (UCC), Vancouver, WA, USA, 6–9 December 2022; pp. 11–20. [Google Scholar] [CrossRef]
Gu, X.; Wu, Q.; Fan, P.; Cheng, N.; Chen, W.; Letaief, K.B. DRL-Based Federated Self-Supervised Learning for Task Offloading and Resource Allocation in ISAC-Enabled Vehicle Edge Computing. Digit. Commun. Netw. 2024, 11, 12042–12057. [Google Scholar] [CrossRef]
Yu, J.; Huang, T. Universally Slimmable Networks and Improved Training Techniques. arXiv 2019, arXiv:1903.05134. [Google Scholar] [CrossRef]
Li, C.; Wang, G.; Wang, B.; Liang, X.; Li, Z.; Chang, X. Dynamic Slimmable Network. arXiv 2021, arXiv:2103.13258. [Google Scholar] [CrossRef]
Wang, S.; Li, Z.; Gao, J.; Li, L.; Hu, W. DSPNet: Towards Slimmable Pretrained Networks based on Discriminative Self-supervised Learning. arXiv 2022, arXiv:2207.06075. [Google Scholar] [CrossRef]
Zhao, S.; Zhu, L.; Wang, X.; Yang, Y. Slimmable Networks for Contrastive Self-supervised Learning. arXiv 2024, arXiv:2209.15525. [Google Scholar] [CrossRef]
Bouguezzi, S.; Faiedh, H.; Souani, C. Slim MobileNet: An Enhanced Deep Convolutional Neural Network. In Proceedings of the 2021 18th International Multi-Conference on Systems, Signals & Devices (SSD), Monastir, Tunisia, 22–25 March 2021; pp. 12–16. [Google Scholar] [CrossRef]
Elminshawi, M.; Chetupalli, S.R.; Habets, E.A.P. Slim-Tasnet: A Slimmable Neural Network for Speech Separation. In Proceedings of the 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, 22–25 October 2023; pp. 1–5. [Google Scholar] [CrossRef]
Yu, J.; Huang, T. AutoSlim: Towards One-Shot Architecture Search for Channel Numbers. arXiv 2019, arXiv:1903.11728. [Google Scholar] [CrossRef]

Figure 1. Vehicle–road–cloud collaborative perception scenario.

Figure 2. Perception architecture.

Figure 3. Performance of slimmable network when observing different targets. Note that perception accuracy (y-axis) improves as the network width (x-axis) increases. Crucially, the marginal gain diminishes at higher widths, suggesting that a dynamic trade-off strategy can save resources without significantly compromising performance.

Figure 4. Application of slimmable network in collaborative perception scenarios. This diagram illustrates the heterogeneous network width configurations assigned to different agents (RSU and CAVs). For instance, while CAV 2 operates at full width (1.0) to maximize feature quality, CAV 3 is restricted to a width of 0.25, demonstrating the system’s granular control over “intelligent elasticity” to adapt to individual link conditions.

Figure 5. Collaborative perception and computation offloading schemes across various scenarios. (a) Under low accuracy requirements (

A = 0.7

), the RSU operates independently. (b) When accuracy demands an increase (

A = 0.9

), collaboration is triggered (blue arrows). (c) Under high computational intensity, tasks are further offloaded to the MEC (orange arrows). These transitions visually confirm the system’s dynamic decision-making capability in response to changing task demands.

Figure 6. Resource consumption with increasing computational intensity. (a) Total computing resource consumption in the intermediate feature generation phase. (b) Total computing resource in the feature fusion phase. (c) Total communication resource consumption. Observe that as the computational intensity rises, the proposed C⁴I-JO method (blue line) consistently maintains significantly lower resource consumption compared to the baselines (SCFW and ACSW), demonstrating the efficiency advantage gained by jointly optimizing intelligence elasticity and resource allocation.

Figure 7. Resource consumption with increasing perceived objects. (a,b) Total computing resource consumption in feature generation and fusion phases. (c) Total communication resource consumption. Observe that while resource demand scales with the number of objects, the proposed C⁴I-JO method (blue line) exhibits a significantly slower rate of increase compared to benchmarks, highlighting its superior scalability in dense traffic scenarios.

Table 1. Key notations and definitions.

Symbol	Definition
$Q, Q^{+}$	Set of CAVs and set of assisting nodes (CAVs + RSU)
G	Set of objects to be perceived within the ROI
$w_{q}$	Width ratio of the slimmable network for node q
$β_{q}$	Communication bandwidth allocation proportion for node q
$u_{g}$	Computation offloading decision variable (1 if offloaded to MEC)
$s_{q, g}$	Collaborative selection decision variable
$F^{(g, q)}$	BEV features of object g generated by node q

Table 2. System parameters in experiment.

Parameters	Value
Bandwidth ( $B_{1}$ )	20 MHz
Average transfer rate (R)	85.8 Mbps
Distance from the vehicle to the RSU	5 m
Distance between RSU and MEC	5 m
Interference power ( $σ^{2}$ )	$10^{- 13}$ W
Transmit power ( $P_{q}$ and P)	1 W
Loss exponent ( $γ$ )	3.4
Channel attenuation factor ( $\| h_{q} \|^{2}$ )	1
Available compute resources for CAVs and RSU ( $c_{q}$ )	10 GHz
Allocable computing resources for MEC	100 GHz
Data size per observation point ( $φ$ )	96 bit

Table 3. Differences in perception accuracy between cooperative perception mechanism and single-RSU perception mechanism.

Perception Mechanism	Perception Accuracy
	0	1	2
Single-RSU Perception	0.65	0.73	0.74
Cooperative Perception	0.96	0.92	0.95

Table 4. Experimental setup of baseline methods and our proposed method.

Method	Feature
	Collaboration	Network Width	Computation Offloading
C⁴I-JO (Proposed)	Selective	Selective	Yes
SCFW (Baseline)	Selective	FULL	Yes
ACSW (Baseline)	ALL	Selective	Yes
ACFW (Baseline)	ALL	FULL	Yes
SCSW(Baseline)	Selective	Selective	No

Table 5. Performance comparison of baseline methods and our proposed method. Note: Perception accuracy is measured as a confidence score (scale 0–1), and processing delay is measured in milliseconds (ms), calculated based on the summation of computation and transmission times defined in Equation (21).

Method	Accuracy			Delay (ms)
	0	1	2
C⁴I-JO (Proposed)	0.95	0.93	0.96	6
SCFW (Baseline)	0.94	0.95	0.95	13
ACSW (Baseline)	0.95	0.96	0.94	14
ACFW (Baseline)	0.99	0.98	0.98	25
SCSW (Baseline)	0.95	0.92	0.95	11

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Vehicle–Road–Cloud Collaborative Perception: Resource and Intelligence Optimization

Abstract

1. Introduction

3. System Model

3.1. Vehicle-Road-Cloud Collaborative Perception Scenario

3.2. Perception Data

3.3. Architecture

3.4. Computation Model

3.5. Communication Model

3.6. Problem Formulation

4. Method

4.1. Inner-Layer Module

4.2. Outer-Layer Module

5. Experiments

5.1. Environmental Settings

5.2. Experimental Result

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

Vehicle–Road–Cloud Collaborative Perception: Resource and Intelligence Optimization

Abstract

1. Introduction

2. Related Work

2.1. Collaborative Perception

2.2. Resource Allocation

2.3. Slimmable Network

3. System Model

3.1. Vehicle-Road-Cloud Collaborative Perception Scenario

3.2. Perception Data

3.3. Architecture

3.4. Computation Model

3.5. Communication Model

3.6. Problem Formulation

4. Method

4.1. Inner-Layer Module

4.2. Outer-Layer Module

5. Experiments

5.1. Environmental Settings

5.2. Experimental Result

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics