1. Introduction
Intelligent Transportation Systems (ITS) are widely seen as a cornerstone enabling safe, efficient, and sustainable urban mobility [
1]. By integrating sensing, communication, and computing technologies, these systems aim to reduce congestion, minimize accidents, and improve the overall quality of transportation services [
2]. The proliferation of connected vehicles and roadside units coupled with emerging applications such as autonomous driving, remote diagnostics, and in-vehicle entertainment has led to explosive data traffic growth across ITS infrastructures [
3]. This surge in mobile data flows intensifies pressures on communication networks, exacerbating issues such as spectrum scarcity, limited bandwidth, and stringent latency constraints [
4,
5,
6].
Despite advances in 4G/5G networks and vehicular ad hoc networks, traditional communication architectures primarily follow Shannon’s information theory, focusing on reliable symbol-level transmission rather than task-relevant content [
7,
8]. This approach often results in inefficient use of bandwidth, as large volumes of raw sensor data—including high-resolution images, LiDAR scans, and radar measurements—must be transmitted without accounting for their semantic relevance to downstream tasks [
9,
10]. Moreover, ITS applications impose strict latency requirements that make the real-time transmission of raw data increasingly impractical in dense urban environments [
11]. Meanwhile, many ITS scenarios exhibit strong spatial and temporal regularities [
12]. Static background elements such as buildings, road signs, and traffic lights remain largely unchanged over time, while the truly critical dynamic content involves moving agents such as vehicles and pedestrians [
13]. Recognizing these regularities offers opportunities to reduce redundant transmissions and focus communication resources on semantically meaningful information [
14]. Indeed, traffic management, incident detection, and cooperative perception all benefit from the ability to exchange concise and task-relevant representations rather than raw sensor streams [
15].
Figure 1 illustrates a representative ITS intersection scenario. A Roadside Unit (RSU) deployed at the intersection is equipped with a camera to continuously capture raw traffic scenes. These images typically include static background information such as road geometry, surrounding buildings, and traffic signs as well as dynamic elements such as vehicles, pedestrians, cyclists, and traffic signals. The RSU processes the captured images in order to identify dynamic and safety-critical objects that directly influence traffic flow and decision-making. In contrast, static background elements, which remain largely unchanged over time, are pre-stored at the Base Station (BS) as priors for reconstruction. The extracted dynamic content, which is significantly more compact than the full raw images, is transmitted over noisy wireless channels to the BS. At the receiving end, the base station fuses the transmitted information with the stored background priors to reconstruct a complete traffic scene. This approach avoids repeatedly transmitting redundant background data, thereby reducing bandwidth consumption, while also ensuring that dynamic content such as the movement of vehicles or pedestrians is reliably conveyed even under adverse channel conditions. The example in
Figure 1 illustrates how communication at the perception layer can reduce redundant load and ensure reliable delivery of dynamic safety-critical information. Such reconstructed scenes are not only compact in transmission but also directly usable for ITS perception tasks, including cooperative perception between vehicles and RSUs, real-time safety monitoring, and incident detection. In this way, they connect communication-layer design with practical ITS applications, motivating the development of more advanced communication strategies capable of transmitting high-level task-relevant semantics rather than raw data.
Semantic Communication (SC) has emerged as a promising paradigm to address these limitations by prioritizing the transmission of high-level task-relevant semantics rather than low-level symbols [
16]. SC approaches can significantly improve spectral efficiency, robustness to channel noise, and latency by compressing redundant information while preserving critical semantic cues needed for decision-making [
17]. Recent work has shown that attention mechanisms, encoder–decoder frameworks, and joint source–channel coding (JSCC) can enhance SC systems by selectively focusing on important features and ensuring reliable transmission even under noisy channel conditions [
18,
19,
20].
Although semantic communication has shown great potential for enhancing wireless transmission, existing methods applied to ITS still face key limitations. JSCC-based frameworks [
21,
22] leverage end-to-end learning to jointly optimize image compression and channel robustness; however, the large model sizes and complex encoder–decoder structures hinder deployment on edge devices. Attention-enhanced SC methods [
23,
24,
25,
26] improve semantic relevance by highlighting discriminative features, yet often involve transformer [
27] backbones and lack structured inductive bias, making them computationally intensive and less suitable for edge-side inference in ITS environments. Knowledge-driven designs [
28,
29] utilize external graphs or shared priors to enhance decoding, limiting their reusability across scenes. Task-adaptive SC strategies [
30,
31] adjust rate control based on task sensitivity, but are mostly built for general semantic compression and fail to account for the spatial regularities and dynamic/static saliency inherent to traffic scenarios. Even more recent routing-based approaches such as BRASC [
32] perform in-channel semantic attention to prioritize relevant features, yet still rely on online segmentation and full-resolution attention without leveraging prior semantic knowledge to suppress redundant background content.
These limitations highlight a broader challenge: most existing methods have not been explicitly designed considering ITS requirements, where efficiency, robustness, and semantic fidelity under noisy channels are essential; as a result, the communication layer in ITS still lacks solutions that can reliably deliver semantically meaningful visual information under realistic channel conditions. In this context, reliable semantic image reconstruction is particularly critical for higher-level tasks such as detection, tracking, and cooperative perception, since channel noise directly impacts the recognition of both static infrastructure and dynamic obstacles. As illustrated in
Figure 1, much of the perceptual information in ITS is conveyed through camera-captured images, making robust semantic reconstruction under noisy channels a prerequisite for effective downstream processing.
To address these challenges, we propose a framework which we call Offline Knowledge Base and Attention-driven Semantic Communication (OKBASC). In OKBASC, the offline knowledge base precomputes semantic masks by segmenting training images in advance, constructing a compact and reusable semantic repository. At runtime, it leverages sparse attention to integrate these masks with incoming visual data, providing task-relevant semantic priors that reduce redundant regions in advance. Furthermore, our Bi-Level Routing Attention (BRA) module performs both global top-K channel selection and local spatial refinement, resulting in enhanced feature representation with broader context and fine detail. Together, these innovations enable OKBASC to achieve fast and lightweight semantic communication with high fidelity, making it well-suited for ITS applications.
The main contributions of this work are summarized in
Table 1. It is important to note that this work focuses specifically on semantic communication for image transmission under noisy channels, rather than end-to-end ITS functions such as trajectory prediction or traffic management. By clarifying this scope, the contributions listed in
Table 1 reflect the achievements in this paper. These innovations enable OKBASC to achieve semantic communication with high fidelity, providing reliable semantic image reconstruction that supports downstream ITS tasks such as detection, tracking, and cooperative perception. In this scenario, the proposed contributions should be understood as enabling reliable image-level communication within the perception layer of ITS rather than directly addressing higher-level traffic management or control.
The remainder of this paper is organized as follows:
Section 2 reviews related work on semantic communication and its applications in ITS;
Section 3 details the architecture and components of the proposed OKBASC framework;
Section 4 presents the experimental setup;
Section 5 describes the results and corresponding analyses; finally,
Section 6 concludes the paper and discusses future directions.
2. Related Work
Explosive data growth in ITS driven by applications such as autonomous driving, remote diagnostics, and multimedia services has pushed conventional communication technologies to their limits [
33]. Traditional methods based on Shannon’s information theory focus on bit-level transmission accuracy, yet struggle to handle task-relevant semantic content under constrained bandwidth and latency requirements [
34,
35]. To address these limitations, SC has emerged as a transformative paradigm aiming to transmit only essential semantic information rather than raw data, enhancing both spectral efficiency and transmission robustness [
36].
Prior research on SC in vehicular networks has largely focused on JSCC frameworks that use convolutional or autoencoder-based architectures to achieve efficient image transmission and semantic preservation under noisy channel conditions [
37]. For instance, convolutional semantic encoders and decoders have been utilized to transmit traffic sign semantics. This has led to improved classification and reconstruction performance, particularly in scenarios involving poor visibility and satellite assistance [
38]. Reinforcement learning-based decision mechanisms have also been introduced at the receiver to enhance adaptive interpretation of transmitted semantics [
38,
39].
Multi-user SC frameworks have further extended this line of work by exploring semantic correlations among users [
40]. The cooperative SC framework jointly optimizes encoding and decoding across multiple transmitters to eliminate semantic redundancy while preserving distinct task-relevant features [
41]. By sharing high-level semantic content rather than full-resolution data, this approach has demonstrated advantages in scenarios such as cooperative image retrieval and real-time tracking, where it can significantly reduce communication overhead [
42].
Additionally, SC has shown strong potential in intelligent traffic management, where task-specific systems such as vehicle counting [
43] and multi-task vehicle recognition systems prioritize semantic features [
44] for accurate and efficient decision-making. Feature importance ranking and dynamic rate allocation strategies have also been explored to enhance robustness in low signal-to-noise environments.
While the aforementioned studies have explored incorporating external knowledge bases into SC systems, such approaches often remain limited in scope. Typically, they rely on predefined semantic priors without additional refinement, selective filtering, or task-driven structuring. This lack of adaptive organization and semantic compression mechanisms leads to redundant transmission of background information and suboptimal alignment with communication goals. Consequently, these methods fail to fully exploit the potential of knowledge bases to support efficient and robust semantic communication in the highly constrained and dynamic contexts characteristic of ITS environments.
In contrast, our proposed OKBASC framework adopts a structured knowledge-driven approach that moves semantic segmentation to an offline stage. By pre-segmenting training images and applying a TKSA mechanism, OKBASC constructs a compact and reusable semantic knowledge base that encodes task-relevant dynamic elements (e.g., vehicles, pedestrians) in a structured form. This offline knowledge base serves as a semantic prior that guides subsequent attention-based encoding, enabling selective focus on meaningful content during transmission while reducing redundant processing on edge devices. During transmission, this offline repository provides strong priors for attention-guided feature encoding, enabling selective semantic preservation with reduced computational burden on edge devices. By combining this offline knowledge base with a BRA module for hierarchical semantic feature refinement, OKBASC offers a robust and scalable solution tailored to the bandwidth-limited and latency-sensitive demands of ITS communication.
Table 2 compares the OKBASC framework with representative SC research, highlighting the methods and limitations of previous research along with the novel advantages offered by our approach.
3. Offline Knowledge Base and Attention-Driven Semantic Communication
This section describes the proposed OKBASC framework, including its overall architecture, offline semantic knowledge base construction, and semantic encoding with BRA, along with the channel transmission and decoding process.
3.1. Overall Architecture
The overall architecture of the proposed OKBASC framework is illustrated in
Figure 2. OKBASC is designed to enable efficient and robust semantic communication in ITS scenarios by combining an offline knowledge base with attention-driven encoding and decoding stages, as summarized in Algorithm 1. This design aims to reduce redundant transmission and preserve task-relevant information under constrained bandwidth conditions.
Algorithm 1: OKBASC Framework |
Input: Input image ; Offline knowledge base Output: Reconstructed semantic image - 1
Phase 1: Offline Knowledge Base Construction; - 2
Build semantic mask repository from training data using Algorithm 2; - 3
Phase 2: Semantic Feature Encoding; - 4
Fuse input with relevant masks from using sparse attention; - 5
; - 6
(Algorithm 3); - 7
Phase 3: Channel Transmission; - 8
; - 9
; // simulate AWGN - 10
; - 11
Phase 4: Semantic Decoding and Reconstruction; - 12
Predict spatial semantic mask: ; - 13
Apply mask to features: ; - 14
Decode masked features: ; - 15
Upsample to original resolution via bilinear interpolation; - 16
return ;
|
A central component of OKBASC is its semantic knowledge base, which is constructed offline in advance of communication. Training images are pre-segmented to extract semantic regions corresponding to dynamic task-relevant elements such as vehicles, pedestrians, and traffic signs. A TKSA mechanism is then applied to these segmentation masks to filter out background regions, resulting in a compact repository of reusable semantic masks. This offline knowledge base serves as a strong prior during runtime without imposing additional computational burden on edge devices.
During communication, the system begins with an input image that is fused with relevant masks retrieved from the offline knowledge base using sparse attention, generating a semantic-aware feature representation that selectively emphasizes important dynamic content while suppressing visual redundancy. This fused semantic representation is then refined by the BRA block, which employs a two-branch strategy: global top-K channel selection highlights the most discriminative semantic channels, while local sparse attention within spatial windows enhances fine-grained details. Notably, the top-K selection here is different in purpose from that used in offline knowledge base construction: whereas the offline TKSA mechanism selects spatially salient regions to build static semantic masks, the BRA module’s global top-K routing dynamically filters semantic channels to adapt encoding to each input each image. In this way, BRA ensures that the final semantic vector is both compact and expressive, making it suitable for transmission over bandwidth-constrained and noisy channels.
The refined semantic features are subsequently processed by a semantic encoder and transformed into a transmittable latent code using a channel encoder. This code is sent over a simulated wireless channel modeled with Additive White Gaussian Noise (AWGN), then recovered by a channel decoder at the receiver. During reconstruction, the semantic decoder not only uses the transmitted features but also incorporates relevant entries from the offline knowledge base to guide and refine the recovered content, ensuring consistency with precomputed semantic priors. By leveraging both the transmitted latent features and the stored semantic knowledge, the decoder emphasizes task-relevant dynamic regions while suppressing irrelevant background content. The final output is then upsampled to the original resolution via bilinear interpolation, yielding a high-fidelity image aligned with ITS communication requirements while minimizing bandwidth usage.
3.2. Offline Knowledge Base Construction via TKSA
To support efficient semantic encoding for ITS, we construct a reusable semantic knowledge base using a TKSA mechanism. The idea is to move the computationally expensive process of semantic segmentation to an offline stage. In this stage, the system precomputes and stores compact semantic masks that capture task-relevant dynamic content such as vehicles, pedestrians, and traffic signs while ignoring largely static urban backgrounds such as buildings or road infrastructure [
46]. This approach reduces edge-side computation and communication overhead by ensuring that only essential dynamic semantics are emphasized during transmission.
Unlike standard attention mechanisms that assign soft weights to all features, TKSA enforces hard sparsity by selecting only the top-K most relevant spatial positions for each query. This approach yields lightweight and interpretable attention maps: only the most task-relevant regions receive non-zero weights, making it easy to identify and isolate dynamic semantic content for mask generation while ignoring static background regions [
47]. These binary masks, generated once offline, form a compact and reusable knowledge base that guides semantic fusion with input images during communication [
48]. It is worth noting that our adaptation of TKSA differs from that proposed in image de-raining [
49], which focuses on selective feature restoration within transformer-based de-raining networks. In contrast, we repurpose sparse attention for offline semantic mask extraction in intelligent transportation scenarios, targeting salient dynamic objects such as vehicles, traffic lights, and pedestrians in order to build a concise and reusable semantic knowledge base tailored to ITS communication needs.
As shown in
Figure 3, our TKSA module operates on original training images
, where
B denotes the batch size,
C denotes the number of channels, and
is the image size. The goal of TKSA is to identify key semantic regions in each image through a sparse attention mechanism, producing binary masks that highlight the most informative spatial areas. To perform attention over spatial positions, the input image is first reshaped into a sequence of tokens:
The query, key, and value matrices are computed via learned linear projections:
where,
are trainable parameters and
d is the attention embedding dimension. For each query token
, the dot-product similarity with all keys is computed as
Instead of attending to all positions, TKSA performs top-
K selection to retain only the most relevant attention scores:
where
denotes the indices of the top-
K keys most relevant to query
. Sparse attention weights are then computed by applying the softmax function only over the selected positions:
The final output for each token is computed as a weighted sum of values:
This yields
, from which semantic importance maps are derived. These maps are then binarized (e.g., via thresholding) to produce final semantic masks that highlight dynamic task-relevant regions. By repeating this process for all training images, TKSA builds a compact offline semantic knowledge base of reusable masks that can be fused with runtime input to guide semantic encoding and improve communication efficiency in ITS scenarios. The procedure for constructing the offline semantic knowledge base using TKSA is summarized in Algorithm 2.
Algorithm 2: Offline Knowledge Base Construction via TKSA |
Input: Semantic training dataset Output: Semantic knowledge base ![Bdcc 09 00240 i001 Bdcc 09 00240 i001]() |
3.3. Bi-Level Routing Attention for Semantic Encoding
To enhance semantic feature extraction, we introduce a BRA block that hierarchically captures both global discriminative semantics and local structural details [
50]. Unlike traditional convolutional pooling that performs uniform spatial downsampling, BRA adopts a two-stage sparse attention mechanism to achieve selective information routing. This is particularly useful in semantic communication scenarios, where retaining task-relevant semantics while reducing redundancy is essential.
Our BRA (As shown in
Figure 4) is inspired by the bi-level sparse attention used in BiFormer [
32], but is customized for semantic feature enhancement in offline knowledge-based communication systems. It operates on semantic-aware features
, where
B is the batch size,
C is the number of channels, and
is the spatial resolution. The goal is to apply global token routing for coarse selection and local refinement within windows for fine-grained discrimination. To this end, the feature map is first partitioned into non-overlapping windows of size
, yielding
windows per image. Each window is treated as a separate unit for local attention.
We compute a global attention token
by aggregating spatial statistics across each channel. In our case, global average pooling is replaced with a semantic-aware summary (from the TKSA-enhanced image):
The global token
t is used as the query to compute similarity with learnable channel keys
:
Then, the top-K channel indices
are selected:
Only those channels in
are retained for the next step, effectively filtering out semantically irrelevant channels.
For each window
(after selecting
channels), we flatten spatial locations and perform local multi-head attention. Given reshaped tokens
:
with
, for each query
in the window, attention scores are computed as
Sparse attention is applied by selecting the top-
k spatial positions
for each
:
To compute the sparse attention weights within each local window, we adopt the formulation in Equation (
5). In this context, the equation is applied to semantic-aware feature tokens within each local window after global channel filtering, enabling the model to selectively attend to fine-grained spatial details and enhancing local semantic representation. The procedure for applying BRA to semantic-aware features is summarized in Algorithm 3.
Algorithm 3: Bi-Level Routing Attention (BRA) |
Input: Semantic-aware feature map Output: Compact semantic representation ![Bdcc 09 00240 i002 Bdcc 09 00240 i002]() |
3.4. Channel Transmission and Decoding
After extracting semantic features via the BRA module, we obtain a compact representation denoted as
, where
B is the batch size and
D is the flattened feature dimension. This representation is passed to the channel encoder, which performs a linear projection to map the semantic features into a lower-dimensional latent code
, as follows:
where
and
are learnable parameters of the encoder.
To simulate realistic wireless environments, we adopt an AWGN channel. The transmitted code
is perturbed by zero-mean Gaussian noise
, where
is determined by the SNR:
The received noisy code
is then passed through the channel decoder, which performs a reverse linear projection to reconstruct the semantic features:
where
and
are decoder parameters.
Finally, the recovered semantic vector
is passed to the semantic decoder to reconstruct the semantic-aware image
. When the adaptive semantic compression is enabled, the semantic decoder incorporates an auxiliary mask decoder to generate a spatial attention mask
. This mask guides the reconstruction process by emphasizing task-relevant spatial regions while suppressing irrelevant background content. Given the semantic feature map
, the mask decoder applies a CNN to predict the attention mask:
where
denotes a convolutional subnetwork and
is the sigmoid activation function that normalizes the output to the
range. The resulting attention mask is applied to the semantic feature map via element-wise multiplication:
The masked features are then decoded and upsampled via bilinear interpolation to generate the final reconstructed image:
This mask-guided decoding mechanism improves semantic reconstruction quality under bandwidth-limited conditions by focusing on semantically informative regions. This design follows previous deep learning-enabled semantic communication systems [
35] while being specifically tailored to leverage our offline knowledge base and attention-guided features for ITS scenarios. The procedure for channel encoding, transmission with noise, decoding, and reconstruction is detailed in Algorithm 4.
It can be seen that the overall design of OKBASC addresses several key limitations in existing SC systems for intelligent transportation. By moving semantic segmentation to an offline stage through the TKSA module, our method avoids the high computational costs of real-time feature selection while retaining structured task-relevant priors that can be reused across diverse traffic scenarios. The BRA module then performs hierarchical routing attention during encoding, improving semantic discriminability with minimal computational overhead. In the decoding stage, the pre-shared masks from the TKSA knowledge base guide the reconstruction process by filtering redundant background, while the BRA-refined features provide discriminative semantic cues, jointly supporting accurate semantic reconstruction even under noisy channel conditions. Together, these modules form a lightweight yet robust SC pipeline that reduces redundancy, improves semantic fidelity, and adapts to bandwidth constraints.
Algorithm 4: Channel Transmission and Decoding |
Input: Semantic representation Output: Reconstructed image ![Bdcc 09 00240 i003 Bdcc 09 00240 i003]() |
4. Experiment Setup
This section describes the experimental setup, datasets, and parameter configurations.
4.1. Datasets
To comprehensively evaluate the effectiveness and generalizability of our proposed semantic communication system, we select two representative datasets spanning both general visual tasks and real-world intelligent transportation scenarios. Specifically, we use the VOC2012 dataset for baseline semantic compression benchmarking and the nuPlan dataset for validation in dynamic urban environments.
The VOC2012 dataset [
51] is a widely used benchmark designed for evaluating object recognition and semantic segmentation in complex natural scenes. It contains over 11,000 annotated RGB images covering 20 diverse object categories such as people, animals, vehicles, and household items, often with multiple objects per image and challenging backgrounds. We directly use its standard official training/validation split without further partitioning. We use VOC2012 to validate the effectiveness of our model in preserving critical object-level information under different SNR and channel conditions, demonstrating its suitability for SC in general visual tasks.
The NuPlan dataset [
52] is a large-scale autonomous driving benchmark designed to capture complex urban scenes. In our experiments, we use a subset of the nuPlan-v1.1 mini dataset containing 5100 semantic-aware keyframes from the front-facing camera (CAM F0) stream, specifically extracted from the recording captured on 12 May 2021 at 22:00:38 for vehicle 35 (frames 01008–01518). The data are split into 80% for training and 20% for validation to ensure consistent evaluation. The data include dynamic interactions among vehicles, pedestrians, and traffic infrastructure, and exhibit challenges such as occlusion, variable lighting, and complex road geometries. This makes the selected subset particularly suitable for evaluating the ability of our system to preserve task-relevant semantics under bandwidth constraints and channel noise in the context of intelligent transportation systems.
This combination of datasets ensures that our evaluation spans both broad-domain visual understanding and domain-specific traffic scene challenges. VOC2012 was chosen not only for its wide adoption in the semantic segmentation community but also for its manageable size and well-curated annotations, which allow for controlled experimentation without excessive computational cost. Its diverse object categories and complex backgrounds enable us to test the model’s ability to retain fine-grained semantics in general visual tasks, providing a strong reference for cross-domain generalization. For nuPlan, we use a representative subset rather than the full dataset because the full corpus contains over 1500 h of driving data, which would demand prohibitively high computational and storage resources for training and evaluation. The selected subset still covers diverse road types, lighting conditions, and traffic compositions, ensuring experimental diversity while keeping the computation tractable and aligned with our focus on resource-constrained and edge-deployable semantic communication systems. Together, these datasets provide a balanced and efficient evaluation setting.
4.2. Parameter Configuration
To ensure fair comparisons, the implementation details and training configurations are specified as follows. All models are trained using the Adam optimizer with default momentum parameters. We adopt a fixed learning rate and batch size across all experiments. For fair comparison, both datasets are trained for the same number of epochs with identical training protocols. The detailed training hyperparameters are summarized in
Table 3.
The experiments are conducted on a workstation equipped with an Intel Core i7-12700F (Intel Corporation, Santa Clara, CA, USA) processor, 32 GB of RAM(Kingston Technology Corp., Fountain Valley, CA, USA), and two RTX 3060 GPUs(ASUStek Computer Inc., Taipei, Taiwan) (12 GB each), utilizing CUDA 12.2 for computation. The implementation is based on Python3.8.17 and PyTorch1.9.0 running on the Windows 11 operating system.
5. Results Discussion
This section describes the results used to evaluate the proposed OKBASC framework. We conduct two sets of experiments: (1) comparative evaluations against two baseline methods, BRASC [
50] and LAMSC [
45], under various SNR conditions, to demonstrate OKBASC’s advantages in semantic compression accuracy, reconstruction quality, and robustness; and (2) ablation studies to analyze the individual contributions of key components such as TKSA and BRA, validating their roles in enhancing semantic representation and transmission performance in ITS scenarios.
5.1. Comparative Experiment
To evaluate the effectiveness of the proposed OKBASC framework, we conduct a series of comparative experiments against two representative baseline methods, namely, LAMSC and BRASC. These experiments are performed under varying channel conditions simulated by different SNR levels, with a focus on training loss convergence and semantic reconstruction performance. The goal is to assess how well each method handles noise and preserves task-relevant information during transmission and recovery. Specifically, we select SNR levels of 5, 10, 15, 20, and 25 dB in order to represent a wide range of realistic communication scenarios that are commonly encountered in ITS applications, from highly noisy channels (5 dB) to relatively clean wireless environments (25 dB).
Figure 5 shows the training loss trajectories of OKBASC, LAMSC, and BRASC under different SNR conditions (5, 10, 15, 20, 25). Across all noise levels, OKBASC achieves the lowest final training loss.
At low SNR levels (e.g., SNR = 5), the loss curves show the greatest divergence among methods. OKBASC converges rapidly within the first 10–20 epochs, stabilizing at a substantially lower loss floor with minimal oscillations. In contrast, BRASC exhibits pronounced fluctuations and less stable training, while LAMSC converges more smoothly than BRASC but still reaches a higher loss value than OKBASC. These results highlight OKBASC’s robustness under severe channel noise, benefiting from its offline knowledge base and attention-guided fusion that help to suppress irrelevant regions prior to transmission.
At moderate SNR levels (10 and 15 dB), all methods exhibit smoother convergence, reflecting reduced channel distortion; however, OKBASC maintains a clear advantage in final loss values and convergence speed. It consistently reaches lower loss than LAMSC and BRASC after early epochs, with fewer oscillations even as the amount of noise decreases. Notably, LAMSC narrows its gap with OKBASC at SNR = 15 dB, suggesting that methods without offline segmentation can still leverage moderate channel conditions but remain limited by less discriminative feature selection.
At high SNR levels (20 and 25 dB), the performance gap among methods narrows further. Loss curves become more stable overall, and all systems benefit from reduced channel perturbations. OKBASC retains its advantage with the lowest final loss, but with a smaller relative margin. BRASC and LAMSC both achieve smoother training under these cleaner conditions, though BRASC continues to show higher variance and worse minima. This indicates that even with improved channel quality, the absence of an explicit knowledge base in BRASC limits its ability to consistently suppress background noise and preserve task-relevant regions.
We further evaluate semantic reconstruction quality using the Structural Similarity Index Metric (SSIM) [
53] under varying SNR conditions on both the NuPlan-v1.1 mini camera0 dataset and the VOC2012 dataset.
For the VOC2012 dataset, as shown in
Figure 6, the SSIM results under varying SNR conditions demonstrate clear differences among the compared methods. Across all SNR levels from 5 dB to 25 dB, OKBASC consistently achieves the highest SSIM scores. Even under low SNR (5 dB), OKBASC maintains SSIM values above 0.94, while LAMSC drops to around 0.83 and BRASC remains much lower, close to 0.54. While all methods show gradual gains in SSIM as the channel SNR improves, their relative ranking stays consistent. This performance gap is especially important for VOC2012’s varied scenes, where irrelevant background details increase the difficulty of reconstruction. OKBASC’s higher SSIM indicates its ability to leverage offline semantic masks and attention-guided compression to filter redundancy while preserving essential object-level semantics. Compared to LAMSC, OKBASC offers better noise resilience and semantic consistency. BRASC’s consistently lower SSIM across all conditions suggests that its simpler attention mechanism struggles to retain fine-grained structure in cluttered multi-object images.
For the NuPlan dataset, as shown in
Figure 7, the SSIM results reveal a similar advantage for OKBASC. OKBASC consistently outperforms LAMSC and BRASC across all tested SNR levels, maintaining high SSIM values even under severe noise. At SNR = 5 dB, OKBASC achieves SSIM near 0.98, while LAMSC drops to around 0.86 and BRASC falls below 0.57. As SNR increases, OKBASC continues to deliver stable high-semantic reconstructions with minimal degradation. LAMSC improves with SNR but shows a wider gap relative to OKBASC at lower SNRs, highlighting its greater sensitivity to noise. BRASC shows consistently lower SSIM values with limited improvement as SNR rises, reflecting its difficulty in preserving critical semantic details in dynamic traffic scenes with complex interactions among vehicles, pedestrians, traffic signals, and road geometry. These results confirm the robustness and effectiveness of OKBASC in preserving task-relevant semantic information under bandwidth constraints and noisy channels, particularly in real-world intelligent transportation settings.
In addition to quantitative SSIM comparisons,
Figure 8 presents qualitative reconstruction results for the VOC2012 and NuPlan datasets at SNR levels of 5 dB, 10 dB, 15 dB, 20 dB, and 25 dB. Consistent with the SSIM trends, OKBASC preserves object contours and suppresses background noise more effectively than LAMSC and BRASC across all SNRs. LAMSC shows moderate noise suppression but exhibits noticeable blurring in fine details, especially under low SNR. BRASC suffers from severe noise artifacts and structural distortion, indicating its limited capability in maintaining semantic fidelity in challenging channel conditions. These examples visually validate the quantitative findings and further demonstrate the robustness of OKBASC in preserving task-relevant semantics.
5.2. Ablation Study
To further analyze the contribution of each component in our OKBASC framework, we conduct ablation experiments on both the VOC2012 and nuPlan datasets under varying SNR conditions.
Table 4 and
Table 5 report the SSIM performance when selectively removing key modules, including TKSA and the BRA block.
For the VOC2012 dataset, removing TKSA leads to the most pronounced SSIM degradation, particularly under low SNR (e.g., SSIM drops from 0.94 to 0.85 at 5 dB), and remains lower even at high SNR (from 0.87 to 0.82 at 25 dB). This highlights TKSA’s importance in enforcing hard sparsity during offline knowledge base construction to filter out redundant static background. Similarly, removing the BRA block causes a noticeable performance drop (e.g., 0.83 at 5 dB and 0.79 at 25 dB), confirming its role in refining global and local semantic features. When both modules are removed, the SSIM plummets further (e.g., 0.78 at 5 dB and 0.72 at 25 dB), demonstrating their complementary benefits in supporting semantic robustness under noisy conditions.
On the nuPlan dataset, OKBASC achieves relatively higher SSIM scores across all SNR levels due to the more consistent spatial layouts of traffic scenes. Yet, the absence of BRA still results in considerable degradation (e.g., SSIM drops from 0.88 to 0.79 at 5 dB and from 0.90 to 0.81 at 25 dB), highlighting BRA’s importance in preserving fine-grained dynamic features essential for understanding traffic participants and road geometry. TKSA also contributes significantly, with SSIM dropping to 0.81 at 5 dB and 0.84 at 25 dB when removed. The combined removal of both modules yields the lowest performance (0.65 at 5 dB and 0.67 at 25 dB), confirming their joint contribution to accurate semantic representation and resilience to channel noise.
These results validate the critical importance of both TKSA and BRA to OKBASC’s performance. TKSA ensures compact and focused semantic representations, while BRA enhances spatial detail through hierarchical refinement. Together, they form a robust semantic compression–reconstruction pipeline, enabling OKBASC to maintain high task-relevant fidelity across diverse SNR conditions in both complex natural scenes and structured driving scenes.