Offline Knowledge Base and Attention-Driven Semantic Communication for Image-Based Applications in ITS Scenarios

Xiao, Yan; Fan, Xiumei; Xie, Zhixin; Lu, Yuanbo

doi:10.3390/bdcc9090240

Open AccessArticle

Offline Knowledge Base and Attention-Driven Semantic Communication for Image-Based Applications in ITS Scenarios

College of Automation and Information Engineering, Xi’an University of Technology, Xi’an 710048, China

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2025, 9(9), 240; https://doi.org/10.3390/bdcc9090240

Submission received: 20 July 2025 / Revised: 8 September 2025 / Accepted: 16 September 2025 / Published: 18 September 2025

Download

Browse Figures

Versions Notes

Abstract

Communications in intelligent transportation systems (ITS) face explosive data growth from applications such as autonomous driving, remote diagnostics, and real-time monitoring, imposing severe challenges on limited spectrum, bandwidth, and latency. Reliable semantic image reconstruction under noisy channel conditions is critical for ITS perception tasks, since noise directly impacts the recognition of both static infrastructure and dynamic obstacles. Unlike traditional approaches that aim to transmit all image data with equal fidelity, effective ITS communication requires prioritizing task-relevant dynamic elements such as vehicles and pedestrians while filtering out largely static background features such as buildings, road signs, and vegetation. To address this, we propose an Offline Knowledge Base and Attention-Driven Semantic Communication (OKBASC) framework for image-based applications in ITS scenarios. The proposed framework performs offline semantic segmentation to build a compact knowledge base of semantic masks, focusing on dynamic task-relevant regions such as vehicles, pedestrians, and traffic signals. At runtime, precomputed masks are adaptively fused with input images via sparse attention to generate semantic-aware representations that selectively preserve essential information while suppressing redundant background. Moreover, we introduce a further Bi-Level Routing Attention (BRA) module that hierarchically refines semantic features through global channel selection and local spatial attention, resulting in improved discriminability and compression efficiency. Experiments on the VOC2012 and nuPlan datasets under varying SNR levels show that OKBASC achieves higher semantic reconstruction quality than baseline methods, both quantitatively via the Structural Similarity Index Metric (SSIM) and qualitatively via visual comparisons. These results highlight the value of OKBASC as a communication-layer enabler that provides reliable perceptual inputs for downstream ITS applications, including cooperative perception, real-time traffic safety, and incident detection.

Keywords:

intelligent transportation systems; semantic communication; bi-level routing attention; sparse attention; knowledge base

Graphical Abstract

1. Introduction

Intelligent Transportation Systems (ITS) are widely seen as a cornerstone enabling safe, efficient, and sustainable urban mobility [1]. By integrating sensing, communication, and computing technologies, these systems aim to reduce congestion, minimize accidents, and improve the overall quality of transportation services [2]. The proliferation of connected vehicles and roadside units coupled with emerging applications such as autonomous driving, remote diagnostics, and in-vehicle entertainment has led to explosive data traffic growth across ITS infrastructures [3]. This surge in mobile data flows intensifies pressures on communication networks, exacerbating issues such as spectrum scarcity, limited bandwidth, and stringent latency constraints [4,5,6].

Despite advances in 4G/5G networks and vehicular ad hoc networks, traditional communication architectures primarily follow Shannon’s information theory, focusing on reliable symbol-level transmission rather than task-relevant content [7,8]. This approach often results in inefficient use of bandwidth, as large volumes of raw sensor data—including high-resolution images, LiDAR scans, and radar measurements—must be transmitted without accounting for their semantic relevance to downstream tasks [9,10]. Moreover, ITS applications impose strict latency requirements that make the real-time transmission of raw data increasingly impractical in dense urban environments [11]. Meanwhile, many ITS scenarios exhibit strong spatial and temporal regularities [12]. Static background elements such as buildings, road signs, and traffic lights remain largely unchanged over time, while the truly critical dynamic content involves moving agents such as vehicles and pedestrians [13]. Recognizing these regularities offers opportunities to reduce redundant transmissions and focus communication resources on semantically meaningful information [14]. Indeed, traffic management, incident detection, and cooperative perception all benefit from the ability to exchange concise and task-relevant representations rather than raw sensor streams [15].

Figure 1 illustrates a representative ITS intersection scenario. A Roadside Unit (RSU) deployed at the intersection is equipped with a camera to continuously capture raw traffic scenes. These images typically include static background information such as road geometry, surrounding buildings, and traffic signs as well as dynamic elements such as vehicles, pedestrians, cyclists, and traffic signals. The RSU processes the captured images in order to identify dynamic and safety-critical objects that directly influence traffic flow and decision-making. In contrast, static background elements, which remain largely unchanged over time, are pre-stored at the Base Station (BS) as priors for reconstruction. The extracted dynamic content, which is significantly more compact than the full raw images, is transmitted over noisy wireless channels to the BS. At the receiving end, the base station fuses the transmitted information with the stored background priors to reconstruct a complete traffic scene. This approach avoids repeatedly transmitting redundant background data, thereby reducing bandwidth consumption, while also ensuring that dynamic content such as the movement of vehicles or pedestrians is reliably conveyed even under adverse channel conditions. The example in Figure 1 illustrates how communication at the perception layer can reduce redundant load and ensure reliable delivery of dynamic safety-critical information. Such reconstructed scenes are not only compact in transmission but also directly usable for ITS perception tasks, including cooperative perception between vehicles and RSUs, real-time safety monitoring, and incident detection. In this way, they connect communication-layer design with practical ITS applications, motivating the development of more advanced communication strategies capable of transmitting high-level task-relevant semantics rather than raw data.

Semantic Communication (SC) has emerged as a promising paradigm to address these limitations by prioritizing the transmission of high-level task-relevant semantics rather than low-level symbols [16]. SC approaches can significantly improve spectral efficiency, robustness to channel noise, and latency by compressing redundant information while preserving critical semantic cues needed for decision-making [17]. Recent work has shown that attention mechanisms, encoder–decoder frameworks, and joint source–channel coding (JSCC) can enhance SC systems by selectively focusing on important features and ensuring reliable transmission even under noisy channel conditions [18,19,20].

Although semantic communication has shown great potential for enhancing wireless transmission, existing methods applied to ITS still face key limitations. JSCC-based frameworks [21,22] leverage end-to-end learning to jointly optimize image compression and channel robustness; however, the large model sizes and complex encoder–decoder structures hinder deployment on edge devices. Attention-enhanced SC methods [23,24,25,26] improve semantic relevance by highlighting discriminative features, yet often involve transformer [27] backbones and lack structured inductive bias, making them computationally intensive and less suitable for edge-side inference in ITS environments. Knowledge-driven designs [28,29] utilize external graphs or shared priors to enhance decoding, limiting their reusability across scenes. Task-adaptive SC strategies [30,31] adjust rate control based on task sensitivity, but are mostly built for general semantic compression and fail to account for the spatial regularities and dynamic/static saliency inherent to traffic scenarios. Even more recent routing-based approaches such as BRASC [32] perform in-channel semantic attention to prioritize relevant features, yet still rely on online segmentation and full-resolution attention without leveraging prior semantic knowledge to suppress redundant background content.

These limitations highlight a broader challenge: most existing methods have not been explicitly designed considering ITS requirements, where efficiency, robustness, and semantic fidelity under noisy channels are essential; as a result, the communication layer in ITS still lacks solutions that can reliably deliver semantically meaningful visual information under realistic channel conditions. In this context, reliable semantic image reconstruction is particularly critical for higher-level tasks such as detection, tracking, and cooperative perception, since channel noise directly impacts the recognition of both static infrastructure and dynamic obstacles. As illustrated in Figure 1, much of the perceptual information in ITS is conveyed through camera-captured images, making robust semantic reconstruction under noisy channels a prerequisite for effective downstream processing.

To address these challenges, we propose a framework which we call Offline Knowledge Base and Attention-driven Semantic Communication (OKBASC). In OKBASC, the offline knowledge base precomputes semantic masks by segmenting training images in advance, constructing a compact and reusable semantic repository. At runtime, it leverages sparse attention to integrate these masks with incoming visual data, providing task-relevant semantic priors that reduce redundant regions in advance. Furthermore, our Bi-Level Routing Attention (BRA) module performs both global top-K channel selection and local spatial refinement, resulting in enhanced feature representation with broader context and fine detail. Together, these innovations enable OKBASC to achieve fast and lightweight semantic communication with high fidelity, making it well-suited for ITS applications.

The main contributions of this work are summarized in Table 1. It is important to note that this work focuses specifically on semantic communication for image transmission under noisy channels, rather than end-to-end ITS functions such as trajectory prediction or traffic management. By clarifying this scope, the contributions listed in Table 1 reflect the achievements in this paper. These innovations enable OKBASC to achieve semantic communication with high fidelity, providing reliable semantic image reconstruction that supports downstream ITS tasks such as detection, tracking, and cooperative perception. In this scenario, the proposed contributions should be understood as enabling reliable image-level communication within the perception layer of ITS rather than directly addressing higher-level traffic management or control.

The remainder of this paper is organized as follows: Section 2 reviews related work on semantic communication and its applications in ITS; Section 3 details the architecture and components of the proposed OKBASC framework; Section 4 presents the experimental setup; Section 5 describes the results and corresponding analyses; finally, Section 6 concludes the paper and discusses future directions.

2. Related Work

Explosive data growth in ITS driven by applications such as autonomous driving, remote diagnostics, and multimedia services has pushed conventional communication technologies to their limits [33]. Traditional methods based on Shannon’s information theory focus on bit-level transmission accuracy, yet struggle to handle task-relevant semantic content under constrained bandwidth and latency requirements [34,35]. To address these limitations, SC has emerged as a transformative paradigm aiming to transmit only essential semantic information rather than raw data, enhancing both spectral efficiency and transmission robustness [36].

Prior research on SC in vehicular networks has largely focused on JSCC frameworks that use convolutional or autoencoder-based architectures to achieve efficient image transmission and semantic preservation under noisy channel conditions [37]. For instance, convolutional semantic encoders and decoders have been utilized to transmit traffic sign semantics. This has led to improved classification and reconstruction performance, particularly in scenarios involving poor visibility and satellite assistance [38]. Reinforcement learning-based decision mechanisms have also been introduced at the receiver to enhance adaptive interpretation of transmitted semantics [38,39].

Multi-user SC frameworks have further extended this line of work by exploring semantic correlations among users [40]. The cooperative SC framework jointly optimizes encoding and decoding across multiple transmitters to eliminate semantic redundancy while preserving distinct task-relevant features [41]. By sharing high-level semantic content rather than full-resolution data, this approach has demonstrated advantages in scenarios such as cooperative image retrieval and real-time tracking, where it can significantly reduce communication overhead [42].

Additionally, SC has shown strong potential in intelligent traffic management, where task-specific systems such as vehicle counting [43] and multi-task vehicle recognition systems prioritize semantic features [44] for accurate and efficient decision-making. Feature importance ranking and dynamic rate allocation strategies have also been explored to enhance robustness in low signal-to-noise environments.

While the aforementioned studies have explored incorporating external knowledge bases into SC systems, such approaches often remain limited in scope. Typically, they rely on predefined semantic priors without additional refinement, selective filtering, or task-driven structuring. This lack of adaptive organization and semantic compression mechanisms leads to redundant transmission of background information and suboptimal alignment with communication goals. Consequently, these methods fail to fully exploit the potential of knowledge bases to support efficient and robust semantic communication in the highly constrained and dynamic contexts characteristic of ITS environments.

In contrast, our proposed OKBASC framework adopts a structured knowledge-driven approach that moves semantic segmentation to an offline stage. By pre-segmenting training images and applying a TKSA mechanism, OKBASC constructs a compact and reusable semantic knowledge base that encodes task-relevant dynamic elements (e.g., vehicles, pedestrians) in a structured form. This offline knowledge base serves as a semantic prior that guides subsequent attention-based encoding, enabling selective focus on meaningful content during transmission while reducing redundant processing on edge devices. During transmission, this offline repository provides strong priors for attention-guided feature encoding, enabling selective semantic preservation with reduced computational burden on edge devices. By combining this offline knowledge base with a BRA module for hierarchical semantic feature refinement, OKBASC offers a robust and scalable solution tailored to the bandwidth-limited and latency-sensitive demands of ITS communication. Table 2 compares the OKBASC framework with representative SC research, highlighting the methods and limitations of previous research along with the novel advantages offered by our approach.

3. Offline Knowledge Base and Attention-Driven Semantic Communication

This section describes the proposed OKBASC framework, including its overall architecture, offline semantic knowledge base construction, and semantic encoding with BRA, along with the channel transmission and decoding process.

3.1. Overall Architecture

The overall architecture of the proposed OKBASC framework is illustrated in Figure 2. OKBASC is designed to enable efficient and robust semantic communication in ITS scenarios by combining an offline knowledge base with attention-driven encoding and decoding stages, as summarized in Algorithm 1. This design aims to reduce redundant transmission and preserve task-relevant information under constrained bandwidth conditions.

Algorithm 1: OKBASC Framework

Input: Input image

X \in R^{B \times C \times H \times W}

; Offline knowledge base

K

Output: Reconstructed semantic image

\hat{x}

1: Phase 1: Offline Knowledge Base Construction;
2: Build semantic mask repository from training data using Algorithm 2;
3: Phase 2: Semantic Feature Encoding;
4: Fuse input $x$ with relevant masks from $K$ using sparse attention;
5: $F_{sem} \leftarrow SemanticEncoder (x)$ ;
6: $s \leftarrow BRA (F_{sem})$ (Algorithm 3);
7: Phase 3: Channel Transmission;
8: $c \leftarrow ChannelEncoder (s)$ ;
9: $\tilde{c} \leftarrow c + ϵ$ ; // simulate AWGN
10: $\hat{s} \leftarrow ChannelDecoder (\tilde{c})$ ;
11: Phase 4: Semantic Decoding and Reconstruction;
12: Predict spatial semantic mask: $\hat{M} \leftarrow MaskDecoder (\hat{s})$ ;
13: Apply mask to features: $F_{masked} \leftarrow \hat{M} ⊙ \hat{s}$ ;
14: Decode masked features: $\hat{x} \leftarrow Decoder (F_{masked})$ ;
15: Upsample ${\hat{x}}_{rec}$ to original resolution via bilinear interpolation;
16: return ${\hat{x}}_{rec}$ ;

A central component of OKBASC is its semantic knowledge base, which is constructed offline in advance of communication. Training images are pre-segmented to extract semantic regions corresponding to dynamic task-relevant elements such as vehicles, pedestrians, and traffic signs. A TKSA mechanism is then applied to these segmentation masks to filter out background regions, resulting in a compact repository of reusable semantic masks. This offline knowledge base serves as a strong prior during runtime without imposing additional computational burden on edge devices.

During communication, the system begins with an input image that is fused with relevant masks retrieved from the offline knowledge base using sparse attention, generating a semantic-aware feature representation that selectively emphasizes important dynamic content while suppressing visual redundancy. This fused semantic representation is then refined by the BRA block, which employs a two-branch strategy: global top-K channel selection highlights the most discriminative semantic channels, while local sparse attention within spatial windows enhances fine-grained details. Notably, the top-K selection here is different in purpose from that used in offline knowledge base construction: whereas the offline TKSA mechanism selects spatially salient regions to build static semantic masks, the BRA module’s global top-K routing dynamically filters semantic channels to adapt encoding to each input each image. In this way, BRA ensures that the final semantic vector is both compact and expressive, making it suitable for transmission over bandwidth-constrained and noisy channels.

The refined semantic features are subsequently processed by a semantic encoder and transformed into a transmittable latent code using a channel encoder. This code is sent over a simulated wireless channel modeled with Additive White Gaussian Noise (AWGN), then recovered by a channel decoder at the receiver. During reconstruction, the semantic decoder not only uses the transmitted features but also incorporates relevant entries from the offline knowledge base to guide and refine the recovered content, ensuring consistency with precomputed semantic priors. By leveraging both the transmitted latent features and the stored semantic knowledge, the decoder emphasizes task-relevant dynamic regions while suppressing irrelevant background content. The final output is then upsampled to the original resolution via bilinear interpolation, yielding a high-fidelity image aligned with ITS communication requirements while minimizing bandwidth usage.

3.2. Offline Knowledge Base Construction via TKSA

To support efficient semantic encoding for ITS, we construct a reusable semantic knowledge base using a TKSA mechanism. The idea is to move the computationally expensive process of semantic segmentation to an offline stage. In this stage, the system precomputes and stores compact semantic masks that capture task-relevant dynamic content such as vehicles, pedestrians, and traffic signs while ignoring largely static urban backgrounds such as buildings or road infrastructure [46]. This approach reduces edge-side computation and communication overhead by ensuring that only essential dynamic semantics are emphasized during transmission.

Unlike standard attention mechanisms that assign soft weights to all features, TKSA enforces hard sparsity by selecting only the top-K most relevant spatial positions for each query. This approach yields lightweight and interpretable attention maps: only the most task-relevant regions receive non-zero weights, making it easy to identify and isolate dynamic semantic content for mask generation while ignoring static background regions [47]. These binary masks, generated once offline, form a compact and reusable knowledge base that guides semantic fusion with input images during communication [48]. It is worth noting that our adaptation of TKSA differs from that proposed in image de-raining [49], which focuses on selective feature restoration within transformer-based de-raining networks. In contrast, we repurpose sparse attention for offline semantic mask extraction in intelligent transportation scenarios, targeting salient dynamic objects such as vehicles, traffic lights, and pedestrians in order to build a concise and reusable semantic knowledge base tailored to ITS communication needs.

As shown in Figure 3, our TKSA module operates on original training images

X \in R^{B \times C \times H \times W}

, where B denotes the batch size, C denotes the number of channels, and

H \times W

is the image size. The goal of TKSA is to identify key semantic regions in each image through a sparse attention mechanism, producing binary masks that highlight the most informative spatial areas. To perform attention over spatial positions, the input image is first reshaped into a sequence of tokens:

X \in R^{B \times N \times C}, N = H \times W .

(1)

The query, key, and value matrices are computed via learned linear projections:

Q = X W^{Q}, K = X W^{K}, V = X W^{V}

(2)

where,

W^{Q}, W^{K}, W^{V} \in R^{C \times d}

are trainable parameters and d is the attention embedding dimension. For each query token

q_{i}

, the dot-product similarity with all keys is computed as

s_{i j} = \frac{q_{i} \cdot k_{j}^{⊤}}{\sqrt{d}}, \forall j \in \{1, \dots, N\} .

(3)

Instead of attending to all positions, TKSA performs top-K selection to retain only the most relevant attention scores:

I_{i} = TopK (s_{i 1}, \dots, s_{i N})

(4)

where

I_{i} \subset \{1, \dots, N\}

denotes the indices of the top-K keys most relevant to query

q_{i}

. Sparse attention weights are then computed by applying the softmax function only over the selected positions:

α_{i j} = \{\begin{matrix} \frac{exp (s_{i j})}{\sum_{j^{'} \in I_{i}} exp (s_{i j^{'}})}, if j \in I_{i} \\ 0, otherwise . \end{matrix}

(5)

The final output for each token is computed as a weighted sum of values:

z_{i} = \sum_{j \in I_{i}} α_{i j} v_{j} .

(6)

This yields

Z \in R^{B \times N \times d}

, from which semantic importance maps are derived. These maps are then binarized (e.g., via thresholding) to produce final semantic masks that highlight dynamic task-relevant regions. By repeating this process for all training images, TKSA builds a compact offline semantic knowledge base of reusable masks that can be fused with runtime input to guide semantic encoding and improve communication efficiency in ITS scenarios. The procedure for constructing the offline semantic knowledge base using TKSA is summarized in Algorithm 2.

Algorithm 2: Offline Knowledge Base Construction via TKSA

Input: Semantic training dataset

D = {X}

Output: Semantic knowledge base

K

3.3. Bi-Level Routing Attention for Semantic Encoding

To enhance semantic feature extraction, we introduce a BRA block that hierarchically captures both global discriminative semantics and local structural details [50]. Unlike traditional convolutional pooling that performs uniform spatial downsampling, BRA adopts a two-stage sparse attention mechanism to achieve selective information routing. This is particularly useful in semantic communication scenarios, where retaining task-relevant semantics while reducing redundancy is essential.

Our BRA (As shown in Figure 4) is inspired by the bi-level sparse attention used in BiFormer [32], but is customized for semantic feature enhancement in offline knowledge-based communication systems. It operates on semantic-aware features

X \in R^{B \times C \times H \times W}

, where B is the batch size, C is the number of channels, and

H \times W

is the spatial resolution. The goal is to apply global token routing for coarse selection and local refinement within windows for fine-grained discrimination. To this end, the feature map is first partitioned into non-overlapping windows of size

P \times P

, yielding

N = \frac{H}{P} \times \frac{W}{P}

windows per image. Each window is treated as a separate unit for local attention.

We compute a global attention token

t \in R^{C}

by aggregating spatial statistics across each channel. In our case, global average pooling is replaced with a semantic-aware summary (from the TKSA-enhanced image):

t = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} X_{:, :, i, j} .

(7)

The global token t is used as the query to compute similarity with learnable channel keys

K \in R^{C \times d}

:

s_{c} = \frac{t \cdot k_{c}^{T}}{\sqrt{d}}, c = 1, \dots, C .

(8)

Then, the top-K channel indices

J

are selected:

J = TopK (s_{1}, \dots, s_{C}) .

(9)

Only those channels in

J

are retained for the next step, effectively filtering out semantically irrelevant channels.

For each window

w \in R^{B \times K \times P^{2}}

(after selecting

K = | J |

channels), we flatten spatial locations and perform local multi-head attention. Given reshaped tokens

X_{w} \in R^{B \times P^{2} \times K}

:

Q_{w} = X_{w} W^{Q}, K_{w} = X_{w} W^{K}, V_{w} = X_{w} W^{V}

(10)

with

W^{Q}, W^{K}, W^{V} \in R^{K \times d}

, for each query

q_{i}

in the window, attention scores are computed as

s_{i j} = \frac{q_{i} \cdot k_{j}^{T}}{\sqrt{d}}, \forall j \in \{1, \dots, P^{2}\} .

(11)

Sparse attention is applied by selecting the top-k spatial positions

I_{i}

for each

q_{i}

:

I_{i} = TopK (s_{i 1}, \dots, s_{i P^{2}}) .

(12)

To compute the sparse attention weights within each local window, we adopt the formulation in Equation (5). In this context, the equation is applied to semantic-aware feature tokens within each local window after global channel filtering, enabling the model to selectively attend to fine-grained spatial details and enhancing local semantic representation. The procedure for applying BRA to semantic-aware features is summarized in Algorithm 3.

Algorithm 3: Bi-Level Routing Attention (BRA)

Input: Semantic-aware feature map

F \in R^{B \times C \times H \times W}

Output: Compact semantic representation

\hat{s} \in R^{B \times D}

3.4. Channel Transmission and Decoding

After extracting semantic features via the BRA module, we obtain a compact representation denoted as

s \in R^{B \times D}

, where B is the batch size and D is the flattened feature dimension. This representation is passed to the channel encoder, which performs a linear projection to map the semantic features into a lower-dimensional latent code

c \in R^{B \times d}

, as follows:

c = s W_{e} + b_{e}

(13)

where

W_{e} \in R^{D \times d}

and

b_{e} \in R^{d}

are learnable parameters of the encoder.

To simulate realistic wireless environments, we adopt an AWGN channel. The transmitted code

c

is perturbed by zero-mean Gaussian noise

ϵ \sim N (0, σ^{2} I)

, where

σ^{2}

is determined by the SNR:

\tilde{c} = c + ϵ .

(14)

The received noisy code

\tilde{c}

is then passed through the channel decoder, which performs a reverse linear projection to reconstruct the semantic features:

\hat{s} = \tilde{c} W_{d} + b_{d}

(15)

where

W_{d} \in R^{D \times d}

and

b_{d} \in R^{d}

are decoder parameters.

Finally, the recovered semantic vector

\hat{s}

is passed to the semantic decoder to reconstruct the semantic-aware image

\hat{x} \in R^{B \times 3 \times H \times W}

. When the adaptive semantic compression is enabled, the semantic decoder incorporates an auxiliary mask decoder to generate a spatial attention mask

\hat{M} \in {[0, 1]}^{B \times 1 \times H \times W}

. This mask guides the reconstruction process by emphasizing task-relevant spatial regions while suppressing irrelevant background content. Given the semantic feature map

F_{sem} \in R^{B \times C \times H \times W}

, the mask decoder applies a CNN to predict the attention mask:

\hat{M} = σ ({CNN}_{mask} (F_{sem}))

(16)

where

{CNN}_{mask}

denotes a convolutional subnetwork and

σ (\cdot)

is the sigmoid activation function that normalizes the output to the

[0, 1]

range. The resulting attention mask is applied to the semantic feature map via element-wise multiplication:

F_{masked} = F_{sem} ⊙ \hat{M} .

(17)

The masked features are then decoded and upsampled via bilinear interpolation to generate the final reconstructed image:

\hat{x} = Upsample (Decoder (F_{masked})), size = (H_{0}, W_{0}) .

(18)

This mask-guided decoding mechanism improves semantic reconstruction quality under bandwidth-limited conditions by focusing on semantically informative regions. This design follows previous deep learning-enabled semantic communication systems [35] while being specifically tailored to leverage our offline knowledge base and attention-guided features for ITS scenarios. The procedure for channel encoding, transmission with noise, decoding, and reconstruction is detailed in Algorithm 4.

It can be seen that the overall design of OKBASC addresses several key limitations in existing SC systems for intelligent transportation. By moving semantic segmentation to an offline stage through the TKSA module, our method avoids the high computational costs of real-time feature selection while retaining structured task-relevant priors that can be reused across diverse traffic scenarios. The BRA module then performs hierarchical routing attention during encoding, improving semantic discriminability with minimal computational overhead. In the decoding stage, the pre-shared masks from the TKSA knowledge base guide the reconstruction process by filtering redundant background, while the BRA-refined features provide discriminative semantic cues, jointly supporting accurate semantic reconstruction even under noisy channel conditions. Together, these modules form a lightweight yet robust SC pipeline that reduces redundancy, improves semantic fidelity, and adapts to bandwidth constraints.

Algorithm 4: Channel Transmission and Decoding

Input: Semantic representation

s \in R^{B \times D}

Output: Reconstructed image

\hat{x}

4. Experiment Setup

This section describes the experimental setup, datasets, and parameter configurations.

4.1. Datasets

To comprehensively evaluate the effectiveness and generalizability of our proposed semantic communication system, we select two representative datasets spanning both general visual tasks and real-world intelligent transportation scenarios. Specifically, we use the VOC2012 dataset for baseline semantic compression benchmarking and the nuPlan dataset for validation in dynamic urban environments.

The VOC2012 dataset [51] is a widely used benchmark designed for evaluating object recognition and semantic segmentation in complex natural scenes. It contains over 11,000 annotated RGB images covering 20 diverse object categories such as people, animals, vehicles, and household items, often with multiple objects per image and challenging backgrounds. We directly use its standard official training/validation split without further partitioning. We use VOC2012 to validate the effectiveness of our model in preserving critical object-level information under different SNR and channel conditions, demonstrating its suitability for SC in general visual tasks.

The NuPlan dataset [52] is a large-scale autonomous driving benchmark designed to capture complex urban scenes. In our experiments, we use a subset of the nuPlan-v1.1 mini dataset containing 5100 semantic-aware keyframes from the front-facing camera (CAM F0) stream, specifically extracted from the recording captured on 12 May 2021 at 22:00:38 for vehicle 35 (frames 01008–01518). The data are split into 80% for training and 20% for validation to ensure consistent evaluation. The data include dynamic interactions among vehicles, pedestrians, and traffic infrastructure, and exhibit challenges such as occlusion, variable lighting, and complex road geometries. This makes the selected subset particularly suitable for evaluating the ability of our system to preserve task-relevant semantics under bandwidth constraints and channel noise in the context of intelligent transportation systems.

This combination of datasets ensures that our evaluation spans both broad-domain visual understanding and domain-specific traffic scene challenges. VOC2012 was chosen not only for its wide adoption in the semantic segmentation community but also for its manageable size and well-curated annotations, which allow for controlled experimentation without excessive computational cost. Its diverse object categories and complex backgrounds enable us to test the model’s ability to retain fine-grained semantics in general visual tasks, providing a strong reference for cross-domain generalization. For nuPlan, we use a representative subset rather than the full dataset because the full corpus contains over 1500 h of driving data, which would demand prohibitively high computational and storage resources for training and evaluation. The selected subset still covers diverse road types, lighting conditions, and traffic compositions, ensuring experimental diversity while keeping the computation tractable and aligned with our focus on resource-constrained and edge-deployable semantic communication systems. Together, these datasets provide a balanced and efficient evaluation setting.

4.2. Parameter Configuration

To ensure fair comparisons, the implementation details and training configurations are specified as follows. All models are trained using the Adam optimizer with default momentum parameters. We adopt a fixed learning rate and batch size across all experiments. For fair comparison, both datasets are trained for the same number of epochs with identical training protocols. The detailed training hyperparameters are summarized in Table 3.

The experiments are conducted on a workstation equipped with an Intel Core i7-12700F (Intel Corporation, Santa Clara, CA, USA) processor, 32 GB of RAM(Kingston Technology Corp., Fountain Valley, CA, USA), and two RTX 3060 GPUs(ASUStek Computer Inc., Taipei, Taiwan) (12 GB each), utilizing CUDA 12.2 for computation. The implementation is based on Python3.8.17 and PyTorch1.9.0 running on the Windows 11 operating system.

5. Results Discussion

This section describes the results used to evaluate the proposed OKBASC framework. We conduct two sets of experiments: (1) comparative evaluations against two baseline methods, BRASC [50] and LAMSC [45], under various SNR conditions, to demonstrate OKBASC’s advantages in semantic compression accuracy, reconstruction quality, and robustness; and (2) ablation studies to analyze the individual contributions of key components such as TKSA and BRA, validating their roles in enhancing semantic representation and transmission performance in ITS scenarios.

5.1. Comparative Experiment

To evaluate the effectiveness of the proposed OKBASC framework, we conduct a series of comparative experiments against two representative baseline methods, namely, LAMSC and BRASC. These experiments are performed under varying channel conditions simulated by different SNR levels, with a focus on training loss convergence and semantic reconstruction performance. The goal is to assess how well each method handles noise and preserves task-relevant information during transmission and recovery. Specifically, we select SNR levels of 5, 10, 15, 20, and 25 dB in order to represent a wide range of realistic communication scenarios that are commonly encountered in ITS applications, from highly noisy channels (5 dB) to relatively clean wireless environments (25 dB).

Figure 5 shows the training loss trajectories of OKBASC, LAMSC, and BRASC under different SNR conditions (5, 10, 15, 20, 25). Across all noise levels, OKBASC achieves the lowest final training loss.

At low SNR levels (e.g., SNR = 5), the loss curves show the greatest divergence among methods. OKBASC converges rapidly within the first 10–20 epochs, stabilizing at a substantially lower loss floor with minimal oscillations. In contrast, BRASC exhibits pronounced fluctuations and less stable training, while LAMSC converges more smoothly than BRASC but still reaches a higher loss value than OKBASC. These results highlight OKBASC’s robustness under severe channel noise, benefiting from its offline knowledge base and attention-guided fusion that help to suppress irrelevant regions prior to transmission.

At moderate SNR levels (10 and 15 dB), all methods exhibit smoother convergence, reflecting reduced channel distortion; however, OKBASC maintains a clear advantage in final loss values and convergence speed. It consistently reaches lower loss than LAMSC and BRASC after early epochs, with fewer oscillations even as the amount of noise decreases. Notably, LAMSC narrows its gap with OKBASC at SNR = 15 dB, suggesting that methods without offline segmentation can still leverage moderate channel conditions but remain limited by less discriminative feature selection.

At high SNR levels (20 and 25 dB), the performance gap among methods narrows further. Loss curves become more stable overall, and all systems benefit from reduced channel perturbations. OKBASC retains its advantage with the lowest final loss, but with a smaller relative margin. BRASC and LAMSC both achieve smoother training under these cleaner conditions, though BRASC continues to show higher variance and worse minima. This indicates that even with improved channel quality, the absence of an explicit knowledge base in BRASC limits its ability to consistently suppress background noise and preserve task-relevant regions.

We further evaluate semantic reconstruction quality using the Structural Similarity Index Metric (SSIM) [53] under varying SNR conditions on both the NuPlan-v1.1 mini camera0 dataset and the VOC2012 dataset.

For the VOC2012 dataset, as shown in Figure 6, the SSIM results under varying SNR conditions demonstrate clear differences among the compared methods. Across all SNR levels from 5 dB to 25 dB, OKBASC consistently achieves the highest SSIM scores. Even under low SNR (5 dB), OKBASC maintains SSIM values above 0.94, while LAMSC drops to around 0.83 and BRASC remains much lower, close to 0.54. While all methods show gradual gains in SSIM as the channel SNR improves, their relative ranking stays consistent. This performance gap is especially important for VOC2012’s varied scenes, where irrelevant background details increase the difficulty of reconstruction. OKBASC’s higher SSIM indicates its ability to leverage offline semantic masks and attention-guided compression to filter redundancy while preserving essential object-level semantics. Compared to LAMSC, OKBASC offers better noise resilience and semantic consistency. BRASC’s consistently lower SSIM across all conditions suggests that its simpler attention mechanism struggles to retain fine-grained structure in cluttered multi-object images.

For the NuPlan dataset, as shown in Figure 7, the SSIM results reveal a similar advantage for OKBASC. OKBASC consistently outperforms LAMSC and BRASC across all tested SNR levels, maintaining high SSIM values even under severe noise. At SNR = 5 dB, OKBASC achieves SSIM near 0.98, while LAMSC drops to around 0.86 and BRASC falls below 0.57. As SNR increases, OKBASC continues to deliver stable high-semantic reconstructions with minimal degradation. LAMSC improves with SNR but shows a wider gap relative to OKBASC at lower SNRs, highlighting its greater sensitivity to noise. BRASC shows consistently lower SSIM values with limited improvement as SNR rises, reflecting its difficulty in preserving critical semantic details in dynamic traffic scenes with complex interactions among vehicles, pedestrians, traffic signals, and road geometry. These results confirm the robustness and effectiveness of OKBASC in preserving task-relevant semantic information under bandwidth constraints and noisy channels, particularly in real-world intelligent transportation settings.

In addition to quantitative SSIM comparisons, Figure 8 presents qualitative reconstruction results for the VOC2012 and NuPlan datasets at SNR levels of 5 dB, 10 dB, 15 dB, 20 dB, and 25 dB. Consistent with the SSIM trends, OKBASC preserves object contours and suppresses background noise more effectively than LAMSC and BRASC across all SNRs. LAMSC shows moderate noise suppression but exhibits noticeable blurring in fine details, especially under low SNR. BRASC suffers from severe noise artifacts and structural distortion, indicating its limited capability in maintaining semantic fidelity in challenging channel conditions. These examples visually validate the quantitative findings and further demonstrate the robustness of OKBASC in preserving task-relevant semantics.

5.2. Ablation Study

To further analyze the contribution of each component in our OKBASC framework, we conduct ablation experiments on both the VOC2012 and nuPlan datasets under varying SNR conditions. Table 4 and Table 5 report the SSIM performance when selectively removing key modules, including TKSA and the BRA block.

For the VOC2012 dataset, removing TKSA leads to the most pronounced SSIM degradation, particularly under low SNR (e.g., SSIM drops from 0.94 to 0.85 at 5 dB), and remains lower even at high SNR (from 0.87 to 0.82 at 25 dB). This highlights TKSA’s importance in enforcing hard sparsity during offline knowledge base construction to filter out redundant static background. Similarly, removing the BRA block causes a noticeable performance drop (e.g., 0.83 at 5 dB and 0.79 at 25 dB), confirming its role in refining global and local semantic features. When both modules are removed, the SSIM plummets further (e.g., 0.78 at 5 dB and 0.72 at 25 dB), demonstrating their complementary benefits in supporting semantic robustness under noisy conditions.

On the nuPlan dataset, OKBASC achieves relatively higher SSIM scores across all SNR levels due to the more consistent spatial layouts of traffic scenes. Yet, the absence of BRA still results in considerable degradation (e.g., SSIM drops from 0.88 to 0.79 at 5 dB and from 0.90 to 0.81 at 25 dB), highlighting BRA’s importance in preserving fine-grained dynamic features essential for understanding traffic participants and road geometry. TKSA also contributes significantly, with SSIM dropping to 0.81 at 5 dB and 0.84 at 25 dB when removed. The combined removal of both modules yields the lowest performance (0.65 at 5 dB and 0.67 at 25 dB), confirming their joint contribution to accurate semantic representation and resilience to channel noise.

These results validate the critical importance of both TKSA and BRA to OKBASC’s performance. TKSA ensures compact and focused semantic representations, while BRA enhances spatial detail through hierarchical refinement. Together, they form a robust semantic compression–reconstruction pipeline, enabling OKBASC to maintain high task-relevant fidelity across diverse SNR conditions in both complex natural scenes and structured driving scenes.

6. Conclusions and Future Work

6.1. Conclusions

In this paper we have proposed OKBASC, an offline knowledge base and attention-driven semantic communication framework tailored for ITS. By shifting the computationally expensive semantic segmentation process to an offline stage, OKBASC constructs a reusable semantic knowledge base of task-relevant masks using a TKSA mechanism. During transmission, these precomputed semantic masks are fused with input images via sparse attention, producing semantic-aware representations that emphasize critical dynamic content while suppressing redundant background. The BRA block further enhances feature compactness and discriminability through global channel selection and local spatial refinement. Experimental results across diverse datasets and SNR conditions show that OKBASC achieves faster convergence, lower training loss, and higher semantic reconstruction quality compared to baseline methods, demonstrating its robustness and efficiency in bandwidth-constrained and noisy ITS communication scenarios. Furthermore, our ablation studies confirm the critical role of both the TKSA and BRA modules. Removing either component results in noticeable performance degradation, reinforcing their complementary importance in enabling robust semantic understanding and transmission.

Overall, this work validates OKBASC as an effective semantic communication framework for ITS. Through experiments on diverse datasets and varying SNR conditions, the proposed framework is shown to maintain high semantic fidelity and robustness in simulated communication environments, while ablation studies confirm the complementary importance of its TKSA and BRA components. Looking ahead, OKBASC shows strong potential as an enabling technology for future ITS perception systems, with accurate and efficient semantic exchange forming the foundation for higher-level functions such as cooperative perception, traffic safety monitoring, and incident detection. Building on the present results, future extensions such as dynamic knowledge bases, integration with foundation models, and field validation in real ITS networks could further enhance the adaptability and practical relevance of the framework.

6.2. Future Work

Future research will extend OKBASC beyond simulation-based validation towards practical ITS deployments. In particular, upcoming work will focus on developing dynamic and adaptive knowledge bases that evolve with real-time observations, integrating large-scale foundation models to enhance generalization, and conducting field evaluations in operational ITS networks.

In terms of limitations, first, the present framework relies on static precomputed semantic masks, which limits adaptability to unseen environments. Future work will include the development of dynamic knowledge bases that evolve with real-time observations, enabling vehicles and RSUs to update semantic priors on-the-fly [54]. Second, integrating large-scale foundation models and multimodal generative AI (e.g., vision-language pretraining, diffusion architectures) may provide richer priors for mask generation and semantic reconstruction, leading to enhanced cross-domain generalization and robustness under noisy channels [55]. Third, in addition to simulations, we plan field tests in operational ITS networks, such as RSUs at intersections or on-board systems in connected vehicles [56,57]. These will measure practical impairments (e.g., fading, interference, mobility effects, weather [58]) and evaluate semantic fidelity, latency, and robustness in authentic deployments [59]. Finally, cooperative knowledge sharing across vehicles, edge nodes, and cloud infrastructure (e.g., via federated learning [60]) represents a promising direction to enhance scalability and adaptability in dynamic traffic environments.

Collectively, the presented findings and proposed extensions suggest that OKBASC can contribute to bridging image-level validation and practical ITS deployments, providing an initial pathway for exploring scalable and semantic-aware communication in next-generation transportation systems.

Author Contributions

Conceptualization, Y.X.; methodology, Y.X.; software, Y.X.; validation, Y.X.; formal analysis, Y.X.; investigation, Y.L. and Z.X.; resources, X.F.; data curation, Y.X.; writing—original draft preparation, Y.X.; writing—review and editing, Y.X. and X.F.; visualization, Y.X.; supervision, X.F.; project administration, X.F.; funding acquisition, X.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by a Project of the Ministry of Science and Technology (G2022041009L) and in part by the Key Research and Development Plan of Shaanxi Province (2021GY-072) and in part by Xi’an University of Technology (2024GHCJ015, 2024GHCJ028).

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ITS	Intelligent Transportation Systems
RSU	Roadside Unit
BS	Base Station
SC	Semantic Communication
JSCC	Joint Source–Channel Coding
OKBASC	Offline Knowledge Base and Attention-driven Semantic Communication
TKSA	Top-K Sparse Attention
BRA	Bi-Level Routing Attention
SNR	Signal-to-Noise Ratio
AWGN	Additive White Gaussian Noise
SSIM	Structural Similarity Index Measure

References

Haghighat, A.K.; Ravichandra-Mouli, V.; Chakraborty, P.; Esfandiari, Y.; Arabi, S.; Sharma, A. Applications of deep learning in intelligent transportation systems. J. Big Data Anal. Transp. 2020, 2, 115–145. [Google Scholar] [CrossRef]
Garg, T.; Kaur, G. A systematic review on intelligent transport systems. J. Comput. Cogn. Eng. 2023, 2, 175–188. [Google Scholar] [CrossRef]
Lv, Z.; Lou, R.; Singh, A.K. AI empowered communication systems for intelligent transportation systems. IEEE Trans. Intell. Transp. Syst. 2020, 22, 4579–4587. [Google Scholar] [CrossRef]
Khalil, R.A.; Safelnasr, Z.; Yemane, N.; Kedir, M.; Shafiqurrahman, A.; Saeed, N. Advanced learning technologies for intelligent transportation systems: Prospects and challenges. IEEE Open J. Veh. Technol. 2024, 5, 397–427. [Google Scholar] [CrossRef]
Deng, X.; Wang, L.; Gui, J.; Jiang, P.; Chen, X.; Zeng, F.; Wan, S. A review of 6G autonomous intelligent transportation systems: Mechanisms, applications and challenges. J. Syst. Archit. 2023, 142, 102929. [Google Scholar] [CrossRef]
Xu, W.; Zhang, Y.; Wang, F.; Qin, Z.; Liu, C.; Zhang, P. Semantic communication for the Internet of Vehicles: A multiuser cooperative approach. IEEE Veh. Technol. Mag. 2023, 18, 100–109. [Google Scholar] [CrossRef]
Shao, Y.; Cao, Q.; Gündüz, D. A theory of semantic communication. IEEE Trans. Mob. Comput. 2024, 23, 12211–12228. [Google Scholar] [CrossRef]
Zhou, H.; Xu, W.; Chen, J.; Wang, W. Evolutionary V2X technologies toward the Internet of Vehicles: Challenges and opportunities. Proc. IEEE 2020, 108, 308–323. [Google Scholar] [CrossRef]
Zhang, Q.; Shi, J.; Zeng, W.; Xu, X.; Guan, Z.; Li, S.; Qin, Z. Balancing security and efficiency in GAI-driven semantic communication: Challenges, solutions, and future paths. IEEE Netw. 2025, 39, 88–96. [Google Scholar]
Jin, Z.; Song, T.; Jia, W.K.; Zou, W.; Song, X. Task-oriented semantic communication with adaptive semantic reconstruction network. IEEE Internet Things J. 2025, 12, 35784–35798. [Google Scholar] [CrossRef]
Zheng, G.; Ni, Q.; Navaie, K.; Pervaiz, H.; Zarakovitis, C. A distributed learning architecture for semantic communication in autonomous driving networks for task offloading. IEEE Commun. Mag. 2023, 61, 64–68. [Google Scholar] [CrossRef]
Dilek, E.; Dener, M. Computer vision applications in intelligent transportation systems: A survey. Sensors 2023, 23, 2938. [Google Scholar] [CrossRef] [PubMed]
Sarwatt, D.S.; Lin, Y.; Ding, J.; Sun, Y.; Ning, H. Metaverse for intelligent transportation systems (ITS): A comprehensive review of technologies, applications, implications, challenges and future directions. IEEE Trans. Intell. Transp. Syst. 2024, 25, 6290–6308. [Google Scholar] [CrossRef]
Chaccour, C.; Saad, W.; Debbah, M.; Han, Z.; Poor, H.V. Less data, more knowledge: Building next generation semantic communication networks. IEEE Commun. Surv. Tutor. 2024, 27, 37–76. [Google Scholar] [CrossRef]
Dimitrakopoulos, G.J.; Uden, L.; Varlamis, I. The Future of Intelligent Transport Systems; Elsevier: Amsterdam, The Netherlands, 2020. [Google Scholar]
Yan, X.; Fan, X.; Yau, K.L.A.; Xie, Z.; Ma, R.; Yang, G. A review of reinforcement learning for semantic communications. J. Netw. Syst. Manag. 2025, 33, 52. [Google Scholar] [CrossRef]
Getu, T.M.; Kaddoum, G.; Bennis, M. Semantic communication: A survey on research landscape, challenges, and future directions. Proc. IEEE 2025, 112, 1649–1685. [Google Scholar] [CrossRef]
Zhang, W.; Zhang, H.; Ma, H.; Shao, H.; Wang, N.; Leung, V.C. Predictive and adaptive deep coding for wireless image transmission in semantic communication. IEEE Trans. Wirel. Commun. 2023, 22, 5486–5501. [Google Scholar] [CrossRef]
Xie, W.; Zhang, T.; Xiong, M.; Zou, J.; Zhu, P.; Yang, L. Multimodal semantic communication: Research progress, key challenges, and future trends. IEEE Commun. Stand. Mag. 2025. [Google Scholar] [CrossRef]
Song, M.; Ma, N.; Liang, H.; Dong, C.; Li, W.; Chen, J.; Zhang, P. Multi-user content-style adaptive semantic communication for image transmission. IEEE Internet Things J. 2025, 12, 36063–36078. [Google Scholar] [CrossRef]
Xu, J.; Tung, T.-Y.; Ai, B.; Chen, W.; Sun, Y.; Gündüz, D. Deep joint source-channel coding for semantic communications. IEEE Commun. Mag. 2023, 61, 42–48. [Google Scholar] [CrossRef]
Jia, Y.; Huang, Z.; Luo, K.; Wen, W. Lightweight joint source-channel coding for semantic communications. IEEE Commun. Lett. 2023, 27, 3161–3165. [Google Scholar] [CrossRef]
Yun, W.J.; Lim, B.; Jung, S.; Ko, Y.-C.; Park, J.; Kim, J.; Bennis, M. Attention-based reinforcement learning for real-time UAV semantic communication. In Proceedings of the 17th International Symposium on Wireless Communication Systems (ISWCS), Berlin, Germany, 6–9 September 2021; pp. 1–6. [Google Scholar]
Liu, X.; Huang, Z.; Zhang, Y.; Jia, Y.; Wen, W. CNN and attention-based joint source channel coding for semantic communications in WSNs. Sensors 2024, 24, 957. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Ma, S.; Gao, D.; Shi, G. Swin Kansformer-based semantic communication systems for wireless image transmission. In Proceedings of the 2024 IEEE/CIC International Conference on Communications in China (ICCC Workshops), Zhuhai, China, 14–16 August 2024; pp. 265–270. [Google Scholar]
Yang, K.; Wang, S.; Dai, J.; Qin, X.; Niu, K.; Zhang, P. SwinJSCC: Taming Swin Transformer for deep joint source-channel coding. IEEE Trans. Cogn. Commun. Netw. 2024, 11, 90–104. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 30 (NIPS 2017); NIPS: Okazaki, Japan, 2017; Volume 30. [Google Scholar]
Xu, X.; Xiong, H.; Wang, Y.; Che, Y.; Han, S.; Wang, B.; Zhang, P. Knowledge-enhanced semantic communication system with OFDM transmissions. Sci. China Inf. Sci. 2023, 66, 172302. [Google Scholar] [CrossRef]
Yi, P.; Cao, Y.; Kang, X.; Liang, Y.-C. Deep learning-empowered semantic communication systems with a shared knowledge base. IEEE Trans. Wirel. Commun. 2023, 23, 6174–6187. [Google Scholar] [CrossRef]
Liu, C.; Guo, C.; Yang, Y.; Feng, C.; Sun, Q.; Chen, J. Intelligent task-oriented semantic communication method in artificial intelligence of things. J. Commun. 2021, 42, 11. [Google Scholar]
Xu, X.; Xu, B.; Han, S.; Dong, C.; Xiong, H.; Meng, R.; Zhang, P. Task-oriented and semantic-aware heterogeneous networks for artificial intelligence of things: Performance analysis and optimization. IEEE Internet Things J. 2023, 11, 228–242. [Google Scholar] [CrossRef]
Yan, X.; Fan, X.; Yau, K.L.A.; Xie, Z.; Ma, R. Semantic communication based on bi-level routing attention in IoT environment. J. Supercomput. 2025, 81, 605. [Google Scholar] [CrossRef]
Yang, W.; Liew, Z.Q.; Lim, W.Y.B.; Xiong, Z.; Niyato, D.; Chi, X.; Letaief, K.B. Semantic communication meets edge intelligence. IEEE Wirel. Commun. 2022, 29, 28–35. [Google Scholar] [CrossRef]
Niu, K.; Zhang, P. A mathematical theory of semantic communication: Overview. arXiv 2024, arXiv:2401.14160. [Google Scholar] [CrossRef]
Xie, H.; Qin, Z.; Li, G.Y.; Juang, B.H. Deep learning enabled semantic communication systems. IEEE Trans. Signal Process. 2021, 69, 2663–2675. [Google Scholar] [CrossRef]
Lu, K.; Zhou, Q.; Li, R.; Zhao, Z.; Chen, X.; Wu, J.; Zhang, H. Rethinking modern communication from semantic coding to semantic communication. IEEE Wirel. Commun. 2022, 30, 158–164. [Google Scholar] [CrossRef]
Bourtsoulatze, E.; Kurka, D.B.; Gündüz, D. Deep joint source-channel coding for wireless image transmission. IEEE Trans. Cogn. Commun. Netw. 2019, 5, 567–579. [Google Scholar] [CrossRef]
Raha, A.D.; Munir, M.S.; Adhikary, A.; Qiao, Y.; Park, S.B.; Hong, C.S. An artificial intelligent-driven semantic communication framework for connected autonomous vehicular network. In Proceedings of the 2023 International Conference on Information Networking (ICOIN), Bangkok, Thailand, 11–14 January 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 352–357. [Google Scholar]
Wang, Y.; Chen, M.; Saad, W.; Luo, T.; Cui, S.; Poor, H.V. Performance optimization for semantic communications: An attention-based learning approach. In Proceedings of the 2021 IEEE Global Communications Conference (GLOBECOM), Madrid, Spain, 7–11 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–6. [Google Scholar]
Xie, H.; Qin, Z.; Tao, X.; Letaief, K.B. Task-oriented multi-user semantic communications. IEEE J. Sel. Areas Commun. 2022, 40, 2584–2597. [Google Scholar]
Zhang, J.; Zhu, Y.; Luo, T.; Zhang, Z.; Hu, Y.; Chen, M. Performance optimization of semantic communications with heterogeneous knowledge: An adversarial reinforcement learning approach. IEEE Trans. Commun. 2025. [Google Scholar] [CrossRef]
Jiang, F.; Tu, S.; Dong, L.; Wang, K.; Yang, K.; Liu, R.; Pan, C.; Wang, J. Lightweight vision model-based multi-user semantic communication systems. arXiv 2025, arXiv:2502.16424. [Google Scholar]
Kadam, S.; Kim, D.I. Semantic communication-empowered vehicle count prediction for traffic management. In Proceedings of the 2024 IEEE Wireless Communications and Networking Conference (WCNC), Dubai, United Arab Emirates, 15–18 April 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
Hu, J.; Wang, F.; Xu, W.; Gao, H.; Zhang, P. Scalable multi-task semantic communication system with feature importance ranking. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2023), Rhodes Island, Greece, 4–10 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar]
Jiang, F.; Peng, Y.; Dong, L.; Wang, K.; Yang, K.; Pan, C.; You, X. Large AI model-based semantic communications. IEEE Wirel. Commun. 2024, 31, 68–75. [Google Scholar] [CrossRef]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Girshick, R. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 4015–4026. [Google Scholar]
Correia, G.M.; Niculae, V.; Martins, A.F. Adaptively sparse transformers. arXiv 2019, arXiv:1909.00015. [Google Scholar] [CrossRef]
Tay, Y.; Dehghani, M.; Bahri, D.; Metzler, D. Efficient transformers: A survey. ACM Comput. Surv. 2022, 55, 1–28. [Google Scholar] [CrossRef]
Chen, X.; Li, H.; Li, M.; Pan, J. Learning a sparse transformer network for effective image deraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 5896–5905. [Google Scholar]
Zhu, L.; Wang, X.; Ke, Z.; Zhang, W.; Lau, R.W. Biformer: Vision transformer with bi-level routing attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 10323–10333. [Google Scholar]
Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes Challenge 2012 (VOC2012) results. Available online: http://host.robots.ox.ac.uk/pascal/VOC/voc2012 (accessed on 19 July 2025).
Caesar, H.; Kabzan, J.; Tan, K.S.; Fong, W.K.; Wolff, E.; Lang, A.; Omari, S. nuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles. arXiv 2021, arXiv:2106.11810. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Oh, Y.; Park, J.; Choi, J.; Park, J.; Jeon, Y.S. Blind training for channel-adaptive digital semantic communications. IEEE Trans. Commun. 2025. [Google Scholar] [CrossRef]
Zhang, Z.; Wang, L.; Wu, W.; Zhou, F.; Wu, Q. Knowledge graph-based explainable and generalized zero-shot semantic communications. arXiv 2025, arXiv:2507.02291. [Google Scholar] [CrossRef]
Wang, J.; Shao, Y.; Ge, Y.; Yu, R. A Survey of Vehicle to Everything (V2X) Testing. Sensors 2019, 19, 334. [Google Scholar] [CrossRef] [PubMed]
Maglogiannis, V.; Naudts, D.; Hadiwardoyo, S.; Van Den Akker, D.; Marquez-Barja, J.; Moerman, I. Experimental V2X Evaluation for C-V2X and ITS-G5 Technologies in a Real-Life Highway Environment. IEEE Trans. Netw. Serv. Manag. 2022, 19, 1521–1538. [Google Scholar] [CrossRef]
Kutila, M.; Kauvo, K.; Pyykönen, P.; Zhang, X.; Martinez, V.G.; Zheng, Y.; Xu, S. A C-V2X/5G Field Study for Supporting Automated Driving. In Proceedings of the 2021 IEEE Intelligent Vehicles Symposium (IV), Nagoya, Japan, 11–17 July 2021; pp. 315–320. [Google Scholar]
Zhong, Z.; Cordova, L.; Halverson, M.; Leonard, B. Field Tests on DSRC and C-V2X Range of Reception; Utah Department of Transportation: Salt Lake City, UT, USA, 2021. [Google Scholar]
Bian, Y.; Zhang, X.; Luosang, G.; Renzeng, D.; Renqing, D.; Ding, X. Federated learning and semantic communication for the metaverse: Challenges and potential solutions. Electronics 2025, 14, 868. [Google Scholar] [CrossRef]

Figure 1. ITS scenario.

Figure 2. Overall architecture of Offline Knowledge Base and Attention-driven Semantic Communication (OKBASC).

Figure 3. Top-k sparse attention (TKSA).

Figure 4. Bi-level routing attention.

Figure 5. Loss curves at different SNR levels: (a) 5 dB, (b) 10 dB, (c) 15 dB, (d) 20 dB, (e) 25 dB.

Figure 6. SSIM performance on the VOC2012 dataset.

Figure 7. SSIM performance on the NuPlan dataset.

Figure 8. Comparison of reconstruction results at different SNR levels.

Table 1. Summary of the main contributions of this work.

Contribution	Description
OKBASC Framework	We propose an offline knowledge base and attention-driven semantic communication framework tailored for traffic scenarios, enabling compact and reusable priors to reduce edge computation.
Offline Knowledge Base	We construct a semantic knowledge base using TKSA to generate reusable masks encoding static backgrounds and dynamic traffic elements, guiding feature extraction and reconstruction.
Bi-Level Routing Attention	We design a two-level attention mechanism consisting of global top-K channel selection for filtering and local sparse attention for detail preservation, thereby improving robustness under noisy channels limits.
Experimental Validation	We demonstrate superior reconstruction quality, lower loss, and better noise robustness on VOC2012 and nuPlan datasets; ablation tests confirm that TKSA and BRA both enhance semantic fidelity, especially under low SNR conditions.

Table 2. Comparison of OKBASC with existing semantic communication methods.

Researches	Method	Limitations	Advantages of OKBASC
JSCC-based SC [21,22,37,38]	Semantic compression using convolutional or autoencoder-based encoders/decoders for visual data transmission under noise	Model size and encoder–decoder complexity hinder edge deployment; limited task specificity	Offline segmentation and lightweight attention reduce model burden
Attention-enhanced SC [23,24,25,26,39]	Enhance semantic relevance using transformer-based or local attention mechanisms	High computational cost and lack of inductive bias reduce suitability for edge ITS	Uses BRA for sparse hierarchical routing and leverages structured offline priors for context-aware feature refinement
Knowledge-driven SC [28,29,41]	Incorporates external semantic priors (e.g., graphs or lookup tables) into the encoder/decoder	Static priors are coarse and inflexible, typically lacking fine-grained filtering or per-task focus	The TKSA mechanism creates a reusable structured knowledge base that captures task-relevant semantics
Task-oriented SC [40,43,44]	Target-specific tasks such as multitask recognition or semantic rate control with tailored SC frameworks	Limited scalability; designs are tightly coupled to individual tasks or datasets, and are not tailored to traffic scenarios	OKBASC generalizes across tasks by using offline-structured priors and dynamic token-level filtering
Large-model SC [42,45]	Uses large-scale pretrained models or diffusion-based modules to boost SC performance and provide multimodal fusion	Resource-intensive; hard to deploy on ITS edge devices; lack of task-specific tuning	OKBASC offers codec processes guided by structured priors specifically optimized for traffic scenarios

Table 3. Training hyperparameters used in all experiments.

Parameter	Value
Optimizer	Adam ( $β_{1} = 0.9, β_{2} = 0.9$ )
Learning rate	$1 \times 10^{- 4}$
Batch size	128
Epochs	100
Weight decay	$1 \times 10^{- 5}$
Image resolution	$64 \times 64$
Device	CUDA (RTX 3060×2)
SNR	${5, 10, 15, 20, 25}$ dB
BRA embedding dim	64
BRA heads	8

Table 4. Ablation study on the VOC2012 dataset (SSIM at different SNR levels).

Method	5 dB	10 dB	15 dB	20 dB	25 dB
OKBASC	0.94	0.82	0.83	0.84	0.87
without TKSA	0.85	0.74	0.77	0.75	0.82
without BRA	0.83	0.73	0.76	0.77	0.79
without TKSA + BRA	0.78	0.65	0.68	0.70	0.72

Table 5. Ablation study on the nuPlan dataset (SSIM at different SNR levels).

Method	5 dB	10 dB	15 dB	20 dB	25 dB
OKBASC	0.88	0.86	0.85	0.87	0.90
without TKSA	0.81	0.78	0.77	0.79	0.84
without BRA	0.79	0.77	0.76	0.78	0.81
without TKSA + BRA	0.65	0.62	0.61	0.64	0.67

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xiao, Y.; Fan, X.; Xie, Z.; Lu, Y. Offline Knowledge Base and Attention-Driven Semantic Communication for Image-Based Applications in ITS Scenarios. Big Data Cogn. Comput. 2025, 9, 240. https://doi.org/10.3390/bdcc9090240

AMA Style

Xiao Y, Fan X, Xie Z, Lu Y. Offline Knowledge Base and Attention-Driven Semantic Communication for Image-Based Applications in ITS Scenarios. Big Data and Cognitive Computing. 2025; 9(9):240. https://doi.org/10.3390/bdcc9090240

Chicago/Turabian Style

Xiao, Yan, Xiumei Fan, Zhixin Xie, and Yuanbo Lu. 2025. "Offline Knowledge Base and Attention-Driven Semantic Communication for Image-Based Applications in ITS Scenarios" Big Data and Cognitive Computing 9, no. 9: 240. https://doi.org/10.3390/bdcc9090240

APA Style

Xiao, Y., Fan, X., Xie, Z., & Lu, Y. (2025). Offline Knowledge Base and Attention-Driven Semantic Communication for Image-Based Applications in ITS Scenarios. Big Data and Cognitive Computing, 9(9), 240. https://doi.org/10.3390/bdcc9090240

Article Menu

Offline Knowledge Base and Attention-Driven Semantic Communication for Image-Based Applications in ITS Scenarios

Abstract

1. Introduction

2. Related Work

3. Offline Knowledge Base and Attention-Driven Semantic Communication

3.1. Overall Architecture

3.2. Offline Knowledge Base Construction via TKSA

3.3. Bi-Level Routing Attention for Semantic Encoding

3.4. Channel Transmission and Decoding

4. Experiment Setup

4.1. Datasets

4.2. Parameter Configuration

5. Results Discussion

5.1. Comparative Experiment

5.2. Ablation Study

6. Conclusions and Future Work

6.1. Conclusions

6.2. Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI