DeepSTAS: DL-assisted Semantic Transmission Accuracy Enhancement Through an Attention-driven HAPS Relay System

Pascal Nkurunziza; Daisuke Umehara

doi:10.3390/technologies13040137

and

¹

Graduate School of Science and Technology, Kyoto Institute of Technology, Matsugasaki, Sakyo-ku, Kyoto 606-8255, Japan

²

Department of Electrical and Electronics Engineering, University of Rwanda, Kigali P.O. Box 3900, Rwanda

^*

Authors to whom correspondence should be addressed.

Technologies2025, 13(4), 137;https://doi.org/10.3390/technologies13040137

This article belongs to the Section Information and Communication Technologies

Version Notes

Order Reprints

Abstract

Semantic communication technology, as it allows for source data meaning extraction and the transmission of appropriate semantic information only, has the potential to extend Shannon’s paradigm, which is concerned with the reproduction of a message from one location to another, regardless of its meaning. Nevertheless, some user terminals (UTs) may experience inadequate service due to their geolocation in reference to the base stations, which may entirely affect the accuracy of transmission and complicate deployment and implementation. A High-Altitude Platform Station (HAPS) serves as a key enabler for the deployment of wireless broadband in inaccessible areas, such as in coastal, desert, and mountainous areas. This paper proposes a novel HAPS relay-based semantic communication scheme, named DeepSTAS, which leverages deep learning techniques to enhance transmission accuracy. The proposed scheme focuses on attention-based semantic signal decoding, denoising, and forwarding modes; thus, called a CSA-DCGAN SDF HAPS relay network. The simulation results reveal that the proposed system with attention mechanisms significantly outperforms the system without attention mechanisms, both in peak signal-to-noise ratio (PSNR) and multi-scale structural similarity index (MS-SSIM); the proposed system can achieve a 2 dB gain when leveraging the attention mechanisms, and a PSNR of 38.5 dB can be obtained, with an MS-SSIM exceeding 0.999 at an approximate SNR of only 20 dB. The system provides considerable performance, more than 37 dB, and a corresponding MS-SSIM close to 0.999 at an estimated SNR of 20 dB when the CIFAR-100 dataset is considered and an MS-SSIM of 0.965 at an approximate SNR of only 10 dB on the Kodak dataset. The proposed system holds promise to maintain consistent performance even at low SNRs across various channel conditions.

Keywords:

semantic transmission; HAPS relay; attention mechanisms

1. Introduction

Various research studies have demonstrated that cooperative communications schemes undoubtedly contribute to transmission quality enhancement. Relaying is one of the most essential types of cooperative communication, offering an effective solution when a direct link experiences a significant path loss and the line of sight between the source and the destination cannot be established. In contrast, as research progresses toward next-generation wireless networks [1], numerous transmission technologies have been developed. Among these, semantic communications have been acknowledged for their considerable potential for media transmission in various applications where multimodal data processing is considered a critical task and for tasks such as augmented reality/virtual reality (AR/VR) and human sensing care systems where the generated multimodal data are correlated in the context [2].

To meet the growing demands for future wireless communications, semantic communications, typically implemented through joint source–channel coding (JSCC) [3], have attracted substantial research interest [4,5] as a promising approach that prioritizes the delivery of the essential semantic content instead of bit sequences [6].

Following technological advancements in artificial intelligence (AI), the semantic communications paradigm has re-appeared as a favorable future mobile technology. Lately, numerous semantic communications frameworks, such as semantic and goal-oriented communications [7], as well as semantic-aware networking based on federated edge intelligence [8], have been proposed. In addition to redefining semantic communication mathematically [9], innovative schemes have been introduced that employ neural networks (NNs) and semantic interpretation modules to replace traditional communication blocks.

Given its successful implementation in different areas, deep learning (DL) has emerged as a promising approach in communications to boost the performance of systems to higher levels and improved intelligence [10,11]. Specifically, DL has shown great potential for addressing the technical challenges present in physical-layer communications [12,13,14].

The network entities’ data to be generated in 2025 are globally anticipated to soar and reach 175 zeta-bytes [15]. To meet the requirements imposed by such tremendous data rate and connectivity scenarios, there is a persistent intensification of frequencies of operation and size of transceivers in Shannon’s paradigm which would lead to very high energy consumption and substantial costs of hardware.

With its sustainable features, integrating semantic communication is essential. Cutting-edge applications such as smart cities and telemedicine, in their advanced roles, must support human-to-machine (H2M) and machine-to-machine (M2M) communications [16,17], enabling the receiver to comprehend the meaning of the source and perform an appropriate action. Therefore, the development of accurate and reliable semantic communication is not simply an option but a necessity for an intelligent and sustainable paradigm.

Given the aforementioned advantages, semantic communications-based research has been applied to the transmission of media such as texts, images, and videos [18,19,20]. However, numerous challenges must be addressed before the widespread adoption of semantic communications [21]. Most studies in the field of semantic communications [22,23,24] have focused, firstly, on relatively simple point-to-point (P2P) communication scenarios, making it difficult to design efficient communication systems for cooperative multi-point networks [25]. Secondly, while semantic communication has been applied to fixed relay channels [26,27], effective system performance enhancement by combining multiple semantic information flows is still a serious challenge. We propose a novel communication scheme to specifically address these challenges by providing new architecture for the source, the relay, and the destination.

Motivated by the technical challenges mentioned above, this paper seeks to address the following crucial issues: (1) how to effectively design an efficient DL-based semantic system based on HAPS relay channels, (2) how to efficiently exploit the potential of DL and develop robust semantic communication systems that can ensure the stream of semantic information flow in cooperative relay systems, and (3) how to guarantee the systematic evaluation and transmission enhancement of semantic communication systems with accurate information reconstruction at the destination.

Due to the fact that various services require data transmission and accurate reconstruction at the destination, we consider an integrated network of cooperated terrestrial stations and non-terrestrial HAPS relay for multiple-image semantic data transmission. Given the numerous challenges that existing communication paradigms face, the proposed scheme is introduced to leverage the potential of combined DL techniques, namely, convolutional neural networks, and channel and spatial attention mechanisms, at all the communicating entities to effectively extract, process, and reconstruct information features for reliable and accurate transmission and reception.

2. Related Works

Current research often leverages deep learning and knowledge graphs to boost semantic communication performance [28]. Knowledge graphs particularly proficiently compress substantial information within a compact data size, rendering them ideal candidates for representing semantic information, where semantic relationships among entities are represented [29]. Notably, DeepSC systems for speech and text transmission, as discussed in [30], integrate semantic and channel coding. These studies highlight how semantic communication can outperform traditional bit transmission, especially in low signal-to-noise ratio (SNR) and limited-bandwidth scenarios.

The authors in [31] further extended these research works by developing multimodal communication systems, known as U-DeepSC. The semantic rate, a novel performance metric that quantifies the quantity of semantic information efficiently transferred per second, was proposed in [32] to describe the efficiency of semantic communication.

An auto-encoder-based semantic communication strategy that aids in forwarding the source message at the semantic level was developed in [17] for text transmission over a relay channel. In [18], a relay-based image transmission system was proposed and proved to perform better than the baseline technique using polar codes and improved portable graphics compression.

Effectively combining the information from many sites at the semantic level is not, nevertheless, simple. The authors in [33] studied the amplify-and-forward (AF) relaying network end-to-end DL-based constellation design under Rayleigh faded channels. Under block fading channels, modulation coding for two-way amplify-and-forward (TWAF) relay networks and differential modulation coding for one-way amplify-and-forward (OWAF) relay networks were developed.

Designing resource management to maximize the performance of semantic communication systems has also received some early attention, in addition to system architecture design. The ideal resource allocation strategy for a heterogeneous semantic and bit transmission system, for instance, was examined by the authors in [34]. The authors described the limits of the semantic-versus-bit-rate region that various multiple-access techniques were able to reach. Furthermore, the authors in [35] proposed a quality-of-experience (QoE) conscious resource allocation strategy to optimize QoE by collaboratively planning the power allocation, channel assignment, and number of transmitted semantic symbols.

However, the majority of previous research has assumed that DL neural networks, such as the DeepSC receiver, are deployed and operated on mobile devices. This ignores the crucial reality, though, that certain mobile devices might not have enough processing and storage power to carry out DL-driven semantic communications. It is urgently necessary to make the coding and decoding processes easier to understand and interpret in terms of semantic information theory in order to construct DL-enabled semantic communication systems. This could also serve as a direction for future research.

This paper’s primary contribution is to provide a novel framework for semantic communication that takes into account performance evaluation and data recovery in data-limited environments due to terrestrial non-line-of-sight (NLOS) scenarios and considers the extraction and transmission of semantic information by using an attention-driven HAPS relay. The following is a summary of the main contributions:

To implement efficient feature extraction at the source for the semantic forwarding process, five main components are proposed for providing new insights into enhanced semantic communication systems. These components are a channel and spatial attention-based deep convolutional encoder (CSA-DCE), a deep convolutional channel encoder (DCCE), a relay network, a deep convolutional channel decoder (DCCD), and a channel and spatial attention-based deep convolutional decoder (CSA-DCD).
To improve semantic information transmission in NLOS environments while maintaining reasonable accuracy, a novel architecture for semantic forwarding that seeks to enhance the transmission accuracy and is able to perform in a range of channel conditions is proposed. Specifically, the HAPS relay decodes the received multiple-image signal, performs signal separation to obtain individual images, and semantically enhances the received source signal at its level to forward it to the destination after processing. This can prevent unwanted information feature loss while successfully enhancing information delivery. This is the first time such a system has been conceived, as far as we know.
Numerous tests and performance evaluations are carried out on the proposed attention-based system through the relay network and a system which does not exploit the potential of attention mechanisms. It is proved that the proposed attention-based scheme performs better in terms of peak signal-to-noise ratio (PSNR) and multi-scale structural similarity index (MS-SSIM), which define the reconstruction strength and similarities between the semantic transmitted and received data. Transmission, in both phases, considers the Rician fading channel model, and system performance is evaluated on various datasets, as detailed in Section 5.

The remainder of the paper is then structured in the following manner: Section 3 elaborates the system model, and the proposed relay-enabled semantic communication system is detailed in Section 4. Performance analysis and conclusion are given in Section 5 and Section 6, respectively.

3. System Model

Let us consider a relay network, where a High-Altitude Platform Station (HAPS) is employed as a fixed relay R to route information data from the source to the destination, denoted by S and D. S and D serve as terrestrial base stations (BSs) deployed on the ground with fixed locations, while the HAPS is at a fixed altitude.

Let us assume a non-line-of-sight (NLOS) link between S and D due to the severe long distance between the two ground stations, as illustrated in Figure 1. Given that both S and D are semantic-enabled systems, data processing is accurately carried out for understanding the meaning and aiming at preserving the semantic fidelity of both transmitted and received information. The model exemplifies two layers of semantic communication, namely, the transmission level and the semantic level. To extract and analyze semantic information, the semantic level has two layers, the semantic coding layer and the semantic decoding layer, respectively, for accurate reconstruction at the destination.

Figure 1. Communication scenario illustration.

The model takes into account the Rician channel model for transmission, and the transmission level ensures that semantic information may be conveyed accurately in the wireless channel. For analysis simplification, we consider a system model as illustrated in Figure 2. In this paper, we consider a semantic paradigm to be simultaneous multiple-image transmission, which is considered to be an advanced approach for the efficient transmission of multiple images over a wireless channel.

Figure 2. Semantic communication system design.

The focus is on the image semantic content, where instead of transmitting raw images, the semantic encoder leverages the potential of the channel and spatial attention mechanism to focus on semantic features of interest. This cuts down the data size by ignoring irrelevant details while preserving the key information.

Practically, the semantic encoder receives

{x = (x}_{1}, {x_{2}, \dots, x}_{n})

, where

x_{i} \in R^{H \times W \times C}

,

R

is the set of real numbers, and H, W, and C are the height, width, and color channels, respectively. It is worth noting that

x_{i}

represents the i-th UT’s image data on the source’s ground side. After feature extraction, processing, and data size adjustment, the aggregated semantic image data are channel-wise encoded for further transmission to the relay, which processes and forwards the received data to the destination, as illustrated in the scenario of Figure 2.

Given that the model design and analysis focus mainly on the source–relay–destination (S-R-D) framework as detailed above in Figure 2, it is worth noting that transmission is carried out into two phases, where phase 1 consists of data transmission from S to R and phase 2 from R to D.

During phase 1 of the transmission process, the semantic encoder applies channel and spatial attention mechanisms for efficient semantic feature extraction. The data signal x is then mapped by both semantic and channel encoders, where the latter dynamically enhances the semantic features with reference to the current state of the channel, alleviating the wireless channel-caused adverse effects.

Considering the transmission scheme of our model, the data signal to be transmitted from S to R is an L-dimensional complex vector

x_{S}

that can be expressed as

x_{S} = C (S (x), μ_{1}),

(1)

where

C

and

S

represent the channel and semantic encoding processes by which

S

efficiently extracts semantic information and

C

dynamically enhances this information in response to the current channel state to mitigate wireless channel effects.

μ_{1}

denotes the signal’s SNR in dB of the S-R wireless channel, and

x_{S} = (x_{S, 1}, x_{S, 2}, \dots, x_{S, L}) \in C^{L}

, where

C

denotes the set of complex numbers and

i \in {1, 2, \dots, L}

. The coding rate can be expressed as

r = \frac{2 L}{H W C n}

.

The wireless channel transmits

x_{S}

and produces

x_{S R}

, which is expressed as

x_{S R} = {\sqrt{P_{S}} h}_{S R °} x_{S} + n_{R},

(2)

where

P_{S}

represents the source’s transmit power,

h_{S R} = (h_{S R, 1}, h_{S R, 2}, \dots, h_{S R, L}) \in C^{L}

represents the S-R link channel gains, and

°

stands for element-wise product. Considering the Rician fading channel, the channel gain can be expressed as

h_{S R, i} = \sqrt{\frac{K}{1 + K}} + \sqrt{\frac{1}{K + 1}} f_{S R, i}

, where K is the Rician fading coefficient, representing the ratio of the power in the LOS component to the power in the multipath components, and

f_{S R, i}

is a complex Gaussian variable with mean 0 and variance 1, i.e.,

f_{S R, i} ~ ∁ N (0, 1)

. In equation (2) above,

n_{R} = (n_{R, 1}, n_{R, 2}, \dots, n_{R, L}) \in C^{L}

represents the zero mean and

σ_{R}^{2}

-variance complex Gaussian noise at the HAPS relay.

Based on (2),

μ_{1}

can be computed as

μ_{1} = 10 \cdot {l o g}_{10} \frac{P_{S}}{σ_{R}^{2}} .

(3)

The transmission of

x_{S}

, a result of semantic and channel encoders, assumes a shared knowledge of the transmission ’s SNR,

μ_{1},

between the sender S and the relay R. During phase 2 of the transmission process, the resulting data signal to be transmitted from R to D is power-normalized, suitable for transmission, following phase 2 channel characteristics; it can then be expressed as

x_{R} = R (x_{S R}, μ_{2}),

(4)

where

R

symbolizes the relay-related operations and

μ_{2}

represents the signal’s SNR in dB of the R-D wireless channel, which can be expressed as

μ_{2} = 10 {l o g}_{10} \frac{P_{R}}{σ_{D}^{2}}

.

Therefore, the signal at the destination can be written as

x_{R D} = {\sqrt{P_{R}} h}_{R D °} x_{R} + n_{D},

(5)

where

P_{R}

represents the relay’s transmit power and

h_{R D} = (h_{R D, 1}, h_{R D, 2}, \dots, h_{R D, L}) \in C^{L}

represents the R-D link channel gains. Similarly, the channel gains, in phase 2, can be expressed as

h_{R D, i} = \sqrt{\frac{K}{1 + K}} + \sqrt{\frac{1}{K + 1}} f_{R D, i}

, where

f_{R D, i}

represents a complex Gaussian random variable with mean 0 and variance 1. Following Equation (5),

n_{D} = (n_{D, 1}, n_{D, 2}, \dots, n_{D, L}) \in C^{L}

is zero-mean and

σ_{D}^{2}

-variance complex Gaussian noise.

The transmission of

x_{R}

to the destination D assumes a shared knowledge of the transmission’s SNR,

μ_{2},

between the sender R and D. We assume that the channel decoding process also benefits from the knowledge of

μ_{2}

.

4. Proposed Semantic Communication Scheme Description

In this section, we focus on a one-way relaying scheme, where the system components, namely, semantic and channel encoders, relay network, and channel and semantic decoders, are carefully designed and described in the following sub-sections.

4.1. Attention-driven Semantic Source and Destination

The source consists of semantic and channel encoders to efficiently extract meaningful features from UTs’ image data, as shown in Figure 2. For model performance optimality, the semantic encoder leverages channel and spatial attention mechanism effectiveness to focus on the most important parts.

Channel and spatial attention mechanisms in convolutional neural networks fully leverage convolutional neural network characteristics to produce attentive image features with powerful knowledge of what (i.e., channel-wise) and where (i.e., spatial), thus achieving better performance in image processing applications [36].

The network subsystem in Figure 3 illustrates the architecture of the semantic encoder, which yields the aggregated semantic image data,

x_{s e m}

, as outputs during the feature extraction process by exploiting a series of layers in convolution networks to help detecting the image’s local patterns and hierarchies. Global average and maximum pooling help to reduce the spatial dimensions of the feature maps, leading to lower computational cost and greater feature robustness to variations in the input.

Figure 3. Architecture of (a) channel and spatial attention-based deep convolutional encoder, (b) channel attention block, and (c) spatial attention block.

Specifically, global maximum pooling can enhance the ability of the model to emphasize the most important features, potentially improving performance in tasks where the strongest signals are more relevant [37]. On the other hand, global average enforces correspondence between feature maps and categories, making feature maps easily interpretable as confidence maps for each category.

The sigmoid function is applied to each block’s output to produce a channel-wise and spatial attention vector, which can then be used to scale the input feature maps, allowing the network to focus on the most informative features and improving the representational power of the network by accentuating features of great importance and suppressing features of less relevance.

Next, the softmax function is used to ensure that the data signal sent to the deep convolutional channel encoder is both interpretable and effective for further processing. This can enhance overall model performance, especially in tasks that require attention and focus on special features.

The CSA-DCE’s output is then fed into the DCCE, whose architecture is illustrated in Figure 4. The Rectified Linear Unit (ReLU) activation function introduces non-linearity properties to capture complex patterns, while maximum pooling is leveraged to reduce the spatial dimensions of the feature maps, which leads to lower computational costs and renders feature maps invariant to small translations.

Figure 4. Deep convolutional channel encoder architecture.

Essentially, in order to preserve the most noticeable features and downsample the remaining ones, maximum pooling extracts the largest value from each feature map’s patch. It is worth noting that BN and MLP, in Figure 3, are the batch normalization layer and the multi-layer perceptron, respectively, for the robust and efficient learning of complex patterns from data by the system.

The power normalization in the DCCE stage fine-tunes feature maps before they proceed to the subsequent layer. By standardizing feature magnitudes, it ensures more balanced and stable signal transmission, enhancing the reliability of the information flow through the network. The semantic destination, on the other side of the network, referred to as D, leverages the potential of channel and spatial attention mechanisms for the performance enhancement of the deep convolutional semantic decoder (CSA-DCD).

In conjunction with the deep convolutional channel decoder (DCCD), the CSA-DCD captures feature patterns after channel decoding and reconstructs the important and meaningful patterns of the signal of interest

\hat{x}

. The simplified architectures can be found as clarified in Figure 5 and Figure 6.

Figure 5. Deep convolutional channel decoder architecture.

Figure 6. Channel and spatial attention-based deep convolutional decoder architecture.

4.2. Attention-based Signal Enhancement in HAPS Relay Networks

Given that the data signal

x_{S}

is transmitted in phase 1, as shown in Figure 2,

x_{S R}

, as expressed in (2), is received at the relay station, referred to as channel and spatial attention-based deep convolutional generative adversarial network (CSA-DCGAN) SDF HAPS.

When a DCGAN is trained on a high-quality images’ dataset, the network can learn to generate enhanced versions of transmitted images, improve image quality, and make them more suitable for further processing and analysis. Because the relay R receives the multiple-image data signal, the relay first applies composite signal decoding, signal separation, and denoising to isolate the individual images from the input by leveraging the potential of the attention-based modules to isolate each image’s corresponding features [38] in the signal preprocessing step, as depicted in Figure 7.

Figure 7. Relay network architecture.

By separating images, the relay ensures that each component can be processed individually, reducing interference and improving the quality of the extracted features. Once the images are separated, the relay extracts relevant features from each individual image signal to ensure that the subsequent processing steps have accurate and distinct data to work with.

The resultant feature maps generated are combined into a unified representation. Concatenating features from separated image signals allows the relay to create a comprehensive representation that includes all relevant information from the original signal [39].

The combined features are further processed by using channel and spatial attention mechanisms to focus on the most important parts of the data. Leveraging the relay’s generator and discriminator networks, an enhanced data signal that is generated is forwarded to the destination.

The relay-related operations, as expressed in (4), ensure that the enhanced signal can reach the destination through a wireless channel without significant loss of quality, maintaining the integrity of the original information. At the relay R and destination D,

x_{S R}

and

x_{R D}

, respectively, are received after transmission through a Rician fading channel characterized by its coefficient K.

Upon the reception of the

x_{R D}

data signal at the semantic destination D, channel decoding is performed for error correction enabled by the DCCD’s ability to learn from data, robustness, and end-to-end training capabilities, which render it effective for error correction in communication systems. The channel decoding resultant data signal can be expressed as

x_{d e c} = C^{- 1} (x_{R D}, μ_{2}),

(6)

where

C^{- 1}

and

μ_{2}

represent the channel decoding process and the signal’s SNR of the R-D wireless channel, respectively.

Channel decoding is followed by semantic decoding for the meaningful reconstruction of the received data signal, and the estimated signal can be obtained as

\hat{x} = S^{- 1} (x_{d e c}),

(7)

where

S^{- 1}

is the semantic decoding process. We consider that the channel decoder is designed to adapt to different shared knowledge of the channel characteristics and the accurate reconstruction of the received signal.

4.3. System Loss Function

Given that the proposed relayed semantic communication system is a composite system, component losses must be considered for efficient system design and analysis. The proposed system architecture assumes that each system component exhibits losses linked to deep learning techniques-related computations and transmission losses that occur during the whole communication process. Therefore, the system total loss can be expressed as

L_{s y s} = α L_{S} + β L_{C} + γ L_{G} + δ L_{D} + θ L_{C O M},

(8)

where

α

,

β

,

γ

,

δ

, and

θ

are weights representing the relative importance of each component and are allocated during system training, such that

0 \leq α, β, γ, δ, θ \leq

1.

L_{S}

,

L_{C}

,

L_{G}

,

L_{D}

, and

L_{C O M}

are semantic coding loss, channel coding loss, generator network loss, discriminator network loss, and communication loss, respectively.

The components’ losses can individually be expressed as follows.

Semantic coding loss can be represented by the variation in the original semantic information x and the final reconstructed semantic information

\hat{x}

:

L_{S} = E [{‖x - \hat{x}‖}^{2}] .

(9)

The semantic coding loss can be used to measure the average squared difference between the predicted value and the actual values. The channel coding loss can be represented as mean squared error (MSE) loss and expressed as

L_{C} = E [{‖x_{s e m} - x_{d e c}‖}^{2}],

(10)

where

x_{s e m}

is the semantic encoder output data signal.

In addition to channel and spatial attention mechanisms and given that the DCGAN comprises a generator and discriminator networks for the efficient enhancement of the received data signal at the relay stage, the corresponding losses can be expressed as

L_{G} = E [l o g (1 - D (G (z)))]

(11)

L_{D} = E [l o g D (x_{S R})] + E [l o g (1 - D (G (z)))],

(12)

where

x_{S R}

is the real data signal, z is the noise input to the generator, and G(z) and D(G(z)) represent the fake image data generated by the generator and the discriminator’s output when given fake data, respectively.

The communication loss

L_{C O M}

can be mathematically expressed as in (13) and defined to measure the signal degradation during transmission. This loss can be caused by various factors, such as noise, interference, and signal attenuation.

L_{C O M} = H (x) - I (x; \hat{x}),

(13)

where H(x) represents the entropy of the source message x, representing the total amount of information in the source.

I (x; \hat{x})

is the mutual information between x and the received message

\hat{x}

, representing the amount of information successfully received.

The system optimization goal is to minimize the system loss and enhance the transmission accuracy, as described in Algorithm 1, for system overall performance improvement.

4.4. Complexity and Real-time Feasibility

The proposed semantic communication system presents a novel approach of feature extraction, processing, and reconstruction. The defined system components in the new architecture promise to enhance feature extraction and reconstruction processes, as described in Section 3 and Section 4.1 and Section 4.2, respectively.

Given that this system relies on the use of deep learning techniques, its architecture gives opportunities but, at the same time, poses challenges regarding computational and real-time constraints. The system’s computational cost is non-negligible because of the channel and spatial attention mechanisms incorporated in the semantic encoder, relay network, and semantic decoder.

Although these mechanisms serve to improve feature representation, they simultaneously present a tendency to operate within a set processing time while performing additional matrix operations and non-linear transformations [40]. Moreover, the use of deep convolutional architectures across all components implies a high number of parameters and computations, leading to substantial memory and processing power requirements [41].

The integration of a GAN at the HAPS relay can eventually add an additional layer of complexity in terms of thorough training requirements and potential instability [42]. Achieving the real-time system applications would require heavy hardware resources, such as modern graphics processing units (GPUs), in line with recent semantic communication system implementation [43].

As the proposed system presents multiple processing stages from the source to the destination nodes, this can introduce significant latency, which is a challenge for applications requiring near-instantaneous communication [44]. Scalability remains another important issue, because it requires assessment of the capacity to manage numerous concurrent interactions or larger datasets in real time.

While the proposed system offers advanced semantic communication capabilities, its practical deployment may require a careful consideration of the trade-off between performance and resource utilization. System model compression or hardware acceleration can play a crucial role in further system optimization for more efficient real-time operation, especially in resource-constrained environments.

Algorithm 1 System loss optimization for transmission accuracy.

1:: Define loss components, as in Equations (9)–(13).

2:: Define total loss function, as in (8), where $α$ , $β$ , $γ$ , $δ$ and $θ$ are weights assigned to each loss term, balancing their contribution to the total loss.

3:: Initialize model parameters.

4:

(1) Perform a forward pass through the system:

CDA-DCE → DCCE → Wireless Channel → CSA-DCGAN SDF HAPS → Wireless Channel → DCCD → CSA-DCD.

(2): Calculate the intermediate outputs needed to compute each loss component (semantic outputs, channel outputs, generator and discriminator outputs).

5:: Compute the respective losses at each stage, as in Equations (9)–(13).

6:

Backward pass and update parameters using backpropagation:

(1): Compute gradients for each loss component in accordance with model parameters.
(2): Update model weights for the semantic and channel encoders, generator, discriminator, channel and semantic decoders using Adam optimizer.
(3): Adjust loss weights to prioritize transmission accuracy.

7:: Train generator and discriminator.

8:: Iterate for multiple epochs, repeat Steps 4–6, optimizing the model based on the total loss function and progressively improving the transmission accuracy.

9:

Monitor and adjust:

(1): Track the loss metric performance after each epoch.
(2): Adjust learning rates, loss weights, and model architecture if the desired accuracy is not reached.

10:: If the accuracy stabilizes or if the loss plateaus, stop.

5. Performance Evaluation and Analysis

The performance of the proposed attention-based semantic relayed communication system is analyzed in this section. To provide further context and insights, the system is trained, as detailed in Algorithm 2, and simulated under the Rician fading channel model for different K values in dB, and a performance comparison of the system with and without attention mechanisms is conducted.

Algorithm 2 Training of proposed semantic communication scheme

Input: The multiple-image data signal

{x = (x}_{1}, {x_{2}, \dots, x}_{n})

and SNR

μ = μ_{1} = μ_{2}

.
Output: The reconstructed

\hat{x} = ({\hat{x}}_{1}, {\hat{x}}_{2}, \dots, {\hat{x}}_{n})

.
1: Initialize the parameters in CSA-DCE and DCCE.
2: Semantic Source:
(1) Perform semantic encoding process:

S (x) = x_{s e m}

.
(2) Execute channel encoding followed by power normalization:

C (x_{s e m}) = x_{S} .

(3) Transmit

x_{S}

, over the wireless channel.
3: Relay:
(1) Perform

x_{S R}

preprocessing for decoding, separation and feature extraction.
(2) Perform feature maps concatenation.
(3) Apply attention-based data signal processing for resulting signal enhancement.
(4) Forward

x_{R}

, as in Equation (4), over the wireless channel towards channel decoding of

x_{R D}

.
4: Semantic Destination:
(1) Do channel decoding,

C^{- 1}

, as in Equation (6).
(2) Do semantic decoding,

S^{- 1}

, as in Equation (7)
5: Compute and optimize the system loss function, according to Algorithm 1 and Equation (8).
6: Do system neural network functions training

\{S, C, C^{- 1}, S^{- 1}\}

using Adam optimizer.

We employ the PSNR and the MS-SSIM as evaluation metrics to quantify the system performance in terms of semantic system reconstruction strength and the similarity between x and the reconstructed

\hat{x}

achieved. The PSNR and the MS-SSIM are important metrics in image quality assessment. The PSNR metric exemplifies the image signal reconstruction strength, where the ratio between the maximum possible signal power and the power of distorting noise is measured. It is typically expressed in decibels (dB).

We can mathematically express the PSNR as

P S N R = 10 \cdot {l o g}_{10} (\frac{{P V}_{I}^{2}}{M S E})

, where

{P V}_{I}

is the maximum possible pixel value of the image and MSE is mean squared error between the original and reconstructed images. In this context, a higher PSNR exemplifies better reconstruction strength and thus better image quality.

The MS-SSIM, as an extension of the structural similarity index (SSIM), is a metric that considers image details at multiple scales, and its design is to better correlate with human visual perception; thus, it is a convenient way to incorporate image details at different resolutions. Luminance (l), contrast (c), and structure (s) are comparison measurements considered by the structural similarity index computation between the samples of the original and reconstructed images x and

\hat{x}

. The structural similarity index can generally be expressed as

S S I M (x, \hat{x}) = l {(x, \hat{x})}^{υ} \cdot c {(x, \hat{x})}^{φ} \cdot s {(x, \hat{x})}^{ψ}

, where

υ, φ

, and

ψ

are weighting factors.

The evaluation is carried out by using three datasets, namely, CIFAR-10, CIFAR-100 [43], and Kodak [45]. The CIFAR-100 dataset is a more challenging version than the CIFAR-10 dataset due to its higher number of categories. On the other hand, the Kodak dataset, consisting of high-quality and natural images, is significantly more complex. Selecting and using these three datasets provides a more comprehensive evaluation of the proposed system’s performance across different levels of complexity, image qualities, and task types. This diverse evaluation helps to assess the true generalizability of the scheme to various real-world scenarios and applications.

The number of epochs is set to 100, and the batch size is 256 for the CIFAR-10 and CIFAR-100 datasets and 32 for the Kodak dataset, given that the latter has only 24 images. The simulation results show that the system can achieve a 2 dB gain when leveraging the attention mechanisms and a PSNR of 38.5 dB can be obtained with an MS-SSIM exceeding 0.999 at SNR = 20 dB, as shown in Figure 8.

Figure 8. System performance results by using CIFAR-10 for various channel conditions: (a) Reconstruction strength evaluation. (b) Similarity evaluation.

The system provides considerable performance, more than 37 dB, and a corresponding MS-SSIM close to 0.999 at an SNR of 20 dB when the CIFAR-100 dataset is considered, as exemplified in Figure 9. In Figure 10, although the Kodak dataset structure is characterized by image complexity, higher resolution, and diverse dataset content, the system can achieve an MS-SSIM = 0.965 at SNR = 10 dB.

Figure 9. System performance results by using CIFAR-100 for various channel conditions: (a) Reconstruction strength evaluation. (b) Similarity evaluation.

Figure 10. System performance results by using Kodak dataset for various channel conditions: (a) Reconstruction strength evaluation. (b) Similarity evaluation.

In Figure 11, a sample image is selected from the Kodak dataset and used to exemplify the performance of the proposed system in terms of reconstructed image visualization. We can notice that when the attention mechanisms are leveraged, increased luminance further improves detail capturing and enhances visualization without information feature loss, compared with the system without attention mechanisms.

Figure 11. Visual analysis of the reconstructed images. (a) Original image. (b) Reconstructed image with attention mechanisms. (c) Reconstructed image without leveraging attention mechanisms.

6. Conclusions

This paper proposes a novel communication approach to address the robustness challenges and assist semantic communication in the scenario of varying channel models for accurate image transmission using relayed semantic communication systems. We leverage the combined potential of attention mechanisms and deep convolutional and generative adversarial networks to achieve enhanced transmission accuracy. By using the CSA-DCGAN SDF HAPS relay network, the source-transmitted image semantic data can be efficiently denoised, enhanced, and forwarded to the destination, where the received semantic data are successfully reconstructed.

The results of simulations demonstrate the advancement and effectiveness of the proposed approach with an achievable PSNR gain of 2 dB at SNR = 20 dB and strong reconstruction in terms of the MS-SSIM on various datasets for different Rician fading channel model factors. The performance evaluation reveals that the system becomes less dependent on the channel factor as the signal strength increases.

For further optimization towards the technological merits, future research endeavors may focus on refining the component architectures for real-time feasibility enhancement while maintaining their advanced semantic communication capabilities. Establishing and evaluating security constraints on the framework to ensure image semantic data privacy preservation for secure semantic communications may also be taken into consideration.

Furthermore, for a comparison of the proposed deep learning-assisted system with and without attention mechanisms, this paper is limited to the performance evaluation in terms of the PSNR, the MS-SSIM, and reconstructed image visualization. Hence, to contextualize this paper’s contributions within the broader landscape, we are planning to extend our analysis in future research works by including more expansive comparisons with supplementary benchmarks from the current literature. Additionally, it is vital for future work to report on additional metrics, such as the system compression ratio, computational complexity, and computation time analysis, for comprehensive evaluation.

Author Contributions

Conceptualization, P.N. and D.U.; methodology, P.N.; validation, P.N. and D.U.; formal analysis, P.N. and D.U.; writing—original draft preparation, P.N.; writing—review and editing, P.N. and D.U.; funding acquisition, D.U. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by JSPS KAKENHI, grant number JP24K02929.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The CIFAR-10 and CIFAR-100 datasets used in this article are available online (https://www.cs.toronto.edu/~kriz/cifar.html, accessed on 15 December 2024); and the Kodak dataset is also available online (https://r0k.us/graphics/kodak/, accessed on 24 December 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AF	amplify and forward
AI	artificial intelligence
BS	base station
AR	augmented reality
BN	batch normalization
CSA-DCD	channel and spatial attention-based deep convolutional decoder
CSA-DCE	channel and spatial attention-based deep convolutional encoder
CSA-DCGAN	channel and spatial attention-based deep convolutional generative adversarial network
DCCE	deep convolutional channel encoder
DCCD	deep convolutional channel decoder
DL	deep learning
GPU	graphics processing unit
HAPS	High-Altitude Platform Station
H2M	human to machine
JSCC	joint source–channel coding
LOS	line of sight
M2M	machine to machine
MLP	multi-layer perceptron
MSE	mean squared error
MS-SSIM	multi-scale structural similarity index
NLOS	non-line of sight
NN	neural network
OWAF	one-way amplify and forward
P2P	point to point
PSNR	peak signal-to-noise ratio
QoE	quality of experience
ReLU	Rectified Linear Unit
SC	semantic communication
SDF	Semantic Decoding and Forwarding
STAS	Semantic Transmission Accuracy enhancement System
TWAF	two-way amplify and forward
UT	user terminal
VR	virtual reality

References

Wang, C.-X.; You, X.; Gao, X.; Zhu, X.; Li, Z.; Zhang, C.; Wang, H.; Huang, Y.; Chen, Y.; Haas, H.; et al. On the road to 6G: Visions, requirements, key technologies, and testbeds. IEEE Commun. Surv. Tut. 2023, 25, 905–974. [Google Scholar]
Qin, Z.; Tao, X.; Lu, J.; Tong, W.; Li, G.Y. Semantic communications: Principles and challenges. arXiv 2022, arXiv:2201.01389v5. [Google Scholar]
Kwasinski, A.; Chande, V. Recent Applications and Emerging Designs in Source–Channel Coding. In Joint Source-Channel Coding, 1st ed.; John Wiley & Sons Ltd.: Chichester, UK; Wiley-IEEE Press: Hoboken, NJ, USA, 2023; pp. 335–379. [Google Scholar]
Yang, W.; Du, H.; Liew, Z.Q.; Lim, W.Y.B.; Xiong, Z.; Niyato, D.; Chi, X.; Shen, X.; Miao, C. Semantic Communications for Future Internet: Fundamentals, applications, and challenges. IEEE Commun. Surv. Tut. 2023, 25, 213–250. [Google Scholar]
Xu, J.; Tung, T.-Y.; Ai, B.; Chen, W.; Sun, Y.; Gündüz, D. Deep joint source-channel coding for semantic communications. IEEE Comm. Mag. 2023, 61, 42–48. [Google Scholar]
Gündüz, D.; Qin, Z.; Aguerri, I.E.; Dhillon, H.S.; Yang, Z.; Yener, A.; Wong, K.K.; Chae, C.-B. Beyond transmitting bits: Context, semantics, and task-oriented Communications. IEEE J. Sel. Areas Commun. 2023, 41, 5–41. [Google Scholar]
Strinati, E.C.; Barbarossa, S. 6G networks: Beyond Shannon towards semantic and goal-oriented communications. Comp. Net. 2021, 190, 1–52. [Google Scholar]
Shi, G.; Xiao, Y.; Li, Y.; Xie, X. From semantic communication to semantic-aware networking: Model, architecture, and open problems. IEEE Commun. Mag. 2021, 59, 44–50. [Google Scholar]
Bao, J.; Basu, P.; Dean, M.; Partridge, C.; Swami, A.; Leland, W.; Hendler, J.A. Towards a theory of semantic communication. In Proceedings of the IEEE Network Science Workshop, West Point, NY, USA, 22–24 June 2011. [Google Scholar]
Qin, Z.; Ye, H.; Li, G.Y.; Juang, B.-H.F. Deep learning in physical layer communications. IEEE Wirel. Commun. 2019, 26, 93–99. [Google Scholar]
Qin, Z.; Li, G.Y.; Ye, H. Federated learning and wireless communications. IEEE Wirel. Commun. 2021, 28, 134–140. [Google Scholar] [CrossRef]
Gruber, T.; Cammerer, S.; Hoydis, J.; Brink, S.T. On deep learning-based channel decoding. In Proceedings of the IEEE 51st Annual Conference on Information Sciences and Systems, Baltimore, MD, USA, 22–24 March 2017. [Google Scholar]
Ye, H.; Li, G.Y.; Juang, B.-H.F. Power of deep learning for channel estimation and signal detection in OFDM systems. IEEE Wirel. Commun. Lett. 2018, 7, 114–117. [Google Scholar] [CrossRef]
Samuel, N.; Diskin, T.; Wiesel, A. Deep MIMO detection. In Proceedings of the IEEE 18th International Workshop on Signal Processing and Advanced Wireless Communication, Sapporo, Japan, 3–6 July 2017. [Google Scholar]
Reinsel, D.; Gantz, J.; Rydning, J. The digitization of the world from edge to core. IDC Whitepaper 2018, 16, 1–28. [Google Scholar]
Yang, P.; Xiao, Y.; Xiao, M.; Li, S. 6G wireless communications: Vision and potential techniques. IEEE Netw. 2019, 33, 70–75. [Google Scholar]
Boswarthick, D.; Elloumi, O. Conclusions. In M2M Communications: A Systems Approach, 1st ed.; Boswarthick, D., Omar Elloumi, O., Hersent, O., Eds.; John Wiley & Sons: Chichester, UK, 2012; pp. 297–298. [Google Scholar]
Tang, B.; Li, Q.; Huang, L.; Yin, Y. Text semantic communication systems with sentence-level semantic fidelity. In Proceedings of the IEEE Wireless Communications and Networking Conference, Glasgow, UK, 26–29 March 2023. [Google Scholar]
Jiang, P.; Wen, C.-K.; Jin, S.; Li, G.Y. Wireless semantic communications for video conferencing. IEEE J. Sel. Areas Commun. 2023, 41, 230–244. [Google Scholar]
Huang, D.; Tao, X.; Gao, F.; Lu, J. Deep learning-based image semantic coding for semantic communications. In Proceedings of the IEEE Global Communications Conference, Madrid, Spain, 7–11 January 2021. [Google Scholar]
Zhang, P.; Liu, Y.; Song, Y.; Zhang, J. Advances and challenges in semantic communications: A systematic review. Natl. Sci. Open 2024, 3, 1–36. [Google Scholar]
Xie, H.; Qin, Z.; Li, G.Y.; Juang, B.-H. Deep learning enabled semantic communication systems. IEEE Trans. Signal Process. 2021, 69, 2663–2675. [Google Scholar]
Jiang, P.; Wen, C.-K.; Jin, S.; Li, G.Y. Deep source-channel coding for sentence semantic transmission with HARQ. IEEE Trans. Commun. 2022, 70, 5225–5240. [Google Scholar]
Zhou, Q.; Li, R.; Zhao, Z.; Xiao, Y.; Zhang, H. Adaptive bit rate control in semantic communication with incremental knowledge-based HARQ. IEEE Open J. Com. Soc. 2022, 3, 1076–1089. [Google Scholar]
Ma, S.; Liang, W.; Zhang, B.; Wang, D. An investigation on intelligent relay assisted semantic communication networks. In Proceedings of the IEEE Wireless Communications and Networking Conference, Glasgow, UK, 26–29 March 2023. [Google Scholar]
Luo, X.; Chen, Z.; Xia, B.; Wang, J. Autoencoder-based semantic communication systems with relay channels. In Proceedings of the EEE International Conference on Communications, Seoul, Republic of Korea, 16–20 May 2022. [Google Scholar]
Bian, C.; Shao, Y.; Wu, H.; Gunduz, D. Deep joint source-channel coding over cooperative relay networks. In Proceedings of the EEE International Conference on Machine Learning for Communication and Networking, Stockholm, Sweden, 5–8 May 2024. [Google Scholar]
Kang, J.; Du, H.; Li, Z.; Xiong, Z.; Ma, S.; Niyato, D. Personalized saliency in task-oriented semantic communications: Image transmission and performance analysis. IEEE J. Sel. Areas Commun. 2023, 41, 186–201. [Google Scholar]
Zhouxiang, Z.; Zhaohui, Y.; Mingzhe, C.; Zhaoyang, Z.; Poor, H.V. A joint communication and computation design for probabilistic semantic communications. Entropy 2024, 26, 394. [Google Scholar] [CrossRef]
Weng, Z.; Qin, Z.; Tao, X.; Pan, C.; Liu, G.; Li, G.Y. Deep learning enabled semantic communications with speech recognition and synthesis. IEEE Trans. Wirel. Commun. 2023, 22, 6227–6240. [Google Scholar]
Zhang, G.; Hu, Q.; Qin, Z.; Cai, Y.; Yu, G.; Tao, X. A unified multi-task semantic communication system for multimodal data. IEEE Trans. Commun. 2024, 72, 4101–4116. [Google Scholar]
Yan, L.; Qin, Z.; Zhang, R.; Li, Y.; Li, G.Y. Resource allocation for text semantic communications. IEEE Wirel. Commun. Lett. 2022, 11, 1394–1398. [Google Scholar] [CrossRef]
Gupta, A.; Sellathurai, M. End-to-end learning-based amplify-and-forward relay networks using autoencoders. In Proceedings of the EEE International Conference on Communications, Dublin, Ireland, 7–11 June 2020. [Google Scholar]
Mu, X.; Liu, Y.; Guo, L.; Al-Dhahir, N. Heterogeneous semantic and bit communications: A semi-NOMA scheme. IEEE J. Sel. Areas Commun. 2023, 41, 155–169. [Google Scholar]
Yan, L.; Qin, Z.; Zhang, R.; Li, Y.; Li, G.Y. QoE-Aware resource allocation for semantic communication networks. In Proceedings of the IEEE Global Communications Conference, Rio de Janeiro, Brazil, 4–8 December 2022. [Google Scholar]
Chen, L.; Zhang, H.; Xiao, J.; Nie, L.; Shao, J.; Liu, W.; Chua, T.-S. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Zhao, L.; Zhang, Z.A. An improved pooling method for convolutional neural networks. Sci. Rep. 2024, 14, 1589. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
An, W.; Bao, Z.; Liang, H.; Dong, C.; Xu, X. A Relay System for Semantic Image Transmission Based on Shared Feature Extraction and Hyperprior Entropy Compression. IEEE Internet Things J. 2024, 11, 16158–16170. [Google Scholar] [CrossRef]
Prabhath, S.; Yasith, G.; Thanuj, F.; Lahiru, T.; Anil, F. Semantic Communication Based Complexity Scalable Image Transmission System for Resource Constrained Devices. In Proceedings of the IEEE CTSoc Gaming, Entertainment and Media (GEM) Conference, Turin, Italy, 5–7 June 2024. [Google Scholar]
Hanju, Y.; Taehun, J.; Linglong, D.; Songkuk, K.; Chan-Byoung, C. Demo: Real-Time Semantic Communications with a Vision Transformer. In Proceedings of the IEEE International Conference on Communications Workshops (ICC Workshops), Seoul, Republic of Korea, 16–20 May 2022. [Google Scholar]
Liu, W.; Zeng, Q.; Lu, L.; Abdul, W. Intelligent Semantic Communication System Based on Kolmogorov–Arnold Networks Driven by Dynamic Terminal-Side Computing Power Network. Electronics 2024, 13, 4076. [Google Scholar] [CrossRef]
CIFAR-10 and CIFAR-100 Datasets. Available online: https://www.cs.toronto.edu/~kriz/cifar.html (accessed on 15 December 2024).
Zhijin, Q.; Feifei, G.; Bo, L.; Xiaoming, T.; Guangyi, L.; Chengkang, P. A Generalized Semantic Communication System: From Sources to Channels. IEEE Wirel. Commun. 2023, 30, 18–26. [Google Scholar]
Kodak-Lossless-True-Color-Image-Suite. Available online: https://r0k.us/graphics/kodak/ (accessed on 24 December 2024).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

DeepSTAS: DL-assisted Semantic Transmission Accuracy Enhancement Through an Attention-driven HAPS Relay System

Abstract

1. Introduction

2. Related Works

3. System Model

4. Proposed Semantic Communication Scheme Description

4.1. Attention-driven Semantic Source and Destination

4.2. Attention-based Signal Enhancement in HAPS Relay Networks

4.3. System Loss Function

4.4. Complexity and Real-time Feasibility

5. Performance Evaluation and Analysis

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Article Metrics

Citations

Article Access Statistics