Semantic Information Recovery in Wireless Networks

Motivated by the recent success of Machine Learning (ML) tools in wireless communications, the idea of semantic communication by Weaver from 1949 has gained attention. It breaks with Shannon’s classic design paradigm by aiming to transmit the meaning of a message, i.e., semantics, rather than its exact version and, thus, enables savings in information rate. In this work, we extend the fundamental approach from Basu et al. for modeling semantics to the complete communications Markov chain. Thus, we model semantics by means of hidden random variables and define the semantic communication task as the data-reduced and reliable transmission of messages over a communication channel such that semantics is best preserved. We consider this task as an end-to-end Information Bottleneck problem, enabling compression while preserving relevant information. As a solution approach, we propose the ML-based semantic communication system SINFONY and use it for a distributed multipoint scenario; SINFONY communicates the meaning behind multiple messages that are observed at different senders to a single receiver for semantic recovery. We analyze SINFONY by processing images as message examples. Numerical results reveal a tremendous rate-normalized SNR shift up to 20 dB compared to classically designed communication systems.

and multi-user transmission with multi-modal data [24]. Even knowledge graphs, i.e., a prior knowledge base, were incorporated into the transformer-based AE design to improve inference at the receiver side and, thus, text recovery [25].
Not considering Weaver's idea of semantic communication in particular, the authors in [26] show, for the first time, that task-oriented communications (Level C) for edge cloud transmission can be mathematically formulated as an Information Bottleneck (IB) optimization problem. Moreover, for solving the IB problem, they introduce a DNN-based approximation and show its applicability for the specific task of edge cloud transmission. The terminus "semantic information" is only mentioned once in [26] referring to Joint Source-Channel Coding (JSCC) of text from [19] using recurrent neural networks. In [19], the authors observe that sentences that express the same idea have embeddings that are close together in Hamming distance. But they use cross entropy between words and estimated words as the loss function and use the word error rate as the performance measure, which both do not reflect if two sentences have the same meaning but rather that both are exactly the same.
As a result, semantic communication is still a nascent field; it still remains unclear what this term exactly means [27] and, in particular, its distinction from JSCC [19,28]. As a result, many survey papers aim to provide an interpretation, see, e.g., [9][10][11][12][13]. We will revisit this issue in Section 4.

Main Contributions
The main contributions of this article are: • Motivated by the approach of Bao, Basu et al. [16,17], we adopt the terminus of a semantic source. Inspired by Weaver's notion, we bring it to the context of communications by considering the complete Markov chain, including semantic source, communications source, transmit signal, communication channel, and received signal in contrast to both [16,17]. Further, we also extend beyond the example of deterministic entailment relations between "models" and "messages" based on propositional logic in [16,17] to probabilistic semantic channels. • We define the task of semantic communication in the sense that we perform data compression, coding, and transmission of messages observed such that the semantic Random Variable (RV) at a recipient is best preserved. Basically, we implement joint source-channel coding of messages conveying the semantic RV, but not differentiating between Levels A and B. We formulate the semantic communication design either as an Information Maximization or as an Information Bottleneck (IB) optimization problem [29][30][31].
-Although the approach pursued here again leads to an IB problem as in [26], our article introduces a new classification and perspective of semantic communication and different ML-based solution approaches. Different from [26], we solve the IB problem maximizing the mutual information for a fixed encoder output dimension that bounds the information rate.

-
The publication presented here differs also both in the interpretation of what is meant by semantic information and in the objective of recovering this semantic information from approaches to semantic communication presented in the literature like, e.g., [21,32].
• Finally, we propose the ML-based semantic communication system SINFONY for a distributed multipoint scenario in contrast to [26]: SINFONY communicates the meaning behind multiple messages that are observed at different senders to a single receiver for semantic recovery. Compared to the distributed scenario in [33,34], we include the communication channel. • We analyze SINFONY by processing images as an example of messages. Notably, numerical results reveal a tremendous rate-normalized SNR shift up to 20 dB compared to classically designed communication systems.

Philosophical Considerations
Despite the much-renewed interest, research on semantic communication is still in its infancy and recent work reveals a differing understanding of the word semantics. In this work, we contribute our interpretation. To motivate it, we shortly revisit the research birth hour of communications from a philosophical point of view; its theoretical foundation was laid by Shannon in his landmark paper [1] in 1948.
He stated that "Frequently the messages have meaning; that is they refer to or are correlated according to some system with certain physical or conceptual entities. These semantic aspects of communication are irrelevant to the engineering problem". In fact, this viewpoint abstracts all kinds of information one may transmit, e.g., oral and written speech, sensor data, etc., and also lays the foundation for the research area of Shannon information theory. Thus, it found its way into many other research areas where data or information are processed, including Artificial Intelligence (AI) and especially its subdomain Machine Learning (ML).
Weaver saw this broad applicability of Shannon's theory back in 1949. In his comprehensive review of [1], he first states that "there seem to be [communication] problems at three levels" [2] already mentioned in Section 2. These three levels are quoted in recent works, where Level C is oftentimes referred to as goal-oriented communication instead [10].
But we note that, in his concluding section, he then questions this segmentation. He argues for the generality of the theory at Level A for all levels and "that the interrelation of the three levels is so considerable that one's final conclusion may be that the separation into the three levels is really artificial and undesirable".
It is important to emphasize that the separation is rather arbitrary. We agree with Weaver's statement because the most important point that is also the focus herein is the definition of the term semantics, e.g., by Basu et al. [16,17]. Note that the entropy of the semantics is less than or equal to the entropy of the messages. Consequently, we can save information rate by introducing meaning or context. In fact, we are able to add arbitrarily many levels of semantic details to the communication problem and optimize communications for a specific semantic background, e.g., an application or human.

Semantic Source and Channel
Now, we will define our information-theoretic system model of semantic communication. Figure 1 shows the schematic of our model. We assume the existence of a semantic source, described as a hidden target multivariate Random Variable (RV) z ∈ M N z ×1 z from a domain M z of dimension N z distributed according to a probability density or mass function (pdf/pmf) p(z). To simplify the discussion, we assume it to be discrete and memoryless. For the remainder of the article, note that the domain of all RVs M may be either discrete or continuous. Further, we note that the definition of entropy for discrete and continuous RVs differs. For example, the differential entropy of continuous RVs may be negative whereas the entropy of discrete RVs is always positive [35]. Without loss of generality, we will thus assume all RVs either to be discrete or to be continuous. In this work, we avoid notational clutter by using the expected value operator, replacing the integral by summation over discrete RVs, the equations are also valid for discrete RVs and vice versa.
Our approach is similar to that of [16,17]. In [16,17], the semantic source is described by "models of the world". (Note that, in [17], the semantic information source is defined as a tuple (z, s, p(z, s), L). In this original notation, z is the model, s the message, p(z, s) the joint distribution of z and s, and L is the deterministic formal language.) In [17], a semantic channel then generates messages through entailment relations between "models" and "messages". We will call these "messages" source signal and define it to be a RV s ∈ M N s ×1 s as it is usually observed and enters the communication system. In the classic Shannon design, the aim is to reconstruct the source s as accurately as possible at the receiver side. Further, we note that the authors in [17] considered the example of a semantic channel with deterministic entailment relations between z and s based on propositional logic. In this article, we go beyond this assumption and consider probabilistic semantic channels modeled by distribution p(s|z) that include the entailment in [17] as special cases, i.e., p(s|z) = δ(s − f (z)) where δ(·) is the Dirac delta function and f (·) is any generic function. Our viewpoint is motivated by the recent success of pattern recognition tools that advanced the field of AI in the 2010s and may be used to extract semantics [7]. Our approach also extends models as in [21]. There, the authors design a semantic communication system for the transmission of written language/text similar to [19] using transformer networks. In contrast to our work, [21] does not define meaning as RV z. The objective in [21] is to reconstruct s (sentences) as well as possible, rather than the meaning (RV z) conveyed in s. Optimization is completed with regard to a loss function consisting of two parts, cross entropy between language input s and output estimateŝ, as well as a scaled mutual information term between transmit signal x and receive signal y. After optimization, the authors measure semantic performance by some semantic metric L(s,ŝ).
We now provide an example to explain what we understand under a semantic source z and channel p(s|z). Let us imagine a biologist who has an image of a tree. The biologist wants to know what kind of tree it is by interpreting the observed data (image). In this case, the semantic source z is a multivariate RV composed of a categorical RV with M tree classes. For any realization (sample value) z i of the semantic source, the semantic channel p(s|z) then outputs with some probability one image s i of a tree conveying characteristics of z, i.e., its meaning. Note that the underlying meaning of the same sensed data (message) can be different for other recipients, e.g., humans or tasks/applications, i.e., in other semantic contexts. Imagine a child, i.e., a person with different characteristics (personality, expertise, knowledge, goals, and intentions) than the biologist, who is only interested if he/she can climb up this tree or whether the tree provides shade. Thus, we include the characteristics of the sender and receiver in RV z and consider it directly in compression and encoding.
Compared to [16], we, therefore, argue that we also include level C by semantic source and channel since context can be included on increasing layers of complexity. First, a RV z 1 might capture the interpretation, like the classification of images or sensor data. Moving beyond the first semantic layer, then a RV z 2 might expand this towards a more general goal, like keeping a constant temperature in power plant control. In fact, we can add or remove context, i.e., semantics and goals, arbitrarily often according to the human or application behind, and we can optimize the overall (communication) system with regard to z 1 , z 2 , . . . , z i , respectively.
As a last remark, we note that we basically defined probabilistic semantic relationships, and it remains the question of how exactly they might look. In our example, the meaning of the images needs to be labeled into real-world data pairs {s i , z i } by experts/humans, since image recognition lacks precise mathematical models. This is also true for NLP [21]; how can we measure if two sentences have the same meaning, i.e., how does the semantic space look like? In contrast, in [17], the authors are able to solve their well-defined technical problem (motion detection) by a model-driven approach. We can thus distinguish between model and data-driven semantics, which both can be handled within Shannon's information theory.

Semantic Channel Encoding
After the semantic source and channel in Figure 1, we extend upon [16] by differentiating between "message"/source signal s and transmit signal x ∈ M N Tx ×1 x . Our challenge is to encode the source signal s onto the transmit signal vector x for reliable semantic communication through the physical communication channel p(y|x), where y ∈ M N Rx ×1 y is the received signal vector. We assume the encoder p θ (x|s) to be parametrized by a parameter vector θ ∈ R N θ ×1 . Note that p θ (x|s) is probabilistic here, but assumed to be deterministic in communications with p θ (x|s) = δ(x − µ θ (s)) and encoder function µ θ (s).
In summary, in contrast to both [16,17], we consider the complete Markov chain z ↔ s ↔ x ↔ y including semantic source z, communications source s, transmit signal x and receive signal y. By this means, we distinguish from [17] which only deals with semantic compression, and [16] which is about joint semantic compression and channel coding (Level B). In [16], the authors consider the classic transmission system (Level A) as the (semantic) channel (not to be confused with the definition of the semantic channel in [17] which we make use of in this publication).
At the receiver side, one approach is maximum a posteriori decoding with regard to RV s that uses the posterior p θ (s|y), being deduced from prior p(s) and likelihood p θ (y|s) by application of Bayes law. Based on the estimate of s, then the receiver interprets the actual semantic content z by p(z|s).
Another approach we propose is to include the semantic hidden target RV z into the design by processing p θ (z|y). If the calculation of the posterior is intractable, we can replace p θ (z|y) by the approximation q ϕ (z|y), i.e., the semantic decoder, with parameters ϕ ∈ R N ϕ ×1 . We expect the following benefit: We assume the entropy H(z) = E z∼p(z) [− ln p(z)] of the semantic RV z, i.e., the actual semantic uncertainty or information content, to be less or equal to the entropy H(s) of the source s, i.e., H(z) ≤ H(s). There, E x∼p(x) [ f (x)] denotes the expected value of f (x) with regard to both discrete or continuous RVs x. Consequently, since we would like to preserve the relevant, i.e., semantic, RV z rather than s, we can compress more s.t. preserving z conveyed in s. Note that in semantic communication the relevant variable is z, not s. Thus, processing p θ (s|y) without taking z into consideration resembles the classical approach. Instead of using (and transmitting) s for inference of z, we now want to find a compressed representation y of s containing the relevant information about z.

Semantic Communication Design via InfoMax Principle
After explaining the system model and the basic components, we are able to approach a semantic communication system design. We first define an optimization problem to obtain the encoder p θ (x|s) following the Information Maximization (InfoMax) principle from an information theoretic perspective [35]. Thus, we like to find the distribution p θ (x|s) that maps s to a representation x such that most information of the relevant RV z is included in y, i.e., we maximize the Mutual Information (MI) I(z; y) with regard to p θ (x|s) [36]: There, is the cross entropy between two pdfs/pmfs p(x) and q(x). Note independence from θ in H(z) and dependence in p θ (z|y) and p θ (z, y) through the Markov chain z → s → y. Problem (1) is concave with regard to the encoder p θ (x|s) for fixed p(s) [37], but not necessarily concave with regard to the encoder parameters θ. For example, it is non-concave if the encoder function is non-convex with regard to its parameters being typically the case with DNN encoders. It is worth mentioning that we so far have not set any constraint on the variables we deal with. Hence, the form of p θ (y|s) has to be constrained to avoid learning a trivial identity mapping y = s. We indeed constrain the optimization by our communication channel p(y|x) we assume to be given. If the calculation of the posterior p θ (z|y) in (4) is intractable, we are able to replace it by a variational distribution q ϕ (z|y) with parameters ϕ. Similar to the transmitter, DNNs are usually proposed [21,38] for the design of the approximate posterior q ϕ (z|y) at the receiver. To improve the performance complexity trade-off, the application of deep unfolding can be considered, a model-driven learning approach that introduces model knowledge of p θ (s, x, y, z) to create q ϕ (z|y) [8,39]. With q ϕ (z|y), we are able to define a Mutual Information Lower Bound (MILBO) [36] similar to the well-known Evidence Lower Bound (ELBO) [7]: The lower bound holds since −H(p θ (z, y), p θ (z|y)) itself is a lower bound of the expression in (3) and E z,y∼p θ (z,y) ln p θ (z|y)/ ln q ϕ (z|y) ≥ 0. Now, we can calculate optimal values of θ and ϕ of our semantic communication design by minimizing the amortized cross entropy L CE θ,ϕ in (7), i.e., marginalized across observations y [8].
Thus, the idea is to learn parametrizations of the transmitter discriminative model and of the variational receiver posterior, e.g., by AEs or reinforcement learning. Note that, in our semantic problem (1), we do not auto-encode the hidden z itself, but encode s to obtain z by decoding. This can be seen from Figure 1 and by rewriting the amortized cross entropy (7) and (8): = E s,x,y,z∼p θ (s,x,y,z) − ln q ϕ (z|y) (10) = E s,z∼p(s,z) E x∼p θ (x|s) E y∼p(y|x) − ln q ϕ (z|y) . We can further prove the amortized cross entropy to be decomposable into In the end, maximization of the MILBO with regard to θ and ϕ balances maximization of the mutual information I θ (z; y) and minimization of the Kullback-Leibler (KL) divergence D KL p θ (z|y) q ϕ (z|y) . The former objective can be seen as a regularization term that favors encoders with high mutual information, for which decoders can be learned that are close to the true posterior.

Classical Design Approach
If we consider classical communication design approaches, we would solve the problem arg max which relates to Joint Source-Channel Coding (JSCC). There, the aim is to find a representation x that retains a significant amount of information about the source signal s in y. Again, we can apply the lower bound (8). In fact, bounding (14) by (8) shows that approximate maximization of the mutual information justifies the minimization of the cross entropy in the AutoEncoder (AE) approach [6], often seen in recent wireless communication literature [6,19,28].

Information Bottleneck View
It should be stressed that we have not set any constraints on the variables in the InfoMax problem so far. However, in many applications, compression is needed because of the limited information rate. Therefore, we can formulate an optimization problem where we like to maximize the relevant information I θ (z; y) subject to the constraint to limit the compression rate I θ (s; y) to a maximum information rate I C : Problem (15) is an important variation of the InfoMax principle and called the Information Bottleneck (IB) problem [10,29,40,41]. The IB method introduced by Tishby et al. [29] has been the subject of intensive research for years and has proven to be a suitable mathematical/information-theoretical framework for solving numerous problems-as well as in wireless communications [30,31,42,43]. Note that we aim for an encoder that compresses s into a compact representation x for discrete RVs by clustering and for continuous RVs by dimensionality reduction.
To solve the constrained optimization problem (15), we can use Lagrangian optimization and obtain arg max with Lagrange multiplier β ≥ 0. The Lagrange multiplier β allows the defining of a tradeoff between the relevant information I θ (z; y) and compression rate I θ (s; y), which indicates the relation to rate distortion theory [30]. With β = 0, we have the InfoMax problem (1) whereas for β → ∞ we minimize compression rate. Calculation of the mutual information terms may be computationally intractable, as in the InfoMax problem (1). Approximation approaches can be found in [44,45]. Notable exceptions include if the RVs are all discrete or Gaussian distributed. We note that in [10,26] the authors already introduced the IB problem to task-oriented communications. But [10,26] do not address our viewpoint or classification. We compress and channel encode the messages/communications source s for given entailment p(s|z), in the sense of a data-reduced and reliable communication of the semantic RV z. Basically, we implement joint source-channel coding of s s.t. preserving the semantic RV z, and we do not differentiate between Levels A and B, as indicated by Weaver's notion outlined in Section 2. Indeed, we draw a direct connection to IB compared to related semantic communication literature [19,21,38] that, so far, only included optimization with terms reminiscent of the IB problem.

Semantic Information Bottleneck
This article does not only distinct itself on a conceptual, but also on a technical level from [26,34]. We follow a different strategy to solve (15).
First, using the data processing inequality [46], we see that the compression rate is upper bounded by the mutual information of the encoder I θ (s; x) and that of the channel I(x; y): In case of negligible encoder compression I θ (s; x) > I(x; y), the channel becomes the limiting factor of information rate. For example, with a deterministic continuous mapping Using the chain rule of mutual information [46], we see that this upper bound on compression rate grows with the dimension of x, i.e., the number of channel uses N Tx : Assuming y to be conditional dependent on x n given x n−1 , . . . , x 1 , i.e., p(y|x n , . . . , x 1 ) = p(y|x n−1 , . . . , x 1 ) being, e.g., true for an AWGN channel, it is I(x n ; y|x n−1 , . . . , x 1 ) > 0 [46] and the sum in (18) indeed strictly increases. Replacing y in I(x; y) of (18) by s, the result also holds for encoder compression I θ (s; x), respectively. Hence, increasing the encoder output dimension N Tx , we can increase the possible compression rate I θ (s; y). Interchanging x and y in (18), we see that the same holds for the receiver input dimension N Rx . Furthermore, the mutual information of the channel and, thus, the compression rate are upper bounded by channel capacity: For example, with an AWGN channel with noise standard deviation σ n , we have C = N Tx /2 · ln 1 + 1/σ 2 n again increasing with N Tx . Now, let us assume the RVs to be discrete so that H(x|s) ≥ 0. Indeed, this is true if the RVs are processed discretely with finite resolution on digital signal processors, as in the numerical example of Section 5. As long as I θ (s; x) < C, all information of the discrete RVs can be transmitted through the channel with arbitrary low error probability according to Shannon's channel coding theorem [1]. Then, we can upper bound encoder compression I θ (s; x) and thus compression rate I θ (s; y) by the sum of entropies of any output x n [46] of the encoder p θ (x|s)-each with cardinality |M x |: Note that the entropy sum in (20) grows again with N Tx for discrete RVs since 0 ≤ H(x n ) ≤ log 2 (|M x |). Moreover, we can define an encoder capacity C θ analogous to channel capacity C in (19) that upper bounds encoder compression I θ (s; x). It may be restricted by the chosen (DNN) model p θ (x|s) and optimization procedure with regard to θ, i.e., the hypothesis class [7].
In summary, we have proven by (19) and (20) that there is an information bottleneck when maximizing the relevant information I θ (z; y) either due to the channel distortion I(x; y) or encoder compression I θ (s; y).
To fully exploit the available resources, we set constraint I C to be equal to the upper bound, i.e., channel capacity C or the upper bound on encoder compression rate N Tx · log 2 (|M x |). In both cases, the upper bound grows (linearly) with the encoder output dimension N Tx , and, thus, we can set the constraint I C higher or lower by choosing N Tx .
With fixed constraint I C , we maximize the relevant information I θ (z; y). By doing so, we derive an exact solution to (15) that maximizes I θ (z; y) for a fixed encoder output dimension that bounds the compression rate. As in the InfoMax problem, we can exploit the MILBO to use the amortized cross entropy L CE θ,ϕ in (9) as the optimization criterion.

Variational Information Bottleneck
In [26], however, the authors solve the variational IB problem of (16) and require tuning of β. Albeit also using the MILBO as a variational approximation to the first term in (16), they introduce a KL divergence term as an upper bound to compression rate I θ (s; y) derived by D KL (p θ (y) q ϑ (y)) ≥ 0 with some variational distribution q ϑ (y) with parameters ϑ [44]. Then, the variational IB objective function reads [44]: (21) Moreover, the authors use a log-uniform distribution as the variational prior q ϑ (y) in [26] to induce sparsity on y so that the number of outputs is dynamically determined based on the channel condition or SNR, i.e., p θ (y|s, σ 2 n ). The approach additionally necessitates approximation of the KL divergence term in (21) and estimation of the noise variance σ 2 n . With our approach we avoid the additional approximations and tuning of the hyperparameter β in (21) possibly enabling better semantic performance as well as reduced inference and training complexity at the cost of full usage of N Tx channels even when the channel capacity C enables its reduction. We leave a numerical comparison to [26] for future research as this is out of the scope of this paper.

Implementation Considerations
Now, we will provide important implementation considerations for optimization of (8)/(10) and (15). We note that computation of the MILBO leads to similar problems as for the ELBO [35]; if calculating the expected value in (10) cannot be solved analytically or is computationally intractable-as typically the case with DNNs-we can approximate it using Monte Carlo sampling techniques with N samples . For Stochastic Gradient Descent (SGD)-based optimization like, e.g., in the AE approach, the gradient with regard to ϕ can then be calculated by = − E z,s,y∼p θ (y|s)p(s|z)p(z) ∂ ln q ϕ (z|y) ∂ϕ (23) with N being equal to the batch size N b and by application of the backpropagation algorithm to ∂ ∂ϕ ln q ϕ (z i |y i ) = ∂ ∂ϕ q ϕ (z i |y i )/q ϕ (z i |y i ) in Automatic Differentiation Frameworks (ADF), e.g., TensorFlow and PyTorch. Computation of the so-called REINFORCE gradient with regard to θ leads to a high variance of the gradient estimate since we sample with regard to the distribution p θ (y|s) dependent on θ [35].

Reparametrization Trick
Leveraging the direct relationship between θ and y in ln q ϕ (z|y) can help reduce the estimator's high variance. Typically, e.g., in the Variational AE (VAE) approach, the reparametrization trick is used to achieve this [35]. Here, we can apply it if we can decompose the latent variable y ∼ p θ (y|s) into a differentiable function y = f θ (s, n) and a RV n ∼ p(n) independent of θ. Fortunately, the typical forward model of a communication system p θ (y|s) fulfills this criterion. Assuming a deterministic DNN encoder x = µ θ (s) and additive noise n with covariance Σ, we can thus rewrite y into f θ (s, n) = µ θ (s) + Σ 1/2 · n and, accordingly, the amortized cross entropy gradient into: The reparametrization trick can be easily implemented in ADFs by adding a noise layer-typically used for regularization in ML literature-after (DNN) function x = µ θ (s). Then, our loss function (10) amounts to This enables the joint optimization of both θ and ϕ, as demonstrated in recent works [6], treating unsupervised optimization of AEs as a supervised learning problem.

Example of Semantic Information Recovery
In this section, we provide one numerical example of data-driven semantics to explain what we understand under a semantic communication design and to show its benefits: It is the task of image classification. In fact, we consider our example of the biologist from Section 4.2 who wants to know what type of tree it is.
For the remainder of this article, we will thus assume the hidden semantic RV to be a one-hot vector z ∈ {0, 1} M×1 where all elements are zero except for one element representing one of the M image classes. Then, the semantic channel p(s|z) (see Figure 1) generates images belonging to this class, i.e., the source signal s.
Note that for point-to-point transmission, as in [26], we could first classify the image based on the posterior q ϕ (z|s), as shown in Figure 2 and transmit the estimateẑ (encoded into x) through the physical channel since this would be most rate or bandwidth efficient.
But if the image information is distributed across multiple agents, all (sub) images may contribute useful information for classification. We could thus lose information when making hard decisions on each transmitter's side. In the distributed setting, transmission and combination of features, i.e., soft information, is crucial to obtain high classification accuracy.
Further, we note that transmission of full information, i.e., raw image data s, through a wireless channel from each agent to a central unit for full image classification would consume a lot of bandwidth. This case is also shown in Figure 2 assuming perfect com-munication links between the output of the semantic channel and the input of the ResNet Feature Extractor. Therefore, we investigate a distributed setting shown in Figure 3. There, each of four agents sees its own image s 1 , . . . , s 4 ∼ p(s i |z) being generated by the same semantic RV z. Based on these images, a central unit shall extract semantics, i.e., perform classification. We propose to optimize the four encoders p θ i (x i |s i ) with i = 1, . . . , 4, each consisting of a bandwidth efficient feature extractor (ResNet Feature Extractor i) and transmitter (Tx i) jointly with a decoder q ϕ (z|y = [y 1 , y 2 , y 3 , y 4 ] T ), consisting of a receiver (Rx) and concluding classifier (Classifier), with regard to cross entropy (10) of the semantic labels (see Figure 3). Hence, we maximize the system's overall semantic measure, i.e., classification accuracy. Note that this scenario is different from both [33,34]; we include a physical communication channel (Comm. Channel i) since we aim to transmit and not only compress. For the sake of simplicity, we assume orthogonal channel access. The IB is addressed by limiting the number of channel uses, which defines the constraint I C in (15).
As a first demonstration example, we use the grayscale MNIST and colored CIFAR10 datasets with M = 10 image classes [47]. We assume that the semantic channel generates an image that we divide into four equally sized quadrants and each agent observes one quadrant s 1 , . . . , s 4 ∈ R N x ×N y ×N c , where N x and N y is the number of image pixels in the xand y-dimension, respectively, and N c is the number of color channels. Albeit this does not resemble a realistic scenario, note that we can still show the basic working principle and ease implementation.

ResNet
For the design of the overall system, we rely on a famous DNN approach for feature extraction, breaking records at the time of invention: ResNet [47,48]. The key idea of ResNet is that it consists of multiple residual units. Each unit's input is fed directly to its output and if the dimensions do not match, a convolutional layer is used. This structure enables fast training and convergence of DNNs since the training error can be backpropagated to early layers through these skip connections. From a mathematical point of view, usual DNNs have the design flaw that using a larger function class, i.e., more DNN layers, does not necessarily increase the expressive power. However, this holds for nested functions like ResNet which contain the smaller classes of early layers.

Distributed Semantic Communication Design Approach
Our key idea here is to modify ResNet with regard to the communication task by splitting it at a suitable point where a representation r ∈ R N Feat ×1 of semantic information with low-bandwidth is present (see Figures 2 and 3). ResNet and CNNs in general can be interpreted to extract features; with full images, we obtain a feature map of size 8 × 8 × N Feat after the last ReLU activation (see Table 1). These local features are aggregated by the global average pooling layers across the 2 spatial dimensions into r. Based on these N Feat global features in r, the softmax layer finally classifies the image. We note that the features contain the relevant information with regard to the semantic RV z and are of low dimension compared to the original image or even its sub-images, i.e., 64 compared to 16 × 16 × 3 = 768 for CIFAR10. Therefore, we aim to transmit each agent's local features r i ∈ R N Feat ×1 (i = 1, . . . , 4) instead of all sub-images s i and add the component Tx in Table 1 to encode the features r i into x i ∈ R N Tx ×1 for transmission through the wireless channel (see Figure 3). We note that x i ∈ R N Tx ×1 is analog and that the output dimension N Tx of x i defines the number of channel uses per agent/image. Note that the less often we use the wireless channel (N Tx ), the less information we transmit but the less bandwidth we consume, and vice versa. Hence, the number of channel uses defines the IB in (15). We implement the Tx module by DNN layers. To limit the transmission power to one, we constrain the Tx output by the norm along the training batch or the encoding vector dimension (dim.), i.e., x n =x n / E[x 2 n ] or is the output of the layer Linear from Table 1. For numerical simulations, we choose all Tx layers to have width N Tx .
At the receiver side, we use a single Rx module only with shared DNN layers and parameters ϕ Rx for all inputs y i . This setting would be optimal if any feature is reflected in any sub-image and if the statistics of the physical channels are the same. Exploiting the prior knowledge of location-invariant features and assuming Additive White Gaussian Noise (AWGN) channels, this design choice seems reasonable. In our experiments, all layers of the Rx module have width N w . A larger layer width N w is equivalent to more computing power.
The output of the Rx module can be interpreted as a representation of the image features r i with index i indicating the spatial location. Thus, we have a representation of a feature map of size (2, 2, N w ) that we aggregate across the spatial dimension according to the ResNet structure. Based on this semantic representation, a softmax layer with 10 units finally computes class probabilities q ϕ (z|y) whose maximum is the maximum a posteriori estimateẑ. In the following, we name our proposed approach Semantic INFOrmation traNsmission and recoverY (SINFONY).

Optimization Details
We evaluate SINFONY in TensorFlow 2 [49] on the MNIST and CIFAR10 datasets. The source code is available in [50] and the default simulation and training parameters are summarized in Table 2. We split the dataset into N train = 60 k/50 k training data and 10 k validation data samples, respectively. For preprocessing, we normalize the pixel inputs to range [0, 1], but we do not use data augmentation, in contrast to [47,48], yielding slightly worse accuracy. The ReLU layers are initialized with uniform distribution according to He and all other layers according to Glorot [51]. In the case of CIFAR10 classification with central image processing and original ResNet, we need to train N θ + N ϕ = 273,066 parameters. We like to stress that although we divided the image input into four smaller pieces, this number grows more than four times to 4N θ + N ϕ = 1,127,754 with N Tx = N Feat = 64 for SINFONY. The reason lies in the ResNet structure with minor dependence on the input image size and that we process at four agents with an additional Tx module. Only N ϕ = 4810 parameters amount to the Rx module and classification, i.e., the central unit. We note that the number of added Tx and Rx parameters of 33,560 and 3192 is relatively small. Since the number of parameters only weakly grows with Rx layer width N w in our design, we choose N w = N Feat as the default.
For optimization of the cross entropy (10) or the loss function (28), we use the reparametrization trick from Section 4.6 and SGD with a momentum of 0.9 and a batch size of N b = 64. We add l 2 -regularization with a weight decay of 0.0001 as in [47,48]. The learning rate of = 0.1 is reduced to 0.01 and 0.001 after N e = 100 and 150 epochs for CIFAR10 and after 3 and 6 epochs for MNIST. In total, we train for N e = 200 epochs with CIFAR10 and for 20 with MNIST. In order to optimize the transceiver for a wider SNR range, we choose the training SNR to be uniformly distributed within SNR train ∈ [−4, 6] dB where SNR = 1/σ 2 n with noise variance σ 2 n .

Numerical Results and Discussion
In the following, we will investigate the influence of specific design choices on our semantic approach SINFONY. Then, we compare a semantic with a classical Shannon-based transmission approach. The design choices are as follows: • Central: Central and joint processing of full image information by the ResNet classifier, see Figure 2. It indicates the maximum achievable accuracy.  [19,21,28,32]. For details, see Section 5.4.5.
Since meaning is expressed by the RV z, we use classification accuracy to measure semantic transmission quality. For illustration in logarithmic scale, we show the opposite of accuracy in all plots, i.e., classification error rate.

MNIST Dataset
The numerical results of our proposed approach SINFONY on the MNIST validation dataset are shown in Figure 4 for N w = 56. To obtain a fair comparison between transmit signals x i ∈ R N Tx ×1 of different length N Tx , we normalize the SNR by the spectral efficiency or rate η = N Feat /N Tx . First, we observe that the classification error rate of 0.5% of the central ResNet unit with full image information (Central) is smaller than that of 0.9% of SINFONY -Perfect com. Note that we assume ideal communication links. However, the difference seems negligible considering that the local agents only see a quarter of the full images and learn features independently based on it.
With noisy communication links (SINFONY -AWGN), the performance degrades especially for SNR < 10 dB, and we can avoid degradation just partly by training with noise (SINFONY -AWGN + training). Introducing the Tx module (SINFONY -Tx/Rx N Tx = 56), we further improve classification accuracy at low SNR. If we encode the features from N Feat = 56 to only N Tx = 14 in the Tx module (SINFONY -Tx/Rx N Tx = 14) to have less channel uses/bandwidth (stronger bottleneck), the error rate is lowest compared to other SINFONY examples with non-ideal links for low normalized SNR. At high SNR, we observe a small error offset, which indicates lossy compression. In fact, our system SINFONY learns a reliable semantic encoding to improve the classification performance of the overall system with non-ideal links. Every design choice in Table 1

CIFAR10 Dataset
Comparing these results to the classification accuracy on CIFAR10 shown in Figure 5, we observe a similar behavior. But a few main differences become apparent. Central performs much better with a 12% error rate than SINFONY -Perfect com. with 20%. We expect the reason to lie in the more challenging dataset with more color channels. Further, SINFONY -AWGN + training with N Tx = N Feat = 64 channel uses runs into a rather high error floor. Notably, even SINFONY -Tx/Rx (N Tx = 16) with fewer channel uses performs better than both SINFONY -AWGN and SINFONY -AWGN + training over the whole SNR range and achieves channel encoding with negligible loss. This means adding more flexible channel encoding, i.e., Tx/Rx module, is crucial for CIFAR10.

Channel Uses Constraint
Since one of the main advantages of semantic communication lies in savings of information rate, we also investigate the influence of the number of channel uses N Tx on MNIST classification error rate shown in Figure 6. From a practical point of view, we fix the information bottleneck by the output dimension N Tx and maximize the mutual information I θ (z; y). Decreasing the number of channel uses from N Tx = 14 to 2 and accordingly the upper bound I C on the mutual information I θ (s; y), i.e., compression rate, from (19) or (20), we observe that the error floor at high SNR increases. We assume that, since the channel capacity decreases with SNR and N Tx , higher compression is required for reliable transmission through the channel in the training SNR interval. For N Tx = 56, almost no error floor occurs at the cost of a smaller channel encoding gain. This means compression and channel coding are balanced based on the channel condition, i.e., training SNR region, to find the optimal trade-off to maximize I θ (z; y), which we can also observe in unshown simulations.

Semantic vs. Classic Design
Finally, we compare semantic and classic communication system designs. For the classic digital design, we first assume that the images are compressed lossless and protected by a channel code for transmission and reliable overall image classification by the central unit based on q ϕ (z|s) (Central image com.). We apply Huffman encoding to a block containing 100 images s i where each RGB color entry contains 8 bits.
For fairness, we also compare to a SINFONY version where Tx and Rx modules of Table 1 are replaced by a classic design (SINFONY -Classic digital com.). We first compress each element of the feature vector r i that is computed in 32-bit floating-point precision in the distributed setting SINFONY -AWGN to 16-bit. Then, we apply Huffman encoding to a block containing 100 feature vectors of length N Feat .
Further, we use a 5G LDPC channel code implementation from [52] with interleaver, rate R C = {0.25, 0.5, 0.25} and long block length of {15360, 16000, 15360}, and modulate the code bits with {BPSK, BPSK, 16-QAM} such that we have, e.g., parameter set {0.25, 15360, BPSK} in one simulation. For digital image transmission, we use a rate of R C = 0.25 with a block length of 15360 and BPSK modulation. At the receiver, we assume belief propagation decoding, where the noise variance is perfectly known for LLR computation.
The results in Figure 7 reveal tremendous information rate savings for the semantic design with SINFONY. We observe an enormous SNR shift of roughly 20 dB compared to the classic digital design with regard to to both image (Central image com.) and feature transmission (SINFONY -Classic digital com.). Note that the classic design is already near the Shannon limit and even if we improve it by ML we are only able to shift its curve by a few dB. The reason may lie in overall system optimization with SINFONY with regard to semantics and analog encoding of x. −

SINFONY vs. Analog "Semantic" Autoencoder
To distinguish both influences, we also implemented the approach of (14) according to Shannon by analog AEs. The analog AE has been introduced by O'Shea and Hoydis in [6]. From the viewpoint of semantic communication, it resembles the semantic approach from [19,21,28,32] without differentiating between semantic and channel coding, and the mutual information constraint I(x; y) like in [21]. We trained the AE matching the Tx and Rx module in Table 1 with mean square error criterion for reliable transmission of the feature vector r with SINFONY training settings. The Rx module consists of one ReLU layer of width N w = N Tx providing the estimate of r. We provide results (SINFONY -Analog semantic AE) in Figure 7. Indeed, most of the shift is due to analog encoding. By this means, we further avoid the typical thresholding behavior of a classic digital system seen at 14 dB.
In conclusion, this surprisingly clear result justifies an analog "semantic" communications design and shows its huge potential to provide bandwidth savings. However, introducing the semantic RV z by SINFONY, we can further shift the curve by 2 dB and avoid a slightly higher error floor compared to the analog "semantic" AE. With expect a larger performance gap with more challenging image datasets, such as CIFAR10. More importantly, the main benefit of SINFONY lies in lower training complexity. We avoid separate and possibly iterative semantic and communication training procedures where in the first step we need to train SINFONY with ideal links hard to achieve in practice.

Conclusions
Motivated by the approach of Bao, Basu et al. [16,17] and inspired by Weaver's notion of semantic communication [2], we brought the terminus of a semantic source to the context of communications by considering its complete Markov chain. We defined the task of semantic communication in the sense of a data-reduced and reliable transmission of communications sources/messages over a communication channel such that the semantic Random Variable (RV) at a recipient is best preserved. We formulated its design either as an Information Maximization or as an Information Bottleneck optimization problem covering important implementations aspects like the reparametrization trick and solved the problems approximately by minimizing the cross entropy that upper bounds the negative mutual information. With this article, we distinguish from related literature [16,17,21,26,32] in both classification and perspective of semantic communication and a different ML-based solution approach.
Finally, we proposed the ML-based semantic communication system SINFONY for a distributed multipoint scenario: SINFONY communicates the meaning behind multiple messages that are observed at different senders to a single receiver for semantic recovery. We analyzed SINFONY by processing images as an example of messages. Notably, numerical results reveal a tremendous rate-normalized SNR shift up to 20 dB compared to classically designed communication systems.

Outlook
In this work, we contributed to the theoretical problem description of semantic communication and data-based ML solution approaches with DNNs. There remain open research questions such as: • Numerical Comparison to Variational IB: It remains unclear if solving the variational IB problem (21) holds benefits compared to our proposed approach. • Implementation: Optimization with the reparametrization trick requires a known differential channel model and training at one location with dedicated hardware such as graphics processing units [53]. In addition, large amounts of labeled data are required with data-driven ML techniques, which can be expensive and timeconsuming to acquire and process. Hence, further research is required to clarify how a semantic design can be implemented efficiently in practice. • Semantic Modeling: Developing effective models of semantics is crucial, and thus we proposed the usage of probabilistic models. If the underlying problem can be described by a well-known model, e.g., a physical process to be measured and processed by a sensor network [32], a promising idea is to apply model-based approaches based on Bayesian inference for encoding and decoding-potentially combined with the technique of deep unfolding. In the context of NLP, design of knowledge graphs such as ontologies or taxonomies is a promising modeling approach for human language. Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data presented in this study are openly available in [50].

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: