In-Network Learning: Distributed Training and Inference in Networks

In this paper, we study distributed inference and learning over networks which can be modeled by a directed graph. A subset of the nodes observes different features, which are all relevant/required for the inference task that needs to be performed at some distant end (fusion) node. We develop a learning algorithm and an architecture that can combine the information from the observed distributed features, using the processing units available across the networks. In particular, we employ information-theoretic tools to analyze how inference propagates and fuses across a network. Based on the insights gained from this analysis, we derive a loss function that effectively balances the model’s performance with the amount of information transmitted across the network. We study the design criterion of our proposed architecture and its bandwidth requirements. Furthermore, we discuss implementation aspects using neural networks in typical wireless radio access and provide experiments that illustrate benefits over state-of-the-art techniques.


Introduction
The unprecedented success of modern machine learning (ML) techniques in areas such as computer vision [1], neuroscience [2], image processing [3], robotics [4] and natural language processing [5] has led to an increasing interest for their application to wireless communication systems in recent years.
Early efforts along this line of work fall into what is sometimes referred to as the "learning to communicate" paradigm, in which the goal is to automate one or more communication modules such as the modulator-demodulator, the channel coder-decoder, or others, by replacing them with suitable ML algorithms. Although important progress has been made for some particular communication systems, such as the molecular one [6], it is still not yet clear whether ML techniques can offer a reliable alternate solution to modelbased approaches, especially as typical wireless environments suffer from time-varying noise and interference.
Wireless networks have other important intrinsic features which may pave the way for more cross-fertilization between ML and communication, as opposed to applying ML algorithms as black boxes in replacement of one or more communication modules. For example, while in areas such as computer vision, neuroscience, and others, relevant data is generally available at one point, it is typically highly distributed across several nodes in wireless networks.
Examples include self-driving cars where multiple sensors, both external and internal to the car can be used to help the car navigate its environment, medical applications to diagnose a patient based on data from different medical institutions or environmental monitoring to detect hazardous events or pollution, and others, see [7,8] for more information. We give more details of the usefulness of such setups in Examples 1 and 2. A prevalent approach for the implementation of ML solutions in such cases would consist of collecting all relevant data at one point (a cloud server) and then training a suitable ML model using all available data and processing power. Because the volumes of data needed for training are generally large, and with the scarcity of network resources (e.g., power and bandwidth), that approach might not be appropriate in many cases, however. In addition, some applications might have stringent latency requirements which are incompatible with sharing the data, such as in automatic vehicle driving. In other cases, it might be desired not to share the raw data for the sake of enhancing the privacy of the solution, in the sense that infringing the user's privacy is generally more easily accomplished from the raw data itself than from the output of a neural network (NN) that takes the raw data as input.
The above has called for a new paradigm in which intelligence moves from the heart of the network to its edge, which is sometimes referred to as "Edge Learning". In this new paradigm, communication plays a central role in the design of efficient ML algorithms and architectures because both data and computational resources, which are the main ingredients of an efficient ML solution, are highly distributed. A key aspect towards building suitable ML-based solutions is whether the setting assumes only the training phase involves distributed data, sometimes referred to as distributed learning, such as the Federated Learning (FL) of [9] or if the inference (or test) phase also involves distributed data.
The considered problem setup is strongly related to the problems of distributed estimation and detection (see, e.g., [10][11][12][13] and references therein). We differentiate ourselves from these problems as we assume no prior knowledge of distribution of the data. This is a common setup in many practical applications, such as image or speech processing, or text analysis, where the distribution between the observed data and the target variable is unknown or too complex to model.
In particular, of those most closely related to this paper, a growing line of works focus on developing distributed learning algorithms and architectures. The works of [14,15] address the problem of distributed learning using kernel methods when each node observes independent samples drawn from the same distribution. In our specific setup, however, the nodes observe correlated data, necessitating collaboration among all nodes during inference. On the other hand, works such as [16,17] are focused on the narrower problem of detection and impose certain restrictions on the scope of their investigation. However, perhaps most popular and related to our work is the FL of [9] which, as we already mentioned, is most suitable for scenarios in which the training phase has to be performed distributively, while the inference phase has to be performed centrally at one node. To this end, during the training phase, nodes (e.g., base stations) that possess data are all equipped with copies of a single NN model which they simultaneously train on their locally available data-sets. The learned weight parameters are then sent to a cloud or parameter server (PS) which aggregates them, e.g., by simply computing their average. The process is repeated, every time re-initializing using the obtained aggregated model, until convergence. The rationale is that, this way, the model is progressively adjusted to account for all variations in the data, not only those of the local data-set. For recent advances on FL and applications in wireless settings, the reader may refer to [18][19][20] and references therein. Another relevant work is the Split Learning (SL) of [21] in which, for a multiaccess type network topology, a two-part NN model, split into an encoder part and a decoder part, is learned sequentially. The decoder does not have its own data and in every round the NN encoder part is fed with a distinct data-set and its parameters are initialized using those learned from the previous round. The learned two-part model is then used as follows during the inference: one part of this model is used by an encoder, and the other one by a decoder. Another variation of SL, sometimes called "vertical SL", was proposed recently in [22]. The approach uses vertical partitioning of the data; in the special case of a multi-access topology, it is similar to the in-network learning solution that we propose in this paper.
Compared to both SL and FL, which consider only the training phase to be distributed, in this paper we focus on the problem in which the inference phase also takes place distributively. More specifically, in this paper, we study a network inference problem in which some of the nodes possess each, or can acquire, part of the data that is relevant for inference on a random variable Y. The node at which the inference needs to be performed is connected to the nodes that possess the relevant data through a number of intermediate other nodes. We assume that the network topology is fixed and known. This may model, e.g., a setting in which a macro BS needs to make inference on the position of a user on the basis of summary information obtained from correlated CSI measurements X 1 , . . . , X J that are acquired at some proximity edge BSs. Each of the edge nodes is connected with the central node either directly, via an error free link of given finite capacity, or via intermediary nodes. While in some cases it might be enough to process only a subset of the J nodes, we assume that processing only a (any) strict subset of the measurements cannot yield the desired inference accuracy and, as such, the J measurements X 1 , . . . , X J need to be processed during the inference or test phase. Example 1. (Autonomous Driving) One basic requirement of the problem of autonomous driving is the ability to cope with problematic roadway situations, such as those involving construction, road hazards, hand signals, and reckless drivers. Current approaches mainly depend on equipping the vehicle with more on-board sensors. Clearly, while this can only allow a better coverage of the navigation environment, it seems unlikely to successfully cope with the problem of blind spots due, e.g., to obstruction or hidden obstacles. In such contexts, external sensors such as other vehicles' sensors, cameras installed on the roofs of proximity buildings or wireless towers may help perform a more precise inference, by offering a complementary, possibly better, view of the navigation scene. An example scenario is shown in Figure 1. The application requires real-time inference which might be incompatible with current cellular radio standards, thus precluding the option of sharing the sensors' raw data and processing it locally, e.g., at some on-board server. When equipped with suitable intelligence capabilities, each sensor can successfully identify and extract those features of its measurement data that are not captured by other sensors' data. Then, it only needs to communicate those, not its entire data. Example 2. (Public Health) One of the early applications of machine learning is in the area of medical imaging and public health. In this context, various institutions can hold different modalities of patient data in the form of electronic health records, pathology test results, radiology, and other sensitive imaging data such as genetic markers for disease. The correct diagnosis may be contingent on being able to using all relevant data from all institutions. However, these institutions may not be authorized to share their raw data. Thus, it is desired to distributively train machine learning models without sharing the patient's raw data in order to prevent illegal, unethical or unauthorized usage of it [23]. Local hospitals or tele-health screening centers seldom acquire enough diagnostic images on their own; collaborative distributed learning in this setting would enable each individual center to contribute data to an aggregate model without sharing any raw data.

Contributions
In this paper, we study the aforementioned network inference problem in which the network is modeled as a weighted acyclic graph and inference about a random variable is performed on the basis of summary information obtained from possibly correlated variables at a subset of the nodes. Following an information-theoretic approach in which we measure discrepancies between true values and their estimated fits using average logarithmic loss, we first develop a bound on the best achievable accuracy given the network communication constraints. Then, considering a supervised setting in which nodes are equipped with NNs and their mappings need to be learned from distributively available training data-sets, we propose a distributed learning and inference architecture and we show that it can be optimized using a distributed version of the well-known stochastic gradient descent (SGD) algorithm that we develop here. The resulting distributed architecture and algorithm, which we herein name "in-network (INL) learning", generalize those introduced in [24] (see also [25,26]) for a specific case, multiaccess type, network topology. We investigate in more detail what the various nodes need to exchange during both the training and inference phases, as well as associated requirements in bandwidth. Finally, we provide a comparative study with (an adaptation of) the FL and the SL algorithms, and experiments that illustrate our results. Part of the results this paper have also been presented in [27,28]. However, in this paper, we go beyond those works by offering a more comprehensive and detailed review of the state-of-the-art. Additionally, we provide proofs for the theorem and lemmas presented in this paper, which were not included in the previous publications. Furthermore, we introduce additional insights and conclusions that further contribute to the overall understanding and significance of the research findings.

Outline and Notation
In Section 2 we describe the studied network inference problem formally. In Section 3 we present our in-network inference architecture, as well as a distributed algorithm for training it distributively. Section 4 contains a comparative study with FL and SL in terms of bandwidth requirements; as well as some experimental results. Finally, in Section 5 we summarize the insights and results presented in this paper.
Throughout the paper, the following notation will be used. Upper case letters denote random variables, e.g., X; lower case letters denote realizations of random variables, e.g., x, and calligraphic letters denote sets, e.g., X. The cardinality of a set is denoted by |X|. For a random variable X with probability mass function P X , the shorthand p(x) = P X (x), x ∈ X is used. Boldface letters denote matrices or vectors, e.g., X or x. For random variables (X 1 , X 2 , . . .) and a set of integers K ⊆ N, the notation X K designates the vector of random variables with indices in the set K , i.e., In addition, for zero-mean random vectors x and y, the quantities x , x,y and x|y denote, respectively, the covariance matrix of the vector x, the covariance matrix of vector (x, y) and the conditional covariance of x given y. Finally, for two probability measures P X and Q X over the same alphabet X, the relative entropy or Kullback-Leibler divergence is denoted as D KL (P X ||Q X ). That is, if P X is absolutely continuous with respect to Q X , then

Network Inference: Problem Formulation
We consider the distributed supervised learning setup, in which multiple nodes observe different features relating to the same sample, sometimes refered to as distributed learning with vertically partitioned dataset, see [8,29]. We additionally assume the learning takes place over a communication constrained network. Specifically, consider an N node distributed network. Of these N nodes, J ≥ 1 nodes possess or can acquire data that is relevant for inference on a random variable (r.v.) of interest Y, with alphabet Y. Let where Π designates the Cartesian product of sets. Similarly, for k ∈ [1 : The range of the encoding functions {ω i } are restricted in size, as Node N needs to infer on the random variable Y ∈ Y using all incoming messages, i.e., In this paper, we choose the reconstruction setŶ to be the set of distributions on Y, i.e., Y = P(Y) and we measure discrepancies between true values of Y ∈ Y and their estimated fits in terms of average logarithmic loss, i.e., for (y,P) ∈ Y × P(Y) d(y,P) = log 1 P(y) .
As such, the performance of a distributed inference scheme (ω j ) j∈J , (ω k ) k∈[1,N−1]/J , ψ for which (3) is fulfilled is given by its achievable relevance given by which, for a discrete set Y, is directly related to the error of misclassifying the variable Y ∈ Y.
It is imporant to note that H(Y) is problem specific constant and as such the relavance given by (6) is simply a another form of the logarithmic loss. In practice, in a supervised setting, the mappings given by (1), (2) and (4) need to be learned from a set of training data samples The data is distributed such that the samples x j := (x j,1 , . . . , x j,n ) are available at node j for j ∈ J and the desired predictions y := (y 1 , . . . , y n ) are available at the end decision node N. We parametrize the possibly stochastic mappings (1), (2) and (4) using NNs. This is depicted in Figure 3. We denote the parameters of the NNs that parameterize the encoding function at each node i ∈ [1 : (N − 1)] with θ i and the parameters of the NN that parameterizes the decoding function at node N with φ. Let θ = [θ 1 , . . . , θ N−1 ], we aim to find the parameters θ, φ that maximize the relevance of the network, given the network constraints of (3). Given that the actual distribution is unknown and we only have access to a dataset, the loss function needs to strike a balance between its performance on the dataset, given by empirical estimate of the relevance, and the network's ability to perform well on samples outside the dataset.
The NNs at the various nodes are arbitrary and can be chosen independently-for instance, they need not be identical as in FL. It is only required that the following mild condition which, as will become clearer from what follows, facilitates the back-propagation be met. Specifically, for every j ∈ J and x j ∈ X j , under the assumtion that all elements of X j have the same dimension, it holds that Size of first layer of NN (j) = Similarly, for k ∈ [1 : N]/J we have (7) and (8) were imposed only for the sake of ease of implementation of the training algorithm; the techniques present in this paper, including optimal trade-offs between relevance and complexity for the given topology, the associated loss function, the variational lower bound, how to parameterize it using NNs and so on, do not require (7) and (8) to hold. Alternative aggregation techniques, such as element-wise multiplication or element-wise averaging, can be employed to combine the information received by each node, in replacement to concatenation. The impact of these aggregation techniques has been analyzed in [22].

Proposed Solution: In-Network Learning and Inference
For convenience, we first consider a specific setting of the model of network inference problem of Figure 3 in which J = N − 1 and all the nodes that observe data are only connected to the end decision node, but not among them.

A Specific Model: Fusing of Inference
In this case, a possible suitable loss function was shown by [25] to be: where s is a Lagrange parameter and for j ∈ J the distributions P θ j (u j |x j ), Q φ j (y|u j ), Q φ J (y|u J ) are variational ones whose parameters are determined by the chosen NNs using the re-parametrization trick of [30] and Q ϕ j (u j ) are priors known to the encoders. For example, denoting by f θ j the NN used at node j ∈ J whose (weight and bias) parameters are given by θ j , for regression problems the conditional distribution P θ j (u j |x j ) can be chosen to be multivariate Gaussian, i.e., . For discrete data, concrete variables (i.e., Gumbel-Softmax) can be used instead.
The rationale behind the choice of loss function (9) is that in the regime of large n, if the encoders and decoder are not restricted to use NNs under some conditions. The optimality is proved therein under the assumption that for every subset S ⊆ J, it holds that X S − − Y − − X S c . The RHS of (10) is achievable for arbitrary distributions, however, regardless of such an assumption; the optimal stochastic mappings P U j |X j , P U , P Y|U j and P Y|U J are found by marginalizing the joint distribution that maximizes the following Lagrange cost function [25] (Proposition 2) where the maximization is over all joint distributions of the form P Y J j=1 P X j |Y J j=1 P U j |X j .

Inference Phase
During this phase node j observes a new sample x j . It uses its NN to output an encoded value u j which it sends to the decoder. After collecting (u 1 , . . . , u J ) from all input NNs, node (J + 1) uses its NN to output an estimate of Y in the form of soft output Q φ J (Y|u 1 , . . . , u J ). The procedure is depicted in Figure 4b.

Remark 2.
One can combine our proposed technique with an appropriate transmission scheme and channel coding. One possible suitable practical implementation in wireless settings can be obtained using Orthogonal Frequency-Division Multiple Access (OFDMA). That is, the J input nodes are allocated non-overlapping bandwidth segments and the output layers of the corresponding NNs are chosen accordingly. The encoding of the activation values can be performed, e.g., using entropy type coding [31].

Training Phase
During the forward pass, every node j ∈ J processes mini-batches of size, say, b j of its training data-set x j . Node j ∈ J then sends a vector, u j , whose elements are the activation values of the last layer of (NN j), see Figure 4a. Due to (8)  J+1 denote the weights, biases and activation values at layer l ∈ [2 : L J+1 ] for the NN (J + 1) and σ is the activation function, respectively. Node (J + 1) computes the error vectors and then updates its weight-and bias parameters as where η designates the learning parameter; for simplicity, η and σ are assumed here to be identical for all NNs.

Remark 3.
It is important to note that for the computation of the RHS of (11a) node (J + 1), which knows Q φ J (y i |u 1,i , . . . , u J,i ) and Q φ j (y i |u j,i ) for all i ∈ [1 : n] and all j ∈ J, only the derivative of J+1 is required. For instance, node (J + 1) does not need to know any of the conditional variationals P θ j (u j |x j ) or the priors Q ϕ j (u j ).
The backward propagation of the error vector from node (J + 1) to the nodes j, j ∈ {1, . . . , J}, is as follows. Node (J + 1) horizontally splits the error vector of its input layer into J sub-vectors with sub-error vector j having the same size as the dimension of the last layer of NN j [recall (8) and that the activation vectors are concatenated vertically during the forward pass]. See Figure 4a. The backward propagation then continues on each of the J input NNs simultaneously, each of them essentially applying operations similar to (11) and (12).

Remark 4.
Let δ [1] J+1 ( j) denote the sub-error vector sent back from node (J + 1) to node j ∈ J. It is easy to see that, for every j ∈ J, and this explains why node j ∈ J needs only the part δ [1] J+1 ( j), not the entire error vector at node (J + 1).

General Model: Fusion and Propagation of Inference
Consider now the general network inference model of Figure 2. Part of the difficulty of this problem is in finding a suitable loss function which can be optimized distributively via NNs that only have access to local data-sets each. The next theorem provides a bound on the achievable relevance (under some assumptions) for an arbitrary network topology (E, N ). The result of Theorem 1 is asymptotic in the size of the training data-sets, while the inference problem is a one-shot problem. One-shot results for this problem can be obtained, e.g., along the approach of [32]. For convenience, we define for S ⊆ [1, . . . , N − 1] and non-negative (C ij : (i, j) ∈ E) the quantity Theorem 1. For the network inference model of Figure 2, in the regime of large data-sets the following relevance is achievable, where the maximization is over joint measures of the form for which there exist non-negative R 1 , . . . , R J that satisfy Proof. The proof of Theorem 1 appears in Appendix A. An outline is as follows. The result is achieved using a separate compression-transmission-estimation scheme in which the observations (x 1 , . . . , x J ) are first compressed distributively using Berger-Tung coding [33] into representations (u 1 , . . . , u J ) and then the bin indices are transmitted as independent messages over the network G using linear-network coding [34] (Section 15.5). The decision node N first recovers the representation codewords (u 1 , . . . , u J ) and then produces an estimate of the label y. The scheme is illustrated in Figure 5. Part of the utility of the loss function of Theorem 1 is in that it accounts explicitly for the network topology for inference fusion and propagation. In addition, although as seen from its proof the setting of Theorem 1 assumes knowledge of the joint distribution of the tuple (X 1 , . . . , X J , Y), the result can be used to train, distributively, NNs from a set of available date-sets. To do so, we first derive a Lagrangian function, from Theorem 1, which can be used as an objective function to find the desired set of encoders and decoder. Afterwards, we use a variational approximation to avoid the computation of marginal distributions, which can be costly in practice. Finally, we parameterize the distributions suing NNs. For a given network topology in essence, the approach generalizes that of Section 3.1 to more general networks that involve hops. For simplicity, in what follows, this is illustrated for the example architecture of Figure 6. While the example is simple, it showcases the important aspect of any such topology, the fusion of the data at an intermediary nodes, i.e., a hop. Firstly, we leverage Theorem 1 to establish a feasible trade-off between the performance of the network illustrated in Figure 6, quantified by its relevance, and the quantity of information that must be communicated between the nodes. Subsequently, employing the aforementioned approach, we derive a loss function tailored for the scenarios where the nodes are equipped with neural networks, as depicted in Figure 7.
where the maximization is over joint measures of the form for which the following holds for some R 1 ≥ 0, R 2 ≥ 2 and R 3 ≥ 0: Let C sum = C 15 + C 24 + C 34 + C 45 ; consider the region of all pairs (∆, C sum ) ∈ R 2 + for which the relevance level ∆ as given by the RHS of (17) is achievable for some C 15 ≥ 0, C 24 ≥ 0, C 34 ≥ 0 and C 45 ≥ 0 such that C sum = C 15 + C 24 + C 34 + C 45 . Hereafter, we denote such region as RI sum . Applying Fourier-Motzkin elimination on the region defined by (17) and (19), we obtain that the region RI sum is given by the union of pairs (∆, C sum ) ∈ R 2 + for which (the time sharing random variable is set to a constant for simplicity) for some measure of the form The next proposition gives a useful parameterization of the region RI sum as described by (20) and (21).

Proposition 1.
For every pair (∆, C sum ) that lies on the boundary of the region described by (20) and (21) there exists s ≥ 0 such that (∆, C sum ) = (∆ s , C s ), with and P * is the set of pmfs P := {P U 1 |X 1 , P U 2 |X 2 , P U 3 |X 3 } that maximize the cost function Proof. See Appendix B.
In accordance with the studied example network inference problem shown in Figure 6, let a random variable U 4 be such that U 4 − − (U 2 , U 3 ) − − (X 1 , X 2 , X 3 , Y, U 1 ). That is, the joint distribution factorizes as Let for given s ≥ 0 and conditional P U 4 |U 2 ,U 3 the Lagrange term The following lemma shows that L low s (P, P U 4 |U 2 ,U 3 ) lower bounds L s (P) as given by (23).

Lemma 1.
For every s ≥ 0 and joint measure that factorizes as (24), we have Proof. See Appendix C.
For convenience let P + : The optimization of (25) generally requires the computation of marginal distributions, which can be costly in practice. Hereafter, we derive a variational lower bound on L low s with respect to some arbitrary (variational) distributions. Specifically, let where Q Y|U 1 ,U 4 represents variational (possibly stochastic) decoders and Q U 3 , Q U 2 and Q U 1 represent priors. Additionally, let The following lemma, the proof of which is essentially similar to that of [25] (Lemma 1), shows that for every s ≥ 0, the cost function L low s (P, P U 4 |U 2 ,U 3 ) is lower-bounded by L v-low s (P + , Q) as given by (28).

Proof. See Appendix D.
From the above, we get that Since, as described in Section 2, the distribution of the data is not known, but only a set of samples is available {(x 1,i , . . . , x J,i , y i )} n i=1 , we restrict the optimization of (28) to the family of distributions that can be parameterized by NNs. Thus, we obtain the following loss function which can be optimized empirically, in a distributed manner, using gradient based techniques, with s stands for a Lagrange multiplier and the distributions Q φ 5 , P θ 4 , P θ 3 , P θ 2 , P θ 1 are variational ones whose parameters are determined by the chosen NNs using the reparametrization trick of [30] and {Q ϕ i : i ∈ {1, 2, 3}} are priors known to the encoders. The parameterization of the distributions with NNs is performed similarly to that for the setting of Section 3.1.

Inference Phase
During this phase, nodes 1, 2 and 3 each observe (or measure) a new sample. Let x 1 be the sample observed by node 1 and x 2 and x 3 those observed by node 2 and node 3, respectively. Node 1 processes x 1 using its NN and sends an encoded value u 1 to node 5 and so do nodes 2 and 3 towards node 4. Upon receiving u 2 and u 3 from nodes 2 and 3, node 4 concatenates them vertically and processes the obtained vector using its NN. The output u 4 is then sent to node 5. The latter performs similar operations on the activation values u 1 and u 4 and outputs an estimate of the label y in the form of a soft output Q φ 5 (y|u 1 , u 4 ).

Training Phase
During the forward pass, every node j ∈ {1, 2, 3} processes mini-batches of size, b j of its training data set x j . Nodes 2 and 3 send their vector formed of the activation values of the last layer of their NNs to node 4. Because the sizes of the last layers of the NNs of nodes 2 and 3 are chosen according to (8) the sent activation vectors are concatenated vertically at the input layer of NN 4. The forward pass continues on the NN at node 4 until its last layer. Next, nodes 1 and 4 send the activation values of their last layers to node 5. Again, as the sizes of the last layers of the NNs of nodes 1 and 4 satisfy (8) the sent activation vectors are concatenated vertically at the input layer of NN 5 and the forward pass continues until the last layer of NN 5.
During the backward pass, each of the NNs updates its parameters according to (11) and (12). Node 5 is the first to apply the back propagation procedure in order update the parameters of its NN. It applies (11) and (12) sequentially, starting from its last layer.

Remark 5.
It is important to note that, similar to the setting of Section III-A, for the computation of the RHS of (11a) for node 5, only the derivative of L NN s (n) w.r.t. the activation vector a L 5 5 is required, which depends only on Q φ 5 (y i |u 1,i , u 4,i ). The distributions are known to node 5 given only u 1,i and u 4,i .
The error propagates back until it reaches the first layer of the NN of node 5. Node 5 then splits horizontally the error vector of its input layer into 2 sub-vectors with the top sub-error vector having as size that of the last layer of the NN of node 1 and the bottom sub-error vector having as size that of the last layer of the NN of node 4-see Figure 7a. Similarly, the two nodes 1 and 4 continue the backward propagation at their turns simultaneously. Node 4 then splits horizontally the error vector of its input layer into 2 sub-vectors with the top sub-error vector having as size that of the last layer of the NN of node 2 and the bottom sub-error vector having as size that of the last layer of the NN of node 3. Finally, the backward propagation continues on the NNs of nodes 2 and 3. The entire process continues until convergence. Remark 6. Let δ [1] J ( j) denote the sub-error vector sent back from node J to node j. It is easy to see that, for every j ∈ J, and this explains why, for back propagation, nodes 1, 2, 3, 4 need only part of the error vector at the node they are connected to.

Bandwidth Requirements
In this section, we study the bandwidth requirements of our in-network learning. Let q denote the size of the entire data set (each input node has a local dataset of size q J ), p = L J+1 the size of the input layer of NN (J + 1) and s the size in bits of a parameter. Since as per (8), the output of the last layers of the input NNs are concatenated at the input of NN (J + 1) whose size is p, and each activation value is s bits, one then needs 2sp J bits for each data point-the factor 2 accounts for both the forward and backward passes and so, for an epoch, our in-network learning requires 2pqs J bits. Note that the bandwidth requirement of in-network learning does not depend on the sizes of the NNs used at the various nodes, but does depend on the size of the dataset. For comparison, notice that with FL one would require 2NJs, where N designates the number of (weight-and bias) parameters of a NN at one node. For the SL of [21], assuming for simplicity that the NNs j = 1, . . . , J all have the same size ηN, where η ∈ [0, 1], SL requires (2pq + ηNJ)s bits for an entire epoch.
The bandwidth requirements of the three schemes are summarized and compared in Table 1 for two popular NNs architectures, VGG16 (N = 138,344,128 parameters) and ResNet50 (N = 25,636,712 parameters) and two example datsets, q = 50, 000 data points and q = 500,000 data points. The numerical values are set as J = 500, p = 25,088 and η = 0.88 for ResNet50 and 0.11 for VGG16.
Compared to FL and SL, INL has an advantage in that all nodes work jointly also during inference to make a prediction,not just during the training phase. As a consequence nodes only need to exchange latent representations, not model parameters, during training.

Experimental Results
We perform two series of experiments for which we compare the performance of our INL with those of FL and SL. The dataset used is the CIFAR-10 and there are five client nodes. In the first experiment, the three techniques are implemented in such a way such that during the inference phase the same NN is used to make the predictions. In the second experiment, the aim is to implement each of the techniques such that the data is spread in the same manner across the five client nodes for each of the techniques.

Experiment 1
In this setup, we create five sets of noisy versions of the images of CIFAR-10. To this end, the CIFAR images are first normalized, and then corrupted by additive Gaussian noise with standard deviation set respectively to 0.4, 1, 2, 3, 4. For our INL each of the five input NNs is trained on a different noisy version of the same image. Each NN uses a variation of the VGG network of [35], with the categorical cross-entropy as the loss function, L2 regularization, and Dropout and BatchNormalization layers. Node (J + 1) uses two dense layers. The architecture is shown in Figure 8. In the experiments, all five (noisy) versions of every CIFAR-10 image are processed simultaneously, each by a different NN at a distinct node, through a series of convolutional layers. The outputs are then concatenated and then passed through a series of dense layers at node (J + 1). For FL, each of the five client nodes is equipped with the entire network of Figure 8. The dataset is split into five sets of equal sizes and the split is now performed such that all five noisy versions of a same CIFAR-10 image are presented to the same client NN (distinct clients observe different images, however). For SL of [21], each input node is equipped with an NN formed by all fives branches with convolutional networks (i.e., all the network of Figure 8, except the part at Node (J + 1)) and node (J + 1) is equipped with fully connected layers at Node (J + 1) in Figure 8. Here, the processing during training is such that each input NN concatenates vertically the outputs of all convolutional layers and then passes that to node (J + 1), which then propagates back the error vector. After one epoch at one NN, the learned weights are passed to the next client, which performs the same operations on its part of the dataset.
The model depicted in Figure 8, which utilizes convolutional layers with a filter size of 3 × 3, comprises of approximately seventy-four million parameters, with 99.5% of these parameters constituting the encoding parts of the neural network. Table 2 presents the bandwidth requirements per epoch for the three techniques, considering the variation of the CIFAR-10 dataset used in the experiment, as well as the scenario where a dataset with ten times the amount of data is employed. It is observed that increasing the data size results in higher bandwidth requirements for both SL and INL, whereas the bandwidth requirements for FL remain unaffected.  Figure 9a depicts the evolution of the classification accuracy on CIFAR-10 as a function of the number of training epochs, for the three schemes. As visible from the figure, the convergence of FL is relatively slower comparatively. The final result is also less accurate. Figure 9b shows the amount of data needed to be exchanged among the nodes (i.e., bandwidth resources) in order to get a prescribed value of classification accuracy. Observe that both our INL and SL require significantly less data exchange than FL and our INL is better than SL especially for small values of bandwidth. This experiment showcases that the INL framework can save bandwidth, compared to SL and FL, when training large models by exchanging latent representations as opposed to model parameters. This is particularly relevant as some works argue to overparametrizing models can result in better model performance [36].

Experiment 2
In Experiment 1, the entire training dataset was partitioned differently for INL, FL and SL (in order to account for the particularities of the three). In this second experiment, they are all trained on the same data. Specifically, each client NN sees all CIFAR-10 images during training and its local dataset differs from those seen by other NNs only by the amount of added Gaussian noise (standard deviation chosen as 0.4, 1, 2, 3, 4, respectively). Additionally, for the sake of a fair comparison between INL, FL and SL the nodes are set to utilize fairly the same NNs for the three of them (see, Figure 10). The model shown in Figure 10, for convolutional layers with filter of size 3 × 3, has approximately fifteen million parameters, with 97.6% of the parameters forming the decoding part of the network. Table 3 shows the bandwidth requierments for the three techniques per epoch for the variation of the CIFAR-10 dataset used in the experiment as well as for the case in which another dataset would be used that had ten times the amount of data. It is observed that increasing the data size results in higher bandwidth requirements for both SL and INL, whereas the bandwidth requirements for FL remain unaffected.  Figure 11b shows the performance of the three schemes during the inference phase in this case (for FL the inference is performed on an image which has average quality of the five noisy input images for INL and SL). Again, observe the benefits of INL over FL and SL in terms of both achieved accuracy and bandwidth requirements. This experiment showacases INL's ability to make use of the correlations between the data observed by the different nodes, thus resulting in better network performance.

Conclusions
In this paper, our focus is on addressing the problem of distributed training and inference. We introduce INL, a novel framework which enables multiple nodes to collaboratively train a model that can be utilized in a distributed manner during the inference phase. Unlike existing works on distributed estimation and detection, our framework does not require prior knowledge of the data distribution; instead, it only necessitates access to a set of training samples. Furthermore, while other approaches to distributed training, such as FL and SL, assume local decision-making during the inference phase, we consider a scenario where the nodes observe data associated with the same event, thus enabling a joint decision that can lead to improved accuracy. The proposed INL algorithm offers a loss function derived through theoretical analysis, aiming to achieve the best trade-off between prediction accuracy, measured by logarithmic loss, and the amount of information exchanged among the nodes in the communication network.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Proof of Theorem 1
The proof of Theorem 1 is based on a scheme in which the observations {x j } j∈J are compressed distributively using Berger-Tung coding [33], then, the compression bin indices are transmitted as independent messages over the network G using linear-network coding [34] (Section 15.4). The decision node N first decompresses the compression codewords and then uses them to produce an estimateŶ of Y. In what follows, for simplicity we set the time-sharing random variable to be a constant, i.e., Q = ∅. Let 0 < < < .
The decision node N then produces an estimateŷ n of y n asŷ(u n 1 (l 1 ), . . . , u n J (l J )). It can be shown easily that the per-sample relevance level achieved using the described scheme is ∆ = I(U 1 , . . . , U J ; Y) and this completes the proof of Theorem 1.

Appendix B. Proof of Proposition 1
For C sum ≥ 0 fix s ≥ 0 such that C s = C sum and let P * = {P U * 1 |X 1 , P U * 2 |X 2 , P U * 3 |X 3 } be the solution to (23) for the given s. By making the substitution in (22): where (A4) holds since ∆ is the maximum I(Y; U 1 , U 2 , U 3 ) over all distribution for which (20b) holds, which includes P * . Conversely, let P * be such that (∆, C sum ) is on the bound of the RI sum then: where (A5) follows from (20b). Inequality (A6) holds due to the fact that max P L(P) takes place over all P, including P * . Since (A7) is true for any s ≥ 0 we take s such that C sum = C s , which implies ∆ ≤ ∆ s . Together with (A4) this completes the proof.