Update-Based Machine Learning Classiﬁcation of Hierarchical Symbols in a Slowly Varying Two-Way Relay Channel

: This paper presents a stochastic inference problem suited to a classiﬁcation approach in a time-varying observation model with continuous-valued unknown parameterization. The utilization of an artiﬁcial neural network (ANN)-based classiﬁer is considered, and the concept of a training process via the backpropagation algorithm is used. The main objective is the minimization of resources required for the training of the classiﬁer in the parametric observation model. To reach this, it is proposed that the weights of the ANN classiﬁer vary continuously with the change of the observation model parameters. This behavior is then used in an update-based backpropagation algorithm. This proposed idea is demonstrated on several procedures, which re-use previously trained weights as prior information when updating the classiﬁer after a channel phase change. This approach successfully saves resources needed for re-training the ANN. The new approach is veriﬁed via a simulation on an example communication system with the two-way relay slowly fading channel.


Introduction
After the great success of machine learning (ML) methods in the classical application domains such as computer vision and speech recognition, they recently proved successful in other areas as well. The tasks suited to ML techniques can be divided into three main categories. First is the classification of objects based on rules inferred from a training set of examples (supervised learning). Second is the classification of objects without access to any examples, and the rules are found based only on a given distance metric. This technique (unsupervised learning) typically results in a cluster system minimizing and maximizing the intra-and inter-cluster distances, respectively. The third category (reinforcement learning) deals with learning how to interact with an unknown environment in order to reach a given goal. This method jointly addresses the problem of estimating the behavior of the environment and simultaneously learns a policy, which dictates actions to be taken in order to meet the goal. A good introduction to these ML topics can be found in [1,2].
In this paper, we focus on the first category, i.e., on supervised learning, implemented with an artificial neural network (ANN). The ANN mimics the structure of a brain in a sense, in that it consists of a number of individual neurons, each having multiple real inputs and one real output. Each neuron is parameterized by a set of weights and a bias. The operation performed by each neuron is a simple weighted sum of its inputs, the addition of the bias, and the application of a nonlinear activation function. Typically, the structure of an ANN consists of a number of layers of neurons. The output of a neuron is connected to the inputs of all neurons in the subsequent layer. The total number of layers and number of neurons in each layer are the design parameters of the network. A special role is given to the input and output layer, which are always present (ANNs with hidden layers are termed deep neural networks). In the input layer, there is one neuron for each dimension of the input data. In the output layer, we have typically a neuron for each individual classification class, where we can apply a hard metric (for example, a max function) for a hard decision or a soft metric (soft-max function) to get a soft output, which can be interpreted as a probabilistic classification output.
Learning is performed by applying examples from the training set and optimizing the parameters of individual neurons w.r.t. a fidelity metric of the ANN output and the desired output given by the training set. This optimization can be efficiently implemented by the backpropagation algorithm, which performs a stochastic gradient descent on the fidelity metric. This is the basic principle behind ML and ANN in particular. One of the issues with such a simple approach is the large number of parameters that arise with high-dimensional input data. With a large number of parameters, the learning process takes longer and requires a larger training set. One way of eliminating this problem is to preprocess the data and reduce its dimensionality while maintaining the information needed for classification. This process is called feature extraction and is very dependent on the particular data source and application. Much work has been done to develop sophisticated methods for such feature extraction, for example convolutional layers for image recognition [3], cepstral coefficients for speech recognition [4] or modulation recognition [5] and the bag-of words technique, developed for feature extraction of textual data [6], to name a few.
Another issue, which is also related to the number of parameters of the network, is the problem of over-fitting. The core of all ML methods is the ability to extract generalizing classification rules out of a set of particular examples: the training set. Over-fitting occurs when insignificant details of the training data are learned and degrade the performance of classifying data not present in the training set. To combat this problem, regularization techniques such as constraining neuron weights or random neuron dropout are used, to name a few.
The problems of stochastic inference interpreted as ML tasks have attracted considerable attention in the field of physical layer communication [7][8][9][10]. The classical mathematical formulations of detection, estimation, and signal processing algorithms usually require a precise analytical knowledge of the system and observation model and usually lead to provable optimal closed-form results. Alternative ML-based approaches provide in such situations only numerically solved approximations to those solution. Although this has a limited applicability in classical scenarios, there are situations where the system and observation model are not fully known and/or the resulting closed-form solutions are too complex or unknown. In that situation, the ML approach provides an answer. As a demonstrative example, a particularly obvious area where this applies in the context of stochastic graphs is WPLNC (wireless physical layer network coding).
Recently, many interesting applications of ML were proposed for usage in wireless communications. To provide several motivating examples, in [11], the authors used ANN to perform joint OFDM channel estimation and symbol detection, where the learning phase was performed offline based on known channel statistics. In [8,9], a deep learning approach to jointly optimize the whole wireless communication chain was proposed, and this approach was successfully tested in an over-the-air transmission. Similar to this, in the context of WPLNC, the authors in [10] addressed parts of the network chain as individual deep neural networks, divided into the source modulator, a relay node, and the demodulator. Attention attracted also the approach of modulation classification in [12], where it was interpreted as an image classification problem. These papers largely motivated our work. So far, utilization of the ML in WPLNC is not deeply explored. The concept of WPLNC recently attracted much attention. A comprehensive tutorial of WPLNC techniques might be found in [13], where fundamental concepts, advantages, and challenges were covered. This paper investigates the applicability of the ML concept in the parametric, slowly varying two-way relay channel (2WRC) WPLNC communication. This scenario is on the edge which still can be handled by the traditional approaches, see [14]. This allows us to find the required reference scenarios to which we compare the ML solution and gives a hint about how the ML solution could be applicable in more complex scenarios, which are beyond the capability of classical analytical solutions. As was stated in [7], it is suitable to consider ML, when (1) the physical system model is not known, (2) the training dataset is available, (3) a clear metric of the task can be defined, and (4) a detailed explanation of the obtained result is not required. These general conditions are easily met in our context of wireless communication networks.
Our contribution in this paper is an exploration of a novel approach to compensate the effects of the variable phase of the wireless channel in a fundamental WPLNC scenario without the need for explicit channel state estimation (which would generally require orthogonal pilots). Instead, we use an ML approach that inherently tracks (learns) the varying hierarchical channel relative phase with tractable computational complexity. This is achieved with a hypothesis class of the ANN, and the task is interpreted as a classification problem. We explore methods to re-use previously trained coefficients of the ANN, thus saving computational resources. Validation of the results was performed by means of a numerical analysis.
The rest of the paper is organized as follows. Section 3 introduces the notation and describes the system model. Section 4 outlines a generic approach of ANN utilization for the classification of hierarchical symbols. Section 5 extends the previous approach for more practical scenarios and forms the main contribution of this paper. Section 6 concludes the paper and outlines future research directions.

Materials and Methods
Numerical analysis was performed using MATLAB software. The developed source code, utilized to produce the results, may be re-utilized without any restrictions and is available at: https://www. dropbox.com/sh/vnc8j3ioksqhrmt/AAALaKeDHzYrBAJ-fnnFURfBa?dl=0.

System Model
Let us assume a three-node network with two source nodes S A , S B and one intermediate relay node R. Further, let us assume a perfect time synchronization of symbols among the three nodes and the knowledge of the modulation pulse used. Such a network topology is referred to as a 2-way relay channel [13]. Let information bits b A , b B ∈ {0; 1} be at source nodes S A , S B mapped into the constellation space using binary phase shift keying (BPSK) as s A , s B ∈ {±1} . For the topology of 2WRC, it is in the context of WPLNC efficient to utilize an XOR hierarchical network code map, such that only a hierarchical symbol b = b A ⊕ b B is required to be identified by the relay node (Briefly, the reason is that the relayed value of b is sufficient to recover information bit b A at node S B , as b A = b B ⊕ b and vice versa for b B at S A . See [13] for details.). A complex AWGN with variance σ 2 w per dimension is denoted as w∈ C. A simplified channel model with a relative fading parameter h ∈ C is considered. For the considered analysis, the following constellation space system model is used: where x ∈ C is the observation of node R. The sequence index is dropped for notational simplicity. The SNR will be related with respect to S A and be denoted as SNR = E[|s A | 2 ]/σ 2 w , where E[] denotes the expectation operator.

Artificial Neural Network-Based Classification of Hierarchical Symbols
In this section, we briefly outline a generic approach, which is later improved and forms our main contribution. The main task is to infer the value of hierarchical symbol b, based on the observation x.
Obviously, knowledge of the parameter h in (1) is crucial for this task. The expected channel phase drift over time opens a problem of ambiguity of deciding the hierarchical symbol b.
A basic background idea described in this paper is to utilize a classical ML approach for this problem, such that a training dataset of size D is utilized to train an ANN classifier for this task. Vector x d = [ {x}, {x}] T corresponds to the coordinates in the constellation space (an alternative would be to take [|x|, x] T as the input vector, but it would introduce the problem of phase discontinuity). Vector t d with dimensionality 2 represents the desired output value of the ANN. It should be clear that even though we do not apply any explicit feature extraction, such preprocessing takes place in the form of a projection of the received signal onto a set of basis functions, thus giving the coordinates of x d in the constellation space. For the purpose of simulations, however, we work directly in the constellation space.
Let us now describe how to obtain the training dataset. Consider that initially, nodes S A and S B simultaneously transmit a predetermined, pseudorandom sequence of symbols . Furthermore, let node R be aware of the resulting hierarchical target sequence For the purpose of training the ANN using the backpropagation algorithm, this sequence is transformed to the expected ANN output vector t d as Thus, the observation x d (x) together with the corresponding value t d form the required training dataset for the supervised ML hypothesis class.
In the case of the considered BPSK modulation, the hierarchical symbol is binary. See Figure 1 for a demonstrative example. Therein, two colors distinguish the classes corresponding to the two values of the hierarchical target symbol b. The phase of h, denoted h, dictates the rotation of the superimposed constellation.
It is natural to visualize the trained classifierin terms of its decision regions, as shown in [13]. An example of trained decision regions is shown in Figure 2. Therein, two reference solutions are provided. The trained decision regions are shown in the central subfigure. The true metric map is based on the analytically optimal metric; see [13] for details. The distance-based map is an approximation of the true metric map, based on the Euclidean distance between the neighboring points only, also derived in [13].
With the above description of the nature of the training dataset, it is straightforward to design an ANN and train it using the backpropagation algorithm and perform the classification of the hierarchical symbols; see [1]. The specific implementation is not the main objective of this paper. For more details; see [15].

Slowly Varying Fading Parameter in 2WRC
Above, we briefly addressed how an ANN might be simply utilized for the classification of hierarchical symbols in 2WRC. Therein, we considered the training dataset D to be obtained for a single fixed value of h. In this section, we address a more practical situation, when the value of h is slowly varying. Practically, effects such as slow user mobility, changes in the propagation environment, or drift of the internal oscillators might be considered as the causes of it.
Considering the ANN, its functionality is fully determined by its structure (number of layers and number of neurons). The training dataset is obtained as a pilot signal known to both Tx and Rx, as described previously, and we shall focus on the minimization of resources required for training. The reason is two-fold. Firstly, the number of training samples used for training directly affects signaling overhead. Thus, it is desirable to minimize the number of training samples D. Secondly, the training process of ANN using backpropagation itself is time consuming. Besides the number of training samples D, its time complexity is typically dependent on the number of training epochs E and learning rate η. From a matrix implementation point of view, it can be easily seen that the time complexity of backpropagation scales quadratically with the number of nodes. The time complexity dependence is linear w.r.t. the size of the training set. Therefore, we relaxed the employment of a relatively small ANN with 4 layers. The core of our contribution in this paper is exploring the ideas of how to minimize the resources required for training the ANN in the case of hierarchical symbol classification in the 2WRC.
Note that in the simulations, we consider a 2WRC with BPSK modulation, and the employed ANN consists of 2 hidden layers, each having 40 neurons. The network has 2 inputs, corresponding to the coordinates in the constellation space. Its two outputs are designed to indicate the binary value of the hierarchical symbols, which might be easily scaled using the soft-max approach to interpret the outputs as classification likelihoods.

Numerical Analysis of the Effect of Training Dataset Size
Unfortunately, no general rules are available to determine what size of the training dataset is required to achieve the desired accuracy or error rate. This issue is traditionally addressed heuristically. Luckily, for our problem, it is straightforward to perform an exhaustive number of Monte Carlo simulations to determine, how the quality of the trained classifier depends on the ANN parameters.
The size of the training dataset D is hereby considered to be a key parameter, since it dictates the transmission overhead. In Figure 3, we present an exemplary result of such a numerical analysis. Therein, we observe that for more than D = 2000 training samples, the accuracy of classification no longer increases, and therefore, it is not desirable to waste valuable radio resources to obtain them.

Updates of Previously Trained Weights
As stated previously, training of the ANN-based classifier is further determined by parameters E and η. A basic approach for training an ANN classifier is to randomly initialize its weights and then perform backpropagation. Assume this is done for the training dataset D ( h 0 ), parameterized by the channel phase. The trained weights are denoted by α 0 . Consider further that after the training process is performed, the ANN can be used to classify the received symbols for a certain time, while h ≈ h 0 . Due to the variability of the wireless channel, consequently h = h 1 = h 0 . Inevitably, the classifier needs to be updated according to D ( h 1 ), resulting in new weights α 1 . In general, we ask whether we can re-use previously trained results α i−1 for obtaining new weights α i . To provide an answer, we propose three procedures, which differ in the strategy of re-initialization and compare their classification performance.
For minimizing the number of operations for training, we propose the following general scheme with diverse epoch numbers and learning rates. This procedure repeats with period I. First, the network is trained on D ( h 0 ) over E 1 epochs with rate η 1 . For I − 1 subsequent sets D ( h i ), the network is trained over E 2 < E 1 epochs with rate η 2 > η 1 . The motivation is such that a higher number of training epochs E 1 is used to train weights via a smaller learning rate η 1 , leading to more precise initial training.

Procedure P1: Always Re-Initialize
The first procedure, P1, is visualized in Figure 4 with green color. The weights of the ANN are initialized. A long training process is performed, lasting E 1 epochs with learning rate η 1 , and subsequently for I − 1 realizations of h i , the training is performed within E 2 epochs with η 2 . Weights α i are re-initialized before each training.

Procedure P2: Regular Re-Initialization
The second procedure, P2, is visualized in Figure 4 with blue color. Compared to the previous routine, the weights are re-initialized after every Ith training.

Procedure P3: Never Re-Initialize
The third procedure, P3, is visualized right-most in Figure 4 with red color. Weights α i are randomly initialized only before the first training process. For the subsequent values of h i , the weights are updated and utilized as prior information. The difference to be emphasized is the strategy of re-initialization of the ANN weights.

Numerical Analysis
The above-described procedures P1 -P3 were implemented, and we provide numeric simulation results. The angle of relative fading h of consequent pilot sequences used for the training process changes uniformly as h i+1 − h i = 4.5 • . As justified by results presented in Figure 3, a pilot sequence of length D = 2000 was considered. We experimentally determined E 1 = 30 and E 2 = 20 to be the suitable numbers of training epochs with the corresponding learning rates η 1 = 0.05 and η 2 = 0.07. With respect to the block diagram in Figure 4, consider I = 3. To provide an in-depth look, an analysis of the ANN weights' evolution was performed. To evaluate the rates of change of all weights α in each layer of the ANN, the following expression was used: where the upper index (j) specifies the layer, A (j) is the number of weights in layer (j), e is a counter of training epochs, and i identifies the specific weights of the layer. For comparison, this expression is evaluated and graphically represented in Figure 5 for procedure P3, without re-initialization, and in Figure 6 for procedure P1, where the weights are always re-initialized, and thus no prior information is utilized (graphically, the inspection of procedure P2 resulted in similar results as for P3, and the figure is therefore omitted). Note that in these figures, it is clearly observable how the training process converges. The peaks of the curves correspond to the changes of the parameter h i , as marked by arrows. These results are useful to optimize parameters E 1 , η 1, E 2 , η 2 .  In Figure 7, we provide a comparison of the approaches in terms of the bit-error rate (BER) over different values of the SNR. Therein, two reference solutions are provided, as addressed in Section 4. Recall that these reference solutions are based on perfect knowledge of h i and therefore are naturally superior to the trained solutions. Procedures P1-P3, represented in Figure 4, were tested for two values of the parameter I, namely I = 3 and I = 5. In this result, we observe that the most efficient strategy is procedure P3. Clearly, it is also better to keep longer training sequences more often, i.e., I = 3 is preferable to I = 5.

Discussion
In this paper, we considered the problem of the classification of received symbols in 2WRC according to a many-to-one hierarchical map given from a WPLNC scenario with the employment of an ANN as a classifier. It was shown that for a slowly varying phase of the wireless channel, it is possible to utilize previously trained weights of the ANN as prior information. An effort was made to minimize the resources required for the training of the system, and the number of resources required for training was optimized. The simulation results showed that the utilization of the prior information was beneficial and improved the overall classification results. Three different procedures of training were proposed, implemented, and evaluated. The procedure denoted P3 in the text, where the weights of the ANN were not re-initialized with consecutive changes of the relative fading parameter, was identified to be the best one. Future research will focus on (1) the optimization of the parameters of the proposed methods, where the challenge is recognized in the number of parameters that the ANN model contains and a more systematic approach would be desired, (2) the implementation of the approach for over-the-air transmission, which seems straightforward using the concept of software-defined radio (however on-line processing of data opens timing issues in practical networks), and (3) exploiting of the principle for more complex networks, where new challenges will emerge. As noted previously, the presented scenario is still on the edge, which can be handled by traditional, analytically derived approaches. However, these are bearable only for simple network topologies. Hopefully, we shall be able to extend the approach described in this paper to more complex scenarios. These extensions might focus on more complex modulation schemes, an increased number of network nodes, and relaxed knowledge of the network topology.
Author Contributions: Development of the algorithms and writing, J.K.; supervision and editing, J.S.; validation and editing, P.H. All authors read and agreed to the published version of the manuscript.

Acknowledgments:
The authors would like to thank the two anonymous reviewers for their helpful comments, which improved the text.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: 2WRC 2-way relay channel WPLNC wireless physical layer network coding ANN artificial neural network ML machine learning SNR signal-to-noise ratio BPSK binary phase shift keying BER bit-error rate