Experimental Demonstration and Performance Enhancement of 5G NR Multiband Radio over Fiber System Using Optimized Digital Predistortion

This paper presents an experimental realization of multiband 5G new radio (NR) optical front haul (OFH) based radio over fiber (RoF) system using digital predistortion (DPD). A novel magnitude-selective affine (MSA) based DPD method is proposed for the complexity reduction and performance enhancement of RoF link followed by its comparison with the canonical piece wise linearization (CPWL), decomposed vector rotation method (DVR) and generalized memory polynomial (GMP) methods. Similarly, a detailed study is shown followed by the implementation proposal of novel neural network (NN) for DPD followed by its comparison with MSA, CPWL, DVR and GMP methods. In the experimental testbed, 5G NR standard at 20 GHz with 50 MHz bandwidth and flexible-waveform signal at 3 GHz with 20 MHz bandwidth is used to cover enhanced mobile broad band and small cells scenarios. A dual drive Mach Zehnder Modulator having two distinct radio frequency signals modulates a 1310 nm optical carrier using distributed feedback laser for 22 km of standard single mode fiber. The experimental results are presented in terms of adjacent channel power ratio (ACPR), error vector magnitude (EVM), number of estimated coefficients and multiplications. The study aims to identify those novel methods such as MSA DPD are a good candidate to deploy in real time scenarios for DPD in comparison to NN based DPD which have a slightly better performance as compared to the proposed MSA method but has a higher complexity levels. Both, proposed methods, MSA and NN are meeting the 3GPP Release 17 requirements.


Introduction
With recent advances in 5G and beyond, the accelerating growth in base stations (BS) has led to centralization of radio access network (RAN) [1,2], which decreases the capital expenditure as it leads to simplifications in network management [2]. To facilitate C-RAN, a fronthaul (FH) connects base band units (BBU) to remote radio heads (RRH) (see Figure 1). With 5G in roll out stage in most part of the developed world, the microwave photonicsbased solutions such as Radio over Fiber (RoF) have a higher significance connecting the BBUs with RRUs [3][4][5] due to advantages such as cost effectiveness; immunity to electromagnetic disturbance, broader bandwidth and increasing the wireless links reach for all type of distances ranging from short to long. There have been various versions of RoF such as Analog Radio over Fiber (A-RoF) [6,7], Digital Radio over Fiber (D-RoF) [8,9], Sigma Delta Radio over Fiber (SD-RoF) [10][11][12] and other variants that have been proposed recently (see Figure 2). Up to an extent, A-RoF links are the simplest, uncomplicated and economical solution, however, they suffer from nonlinearities arising due to signal impairments and devices involved such as laser modules, fibers and photodiodes. The other solutions comprise of utilizing D-RoF or SD-RoF. Considering D-RoF systems, the requirements of conversion from analog to digital (ADC) and digital to analog converters (DAC) makes the process very expensive. In addition, due to high data rate capacity and requirement of high bandwidth, common public radio interface (CPRI) restrictions are faced. CPRI bottleneck can be overcome by exploiting the SD-RoF. Here, ADCs and DACs are not required as it is based on the sigma-delta modulation which utilizes one bit of ADC but the method is complex hence not preferred. Moreover, the quantization noise is high for 1 bit that requires an additional band pass filter (BPF) at the RRH. However, this additional complexity is not alone to be handled for S-DRoF implementation, the addition of BPF results in additional amplitude and phase noise that requires additional solution for removal of these additional nonlinearities [13].
From this it is evident that exploiting other schemes (D-RoF/SD-RoF) is cumbersome. Hence, comparatively, owing to the legacy, infrastructure, and cost effectiveness of the A-RoF systems, they are the better choice for optical fronthaul (OFH). Now then, our best alternative is to counter the nonlinearities of the A-RoF system. Given that we can do this in a simple and practical manner, the RoF technique proves to be extremely advantageous. There have been various versions of RoF such as Analog Radio over Fiber (A-RoF) [6,7], Digital Radio over Fiber (D-RoF) [8,9], Sigma Delta Radio over Fiber (SD-RoF) [10][11][12] and other variants that have been proposed recently (see Figure 2). Up to an extent, A-RoF links are the simplest, uncomplicated and economical solution, however, they suffer from nonlinearities arising due to signal impairments and devices involved such as laser modules, fibers and photodiodes. The other solutions comprise of utilizing D-RoF or SD-RoF. Considering D-RoF systems, the requirements of conversion from analog to digital (ADC) and digital to analog converters (DAC) makes the process very expensive. In addition, due to high data rate capacity and requirement of high bandwidth, common public radio interface (CPRI) restrictions are faced. CPRI bottleneck can be overcome by exploiting the SD-RoF. Here, ADCs and DACs are not required as it is based on the sigma-delta modulation which utilizes one bit of ADC but the method is complex hence not preferred. Moreover, the quantization noise is high for 1 bit that requires an additional band pass filter (BPF) at the RRH. However, this additional complexity is not alone to be handled for S-DRoF implementation, the addition of BPF results in additional amplitude and phase noise that requires additional solution for removal of these additional nonlinearities [13].
From this it is evident that exploiting other schemes (D-RoF/SD-RoF) is cumbersome. Hence, comparatively, owing to the legacy, infrastructure, and cost effectiveness of the A-RoF systems, they are the better choice for optical fronthaul (OFH). Now then, our best alternative is to counter the nonlinearities of the A-RoF system. Given that we can do this in a simple and practical manner, the RoF technique proves to be extremely advantageous.
Mitigating the nonlinearities of RoF transmission is essential to utilize the system to its best potential and has become an important subject. Within all these different domains, to counter the prevalent issues, a lot of techniques have been exploited. Amongst them, the one that have been utilized extensively are discussed in Section 2 under literature review.
The contribution of nonlinearities from the laser and to some extent, the photodiode part is important as transmission quality decreases and also, the interference with channels nearby is triggered. However, while considering the long-range networks, the nonlinearities due to the combination of fiber chromatic dispersion and laser frequency chirp are usually the main cause of signal impairment [7]. The Orthogonal Frequency Division Modulated (OFDM) signals, such as the emphasized fifth generation (5G) signal, is also liable to these distortions due to high peak-to average power ratio (PAPR). Mitigating the nonlinearities of RoF transmission is essential to utilize the system to its best potential and has become an important subject. Within all these different domains, to counter the prevalent issues, a lot of techniques have been exploited. Amongst them, the one that have been utilized extensively are discussed in Section 2 under literature review.
The contribution of nonlinearities from the laser and to some extent, the photodiode part is important as transmission quality decreases and also, the interference with channels nearby is triggered. However, while considering the long-range networks, the nonlinearities due to the combination of fiber chromatic dispersion and laser frequency chirp are usually the main cause of signal impairment [7]. The Orthogonal Frequency Division Modulated (OFDM) signals, such as the emphasized fifth generation (5G) signal, is also liable to these distortions due to high peak-to average power ratio (PAPR).
To the best of the authors knowledge, this paper introduces the nonlinearity and signal impairment compensation for OFH based RoF systems utilizing 5G NR based RoF systems. Following a very detailed literature review on the nonlinearities mitigation in Section 2, the novelties of this article are manifold: 1. Firstly, a multiband 5G NR signals are employed in the experimental testbed to cover enhanced mobile broad band (eMBB) scenarios and small cells for 3 GHz and 20 GHz, respectively. 2. A robust DPD technique utilizing negative feedback iteration is shown. The proposed DPD identification method has a relative lower computational complexity as compared toother learning architectures. In the proposed method of DPD identification, firstly, a DPD signal will be identified followed by the estimation of the DPD parameters. 3. The linearization performed is not limited to our previous proposed methods coming from volterra series but includes the "out of box" approach that includes CPWL and To the best of the authors knowledge, this paper introduces the nonlinearity and signal impairment compensation for OFH based RoF systems utilizing 5G NR based RoF systems. Following a very detailed literature review on the nonlinearities mitigation in Section 2, the novelties of this article are manifold:

1.
Firstly, a multiband 5G NR signals are employed in the experimental testbed to cover enhanced mobile broad band (eMBB) scenarios and small cells for 3 GHz and 20 GHz, respectively.

2.
A robust DPD technique utilizing negative feedback iteration is shown. The proposed DPD identification method has a relative lower computational complexity as compared toother learning architectures. In the proposed method of DPD identification, firstly, a DPD signal will be identified followed by the estimation of the DPD parameters.

3.
The linearization performed is not limited to our previous proposed methods coming from volterra series but includes the "out of box" approach that includes CPWL and DVR method. In addition, a novel magnitude selective affine (MSA) method is proposed that reduces the overheads of complexity such as multiplications in CPWL architecture, however achieving similar efficiency as of CPWL.

4.
In addition, a simple optimized neural network (ONN) based DPD algorithm is proposed that is an upgradation of our previously proposed DPD based method utilizing deep learning to perform linearization of 50 MHz 5G New Radio (NR) based RoF links. The NN DPD method is executed by a different type of training which does not utilize Indirect Learning Architecture (ILA). Initially we emulate the generic RoF link using a RoF NN and then train the proposed DPD ONN using this, by backpropagating the errors.
For the first time, a comparative experimental study has been held where MP, GMP, DVR, CPWL, MSA and ONN are compared for 5G NR multiband signal. The performance is evaluated in terms of Adjacent Channel Power Ration (ACPR) Error Vector Magnitude (EVM) and complexity in terms of multiplications and coefficients requirements. The summary of this work is shown in the Figure 3 where overall summary of each respective section is shown.
For the first time, a comparative experimental study has been held where MP, GMP, DVR, CPWL, MSA and ONN are compared for 5G NR multiband signal. The performance is evaluated in terms of Adjacent Channel Power Ration (ACPR) Error Vector Magnitude (EVM) and complexity in terms of multiplications and coefficients requirements. The summary of this work is shown in the Figure 3 where overall summary of each respective section is shown.

Literature Review
In this section, the linearization methodologies that have been implied for the RoF systems are discussed. The linearization methodologies consist of Electrical, Optical and Machine Learning Methods. Electrical methods are subdivided in to analog and digital methods. Digital methods are subdivided into predistortion and postdistortion methods. Optical methods consist of largely dual wavelength, singular polarization, mixed polarization, etc., while a newer avenue of linearization has been coined that belongs to machine learning. The higher schematic of these methods is summarized in the Figure 4.

Literature Review
In this section, the linearization methodologies that have been implied for the RoF systems are discussed. The linearization methodologies consist of Electrical, Optical and Machine Learning Methods. Electrical methods are subdivided in to analog and digital methods. Digital methods are subdivided into predistortion and postdistortion methods. Optical methods consist of largely dual wavelength, singular polarization, mixed polarization, etc., while a newer avenue of linearization has been coined that belongs to machine learning. The higher schematic of these methods is summarized in the  A detailed literature review is enlisted for the mitigation of the impairments of RoF system that are categorized in Table 1. The table summarizes the method employed, type of linearization, category, parameters evaluated, advantages and disadvantages of respective methods.  A detailed literature review is enlisted for the mitigation of the impairments of RoF system that are categorized in Table 1. The table summarizes the method employed, type of linearization, category, parameters evaluated, advantages and disadvantages of respective methods. Intermediate frequency used for DPD. RoF link was not composed of laserfiber-photodiode only. Attenuators and amplifiers were used so perhaps the signal impairments were corrected due to these components. Limited to sinusoid I/P signal. 9 Hadi et al. [  Along with the numerous techniques listed above, mitigation of the impairments of RoF system have also been studied previously [24][25][26][27][28][29][30][31][32][33][34][35]. A feedforward scheme has also been analyzed [26,27] but is complex and to counter these problems, optical methods such as dual parallel modulation [28], mixed polarization [29,30], etc. have been presented. In addition to the nonlinearities of optical fiber/modulators, RoF is liable to other issues such as amplified emission noise and phase noise [32,33] that further limit performance of RoF increasing proportionally with RF carrier frequency increase.

Neural Network Based DPD Architecture
Neural Networks (NN) are sophisticated networks, similar to the suggested NN based DPD model and they require extensive training data. This model is then cascaded with the RoF link, but its output is not known. However, in the case of an RoF link, the output is known hence, we form an RoF NN model and train it to mirror the original RoF link. Once we have formulated this RoF NN, we can now backpropagate through it and update the parameters in the suggested NN DPD model.
Supposing that the considered RoF link has a transfer function H(n) and an output signal y(n); and a baseband signal x(n) must be sent through it, DPD aims to calculate the inverse transfer function of this RoF link denoted byÎ −1 , whose output will then be denoted byx(n). This can be expressed by: While, where G is the gain. The NN here calculatesÎ −1 utilized for predistortion. A direct training cannot be implemented for construction of the NN for DPD as the idealx(n) is not known. The option analyzed for performance assessment is illustrated in Figure 5 [1].
where G is the gain. The NN here calculates utilized for predistortion. A direct training cannot be implemented for construction of the NN for DPD as the ideal ( ) is not known.
The option analyzed for performance assessment is illustrated in Figure 5 [1]. Initially, the RoF link is emulated by the second NN. The generic RoF link has ( ) as input and ( ) as output data, considering which, training is executed with the regression based NN model for them. This results in learning of the NN and now an approximate transfer function can be identified by it. Upon the formation of the RoF NN model, the model weights are fixed after which we connect it with the NN DPD model. Now, we use the original input, i.e., ( ) and output as training data to calculate error using a loss function. We then backpropagate it via to train .

Features of the NN Model
This section will discuss all the features of the Neural networks that are essential to be used in DPD based RoF system. The summary of the topics discussed in this section is given as follows. Firstly, Section 3.  Initially, the RoF link is emulated by the second NN. The generic RoF link hasx(n) as input and y(n) G as output data, considering which, training is executed with the regression based NN model for them. This results in learning of the NN and now an approximate transfer functionÎ can be identified by it. Upon the formation of the RoF NN model, the model weights are fixed after which we connect it with the NN DPD model. Now, we use the original input, i.e., x(n) and output as training data to calculate error using a loss function. We then backpropagate it viaÎ to trainÎ −1 .

Features of the NN Model
This section will discuss all the features of the Neural networks that are essential to be used in DPD based RoF system. The summary of the topics discussed in this section is given as follows. Firstly, Section 3.1.1 discusses the loss functions while Section 3.1.2 talks about optimizers. Section 3.1.3 talks about activation functions and Section 3.1.4 talks about hyperparameters. Section 3.1.5 discusses the regularization methods and Section 3.1.6 talks about are characteristics of NN.
The NN are layered structures that have been inspired from the human nervous system and are an adequate choice due to their powerful accuracy at approximating any nonlinear function and learning the relationship between their inputs and outputs. The basic working involves an input signal provided at the input layer, processing which takes place in the hidden layer/layers and the output is produced by the last (output layer). Each layer represents a cluster of neurons which are not connected to one another within the layer but are connected to the neurons of the next layer.

Loss Function
Prior to optimizing, it is essential that, the error is estimated for the current state of the model. For this, selecting an error function, normally known as loss function is necessary as it helps calculate the loss of the model. After this, the weights of the model are updated which minimizes this loss function for further evaluations. Loss functions broadly classify into: i.
Regression Loss. ii. Classification Loss. MSE calculates the average of the squared differences between actual and predicted value. Outcomes are always positive values irrespective of the sign of the actual and predicted values. The minimum value it can result in is 0.0. Note that, larger differences cause more error due to squaring of the value, penalizing the model for them. Mean Squared Logarithmic Error (MSLE) As the name suggests, here, the natural logarithm of each of the predicted values is calculated and then the MSE is calculated. MSLE relaxes the punishing effect of large differences in large, predicted values.

Mean Absolute Error (MAE)
The MAE is more powerful. It calculates the average of the absolute difference between the predicted and actual values.

Classification Loss Functions:
Classification loss functions are divided into following: Binary Cross-Entropy This function is used when our target values belong to the set {0, 1}. A score which summarizes the average difference between the predicted and actual probability distributions is calculated and is then minimized. The least score that can be achieved by this function is 0.
Hinge Loss This function is used when our target values belong to the set {−1, 1}. This function basically promotes correctness of the sign of values and will allot more error for differences in signs between predicted values and actual values.
Squared Hinge Loss This is a function obtained by modifications in the hinge loss function. It merely calculates the square of the hinge loss score. This causes the error function to smoothen making it easy to work with numerically. It smoothens the surface of the error function and makes it easy to work with numerically. As stated in the case of hinge loss, values must belong to the set {−1, 1}.
Multi-Class Cross-Entropy Hinge loss is considered for binary classification similarly, multi-class cross entropy loss is considered for multi-class classification. Every class is given a distinct integer value. A score for all classes is summarized by calculating the average difference between the predicted and actual probability distributions. The minimum score that can be obtained is 0.

Optimizers
The process of optimizing loss functions improves the function of the model. The procedure to minimize/maximize a mathematical function is called optimization and utilizes optimizers. Optimizers work by changing the parameters of NNs such as the learning rates or weights. The types of optimizers are: The GD algorithm works on the principle that the direction opposite to that of the calculated gradient will locate the lower point/surface of the considered function, i.e., the potential minima. It then repeatedly takes steps in that direction with every iteration. In the case of GD, we require large memory as we calculate the gradient over the entire dataset after which the updates are performed meaning that, for huge data, it may take years to converge.

Stochastic Gradient Descent (SGD)
To make up for the problem faced by the GD algorithm, SGD was developed. The SGD algorithm works by computing derivatives by considering only one point at once. This solves the large memory requirement problem; however, time is still an issue as one epoch of SGD takes more time when compared to the GD algorithm.

Mini Batch Stochastic Gradient Descent (MB-SGD)
This algorithm was developed to overcome time complexity problems. MB-SGD as the name suggests, works by considering a small batch of points or a small subset of the whole data at once to compute the derivatives. It takes longer to converge than GD does and additionally, its weight updates are noisy.
SGD with Momentum SGD with momentum (momentum for short) calculates another parameter, momentum in addition to the gradient in every step. This momentum is the cumulative movement from all previous steps, overcoming the issue of noisy weight updates in MB-SGD by denoising gradients. It moves faster owing to the accumulated momentum which helps it prevent local minima and plateau regions. In addition, the algorithm Nesterov Accelerated Gradient (NAG) is similar to momentum except, it is a future considering method. The cost function is evaluated using future parameters rather than the current ones to prevent the chance of skipping the minima in case of high momentum value.
Adaptive Gradient Descent (AdaGrad) AdaGrad, works by accumulating the sum of gradient 'squared'. As the name suggests the learning rate is adaptive and doesn't need to be tuned manually. The adaptive learning rate property helps escape typical complexities of non-convex functions such as saddle points as they take a straight path compared to methods such as GD, SGD, MB-SGD and even momentum. There are chances that these methods become stuck at saddle points and never converge to minima. Unfortunately, AdaGrad is very slow as the sum it accumulates continues to grow resulting in the learning rate to become very small, this consequently leads to vanishing of the gradient. Adaptive Delta (AdaDelta) As we know, AdaGrad is a slow converging method due to its large value of the accumulated sum of gradient squared. AdaDelta was developed to resolve this by modifying certain aspects of AdaGrad. Here, rather than storing all the past gradients inefficiently, we recursively define the sum of gradients as a decaying average of all the past squared gradients. This allows AdaDelta to continue learning even after multiple updates.

RMSprop (Root Mean Square Propagation)
RMSprop, similar to AdaDelta, was developed to counter AdaGrad's problem of vanishing learning rates. RMSProp fixes the issue by adding a decay factor which emphasizes on recent gradients neglecting the older ones. An exponentially decaying average of squared gradients is used to divide the learning rate. AdaGrad might keep up with RMSProp initially, but the sum of gradients squared for AdaGrad accumulate and become huge, eventually AdaGrad practically stops moving, in contrast with RMSProp, smartly stored the squares due to the decay rate. This makes it faster than AdaGrad and it impedes search in direction of oscillations.

Adaptive Moment Estimation (ADAM)
The Adam algorithm combines the features of both Momentum and RMSProp meaning that, along with accumulating the exponentially decaying average of the previous squared gradients, it also holds exponentially decaying average of previous gradients as well. Adam is an efficient optimizer due to the component of speed and the capability to adapt gradients acquired from momentum and RMSProp, respectively. This is also the reason it is the most used optimizer.

Comparison of Optimizers
There are two metrics to determine the efficacy of an optimizer, speed of convergence to the global minimum and generalization (the model's performance on new data). Performance of the optimizers and hence their choice is also dependent on the type of data provided, based on that, two types of functions exist-convex and non-convex ( Figure 4). In case of the latter, optimizers must be carefully chosen as it has multiple hurdles where our algorithm might become stuck, some of which we will discuss ahead. The performance of mentioned optimizers on different obstacles faced on non-convex functions are discussed below (see Table 2). From the table, we see that ADAM has performed comparatively better than others, especially in terms of convergence speed which is why it is the most used and most efficient optimizer.

Activation Functions
The activation or capability of response of a node is not pre-defined. This is deduced with the help of an Activation. It does this by building a relationship between various weights and biases affecting the node and applying the relationship to the node as a function, hence determining responses. It also helps it learn the complex patterns of data. They transform the incoming signals in the node into an output signal which will be used in the next layer of the network or will be the output. The different types of activation functions are summarized in the table below (Table 3). Table 3. Activation Functions.

Activation Function Mathematical Representation Brief Description
Binary Step (x) = 0 ; f or x < 0 1 ; f or x ≥ 0 only has values 1 or 0 as output. Linear . When considering ReLu, the possibility of a vanishing gradient is lower, in contrast with the gradient of sigmoids, which grow smaller with an increase in the absolute value of x.
ReLu has a constant gradient, helping it learn faster making it the better choice, provided we can work in the positive range of input values.

Hyperparameters
The implementation of algorithms on data has certain pre-requisites. There are several hyperparameters that must be defined for the functionality of our network. These are overall functionality defining components and are not to be confused with the internal model parameters. This section discusses the hyperparameters definitions in Table 4.
Note that there are no pre-defined fixed values for these hyperparameters. There are some conventionally used values, but they are not mandatory and attaining a value suitable to a function is a trial-and-error method.

Different Methods of Regularization
All datasets are divided into two sections, namely, the training data and the testing data. Sometimes, we come upon conditions where models do not result in good performance on testing data despite performing well on training data. These situations arise due to underfitting and overfitting in Figure 6. Regularization is the procedure in which weight matrices of nodes are penalized, hence overcoming these problems.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 12 of 31 All datasets are divided into two sections, namely, the training data and the testing data. Sometimes, we come upon conditions where models do not result in good performance on testing data despite performing well on training data. These situations arise due to underfitting and overfitting in Figure 6. Regularization is the procedure in which weight matrices of nodes are penalized, hence overcoming these problems. L2 and L1 Regularization L1 and L2 are simple and widely known types. A regularization term is considered while updating the cost function in this method. This regularization term causes a reduction in the value of the weight matrices as they result in a simpler model, significantly reducing the problem of overfitting.
In L2, we have: where is the regularization parameter (which is updated for more precise outcomes) and w is weight. The L2 method is referred to as a weight decay procedure because it compels the weights to decay towards zero although never exactly becoming zero. In L1, we have: As observed in Equation (4) here, the absolute weight value is considered. Contrary to the L2 method, the weights here can become zero which is beneficial for compression of models. L2 and L1 Regularization L1 and L2 are simple and widely known types. A regularization term is considered while updating the cost function in this method. This regularization term causes a reduction in the value of the weight matrices as they result in a simpler model, significantly reducing the problem of overfitting.
In L2, we have: where λ is the regularization parameter (which is updated for more precise outcomes) and w is weight. The L2 method is referred to as a weight decay procedure because it compels the weights to decay towards zero although never exactly becoming zero.
In L1, we have: As observed in Equation (4) here, the absolute weight value is considered. Contrary to the L2 method, the weights here can become zero which is beneficial for compression of models.

Dropout
Dropout is the most used type of regularization due to its efficient results. It performs several iterations on the NN and at every iteration, random nodes are selected and removed along with all of their input and output signals/connections as seen in Figure 7. This results in selection of distinct nodes and hence distinct outputs per iteration. More randomness is expressed by splitting one network into its subsets leading to better performance when compared to a single dense model. The method can be implemented on input as well as hidden layers making it an adequate choice for large NNs.

Characteristics of the NN Models
Based on the discussion made so far, we have built a foundation of all components of a Neural Network and established that the two neural network models we require here are, first, the DPD NN, used to predistort our original RoF link and second, the emulated RoF NN, which is essential for training this DPD NN. The NN employed here contains hidden-layers, neurons per hidden-layer and is a feedforward fully connected network. The symbolic structure of the employed NN is shown below in Figure 8. The NN has two inputs and two outputs, in view of the complex nature of baseband signals representing both the real and imaginary parts of the signal. The ReLu activation function has been utilized for at least one of the multiple hidden layers (owing to its lower complexity). We represent the O/P for the first hidden-layer as follows: where, expresses the first hidden O/P layer, is the activation function (nonlinear) and is the weight and is the bias for the first output layer in the network. We represent the general output for the i th layer as:

Characteristics of the NN Models
Based on the discussion made so far, we have built a foundation of all components of a Neural Network and established that the two neural network models we require here are, first, the DPD NN, used to predistort our original RoF link and second, the emulated RoF NN, which is essential for training this DPD NN. The NN employed here contains N hidden-layers, K neurons per hidden-layer and is a feedforward fully connected network. The symbolic structure of the employed NN is shown below in Figure 8. The NN has two inputs and two outputs, in view of the complex nature of baseband signals representing both the real and imaginary parts of the signal. The ReLu activation function has been utilized for at least one of the multiple hidden layers (owing to its lower complexity).

Characteristics of the NN Models
Based on the discussion made so far, we have built a foundation of all components of a Neural Network and established that the two neural network models we require here are, first, the DPD NN, used to predistort our original RoF link and second, the emulated RoF NN, which is essential for training this DPD NN. The NN employed here contains hidden-layers, neurons per hidden-layer and is a feedforward fully connected network. The symbolic structure of the employed NN is shown below in Figure 8. The NN has two inputs and two outputs, in view of the complex nature of baseband signals representing both the real and imaginary parts of the signal. The ReLu activation function has been utilized for at least one of the multiple hidden layers (owing to its lower complexity). We represent the O/P for the first hidden-layer as follows: where, expresses the first hidden O/P layer, is the activation function (nonlinear) We represent the O/P for the first hidden-layer as follows: where, l 1 expresses the first hidden O/P layer, f is the activation function (nonlinear) and W 1 is the weight and b 1 is the bias for the first output layer in the network. We represent the general output for the ith layer as: where, i N : 2 ≤ i ≤ N. Then, the final output after N hidden layers will be: Training Algorithm The algorithm used to train the NN DPD model is given below. The algorithm utilizes MSE as the loss function, the optimizer used is ADAM and backpropagation is used for updating the weights. The process is repeated for a certain number of iterations (Z) to refine performance.
As mentioned earlier, we train our emulated RoF NN model using the I/P and O/P of the original RoF link and once this model is acquired, we connect the NN DPD model to it. Then, upon convergence of this training, we connect the actual RoF link with this DPD NN and proceed with its predistortion (see pseudocode in Algorithm 1).

Comparison with Volterra Method
It will be interesting to see the comparison of NN DPD methodology with conventional GMP method that has been validated in [1,6,34]. MP/GMP are the most viable solutions that have been used for DPD.
Volterra series-based models are very commonly used. Upon conversion of signals from electrical to optical domain and vice-versa, certain memory effects are introduced. To take these effects into account, Volterra Series is considered.
We represent Volterra series as: wherey(n) represents RF out put,x(n) represents RF input signal and h m is the mth order Volterra kernel. RF signal is down converted to baseband, and we obtain the envelope through a low pass filter. Then, with inputx(n) baseband image in discrete time is: where, K is the nonlinearity order and Q is the memory depth.

Memory Polynomial Model (MP)
The Memory polynomial (MP) model is capable of resolving both memory effects and nonlinearities simultaneously. It is also referred to as diagonal Volterra model as all its diagonal terms are non-zero. It is the middle ground between a memoryless model and a full Volterra model due to existence of diagonal memory (non-diagonal terms are zero). The MP model is generally used as it is less complex compared to Volterra series. It can also emulate the nonlinear behavior of Power Amplifiers to an extent which is why it has been used to model Power Amplifiers previously. The equation is as follows [24][25][26][27][28][29][30][31][32][33][34][35]: where K is non-linearity order, Q is the memory depth, y(n) is input sequence of the predistorter, x(n) represents baseband input and c kq is the model coefficient.

Generalized Memory Polynomial (GMP)
GMP has been productively utilized for DPD linearization of Power Amplifiers. Here, we will now use it to aid linearization of RoF links using the Digital predistortion method. If we observe the equation below, we see that, the GMP model, in contrast with MP, holds both, memory of diagonal terms along with crossing terms as well, which is why it outperforms MP and in the coming sections, we will evaluate comparisons with GMP only having established it is superior to MP [1,[24][25][26][27][28][29][30][31][32][33][34][35]: e kqr x(n − q)|x(n − q + r)| k (11) where x(n) is the DPD input and y(n) is the DPD output. The complex coefficients c kq , d kqr and e kqr denote signal and the envelope; signal and lagging envelope and signal and leading envelope, respectively. K a , K b , K c are the maximum nonlinearity orders, Q a , Q b , Q c are the memory depths. q, r represents the indices of the memory k is nonlinearity index and R c and R b exhibit the leading and lagging delay tap lengths, respectively. Since, in [62][63][64][65][66][67][68][69][70][71], it was established that GMP is better as compared to MP, therefore for simplicity, evaluation with GMP is included in this paper.

Models Based on Canonical Piece Wise Linearization
The DPD linearization has been achieved with Volterra based methods as discussed in Section 3. However, "out of the box" approach which can achieve better performance is always interesting addition to the topic. Recently, it was shown in [26][27][28] that out of all the possible architectures, CPWL method outperforms the other models such as memory polynomial (MP) and generalized memory polynomial (GMP). CPWL is an obvious choice due to the performance enhancement that it brings, however it has a lot of complexity and overheads. The CPWL model can be expressed as [28]: Here, the baseband input is represented by x(n) and the output baseband signal is represented by y(n), K is the FIR length filter, M is represented by memory depth, L is the number of partitions in the CPWL, β l shows the threshold while c (1) m,k,l , c (2) m,k,l , c (3) m,k,l , c (4) m,k,l presents the model coefficients.
In the Equation (1), there are many orders of multiplications and additions which will add a lot of overhead in terms of complexity and utilization if hardware resources during DPD implementation, the most important of them is dedicated hardware adders and multipliers.

Decomposed Vector Rotation (DVR) Model
It is a derived from the canonical piecewise-linear (CPWL) function and is a modified form of it. The model's nonlinear function is constructed from piecewise vector decomposition and is entirely different to the one used in previously discussed Volterra series. The model is more flexible and can perform better even with a comparatively small coefficient number than the conventional models. DVR is expressed as: where x(n) is I/P and y(n) is O/P. K DVR is elements in the partition and Q DVR is the memory depth. β k represents the breakpoint.

Magnitude Selective Affine (MSA) Based Linearization
Extending the previous work reported for DVR and CPWL, the objective of this work is to further reduce the overheads and complexity of the CPWL method by proposing a magnitude selective affine (MSA) function-based model. The advantage of this method is that it requires only a single linear operation for the selected zone leading to a lower complexity and simpler structure.
In order to optimize the operations, the coefficients in the zone that have similar magnitude can be coupled together [72,73]. The comparison between threshold function and magnitude of input samples can select which zone the samples will be falling and which affine functions can be utilized. This simplification will result in the reduction of the CPWL complex operation. Therefore, we can rewrite the first term of CPWL function in Equation (13) in a simplified way in terms of MSA function that can be expressed as: Here in Equation (15) m,k (.). The example of hardware implementation is shown in Figure 9. The simplification shown in Equation (15) leads to this realization that input power terms without any magnitude are compared with the thresholds for the offset and linear gain selection for the MSA function. This leads to removal of square root calculation operation. The overall model of Equation (12) in terms of MSA function can be written as:

Negative Feedback Iteration based Modelling Approach
For the computation of DPD model coefficients, as shown in the Figure 10 with negative feedback iteration technique, the predistortion coefficients are calculated in the training phase. ( ) is the input of the predistorter, coming through the output of the RoF ( ) here, ( ) = ( ) and G is the link's gain. There are two main steps in this technique. The first step is to establish a negative feedback iteration to obtain an input signal that can be regarded as a DPD signal followed by the second step which consists of calculating the DPD model parameters. The shared feedback path is adopted to observe the output information in both frequency bands. In this case, the negative feedback iteration is performed on the lower band and upper band by turns.

Negative Feedback Iteration based Modelling Approach
For the computation of DPD model coefficients, as shown in the Figure 10 with negative feedback iteration technique, the predistortion coefficients are calculated in the training phase. z(n) is the input of the predistorter, coming through the output of the RoF y(n) here, z(n) = y(n) G and G is the link's gain. There are two main steps in this technique. The first step is to establish a negative feedback iteration to obtain an input signal that can be regarded as a DPD signal followed by the second step which consists of calculating the DPD model parameters. The shared feedback path is adopted to observe the output information in both frequency bands. In this case, the negative feedback iteration is performed on the lower band and upper band by turns.

Estimation Algorithm
Estimation is initiated by the collection of coefficients , and , into a R × 1 vector . here, the total number of coefficients is represented by R and is related with a signal whose time is sampled over the same period. Coefficients are associated to the signal ( − )| ( − )| . a N × R matrix represents the collection of all such vectors. The predistorter training block output upon convergence will be ( ) = ( ) Solution minimizing the cost function is represented by:

Experimental Setup
For the validation of this technique, a multiband 5G NR scenario for out/in-door environments working at 3 GHz (20 MHz bandwidth) and 20 GHz (50 MHz bandwidth),

Estimation Algorithm
Estimation is initiated by the collection of coefficients c kq , d kqr and e kqr , into a R × 1 vector v. here, the total number of coefficients is represented by R and v is related with a signal whose time is sampled over the same period. Coefficients c 21 are associated to the signal x(n − 1)|x(n − 1)| 2 . Z a N × R matrix represents the collection of all such vectors. The predistorter training block output upon convergence will be z p (n) = x(n) hence, z(n) = u(n).
For N number of samples, output is: Solution minimizing the cost function is represented by:

Experimental Setup
For the validation of this technique, a multiband 5G NR scenario for out/in-door environments working at 3 GHz (20 MHz bandwidth) and 20 GHz (50 MHz bandwidth), which was discussed in our previous work [35], but no DPD was implemented. As an upgradation of this architecture, the setup is integrated with a multiband DPD block to this setup for enhancing the performance of this link. The setup shown in Figure 11 comprises of a 1310 nm optical carrier is modulated by a MZM working with two distinct RF-driven signals and a 1310 nm DFB laser. VSG1 provides RF1 which is a 5G NR waveform at 20 GHz while 5G transceiver provides RF2 which is a 3 GHz flexible (O/G/F-FDM) signal. The process of DPD can be divided in to three main phases.   In the first phase, we upconvert these signals at the respective carrier frequencies of 3 and 20 GHz, respectively, one after the other followed by the passing of these signals through 22 km of Standard Single-Mode Fiber (SSMF) and photodetector (0.71 A/W and 40 GHz bandwidth) receives the signal and converts the received signal back to electrical domain. Since, the multiband needs to be isolated separately, an amplification stage is added. Followed by a diplexer (DPX) that separates the 20 GHz and 3 GHz signals. The signals then go to distinct vector signal analyzers (VSA). Here each VSA outputs is fed to the post processing block for performance evaluation. This step is carried out without DPD, this means that the output is evaluated without DPD process in this step.
In the second phase called as DPD training phase, the DPD operation depicted in the Figure 10 is utilized in this section and training is employed unless the error converges.
In simple words, DPD ensures that the phase and amplitude responses are inverse to that obtained at electrical amplifiers EA1 and EA2, respectively. The architectures such as GMP, CPWL, MSA and ONN can be utilized as per the user requirement and comparative requirements.
In order to achieve synchronization for the received input and the output waveform, we exploit PRS (Position reference signal) presented in the 5G NR framework. The bandwidth for PRS is taken to be 20 MHz/106 resource blocks. The received and output reference transmitted signal are correlated in time domain and the PDP (power delay profile) is passed through the maximum block to find the strongest path of arrival.
Moving on to the third phase, the predistorted basebands signal are passed into the DPD block upconverting the signals at their carrier frequency by their respective VSGs after which they are fed into the optical link. Then, the signal that is received at photodiode is passed through diplexer DPX to isolate the respective multi-bands ate the VSA passes to the DPD training stage. In the phase where DPD validation is carried out, we move the switches to the opposite direction. The evaluation for 5G NR frames is achieved which involves predistortion followed by passing them to the VSG. The RoF link's nonlinearities slowly fluctuate owing to thermal effects and component ageing, from which we deduce that real-time processing in the adaptation is inessential. The parameters utilized are summarized in Table 5 that have been used previously in [35] and other state of the art [20][21][22][23][24][25][26][27][28]. The parameters of different architectures that are utilized in this study for the experimental setup are given in Tables 6-8, respectively. Table 6 enlists the parameters of the NN that results in the optimized performance of NN. The parameters discussed previously in Section 3.1 are utilized to define the structure of the proposed Optimized Neural Network and their values are mentioned in the Table 6. The last section of the table evaluates the complexity aspect of the NN by evaluating its expressions in terms of its coefficients. For a comparative analysis, the parameters of GMP and DVR are also mentioned below in Table 7. The proposed MSA-DPD technique and CPWL without modification is used with M = 3 and K = L = 4. Similarly, for comparison, we have used GMP method previously used in [34,65] with the parameters K = Q = R = 3. For NN, N = 10, K = 30 is utilized.

Experimental Results and Discussion
In this section, the results are discussed for the experimental setup discussed in the earlier section. The Mean Square Error (MSE) is one of the ways to estimate the accuracy of the estimation of coefficients for different architectures utilized. The MSE when No DPD is applied is measured to be −27 dB, while for GMP it is −30 dB. The value reaches as low as −35 dB for CPWL and MSA while NN has a MSE of −39 dB.
In addition to MSE, the wellness of the proposed methods is compared and are presented in form of Adjacent Channel Leakage or Power Ratio (ACLR/ACPR) and Error Vector Magnitude (EVM).

Error Vector Magnitude
Error Vector Magnitude is the most common performance metrics that is utilized for performance evaluation for this study item in 3GPP. EVM determines the difference between the 'expected' value of the symbol in demodulated form to 'actual' value of the demodulated received symbol. EVM can be mathematically written as [5]: where M is the quantity of constellation symbols, S m is the real symbol of the constellation associated with the symbol "m" and S 0,m is the real symbol associated with S m . The 3GPP has set an EVM limit for 256 QAM to be 3.5% [74].
In Figure 12a, the Error Vector Magnitude EVM is represented by sweeping the RF input power. It is evident that MSA-DPD results in EVM reduction to <3% as compared to 5% obtained with GMP. The MSA-DPD has a slight improvement as compared to CPWL, but this is not the significant contribution, we expect to have similar improvement but with smaller complexity. In addition to this, DPD NN has a better performance as compared to MSA-DPD by 1%. It can be seen that DPD NN has a better performance as compared to MSA technique between 0 and 5 dB, however, MSA has overall better performance from −15 to 0 dB as compared to all other techniques. Similarly, In Figure 12b, EVM is reported for all the comparative methodologies employed for different flexible waveforms that has been used in 5G NR framework for 0 dBm RF input power. Clearly, MSA-DPD has almost similar performance as compared to CPWL knowing MSA reduces the complexity as compared to CPWL. It can be seen that MSA has a better reduction by 1% as compared to CPWL. When compared to NN, the NN has better performance as compared to MSA. It is clearly observable that NN has the best performance.

Adjacent Channel Leakage Ratio
ACLR also called as Adjacent Channel Power Ratio (ACPR) is a quantity that determines the distortion components outside the useful signal bandwidth. It is expressed as [5]: where T( f ) denotes Power Spectral Density (PSD) of the output signal while ab u and ab l are the upper and lower frequency limits of the adjacent channel; ub l and ub u are the frequency bounds of useful bands. In addition to EVM, we look at the ACLR also called as MIA. In Figure 13a, ACLR for given input power variations and for different methodologies is represented. We can see that, at an RF I/P power of 0 dBm, the ACPR value for no DPD utilization is around −28 dBc, for DPD-GMP the value is around −41 dBc, for DPD-CPWL, it is around −44 dBc, for MSA-DPD, it is around −45 dBc and finally for DPD-NN, the value is around −48 dBc keeping the ACPR below −45 dBc set by 3GPP [74]. With this it is obvious that DPD-NN has achieved significantly 3 dBc lower ACPR value as compared to MSA-DPD which has the best performance as compared to other Volterra series methods.
Similarly, in Figure 13b, the spectral density is shown. The power spectral density (PSD) with and without DPD at 3 GHz is shown. It is observed that MSA technique is achieving good performance as compared to CPWL while the NN DPD has a little advantage over the MSA with spectral regrowth suppression in the 10 to 15 MHz zones.
Similarly, in Figure 13c, the electrical spectra are shown, there are two main components that show the carrier signals at 3 and 20 GHz, respectively. It represents that the aim of the proposed MSA-DPD technique further reduces the complexity and implementation of the DPD methodology and enhance the performance of the link as compared to CPWL, GMP and when no DPD is applied. Therefore, it is observable that ACPR is reduced with the proposed MSA-DPD methods as compared to CPWL and GMP method by a good proportion keeping the ACPR below −45 dBc set by 3GPP [74]. It is important to observe that MSA-DPD performs a little better than CPWL, however, the performance gain is not the only benefit, but the complexity reduction is the most important benefit of the proposed MSA-DPD technique (discussed in Section 7.3).

Adjacent Channel Leakage Ratio
ACLR also called as Adjacent Channel Power Ratio (ACPR) is a quantity that d mines the distortion components outside the useful signal bandwidth. It is expresse [5]:

Complexity Considerations
The complexity reduction that MSA-DPD brings with gaining similar performance as compared to CPWL method is significant contribution. The complexity of the DPD model construction of the models is shown in Table 9, which is mainly measured by the required number of real multipliers, as the multipliers take up most of the hardware resources. Table 9 signifies that MSA DPD (220 multiplications) has much lesser complexity as compared to CPWL (880 multiplications). The advanced variations in Volterra series can be obtained by changing memory depth and nonlinearity order to higher numbers. However, the computational complexity has to be considered as shown in Equation (11). This means that while selecting the DPD model and its complexity, a smart trade-off between complexity and performance can be made accordingly. For a comparative evaluation of NN and GMP methods in terms of complexity and performance, we evaluate expressions for each method in term of its complexity. The number of coefficients and multiplications in CPWL architecture are higher as compared to GMP, however, the complexity in CPWL results in better performance as compared to GMP (as seen in results section in Figures 11 and 12). The reduction in complexity is brought into the play as proposed in MSA method. The performance of MSA achieves similar performance as compared to CPWL, however, the complexity in terms of coefficients in MSA is reduced from 520 to 260 and multiplications are reduced from 880 to 220. Similarly, looking at NN architecture, the performance (see Figures 11 and 12) is comparable to MSA which is in similar proportions to NN, however, the number of coefficients rise to 8526 and number of multiplications reached to 8224. NN complexity is a challenging issue, and this can be reduced as shown in this work that a very limited number of hidden layers and neuron per layer are employed. The multiplications and complexity of the NN increases exponentially with the increase in number of hidden layers and neurons per layer, depending on the application, MSA with much less complexity and similar based DPD can be employed as they achieve a similar performance as compared to NN. Table 9. Complexity Comparisons.

Real Time Implementation
In a realistic NN scenario, linearization methodology is carried out at the Central Office (CO) where the BBUs are placed and a periodical re-training of the DPD system is in this case necessary, requiring however a negligible time with respect to the time of normal operation of the RoF system. Recently, a Xilinx DPD kit has been developed that can be used for this purpose [75].
Bringing feedback signal from the BS to RAU is one of the main challenging task in the adaptive recompense of the RoF link. This is due to possible nonlinearity of the feedback link; actually, it can be as nonlinear as the RoF link, which is being compensated for. The present work is based on the fact that the predistorter sees only the non-idealities that it needs to compensate. Since it is assumed that the nonlinear feedback connection is uncompensated, therefore, it would destroy the performance of the compensation. In simple words, it means that an approach is utilized where the RoF link is first compensated for using a post distorter and known training signal from the RAU is used here. After that, the already compensated for downlink RoF link can be used as a feedback connection for the compensation [76]. However, as a possible feedback scenario, Figure 14  Indeed, with the higher modulation format and higher bandwidth similar to multiple LTE carriers or 5G new radio (NR) waveforms as discussed, they would lead to higher complexity of DPD operation due to stronger PAPR. Concomitantly, the elevation in bandwidth will lead to overall increase in the base-band memory of the system model. Nevertheless, the evaluated models are still valid. However, higher values of the Q and K will be indispensable as compared to the considered case.
To summarize the discussion above, Table 10 lists the values of ACPR, MSE and EVM for all the utilized methodologies.

Conclusions
This work summarizes a successful realization of 5G NR multiband OFH with a novel unprecedented DPD solutions for reducing RoF nonlinearities using the conventional volterra based methods and further exploiting deep learning methods to further improve results using Neural Networks. Firstly, a novel MSA DPD method has been proposed which reduces the complexity of CPWL method and reaches similar performance with reduction of complexity overheads and multiplications by 75%. The article also explains the theoretical foundations and elements required for building a Neural Network. The 5G NR multiband signals at 3 GHz and 20 GHz are employed to 22 km fiber length. The proposed MSA-DPD method results in reduction of ACPR from −28 dBc to −45 dBc while NN results in ACPR reduction to −48 dBc. Similarly, 11% of EVM is reduced to 3.5% with MSA method and NN leads to 2.7% at RF input power of 5 dBm. The results signify that proposed MSA-DPD method reduces the signal impairments in better proportions as compared to GMP method and CPWL. The estimated multiplication operations from CPWL to MSA are reduced from 880 to 220 leading to much less complexity and overheads meeting the standardization requirements set by 3GPP Release 17. MSA with much less complexity and similar performance to NN can be employed for DPD depending on the application scenario. It should be noted that DPD works as a black box, it counter acts the overall nonlinearities of the system including that of MZM (laser), fiber and photodiode. For tens of km, the combined effect of laser chirp and fiber dispersion becomes a major nonlinearity issue after tens of km [5]. Therefore, laser and possibly photodiode are the primary source of nonlinearity which is mitigated in this proposed bench. In future, it will be interesting to increase the length of fiber and linearize the link by mitigating the fiber nonlinearities such as Kerr effect.
Indeed, with the higher modulation format and higher bandwidth similar to multiple LTE carriers or 5G new radio (NR) waveforms as discussed, they would lead to higher complexity of DPD operation due to stronger PAPR. Concomitantly, the elevation in bandwidth will lead to overall increase in the base-band memory of the system model. Nevertheless, the evaluated models are still valid. However, higher values of the Q and K will be indispensable as compared to the considered case.
To summarize the discussion above, Table 10 lists the values of ACPR, MSE and EVM for all the utilized methodologies.

Conclusions
This work summarizes a successful realization of 5G NR multiband OFH with a novel unprecedented DPD solutions for reducing RoF nonlinearities using the conventional volterra based methods and further exploiting deep learning methods to further improve results using Neural Networks. Firstly, a novel MSA DPD method has been proposed which reduces the complexity of CPWL method and reaches similar performance with reduction of complexity overheads and multiplications by 75%. The article also explains the theoretical foundations and elements required for building a Neural Network. The 5G NR multiband signals at 3 GHz and 20 GHz are employed to 22 km fiber length. The proposed MSA-DPD method results in reduction of ACPR from −28 dBc to −45 dBc while NN results in ACPR reduction to −48 dBc. Similarly, 11% of EVM is reduced to 3.5% with MSA method and NN leads to 2.7% at RF input power of 5 dBm. The results signify that proposed MSA-DPD method reduces the signal impairments in better proportions as compared to GMP method and CPWL. The estimated multiplication operations from CPWL to MSA are reduced from 880 to 220 leading to much less complexity and overheads meeting the standardization requirements set by 3GPP Release 17. MSA with much less complexity and similar performance to NN can be employed for DPD depending on the application scenario.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.