Harnessing machine learning for fiber-induced nonlinearity mitigation in long-haul coherent optical OFDM

Coherent optical orthogonal frequency division multiplexing (CO-OFDM) has attracted a lot of interest in optical fiber communications due to its simplified digital signal processing (DSP) units, high spectral-efficiency, flexibility, and tolerance to linear impairments. However, CO-OFDM’s high peak-to-average power ratio imposes high vulnerability to fiber-induced non-linearities. DSP-based machine learning has been considered as a promising approach for fiber non-linearity compensation without sacrificing computational complexity. In this paper, we review the existing machine learning approaches for CO-OFDM in a common framework and review the progress in this area with a focus on practical aspects and comparison with benchmark DSP solutions.


Introduction
Nowadays, the majority of the transmitted digital data is carried by optical fibre cables, forming the major part of the telecommunications infrastructure worldwide.However, due to the explosive growth of the Internet and a number of bandwidth-hungry new services such as online gaming, 3D or high-definition TV and cloud computing, this infrastructure will eventually fail to satisfy future capacity needs.As Gartner Research states: "4.9 billion connected things in use in 2015, and will reach 20.8 billion by 2020".This growing demand for more capacity will lead to an imminent capacity crunch, should we not be successful in presenting innovative solutions to faster data transmission [1,2].
Current optical networks are based on conventional single-mode fibre (SMF) cables and high-order modulation formats such as 16-quadrature amplitude modulation (16-QAM) where more digital information can be carried.This could form the most plausible alternative towards the desirable bandwidth capacity increase.However, the reason why we have not adopted this solution so far lies in the very cause of the capacity crunch itself which originates from the optical fibre Kerr effect [3].The Kerr effect is a nonlinear phenomenon which causes distortion to the propagated optical signal and it is proportional to its power [4], resulting in the deceleration of the data transmission [5].Few-mode fibers (FMFs) are naturally more prominent to nonlinearity due to the increasing crosstalk distortions between the spatial modes, compared to the SMFs.On the other hand, the drive towards higher-order modulation formats, such as 16-QAM, and spectral-efficient techniques, such as orthogonal frequency division multiplexing (OFDM), lead to greater transmission impairments, reducing the maximum distance over which increased capacity can be provided.More specific, denser constellation diagrams render higher-order modulation formats are more susceptible to circularly-symmetric Gaussian noise as generated by Erbium-doped fiber amplifiers (EDFAs) along the transmission link [6].Even though the launch power per wavelength channel can be increased to improve the signal-to-noise ratio (SNR) at the receiver, transmission is limited by nonlinear distortions due to the Kerr effect, which have a more severe impact on higher-order modulation formats and spectral-efficient modulation schemes [5,7].
Moreover, the transmission of more than two signal wavelengths (wavelength-division multiplexing, WDM) through an optical fibre generates four-wave mixing (FWM), a process caused by the power dependence of the refractive index of the optical fibre [3].FWM is related to fibre nonlinearity and gives rise to new wavelengths which significantly degrade the signal quality especially at high optical powers and when signals are spectrally close to each other.FWM is one of the most dominant nonlinear effects in optical networks and a primary root of the capacity crunch [3].Since nonlinear noise such as FWM is highly correlated to signals themselves, nonlinearity can be mitigated by performing special treatment of the signals or conducting post-transmission digital signal processing (DSP) on received signals [5,7].
On the other hand, coherent optical OFDM (CO-OFDM) [8] has attracted a lot of interest in optical fiber communications due to its simplified DSP units, high spectral-efficiency, flexibility, and tolerance to linear impairments.However, CO-OFDM's high peak-to-average power ratio (PAPR) imposes high vulnerability to fiber-induced nonlinearities [8].Attempts to combat nonlinearities in CO-OFDM have been performed by deterministic nonlinearity compensators which take advantage of the fact that light scattering within a fibre is a deterministic process.Key techniques towards nonlinearity compensation (NLC) include mid-span optical phase conjugation (MS-OPC) [9], phase-conjugated subcarrier-coding (PCSC) [10], digital back-propagation (DBP) [11,12], and inverse-Volterra series-transfer function (IVSTF) [13].All of these techniques however, result in modest improvements because the interaction between nonlinearity and random noises in the network, such as the noise originating from optical amplifiers, adds significant stochastic nonlinear distortion.Moreover, MS-OPC reduces the flexibility in an optical routed network, IVSTF presents a marginal performance benefit, and DBP is very complex forbidding potential implementation in real-time.On the other hand, enhance signal capacity in PCSC, modified versions have been proposed in [14,15], which however offer marginal performance benefits or still sacrifice spectral-efficiency.More drawbacks of these techniques are summarized in Section 2.
Machine learning is the combination of pattern recognition [16,17] and the theory that computers can learn without being programmed to perform specific tasks.The process of learning begins with observations or data, such as examples, direct experience, or instruction, in order to look for patterns in data and make better decisions in the future based on the provided examples.The primary aim is to allow the computers learn automatically without human intervention or assistance and adjust actions accordingly.Machine learning has been recently under the spotlight for many photonic-related applications [18,19].In long-haul CO-OFDM several supervised and unsupervised machine learning algorithms (MLAs) have been harnessed to mainly perform DSP-based fiber-induced nonlinearity compensation, including artificial neural network (ANNs) [20][21][22][23][24][25][26], support vector machine (SVMs) [27][28][29][30][31][32][33][34], and machine learning clustering such as Fuzzy-logic C-means (FL or FLC) [35], K-means [35] or affinity propagation (AP) [36].
In this paper we review the aforementioned MLAs for CO-OFDM, showing key results for single-polarization and standard SMF (SSMF)-based long-haul transmission by comparing them with full-step DBP (FS-DBP) and IVSTF.We also briefly discuss the operation of popular deterministic nonlinearity cancellation techniques and a full computational complexity analysis is presented, for the first time, among key MLAs, FS-DBP and IVSTF.

Drawbacks and Deficiencies of Benchmark Fiber Non-Linearity Compensation Schemes
MS-OPC-This technique attempts to inverse the spectrum in the mid-distance of fiber transmission using the inversed spectrum to propagate the remaining half of the distance.By doing so, the non-linearity accumulated in the first half of the span will be automatically cancelled with those gained in the second half of the span [37].The main drawback of this technique is that generating the "inverted spectrum" in the mid-span is complex and its application is limited to long range point-to-point transmission, because otherwise in an optical routed network the mid-span point is hard to identify [38].MS-OPC also cannot compensate 2nd order chromatic dispersion (CD) and requires symmetric dispersion map which can be partially achieved using expensive Raman amplification [39].Multi-stage OPCs have been recently proposed to enhance flexibility and performance in next generation flex-grid optical networks [40]; however, this approach inevitably adds cost and complexity.
PCSC-This technique has similar fundamental principle as MS-OPC and is a variation of the phase-conjugated twin-waves (PCTW) for single-carrier optical systems [41], where signals are polarization multiplexed with one polarization and a phase-conjugated signal against the other, and then after transmission, special DSP is designed to cancel nonlinearity by overlapping signals from the two polarizations.In PCSC, however, a portion of OFDM subcarriers (up to 50%) is transmitted with its phase conjugates, which is used at the receiver to estimate the nonlinear distortions in the respective subcarriers and other subcarriers, which are not accompanied by phase-conjugated subcarriers or pilots (PCPs) [42].The nonlinearity cancellation is very effective with not much complexity added to the whole system.However, this method sacrifices spectral efficiency for both single-and dual-polarization CO-OFDM.Modified PCTW-based approaches have been proposed in [14,15] by modulating one of the conjugated signals with additional bits or by diplexing the twin waves.However, in [14] spectral-efficiency is still sacrificed (from 20% to 50%), while in [15] the performance enhancement is not impressive (maximum of 1.2 dB in quality(Q)-factor).
DBP-As the name suggests, this is a DSP method that attempts to re-wind the non-linear channel.In this method, the optical channel is rigorously numerically modelled and the received signals are digitally back-propagated through a modelled 'virtual' channel with the help of the split-step Fourier (SSF) method as shown in Figure 1 [12].In Figure 1, the α, β, and γ terms refer to the loss, 2nd order CD and fiber non-linearity, respectively.In this way, part of the non-linearity can be cancelled.However, the problem associated with this method is that the channel cannot be modelled very accurately due to random parameters during transmission such as random polarization mode dispersion (PMD) and the interaction between amplified spontaneous emission (ASE) noise from optical amplification with fiber non-linearity (also known as parametric noise amplification), which can only be statistically characterized.Additionally, it demonstrates impractically high complexity for real-time applications since a huge number of computation steps are needed to undo the non-linear interactions.For the latter, it has been shown in [11,12] that a minimum 40 steps/span is required to eliminate non-linear distortions (also called as FS-DBP).It worth mentioning that DBP has been recently modified to account for PMD [43] and a stochastic-DBP was also designed to partly account for the ASE noise from optical amplification [44].In [44], however, a maximum a posteriori principle was introduced with the help of Bayesian graphical models (a machine learning based approach) being combined with the deterministic DBP.IVSTF-The deterministic IVSTF algorithm (or simply called V-non-linear equalization, V-NLE) was introduced to relax the complexity of DBP by eliminating the need for the SSF method which is computationally inefficient.The VSTF provides an analytical tool for representing the fiber non-linear effects by similarly constructing the inverse channel based on VSTF, where in contrast to DBP it IVSTF-The deterministic IVSTF algorithm (or simply called V-non-linear equalization, V-NLE) was introduced to relax the complexity of DBP by eliminating the need for the SSF method which is computationally inefficient.The VSTF provides an analytical tool for representing the fiber non-linear effects by similarly constructing the inverse channel based on VSTF, where in contrast to DBP it depends on the number of spans in long-haul network and not on the fiber length.This occurs using non-linear Kernel functions and, similarly to DBP, the CD and fiber non-linearity are compensated in frequency and time domain, respectively.Typically, up to 2nd order kernels are used to account for 2nd order CD, above which does not significantly improve the system performance in single-channel CO-OFDM [12,13,45].In Figure 2 below, we show the recent implementation of IVSTF for CO-OFDM [13] that is typically placed in a time domain before OFDM demodulation and performs non-linearity compensation per span in parallel processing.Such implementation offers significantly reduced complexity compared to FS-DBP and inherits some of the features of the hybrid time-and-frequency domain implementation, such as non-frequency aliasing and simple implementation.IVSTF-The deterministic IVSTF algorithm (or simply called V-non-linear equalization, V-NLE) was introduced to relax the complexity of DBP by eliminating the need for the SSF method which is computationally inefficient.The VSTF provides an analytical tool for representing the fiber non-linear effects by similarly constructing the inverse channel based on VSTF, where in contrast to DBP it depends on the number of spans in long-haul network and not on the fiber length.This occurs using non-linear Kernel functions and, similarly to DBP, the CD and fiber non-linearity are compensated in frequency and time domain, respectively.Typically, up to 2nd order kernels are used to account for 2nd order CD, above which does not significantly improve the system performance in singlechannel CO-OFDM [12,13,45].In Figure 2 below, we show the recent implementation of IVSTF for CO-OFDM [13] that is typically placed in a time domain before OFDM demodulation and performs non-linearity compensation per span in parallel processing.Such implementation offers significantly reduced complexity compared to FS-DBP and inherits some of the features of the hybrid time-andfrequency domain implementation, such as non-frequency aliasing and simple implementation.

Sources of Stochastic Noises
There are various sources of stochastic noises in an optical network that affect deterministic nonlinearity compensation.The description of the form of stochastic noise in three main sources of longhaul coherent optical system is detailed below: A.
Advanced modulation formats-These have become a key ingredient to the design of modern optically routed networks, as a signal is modulated at amplitude, frequency and phase enabling the information carrying capacity to be doubled.Such signal formats include high-order single-carrier formats (e.g., 16/64-QAM) or multi-carrier modulation schemes (e.g., OFDM) [8] which cope better with 'linear' channel distortions.Unfortunately, high-order signal formats are vulnerable to fiber non-linearities, to the point that, when multiple signals are transmitted spectrally closely to each other the resultant non-linear deterministic noise is so 'dense' that appears stochastic [20,21].In multi-carrier modulation schemes such as CO-OFDM, this phenomenon is more prominent due to the high PAPR and the fact that subcarriers are spectrally very close to each other causing inter-carrier interference [8,20,21].

Sources of Stochastic Noises
There are various sources of stochastic noises in an optical network that affect deterministic non-linearity compensation.The description of the form of stochastic noise in three main sources of long-haul coherent optical system is detailed below: A.
Advanced modulation formats-These have become a key ingredient to the design of modern optically routed networks, as a signal is modulated at amplitude, frequency and phase enabling the information carrying capacity to be doubled.Such signal formats include high-order single-carrier formats (e.g., 16/64-QAM) or multi-carrier modulation schemes (e.g., OFDM) [8] which cope better with 'linear' channel distortions.Unfortunately, high-order signal formats are vulnerable to fiber non-linearities, to the point that, when multiple signals are transmitted spectrally closely to each other the resultant non-linear deterministic noise is so 'dense' that appears stochastic [20,21].In multi-carrier modulation schemes such as CO-OFDM, this phenomenon is more prominent due to the high PAPR and the fact that subcarriers are spectrally very close to each other causing inter-carrier interference [8,20,21].B.
Optical Amplifiers-In long-range optical communications there is multi-span amplification for keeping the signal power levels high enough, but their excess noise beats with the incoming signal.This noise originates by means of quantum mechanical uncertainties in the number of photons added at each amplifier and ultimately limited by the Heisenberg uncertainty principle [3,7].The amplifier excess noise can be interpreted as resulting from unavoidable spontaneous emission into its amplified mode (i.e., ASE).The effect of ASE noise on fiber non-linearity interaction is called parametric noise amplification (PNA).

C.
Optical Fibers-Conventional fibers include SMFs which generally exhibit stochastic noise from polarization rotation.The other form of stochastic noise is due to the interplay between linear CD and Kerr non-linearity when signal-noise interaction is considered.

Machine Learning for Fiber-Induced Non-Linear Noise Suppression in Coherent Optical
Orthogonal Frequency Division Multiplexing (CO-OFDM) MLAs have been widely applied to solve various problems in different areas, such as data mining, pattern recognition, medical imaging, etc., while in telecommunications they have covered a wide range of applications, such as channel modelling and prediction, equalization, demodulation/modulation recognition, and spectrum sensing [18].MLAs are based on the cross-pollination of optimization theory, statistical learning, Kernel theory and algorithmics.MLAs can predict solutions to a problem when deterministic ones are not feasible.There are three main situations in which MLAs make good candidates: 1.
when closed-form solutions do not exist, and trial and error methods are the only approaches to solving the problem at hand, 2.
when the application requires real-time performance, and 3.
when faster convergence rates and smaller errors are required in the optimization of large systems.
In CO-OFDM, MLAs has indicated  that stochastic noises can be combated without knowledge of the fiber link parameters.Below we describe the structure of the two main supervised MLAs that have been applied in long-haul CO-OFDM as NLEs, i.e., the ANN and SVM algorithms.It should be noted that for both cases a pseudorandom unrepeated sequence was employed with a length of 2 19 −1 having a period of approximately 2 19937 −1 (Mersenne twister) [46].Compared to the work reported in [47] showing that when employing short pseudorandom sequences (with lengths of 2 7 and 2 15 ), ANNs most likely will overestimate the system performance and the adopted pseudorandom sequence has a much longer period.Furthermore, the training process applies to a data-set of 2 19 −1 which is not repeated over and over and is split into three separate classes: (i) an actual training set (dependent on the number of iterations-epochs for ANN); (ii) a validation set; and (iii) testing data, using 70%, 10%, and 20%, respectively.The ANN/SVM algorithm is iteratively updated until the error on the validating data set converges to a given rate while different amount of training data is tested, i.e., ranging from 1% up to 70%.As indicated in [20][21][22], the optimal training data corresponding to the maximum achievable Q-factor is 10% for both quaternary phase-shift keying (QPSK) and 16-QAM formats, above which there is saturation (i.e., no Q-factor improvement was noticed).

Artificial Neural Network (ANN)
An ANN-NLE based on the multilayer perceptron (MLP) has been implemented in [21].MLP-ANNs form a complex map with non-linear decision boundaries between input and output spaces, helping in inverting the effects of non-linear distortion.ANN is an emerging technology applied in contemporary wireless communications for the reduction of OFDM-based non-linear distortion in power amplifiers.ANN schematic diagram for m-QAM CO-OFDM is shown in Figure 3b, which is placed directly after the fast Fourier transform (FFT) at the digital part of the CO-OFDM receiver (see Figure 3a).In summary, it comprises of p sub-neural networks considering a single hidden-layer.Each sub-network is being associated to each subcarrier k, and where s(k) is the training vector.The received symbols, i, for each subcarrier x{k} are processed by the ANN neurons which are subsequently multiplied with a weight value, w k,i , for each subcarrier where the outputs of all subcarriers are summed.In the training stage, the minimum mean-square error (MMSE) algorithm determines the error signal and updates the weights, which are iteratively updated until the desired error value is reached, thus indicating the optimal match between the sub-network output and the transmitted CO-OFDM symbols.The error signal is given as The chosen NAF is a differential sigmoid function and is a "split" complex NAF, where two conventional real-valued functions process the I-Q components in contrast to our proposed approach which processes the complex data simultaneously, thus accounting for real and imaginary data cross-information.The number of ANN neurons in every sub-neural network is equal to the number of points of the constellation.The 2D ANN is based on the Riedmiller's resilient-back propagation (RR-BP) algorithm and performs an approximation to the global minimization achieved by the steepest descent [20].The training function updates the weights and bias values according to RR-BP, which minimizes the difference between the ANN output and the desired output by splitting the complex OFDM data in two real-valued data collections.The transfer functions for the hidden layer of the ANN are differentiable and similar to the hyperbolic tangent function.For the output layer, the linear function "purelin" was employed [20].The MMSE in Figure 3a represents the subsystem that implements RR-BP to find the weights that minimize the error vector , where S(n) and Ŝ(n) are the desired and calculated output vectors, respectively.The weights are updated according to the steps described in Figure 4 by applying the gradient descent on the cost function E(n) to reach a minimum.Finally, at the end-output of Figure 3b we introduced slack variables that allow some misclassified symbols but penalizes them.
received symbols, i, for each subcarrier x{k} are processed by the ANN neurons which are subsequently multiplied with a weight value, wk,i, for each subcarrier where the outputs of all subcarriers are summed.In the training stage, the minimum mean-square error (MMSE) algorithm determines the error signal and updates the weights, which are iteratively updated until the desired error value is reached, thus indicating the optimal match between the sub-network output and the transmitted CO-OFDM symbols.The error signal is given as     ̂  , where ŝ(k) is calculated in terms of a non-linear activation function (NAF), φk,i, that is given by ̂  ∑  ,  ,   .The chosen NAF is a differential sigmoid function and is a "split" complex NAF, where two conventional real-valued functions process the I-Q components in contrast to our proposed approach which processes the complex data simultaneously, thus accounting for real and imaginary data cross-information.The number of ANN neurons in every sub-neural network is equal to the number of points of the constellation.The 2D ANN is based on the Riedmiller's resilient-back propagation (RR-BP) algorithm and performs an approximation to the global minimization achieved by the steepest descent [20].The training function updates the weights and bias values according to RR-BP, which minimizes the difference between the ANN output and the desired output by splitting the complex OFDM data in two real-valued data collections.The transfer functions for the hidden layer of the ANN are differentiable and similar to the hyperbolic tangent function.For the output layer, the linear function "purelin" was employed [20].The MMSE in Figure 3a represents the subsystem that implements RR-BP to find the weights that minimize the error vector    n   , where   and   are the desired and calculated output vectors, respectively.The weights are updated according to the steps described in Figure 4 by applying the gradient descent on the cost function E(n) to reach a minimum.Finally, at the end-output of Figure 3b we introduced slack variables that allow some misclassified symbols but penalizes them.

Support Vector Machine (SVM)
The SVM is placed in the same NLE block as in Figure 3a.In Figure 5b the supervised support vector regressor (SVR)-NLE is shown [28,30], which in contrast to other versions such as in [27] that only classifies the data (i.e.support vector classifier, SVC), SVR is considered more advanced as it received symbols, i, for each subcarrier x{k} are processed by the ANN neurons which are subsequently multiplied with a weight value, wk,i, for each subcarrier where the outputs of all subcarriers are summed.In the training stage, the minimum mean-square error (MMSE) algorithm determines the error signal and updates the weights, which are iteratively updated until the desired error value is reached, thus indicating the optimal match between the sub-network output and the transmitted CO-OFDM symbols.The error signal is given as     ̂  , where ŝ(k) is calculated in terms of a non-linear activation function (NAF), φk,i, that is given by ̂  ∑  ,  ,   .The chosen NAF is a differential sigmoid function and is a "split" complex NAF, where two conventional real-valued functions process the I-Q components in contrast to our proposed approach which processes the complex data simultaneously, thus accounting for real and imaginary data cross-information.The number of ANN neurons in every sub-neural network is equal to the number of points of the constellation.The 2D ANN is based on the Riedmiller's resilient-back propagation (RR-BP) algorithm and performs an approximation to the global minimization achieved by the steepest descent [20].The training function updates the weights and bias values according to RR-BP, which minimizes the difference between the ANN output and the desired output by splitting the complex OFDM data in two real-valued data collections.The transfer functions for the hidden layer of the ANN are differentiable and similar to the hyperbolic tangent function.For the output layer, the linear function "purelin" was employed [20].The MMSE in Figure 3a represents the subsystem that implements RR-BP to find the weights that minimize the error vector    n   , where   and   are the desired and calculated output vectors, respectively.The weights are updated according to the steps described in Figure 4 by applying the gradient descent on the cost function E(n) to reach a minimum.Finally, at the end-output of Figure 3b we introduced slack variables that allow some misclassified symbols but penalizes them.

Support Vector Machine (SVM)
The SVM is placed in the same NLE block as in Figure 3a.In Figure 5b the supervised support vector regressor (SVR)-NLE is shown [28,30], which in contrast to other versions such as in [27] that only classifies the data (i.e.support vector classifier, SVC), SVR is considered more advanced as it

Support Vector Machine (SVM)
The SVM is placed in the same NLE block as in Figure 3a.In Figure 5b the supervised support vector regressor (SVR)-NLE is shown [28,30], which in contrast to other versions such as in [27] that only classifies the data (i.e., support vector classifier, SVC), SVR is considered more advanced as it performs both classification and regression and for simplicity is called SVR.It is comprised of k hidden nodes (support vectors), with each node being associated to each subcarrier k.The procedure of SVR is similar to [28,30].In summary, the received symbols for each subcarrier x{k} are processed by the NLE supported vectors which are scaled by weight values (Lagrange multipliers) for each subcarrier w k,i , after which, the outputs for different k are summed.The distribution of noisy constellation points is learnt during an initial training process similarly to ANN.Once the distribution is learnt, the detector can make decision for the new unknown observation symbols.A hyperplane is also obtained through approximation of a nonlinear function using a set of Kernels (sigmoid function) of training dataset.SVR maps the data to a high-dimension feature space as shown in Figure 5a, using a nonlinear mapping ϕ and then linear regression is formulated by introducing the "ε-insensitive" loss function in the following form f (x, w) = M ∑ i=1 w k,i ϕ k,i (x) + b, where f (x, w) is the target linear model, ϕ k,i (x) denotes a set of nonlinear transformations of input x, and b is the bias term.The number of vectors in every hidden node is equal to the number of points of the constellation; hence for example for 4-QAM is 4. The "ε-insensitive" loss function can be learnt through training process by minimizing the error, k are slack variables corresponding to the upper and lower bounds on the output function and C is the penalty parameter.Depending on how much loss is ignored, the latter equation can be approximated by the Lagrange loss function L(y, f (x, w)).
performs both classification and regression and for simplicity is called SVR.It is comprised of k hidden nodes (support vectors), with each node being associated to each subcarrier k.The procedure of SVR is similar to [28,30].In summary, the received symbols for each subcarrier x{k} are processed by the NLE supported vectors which are scaled by weight values (Lagrange multipliers) for each subcarrier wk,i, after which, the outputs for different k are summed.The distribution of noisy constellation points is learnt during an initial training process similarly to ANN.Once the distribution is learnt, the detector can make decision for the new unknown observation symbols.A hyperplane is also obtained through approximation of a nonlinear function using a set of Kernels (sigmoid function) of training dataset.SVR maps the data to a high-dimension feature space as shown in Figure 5a, using a nonlinear mapping φ and then linear regression is formulated by introducing the "ε-insensitive" loss function in the following form  ,  ∑  ,  ,  , where  ,  It should be indicated that an unsupervised and faster version of SVM was employed in [31,33] and [29], respectively.For the unsupervised SVM, the Sato's and Godard's-based constant modulus algorithm (CMA) cost functions were employed in the penalty term of an SVM-like cost function, being iteratively minimized by re-weighted least squares (IRWLS) [23,25].Figure 6 depicts (a) such SVM algorithm in CO-OFDM for blind-NLE (BNLE) operation, and (b) shows the inherent IRWLS pseudocode.In this algorithm, the received OFDM symbols for each subcarrier x{k} are processed by the BNLE which are scaled by the vector of filter coefficients (weights) for each subcarrier (k) wk,i (where i is the symbol) by means of a hybrid maximum likelihood and recursive least-square process [25].In the IRWLS steps, described in Figure 6b, w refers to the weights, y refers to the received symbols (reference sequence), while   is the loss function, C a penalty regulation parameter,  the penalization term for the ith symbol, and Ns is the total number of subcarriers.Finally,  and  refer to the Satoʹs and Godard's constants, respectively.For the fast version of SVM, a Newton SVM was implemented with an architecture as shown in Figure 6c.Newton SVM suppresses the input space features for a nonlinear programming formulation of supervised SVM classifiers.This stand-alone method can handle classification problems in very high dimensional spaces.In this algorithm, a Newton-based algorithm is solved which is implemented via Lagrangian multipliers of an SVM-based classifier, thus resulting to an effective iterative scheme [29] constituted of only a few steps.To process a high-level modulation format order (and thus constellation mapper) with a large dimensional input, a fast-finite Newton approach was considered.For the classification problem, this approach searches for a unique Lagrangian-based global minimum solution by determining a finite number of times, a system of nonlinear equations.The aforementioned Newton-based algorithm steps and related equations are depicted in Figure 6d which involves an Armijo step-size [29].In the It should be indicated that an unsupervised and faster version of SVM was employed in [31,33] and [29], respectively.For the unsupervised SVM, the Sato's and Godard's-based constant modulus algorithm (CMA) cost functions were employed in the penalty term of an SVM-like cost function, being iteratively minimized by re-weighted least squares (IRWLS) [23,25].Figure 6 depicts (a) such SVM algorithm in CO-OFDM for blind-NLE (BNLE) operation, and (b) shows the inherent IRWLS pseudocode.In this algorithm, the received OFDM symbols for each subcarrier x{k} are processed by the BNLE which are scaled by the vector of filter coefficients (weights) for each subcarrier (k) w k,i (where i is the symbol) by means of a hybrid maximum likelihood and recursive least-square process [25].In the IRWLS steps, described in Figure 6b, w refers to the weights, y refers to the received symbols (reference sequence), while L ε (e i ) is the loss function, C a penalty regulation parameter, e i the penalization term for the ith symbol, and Ns is the total number of subcarriers.Finally, R s and R P refer to the Sato's and Godard's constants, respectively.For the fast version of SVM, a Newton SVM was implemented with an architecture as shown in Figure 6c.Newton SVM suppresses the input space features for a nonlinear programming formulation of supervised SVM classifiers.This stand-alone method can handle classification problems in very high dimensional spaces.In this algorithm, a Newton-based algorithm is solved which is implemented via Lagrangian multipliers of an SVM-based classifier, thus resulting to an effective iterative scheme [29] constituted of only a few steps.To process a high-level modulation format order (and thus constellation mapper) with a large dimensional input, a fast-finite Newton approach was considered.For the classification problem, this approach searches for a unique Lagrangian-based global minimum solution by determining a finite number of times, a system of nonlinear equations.The aforementioned Newton-based algorithm steps and related equations are depicted in Figure 6d which involves an Armijo step-size [29].In the equations in Figure 6d, column vectors are considered except if transposed to a row vector (using a T superscript).Moreover, as depicted in Figure 6c, x denotes the 2-norm of a vector x, while A is the matrix related to an OFDM received signal incorporating m complex symbols in the n-dimensional real space R m which expresses the modulation order level (i.e., 4 for 16-QPSK).
Future Internet 2018, 10, x FOR PEER REVIEW 8 of 20 equations in Figure 6d, column vectors are considered except if transposed to a row vector (using a  superscript).Moreover, as depicted in Figure 6c, ‖‖ denotes the 2-norm of a vector x, while A is the matrix related to an OFDM received signal incorporating m complex symbols in the ndimensional real space  which expresses the modulation order level (i.e. 4 for 16-QPSK).

Clustering
In this section, we briefly describe the structure of the four main unsupervised machine learning based clustering algorithms that have been applied in long-haul CO-OFDM as BNLEs, i.e., the Kmeans, fuzzy-logic C-means (named here as FL or FLC), and affinity propagation (name here as AP) clustering algorithms.
K-means: Is the most common clustering algorithm based on an iterative, data-partitioning process, assigning n observations to exactly one of the k clusters defined by centroids, where k is chosen before the algorithm starts [35].The algorithm proceeds as follows: 1. Choose k initial cluster centers (centroid).2. Compute point-to-cluster-centroid distances of all observations to each centroid.3. Compute the average of the observations in each cluster to obtain k new centroid locations.4. Repeat steps 2 through 3 until cluster assignments do not change, or the maximum number of iterations is reached.
FL: This belongs to the probabilistic machine learning algorithms, and permits the symbols to fluctuate the data membership degree (MD) while being allocated into many clusters as shown in

Clustering
In this section, we briefly describe the structure of the four main unsupervised machine learning based clustering algorithms that have been applied in long-haul CO-OFDM as BNLEs, i.e., the K-means, fuzzy-logic C-means (named here as FL or FLC), and affinity propagation (name here as AP) clustering algorithms.
K-means: Is the most common clustering algorithm based on an iterative, data-partitioning process, assigning n observations to exactly one of the k clusters defined by centroids, where k is chosen before the algorithm starts [35].The algorithm proceeds as follows: 1.
Choose k initial cluster centers (centroid).

2.
Compute point-to-cluster-centroid distances of all observations to each centroid.

3.
Compute the average of the observations in each cluster to obtain k new centroid locations.

4.
Repeat steps 2 through 3 until cluster assignments do not change, or the maximum number of iterations is reached.
FL: This belongs to the probabilistic machine learning algorithms, and permits the symbols to fluctuate the data membership degree (MD) while being allocated into many clusters as shown in Figure 7 by minimizing the objective function: In this objective function, the terms m, L, N, and R, correspond to the "Fuzzy partition matrix exponent", clusters, total number of subcarriers and symbols, respectively.The role of the "Fuzzy partition matrix exponent" is to adjust the grade of overlapping between clusters.Where t i is referred to the ith symbol, c j is the center of a jth cluster, and µ ij refers to the MD of t i into jth cluster.FL is processed in 5 steps: 1. Enter the number of targeted clusters; 2. Initiate the cluster MD, µ ij ; 3. Estimate the center per cluster by  AP: Every symbol in AP is a potential exemplar by viewing each symbol as a node that recursively transmits real-valued messages (separately for amplitude and phase) along the edges of the NLE network until a good set of exemplars and corresponding clusters emerges [36].'Messages' are updated by simple formulas that search for minima of an appropriately chosen energy function.At any symbol in time the magnitude of each message reflects the current affinity that 1 symbol has for choosing another symbol as its exemplar.Let x1 through xn be a set of complex data (symbol), with no assumptions made about their internal structure, and let S be a function that quantifies the similarity between any 2 symbols, such that S(xi, xj)>S(xi, xk) if xi is more similar to xj than to xk.For this example, the negative squared distance of 2 symbols was used i.e. for points xi and xk,  ,  ‖  ‖ .The diagonal of S (i.e.S(i,i)) is particularly important, as it represents the input preference, meaning how likely a particular input is to become an exemplar.When this is set to the same value for all inputs, it controls how many classes the algorithm can produce.A value close to the minimum possible similarity produces fewer classes, however, a value close or larger to the maximum possible similarity, produces many classes (initialized to the median similarity of all pairs of inputs).AP proceeds by alternating 2 message passing steps to update the 'responsibility, R(i, k)' and 'availability, A(i, k)' matrices, where R quantifies how "well-suited" xk is to serve as the exemplar for xi compared to other candidate exemplars, while A shows how "appropriate" it would be for xi to pick xk as its exemplar, taking into account other points' preference.R and A, are initialized to zero being viewed as log-probability tables and then AP is iteratively updated for R and A by:  ,   ,  max  ,   ,   ,  min 0,  ,  max 0, ,   , The exemplars are extracted from the final updated matrices where ʹresponsibility + availabilityʹ is positive.Figure 8 shows the AP iterative result of R and A for a QPSK middle-channel in WDM CO-OFDM at 3200 km for an optimum launched optical power (LOP) per channel of -5 dBm, where 13 AP: Every symbol in AP is a potential exemplar by viewing each symbol as a node that recursively transmits real-valued messages (separately for amplitude and phase) along the edges of the NLE network until a good set of exemplars and corresponding clusters emerges [36].'Messages' are updated by simple formulas that search for minima of an appropriately chosen energy function.At any symbol in time the magnitude of each message reflects the current affinity that 1 symbol has for choosing another symbol as its exemplar.Let x 1 through x n be a set of complex data (symbol), with no assumptions made about their internal structure, and let S be a function that quantifies the similarity between any 2 symbols, such that S(x i , x j )>S(x i , x k ) if x i is more similar to x j than to x k .For this example, the negative squared distance of 2 symbols was used i.e., for points x i and x k , s(i, k) = − x i − x k 2 .The diagonal of S (i.e., S(i,i)) is particularly important, as it represents the input preference, meaning how likely a particular input is to become an exemplar.When this is set to the same value for all inputs, it controls how many classes the algorithm can produce.A value close to the minimum possible similarity produces fewer classes, however, a value close or larger to the maximum possible similarity, produces many classes (initialized to the median similarity of all pairs of inputs).AP proceeds by alternating 2 message passing steps to update the 'responsibility, R(i, k)' and 'availability, A(i, k)' matrices, where R quantifies how "well-suited" x k is to serve as the exemplar for x i compared to other candidate exemplars, while A shows how "appropriate" it would be for x i to pick x k as its exemplar, taking into account other points' preference.R and A, are initialized to zero being viewed as log-probability tables and then AP is iteratively updated for R and A by: The exemplars are extracted from the final updated matrices where 'responsibility + availability' is positive.Figure 8 shows the AP iterative result of R and A for a QPSK middle-channel in WDM CO-OFDM at 3200 km for an optimum launched optical power (LOP) per channel of -5 dBm, where 13 iterations are required for convergence [36].AP is considered as an advanced soft-clustering algorithm.
Future Internet 2018, 10, x FOR PEER REVIEW 10 of 20

Experimental Setup and Performance of Machine Learning Algorithm in CO-OFDM
The experimental setup and parameters are shown in Figure 9 and Table 1 for both singlechannel and WDM CO-OFDM at 2000 km and 3200 km of transmission, respectively, using EDFAbased recirculating loops and a standard single-mode fiber (SSMF).The set-up, procedures and parameters are identical to [21,22,28,31,33,35,36].In summary, 400 OFDM symbols (20.48 ns length) were generated using a 512-point IFFT on 210 QPSK/16-QAM subcarriers.To eliminate inter-symbolinterference from linear effects, a cyclic prefix (CP) of 2% was included.For the clustering algorithms, FS-DBP, V-NLE and without (w/o) NLE, the raw bit-rates were ~20 Gb/s (QPSK) and 40 Gb/s (16-QAM).However, for supervised machine learning such as SVR and ANN (as indicated in Table I), 10% of data are sacrificed for training for both single-and multi-channel cases, above which the quality(Q)-factor is saturated as depicted in Figure 10 for WDM CO-OFDM.Such training is performed separately for each LOP, requiring relatively the same amount of training-data.At the receiver side, a coherent optical homodyne receiver was used, while the offline OFDM demodulator included timing synchronization, frequency offset compensation (due to the receiver local oscillator, LO), channel estimation and equalization with the help of an initial training sequence, as well as IQ imbalance and CD compensation using an overlapped frequency domain equalizer.For the WDM CO-OFDM system a laser grid of 100 kHz-linewidth distributed feedback lasers (DFBs) on 100 GHz grid was used and the noise loading channels were inserted using ASE source and a wavelength selective switch (WSS).In the inset spectrum of Figure 10 (as well as in the inset of Figure 9) the received WDM lines are shown.The NLEs performances were assessed by the total subcarriers' biterror-rate (BER) and Q-factor (=20log10 √2 2 ) measurements averaging over 10 recorded traces (~10 6 bits) by error counting (hard-decision-decoding, HDD).

Experimental Setup and Performance of Machine Learning Algorithm in CO-OFDM
The experimental setup and parameters are shown in Figure 9 and Table 1 for both single-channel and WDM CO-OFDM at 2000 km and 3200 km of transmission, respectively, using EDFA-based recirculating loops and a standard single-mode fiber (SSMF).The set-up, procedures and parameters are identical to [21,22,28,31,33,35,36].In summary, 400 OFDM symbols (20.48 ns length) were generated using a 512-point IFFT on 210 QPSK/16-QAM subcarriers.To eliminate inter-symbol-interference from linear effects, a cyclic prefix (CP) of 2% was included.For the clustering algorithms, FS-DBP, V-NLE and without (w/o) NLE, the raw bit-rates were ~20 Gb/s (QPSK) and 40 Gb/s (16-QAM).However, for supervised machine learning such as SVR and ANN (as indicated in Table 1), 10% of data are sacrificed for training for both single-and multi-channel cases, above which the quality(Q)-factor is saturated as depicted in Figure 10 for WDM CO-OFDM.Such training is performed separately for each LOP, requiring relatively the same amount of training-data.At the receiver side, a coherent optical homodyne receiver was used, while the offline OFDM demodulator included timing synchronization, frequency offset compensation (due to the receiver local oscillator, LO), channel estimation and equalization with the help of an initial training sequence, as well as IQ imbalance and CD compensation using an overlapped frequency domain equalizer.For the WDM CO-OFDM system a laser grid of 100 kHz-linewidth distributed feedback lasers (DFBs) on 100 GHz grid was used and the noise loading channels were inserted using ASE source and a wavelength selective switch (WSS).In the inset spectrum of Figure 10 (as well as in the inset of Figure 9) the received WDM lines are shown.The NLEs performances were assessed by the total subcarriers' bit-error-rate (BER) and Q-factor (=20log 10 √ 2er f c −1 (2BER) )) measurements averaging over 10 recorded traces (~10 6 bits) by error counting (hard-decision-decoding, HDD).In Figure 11, the Q-factor against the LOP per channel is plotted for various machine learning and deterministic NLEs for the QPSK middle-channel in WDM CO-OFDM.It shown that AP tackles non-linearities more effective than any other algorithm under test since it compensates both the PNA and the accumulated inter-subcarrier FWM which appears random (due to the impact of a high PAPR).This is corroborated in Figure 11b, where the Q-factor for the middle subcarriers is plotted which suffers the most mainly from inter-subcarrier FWM [48] (secondary from inter-subcarrier cross-phase modulation, XPM) at the optimum LOP.However, some of the inter-channel nonlinearities are also compensated better by AP, which are not as strong as the intra-channel (i.e., inter-subcarrier nonlinear distortions).In Figure 11f, the clear nonlinear decisions (soft/overlapping clustering) compared to K-means hard decisions (exclusive clustering) is also depicted for a received QPSK constellation diagram at −7 dBm of LOP.In Figure 11c-e we show the performance of unsupervised, supervised and fast machine learning algorithms, i.e., Sato/Godard-CMA BNLEs, ANN/SVR, and Fast-Newton-SVM (F-SVM), respectively.Results in Figure 11c-e conclude that AP clustering algorithm has the best performance in QPSK WDM CO-OFDM, and then SVR and FL follow.Finally, in Figure 11 it is shown that at low power a Q-factor improvement of the adopted machine learning algorithms is observed over the deterministic algorithms and linear equalization.This is due to the ability of machine learning NLEs to partially tackle the accumulated ASE noise concatenated optical amplifiers.Such statement is strong considering the negligible noise induced from electrical components and digital-to-analogue and analogue-and-digital converters (error vector magnitude below 7% for optical back-to-back).In Figure 12, results are depicted for single-channel 16-QAM CO-OFDM using the same algorithms.Similarly, AP shows the greatest performance reaching almost 15 dB in Q-factor and outperforming the FS-DBP.This means that the strong nonlinear phase noise of 16-QAM can be successfully tackled by AP, which seems to effectively compensating intra-channel deterministic and stochastic nonlinearities.However, it should be noted that all algorithms at low powers present poor performance due to the stronger PNA and ASE noise.It should be noted that an ANN was also implemented in [20,21], however, the bit-rate and training testing was different than SVR and so is not included here to be compared.Nevertheless, as shown in [20,21] the performance is anticipated to be worse than SVR and consequently and from AP. successfully tackled by AP, which seems to effectively compensating intra-channel deterministic and stochastic nonlinearities.However, it should be noted that all algorithms at low powers present poor performance due to the stronger PNA and ASE noise.It should be noted that an ANN was also implemented in [20,21], however, the bit-rate and training testing was different than SVR and so is not included here to be compared.Nevertheless, as shown in [20,21] the performance is anticipated to be worse than SVR and consequently and from AP.

Complexity analysis
In order to evaluate and compare the complexity of the different NLEs under test, we should first note that the nature of the deterministic equalizers based on DBP and IVSTF is essentially different from machine learning-based NLEs such as ANN and SVM.On the one hand, since DBP and IVSTF equalizers require an inversion of the propagation model, they are dependent on several link and bandwidth parameters.In particular, they depend on the number of spans of the oversampling parameter and, in the case of DBP, on the chosen spatial step.The complexity of these equalizers, though, does not depend a priori on other signal parameters such as modulation format.Machine learning based NLEs, on the contrary, present a complexity that does not depend on the link parameters but on some signal parameters, for instance the number of constellation points and the number of OFDM subcarriers.Hence, the computation of the complexity of each NLE family requires special attention.The following subsections deal with the complexity analysis of deterministic and machine learning approaches.In addition, we shall discuss the complexity of some clustering algorithms.For the sake of clarity in the derivation of the complexity expressions, the following table lists the employed parameters and their associated variable names.We shall consider the number of operations to process an OFDM symbol.Hence, in order to obtain the number of operation-per-bit, the number of operations should be divided by the number of bits per OFDM symbol.

Complexity Analysis
In order to evaluate and compare the complexity of the different NLEs under test, we should first note that the nature of the deterministic equalizers based on DBP and IVSTF is essentially different from machine learning-based NLEs such as ANN and SVM.On the one hand, since DBP and IVSTF equalizers require an inversion of the propagation model, they are dependent on several link and bandwidth parameters.In particular, they depend on the number of spans of the oversampling parameter and, in the case of DBP, on the chosen spatial step.The complexity of these equalizers, though, does not depend a priori on other signal parameters such as modulation format.Machine learning based NLEs, on the contrary, present a complexity that does not depend on the link parameters but on some signal parameters, for instance the number of constellation points and the number of OFDM subcarriers.Hence, the computation of the complexity of each NLE family requires special attention.The following subsections deal with the complexity analysis of deterministic and machine learning approaches.In addition, we shall discuss the complexity of some clustering algorithms.For the sake of clarity in the derivation of the complexity expressions, the following table lists the employed parameters and their associated variable names (Table 2).We shall consider the number of operations to process an OFDM symbol.Hence, in order to obtain the number of operation-per-bit, the number of operations should be divided by the number of bits per OFDM symbol.Both DBP and IVSTF based NLEs can be implemented fully in the time domain, frequency domain, or in the hybrid time-frequency domain [13].In this work, we assume the latter, as this is a commonly adopted approach in either DBP-or IVSTF-based NLEs.As indicated in previous section, in the case of DBP, the most widely employed method is the SSF with inverted parameter values.In this method, the CD is simulated in the frequency domain whereas the non-linear Kerr effect is simulated in the time domain.On the other hand, the IVSTF NLEs implemented in the hybrid time-frequency domain make use of the simpler calculation of high dimensional convolution in the frequency domain.Both methods, therefore, require multiple conversion from the frequency domain to the time domain and vice versa.This conversion is performed using FFT/IFFT pairs that operate on data blocks of size N block = K•N signal , since the data has to be oversampled with an oversampling constant K in order to account for the out-of-band non-linear components.When the N block is a power of two, the split-radix is the implementation showing the lowest complexity [13] requiring the floating-point (FLOPs) real-valued operations from (1): 6.1.1.Complexity of NLEs Based on Digital Back-Propagation Each DBP segment requires a time-to-frequency and a frequency-to-time conversion, in addition to the operations to implement both the non-linear and linear compensation stages.The number of FLOPs for each DBP steps is then: where N linear and N non-linear are given by 8N block log 2 N block −6N block + 16 and 18N block , respectively.On the other hand, the number of DBP steps is, assuming uniform length: The total number of FLOPs required by the DBP is then given by The procedure to calculate the number of FLOPs required by the IVSTF-based NLE is similar to that of DBP-based NLE since multiples time-to-frequency and frequency-to-time conversions are required.The number of operations, however, does not depend on the link spatial discretization but on the number of links.Looking at the blocking diagram of Figure 2, we can observe, that IVSTF-based NLE requires a linear compensation block and a non-linear equalization block per span, that is, N span , and therefore, the total number of FLOPs can be calculated as: Here, N linear is given by: where N prod is the number of operations for linear equalization, which is given by 6N block as it requires N block complex multiplications (6FLOPs each).The number of FLOPs for the non-linear compensation block, on the other hand, is: The square operation can be seen as an element-by-element multiplication of N blocks data and, consequently, also requires 6N blocks .Consequently, the total number of operations for the IVSTF-based NLE is given by:

Complexity Analysis of ANN and SVM-Based NLEs
As mentioned, in contrast to DBP and IVSTF-based equalizers, the complexity of ANN and SVM-based equalizers depends on the parameters of the modulation format.In the particular case of OFDM, it depends on the number of data subcarriers (N SC ) and the number of bits coded in each subcarrier (M).

Complexity of ANN
ANNs mimic natural neural systems making use of massive parallel low-complexity nodes.Therefore, it is not a surprise that their implementation requires fewer FLOPs than other approaches.After learning, and assuming that the non-linear activation function is implemented using a look-up table, the number of operations performed for processing each OFDM symbol can be obtained by,

Complexity of SVM
In order to calculate the complexity of implementing the SVM-NLE, we can split the equalization process in two steps.On the one hand, the process of estimating the ML-RLS, whose complexity does not depend whether the equalization is blind.On the other hand, the complexity of performing the IRWLS that depends on the equalization type, i.e., blind or non-blind, and on the chosen cost function, in our case Sato's or Godard's cost function.We consider that the NLE algorithm operates using a N w order filter operating on data blocks of Ns samples.The complexity of each iteration within the ML-RLS estimation can be calculated step by step.The first step that is the computation of a i is given by 8N W + 3. The second step, where the w s is calculated using the least-square method, requires 64/3N s 3 + 18N s 2 .The updating of the w vector carried out in the third step depends on particular implementation.For the case of the supervised SVM, the number of operations is 4N w + 6N s −1, whereas for Sato's and Godard's based blind implementations, the number operations is 3N s 2 + 2N s and (3p + 2)N s 2 + (p + 2)N s (p represents the power of the norm), respectively.
Step four and step five a priori do not affect the FLOP count since they do not require any extra arithmetic manipulation.
It is important to note that in all the studied cases, the computational complexity is O(N 3 S ), with the least square calculation the limiting stage.The computational complexity of the cost function, on the other hand, is O(N s ) for non-blind equalization and O(N 2 S ) for both Sato's and Godard's cost functions and, consequently, blind equalization does not suppose a significant computational cost increment compared to the unsupervised approach.

Complexity of Clustering Algorithms
The complexity of the clustering algorithms is dependent on how many clusters can be found.Since in most cases each cluster corresponds to a constellation point, the complexity will depend on the modulation format.However, in contrast to ANN, the complexity of the clustering algorithms is also dependent on the number of samples employed for clustering and their dispersion, which significantly impacts the convergence of the algorithm.Therefore, it is not possible to deterministically compute the number of required operations.A commonly adopted criterion is the worst-case scenario that represent a pessimistic upper bound to the complexity.

K-means
For K-means clustering, the number of operations in the worst-case-scenario if the Lloyd's algorithm is used is given by O(nkdi), where n is the number of data to be clustered, k is the number of clusters (2 M ), d is the dimensionality (in our case, 2), and i = 2 Ω( √ n) is the number of the required iterations.

Affinity Propagation
In the case of AP, the number of operations is O(ikn 2 ), where i, k and n are the number of iterations, clusters, and elements, respectively.While the number of clusters and element are trivial, the number of iterations i is difficult to predict due to the complexity algorithm and its interplay with the dispersion structure of the data.

Impact of High-Order Modulation Format Levels on Computational Complexity
In this section, we investigate the impact of high-order modulations formats (up to 128-QAM) on the computational complexity of the most widely-adopted machine learning, ANN.We also provide a comparison in Table 3 between full-step DBP, IVSTF and ANN, for two systems with different total lengths and different numbers of constellation points.Since the ANN is independent of the link parameters, the ANN complexity is the same for both systems.From Table 3, it is evident that DBP is the technique with the highest computational complexity compared to ANN for both low-and high-modulation format levels.However, when comparing IVSTF with ANN for large constellation point numbers, it can be appreciated that for the System A (transmission at 2000 km) IVSTF has lower complexity than ANN for M = 64 and above, while for System B (transmission at 3200 km) IVSTF has lower complexity than ANN only for M = 128.We note that we made the calculations assuming 40 steps-per-span for the full-step DBP.With regard to the subcarrier numbers, we considered 512 for IVSTF and DBP (as this is used to calculate the total occupied band) and 210 for the ANN (because subcarriers carrying data are only processed by the NLEs).

Conclusions
We reviewed the most commonly used machine learning algorithms for receiver-based NLE in CO-OFDM that include both unsupervised and supervised algorithmic designs for blind and non-blind NLE processing, respectively.We identified the main sources of noise in a coherent fiber-optic telecommunication system and analyzed the limitations of benchmark deterministic solutions (e.g., MS-OPC, PCSC, DBP), highlighting that their prominent obstacle is the inability of tackling stochastic non-linear distortions in long-haul CO-OFDM such as the PNA effect.We showed the performance of machine learning-based NLEs over 2000 km and 3200 km of SSMF transmission for a 16-QAM (∼40 Gb/s) and a QPSK-WDM (∼20 Gb/s middle-channel) CO-OFDM system, respectively.The machine learning algorithms included clustering algorithms such as K-means (deterministic hard-clustering), FL (probabilistic soft-clustering), AP (advanced soft-clustering); supervised machine learning algorithms such as ANN (classification) and SVR (classification and regression); unsupervised and fast SVMs; and compared with the deterministic benchmark approaches of FS-DBP and IVSTF.We also presented, for the first time, a computational complexity analysis among machine learning and deterministic algorithms.Our review indicated that AP offers the best performance for both 16-QAM and QPSK-WDM CO-OFDM having, however, higher computational complexity with other supervised/unsupervised machine learning algorithms.Machine learning reveals a much lower complexity and a significant performance benefit over deterministic approaches especially for QPSK-WDM due to their ability of tackling both intra-and inter-channel non-linearities (including inter-subcarrier non-linear crosstalk distortions) mainly on middle subcarriers which suffer the most from inter-subcarrier FWM.We believe that due to their impressive performance and low complexity, machine learning algorithms could play a key role in long-haul coherent optical communications.

Figure 2 .
Figure 2. Inverse-Volterra series-transfer function (IVSTF) block diagram for coherent optical orthogonal frequency division multiplexing (CO-OFDM).Where m is the number of spans in a longhaul network and k is the Kernel order [13].(I)FFT: (Inverse) fast Fourier transform.

Figure 2 .
Figure 2. Inverse-Volterra series-transfer function (IVSTF) block diagram for coherent optical orthogonal frequency division multiplexing (CO-OFDM).Where m is the number of spans in a long-haul network and k is the Kernel order [13].(I)FFT: (Inverse) fast Fourier transform.

Table 1 .
Transmission and transceiver OFDM parameters.

Table 1 .
Transmission and transceiver OFDM parameters.

Table 2 .
Parameters employed in the calculation of complexity.

Table 2 .
Parameters employed in the calculation of complexity.

Table 3 .
Computational complexity comparison between full-step DBP, IVSTF and ANN for different modulation format order (M) and transmission distances.