Nonlinear Impairment Compensation Using Transfer Learning-Assisted Convolutional Bidirectional Long Short-Term Memory Neural Network for Coherent Optical Communication Systems

: By combining the nonlinear impairment features derived from the ﬁrst-order perturbation theory, we propose a nonlinear impairment compensation ( NLC ) scheme based on the transfer learning-assisted convolutional bidirectional long short-term Memory ( CNN-BiLSTM ) neural network structure. When considering the correlation of nonlinear impairment between preceding and succeeding consecutive adjacent symbols on the current moment symbol and integrating the multidimensional feature extraction and time memory characteristics of CNN-BiLSTM , the nonlinear impairment information contained in the input feature can be fully utilized to accurately predict the nonlinear impairment showing signiﬁcant compensation effect. Meanwhile, transfer learning (TL) is introduced to greatly reduce the complexity of the scheme on the basis of high compensation performance. To verify the effectiveness of the proposed scheme, we construct single-channel (SC) and 5-channel 28 GBaud polarization division multiplexing 16 quadrature amplitude modulation (PDM-16QAM)/85 GBaud PDM-64QAM simulation systems, and SC and 3-channel 28 GBaud PDM-16QAM experimental systems. The experimental results show that when compared with simple recurrent neural network (SRNN) NLC and DBP 20 steps per span (DBP20StPs), the Q-factor gain of our scheme is about 1 dB and 1.7 dB in the SC system, and about 1.1 dB and 1.5 dB in the 3-channel system at the optimal launch power, respectively. It is interesting to highlight that, by applying TL to the simulation and experimental systems, our scheme based on only 5% of the training samples can achieve compensation performance comparable to or higher quality than retraining at various launch powers.


Introduction
The optical transmission network is the underlying infrastructure of global communication network.Nowadays, with the rapid increase in network traffic, the optical transmission network is evolving towards the direction of a high-speed, long-distance and large-capacity dynamic optical transmission network [1].However, the transmission capacity and distance of the current optical fiber communication system are jointly limited and constrained by both linear and nonlinear impairment of optical fiber, especially the nonlinear impairment.By increasing the input fiber power, it will cause a significant accumulation of the Kerr nonlinear effect in fiber channel, resulting in serious nonlinear phase modulation and leading to an obvious degradation in signal quality [2].At present, the linear impairment of optical fiber can be effectively compensated, while the nonlinear impairment becomes the main obstacle for further development of optical fiber communication due to the difficulty of compensation processes [3].Therefore, an effective compensation of nonlinear impairment is considered crucial to improve the capacity and maximum transmission distance in the optical fiber communication system.
To solve the problems caused by nonlinear impairment, researchers had proposed a series of classical NLC algorithms, mainly including digital back propagation (DBP) [4], Volterra series-based nonlinear equalizers (VSNE) [5,6], perturbation-based nonlinear equalization and its improvement [7][8][9].In those approaches, both DBP and VSNE involve many Fast Fourier transform (FFT) operations.Moreover, with the increase in transmission distance, their complexities need to be further increased to ensure the compensation performance.Although only one sample per symbol is required in the compensation process of the perturbation-based nonlinear equalization, the calculation of nonlinear perturbation coefficient limits the flexibility of the algorithm.These algorithms above perform NLC under the conditions that the nonlinear impairment model was determined.However, it is often difficult to obtain accurate link parameters in complex dynamic optical networks.
With the rapid development of machine learning, Neural Network (NN) based compensation algorithm has been widely used in NLC field for coherent optical communication systems.Such applications take the advantages of its strong nonlinear fitting ability and fast calculation speed.Compared with classical compensation algorithms, the NLC algorithm based on NN can adaptively evaluate the degree of nonlinear impairment by learning the data set composed of nonlinear impairment symbols and obtain the NN model independent from the system parameters, so as to achieve the NLC effect.Moreover, NNbased compensation algorithm can provide almost the same or even better compensation performance at a lower complexity.One kind of method is to use NNs to simulate linear and nonlinear steps of traditional DBP algorithm.Some researchers have proposed the Learning DBP (LDBP) algorithm and its improvement [10][11][12][13], which performed better than DBP.Nevertheless, the LDBP algorithm required at least two samples for each symbol in the compensation process and generated extra complexity.Another study was to treat NNs as black box models and learn directly from the received data to realize NLC.For examples, convolutional neural network (CNN) [14,15], bidirectional long short-term memory (BiLSTM) neural network [16], bidirectional gated recurrent unit (Bi-GRU) neural network [17] and center-oriented long short-term memory (Co-LSTM) neural network [18] have been used successively.The third method combines the perturbation theory with NNs, using the triplet feature vector that can characterize the intra-channel cross-phase modulation (IXPM) and intra-channel four-wave mixing (IFWM) to provide the nonlinear impairment features.It is interesting that triplet feature vector can be used as the input of deep neural network (DNN) [19,20], complex-valued DNN [21], CNN [22], simple recurrent neural network (SRNN) [23] and embedded bidirectional long short-term memory (BiLSTM) neural network [24] for equalization processing, thereby predicting the nonlinear impairment of fiber directly and achieve better compensation effect.However, the nonlinear impairment between consecutive adjacent symbols is correlated, where the studies only consider the influence of the current moment symbol or the previous moment symbol on the current moment symbol.So, the nonlinear impairment information carried in triplet feature vector is still not fully utilized.
Considering the above research findings, a NLC scheme based on transfer learningassisted convolutional bidirectional long short-term memory (CNN-BiLSTM) neural network is proposed in this paper.This scheme takes into account the nonlinear influence of the k preceding and k succeeding consecutive adjacent symbols on the current moment symbol and intends to combine the powerful information extraction ability of CNN with the time memory characteristics of BiLSTM.It is believed that there is no need to include all the nonlinear impairment information of current moment symbol in the input feature, and the input feature with only a small amount of nonlinear impairment information can be adequately utilized to estimate more accurate nonlinear impairment for NLC.Importantly, the scheme introduces transfer learning (TL), which is expected to achieve fast power transfer with fewer input features and iterative periods at other launch powers.The complexity of the scheme is to be further reduced on the basis of ensuring the compensation performance.

NLC Principle of Transfer Learning-Assisted CNN-BiLSTM Neural Network
Based on CNN-BiLSTM neural network, the NLC module proposed in this paper is located at the end of the digital signal processing (DSP), that is, after the carrier phase recovery module.As shown in Figure 1, symbols received by the coherent receiver are first processed through a series of DSP, such as resampling, dispersion compensation in frequency domain, multimode algorithm (MMA) for polarization demultiplexing, mpower feedforward algorithm for frequency offset estimation and blind phase search algorithm (BPS) for carrier phase recovery.Under these experimental conditions, the received symbols are only affected by nonlinearity.Subsequently, the feature vector T x/y is provided by calculating the IXPM and IFWM triplets from the received symbols.To manifest the nonlinear effects between consecutive adjacent symbols, we reconstruct the triplet feature vector as multidimensional input feature that is able to be inputted into the CNN-BiLSTM neural network for learning.Then, the nonlinear impairment learned through NN is subtracted from the received symbols to obtain NLC symbols.At last, we save the NN model at high launch power.The corresponding nonlinear impairment at other low launch powers are acquired by transfer learning, aiming to accomplish power transfer, save training cost, and reduce implementation complexity.In brief, our scheme consists of three parts: the construction of multidimensional input feature, the learning of nonlinear impairment by CNN-BiLSTM neural network, and the introduction of transfer learning.These three parts will be described in detail below.
power transfer with fewer input features and iterative periods at other launch powers.The complexity of the scheme is to be further reduced on the basis of ensuring the compensation performance.

NLC Principle of Transfer Learning-Assisted CNN-BiLSTM Neural Network
Based on CNN-BiLSTM neural network, the NLC module proposed in this paper is located at the end of the digital signal processing (DSP), that is, after the carrier phase recovery module.As shown in Figure 1, symbols received by the coherent receiver are first processed through a series of DSP, such as resampling, dispersion compensation in frequency domain, multimode algorithm (MMA) for polarization demultiplexing, mpower feedforward algorithm for frequency offset estimation and blind phase search algorithm (BPS) for carrier phase recovery.Under these experimental conditions, the received symbols are only affected by nonlinearity.Subsequently, the feature vector / x y T is provided by calculating the IXPM and IFWM triplets from the received symbols.To manifest the nonlinear effects between consecutive adjacent symbols, we reconstruct the triplet feature vector as multidimensional input feature that is able to be inputted into the CNN-BiLSTM neural network for learning.Then, the nonlinear impairment learned through NN is subtracted from the received symbols to obtain NLC symbols.At last, we save the NN model at high launch power.The corresponding nonlinear impairment at other low launch powers are acquired by transfer learning, aiming to accomplish power transfer, save training cost, and reduce implementation complexity.In brief, our scheme consists of three parts: the construction of multidimensional input feature, the learning of nonlinear impairment by CNN-BiLSTM neural network, and the introduction of transfer learning.These three parts will be described in detail below.

Construction of Multidimensional Input Feature
According to the first-order perturbation principle, the nonlinear term in the Manakov equation of the polarization multiplexing system can be directly regarded as the first-order perturbation ∆u x/y [25], which can be expressed as where β 2 is the group velocity dispersion, γ is the nonlinear coefficient, and u x/y,0 is the linear term solution of x/y polarization.Assuming much larger accumulated dispersion than symbol duration, the nonlinear perturbation terms for the symbol at t = 0 can be approximated expressed as where P 0 is the launch power, A x/y are the received symbols of x and y polarization, m and n are the symbol indexes, and C m,n is the nonlinear perturbation coefficient which generally needs to be calculated by determined link parameter information.
To carry out the NN-based compensation scheme that is independent of link and contain nonlinear impairment features, we directly extract the triplets T x/y = A x/y,n+t A * x/y,m+n+t A x/y,m+t + A y/x,n+t A * y/x,m+n+t A x/y,m+t , which can represent the IXPM and IFWM information.The values of m and n are depend on the rule |mn| ≤ C, |m|, |n| ≤ L. C and L can be valued according to the contribution degree of triplets to nonlinear perturbation.Meanwhile, a balance between the performance and algorithm complexity needs to be guaranteed.The number of triplets can be determined by the range of symbol indexes m and n, so each symbol will contain N t triplets at t-th time by determining the values of m and n.According to perturbation theory, the contribution degree of triplets is mainly determined by the nonlinear perturbation coefficient C m,n in Equation (2). Figure 2a,b show the normalized perturbation coefficients C m,n for 28/85 GBaud PDM-16/64QAM after 1600/400 km transmission, respectively, and their cutoff threshold is -35 dB, namely 20 log 10 (|C m,n |/|C 0,0 |) ≥ −35.It is interesting to see that the closer the values of m and n are at the center position, showing greater contribution to the nonlinear perturbation.When the cutoff threshold is -10 dB, all triplets that significantly contribute to the nonlinear perturbation have been included.At this time, the corresponding maximum value of L is 13, as shown in the Figure's black box.Therefore, we can set the maximum value of L to 13 for the purpose of balancing the performance and complexity of the proposed scheme.At the same time, we can set three selection rules for triplets as shown in Figure 2c-e, respectively, to verify the influence of the number of triplets on the NLC performance.
Since the nonlinear impairment between consecutive adjacent symbols is correlated, we need to consider the influence of k preceding and k succeeding consecutive adjacent symbols on the current symbol at t-th time.The triplets T x/y are then reconstructed into T t,M = [T t−k , . . . ,T t−1 , T t , T t+1 , . . . ,T t+k ] as the multidimensional input feature of the NN, where M is the memory size which is equal to 2k + 1.The nonlinear impairment information in the multidimensional input feature can be adequately extracted by using NN, which is expected to further improve the estimation accuracy of the nonlinear impairment and to compensate more effectively.

Nonlinear Impairment Learning Principle Analysis of CNN-BiLSTM Structure
As shown in Figure 3, through combining the step of the 1D-convolutional layer with the BiLSTM layer [26], we construct the CNN-BiLSTM neural network structure to fully extract the information in the multidimensional input feature and make a full use of the correlation of nonlinear impairments between k preceding and k succeeding consecu- tive adjacent symbols, so as to achieve accurate estimation of nonlinearity.
As demonstrated in the literature [22], CNN is better at extracting high-dimensional features.Consequently, for longer triplet input features, a 1D-convolutional layer can be introduced as a preprocessing step to perform convolution calculations on the input data, and a specific activation function may be utilized to extract the main local feature information for further processing.Being a recurrent neural network, RNN can use its memory ability to deal with time series problems.However, in the process of back propagation, RNN will encounter the gradient being too large or too small.This problem prevents it from remembering what it has learned in longer sequences, causing disappearance of the gradient [27].It is necessary to highlight here that LSTM can improve the internal structure of RNN by introducing "gates" to control the transmission information and solve the problem on long-term sequence dependence [28].In this paper, Figure 4a shows the internal structure of LSTM unit, while the BiLSTM layer structure is adopted, as shown in Fig- ure 4b.The local feature information extracted from the 1D-convolutional layer is further processed by the forward and backward structures of BiLSTM as the final output of the BiLSTM layer to deal with the nonlinear interference between adjacent symbols.After converting the output of BiLSTM layer into a 1D-vector through a flatten layer, we then concatenate it to the fully connected layer containing two neurons.The real and imaginary parts of the estimated nonlinear impairment are output by regression operation subsequently.To prevent overfitting of the network, a dropout layer is added after both the 1Dconvolutional layer and the BiLSTM layer.

Nonlinear Impairment Learning Principle Analysis of CNN-BiLSTM Structure
As shown in Figure 3, through combining the step of the 1D-convolutional layer with the BiLSTM layer [26], we construct the CNN-BiLSTM neural network structure to fully extract the information in the multidimensional input feature and make a full use of the correlation of nonlinear impairments between k preceding and k succeeding consecutive adjacent symbols, so as to achieve accurate estimation of nonlinearity.
As demonstrated in the literature [22], CNN is better at extracting high-dimensional features.Consequently, for longer triplet input features, a 1D-convolutional layer can be introduced as a preprocessing step to perform convolution calculations on the input data, and a specific activation function may be utilized to extract the main local feature information for further processing.Being a recurrent neural network, RNN can use its memory ability to deal with time series problems.However, in the process of back propagation, RNN will encounter the gradient being too large or too small.This problem prevents it from remembering what it has learned in longer sequences, causing disappearance of the gradient [27].It is necessary to highlight here that LSTM can improve the internal structure of RNN by introducing "gates" to control the transmission information and solve the problem on long-term sequence dependence [28].In this paper, Figure 4a shows the internal structure of LSTM unit, while the BiLSTM layer structure is adopted, as shown in Figure 4b.The local feature information extracted from the 1D-convolutional layer is further processed by the forward and backward structures of BiLSTM as the final output of the BiLSTM layer to deal with the nonlinear interference between adjacent symbols.After converting the output of BiLSTM layer into a 1D-vector through a flatten layer, we then concatenate it to the fully connected layer containing two neurons.The real and imaginary parts of the estimated nonlinear impairment are output by regression operation subsequently.To prevent overfitting of the network, a dropout layer is added after both the 1D-convolutional layer and the BiLSTM layer.

Backward
Forward The specific internal operation process of CNN-BiLSTM neural network structure can be expressed as: where

Backward
Forward The specific internal operation process of CNN-BiLSTM neural network structure can be expressed as: where  The specific internal operation process of CNN-BiLSTM neural network structure can be expressed as: where Finally, the minimum mean square error (MSE) is used as loss function of the NN with the expression below.Moreover, when applying Adam optimizer with a learning rate of 0.001, we aim to achieve the minimum error between the estimated nonlinear impairment and the label.
where B is batch size.A x/y,label represents the label of CNN-BiLSTM neural network, and it is equal to the difference between received symbols and transmitted symbols.

Transfer Learning Simplified CNN-BiLSTM Structure
As we know, optical fiber nonlinear effects with different launch powers have different influence on the system performance.When the launch power changes, the NN model needs to be retrained, which inevitably brings additional training overhead.Hence, transfer learning is introduced into the training process in this work.As shown in Figure 5, the parameter transferring based on the network model has been adopted to identify the knowledge learned by the NN from the source domain and then transfer it to different but relevant target domain.As demonstrated, using a small amount of sample data can realize the rapid reconstruction of NN parameters and save training costs [29,30].In this study, we can save the NN model trained at high power to ensure that the NN adequately learns the influence of nonlinearity on the data.In the case of launch power changes, it is only necessary to call the saved network model, freeze the 1D-convolutional layer and BiLSTM layer, so that it does not participate in new training.We design this to keep the training of the fully connected output layer to output the real and imaginary parts of the nonlinear impairment.Through a very small amount of training data and iteration, satisfactory compensation effects are acquired.
Finally, the minimum mean square error (MSE) is used as loss function of the NN with the expression below.Moreover, when applying Adam optimizer with a learning rate of 0.001, we aim to achieve the minimum error between the estimated nonlinear impairment and the label.
where B is batch size.

/ , x y l a b e l
A represents the label of CNN-BiLSTM neural network, and it is equal to the difference between received symbols and transmitted symbols.

Transfer Learning Simplified CNN-BiLSTM Structure
As we know, optical fiber nonlinear effects with different launch powers have different influence on the system performance.When the launch power changes, the NN model needs to be retrained, which inevitably brings additional training overhead.Hence, transfer learning is introduced into the training process in this work.As shown in Figure 5, the parameter transferring based on the network model has been adopted to identify the knowledge learned by the NN from the source domain and then transfer it to different but relevant target domain.As demonstrated, using a small amount of sample data can realize the rapid reconstruction of NN parameters and save training costs [29,30].In this study, we can save the NN model trained at high power to ensure that the NN adequately learns the influence of nonlinearity on the data.In the case of launch power changes, it is only necessary to call the saved network model, freeze the 1D-convolutional layer and BiLSTM layer, so that it does not participate in new training.We design this to keep the training of the fully connected output layer to output the real and imaginary parts of the nonlinear impairment.Through a very small amount of training data and iteration, satisfactory compensation effects are acquired.

Description of the Simulation System
To validate the effectiveness and performance of our scheme, we construct SC and 5-channel wavelength division multiplexing (WDM) numerical simulation platform based on Virtual Photonics Inc. (VPI) Transmission Maker 11.1, as shown in Figure 6.At the transmitter, signals as 28 GBaud PDM-16QAM and 85 GBaud PDM-64QAM are generated, respectively.The data sequence in each polarization from pseudo random binary sequence (PRBS) generators with the length of 2 15 .The corresponding channel spacing in WDM systems are 50 GHz and 100 GHz.The linewidth and frequency offset of each laser are set as 100 KHz and 100 MHz, and the root-raised-cosine pulses with a roll-off factor of 0.1 are performed.The SC and 5-channel WDM systems are switched by optical switches.The modulated signals output by optical transmitters are transmitted to the fiber loop which consists of 80 km standard single mode fiber (SSMF) with attenuation, chromatic dispersion, polarization mode dispersion and nonlinear coefficient of 0.2 dB/km, 16 ps/(nm•km), 0.1 ps/ √ km and 1.3 W −1 /km.An erbium-doped fiber amplifier (EDFA) is added to each loop to compensate for the loss in the link and the amplified spontaneous emission (ASE) noise is introduced.For the PDM-16QAM signal, the transmission distance is detected as 1600 km with the EDFA noise figure of 6 dB.Subsequently, for PDM-64QAM signal, the transmission distance is only detected as 400 km with the noise figure of 4 dB.To improve our scheme further, an optic band-pass filter (OBPF) is placed in the loop to restrain the accumulation of out-band ASE noise.In the 5-channel WDM system, at the receiver, we choose the third channel signal with the center wavelength of 1550 nm which is most affected by nonlinearity to simulation.The coherent receiver is used to realize photoelectric conversion and output the four electrical signals of I x , Q x , I y and Q y .After a series of digital signal processing (DSP) modules, we then sent them to our proposed NLC module.

Description of the Simulation System
To validate the effectiveness and performance of our scheme, we construct SC and 5channel wavelength division multiplexing (WDM) numerical simulation platform based on Virtual Photonics Inc. (VPI) Transmission Maker 11.1, as shown in Figure 6.At the transmitter, signals as 28 GBaud PDM-16QAM and 85 GBaud PDM-64QAM are generated, respectively.The data sequence in each polarization from pseudo random binary sequence (PRBS) generators with the length of 2 15 .The corresponding channel spacing in WDM systems are 50 GHz and 100 GHz.The linewidth and frequency offset of each laser are set as 100 KHz and 100 MHz, and the root-raised-cosine pulses with a roll-off factor of 0.1 are performed.The SC and 5-channel WDM systems are switched by optical switches.The modulated signals output by optical transmitters are transmitted to the fiber loop which consists of 80 km standard single mode fiber (SSMF) with attenuation, chromatic dispersion, polarization mode dispersion and nonlinear coefficient of 0.2 dB/km, 16 ps/(nm•km), 0.1 ps/√km and 1.3 W −1 /km.An erbium-doped fiber amplifier (EDFA) is added to each loop to compensate for the loss in the link and the amplified spontaneous emission (ASE) noise is introduced.For the PDM-16QAM signal, the transmission distance is detected as 1600 km with the EDFA noise figure of 6 dB.Subsequently, for PDM-64QAM signal, the transmission distance is only detected as 400 km with the noise figure of 4 dB.To improve our scheme further, an optic band-pass filter (OBPF) is placed in the loop to restrain the accumulation of out-band ASE noise.In the 5-channel WDM system, at the receiver, we choose the third channel signal with the center wavelength of 1550 nm which is most affected by nonlinearity to simulation.The coherent receiver is used to realize photoelectric conversion and output the four electrical signals of Ix, Qx, Iy and Qy.After a series of digital signal processing (DSP) modules, we then sent them to our proposed NLC module.

Results and Discussion
Considering the complexity of activation function, the adder, look-up table and comparator are much smaller than that of multipliers, where we only consider the complexity of multipliers, namely, the real number multipliers per symbol (RMPS) will be regarded as the computational complexity metric.Table 1 shows the complexity results of the proposed scheme, SRNN NLC and DBP algorithms.The complexity of our proposed scheme is mainly manifested in three aspects: input structure, NN and introduced transfer learning.However, when transfer learning is introduced, it mainly reduces the complexity of the training process by reducing iteration periods and the number of symbols involved in NN training.Hence, only the complexity of the input structure and the NN needs to be considered, while transfer learning is not involved in the RMPS analysis process of this scheme.For the input structure, each operation requires four complex multipliers, and each complex multiplier is equivalent to four real multipliers, so a total of 16 real multi-

Results and Discussion
Considering the complexity of activation function, the adder, look-up table and comparator are much smaller than that of multipliers, where we only consider the complexity of multipliers, namely, the real number multipliers per symbol (RMPS) will be regarded as the computational complexity metric.Table 1 shows the complexity results of the proposed scheme, SRNN NLC and DBP algorithms.The complexity of our proposed scheme is mainly manifested in three aspects: input structure, NN and introduced transfer learning.However, when transfer learning is introduced, it mainly reduces the complexity of the training process by reducing iteration periods and the number of symbols involved in NN training.Hence, only the complexity of the input structure and the NN needs to be considered, while transfer learning is not involved in the RMPS analysis process of this scheme.For the input structure, each operation requires four complex multipliers, and each complex multiplier is equivalent to four real multipliers, so a total of 16 real multipliers are required to generate a multidimensional input feature.If 16N t is used to represent the complexity of input structure, then the input structure complexity of the CNN-BiLSTM and SRNN neural network is 16N t,CNN−BiLSTM , 16N t,SRNN , respectively.Referring to the NN, the number of neurons in the input layer of the two NLC schemes is denoted as n i , and the number of neurons in both the SRNN and the BiLSTM hidden layers is expressed as n h .The output layer neurons of the two NNs are n o = 2.We fix the default convolutional layer configuration with the padding, the stride and the dilation are equal to 0, 1 and 1. n s is the number of time steps and equal to M in Section 2.1.The number of filter is n f , the kernel size is n k , and the output size of each filter of 1D-convolutional layer is n l = n s − n k + 1 [26].Likewise, the DBP algorithm also only considers the complexity of real multipliers.The RMPS of DBP is given by the number of fiber spans n span , the number of steps per span n step , the oversampling ratio n up and the FFT size N FFT [18].
To analyze the valid influence range of nonlinear effects, we need to decide the number of consecutive adjacent symbols to be used by changing the size of k, which is equivalent to the number of time steps n s (n s = 2k + 1).Therefore, we firstly discuss the Q-factor performance with optimal launch power of the SC and WDM systems versus n s with the corresponding RMPS in Figure 7.A Q-factor is utilized as system performance metric in this paper, which can be calculated from the bit error rate (BER) using the equation . When k values are set as 0, 1, 2, . . . , 10,the numbers of time steps n s are 1, 3, 5, . . ., 21.As demonstrated in Figure 7, we find that increasing the numbers of time steps n s has improved the Q-factor performance.In other words, when N t is equal to 69, and if the number of time steps n s exceeds 11, the Q-factor performance tends to converge.Additionally, to balance the performance and complexity well, we should choose an appropriate value of n s , and define a measurement parameter P, that is, the ratio of Q-factor performance to complexity RMPS as a metric.While achieving a larger value of P, the results correspond n s , thus being more suitable to be selected.The red curve in Figure 7a and the black curve in Figure 7b are taken as examples for illustration purpose.In Figure 7a, when n s is 11, the corresponding P is 2.3 × 10 −4 , and when n s is 15, the value of P is 1.7 × 10 −4 .Differently in Figure 7b, when n s is 11, the corresponding P is 3.5 × 10 −4 , and when n s is 15, the value of P is 2.7 × 10 −4 .The findings demonstrate that choosing n s at 11 is more suitable than 15.Consequently, after comprehensive consideration, we finally choose 11 as optimal value of n s for subsequent simulations.When comparing performance of the schemes selected, we set the SRNN input features to contain 109 triplets, which L and C are equal to 11 and 7, respectively.To re- flect the learning ability of the CNN-BiLSTM neural network for a small amount of feature information between consecutive adjacent symbols, where the number of triplets con- When comparing performance of the schemes selected, we set the SRNN input features to contain 109 triplets, which L and C are equal to 11 and 7, respectively.To reflect the learning ability of the CNN-BiLSTM neural network for a small amount of feature information between consecutive adjacent symbols, where the number of triplets contained in the multidimensional input feature is reduced.Following that, we determine the number of triplets N t to be 69 by setting L and C to 7 and 5, respectively.The performance of our proposed scheme is compared with SRNN NLC at the equivalent complexity through changing the parameters involved.
By dividing the sample dataset, we consider 50% for training, 20% for validation, and 30% for testing with unknown data.To improve the generalization ability of NN, we use "early stopping" when the verification accuracy of 10 successive epochs is not improved.When reviewing the simulation results of the PDM-16QAM/64QAM SC system in Figure 8, the performance of the proposed scheme observed is significantly better than SRNN NLC involving the equivalent complexity.For the 28 GBaud PDM-16QAM signal, the optimal launch power of this scheme is increased from -1 dBm to 0 dBm.The SNR of this scheme is improved by about 4.7 dB and 0.5 dB, and the Q-factor gain is about 1.36 dB and 0.28 dB, respectively, when compared with the scheme without nonlinear compensation (w/o NLC) and SRNN NLC.Different from DBP20StPs, this scheme shows a greater advantage of performance in three scenarios with different numbers of triplets, and its complexity is considerably lower than DBP20StPs.In the case of the lowest complexity, namely N t = 69, the SNR of the proposed scheme is about 0.7 dB higher than that of DBP20StPs and the Q-factor gain is about 0.37 dB.For the 85 GBaud PDM-64QAM signal, the optimal launch power of this scheme has been increased from 3 dBm to 4 dBm.It is worth noting that the SNR of this scheme has been improved by about 3 dB and 0.5 dB, and the Q-factor gain is about 0.7 dB and 0.29 dB, respectively, compared with w/o NLC and SRNN NLC.When the complexity of our scheme (N t = 69) is equivalent to that of DBP20StPs, the performance outcomes are also similar.However, when using DBP20StPs for NLC, we need to compensate separately at each launch power, and for our proposed scheme, the NN model at high launch power can be used through transfer learning to realize NLC at low launch power with a small number of sample data, which greatly reduces the complexity of implementation with similar performance.Figure 9 shows the simulation results of the comparison between transfer learning and retraining performance in PDM-16QAM/64QAM SC system.The constellation diagram inserted in the figure is the result of transfer learning and retraining at the optimal launch power and the highest launch power.Interestingly, with only 5% of training sample data, transfer learning can achieve compensation performance comparable to or higher quality than retraining at each launch power.Figure 9 shows the simulation results of the comparison between transfer learning and retraining performance in PDM-16QAM/64QAM SC system.The constellation diagram inserted in the figure is the result of transfer learning and retraining at the optimal launch power and the highest launch power.Interestingly, with only 5% of training sample data, transfer learning can achieve compensation performance comparable to or higher quality than retraining at each launch power.Figure 9 shows the simulation results of the comparison between transfer learning and retraining performance in PDM-16QAM/64QAM SC system.The constellation diagram inserted in the figure is the result of transfer learning and retraining at the optimal launch power and the highest launch power.Interestingly, with only 5% of training sample data, transfer learning can achieve compensation performance comparable to or higher quality than retraining at each launch power.We also verify the effectiveness and performance of the proposed scheme in the 5channel WDM system with results shown in Figure 10.It shows that the proposed scheme has certain performance improvement compared with SRNN NLC with the equivalent complexity.For the 28 GBaud PDM-16QAM WDM signal, the optimal launch power of this scheme is increased from −1 dBm to 0 dBm.Compared with w/o NLC and SRNN NLC, We also verify the effectiveness and performance of the proposed scheme in the 5channel WDM system with results shown in Figure 10.It shows that the proposed scheme has certain performance improvement compared with SRNN NLC with the equivalent complexity.For the 28 GBaud PDM-16QAM WDM signal, the optimal launch power of this scheme is increased from −1 dBm to 0 dBm.Compared with w/o NLC and SRNN NLC, the SNR of this scheme is improved by about 3.7 dB and 0.4 dB, and the Q-factor gain is about 1.14 dB and 0.2 dB, respectively.Similar to the results of the SC system, this scheme (N t = 69) improves the SNR and the Q-factor gain by 1.4 dB and 0.55 dB compared with the DBP20StPs when its complexity is lower than that of DBP20StPs.For the 85 GBaud PDM-64QAM WDM signal, the optimal launch power of this scheme is increased from 2 dBm to 3 dBm.Compared with w/o NLC and SRNN NLC, the SNR of this scheme is improved by about 2.7 dB and 0.1 dB, respectively.In the case of the equivalent complexity as DBP20StPs, the performance of this scheme is similar at a low launch power, while its performance is significantly improved at high launch power.In the 5-channel WDM system, this scheme can also introduce transfer learning to achieve effective power transfer.Figure 11 shows the simulation results of the comparison of transfer learning and retraining performance for the PDM-16/64QAM WDM system.It can be observed that their performance trends are similar to the SC system.
proved by about 2.7 dB and 0.1 dB, respectively.In the case of the equivalent complexity as DBP20StPs, the performance of this scheme is similar at a low launch power, while its performance is significantly improved at high launch power.In the 5-channel WDM system, this scheme can also introduce transfer learning to achieve effective power transfer.Figure 11 shows the simulation results of the comparison of transfer learning and retraining performance for the PDM-16/64QAM WDM system.It can be observed that their performance trends are similar to the SC system.Table 2 shows the specific usage parameters and corresponding RMPS complexity of different compensation schemes in the simulation system.Two different RMPS complexities represent the analysis results under two different modulation formats (PDM-16QAM/64QAM).In the second and third lines of our scheme,    Table 2 shows the specific usage parameters and corresponding RMPS complexity of different compensation schemes in the simulation system.Two different RMPS complexities represent the analysis results under two different modulation formats (PDM-16QAM/64QAM).In the second and third lines of our scheme, N t is expanded to 109 and 169 to observe the influence of the number of triplets on NLC performance.The upper and lower rows of the last column of each NLC scheme correspond to the Q factor gain ∆Q of the 16/64QAM SC and WDM systems, respectively.To further verify our scheme, we carry out 28 GBaud PDM-16QAM experimental transmissions in both the SC and 3-channel WDM systems, where the corresponding schematic diagrams are shown in Figure 12.At the transmitter, the center frequency of the external cavity laser (ECL) in SC system is detected as 193.4 THz, and the channel spacing in 3-channel WDM system is set to 50 GHz, ECL1, ECL2 and ECL3 with center frequencies as 193.35 THz,193.4 THz and 193.45 THz being coupled by polarization-maintaining optical coupler (PM-OC).The linewidth and frequency offset of the laser employed are approximately 100 kHz and 100 MHz, respectively.We construct bit sequences of length 2 15 , which are PRBS generated by MATLAB built-in function.Through 65 GSa/s arbitrary waveform generator (AWG, Keysight M8195A), PDM-16QAM signals are generated to drive the IQ modulator.The signal is pulse shaped using a root-raised-cosine filter with a roll-off factor of 0.1.The IQ-modulated PDM signal is then adjusted to an appropriate optical power through EDFA with a noise figure of 6.5 dB and variable optical attenuator (VOA) and is then sent into the fiber loop.The loop consists of 100 km/span SSMF, EDFA, OBPF and optical loop controller, where the transmission distance is set as 800 km.The attenuation, chromatic dispersion, polarization mode dispersion and nonlinear coefficient of SSMF are 0.

Results and Discussion
Similarly to the idea in the simulation, we initially test the Q-factor performance with optimal launch power of the SC and WDM systems versus s n with the corresponding RMPS.We notice that when s n values exceed 11, the Q-factor performance curve tend to flatten out.In Figure 13, the black curve displays when s n is 11, the corresponding P is 4.1 × 10 −4 , and when s n is 17, the value of P is 2.9 × 10 −4 .In the same figure, on the red curve, when s n is 11, the corresponding P is 3.9 × 10 −4 , and when s n is 17, the value of P is 2.7 × 10 −4 .The results strongly point out that it is more appropriate to choose s n as 11 than 17.Therefore, in the experimental systems, s n is chosen as 11 to trade off complexity and performance.

Results and Discussion
Similarly to the idea in the simulation, we initially test the Q-factor performance with optimal launch power of the SC and WDM systems versus n s with the corresponding RMPS.We notice that when n s values exceed 11, the Q-factor performance curve tend to flatten out.In Figure 13, the black curve displays when n s is 11, the corresponding P is 4.1 × 10 −4 , and when n s is 17, the value of P is 2.9 × 10 −4 .In the same figure, on the red curve, when n s is 11, the corresponding P is 3.9 × 10 −4 , and when n s is 17, the value of P is 2.7 × 10 −4 .The results strongly point out that it is more appropriate to choose n s as 11 than 17.Therefore, in the experimental systems, n s is chosen as 11 to trade off complexity and performance.
Since the transmission distance of the experimental system is smaller than that of the simulation system, the correlation of nonlinear impairment between consecutive adjacent symbols will be reduced accordingly.Additionally, the triplet feature vector in the multidimensional input feature can break the pattern of PRBS without causing the performance to be overestimated [23].The transmission performance results are shown in Figure 14, where only the results are used when N t is 69, 109 for experimental performance comparison.Figure 14a, b show that the optimal launch power of this scheme (N t = 69) can be increased by 1 dB in both the SC and 3-channel WDM experimental systems.In the SC system, compared with w/o NLC, DBP20StPs and SRNN NLC, the Q-factor gain of this scheme (N t = 69) is about 2.2 dB, 1.7 dB and 1 dB, respectively.Under the 3-channel WDM system, the Q-factor gain is about 1.8 dB, 1.4 dB and 1.1 dB, respectively.In addition, whether it is for the SC or 3-channel WDM experimental systems, when N t = 109, its performance will be further improved.Since the transmission distance of the experimental system is smaller than that o simulation system, the correlation of nonlinear impairment between consecutive adja symbols will be reduced accordingly.Additionally, the triplet feature vector in the m dimensional input feature can break the pattern of PRBS without causing the perform to be overestimated [23].The transmission performance results are shown in Figur where only the results are used when t N is 69, 109 for experimental performance c parison. Figure 14a, b show that the optimal launch power of this scheme  ( 69) t N be increased by 1 dB in both the SC and 3-channel WDM experimental systems.In th system, compared with w/o NLC, DBP20StPs and SRNN NLC, the Q-factor gain of scheme  ( 69) t N is about 2.2 dB, 1.7 dB and 1 dB, respectively.Under the 3-cha WDM system, the Q-factor gain is about 1.8 dB, 1.4 dB and 1.1 dB, respectively.In a tion, whether it is for the SC or 3-channel WDM experimental systems, when  t N its performance will be further improved.Since the transmission distance of the experimental system is smaller than that of the simulation system, the correlation of nonlinear impairment between consecutive adjacent symbols will be reduced accordingly.Additionally, the triplet feature vector in the multidimensional input feature can break the pattern of PRBS without causing the performance to be overestimated [23].The transmission performance results are shown in Figure 14, where only the results are used when t N is 69, 109 for experimental performance com- parison.Figure 14a, b   Table 3 shows the key parameters and corresponding RMPS complexity of different NLC schemes for the experimental system.The second line of the proposed scheme is the result while increasing the numbers of triplets in the multidimensional input feature to further improve the performance.The upper and lower rows of the last column of each NLC scheme correspond to the Q factor gain ∆Q of the SC and WDM systems, respectively.It can be seen that our scheme (N t = 69) has obvious performance advantages compared with the results of other two NLC schemes in the case of low complexity.Figure 15 shows the Q-factor comparison curve of transfer learning and retraining in our experimental system.In order to validate the effectiveness of transfer learning from different high launch powers to low launch powers, this paper uses the training model with the launch power set as 3 dBm in the experimental system.The constellation diagrams inserted in the figure are the results of transfer learning and retraining when the launch power being 2 dBm.Under our experimental conditions, the transfer learning of the experimental system has achieved similar effects as the simulation system.
Table 3 shows the key parameters and corresponding RMPS complexity of different NLC schemes for the experimental system.The second line of the proposed scheme is the result while increasing the numbers of triplets in the multidimensional input feature to further improve the performance.The upper and lower rows of the last column of each NLC scheme correspond to the Q factor gain Q  of the SC and WDM systems, respectively.It can be seen that our scheme ( 69) t N  has obvious performance advantages compared with the results of other two NLC schemes in the case of low complexity.Figure 15 shows the Q-factor comparison curve of transfer learning and retraining in our experimental system.In order to validate the effectiveness of transfer learning from different high launch powers to low launch powers, this paper uses the training model with the launch power set as 3 dBm in the experimental system.The constellation diagrams inserted in the figure are the results of transfer learning and retraining when the launch power being 2 dBm.Under our experimental conditions, the transfer learning of the experimental system has achieved similar effects as the simulation system.In addition, we roughly analyze the running time of the proposed scheme, which is a simple and intuitive way to measure the efficiency, describing the time required to run.In this analysis, the computer processor is Intel(R) Core(TM) i7-9700 CPU @ 3.00 GHz, random access memory (RAM) is Kingston DDR4 3200 MHz and memory capacity is 32 GB.The generation time of input features corresponding to each symbol is 0.016107 s, the test time of neural network retraining is 0.8546 s, and the test time after transfer learning is reduced to 0.09075 s.Therefore, the total running time of the proposed NLC scheme is 0.106857 s.

Conclusions
In this study, we propose a fiber NLC scheme based on transfer learning-assisted CNN-BiLSTM neural network structure.The input part of the CNN-BiLSTM neural network is provided with the nonlinear impairment information of multidimensional input feature for learning to make a full use of the correlation of nonlinear impairment between preceding and succeeding consecutive adjacent symbols on the current moment symbol.Subsequently, the scheme is verified by the simulation and experiment of the SC and WDM system, respectively.Through investigating the influence range of nonlinear effects, we observe the effect regarding the number of time steps on the Q-factor performance and find its optimal value applied in the simulation and experiment.For 28 GBaud PDM-16QAM signal, this scheme shows a significant improvement in SNR and Q-factor gain compared with SRNN NLC and DBP20StPs.Our scheme also reveals adequate suitability for 1.020 Tbps ultra-high speed optical communication system, which shows a similar compensation performance with DBP20StPs, but with certain performance advantages at higher launch powers.To further reduce the complexity of the scheme, transfer learning is introduced, and it is demonstrated that this scheme can achieve higher quality comparable compensation performance as retraining on the basis of only 5% of the training samples.Therefore, in comparison with other two NLC schemes, our scheme achieves a significant improvement in compensation effect with lower complexity.Furthermore, based on the design of this paper, we are confident that we can find more valuable inter-channel nonlinear impairment features in our future research work, thus providing suitable options for solving the problem on inter-channel NLC.

Figure 1 .
Figure 1.Schematic diagram of NLC scheme based on transfer learning-assisted CNN-BiLSTM.

Figure 1 .
Figure 1.Schematic diagram of NLC scheme based on transfer learning-assisted CNN-BiLSTM.

Figure 2 .
Figure 2. (a,b) The normalized perturbation coefficients , m n C for 28/85 GBaud PDM-16/64QAM after 1600/400 km transmission; (c-e) Three different selection rules of triplets below the cutoff threshold of −10 dB in this scheme.

Figure 2 .
Figure 2. (a,b) The normalized perturbation coefficients C m,n for 28/85 GBaud PDM-16/64QAM after 1600/400 km transmission; (c-e) Three different selection rules of triplets below the cutoff threshold of −10 dB in this scheme.

tN
is expanded to 109 and 169 to observe the influence of the number of triplets on NLC performance.The upper and lower rows of the last column of each NLC scheme correspond to the Q factor gain Q  of the 16/64QAM SC and WDM systems, respectively.

Figure 13 .Figure 13 .
Figure 13.Q-factor versus the number of time steps n s with the corresponding RMPS for 28 GBaud PDM-16QAM SC and 3-channel WDM systems.
W c and W j are the weight matrix of the 1D-convolutional layer, forget gate f t , input gate i t , output gate o t , cell state of the BiLSTM layer and the fully connected layer, respectively.Furthermore, b is corresponding bias vectors, x t is the output of the 1D-convolution layer, c t , h t represent the memory cell state and output state of BiLSTM layer.ϕ is the LeakyRelu activation function and tanh is the hyperbolic tangent activation function, σ is the logistic sigmoid function.Âx/y,NLC represents the nonlinear impairment estimated by CNN-BiLSTM neural network structure.

Table 1 .
Complexity analysis results of different schemes.+ n h n h + n h n o + 16N t,SRNN CNN-BiLSTM n i n f n k n l + 2n l n h 4n f + 4n h + 3 + n o + 16N t,CNN−BiLSTM

Table 2 .
Comparison results of specific complexity of different schemes in simulation system.

Table 2 .
Comparison results of specific complexity of different schemes in simulation system.
19 dB/km, 16.7 ps/(nm•km), 0.2 ps/ √ km and 1.27/W/km, respectively.At the receiving end, a waveform shaper (WS, Waveshaper 4000s) is applied to the 3-channel WDM system for demultiplexing, which can equalize the optical power of each channel and obtain intermediate channel signals.The received optical signals are detected by a coherent receiver and sampled by a real-time oscilloscope at 80 GSa/s for off-line DSP processing.attenuation, chromatic dispersion, polarization mode dispersion and nonlinear coefficient of SSMF are 0.19 dB/km, 16.7 ps/(nm•km), 0.2 ps/√km and 1.27/W/km, respectively.At the receiving end, a waveform shaper (WS, Waveshaper 4000s) is applied to the 3-channel WDM system for demultiplexing, which can equalize the optical power of each channel and obtain intermediate channel signals.The received optical signals are detected by a coherent receiver and sampled by a real-time oscilloscope at 80 GSa/s for off-line DSP processing.
show that the optimal launch power of this scheme

Table 3 .
Comparison results of the specific complexity of different schemes for experimental system.

Table 3 .
Comparison results of the specific complexity of different schemes for experimental system.