1. Introduction
In order to tackle the explosive escalation of wireless data traffic and the emerging application services, the Sixth-Generation (6G) communication research is widely assumed to shift towards the higher-frequency spectrum since the current radio frequency (RF) band is becoming more and more crowded [
1,
2]. The millimeter-wave and terahertz spectrum can be widely developed to fulfill this demand; however, the corresponding equipment has an extremely high cost. Visible light communication (VLC) is expected to provide a potential supplement for 6G since it relies on the unlicensed spectrum spanning from 400 to 800 THz and has the benefits of electromagnetic interference resistance, green technology, safety, and low cost. In addition, VLC can be equipped with common lighting systems to allow simultaneous illumination and communication. During the last decade, various research works have been conducted to establish the theoretical foundation and application paradigm for high-speed VLC systems [
3,
4].
The spectral efficiency of VLC can be improved with the help of high-order modulation schemes [
5,
6]. Nevertheless, the special undesirable nonlinearities introduced by the electro-optical and photoelectric conversions will contaminate the useful signal [
7], and the optical diffuse channel will also bring in an inevitable inter-symbol interference (ISI). The overall channel impairments should be relieved since they indeed significantly impair the signal performance and hinder the high-speed VLC transmission. Traditional schemes and algorithms can generally estimate the transfer function of the communication channel and remove the channel impairments by constructing nonlinear post-equalization (NPE) [
7,
8]. However, these methodologies still have a performance difference from the ideal case and would confront certain restrictions and requirements for different application scenarios.
The deep learning (DL) has shown great success in pattern identification, image recognition, and data mining. It has been already applied to physical layer communication due to its strong ability in learn the unknown or complex communication block [
9,
10], especially for modeling nonlinear phenomena. With the development of advanced network structures and optimized training algorithms, DL-based NPE shows unparalleled superiority compared to traditional approaches in channel impairment compensation. A comprehensive introduction and overview of DL-based methods can be found in [
11,
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25]. In [
12,
13,
14], the deep neural network (DNN) was employed to learn the channel characteristics and demodulate the output signals directly. In [
15,
16], the Gaussian-kernel-aided DNN networks were proposed as the pre-equalization and post-equalization, respectively, to mitigate the nonlinear degradation in high-order modulated VLC systems. In [
17], a low-complexity memory-polynomial-aided neural network was created to replace the traditional post-equalization filters of carrierless amplitude and phase (CAP) modulation. These schemes utilizing the DNN can achieve the mitigation of the linear and nonlinear distortion of the VLC channel and exhibit better bit error rate (BER) performance than some existing methods. However, the learning ability of these DL models is limited in high-speed VLC since the system is mainly restricted by the inherent memory nonlinearity of the light-emitting diode (LED), resulting in a slow convergence speed and relatively poor generalization of the DNN.
For a nonlinear VLC channel with memory, the recurrent neural network (RNN) with long short-term memory (LSTM) cells seems to be a better choice for memory sequence prediction, because the long-term memory parameters can store the channel characteristics. In [
18], a memory-controlled LSTM equalizer was proposed to compensate both the linear and nonlinear distortions. In [
19], an LSTM network was proposed to handle with the nonlinear distortions for a pulse amplitude modulation (PAM) system with the intensity modulation and direct detection (IM/DD) link over 100 km standard single-mode fiber. These proposed LSTM models outperform the conventional model-solving-based equalizers; nevertheless, the output equalization accuracy does not possess a good robustness to the noise variation, leading to the degradation of learning efficiency. In order to learn more suitable channel features, the convolutional neural network (CNN) can be used for memory sequence prediction from raw channel outputs [
20], since a function could be learned that maps a sequence of past observations as the input to an output observation. In [
21], an equalization scheme using the CNN was proposed in an orthogonal-frequency-division-multiplexing (OFDM)-based VLC system for direct equalization. In [
22], a novel blind algorithm based on the CNN was introduced to jointly perform equalization and soft demapping for M-ary quadrature amplitude modulation (M-QAM). The results showed that the proposed CNN schemes outperform the existing equalization algorithms and can maintain an excellent BER performance in the linear and nonlinear channel. However, for dynamic and deep memory scenarios, it obtains the optimal equalization performance at the cost of computational complexity, because the dimension of the input spatial information increases sharply and the convolution layer has to undertake tremendous computational pressure, resulting in the increase of network complexity [
23]. In order to shrink the network complexity, a specific architecture combination was proposed to distribute the learning task [
24,
25]. However, the original input data hardly experienced effective transformation, resulting in a large amount of time cost in the training process to extract the implicit features contained in the samples. Hence, the trade-off among computational complexity, training times, robustness, and generalization is one of the critical challenges and should be further developed in practical VLC applications. Besides, it is also necessary to consider how to consolidate the virgin data into alternate forms by changing the value, structure, or format so that the data may be easily parsed by the machine.
In this paper, inspired by the approaches in [
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25], the channel impairment compensation is formulated as a spatial memory pattern prediction problem, and an efficient impairment compensation scheme in terms of a model-driven-based CNN-LSTM is proposed to undo the memory nonlinearity of VLC. The underlying idea is that the Volterra structure is applied to pre-emphasize the original sequence and the appropriate pattern is formed as the spatial input accordingly. Then, a hybrid CNN and LSTM neural network is elaborately designed to learn the implicit feature of nonlinearity and predict the memory sequence directly, which can speed up the convergence process and improve the equalization accuracy. The main contribution of this work can be summarized as follows:
The structure information of the Volterra model is involved in the proposed DL equalizer to pre-emphasize the virgin data, which is favorable for the memory feature learning. Therefore, it can relax the learning pressure and reduce the structural complexity and training time.
Based on the traditional model-solving procedure, the channel impairment compensation is formulated as a spatial memory pattern prediction problem, and the proposed DL model is ingeniously used to achieve the accurate prediction.
Both the memory nonlinearity of the LED and the dispersive effect of the optical channel in a VLC system are simultaneously considered during the training stage.
The proposed scheme can still provide an excellent BER performance under the mismatched conditions of training and testing, showing a good robustness.
Numerical simulations in terms of the learning and generalization show that the proposed scheme is able to predict the original transmitted signal and compensate the impairments with high accuracy and resolution. In addition, it can converge relatively fast to achieve a better normalized-mean-squared error (NMSE), which confirms its superiority to some existing methods.
The remainder of this paper is organized as follows. In 
Section 2, the overall channel nonlinearity is analyzed and the impairment compensation is formulated as a spatial memory pattern prediction problem. The corresponding network architecture and training specification are illustrated in 
Section 3. Simulation results and discussions are demonstrated in 
Section 4, and conclusions are given in 
Section 5.
Notations: Matrices and column vectors are denoted by upper and lower boldface letters, respectively.  denotes the n-th element of . The set of real numbers is denoted by . In addition, *, ⊗, , and  are employed to represent the convolution, the Kronecker product, the transpose, and the absolute operators, respectively. Let  denote the -norm and  be an estimation of the parameter of interest a.  is the Gaussian distribution with mean  and variance .
  2. System Nonlinearity
A typical VLC system employing an IM/DD structure is illustrated in 
Figure 1. The end-to-end VLC channel includes an electrical modulator, digital-to-analog converter (DAC), bias tee, LED, optical transmission channel, analog-to-digital converter (ADC), and electrical demodulator. Numerous modules will generate nonlinearities. However, the overall nonlinearity of the VLC channel is mainly introduced by both the LED and multipath propagation of the optical link. In addition, the memory nonlinearity is more significant as the signal bandwidth is increased. The LED behaviors are usually described by the Wiener model, which is a cascade of linear and nonlinear blocks. Let 
 denote the 3 dB cut-off frequency, then the memory nonlinearity of the LED can be expressed as
      
The memoryless nonlinearity block can be modeled by
      
      where 
 is input real-valued transmitting signal, 
 is the coefficient, and 
Q is the polynomial order. The channel impulse response (CIR) of the multipath propagation effect in VLC can be expressed by the following: 
      where 
 is the optical power, 
 is the propagation time of the 
i-th light ray, and 
 is the number of received rays at the photodetector (PD), respectively. In fact, the PD also exhibits nonlinear behavior as the optical intensity of the injected signal is very large, leading to the saturation of the PD. However, the optical intensity can be lowered with the help of an optical attenuator. Therefore, the PD can be regarded as a linear component, which is always modeled by the Dirac function. Note that the quantification effect in the ADC is not considered here. After optical-to-electrical conversion in the PD, the received electrical signal can be expressed as
      
      where 
 denotes the responsivity of the PD, 
 is the DC bias, and 
 is the Gaussian noise following 
.
At the receiver, 
 is fed into the Volterra-based NPE. Then, the corresponding outputs can be expressed as
      
      where 
L denotes the memory length, 
P is the nonlinear order, 
 is the 
p-th order of the Volterra kernels, and 
 is the modeling error. Let
      
      represent the truncated samples with length 
L, which contains both the current and the past channel outputs.
As seen from (
5), the calculation of 
 is mainly related to 
 and 
. As we known, the main goal of the NPE is to produce the desired 
 from 
 to minimize the error with respect to 
, which indicates that the useful information of 
 is involved in 
. In other words, we can infer that 
 can be predicted from 
 once all the 
 are well obtained. Therefore, from the perspective of learning and classification, both the 
 and 
 can be learned from the training sample set 
, and the implement of the NPE can be formulated as a prediction problem, where the DL approach is very appropriate.
  3. The Proposed Scheme
As we know, LSTM is more powerful in dealing with the memory sequences’ prediction problem since it could handle the long-term dependencies and store the memory parameters, which are related to the channel characteristics. As regards the VLC system,  experiences the complex optical-to-electrical and electrical-to-optical conversion, and the complicated overall channel nonlinearity involved in  is very implicit and not very intuitive, which will greatly increase the computational complexity and learning difficulty. Furthermore, it leads to a slow convergence speed and the decrease of equalization performance. In order to improve the learning ability and accelerate the convergence speed, we therefore propose a novel impairment compensation scheme, which utilizes the Volterra structure to construct the spatial features and feed them into the CNN-LSTM network to extract the characteristic of memory nonlinearity. In the following analysis, we assume that the system synchronization has been already achieved at the receiver.
  3.1. Input Preprocessing Based on Volterra Feature
The composition and structure of the virgin input data can directly affect the performance of deep learning. Due to the complexity of the VLC channel, it is very necessary to transform or encode 
 so that it may be easily parsed by the machine. The main agenda for the proposed model to be accurate and precise in its predictions is that the algorithm should be able to easily interpret the data’s features. As demonstrated in (
5), for 
, 
 can be considered as the sum of the response for each 
 and 
, shown as
        
        where 
, and 
 denotes the corresponding kernel coefficients for 
. Let 
. (
7) can be further formed as
        
        where 
 is the Volterra kernel vector and 
 contains the corresponding kernel coefficients 
, which are arranged sequentially for the index 
.
Therefore, 
 should be firstly stacked for the last 
L points shown in (
6) and then transformed into the sequence 
 by using the above approach based on the Volterra structure feature. In order to shrink the computational complexity and speed up the learning progress, the sequence 
 is truncated with a 
 length. Then, the first appropriate pattern can be formed by the following way, shown as
        
With the time sliding window moving forward one step, the second pattern can be generated by
        
Until the last N-th points, multiple patterns  can be obtained subsequently, which will enter the neural network as the features for the input layer. Note that the step of the sliding window was set as 1 in this paper, and m also denotes the time step used in the following DL model.
  3.2. Network Structure
The architecture of the proposed model is depicted in 
Figure 2, which is composed by subnet 
 with 
 convolution layers, subnet 
 with 
 LSTM layers, and subnet 
 with 
 dense layers.
The samples  are then fed into  for feature extraction. The structure of  is composed of a convolution layer, pooling layer, flatten layer, and dense layer. The convolution layer employs a series of two-dimensional convolution filters (2D-Conv) to , so as to extract different feature maps of received signals. Let  and  denote the size of the convolutional kernel and the number of filters of the l-th convolutional layer, respectively. For simplicity, the stride was fixed to 1, and the Relu function and the same padding were employed in 2D-Conv. After the convolution calculation, the max-pooling layer is used to extract the invariant features with the non-linear downsampling, which will eliminate the non-maximal values. The same signal processing is implemented in the next 2D-Conv and max-pooling layer. After that, the flatten layer is employed to reshape the data size, and then, the dense layer is linked behind accordingly.
The CNN outputs should be firstly transformed as a three-dimensional vector 
 and fed into 
, where 
 denotes the cell number of the first layer of 
. The structure of 
 is made up by cascaded sub-layer blocks composed by multiple LSTM cells. In addition, different amounts of LSTM cells can be deployed in different sub-layers. The internal structure of a single LSTM cell, as shown in 
Figure 2, contains three Sigmoid gates in terms of the forget gate, input gate, and output gate. These gates can selectively influence the model state at each time step. The forget gate is also the core of a single LSTM cell, since it determines the information that should be retained or discarded according to the current input 
 and the previous output 
. The output of forget gate can be expressed as
        
        where 
 is the Sigmoid function and 
 and 
 represent the parameter matrix and bias matrix of the forget gate, respectively. After forgetting part of the previous state, the input gate picks up some new information and then adds it into the former state 
. Therefore, the new cell state is formed as
        
        where 
 denotes the temporary cell state and 
 is the output of the input gate. Furthermore, the expression of 
 is given as
        
        and the output 
 can be expressed as
        
        where 
, 
, 
, and 
 denote the parameter matrix and bias matrix, respectively. Therefore, the value 
 and 
 between 0 and 1 also indicates the proportion of the important information in 
 and 
, thereby determining which information is to be updated. Then, the output of the LSTM cell can be calculated by
        
As a result, the outputs of this layer are fed into the corresponding LSTM cell of the next layer. However, as for the last layer, only the  at the last time step is selected and then composed as the final output , where  denotes the cell number of the last layer in . In this case,  should be transformed as a column vector to be fed into the dense net  to refine the output results. Note that the linear activation function is deployed in  without normalization. Finally, the equalized  can be directly obtained at the output of , and the overall VLC nonlinearity will be efficiently compensated by using the proposed scheme.
  3.3. Complexity
For the computational complexity, it is worth noting that the calculation of 
 and 
 is dominant in each time step. Let 
 be the spatial size of the output feature map in the 
l-th convolutional layer, which can be calculate by
        
        where 
 is the input matrix size and 
 is the padding length. Furthermore, we define the cell number of each layer in 
 as equal to 
. Accordingly, the overall complexity of the proposed model per time step can be approximately expressed as
        
  3.4. Training Strategy
The proposed scheme was trained by viewing the VLC channel as a black box. Fortunately, researchers have developed several reference channel models for indoor environments for VLC [
26]. Therefore, the training data can be easily obtained by simulations [
12]. As for collecting the training set, the receiving plane is divided into several grid units with equidistant spacing used as potential locations for the PD. After VLC transmission and optical-to-electrical conversion, the received signals are collected under different PD locations. With the skillful preprocessing of the received signal, every spatial pattern and one sample of the transmitted signals are combined as the training data. Practically, we should collect a diverse and abundant training set, including the potential PD locations, to enhance the parameters learning ability of the proposed scheme.
The direct-current-biased optical (DCO)-OFDM, containing in total 512 sub-carriers with 16-QAM constellation mapping, is adopted as the training symbol, and only five symbols are randomly generated in each training epoch. Moreover, the NMSE between the raw 
 and the equalized 
 is employed as the training loss function, demonstrated by
        
Note that the DC gain is removed from the training set so that the training loss of the proposed scheme can be fairly evaluated. Furthermore, the training procedure was implemented using TensorFlow on a work station running with a graphics processing unit (GPU) of NVIDIA GeForce 2080Ti; the adaptive moment estimation (Adam) was adopted as the optimizer, and the learning rate was fixed to 0.0001. As in the testing stage, only several special links were adopted to evaluate the system performance for the simplicity of the demonstration.