LSTM Attention Neural-Network-Based Signal Detection for Hybrid Modulated Faster-Than-Nyquist Optical Wireless Communications

In order to improve the accuracy of signal recovery after transmitting over atmospheric turbulence channel, a deep-learning-based signal detection method is proposed for a faster-than-Nyquist (FTN) hybrid modulated optical wireless communication (OWC) system. It takes advantage of the long short-term memory (LSTM) network in the recurrent neural network (RNN) to alleviate the interdependence problem of adjacent symbols. Moreover, an LSTM attention decoder is constructed by employing the attention mechanism, which can alleviate the shortcomings in conventional LSTM. The simulation results show that the bit error rate (BER) performance of the proposed LSTM attention neural network is 1 dB better than that of the back propagation (BP) neural network and outperforms by 2.5 dB when compared with the maximum likelihood sequence estimation (MLSE) detection method.


Introduction
Compared with the traditional radio frequency (RF) communication, OWC has the advantages of high system capacity, interference immunity, good security, flexibility and fast erection, and low cost [1]. However, the transmission of optical wireless signals is affected by atmospheric turbulence, atmospheric absorption, scattering, and refraction. It is difficult for the receiver to obtain an accurate signal with the variation in refractive-index structure constant caused by random variation in atmospheric temperature and pressure. The multiband carrier-free amplitude phase (CAP) modulation technique was proposed in [2] to improve the system capacity and frequency band utilization of OWC under the Gamma-Gamma atmospheric turbulence channel. The experimental results demonstrate that the phase shift is well compensated and the inter-symbol interference (ISI) is effectively suppressed using the multi-modulus algorithm (MMA). However, for the purpose of achieving ISI free transmission in digital communication systems, the symbol rate must follow the Nyquist criterion. This limits the further improvement in the spectral efficiency of the OWC system.
In 1975, Mazo proved that higher transmission rates could be achieved using FTN technology [3]. However, FTN introduces ISI to achieve spectral efficiency, which increases the difficulty of signal detection. In recent years, studies on FTN mainly focus on model-driven detection algorithms. Detection algorithms, such as the linear detection algorithm based on minimum mean square error estimation (MMSE) [4], zero forcing (ZF), maximum a posteriori algorithm (MAP) [5], and nonlinear detection algorithm, have unsatisfactory performance for FTN signals with a high acceleration factor and the implementation complexity is extremely high. It is interesting that the FTN signal detection

System Model
Traditionally, intensity modulation/direct detection based on an on-off keying (OOK) scheme is widely accepted in OWC owing to its easy implementation and lower cost [20]. Considering the low BER performance and spectrum efficiency of OOK, PPM has been considered to be used in OWC communications. Compared with OOK, the energy utilization is greatly increased. In addition, modulated QPSK has the characteristics of high spectrum utilization and strong anti-interference [21]. Therefore, combining the PPM and QPSK can improve the data transmission rate and the system reliability [22,23]. Figure 1 shows the schematic of the 4PPM and QPSK hybrid modulated OWC system with FTN technology. The user data after Gray encoder are firstly mapped into 4PPM and QPSK, respectively. Thereafter, the QPSK signal is loaded into the time slot of the 4PPM signal to form the 4PPM-QPSK hybrid modulated signal. Afterwards, the formatted 4PPM-QPSK signal is sent to the FTN shaping filter for FTN signal forming. Subsequently, after digital-to-analogue conversion (DAC), the data are launched into the atmospheric channel. At the receiver end, the optical signal transmitted over the atmospheric channel is firstly detected by a photodiode (PD) and then sent for analog-to-digital converting (ADC), matched filtering, and sampling. Thereafter, the signal is sent to the DL module for data recovery.
network, but also improves the discriminative power of the features. Therefore, th tention mechanism is introduced into the LSTM network to build an LSTM attention coder for the signal detection of a pulse position modulation (PPM) and phase shift ing (QPSK) hybrid modulated FTN OWC system to improve the system perform while ensuring spectrum efficiency.

System Model
Traditionally, intensity modulation/direct detection based on an on-off ke (OOK) scheme is widely accepted in OWC owing to its easy implementation and lo cost [20]. Considering the low BER performance and spectrum efficiency of OOK, P has been considered to be used in OWC communications. Compared with OOK, the ergy utilization is greatly increased. In addition, modulated QPSK has the characteri of high spectrum utilization and strong anti-interference [21]. Therefore, combining PPM and QPSK can improve the data transmission rate and the system reliab [22,23]. Figure 1 shows the schematic of the 4PPM and QPSK hybrid modulated OWC tem with FTN technology. The user data after Gray encoder are firstly mapped 4PPM and QPSK, respectively. Thereafter, the QPSK signal is loaded into the time sl the 4PPM signal to form the 4PPM-QPSK hybrid modulated signal. Afterwards, formatted 4PPM-QPSK signal is sent to the FTN shaping filter for FTN signal form Subsequently, after digital-to-analogue conversion (DAC), the data are launched into atmospheric channel. At the receiver end, the optical signal transmitted over the atm pheric channel is firstly detected by a photodiode (PD) and then sent for analog digital converting (ADC), matched filtering, and sampling. Thereafter, the signal is to the DL module for data recovery.  rt is the pulse shape of FTN;  is the time accelera factor (0 <  < 1), which is the parameter to characterize the Nyquist compression r The S 4PPM−QPSK signal formed by FTN shaping filter can be expressed as where E is the pulse power; r(t) is the pulse shape of FTN; τ is the time acceleration factor (0 < τ < 1), which is the parameter to characterize the Nyquist compression ratio; ρ is the information carried by the ρ − th symbol on the S 4PPM−QPSK ; and T is the symbol period. When the signal passes through the atmospheric channel, the received optical signal can be expressed as where Z n (t) is the channel additive noise, S(t) is the transmitted optical signal, and h is the channel fading coefficient and it follows Gamma-Gamma distribution. Therefore, the probability density function of h can be expressed as where Q α−β (·) is the second class modified Bessel function of order α − β; Γ(·) is the Gamma function; and α and β are the large and small scale scattering coefficients, respectively. After electro-optic conversion, the electrical signal obtained can be expressed as where η denotes the electro-optic conversion ratio, G OC denotes the average transmitted power, and Z n (t) denotes the overall interference carried in the signal. Sampling is carried out after ADC and matched filter, and its output can be written as where a n denotes the data sequence and r * (t) denotes the conjugate expression of r(t).
Then, the S PD is sent to DL module for training and testing. Eventually, the mapping relationship corresponding to the origin signal is determined.

LSTM Attention Decoder
RNN is an important branch of DL that can be used not only for the processing of time series data, but also to focus on the timing of the feature model. In addition, it is useful for processing sequence data where the front input affects the behind output [24]. However, there are problems of gradient disappearance and gradient explosion with the expansion of timeline in traditional RNN. The gradient disappearance occurs because of the Sigmoid function. The Sigmoid function is usually employed in the output layer, but the derivative of this function ranges from 0 to 0.25. When the BP algorithm is utilized to calculate the gradient, the gradient of each layer will be reduced to 1/4 of the original. If there are many network layers, the gradient is going to become really small. The value of the initial network weight needs to be set larger than 1 to avoid this phenomenon, but it will lead to gradient explosion [25]. So, it has great limitations in the prediction of long time series data. Figure 2 shows the diagram of the RNN network unfolded along the time line. In order to solve the problems existing in RNN, the LSTM network is propos solve the ubiquitous long-term dependence problem in the network, which has proved to be effective in solving the gradient disappearance and gradient explo problems caused by RNN [24]. The biggest difference between LSTM and RNN is RNN has only one state inside a single recurrent structure, while LSTM has four s and each structure is composed of an input gate, forget gate, output gate, and cell The diagram of the LSTM network is shown in Figure 3.  denotes the multiplicati vector elements and  denotes the addition of vectors elements. Both the input output gates open and propagate signals only when previous information is neede In order to solve the problems existing in RNN, the LSTM network is proposed to solve the ubiquitous long-term dependence problem in the network, which has been proved to be effective in solving the gradient disappearance and gradient explosion problems caused by RNN [24]. The biggest difference between LSTM and RNN is that RNN has only one state inside a single recurrent structure, while LSTM has four states and each structure is composed of an input gate, forget gate, output gate, and cell state. The diagram of the LSTM network is shown in Figure 3. ⊗ denotes the multiplication of vector elements and ⊕ denotes the addition of vectors elements. Both the input and output gates open and propagate signals only when previous information is needed. In this way, the previous information can be saved selectively. The function of the forget gate is to receive the error from the memory unit and "forget" the value stored in the memory unit when needed, so as to achieve the control of the network weights.
In order to solve the problems existing in RNN, the LSTM network is proposed to solve the ubiquitous long-term dependence problem in the network, which has been proved to be effective in solving the gradient disappearance and gradient explosion problems caused by RNN [24]. The biggest difference between LSTM and RNN is that RNN has only one state inside a single recurrent structure, while LSTM has four states and each structure is composed of an input gate, forget gate, output gate, and cell state. The diagram of the LSTM network is shown in Figure 3.  denotes the multiplication of vector elements and  denotes the addition of vectors elements. Both the input and output gates open and propagate signals only when previous information is needed. In this way, the previous information can be saved selectively. The function of the forget gate is to receive the error from the memory unit and "forget" the value stored in the memory unit when needed, so as to achieve the control of the network weights. LSTM can prevent the gradient disappearance problem by defining "gate" opera- LSTM can prevent the gradient disappearance problem by defining "gate" operations (i.e., f t , i t , o t ) as follows: where σ denotes the sigmoid() activation function; * denotes the element-wise multiplication; f t denotes the probability of forgetting the previous information, and ranges from 0 to 1; h t−1 denotes the output of the previous moment; x t denotes the input at the current moment; W f and b f denote the weight and bias of the forget gate, respectively; i t denotes the retained probability of information from input gate; W i and b i denote the weight and bias of the input gate, respectively; C t denotes the information from the input gate, and the tanh activation function normalizes the values to the range −1 to 1; W c and b c denotes the weight and bias of the cell state, respectively; C t denotes the cell state; o t denotes the probability of information being sent from the output gate; and W o and b o denote the weight and bias of the output gate, respectively. As shown in Equation (9), the value of f t multiplied by C t−1 denotes the selectively forgotten information from the previous moment. The second term, C t multiplied by i t denotes the selectively forgetting information in the present moment. At this moment, the cell state C t is updated. Thereafter, C t is scaled by tanh and multiplied by o t to obtain the final output h t . LSTM saves the learned features as memory through the above operations and retains or forgets the saved memories according to the training process selectively. After several iterations, the important feature information is retained, which gives the network better performance in processing tasks with a long-time dependence.
Both RNN and LSTM networks are designed to handle the problem of long time series. However, the network will forget the previous useful information because of the existence of a forget gate when LSTM dealt with the gradient explosion problem. This deteriorates the effect of long sequence training and the system performance. Fortunately, the attention mechanism is a great solution to this problem [26]. Variants of attention mechanisms include multi-head attention, hard attention, structured attention, and key-value pair attention [27]. Multi-head attention utilizes multiple queries to calculate in parallel to select multiple pieces of information from the input. Each attention focuses on a different part of the input. Hard attention can be implemented in two ways. One is to select the input information with the highest probability. Another is to randomly sample the distribution of attention. Structured attention involves picking out task-relevant information from input. Key-value pair attention employs a key-value pair format to represent input information, where "key" is utilized to calculate the attention distribution and "value" is utilized to generate the selected information. Considering its excellent performance, the key-value pair attention mechanism is employed in our proposal to perform LSTM attention decoder, and it can be utilized to process the received FTN hybrid signals. The diagram of the LSTM attention decoder is shown in Figure 4. As shown in Figure 5, the whole calculation process of attention can be summarized in three steps [28][29][30]. Firstly, PD S is sent to the LSTM network, the outputs of hidden layer are transmitted to the key module, and the time series signal of PD S are sent to the query module. Thereafter, the key (K) and query (Q) module perform a similarity calculation to obtain the weight of the attention module. The equation of similarity calculation can be written as where s denotes the similarity calculation, T denotes the transpose operation, Q denotes the target matrix to be obtained, and K denotes the actual detected matrix. It should be noted that the goal of using a neural network for training and testing is to determine the mapping relationship corresponding to the original signal. However, when PD S is sent to the network, the output is not unique. It is difficult to determine the mapping relationship and not conducive to the back propagation of the network. Therefore, the Softmax function is employed to normalize the output of ( , ) s Q K (S) to (0,1). The probability formula of the Softmax function can be written as  As shown in Figure 5, the whole calculation process of attention can be summarized in three steps [28][29][30]. Firstly, S PD is sent to the LSTM network, the outputs of hidden layer are transmitted to the key module, and the time series signal of S PD are sent to the query module. Thereafter, the key (K) and query (Q) module perform a similarity calculation to obtain the weight of the attention module. The equation of similarity calculation can be written as where s denotes the similarity calculation, T denotes the transpose operation, Q denotes the target matrix to be obtained, and K denotes the actual detected matrix. It should be noted that the goal of using a neural network for training and testing is to determine the mapping relationship corresponding to the original signal. However, when S PD is sent to the network, the output is not unique. It is difficult to determine the mapping relationship and not conducive to the back propagation of the network. Therefore, the Softmax function is employed to normalize the output of s(Q, K) (S) to (0,1). The probability formula of the Softmax function can be written as where f denotes the output states of the network. As mentioned above, the value of the output state is 4. Therefore, each output signal can be denoted by a 1 × 4 matrix. S y denotes the y-th output of S PD and S q denotes the current value to be calculated in S y . Finally, the output of the Softmax function (a) and the actual output "value" LSTM network are summed to obtain the attention value. ε denotes the weighted This is the final result of the proposed LSTM attention decoder.

Simulation Analysis
The size of the training or test dataset depends on the complexity of the system the DL algorithm. Using a small dataset may cause poor detection performance be the model would be incapable of fully learning the diverse characteristics of the sy Further, using a large dataset may result in increased computational complexity Thus, several simulations are conducted to determine the suitable dataset size an parameters that could offer the best BER performance. It should be noted that the racy of the neural network is affected by the DL algorithm itself, which plays a portant role in solving some nonlinear problems. Therefore, its performance and r ness need to be evaluated. Without a loss of generality, some common paramete taken into consideration. The parameters used in the simulation are listed in Table   Table 1 Table 2 shows the accuracy of the network under different learning rates. A va ed system needs an appropriate learning rate. If the learning rate is too large, th work cannot converge, while if it is too small, the network will converge very slow be unable to finish learning. Moreover, the network may change from underfitt overfitting as the learning rate increases [31]. It is evident from the table that 0.00 the best performance. Finally, the output of the Softmax function (a) and the actual output "value" of the LSTM network are summed to obtain the attention value. ε denotes the weighted sum. This is the final result of the proposed LSTM attention decoder.

Simulation Analysis
The size of the training or test dataset depends on the complexity of the system and the DL algorithm. Using a small dataset may cause poor detection performance because the model would be incapable of fully learning the diverse characteristics of the system. Further, using a large dataset may result in increased computational complexity [30]. Thus, several simulations are conducted to determine the suitable dataset size and the parameters that could offer the best BER performance. It should be noted that the accuracy of the neural network is affected by the DL algorithm itself, which plays an important role in solving some nonlinear problems. Therefore, its performance and robustness need to be evaluated. Without a loss of generality, some common parameters are taken into consideration. The parameters used in the simulation are listed in Table 1.  Table 2 shows the accuracy of the network under different learning rates. A validated system needs an appropriate learning rate. If the learning rate is too large, the network cannot converge, while if it is too small, the network will converge very slowly or be unable to finish learning. Moreover, the network may change from underfitting to overfitting as the learning rate increases [31]. It is evident from the table that 0.002 has the best performance.  Table 3 show the effect of accuracy with cycle index. The cycle index has a great impact on the network accuracy. If the cycle index is too low, it will lead to the less accurate prediction of the trained network. If the cycle index is too high, the computational complexity will increase dramatically. As shown in Table 3, the cycle index can be set at 50 for the best accuracy. The selection of hidden layers is another key point. A low or high number of hidden layers leads to the phenomena of underfitting or overfitting [31]. The relationship between the number of hidden layers and accuracy is shown in Table 4. The accuracy increases gradually with the number of hidden layers and declines when it reaches a certain value. In addition, studies in [32,33] found that the increasing number of hidden layers results in a significant increase in computational complexity and overfitting. The causes of overfitting can be divided into three categories [34]. The first is a small dataset of training samples that cannot reflect the overall possible situations. This will lead to the less accurate prediction of the trained network. Therefore, the training dataset should cover all types of data as much as possible. The second is a network that cannot accurately estimate the relationship between input and output because of the excessive interference of training data. The third is the high complexity of the network. Under the circumstances, it should process many parameters to enable the network to accurately fit every data in the training dataset. As a result, the trained network cannot generalize to the test dataset. Therefore, the appropriate number of hidden layers is crucial to the system performance. The simulation experimental results in Table 5 show that the system has the best detection performance when the number of hidden layers is 8. The comparison of accuracy of the LSTM network and LSTM attention network is shown in Table 5. It is clear that the accuracy of the LSTM attention network is significantly higher than that of the LSTM network.
It is well known that rain, snow, sleet, fog, haze, pollution, and so on are atmospheric factors that impact the laser beams. Their presence causes reflection, refraction, scattering, and attenuation of optical signals. It has been proven that atmospheric turbulence follows the Gamma-Gamma distribution, and weak, moderate, and strong turbulence intensity can be expressed by the refractive-index structure constant of C 2 n = 2 × 10 −18 , C 2 n = 2 × 10 −15 , and C 2 n = 2 × 10 −12 , respectively [35] The curves of different atmospheric turbulence intensities versus BER are shown in Figure 6, where the roll-off factor is 0.6, τ = 0.8, and the transmission distance is 500 m. It is evident from the figure that the BER performance is gradually improving with the decrease in turbulence intensity. When BER = 3.8 × 10 −3 , the BER performance of weak turbulence is about 2 dB and 5 dB better than that in moderate and strong turbulence, respectively. It is well known that rain, snow, sleet, fog, haze, pollution, and so on are atmospheric factors that impact the laser beams. Their presence causes reflection, refraction, scattering, and attenuation of optical signals. It has been proven that atmospheric turbulence follows the Gamma-Gamma distribution, and weak, moderate, and strong turbulence intensity can be expressed by the refractive-index structure constant of , respectively [35] The curves of different atmospheric turbulence intensities versus BER are shown in Figure 6, where the roll-off factor is 0.6,  = 0.8, and the transmission distance is 500 m. It is evident from the figure that the BER performance is gradually improving with the decrease in turbulence intensity. When BER = 3.8 × 3 10 − , the BER performance of weak turbulence is about 2 dB and 5 dB better than that in moderate and strong turbulence, respectively.  Figure 7 shows the influence of the roll-off factor of FTN shaping filter on BER with a different decoder, where the acceleration factor is 0.8. As shown in Figure 7a, when BER = 4 10 − , the LSTM attention decoder improves the BER performance by about 1 dB compared with the BP algorithm. Figure 7b shows that the LSTM attention decoder improves the BER performance by about 2.5 dB compared with the MLSE algorithm when BER = 4 10 − . Therefore, LSTM attention is beneficial to improve the BER performance compared with the traditional decoder.  Figure 7 shows the influence of the roll-off factor of FTN shaping filter on BER with a different decoder, where the acceleration factor is 0.8. As shown in Figure 7a, when BER = 10 −4 , the LSTM attention decoder improves the BER performance by about 1 dB compared with the BP algorithm. Figure 7b shows that the LSTM attention decoder improves the BER performance by about 2.5 dB compared with the MLSE algorithm when BER = 10 −4 . Therefore, LSTM attention is beneficial to improve the BER performance compared with the traditional decoder. Figure 8 shows the impact of acceleration factor on the system BER performance. When the BER is 10 −4 and the acceleration factor decreases from 1 to 0.9 and to 0.8, the BER performance decreases about 2 dB and 4 dB, respectively. When the BER is 10 −3 and acceleration factor decreases from 1 to 0.9 and to 0.8, the BER performance declines by about 1 dB and 4.5 dB, respectively. It can be concluded from the figure that the BER curves decrease rapidly as the acceleration factor decreases. However, under the premise of improving the spectrum efficiency, the system can still ensure good communication quality when the acceleration factor is 0.8. Thus, the proposal is beneficial to improve the performance of the system.   Figure 8 shows the impact of acceleration factor on the system BER performance. When the BER is 4 10 − and the acceleration factor decreases from 1 to 0.9 and to 0.8, the BER performance decreases about 2 dB and 4 dB, respectively. When the BER is 3 10 − and acceleration factor decreases from 1 to 0.9 and to 0.8, the BER performance declines by about 1 dB and 4.5 dB, respectively. It can be concluded from the figure that the BER curves decrease rapidly as the acceleration factor decreases. However, under the premise of improving the spectrum efficiency, the system can still ensure good communication quality when the acceleration factor is 0.8. Thus, the proposal is beneficial to improve the performance of the system.

Computational Complexity
In order to further illustrate the advantages of the proposed method, as shown in

Computational Complexity
In order to further illustrate the advantages of the proposed method, as shown in Table 6, the running time of the LSTM attention and BP network are compared. The time complexity is tied to hardware execution, and includes the number of operations needed, the number of elements to process, and the path length needed to complete an operation. The simulation experiments are implemented by Matlab 2018a and Pycharm 2021.3.2. An NVIDIA GeForce RTX 3050 Laptop GPU is used as the test platform. In the training process, 50,000 data are randomly generated, of which 80% is used as the training dataset and the remaining 20% is used as the test dataset. It is pretty obvious that the LSTM attention network outperforms the BP network. This is because the convergence speed of the BP neural network is slow.

Conclusions
In this paper, an LSTM attention decoder is proposed for signal detection of hybrid the modulated 4PPM-QPSK-FTN OWC system. The LSTM attention network can alleviate the problems of gradient disappearance, gradient explosion, and interdependence between adjacent symbols. The experimental simulation shows that our proposal has outstanding signal detection performance for hybrid modulated FTN signals. The received signal can be accurately predicted and quickly and correctly decoded. Hence, the scheme can effectively improve the BER performance on the premise of ensuring the spectrum efficiency.