FPGA Implementation of a BPSK 1D-CNN Demodulator

: In this paper, we propose a ﬁeld programmable gate array (FPGA) implementation of a one-dimensional convolution neural network (1D-CNN) demodulator for binary phase shift keying (BPSK). The 1D-CNN demodulator includes two 1D-CNNs and a decision module. Discrete time series of BPSK signals are imported into the well-trained 1D-CNNs. The 1D-CNNs detect the phase shifts’ moment and type, including phase shift from 0 to π and that from π to 0. The decision module combines results of the two 1D-CNNs and outputs the demodulated data. In order to improve the efﬁciency of resource utilization and operation speed of the FPGA circuit, a time-delay network for convolutional calculation and a structure for piecewise approximation for the activation function were designed. To enhance the performance of the 1D-CNN demodulator, universal and diversity training data considering ﬁve impact factors were generated specially. Experimental results under different channel conditions show that the proposed demodulator has good adaptability to frequency offset and short latency. The demodulation loss of the proposed demodulator can almost be kept within 2 dB.

To realize a BPSK demodulator, the cooperation of hardware platforms and algorithms is necessary. So far, the most frequently used hardware platforms include analog circuits using discrete components, application-specific integrated circuits (ASIC), and general-purpose programmable devices. Analog circuit platforms have been rarely used in recent years, due to the instability of the discrete components [15]. ASICs have a small volume and low power consumption, but the cost is much higher and the function cannot be modified once the chip tape-out has been completed [16]. Among the general-purpose programmable devices, FPGAs have the advantages of remarkable stability, repeatable programming, and high energy efficiency. In recent years, FPGA implementation in BPSK demodulation has become a popular trend [17].
The traditional algorithms of BPSK demodulation can be divided into two categories: coherent demodulation [18,19] and noncoherent demodulation [20]. The coherent demodulation regenerates a local carrier, which has the same frequency and phase as the modulated carrier by carrier

Basic Principle of 1D-CNN Demodulation
The modulated BPSK signal in the time domain can be expressed as s(t) = [Σ n a n g(t − nT s )] cos(ω c t + ϕ), where a n = ±A, g(t) denotes a single rectangular pulse, T S is the pulse width, and ω c is the angular frequency of the carrier. Information carried by the BPSK signal is contained in the phase shifts. Through the detection of the phase shifts, the carried information can be recovered. The process of 1D-CNN demodulation is shown in Figure 1. The BPSK signal is imported separately into two 1D-CNNs. The 1D-CNN1 detects phase shift from 0 to π, and the 1D-CNN2 detects phase shift from π to 0. Each 1D-CNN outputs a pulse when a certain type of phase shift is met. A decision module is employed to handle the outputs of the 1D-CNNs. When the output of 1D-CNN1 is greater than the predefined threshold, the output of the decision module is converted from 1 to 0. When the output of 1D-CNN2 is greater than the predefined threshold, the output of the decision module is converted from 0 to 1. The two 1D-CNNs have the same structure, only the parameters of the convolution kernels and neurons are different. The structure of the 1D-CNN is shown in Figure 2, which consists of four layers: an input layer, convolution layer, hidden layer, and output layer. The input layer conveys segmented data as an input vector. The convolution layer convolutes the input vector with a convolution kernel, and the result is transported to the hidden layer. The hidden layer aims to avoid the network being trapped in the local optimum, and to make 1D-CNN convergence easier during the training process. Neurons in the hidden layer are connected to the convolution layer. The weighted results from every neuron in the hidden layer are summed in the output layer, and then imported to the decision module. The operation process of 1D-CNNs includes forward propagation and backward propagation. In the training process, the backward propagation adjusts network parameters according to the output and loss, until the loss reaches the minimum. Once the network parameters are determined, the forward propagation can deduce the result independently. data as an input vector. The convolution layer convolutes the input vector with a convolution kernel, and the result is transported to the hidden layer. The hidden layer aims to avoid the network being trapped in the local optimum, and to make 1D-CNN convergence easier during the training process. Neurons in the hidden layer are connected to the convolution layer. The weighted results from every neuron in the hidden layer are summed in the output layer, and then imported to the decision module. The operation process of 1D-CNNs includes forward propagation and backward propagation. In the training process, the backward propagation adjusts network parameters according to the output and loss, until the loss reaches the minimum. Once the network parameters are determined, the forward propagation can deduce the result independently.

FPGA Implementation
This paper implements the 1D-CNN demodulator in an FPGA. The implementation block diagram is shown in Figure 3.
Two 1D-CNNs are implemented in an FPGA. The input of the 1D-CNN is a discrete time series from an analog-to-digital converter (ADC). The input layer segments the input vector by a sliding window, and conveys the input vector to the convolution layer. If we suppose that the input series is … , where σ is the length of the input vector, then the input vector can be indicated as  data as an input vector. The convolution layer convolutes the input vector with a convolution kernel, and the result is transported to the hidden layer. The hidden layer aims to avoid the network being trapped in the local optimum, and to make 1D-CNN convergence easier during the training process. Neurons in the hidden layer are connected to the convolution layer. The weighted results from every neuron in the hidden layer are summed in the output layer, and then imported to the decision module. The operation process of 1D-CNNs includes forward propagation and backward propagation. In the training process, the backward propagation adjusts network parameters according to the output and loss, until the loss reaches the minimum. Once the network parameters are determined, the forward propagation can deduce the result independently.

FPGA Implementation
This paper implements the 1D-CNN demodulator in an FPGA. The implementation block diagram is shown in Figure 3.
Two 1D-CNNs are implemented in an FPGA. The input of the 1D-CNN is a discrete time series from an analog-to-digital converter (ADC). The input layer segments the input vector by a sliding window, and conveys the input vector to the convolution layer. If we suppose that the input series is … , where σ is the length of the input vector, then the input vector can be indicated as

FPGA Implementation
This paper implements the 1D-CNN demodulator in an FPGA. The implementation block diagram is shown in Figure 3.
Two 1D-CNNs are implemented in an FPGA. The input of the 1D-CNN is a discrete time series from an analog-to-digital converter (ADC). The input layer segments the input vector by a sliding window, and conveys the input vector to the convolution layer. If we suppose that the input series is x 1 x 2 . . . x n , where σ is the length of the input vector, then the input vector X m can be indicated as . . .
Registers are cascaded to realize a sliding window. The convolution layer includes two convolution kernels, which store the convolution parameters in block read-only memory (ROM) and complete convolution calculations. A rectified linear unit (ReLU) function is selected as the activation function of the convolution layer. The number of neurons in the hidden layer is set to 20, and the activation function of the hidden layer is a sigmoid function. A single neuron is set in the output layer. The existence of a certain type of phase shift can be concluded directly by referring to the state of this neuron. Comparators are used in the decision module to compare the predefined threshold with the results of the two 1D-CNNs. The valid bus gives the enable signal according to the states of the ADC and each layer. All components are driven by the same clock, provided by the external crystal.
Registers are cascaded to realize a sliding window. The convolution layer includes two convolution kernels, which store the convolution parameters in block read-only memory (ROM) and complete convolution calculations. A rectified linear unit (ReLU) function is selected as the activation function of the convolution layer. The number of neurons in the hidden layer is set to 20, and the activation function of the hidden layer is a sigmoid function. A single neuron is set in the output layer. The existence of a certain type of phase shift can be concluded directly by referring to the state of this neuron. Comparators are used in the decision module to compare the predefined threshold with the results of the two 1D-CNNs. The valid bus gives the enable signal according to the states of the ADC and each layer. All components are driven by the same clock, provided by the external crystal.  To improve the efficiency of resource utilization and operation speed of the FPGA, three methods were adopted as follows: (1) the use of a time-delay network for convolutional calculation;

Convolution
(2) the use of a look-up table (LUT) together with a piecewise function to achieve the activation function; (3) the use of a parallel structure within layers and a pipeline structure between layers.

Implementation of the Convolution Kernel
The convolution kernel is the core component of the convolutional layer. Repeated experiments show that the accuracy of detection reached the highest when the length of the input vector was slightly larger than the sampling times of each carrier period. In the design, the length of the input vector is M + 1; M denoting the sampling times of each carrier period. In order to accurately detect the phase shifts in the input vectors, the convolution operation mode was always selected, meaning To improve the efficiency of resource utilization and operation speed of the FPGA, three methods were adopted as follows: (1) the use of a time-delay network for convolutional calculation; (2) the use of a look-up table (LUT) together with a piecewise function to achieve the activation function; (3) the use of a parallel structure within layers and a pipeline structure between layers.

Implementation of the Convolution Kernel
The convolution kernel is the core component of the convolutional layer. Repeated experiments show that the accuracy of detection reached the highest when the length of the input vector was slightly larger than the sampling times of each carrier period. In the design, the length of the input vector is M + 1; M denoting the sampling times of each carrier period. In order to accurately detect the phase shifts in the input vectors, the convolution operation mode was always selected, meaning that the input vector and the output vector had the same length. According to the rule of convolution calculation, the output vector with the length of M + 1 can be obtained only if the input vector is expanded to the length of 2M + 1. We used the element 0 to expand the vector. The expanded input vector convoluted with the convolution kernel M + 1 in length, and the result is shown in Figure 4. that the input vector and the output vector had the same length. According to the rule of convolution calculation, the output vector with the length of M + 1 can be obtained only if the input vector is expanded to the length of 2M + 1. We used the element 0 to expand the vector. The expanded input vector convoluted with the convolution kernel M + 1 in length, and the result is shown in Figure 4. As shown in Figure 4, in order to complete this convolution calculation, (M + 1) 2 multiplication operations must be done. Some of the operations include the multiplier '0', which can be ignored. Calculation results of data in the line frame had all appeared in previous steps or would appear in the forthcoming steps. By feeding some of the results into the time-delay network that was composed of several time-delay queues, a large number of repeated calculations could be avoided. The structure of the time-delay network is shown in Figure 5.  As shown in Figure 4, in order to complete this convolution calculation, (M + 1) 2 multiplication operations must be done. Some of the operations include the multiplier '0', which can be ignored. Calculation results of data in the line frame had all appeared in previous steps or would appear in the forthcoming steps. By feeding some of the results into the time-delay network that was composed of several time-delay queues, a large number of repeated calculations could be avoided. The structure of the time-delay network is shown in Figure 5. that the input vector and the output vector had the same length. According to the rule of convolution calculation, the output vector with the length of M + 1 can be obtained only if the input vector is expanded to the length of 2M + 1. We used the element 0 to expand the vector. The expanded input vector convoluted with the convolution kernel M + 1 in length, and the result is shown in Figure 4. As shown in Figure 4, in order to complete this convolution calculation, (M + 1) 2 multiplication operations must be done. Some of the operations include the multiplier '0', which can be ignored. Calculation results of data in the line frame had all appeared in previous steps or would appear in the forthcoming steps. By feeding some of the results into the time-delay network that was composed of several time-delay queues, a large number of repeated calculations could be avoided. The structure of the time-delay network is shown in Figure 5.  Assuming that [x 1 , x 2 , . . . , x M+1 ] is the current input vector, together with data in the time-delay network, we can obtain the convolution result [y 1 , y 2 , . . . , y M+1 ] by only M + 1 multiplication operations. The calculation complexity is reduced by one dimension. The cost of the time-delay queues is far less than the multipliers in FPGA implementation. Under the control of the clock in the FPGA, data can be accurately beat-delayed with the occupation of few hardware resources. The timing sequence of the time-delay network is shown in Figure 6. operations. The calculation complexity is reduced by one dimension. The cost of the time-delay queues is far less than the multipliers in FPGA implementation. Under the control of the clock in the FPGA, data can be accurately beat-delayed with the occupation of few hardware resources. The timing sequence of the time-delay network is shown in Figure 6.

Implementation of a Single Neuron
Neurons play an important role in the hidden layer. Assuming that [ , , … , ] are the inputs of each neuron, [ , , … , ] are the weights of the synapses, b is the bias, and f(x) is the activation function, the structure of the neuron can be indicated as in Figure 7. Each neuron is fully connected to every convolution kernel output through synapses, and each synapse has its own weight. The weighted results are successively summed, biased, activated, and finally output. In FPGA implementation, a full adder circuit is used to achieve the summing operation. The fixed-point multiplication IP core is used for multiplication. In this paper, two activation functions are used, which are:

Implementation of a Single Neuron
Neurons play an important role in the hidden layer. Assuming that [p 1 , p 2 , . . . , p n ] are the inputs of each neuron, [W 1 , W 2 , . . . , W n ] are the weights of the synapses, b is the bias, and f (x) is the activation function, the structure of the neuron can be indicated as in Figure 7. Each neuron is fully connected to every convolution kernel output through synapses, and each synapse has its own weight. The weighted results are successively summed, biased, activated, and finally output. In FPGA implementation, a full adder circuit is used to achieve the summing operation. The fixed-point multiplication IP core is used for multiplication. operations. The calculation complexity is reduced by one dimension. The cost of the time-delay queues is far less than the multipliers in FPGA implementation. Under the control of the clock in the FPGA, data can be accurately beat-delayed with the occupation of few hardware resources. The timing sequence of the time-delay network is shown in Figure 6.

Implementation of a Single Neuron
Neurons play an important role in the hidden layer. Assuming that [ , , … , ] are the inputs of each neuron, [ , , … , ] are the weights of the synapses, b is the bias, and f(x) is the activation function, the structure of the neuron can be indicated as in Figure 7. Each neuron is fully connected to every convolution kernel output through synapses, and each synapse has its own weight. The weighted results are successively summed, biased, activated, and finally output. In FPGA implementation, a full adder circuit is used to achieve the summing operation. The fixed-point multiplication IP core is used for multiplication. In this paper, two activation functions are used, which are: In this paper, two activation functions are used, which are: The implementation method of the ReLU function is very simple, so we mainly discuss the implementation method of the sigmoid function. Two methods are combined to realize the sigmoid function: the LUT method and the piecewise liner function method.
In the implementation of the LUT method, we mainly used the internal block RAM in the FPGA. First, we made a table from the input to the output, according to the resolution. Then, letting the input of the function be the address, the corresponding function value was written to the memory cell of this address. In this way, the complicated calculation process was simplified to the straightforward addressing process. By using the symmetry of the sigmoid function, only the positive part of the function was written to the LUT, whereas the negative part could be obtained by some simple adjustments according to the input data. This method has the advantages of high precision and ultrashort delay. However, the disadvantage is that the hardware resource cost is too high.
In the implementation of the piecewise liner function method, we used several linear functions for piecewise fitting of the sigmoid function. It is obvious that the more the function was segmented, the higher the accuracy, and the higher the resource consumption. This method is suitable for functions with good linearity. As for the sigmoid function, part of the interval can be replaced by linear functions, especially in sections near or far from the longitudinal axis.
In order to combine the advantages of the two methods, the LUT together with the piecewise function were adopted to achieve the sigmoid function. In three intervals with good linearity, which were x ∈ (−1.3, 1.3) ∪ (−7.5, −4) ∪ (4, 7.5), three linear functions were used to replace the sigmoid function; in the intervals with bad linearity, i.e., x ∈ (−4, −1.3) ∪ (1.3, 4), the LUT was used to implement the sigmoid function; and in other intervals that were far away from the longitudinal axis, the sigmoid function converging to constants was replaced by constants. The structure of the sigmoid function implementation is shown in Figure 8. The control block firstly analyses the range of the input, then places the multiplexers (MUXs) at the right gear and adjusts the output value of the ROM. The output first-input/first-output (FIFO) is used to adjust the output delay of the LUT method. It aims to synchronize the two paths, and then synchronize the outputs of all neurons. This method takes into account both resource consumption and accuracy. The implementation result is shown in Figure 9. The implementation method of the ReLU function is very simple, so we mainly discuss the implementation method of the sigmoid function. Two methods are combined to realize the sigmoid function: the LUT method and the piecewise liner function method.
In the implementation of the LUT method, we mainly used the internal block RAM in the FPGA. First, we made a table from the input to the output, according to the resolution. Then, letting the input of the function be the address, the corresponding function value was written to the memory cell of this address. In this way, the complicated calculation process was simplified to the straightforward addressing process. By using the symmetry of the sigmoid function, only the positive part of the function was written to the LUT, whereas the negative part could be obtained by some simple adjustments according to the input data. This method has the advantages of high precision and ultrashort delay. However, the disadvantage is that the hardware resource cost is too high.
In the implementation of the piecewise liner function method, we used several linear functions for piecewise fitting of the sigmoid function. It is obvious that the more the function was segmented, the higher the accuracy, and the higher the resource consumption. This method is suitable for functions with good linearity. As for the sigmoid function, part of the interval can be replaced by linear functions, especially in sections near or far from the longitudinal axis.
In order to combine the advantages of the two methods, the LUT together with the piecewise function were adopted to achieve the sigmoid function. In three intervals with good linearity, which were ∈ (−1.  3,4) , the LUT was used to implement the sigmoid function; and in other intervals that were far away from the longitudinal axis, the sigmoid function converging to constants was replaced by constants. The structure of the sigmoid function implementation is shown in Figure 8. The control block firstly analyses the range of the input, then places the multiplexers (MUXs) at the right gear and adjusts the output value of the ROM. The output first-input/first-output (FIFO) is used to adjust the output delay of the LUT method. It aims to synchronize the two paths, and then synchronize the outputs of all neurons. This method takes into account both resource consumption and accuracy. The implementation result is shown in Figure 9.

Pipeline and Parallel Structure
The neural network runs in a pipelined mode between layers. The data are not cached from input to output. Layers are called every clock period. Each clock period, a set of data is fed into the 1D-CNN, and a set of results is output, which is delayed for several clock periods compared with the corresponding input. The pipelined mode maximizes the utilization of resources.
The pipelined mode between layers requires a parallel computing architecture inside each layer. The speed of the data stream is fixed in each layer, with no multiplexing of hardware resources between each step. In the convolution layer, two convolution kernels are routed respectively. Calculations of each element of the two output vectors are obtained synchronously; in the hidden layer, calculations of 20 parallel neurons are routed respectively. This parallel structure avoids the problem of routing across the clock domain in the implementation of the FPGA, and also avoids the problem of timing tension caused by frequent calls of the critical paths. The designed pipeline and parallel structure improves the stability of the circuit, and also provides the possibility to accelerate the speed of operations.

Precision and Quantization
In the training process of neural networks in a personal computer (PC), we chose the double precision floating-point type as the data type of neural network parameters, in order to obtain a network with high precision. In the FPGA implementation process, however, this kind of highprecision data type is neither feasible nor necessary. Taking this into account, the signed int type is chosen as the data type, meaning one bit for sign and 15 bits for data.
During the training process, the network input and parameters are normalized to unity; we treat the amplitude of the input signal as 1, and then obtain the network parameters of each layer. Here, the network parameters are represented as double precision floating-point, so we need to complete the data type conversion from floating-point to fixed-point. Differing from the floating-point numbers, fixed-point numbers have the problem of width expansion after multiplication. This means that the product of fixed-point numbers will become bigger and bigger, regardless of their actual value.
Assuming that f1 and f2 are two floating-point numbers, after fixed-point quantization with the coefficient n, their integer values are I1 and I2; therefore, I1 = [nf1], I2 = [nf2], and I1I2 = [n 2 f1f2]. However, the quantized actual value of the product of f1 and f2 is expected to be R = [nf1f2]. It is clear that we can avoid the problem of width expansion through an additional division operation, that is: According to this characteristic, this paper chooses an integer power of 2 as the quantization coefficient, such that division operations can be replaced by bit shift operations. The actual product value can be expressed as

Pipeline and Parallel Structure
The neural network runs in a pipelined mode between layers. The data are not cached from input to output. Layers are called every clock period. Each clock period, a set of data is fed into the 1D-CNN, and a set of results is output, which is delayed for several clock periods compared with the corresponding input. The pipelined mode maximizes the utilization of resources.
The pipelined mode between layers requires a parallel computing architecture inside each layer. The speed of the data stream is fixed in each layer, with no multiplexing of hardware resources between each step. In the convolution layer, two convolution kernels are routed respectively. Calculations of each element of the two output vectors are obtained synchronously; in the hidden layer, calculations of 20 parallel neurons are routed respectively. This parallel structure avoids the problem of routing across the clock domain in the implementation of the FPGA, and also avoids the problem of timing tension caused by frequent calls of the critical paths. The designed pipeline and parallel structure improves the stability of the circuit, and also provides the possibility to accelerate the speed of operations.

Precision and Quantization
In the training process of neural networks in a personal computer (PC), we chose the double precision floating-point type as the data type of neural network parameters, in order to obtain a network with high precision. In the FPGA implementation process, however, this kind of high-precision data type is neither feasible nor necessary. Taking this into account, the signed int type is chosen as the data type, meaning one bit for sign and 15 bits for data.
During the training process, the network input and parameters are normalized to unity; we treat the amplitude of the input signal as 1, and then obtain the network parameters of each layer. Here, the network parameters are represented as double precision floating-point, so we need to complete the data type conversion from floating-point to fixed-point. Differing from the floating-point numbers, fixed-point numbers have the problem of width expansion after multiplication. This means that the product of fixed-point numbers will become bigger and bigger, regardless of their actual value.
Assuming that f 1 and f 2 are two floating-point numbers, after fixed-point quantization with the coefficient n, their integer values are I 1 and I 2 ; therefore, I 1 = [nf 1 ], I 2 = [nf 2 ], and I 1 I 2 = [n 2 f 1 f 2 ]. However, the quantized actual value of the product of f 1 and f 2 is expected to be R = [nf 1 f 2 ]. It is clear that we can avoid the problem of width expansion through an additional division operation, that is: According to this characteristic, this paper chooses an integer power of 2 as the quantization coefficient, such that division operations can be replaced by bit shift operations. The actual product value can be expressed as Bit shift operation is very suitable for FPGA structure, occupying few hardware resources. Such a quantitative method greatly reduced the resource usage of multiplication. In the experiment, n was set to 2048. Examination of the results shows that no overflow happened during the operation process.

Experimental Platform
In this chapter, we describe the proposed 1D-CNN demodulator implementation using the Xilinx KC705 evaluation board. The block diagram of the experimental platform is shown in Figure 10. As shown in this diagram, a PC with a Keras neural network toolkit was used to generate the network parameters, which were later provided to a Xilinx KC705 evaluation board. The training data set X i was provided in the neural network training process. In addition to the evaluation board itself, another three devices were applied to generate modulated data, which were the BPSK generator, FPGA mezzanine card (FMC) sampling subsystem, and additive white Gaussian noise (AWGN) generator. Bit shift operation is very suitable for FPGA structure, occupying few hardware resources. Such a quantitative method greatly reduced the resource usage of multiplication. In the experiment, n was set to 2048. Examination of the results shows that no overflow happened during the operation process.

Experimental Platform
In this chapter, we describe the proposed 1D-CNN demodulator implementation using the Xilinx KC705 evaluation board. The block diagram of the experimental platform is shown in Figure  10. As shown in this diagram, a PC with a Keras neural network toolkit was used to generate the network parameters, which were later provided to a Xilinx KC705 evaluation board. The training data set Xi was provided in the neural network training process. In addition to the evaluation board itself, another three devices were applied to generate modulated data, which were the BPSK generator, FPGA mezzanine card (FMC) sampling subsystem, and additive white Gaussian noise (AWGN) generator.  The experimental condition was set as follows: (1) carrier frequency was fc = 10 MHz; (2) symbol rate was rb = 5 Msps; (3) sampling frequency of the system was fs = 80 MHz.

Training Data Sets
The demodulation performance of the 1D-CNN demodulator is obtained by training, so appropriate training data sets must be generated first. To enhance the adaptability of the 1D-CNN demodulator for a real channel condition, five nonideal factors were taken into consideration during the generation of training data Xi: signal-to-noise ratio (SNR), carrier frequency offset oc, symbol rate offset ob, sampling frequency offset os, and initial phase . Training data Xi can be regarded as the function of these five variables, which is = ( , , , , ), Empirically, we set the range of each variable, as shown in Table 1.  The experimental condition was set as follows: (1) carrier frequency was f c = 10 MHz; (2) symbol rate was r b = 5 Msps; (3) sampling frequency of the system was f s = 80 MHz.

Training Data Sets
The demodulation performance of the 1D-CNN demodulator is obtained by training, so appropriate training data sets must be generated first. To enhance the adaptability of the 1D-CNN demodulator for a real channel condition, five nonideal factors were taken into consideration during the generation of training data X i : signal-to-noise ratio (SNR), carrier frequency offset o c , symbol rate offset o b , sampling frequency offset o s , and initial phase ϕ. Training data X i can be regarded as the function of these five variables, which is Empirically, we set the range of each variable, as shown in Table 1. MATLAB was employed to generate training data sets. Firstly, these five variables were assigned randomly in their respective range, and a set of values of the five variables was obtained. According to the carrier frequency, the symbol rate, and the initial phase, a BPSK-modulated waveform was generated, with the length fixed to 100 random symbols. Next, the noise signal of certain power was created referring to the SNR value, and the noise signal was added to the modulated waveform. Then, we extracted samples from the modulated waveform according to the sampling frequency. Finally, phase shifts labels were added to the sample sequence. In this way, the first data set X 1 was generated.
By repeating the above steps, a total of 1000 sets of training data were yielded. Next, the structures of the 1D-CNNs were built in a PC using a Keras neural network toolkit, and the generated training data sets X 1 -X 1000 were provided to train the neural network. Data sets X 1 -X 1000 were iterated for 100 times. The training loss curve is shown in Figure 11. It can be seen that the training process converged well. The well-trained network parameters were then imported to the FPGA. MATLAB was employed to generate training data sets. Firstly, these five variables were assigned randomly in their respective range, and a set of values of the five variables was obtained. According to the carrier frequency, the symbol rate, and the initial phase, a BPSK-modulated waveform was generated, with the length fixed to 100 random symbols. Next, the noise signal of certain power was created referring to the SNR value, and the noise signal was added to the modulated waveform. Then, we extracted samples from the modulated waveform according to the sampling frequency. Finally, phase shifts labels were added to the sample sequence. In this way, the first data set X1 was generated.
By repeating the above steps, a total of 1000 sets of training data were yielded. Next, the structures of the 1D-CNNs were built in a PC using a Keras neural network toolkit, and the generated training data sets X1-X1000 were provided to train the neural network. Data sets X1-X1000 were iterated for 100 times. The training loss curve is shown in Figure 11. It can be seen that the training process converged well. The well-trained network parameters were then imported to the FPGA.

Structure Parameters of the Network
In order to obtain the best network structure, several networks were designed by changing the two most important structure parameters: the length of the input vector and the number of neurons in the hidden layer. The modulated signals of different SNR were used to judge the performance of the network. As a representative case, the bit error rate (BER) results when SNR was 6 dB are shown in Figures 12 and 13.

Structure Parameters of the Network
In order to obtain the best network structure, several networks were designed by changing the two most important structure parameters: the length of the input vector and the number of neurons in the hidden layer. The modulated signals of different SNR were used to judge the performance of the network. As a representative case, the bit error rate (BER) results when SNR was 6 dB are shown in Figures 12 and 13.
In order to obtain the best network structure, several networks were designed by changing the two most important structure parameters: the length of the input vector and the number of neurons in the hidden layer. The modulated signals of different SNR were used to judge the performance of the network. As a representative case, the bit error rate (BER) results when SNR was 6 dB are shown in Figures 12 and 13.  The result shown in Figure 12 indicates that the network cannot converge when the length of the input vector is too short (less than 4). The performance of the network was kept at an acceptable and stable level when the length of the input vector was slightly larger than the sampling number of each carrier period. In the experiment, the sampling number of each carrier period was 8, and the length of the input vector was set as 9.
The result shown in Figure 13 indicated that with the increase of the number of neurons in the hidden layer, the network performance gets better and better. However, change is not obvious when the number is more than 20. Considering that too many neurons may bring a greater amount of computation, the number was set as 20 in the experiment.

Results of the Implementation
The Xilinx KC705 evaluation board is equipped with a xc7k325t FPGA. The resource occupancy of the 1D-CNNs demodulator is shown in Table 2. We can see that the chosen FPGA can meet the requirements of the implementation. However, the occupancy rate of DSP48E1s is very high, so it was urgent to simplify the multiplication.  The result shown in Figure 12 indicates that the network cannot converge when the length of the input vector is too short (less than 4). The performance of the network was kept at an acceptable and stable level when the length of the input vector was slightly larger than the sampling number of each carrier period. In the experiment, the sampling number of each carrier period was 8, and the length of the input vector was set as 9.
The result shown in Figure 13 indicated that with the increase of the number of neurons in the hidden layer, the network performance gets better and better. However, change is not obvious when the number is more than 20. Considering that too many neurons may bring a greater amount of computation, the number was set as 20 in the experiment.

Results of the Implementation
The Xilinx KC705 evaluation board is equipped with a xc7k325t FPGA. The resource occupancy of the 1D-CNNs demodulator is shown in Table 2. We can see that the chosen FPGA can meet the requirements of the implementation. However, the occupancy rate of DSP48E1s is very high, so it was urgent to simplify the multiplication. In order to fully demonstrate the performance of the 1D-CNN demodulator, complexity, power consumption, and latency of the 1D-CNN demodulator and a of coherent demodulator were compared. In order to eliminate the influence of the hardware platform, the coherent demodulation algorithm was implemented in the same FPGA chip. A Costas loop and Gardner algorithm were employed in the coherent demodulation. The results are shown in Figure 14 and Table 3.  Hardware resource utilization is listed in Figure 14 to illustrate its complexity. The 1D-CNN demodulator consumes fewer slice registers, slice LUTs, block RAM, and more DSP481Es. This means that the 1D-CNN demodulator has simpler logic and a larger amount of calculation. Table 3 shows that the 1D-CNN demodulator saved power consumption by 9.07%. As for latency, because of the nonexistence of carrier synchronization, the 1D-CNN demodulator greatly shortened the delay, by 96%.
In order to illustrate the adaptability to frequency offset and symbol rate offset, the 1D-CNN demodulator and coherent demodulator were tested under the condition of the offset channel. In the offset channel, two of the abovementioned five factors, oc and ob, were set as 0.1 MHz and 5 kHz, respectively. The BER result is shown in Figure 15. It indicates that the 1D-CNN demodulator has the better adaptability to offset, especially in the case of low SNR.   Hardware resource utilization is listed in Figure 14 to illustrate its complexity. The 1D-CNN demodulator consumes fewer slice registers, slice LUTs, block RAM, and more DSP481Es. This means that the 1D-CNN demodulator has simpler logic and a larger amount of calculation. Table 3 shows that the 1D-CNN demodulator saved power consumption by 9.07%. As for latency, because of the nonexistence of carrier synchronization, the 1D-CNN demodulator greatly shortened the delay, by 96%.
In order to illustrate the adaptability to frequency offset and symbol rate offset, the 1D-CNN demodulator and coherent demodulator were tested under the condition of the offset channel. In the offset channel, two of the abovementioned five factors, o c and o b , were set as 0.1 MHz and 5 kHz, respectively. The BER result is shown in Figure 15. It indicates that the 1D-CNN demodulator has the better adaptability to offset, especially in the case of low SNR.

96%.
In order to illustrate the adaptability to frequency offset and symbol rate offset, the 1D-CNN demodulator and coherent demodulator were tested under the condition of the offset channel. In the offset channel, two of the abovementioned five factors, oc and ob, were set as 0.1 MHz and 5 kHz, respectively. The BER result is shown in Figure 15. It indicates that the 1D-CNN demodulator has the better adaptability to offset, especially in the case of low SNR. BER performance under an AWGN channel is generally regarded as an evaluation criterion for a demodulator. Theoretical value, simulation value, and actual tested value under the condition of an AWGN channel were carried out. The results are shown in Figure 16. BER performance under an AWGN channel is generally regarded as an evaluation criterion for a demodulator. Theoretical value, simulation value, and actual tested value under the condition of an AWGN channel were carried out. The results are shown in Figure 16. The experimental results show that under the condition of an AWGN channel, the demodulation loss of the 1D-CNN demodulator could be kept almost within 2 dB. This can be regarded as a relatively good performance, which can fully meet most of the requirements in wireless communication.

Conclusions
This paper presented an FPGA implementation of a 1D-CNN demodulator for BPSK. Two 1D-CNNs were contained in the 1D-CNN demodulator, to detect types and moments of the phase shift. A decision module was employed to synthesize results of the 1D-CNNs, and then the information carried by the modulation signal was obtained. A time-delay network for convolutional calculation and a structure for piecewise approximation for the activation function were adopted, improving the efficiency of resource utilization and operation speed. Universal and diversity training data were generated, strengthening the adaptability to the real channel condition. Complexity, power consumption, and latency of the 1D-CNN demodulator and a coherent demodulator were compared. The result shows that the 1D-CNN demodulator had acceptable complexity, power consumption, and outstanding latency. The performance of the 1D-CNN demodulator was tested in different channels. In complicated channels where high offset was introduced, the 1D-CNN demodulator showed better adaptability to frequency offset. In an AWGN channel, the demodulation loss of the 1D-CNN demodulator could be kept almost within 2 dB. Owing to the good performance in the The experimental results show that under the condition of an AWGN channel, the demodulation loss of the 1D-CNN demodulator could be kept almost within 2 dB. This can be regarded as a relatively good performance, which can fully meet most of the requirements in wireless communication.

Conclusions
This paper presented an FPGA implementation of a 1D-CNN demodulator for BPSK. Two 1D-CNNs were contained in the 1D-CNN demodulator, to detect types and moments of the phase shift. A decision module was employed to synthesize results of the 1D-CNNs, and then the information carried by the modulation signal was obtained. A time-delay network for convolutional calculation and a structure for piecewise approximation for the activation function were adopted, improving the efficiency of resource utilization and operation speed. Universal and diversity training data were generated, strengthening the adaptability to the real channel condition. Complexity, power consumption, and latency of the 1D-CNN demodulator and a coherent demodulator were compared. The result shows that the 1D-CNN demodulator had acceptable complexity, power consumption, and outstanding latency. The performance of the 1D-CNN demodulator was tested in different channels. In complicated channels where high offset was introduced, the 1D-CNN demodulator showed better adaptability to frequency offset. In an AWGN channel, the demodulation loss of the 1D-CNN demodulator could be kept almost within 2 dB. Owing to the good performance in the AWGN channel, the proposed 1D-CNN demodulator can meet most of the requirements in wireless communication. However, deficiencies of the 1D-CNN demodulator cannot be ignored; its complexity may be further reduced, and its adaptability to other nonideal factors like multipath effects should be considered.