A Deep Learning Framework for Signal Detection and Modulation Classification

Deep learning (DL) is a powerful technique which has achieved great success in many applications. However, its usage in communication systems has not been well explored. This paper investigates algorithms for multi-signals detection and modulation classification, which are significant in many communication systems. In this work, a DL framework for multi-signals detection and modulation recognition is proposed. Compared to some existing methods, the signal modulation format, center frequency, and start-stop time can be obtained from the proposed scheme. Furthermore, two types of networks are built: (1) Single shot multibox detector (SSD) networks for signal detection and (2) multi-inputs convolutional neural networks (CNNs) for modulation recognition. Additionally, the importance of signal representation to different tasks is investigated. Experimental results demonstrate that the DL framework is capable of detecting and recognizing signals. And compared to the traditional methods and other deep network techniques, the current built DL framework can achieve better performance.


Introduction
Cognitive radio (CR) [1][2][3] has been used to refer to radio devices that are capable of learning and adapting to their environment. Due to the increasing requirements for wireless bandwidth of radio spectrum, automatic signal detection and modulation recognition techniques are indispensable. It can help users to identify the modulation format and estimate signal parameters within operating bands, which will benefit communication reconfiguration and electromagnetic environment analysis. Besides, it is widely used in both military and civilian applications, which have attracted much attention in the past decades [4][5][6][7].
Multi-signals detection is a task to detect the existing signals in a specific wideband, which is one of the essential components of CR. The most significant difference between signal and non-signal is energy. Hence, many wideband multi-signals detection algorithms are based on energy detector (ED). Some threshold-based wideband signal detection methods, such as [8][9][10][11][12][13], reduce the probability of false alarm or missed alarm. However, these methods are sensitive to noise changes and challenging to ensure the detection accuracy of all detection scenarios. Therefore, many non-threshold-based detection algorithms have been proposed [14][15][16][17]. However, these algorithms have high computational complexity, which results in poor online detection performance.
For automatic modulation recognition, algorithms based on signal phase, frequency, and amplitude have been widely used [18]. However, these algorithms are significantly affected by noise, and the performance can be substantially degraded in low SNR condition. High-order statistical-based

Communication Signal Description and Dataset Generation
In realistic communication processing, the signal may be distorted by the effect of non-linear amplifier and channel. In actual situation, the received signal in the communication system can be expressed as: s(n clk (t − τ))h(τ)dτ + n add (t) (1) where s(t) is the transmission signal, n Clk (t) is timing deviation, h(t) represents the transmitted wireless channel, n add (t) is additive white Gaussian noise. In this section, we will describe different modulated signals and their sample representation for our DL framework. We will also explain the reason why we use it and the method we enhance it.

Modulation Signal Description
For any digital modulation signal, the transmission signal can be presented as s(t) = n a n e j(w n t+φ) g(t − nT b ) (2) where w n is the signal angular frequency, φ is the carrier initial phase, T b is the symbol period, a n is the symbol sequence, g(t) is the shaping filter. For MFSK signal, it can be presented as a n = 1, For MPSK signal, it can be presented as a n = e j2πi/M , i = 0, 1, . . . , M − 1, w n = w 0 (4) For MQAM signal, it can be presented as a n = I n + jQ n I n , Q n = 2i − M 4 + 1, i = 0, 1, . . . , M 4 − 1, w n = w 0 MAPSK constellations are robust against nonlinear channels due to their lower peak-to-average power ratio (PAPR), compared with QAM constellations. Therefore, APSK was mainly employed and optimized over nonlinear satellite channels during the last two decades. As recommended in DVB-S2 [37], it can be presented as: a n = r k exp j 2π n k i k + θ k (6) where r k is the radius of the kth circle, n k is the number of constellations in kth circle, i k is the ordinal number of constellation points in the kth circle, θ k is the initial phase of the kth circle.

Signal Time-Frequency Description
For multi-signals detection task, we use the wideband signal time-frequency spectrum as the neural network input. To prove the feasibility of this method, we theoretically prove the time-frequency visual characteristic of each modulation. Here, we use the short-time Fourier transform [38] (STFT) to analyze the signal time-frequency characteristic.
When d = −T/2, the window completely spans to next symbol, and the minimum value is obtained From our analysis, we can easily get the characteristics of FSK modulation: (1) There will be sharp brightness changes in the time-frequency image at the frequency change moment. (2) The signal modulation number M and frequency spacing are important parameters for the MFSK time-frequency characteristics, which determine the value of w k .

Amplitude-Phase Modulation Signal Time-frequency Description
For MPSK, MAPSK, and MQAM signal, since they all belong to amplitude-phase modulation, the derivation processing of the signal time-frequency characteristics is the same as MPSK. Hence, we specify the time-frequency characteristics of the MPSK signal, and the STFT can be expressed as: As the derivation of MFSK signal time-frequency characteristics, when γ(t) is in a symbol duration, the Equation (14) can be simplified as: When γ(t) spans two symbols, the Equation (14) can be simplified as: where φ k+1 is the phase of the k+1-th symbol. And if φ k+1 = φ k , Equation (15) is equal to Equation (16). We take the modulus square of (15), and the result can be expressed as: And for (16), it can be expressed as: We take the partial derivative for Equation (18): From Equation (19), we can easily learn that SPEC s PSK (t, w c ) get the minimum value when φ k+1 = φ k or d = 0. But the minimum value is much greater than 0, which is greatly different for the MFSK signal.
Hence, for MPSK signal, there is only one wide frequency band in the time-frequency diagram, and the brightness fluctuation appears in a small range, which is different from MFSK. And from derivation processing, we can know that the MPSK time-frequency characteristics are less affected by M, so it is hard to distinguish PSK signals with different M. Figure 1 presents different modulation signals in the wideband.

Signal Eye Diagram and Vector Diagram Description
The function of the eye diagram is to observe the baseband signal waveform by an oscilloscope. Through the eye-diagram, we can adjust the receiver filter to improve system performance. Besides, due to the characteristics of the modulated signal itself, different modulation modes have apparent visual differences in the eye diagram. As shown in Figure 2, because of the different modulation scales, there are different eye numbers in each eye diagram. For OQPSK, since the two orthogonal signals stagger for half a symbol period, the eye-opening position is always staggered, while other

Signal Eye Diagram and Vector Diagram Description
The function of the eye diagram is to observe the baseband signal waveform by an oscilloscope. Through the eye-diagram, we can adjust the receiver filter to improve system performance. Besides, due to the characteristics of the modulated signal itself, different modulation modes have apparent visual differences in the eye diagram. As shown in Figure 2, because of the different modulation scales, there are different eye numbers in each eye diagram. For OQPSK, since the two orthogonal signals stagger for half a symbol period, the eye-opening position is always staggered, while other modulated signals always appear at the same time.

Signal Eye Diagram and Vector Diagram Description
The function of the eye diagram is to observe the baseband signal waveform by an oscilloscope. Through the eye-diagram, we can adjust the receiver filter to improve system performance. Besides, due to the characteristics of the modulated signal itself, different modulation modes have apparent visual differences in the eye diagram. As shown in Figure 2, because of the different modulation scales, there are different eye numbers in each eye diagram. For OQPSK, since the two orthogonal signals stagger for half a symbol period, the eye-opening position is always staggered, while other modulated signals always appear at the same time.
By reconstructing the signal IQ waveforms in the corresponding time, the signal vector diagram shows the symbol trajectory. From its formation mechanism, it is similar to the signal constellation diagram. However, unlike the constellation diagram, the vector diagram can reflect the signal phase information. For example, it can easily distinguish QPSK from OQPSK with the same initial phase, because there is no 180° phase shift in OQPSK, while it exists in QPSK. Meanwhile, compared with the constellation diagram, the vector diagram is more convenient to obtain and requires less prior information.  By reconstructing the signal IQ waveforms in the corresponding time, the signal vector diagram shows the symbol trajectory. From its formation mechanism, it is similar to the signal constellation diagram. However, unlike the constellation diagram, the vector diagram can reflect the signal phase information. For example, it can easily distinguish QPSK from OQPSK with the same initial phase, because there is no 180 • phase shift in OQPSK, while it exists in QPSK. Meanwhile, compared with the constellation diagram, the vector diagram is more convenient to obtain and requires less prior information. Figure 3 presents the processing of our dataset construction and annotation. To make samples more diverse, we set sampling phase offset, frequency offset, phase offset, and amplitude attenuation in sample generation processing.  Figure 3 presents the processing of our dataset construction and annotation. To make samples more diverse, we set sampling phase offset, frequency offset, phase offset, and amplitude attenuation in sample generation processing.  For signal detection, we need to determine the reconnaissance frequency range and set the signal number in the wideband at first. We set different frequency offset for each signal, and ensure that the signals do not overlap in the frequency domain. Then we perform the STFT on the wideband. Not only we record the modulation format of each signal, but also record the start-stop time, carrier frequency, and bandwidth. Then we convert them into the coordinates on  For signal detection, we need to determine the reconnaissance frequency range and set the signal number in the wideband at first. We set different frequency offset for each signal, and ensure that the signals do not overlap in the frequency domain. Then we perform the STFT on the wideband. Not only we record the modulation format of each signal, but also record the start-stop time, carrier frequency, and bandwidth. Then we convert them into the coordinates on time-frequency image, which are the label information for the network.

The Generation Processing of the Dataset
For modulation recognition, traditional eye diagram and vector diagram are binary images, which do not consider the signal aggregation degree at a particular location. Hence, we consider the signal aggregation degree and enhance the traditional eye diagram and vector diagram. Figure 4 presents the enhancement processing of the dataset. In our initial research, since CNNs are insensitive to edge information, the signal amplitude is quantified between [−1.05, 105] by 128 after normalizing the amplitude. Furthermore, the parameter settings are obtained by experiments. For example, we choose 800 symbols and 4 symbols as a waveform group to generate the eye diagram and the vector diagram, and related experiments will also be described in detail in subsequent chapters. Moreover, we perform the following operations on the images to make the image details more prominent, where Im 0 is the original image, Im 1 , Im 2 , Im 3 are the channels of the enhanced image and α, β are scaling factors.  Figure 3 presents the processing of our dataset construction and annotation. To make samples more diverse, we set sampling phase offset, frequency offset, phase offset, and amplitude attenuation in sample generation processing. For signal detection, we need to determine the reconnaissance frequency range and set the signal number in the wideband at first. We set different frequency offset for each signal, and ensure that the signals do not overlap in the frequency domain. Then we perform the STFT on the wideband. Not only we record the modulation format of each signal, but also record the start-stop time, carrier frequency, and bandwidth. Then we convert them into the coordinates on time-frequency image, which are the label information for the network. For modulation recognition, traditional eye diagram and vector diagram are binary images, which do not consider the signal aggregation degree at a particular location. Hence, we consider the signal aggregation degree and enhance the traditional eye diagram and vector diagram. Figure 4 presents the enhancement processing of the dataset. In our initial research, since CNNs are insensitive to edge information, the signal amplitude is quantified between [−1.05, 105] by 128 after normalizing the amplitude. Furthermore, the parameter settings are obtained by experiments. For example, we choose 800 symbols and 4 symbols as a waveform group to generate the eye diagram and the vector diagram, and related experiments will also be described in detail in subsequent

Deep Learning Framework for Signal Detection and Modulation Recognition
DL networks aim at learning different hierarchies of features from data. As one of the branches of DL techniques, the CNNs perform well in the field of image recognition. It performs feature learning via non-linear transformations implemented as a series of nested layers. Each layer consists of several kernels that perform a convolution over the input. Generally, the kernels are usually multidimensional arrays which can be updated by some algorithms [39]. Our DL framework achieves multi-signals detection and modulation recognition. We use different deep neural networks for different tasks. For signal detection, we use SSD networks. For modulation recognition, we design a multi-inputs CNNs.

SSD Networks for Signal Detection
We use SSD networks to achieve multi-signals detection. For DL target detection techniques, the existing algorithms are mainly divided into two kinds: algorithms based on region recommendation (two-stage methods) and algorithms based on regression (one-stage methods). Regression-based algorithms include YOLO series algorithms [40][41][42][43] and SSD series algorithms [43,44], while region recommendation-based algorithms include RCNN [45], Fast RCNN [46], and Faster RCNN [47]. In our research, since the regression-based algorithms are faster than region recommendation-based algorithms, we use SSD networks as our signal detection model. SSD networks can generate a series of fixed-size borders and the possibility of the containing target in each border. Finally, the final detection and recognition results are calculated by the non-maximum suppression algorithm [48]. The structure of SSD networks is shown in Figure 5, and it can be divided into four parts. the existing algorithms are mainly divided into two kinds: algorithms based on region recommendation (two-stage methods) and algorithms based on regression (one-stage methods). Regression-based algorithms include YOLO series algorithms [40][41][42][43] and SSD series algorithms [43,44], while region recommendation-based algorithms include RCNN [45], Fast RCNN [46], and Faster RCNN [47]. In our research, since the regression-based algorithms are faster than region recommendation-based algorithms, we use SSD networks as our signal detection model. SSD networks can generate a series of fixed-size borders and the possibility of the containing target in each border. Finally, the final detection and recognition results are calculated by the non-maximum suppression algorithm [48]. The structure of SSD networks is shown in Figure 5, and it can be divided into four parts.
Part 1: The networks for feature extraction. The initial part of SSD networks is the first layers of VGG16 network, which is used as the primary network to extract the deep features of the whole input image. Behind the primary network, the model structure is the pyramid networks, which contains a series of simple convolution layers to make feature maps smaller and smaller. With the pyramid network structure, we can get several feature maps with different scales.

Part 1:
The networks for feature extraction. The initial part of SSD networks is the first layers of VGG16 network, which is used as the primary network to extract the deep features of the whole input image. Behind the primary network, the model structure is the pyramid networks, which contains a series of simple convolution layers to make feature maps smaller and smaller. With the pyramid network structure, we can get several feature maps with different scales.
Part 2: The design of the default box. In this part, we will design several feature default box for different scales of feature maps. Each feature map at the top of the VGG16 networks is associated with a set of feature default box. As shown in Figure 6, there are dotted borders at each position of 4 × 4 and 8 × 8 feature maps. These fixed-size borders are default boxes, and their scale parameters are designed by the different feature maps scales. For example, assuming that we need M feature maps to predict, the scale parameters of the default box are as follows: where S min is the bottom scale, and S max is top scale. The length-width radio of feature default can be expressed as: a r ∈ {1, 2, 3, 1/2, 1/3}. So the feature default box length is W a k = S k √ a r and the width is Sensors 2019, 19, x FOR PEER REVIEW 9 of 22

Part 2:
The design of the default box. In this part, we will design several feature default box for different scales of feature maps. Each feature map at the top of the VGG16 networks is associated with a set of feature default box. As shown in Figure 6, there are dotted borders at each position of 4 × 4 and 8 × 8 feature maps. These fixed-size borders are default boxes, and their scale parameters are designed by the different feature maps scales. For example, assuming that we need M feature maps to predict, the scale parameters of the default box are as follows: where min S is the bottom scale, and max S is top scale. The length-width radio of feature default can be expressed as: Part 3: Detection and Recognition. In this part, we can predict the target category and location. We add a set of convolution kernels behind several different scales feature maps. Using these convolution kernels, we can get a fixed set of test results. For an m × n × p feature maps, a small convolution kernel with 3 × 3 × p size is used as the fundamental prediction element. Finally, Part 3: Detection and Recognition. In this part, we can predict the target category and location. We add a set of convolution kernels behind several different scales feature maps. Using these convolution kernels, we can get a fixed set of test results. For an m × n × p feature maps, a small convolution kernel with 3 × 3 × p size is used as the fundamental prediction element. Finally, the classification probability of each feature default box and the offsets are obtained.
Part 4: Non-maximum suppression. In the last part, we use non-maximum suppression to select the best prediction results. For the feature default boxes that are matched by each real target border, we calculate their intersection-parallel ratios. The expression is shown as follow where A and B are two borders. We will select the feature default box whose IoU are greater than 0.5 as best results, and then obtains the highest confidence degree feature default box by non-maximum suppression.
In offline train stage, the whole objective optimal function of the SSD networks includes two parts: confidence loss and location loss. The expression is shown as follow where x is used to indicate whether the feature default box is a target or not. N is the number of the feature default boxes that are matched to real target borders. Parameter α is used to adjust the radio between L con f and L loc , default α = 1. L con f is softmax loss function. L loc is used to measure the performance of the boundary box prediction, and in our initial research, we use the typical smooth L1 function to calculate L loc (x, l, g) = N i∈Pos m∈{cx,cy,w,h} where Pos and Neg represent all positive and negative borders, respectively. c p i represents the confidence degree for pth feature default matching ith target. l m i represents the prediction bias of the i-th feature default box. (cx, cy) is the box center coordinates and (w, h) is the box width and height. g m j represents the deviation between the real target border g m j and the default box d m i .ĝ m j is calculated as follow:

Multi-Inputs CNNs for Modulation Recognition
For signal modulation recognition task, the modulation set is {BPSK, QPSK, OQPSK, 8PSK, 16QAM, 16APSK, 32APSK, 64QAM}, because they are all belonging to amplitude-phase modulation and we cannot distinguish each other from time-frequency characteristic in SSD network. Hence, we use the eye diagram and vector diagram as the model inputs. The multi-inputs CNNs model is shown in Figure 7. The initial size of the samples is 128 × 128, and we use softmax as the output layer's active function and relu as all other network layers' active function.
The signal features extraction can be divided into three stages. On the first stage, we use 7 × 7 convolution kernels to convolute IQ eye diagram and vector diagram, respectively. To ensure the dynamic range unification of the feature maps, we apply the batch normalization (BN) [49] on first layer network outputs. We perform the max pooling operation on the BN feature maps. Then we connect the feature maps from IQ eye diagram inputs.
16QAM, 16APSK, 32APSK, 64QAM}, because they are all belonging to amplitude-phase modulation and we cannot distinguish each other from time-frequency characteristic in SSD network. Hence, we use the eye diagram and vector diagram as the model inputs. The multi-inputs CNNs model is shown in Figure 7. The initial size of the samples is 128 × 128, and we use softmax as the output layer's active function and relu as all other network layers' active function. The signal features extraction can be divided into three stages. On the first stage, we use 7 × 7 convolution kernels to convolute IQ eye diagram and vector diagram, respectively. To ensure the dynamic range unification of the feature maps, we apply the batch normalization (BN) [49] on first layer network outputs. We perform the max pooling operation on the BN feature maps. Then we connect the feature maps from IQ eye diagram inputs.
On the second signal feature extraction stage, we adopt the residual network structure to avoid the degradation caused by the network over-depth. The basic structure of ResNet [50] is shown in Figure 8. After the second feature extraction stage, each input feature maps are connected. After the On the second signal feature extraction stage, we adopt the residual network structure to avoid the degradation caused by the network over-depth. The basic structure of ResNet [50] is shown in Figure 8. After the second feature extraction stage, each input feature maps are connected. After the batch normalization in the third stage, we directly process the feature maps by global maximum sampling to reduce the network parameters. In the processing of network optimization, we adopt Adam algorithm [51] to solve the network parameters optimal solution. The categorical cross-entropy error is chosen as the loss function, which is represented as:

The Description for Deep Learning Framework
According to above introduction of the signal detection network and the modulation recognition network, we describe the use of our DL framework. Figure 9 presents the system model. The steps of the model used are as follows: Step 1: We construct the signal detection network and modulation recognition network and train each network with their appropriate samples.
Step2: In the online testing phase, we perform STFT for the wideband signals, and use the trained SSD networks to detect the signals in the time-frequency spectrum. In this step, we can In the processing of network optimization, we adopt Adam algorithm [51] to solve the network parameters optimal solution. The categorical cross-entropy error is chosen as the loss function, which is represented as:

The Description for Deep Learning Framework
According to above introduction of the signal detection network and the modulation recognition network, we describe the use of our DL framework. Figure 9 presents the system model. The steps of the model used are as follows:  Figure 9. System model.

Results
In section 2 and 3, we have discussed the methods which convert complex signal samples into images without noticeable information loss and introduced the structures of our DL framework. Table 1 shows the time complexity of our DL framework on the different process, in which N is the number of signals in the wideband range. It can be seen that our framework has low time complexity due to the evolution of GPUs, which is acceptable for many practical communication systems. We also have conducted several experiments to demonstrate the performances of the DL framework for joint signal detection and modulation recognition in wireless communication systems. Our experiments can be divided into two parts: (1) performances on multi-signals detection and (2) performances on modulation recognition. The rest of this section is organized as follows Multi-signals detection: First, we show some results of our detection network, and explain the reasons for these results. Then, we evaluate the model performances from three aspects: the modulation format, carrier frequency, and start-stop time. We also compare our network with other detection networks.
Signal modulation recognition: We evaluate our model recognition performances on each modulated signal. We also discuss the network performances when the frequency offset exists. Meanwhile, we compare our method with traditional methods and other DL based methods. Finally, we discuss the influence of symbol and eye numbers and compare the performance between signal-input networks and multi-inputs networks.

Performance on Signal Detection
For multi-signal detection, we need to know each signal carrier frequency, start-stop time and modulation format. Figure 10 shows some detection results from our model. From Figure 10a,b, it is indicated that our model is beneficial for multi-signals detection, and it can accurately estimate the relevant information about each signal. Moreover, our model has very a promising application prospect in engineering because it has a good visualization effect.
To some extent, our model is not perfect yet, and there are still some aspects that need to be improved. From Figure 10c and 10d, we can learn that once the signal length is large, the estimation of the signal start-stop time is not precise, while the estimation of the carrier frequency is precise. The cause of this phenomenon may be that the time-frequency spectrum has large deformation and extreme length-with radio, while the natural image is not. Therefore, we need to further optimize the default box in the SSD networks. Figure 10e,f show the network performance when there is no signal Step 1: We construct the signal detection network and modulation recognition network and train each network with their appropriate samples.
Step2: In the online testing phase, we perform STFT for the wideband signals, and use the trained SSD networks to detect the signals in the time-frequency spectrum. In this step, we can obtain the center frequency and start-stop time of each signal. And for MFSK signal, we can get its modulation format.
Step 3: For the amplitude-phase modulation signal, we can get the central frequency and start-stop time in step 2. With this knowledge, we filter the target signal and use the envelope spectrum to estimate the signal symbol rate. Then we down convert the signal and perform the matched filter by using the estimated symbol rate.
Step 4: If the timing deviation exists in the target signal, it is necessary to extract the sample value at the optimum sampling position for signal eye diagram and vector diagram. We use the non-data-aided timing estimation algorithm in [52]. The specific expression is as follows, where L 0 is the length of the signal symbols, T is the sampling period and N is the oversampling number: Step 5: We alter the target signal sampling rate, and obtain the baseband signal with a maximum delay 32 sampling period. Moreover, we generate the eye-diagram and vector diagram with the processed signal.
Step 6: We use the trained modulation recognition network to identify the signal by its eye diagram and vector diagram. Finally, we complete the signal detection and modulation recognition.

Results
In Sections 2 and 3, we have discussed the methods which convert complex signal samples into images without noticeable information loss and introduced the structures of our DL framework. Table 1 shows the time complexity of our DL framework on the different process, in which N is the number of signals in the wideband range. It can be seen that our framework has low time complexity due to the evolution of GPUs, which is acceptable for many practical communication systems. We also have conducted several experiments to demonstrate the performances of the DL framework for joint signal detection and modulation recognition in wireless communication systems. Our experiments can be divided into two parts: (1) performances on multi-signals detection and (2) performances on modulation recognition. The rest of this section is organized as follows Multi-signals detection: First, we show some results of our detection network, and explain the reasons for these results. Then, we evaluate the model performances from three aspects: the modulation format, carrier frequency, and start-stop time. We also compare our network with other detection networks.
Signal modulation recognition: We evaluate our model recognition performances on each modulated signal. We also discuss the network performances when the frequency offset exists. Meanwhile, we compare our method with traditional methods and other DL based methods. Finally, we discuss the influence of symbol and eye numbers and compare the performance between signal-input networks and multi-inputs networks.

Performance on Signal Detection
For multi-signal detection, we need to know each signal carrier frequency, start-stop time and modulation format. Figure 10 shows some detection results from our model. From Figure 10a,b, it is indicated that our model is beneficial for multi-signals detection, and it can accurately estimate the relevant information about each signal. Moreover, our model has very a promising application prospect in engineering because it has a good visualization effect.   (31) It can be deduced that with the increase of the SNR, the mAP value of the SSD network is increasing. When the SNR is 5 dB, the mAP value can reach 90% in IoU is 0.5. Different IoU thresholds can lead to different results. Although the increase of the threshold can obtain more reliable signal carrier frequency and start-stop time, it sacrifices the precision of signal detection. Besides, we can adopt some traditional methods to further estimate these signal parameters. Finally, we choose 0.5 as the threshold of IoU. To some extent, our model is not perfect yet, and there are still some aspects that need to be improved. From Figure 10c,d, we can learn that once the signal length is large, the estimation of the signal start-stop time is not precise, while the estimation of the carrier frequency is precise. The cause of this phenomenon may be that the time-frequency spectrum has large deformation and extreme length-with radio, while the natural image is not. Therefore, we need to further optimize the default box in the SSD networks. Figure 10e,f show the network performance when there is no signal exists. It can be observed that the model does not produce a false alarm, which is useful in engineering.
It can be deduced that with the increase of the SNR, the mAP value of the SSD network is increasing. When the SNR is 5 dB, the mAP value can reach 90% in IoU is 0.5. Different IoU thresholds can lead to different results. Although the increase of the threshold can obtain more reliable signal carrier frequency and start-stop time, it sacrifices the precision of signal detection. Besides, we can adopt some traditional methods to further estimate these signal parameters. Finally, we choose 0.5 as the threshold of IoU. Sensors 2019, 19, x FOR PEER REVIEW 15 of 22 Figure 11. The SSD networks performances.
Once we detected the signals, we need to evaluate the precision of the estimated parameters. We use the normalized offset of the estimated and the actual parameters as the criterion of measurement. They can be presented as follows:  (33) where pre f is the predicted value of the carrier frequency, real f is the actual value of the carrier frequency, R is the symbol rate,  Table 2 shows the carrier frequency and the start-stop time precision when the signal is detected. It can be seen that the precision of the carrier frequency is higher than start and stop time. These phenomena are consistent with Figure 10c,d. And in future research, we need to combine the prior information of the signal to design the default boxes and the networks.  Figure 12a, we can see that the mAP of the Fast RCNN and the RCNN is higher than the SSD networks, but the improvement is not significant. And from Figure 12b, we can infer that the SSD networks has considerable advantages in processing speed compared with the RCNN and Fast RCNN. Our model processing speed can reach 0.05 s for each time-frequency spectrum, and such a computational complexity is acceptable for many practical communications systems. Once we detected the signals, we need to evaluate the precision of the estimated parameters. We use the normalized offset of the estimated and the actual parameters as the criterion of measurement. They can be presented as follows: where f pre is the predicted value of the carrier frequency, f real is the actual value of the carrier frequency, R is the symbol rate, t pre_start and t pre_stop are the predict values of the start and the stop time, t true_start and t true_stop are the actual values of the start and the stop time, and T is the signal duration. Table 2 shows the carrier frequency and the start-stop time precision when the signal is detected. It can be seen that the precision of the carrier frequency is higher than start and stop time. These phenomena are consistent with Figure 10c,d. And in future research, we need to combine the prior information of the signal to design the default boxes and the networks. We also compare our model performances with the RCNN networks and the Fast RCNN. From Figure 12a, we can see that the mAP of the Fast RCNN and the RCNN is higher than the SSD networks, but the improvement is not significant. And from Figure 12b, we can infer that the SSD networks has considerable advantages in processing speed compared with the RCNN and Fast RCNN. Our model processing speed can reach 0.05 s for each time-frequency spectrum, and such a computational complexity is acceptable for many practical communications systems.

Performance on Modulation Recognition
For signal modulation recognition, we set a series of experiments to test the network performances. Figure 13a shows the recognition performances of each modulated signal under the different SNR. It can be seen that the algorithm can still achieve better performance when the SNR is very low. Because its modulation complexity, the performance of 64QAM signal is worse than other signals, but it still can achieve 94% accuracy at 7 dB. For BPSK and OQPSK signals, they have distinct visual characteristics from other modulated signals in the eye diagram and the vector diagram, which recognition accuracy can reach 100% even in 0 dB. And it is also obvious that the recognition performance of circular modulation signals {8PSK, 16APSK, 32APSK} is better than QAM modulation signal. To understand the results better, the confusion matrices in different SNR levels are presented in Figure 13b to 13d. It can be seen that the network shows excellent performance in discriminating BPSK, QPSK, OQPSK, 8PSK, and 16APSK. Moreover, in our experiments, it can be seen that 16QAM is more likely confused with 64QAM, while 16APSK is more likely confused with 64QAM.

Performance on Modulation Recognition
For signal modulation recognition, we set a series of experiments to test the network performances. Figure 13a shows the recognition performances of each modulated signal under the different SNR. It can be seen that the algorithm can still achieve better performance when the SNR is very low. Because its modulation complexity, the performance of 64QAM signal is worse than other signals, but it still can achieve 94% accuracy at 7 dB. For BPSK and OQPSK signals, they have distinct visual characteristics from other modulated signals in the eye diagram and the vector diagram, which recognition accuracy can reach 100% even in 0 dB. And it is also obvious that the recognition performance of circular modulation signals {8PSK, 16APSK, 32APSK} is better than QAM modulation signal. To understand the results better, the confusion matrices in different SNR levels are presented in Figure 13b-d. It can be seen that the network shows excellent performance in discriminating BPSK, QPSK, OQPSK, 8PSK, and 16APSK. Moreover, in our experiments, it can be seen that 16QAM is more likely confused with 64QAM, while 16APSK is more likely confused with 64QAM.

Performance on Modulation Recognition
For signal modulation recognition, we set a series of experiments to test the network performances. Figure 13a shows the recognition performances of each modulated signal under the different SNR. It can be seen that the algorithm can still achieve better performance when the SNR is very low. Because its modulation complexity, the performance of 64QAM signal is worse than other signals, but it still can achieve 94% accuracy at 7 dB. For BPSK and OQPSK signals, they have distinct visual characteristics from other modulated signals in the eye diagram and the vector diagram, which recognition accuracy can reach 100% even in 0 dB. And it is also obvious that the recognition performance of circular modulation signals {8PSK, 16APSK, 32APSK} is better than QAM modulation signal. To understand the results better, the confusion matrices in different SNR levels are presented in Figure 13b to 13d. It can be seen that the network shows excellent performance in discriminating BPSK, QPSK, OQPSK, 8PSK, and 16APSK. Moreover, in our experiments, it can be seen that 16QAM is more likely confused with 64QAM, while 16APSK is more likely confused with 64QAM.
(a) (b) Figure 13. Conts. For accuracy comparison, we consider four different modulation classification algorithms.
(1) Cumulant: A traditional signal processing algorithm using the fourth-order cumulant C40 as the classification statistics [19].
(3) CNNs for IQ waveform: A DL-based algorithm using the CNNs with the signal IQ waveform [4].
(1) Cumulant: A traditional signal processing algorithm using the fourth-order cumulant C40 as the classification statistics [19].
(3) CNNs for IQ waveform: A DL-based algorithm using the CNNs with the signal IQ waveform [4].
(4) CNNs for constellation: An DL-based algorithm using the CNNs with the signal constellation [35].   Considering the error of the carrier frequency estimation by SSD networks and FFT in practice, we research the network recognition performance in different frequency offsets. We set a series of frequency offset for signals, and the result is shown in Figure 15. It can be seen that the recognition accuracy of the signals with a frequency offset is lower than those without frequency offset. When the signals have a large frequency offset, the network is no longer suitable. We also collect some signals from a real satellite communication system, and the real-time wireless channel is performed in the received signals. And then, we use a signal playback device, a DSP card, and PCs to simulate signal reception process. From Figure 15a, we can obtain that the recognition accuracy on real data is lower on simulated data at same SNR level. It may be due to the training data, which not consider the actual channel environment clearly. But the recognition accuracy can still reach 90% when the SNR is 4 dB. And for further research, we will make full use of the real signal to make our model more robust. Considering the error of the carrier frequency estimation by SSD networks and FFT in practice, we research the network recognition performance in different frequency offsets. We set a series of frequency offset for signals, and the result is shown in Figure 15. It can be seen that the recognition accuracy of the signals with a frequency offset is lower than those without frequency offset. When the signals have a large frequency offset, the network is no longer suitable. We also collect some signals from a real satellite communication system, and the real-time wireless channel is performed in the received signals. And then, we use a signal playback device, a DSP card, and PCs to simulate signal reception process. From Figure 15a, we can obtain that the recognition accuracy on real data is lower on simulated data at same SNR level. It may be due to the training data, which not consider the actual channel environment clearly. But the recognition accuracy can still reach 90% when the SNR is 4 dB. And for further research, we will make full use of the real signal to make our model more robust. We also consider the influence of the symbol numbers and the eye number in eye diagram on the network performance. We obtain the best parameter settings of samples by grid search. The symbol number is set as 200, 400, 800, and 1000, respectively, while the eye number is set as 2, 3, 4, and 5. The results are shown in Figure 16. It can be seen that theses parameters do affect network performance. With the increase of symbol number and eye number, the overall accuracy of the model is gradually increasing. But we also can see that when the symbol number is 1000 and the We also consider the influence of the symbol numbers and the eye number in eye diagram on the network performance. We obtain the best parameter settings of samples by grid search. The symbol number is set as 200, 400, 800, and 1000, respectively, while the eye number is set as 2, 3, 4, and 5. The results are shown in Figure 16. It can be seen that theses parameters do affect network performance. With the increase of symbol number and eye number, the overall accuracy of the model is gradually increasing. But we also can see that when the symbol number is 1000 and the eye number is 5, the improvement of performance is not obvious. Therefore, we finally choose 800 symbols and 4 eye numbers to generate the eye diagram and the vector diagram. eye number is 5, the improvement of performance is not obvious. Therefore, we finally choose 800 symbols and 4 eye numbers to generate the eye diagram and the vector diagram. Finally, we compare the performance of the single input network with the multi-inputs network in this work. The results are shown in Figure 17. The modulation recognition algorithm based on a single eye diagram has poor performance. The performance of the I-eye diagram is lower than that of the Q-eye diagram, which may be due to the setting of the initial phase in the same modulation format. And the performance of the vector diagram based method is also inferior to our method, since it does not make full use of the signal waveform information.

Conclusions and Discussions
In our research, we have demonstrated our initial efforts to establish a DL framework for multi-signals detection and modulation classification problem. In our method, the time-frequency spectrums are exploited for multi-signals detection task, while the eye-diagrams and vector diagrams are exploited for the modulation classification task. The simulation results prove that DL technologies have the ability to solve the problems in the communication field and have higher performance than other methods. Finally, we compare the performance of the single input network with the multi-inputs network in this work. The results are shown in Figure 17. The modulation recognition algorithm based on a single eye diagram has poor performance. The performance of the I-eye diagram is lower than that of the Q-eye diagram, which may be due to the setting of the initial phase in the same modulation format. And the performance of the vector diagram based method is also inferior to our method, since it does not make full use of the signal waveform information. eye number is 5, the improvement of performance is not obvious. Therefore, we finally choose 800 symbols and 4 eye numbers to generate the eye diagram and the vector diagram. Finally, we compare the performance of the single input network with the multi-inputs network in this work. The results are shown in Figure 17. The modulation recognition algorithm based on a single eye diagram has poor performance. The performance of the I-eye diagram is lower than that of the Q-eye diagram, which may be due to the setting of the initial phase in the same modulation format. And the performance of the vector diagram based method is also inferior to our method, since it does not make full use of the signal waveform information.

Conclusions and Discussions
In our research, we have demonstrated our initial efforts to establish a DL framework for multi-signals detection and modulation classification problem. In our method, the time-frequency spectrums are exploited for multi-signals detection task, while the eye-diagrams and vector diagrams are exploited for the modulation classification task. The simulation results prove that DL technologies have the ability to solve the problems in the communication field and have higher performance than other methods.

Conclusions and Discussions
In our research, we have demonstrated our initial efforts to establish a DL framework for multi-signals detection and modulation classification problem. In our method, the time-frequency spectrums are exploited for multi-signals detection task, while the eye-diagrams and vector diagrams are exploited for the modulation classification task. The simulation results prove that DL technologies have the ability to solve the problems in the communication field and have higher performance than other methods.
However, in the future, we will do more rigorous analysis and more comprehensive experiments. Besides, for practical use, we will collect the samples generated from the real channels, and then retrain or fine-tune the model for better performance.