Single-Channel Blind Source Separation of Spatial Aliasing Signal Based on Stacked-LSTM

Aiming at the problem of insufficient separation accuracy of aliased signals in space Internet satellite-ground communication scenarios, a stacked long short-term memory network (Stacked-LSTM) separation method based on deep learning is proposed. First, the coding feature representation of the mixed signal is extracted. Then, the long sequence input is divided into smaller blocks through the Stacked-LSTM network with the attention mechanism of the SE module, and the deep feature mask of the source signal is trained to obtain the Hadamard product of the mask of each source and the coding feature of the mixed signal, which is the encoding feature representation of the source signal. Finally, characteristics of the source signal is decoded by 1-D convolution to to obtain the original waveform. The negative scale-invariant source-to-noise ratio (SISNR) is used as the loss function of network training, that is, the evaluation index of single-channel blind source separation performance. The results show that in the single-channel separation of spatially aliased signals, the Stacked-LSTM method improves SISNR by 10.09∼38.17 dB compared with the two classic separation algorithms of ICA and NMF and the three deep learning separation methods of TasNet, Conv-TasNet and Wave-U-Net. The Stacked-LSTM method has better separation accuracy and noise robustness.


Introduction
In recent years, satellite networking constellations (such as onweb constellation, etc.) have developed rapidly. Satellite-to-ground communications and inter-satellite communications have further aggravated spectrum congestion, and co-frequency signal aliasing interference is inevitable [1][2][3], the signals in the electromagnetic environment of space communication show the phenomenon of time-frequency aliasing and spatial interleaving. In the real space communication, robust signal processing usually requires automatic signal separation [4,5]. In a dense constellation Internet communication, the source signal received by the receiving end may be an aliased signal of multiple channels, and due to the fact that the number of ground station observation channels is limited, the application scenario of under-determined blind source separation is extremely common. Conventional two-step method, sparse feature representation and other under-determined separation methods [6][7][8] have poor separation results under single-channel conditions and the problem of blind source separation of single-channel communication signals needs to be solved urgently. Researchers have proposed many methods to solve the single-channel separation problem, such as the ICA [9] and NMF [10] methods that use the expansion of the observation data channel to transform into the multi-channel blind source separation, but the separation accuracy needs to be improved and the generalization is poor. TasNet [11], ConvTasNet [12] and Wave-U-Net [13] based on deep learning play important role in processing the blind source separation of speech signals. The classic blind source sepa-(1) A mixed communication signal data set 10-mixC is constructed using the GNURadio platform. The data set includes 10 mixed signals obtained from five types of modulation communication signals. This data set can provide data support for similar research work in the future.

Related Work
Some traditional blind source separation methods provide solutions for single-channel blind source separation, but there are some disadvantages. Independent component analysis [9] performs well in over-determined and positive-definite blind source separation, but it performs poorly in under-determined blind source separation, especially in single-channel scenarios. Sparse Component analysis (SCA) [18] based on clustering method requires the number of mixed signals to be known. The time-frequency mask method [15] improves the separation accuracy, but there is amplitude and phase decoupling, and the short-time Fourier transform requires a higher resolution frequency decomposition window, which limits its applicability in low-delay systems. Non-negative matrix factorization (NMF) [10] can perform signal decomposition in the time domain, but it is weak in generalization ability. Single channel separation based on Kalman filter [19], LCL-FRESH filter [20] and cyclic Wiener filter [21] needs high computational complexity, and the practical effect needs to be improved.
With the development of big data and the improvement of computing power, deep learning achieves great success in time series signal processing such as speech recognition, speech separation [12,15,[22][23][24][25][26][27][28][29][30][31][32][33][34][35][36][37][38], and communication signal modulation recognition [39]. These tasks demonstrate the powerful feature extraction and timing signal processing capabilities of deep learning. However, the application of deep learning in communication signal processing is mostly seen in conventional modulation recognition and classification tasks, and is involved in complex tasks such as single-channel communication signal separation. Most of the research focuses on speech signals and EGG signals. In the TasNet [11], the traditional recurrent neural network is used. Due to the difficulty of optimization, it is impossible to effectively model such a long sequence, so the separation effect is not good. ConvTasNet [12] can complete the modeling of the long sequence on the basis of the large receiving field to meet the long-term dependence of blind source separation. However, due to the limitation of the receiving field, the sequence length cannot be increased indefinitely. When the reception field of the one-dimensional convolutional neural network (1-D CNN) is small, the speech-level sequence modeling cannot be performed. Wave-U-Net [13] processes the time series context by repeatedly performing downsampling of the feature map and convolution, combining high-level and low-level features on different time scales. In each feature map generated by convolution, the sampling rate of the original signal is used as the resolution, so the memory consumption is high.

Blind Source Separation
In the blind source separation (BSS), the waveform of the observed signal x(t) and the independence between the signal sources are used to make the estimated signal s * (t) as close to the signal source s(t) as possible. The source signal is expressed as The received observation mixed signal is The mathematical model of blind source separation [12,15,23,24,[26][27][28][29][30][31]33,40] is linear instantaneous mixture model: In the formula, A is the mixing matrix, m represents the number of source signals, and n represents the number of receiving antenna elements.When n < m, it is defined as underdetermined blind source separation. When n = 1, it is defined as single-channel blind source separation under underdetermined conditions. The single-channel underdetermined blind source separation instantaneous mixing model is as follows:

LSTM
As is shown in Figure 1, LSTM has three control gates: input gate, output gate, and forget gate, and they jointly control the unit state A. As is shown in Formula (3), the function of the forget gate is to determine how much the state of the unit at time t − 1 A t−1 is retained to A t . W f and U f are the weights of the forget gate, and σ is the sigmoid function [41].  As is shown in Formula (4), the input gate controls how much the input x t of the network at time t retained to A t .
As is shown in Formula (5), z t is used to describe the current input unit state.
As is shown in Formula (6), The current unit state A t is calculated as follows: the last unit state A t−1 is element-wise multiplied by the forgetting gate f t , and then the current input unit state z t is element-wise multiplied by the input gate i t , and the two products are added together. In this way, the current memory z t and the long-term memory A t−1 can be combined to form a new unit state A t . Because of the control of the forget gate, it can save information from a long time ago, and because of the control of the input gate, it can avoid the current unimportant content from entering the memory.
As is shown in Formula (7), the function of the output gate is to control how much the unit state at time t A t retains to the final output value h t at time t.
Finally, the current unit state A t passes the tanh activation function, and then the result is multiplied by the output gate to get the final current moment output, as shown in Formula (8): LSTM can solve the problem of gradient disappearance and explosion of RNN very well. The reason why the RNN gradient disappears and explodes is that when the number of network layers is very deep, the weight matrix W between the state and the hidden layer may be reused for the product at the same time. Suppose W has the following characteristic decomposition: Then there is Formula (10): The problem of gradient disappearance or explosion is due to the size of the eigenvalue. When the eigenvalue is greater than 1, the gradient explodes, and when it is less than 1, the gradient disappears. The above problems will lead to the inability to know the next adjustment direction of the loss function and make the learning process extremely unstable. Another problem of RNN is that when the number of network layers is too deep, the network memory function will be weakened, and there is a problem of longterm dependence. There are three solutions to the disappearance of the gradient of the recurrent neural network: first, ReLU can be selected as the activation function; second, Batch Normalization (BN) can be used, and third, the network structure can be improved, such as LSTM Network structure. LSTM can solve the problem of vanishing gradient and long-term dependence.

SE Module
As is shown in Figure 2, The SE (squeeze-and-excitation) module is an attention mechanism [42][43][44][45][46] that can learn to use global information to selectively emphasize useful channel information and suppress useless channel information. It learns the correlation between channels and can perform dynamic channel feature recalibration to adjust the network and improve the network expression ability. The use of the SE module is very flexible and can be added to an existing network without disturbing the original main structure of the network. The SE module mainly includes three parts: squeezing, excitation and scaling. W and H indicate the width and height of the feature map. C represents the number of channels, and the size of the input feature map is W × H × C. The first step is the squeezing operation, which is implemented through a global average pooling. After the compression operation, the feature map is compressed into a 1 × 1 × C vector. The next step is the excitation operation, which consists of two fully connected layers, where ε is a scaling parameter. The purpose of this parameter is to reduce the number of channels and reduce the amount of calculation. The first fully connected layer has C × ε neurons, the input is 1 × 1 × C, and the output is 1 × 1 × C × ε. The second fully connected layer has C neurons, the input is 1 × 1 × C × ε, and the output is The last step is the scale operation. After the 1 × 1 × C vector is obtained, the original feature map can be scaled. The weight of each channel calculated by the SE module is respectively multiplied with the two-dimensional matrix of the corresponding channel of the original feature map, and the result obtained is output.

LSTM Block
There are three stages in the process of the sequence entering the LSTM block: the segmentation stage, the block processing stage, and the overlap and addition stage. In the segmentation stage, the sequential feature input is divided into overlapping blocks, and they are connected to form a 3-D tensor. As is shown in Figure 3, in the block processing stage, the 3-D tensor enters the Bi-LSTM sequentially in the time sequence of the segmented feature sequence, and then the fully connected layer and the SE module are connected (98). The output of the SE module is multiplied element-by-element with the output of the fully connected layer, and Group Normalization (GN) is performed on the result obtained. Then we can get the final result through residual connection. The SE module is an attention mechanism that can learn to use global information to selectively emphasize useful channel information and suppress useless channel information. It learns the correlation between channels and can perform dynamic channel feature recalibration to adjust the network and improve the network expression ability. The SE module can achieve the purpose of capturing global information with a simpler structure and a smaller network scale. Global Average Pooling (GAP) (99) is used in the SE module to regularize the entire network to avoid the risk of overfitting caused by the full connection, and it can replace the conversion function of the full connection layer. In the overlap and addition stage, the 3-D output of the last LSTM block is converted back to sequential output by performing overlap addition on the block.

Stacked-LSTM
In order to effectively extract the deep features of the communication signal and improve the separation accuracy, this paper proposes a Stacked-LSTM signal separation network. The network structure of Stacked-LSTM is shown in Figure 4, which consists of three parts: an encoding part, a separation part and a decoding part. The encoding part performs feature representation on the mixed signal. The separation part is trained through the stacked LSTM block to obtain the source signal mask, and the decoding part is used to restore the waveform. Accurate separation of sequence signals requires longer time window information, that is, long-term dependence. The superior performance of LSTM in sequence signal modeling and processing can meet the long-term dependence modeling of sequence signals such as voice signals and communication signals. The structure of the LSTM unit block is shown in Figure 3.

Linear Coding of Mixed Signals
1-D convolution is used to extract linear coding feature representation from onedimensional aliased communication source signals. Literature [12] proves that the effect of linear coding is better than non-linear coding, so we use this coding method. The purpose of linear encoding is to encode mixed signals for subsequent processing. A total of 512 convolution kernels is used to generate the multi-dimensional coding feature of the mixed signal, and the result is used as the input of the separation network: In the formula, h encoder (·) is the convolution operation. w 1 and b 1 are the weight and bias of the convolution kernel.

Source Signal Mask Generation
The separation part is the process of obtaining the source signal mask. The mask of each source signal is obtained by training the network. The physical meaning of the mask actually corresponds to the mixing matrix A in the mathematical model of blind source separation. The traditional blind source separation method obtains the mixing matrix through iterative calculation, while the deep learning method obtains the multidimensional mapping of the mixing matrix in the neural network by training the weight coefficients. The specific steps of the separation part are as follows: Step 1 First, perform Group Normalization (GN) (100) on the input. In the group normalization operation, the channels are grouped and the mean and variance are calculated in each group for normalization. According to the size of the batch size, the calculation of GN is independent. As an alternative to batch normalization (BN), GN can naturally transit from pre-training to fine-tuning. Its performance is better than BN, which enhances the generalization ability of the model while avoiding gradient disappearance and gradient explosion. Then the 1-D convolution is used to extract features.
Step 2 Subsequently, the feature sequence enters the stacked Union-LSTM. A stacking block contains six LSTM blocks. Then the 1-D convolution is used to further extract features.
Step 3 PReLU is used as the activation function. Since there is no zero point in the derivative of PReLU, the problem of neuron not learning in the negative interval can be prevented. Next, a 2-D convolution operation is performed to further extract features. Next, through the tanh activation function, the multi-dimensional characteristics of the mixed source signal are obtained. The result obtained by Tanh is the characteristic expression of the source signal. The result obtained by Sigmiod is the mask information of the two source signals. In this way, the time domain masks of the two source signals are obtained after training: In the formula, f stacked−LSTM mask (·) is separation module containing stacked-LSTM, which is used to generate a time domain mask s mask .
Step 4 The coding feature of the mixed signal obtained in Section 4.2.1 actually contains two coding features of the source signal. Each source signal has a potential time-domain mask [18], and the coding feature of each source signal can be extracted through the timedomain mask. The obtained time-domain mask of each source signal is multiplied with the encoding feature representation of the mixed signal to obtain the feature encoding of the two communication source signals: In the formula, • is the Hadamard product, which is the element-wise product of the two operands.

Source Signal Waveform Recovery
The 1-D deconvolution is used to perform 512-dimensional feature decoding on the separated communication source signal encoding to obtain a one-dimensional time-domain waveform: In the formula, h decoder (·) is the decoder.

Learning Process
Scale-invariant source-to-noise ratio (SISNR) [39] is usually used as the basic separation evaluation index to measure the performance of blind source separation. SISNR measures the ratio of the signal to the separation error. The higher the SISNR, the lower the separation error and the better the separation performance. Before the calculation, the source signal and the separated source signal are normalized to a zero mean value to ensure that the scale remains unchanged.
The gradient descent method is generally used during network training. This method needs to minimize the loss function. As is shown in Formula (15), a negative SISNR is used as the loss function. This ensures that in the end-to-end training, the loss is minimized and SISNR is maximized to ensure the accuracy of model training.
The parameters are updated through the back-propagation gradient descent algorithm: The back-propagation gradient descent algorithm updates the parameters of the encoding part, the mask part and the decoding part θ = {θ encoder , θ mask , θ decoder }. The encoder parameter set, mask parameter set, and decoder parameter set are θ encoder =

Dataset
In the data generation and mixing part, communication data from five modulation modes, including BPSK, 8PSK, QAM16, QAM64, PAM4 were generated through the software-defined radio platform GNUradio [47], the sampling rate is 1 MHz, and the code rate is 125 K symbol/s. Referring to the existing research results [48][49][50][51][52][53] in Table 1, combined with the actual operating efficiency of the simulation platform, the selected signal-to-noise ratio range was 5-20 dB, and the step size was 2.5 dB. In the simulation, it was assumed that the aliased signals from different sources had the same frequency offset and timing deviation. This article focuses on the influence of the signal-to-noise ratio of different algorithms on the separation effect.
To meets the long-term dependence of the separation task, each type of signal generated 1000 pure data signals under each signal-to-noise ratio, and each datum contained L = 32,768 sampling points, which was 32.768 ms. First, the amplitude of the pure signal was standardized. Then we used the linear instantaneous mixing model as shown in formula (1) to mix the signals. The signals of five different modulation modes were mixed in pairs to obtain data of 10 mixing modes. In each mixing mode, the signals with the same noise ratio were mixed. The data of the 10 mixed methods were BPSK-16QAM, 8PSK-64QAM,8PSK-PAM4, 64QAM-PAM4, BPSK-8PSK, BPSK-64QAM, BPSK-PAM4, 8PSK-16QAM, 16QAM-64QAM, 16QAM-PAM4, which were used as a mixed data set, with 70,000 samples in total.

Experiment Process
The hardware resources used in the experiment were Tesla k80 GPU, Intel Xeon E5 and 2.60 GHz CPU, and the deep learning framework is PyTorch1.4. Five-fold cross-validation was used in all experiments.
As is shown in Figure 5, the experimental process consists of two parts: the data generation and mixing part, and the signal separation part. In the separation part, it consisted of three modules: 1-D convolutional encoding module, separation module, 1-D convolutional decoding module.

Case Study 1: Effect under Pure Environment
Experimental purpose: To compare the performance difference between Stacked-LSTM network and ICA, NMF, TasNet, Wave-U-Net, and ConvTasNet under ideal trans-mission conditions with high signal-to-noise ratio SNR = 20 dB. The basic parameter configurations of the six blind source separation methods are shown in Table 2. As traditional machine learning methods, ICA [9] and NMF [10] are two classic algorithms in the field of blind source separation. ICA uses dynamic embedding to convert single-channel observation data into multi-channel data for separation. It has superior performance in positive and over-determined separation, and has poor separation accuracy in single-channel under-determined separation. NMF calculates the basic matrix and coefficient matrix of the source according to the KL divergence minimization to achieve signal separation. Such traditional algorithms are equivalent to shallow models and do not extract deep features of the signal. As three deep learning separation methods, TasNet [11], ConvTasNet [12] and Wave-U-Net [13] can realize single-channel blind source separation of signals. In the TasNet network, a separation modules with three-level structure is adopted. In the separation module, Long Short-Term Memory (LSTM) network is used in each block, and its large number of parameters significantly increase its computational cost. In the 1-D block of the ConvTasNet network, a deep separable convolutional convolution is used, and the convolution kernel is an expanded convolution with different expansion rates to obtain features of different time scales. However, when the receiving field of one-dimensional convolution is smaller than the sequence length, sequence-level modeling cannot be performed. In the Wave-U-Net network, time domain convolution is used to process the time series context, combining high-level and low-level features on different time scales. The sampling rate of the original signal is used as the resolution in each feature map generated by convolution, so the memory consumption is high.
The experimental results are shown in Table 3. It can be seen that the Stacked-LSTM method had the best performance, with a loss value of −18.67 dB, which was 2.62 dB lower than that of ConvTasNet. Under ideal transmission conditions, the comparison result of separation accuracy is: Stacked-LSTM > Conv-TasNet > Wave-U-Net > TasNet > ICA > NMF algorithm, the loss was −18.67 dB < −16.05 dB < −13.97 dB < 7.93 dB < 4.09 dB. In addition, we found that the performance of the three deep learning methods under high signal-to-noise ratio was significantly better than traditional ICA and NMF, and the separation performance of Stacked-LSTM was the best method in deep learning methods. The Stacked-LSTM network had three main characteristics, so its performance was superior. First, the LSTM block only needed to be stacked at six times, while the 1-D CNN block of ConvTasNet was stacked at eight times, and this operation was repeated three times, for a total of 24 blocks. In terms of model size, the Stacked-LSTM network was smaller. Second, the Stacked-LSTM method was based on the three-level structure of ConvTasNet, replacing the CNN block with a Bi-LSTM-based block. Multi-dimensional long sequence features were segmented and stacked into a 3-D tensor as input, which could fully model the long-term dependence of signal blind source separation, while the CNN block in Stacked-TCN had a limited receiving domain, so the Stacked-LSTM network could solve the problem of Stacked-TCN of restricted receiving domain. Third, Stacked-LSTM applied the attention mechanism of the SE module to the LSTM block. A simple mechanism was used to achieve the effect of capturing important global information.

Case Study 2: Effect under Noisy Environment
Experimental purpose: First, verify the generalization performance and noise robustness of the Stacked-LSTM algorithm under different noise transmission conditions. The signal-to-noise ratio range was set to SNR = 5∼20 dB. Secondly, calculate the time for the six algorithms to separate a single signal to comprehensively compare the running time and separation accuracy of the algorithm. Figure 6 is a graph of the waveform results of 10 mixed signals of the Stacked-TCN algorithm under the condition of SNR = 15 dB. It can be seen that most of the separation results were relatively good, which are basically consistent with the source signal, and have a good separation effect. Compared with the 1-D CNN block, the Bi-LSTM block could overcome the limitation of the receiving domain and convert the sequential input into a 3-D tensor, which had a higher separation potential. It was a more adequate way to model the long-term dependence of the source signal. At the same time, the attention mechanism of the SE module was added to the Bi-LSTM block, which could capture global information and obtain the correlation between different channels, thereby suppressing useless channel information, strengthening useful channel information, and dynamically adjusting the network. In addition, the excellent fitting ability of the LSTM deep neural network could better learn the time domain mask of the source signal.   Figure 7 shows the loss change of the Stacked-LSTM algorithm in the range of 5-20 dB signal-to-noise ratio, and summarizes the loss line graphs of the other five comparison methods. The following experimental results were obtained: first, the three deep learning methods were significantly better than ICA and NMF machine learning methods. Second, the separation accuracy of the Stacked-LSTM method increased with the increase of SNR, and the trend was relatively stable. Third, under the low signal-to-noise ratio, the Stacked-LSTM method and the Stacked-TCN method had basically the same separation accuracy in the separation of nine kinds of mixed data. In the separation of the remaining one kind of mixed signal 64QAM-PAM4, Stacked-LSTM was superior. Compared with Stacked-TCN, Stacked-LSTM was more robust to noise interference. Fourth, from the overall 5-20 dB signal-to-noise ratio range, for the five mixed signals of BPSK-8PSK, BPSK-64QAM, BPSK-PAM4, 8PSK-16QAM, 16QAM-PAM4, Stacked-LSTM and Stacked-TCN methods were basically the same. In the other five cases, Stacked-LSTM. The method obviously had an advantage in separation accuracy. These results are due to the fact that NMF and ICA were model-driven machine learning methods, which are essentially shallow models, suitable for tasks with small samples and precise models. However, single-channel blind source separation is a difficult problem with very little prior knowledge. The shallow model was not enough to describe its essential characteristics, and it failed to make full use of the deep characteristics and information of the signal. The ability to generalize to big data was weak. Under extreme ill-conditioned single-channel conditions, the separation performance was poor. When the task was complex and could not be accurately described by the model-driven method, the data-driven deep learning method could make up for the difficulties in the model, learn deep essential features from a large number of samples, and have strong fitting capabilities to meet the needs of tasks such as signal separation. Therefore, the separation effect of Stacked-LSTM, Wave-U-Net, TasNet, and ConvTasNet was better than that of ICA and NMF. Among them, LSTM was used as a block in the TasNet network, and a large number of parameters increased the calculation cost. In order to ensure a reasonable network model and calculation speed, the number of blocks needed to be controlled, and the modeling accuracy of the timing signal was not enough, so the effect was poor. The Wave-U-Net network processes the time series context by repeatedly performing downsampling and convolution of the feature map, which had good results, but there were fluctuations. In the ConvTasNet method, hole convolution was used in the separation module to reduce the number of parameters. The same network model size could increase the number of repetitions of the block, but the sequence receiving window was small, and sequence-level modeling could not be performed. In Stacked-LSTM, organizing LSTM in a deep structure was a simple and effective method for modeling extremely long sequences. The attention mechanism with SE module was used to capture global information, and Union-LSTM was used instead of 1D-CNN, which is separated in time domain. The midsequence-level modeling network had the best accuracy and stability among the three deep learning methods.

Separation Time Compare
In addition, the calculation time of each frame of the Stacked-LSTM time domain mask method and time-frequency domain mask method was calculated, as shown in Table 4. In most signal separation studies of time-frequency domain masking methods, the window length of STFT was at least 256 points [40,49,54,55], so the single frame duration in this experiment was 0.256 ms, and the calculation time is 1.24 ms. A longer time window and calculation time increased the minimum delay of the system. On the other hand, the Stacked-LSTM time-domain mask method could reduce the length of a single frame to 0.032 ms without reducing the separation accuracy, and its calculation time was only ms. To successfully separate the source signal from the time-frequency representation, the timefrequency domain masking method required high-resolution frequency decomposition of the mixed signal. This required a long STFT time window. This requirement increased the minimum delay of the system, which limited its applicability in real-time and lowlatency applications, therefore, more and more research has begun to turn to time-domain methods [12,23,24,26,27,29].

Application Suggestions
The signals separated by the Stacked-LSTM network in this article are 10 mixed signals composed of five modulated signals. In fact, we can construct a training set with more modulation signals, which is expandable in the types of separated signals, and is not limited to the data set we use. The network structure and scale can be adjusted according to the amount of data, and the network can be trained to separate more kinds of signals.

Discussion
Stacked-LSTM has three advantages: first, compared with Conv-TasNet, it only requires fewer stacked LSTM blocks to achieve the same separation accuracy as the former or even surpass the former in the case of half of the mixed signals. In addition, the model is more concise. Second, it is superior to Stacked-TCN in the potential to improve separation accuracy. This is because the input of the Bi-LSTM block can be segmented and overlapped to form a 3-D tensor as input, and the longest possible sequence can be used to improve the accuracy of separation of long-term dependent effects, while the receiving domain of stacked-TCN block cannot be too large. Third, GAP is embedded in the Bi-LSTM block of Stacked-LSTM instead of the FC layer. At the same time, the attention mechanism of the SE module can be used to determine the importance of different channel features by capturing global information to dynamically adjust the network. Although Stacked-LSTM has three advantages, there is still room for improvement in separation speed.

Conclusions
In this paper, we propose a deep learning model based on Stacked-LSTM to solve the single-channel blind source separation problem, in which two source signals are mixed into one mixed signal. The proposed model can separate the mixed signals and restore them to two source signals by learning useful information from big data. Through training the time domain mask, the mapping of the hybrid matrix in the neural network is obtained, avoiding the high error of iteratively solving the hybrid matrix directly. Experiments show that the proposed model has good generalization performance not only in pure environment but also noisy environment compared with classical methods or other deep learning models. The separation loss is reduced by 10.09-38.17 dB.