TB-NET: A Two-Branch Neural Network for Direction of Arrival Estimation under Model Imperfections

: For direction of arrival (DoA) estimation, the data-driven deep-learning method has an advantage over the model-based methods since it is more robust against model imperfections. Conventionally, networks are based singly on regression or classiﬁcation and may lead to unstable training and limited resolution. Alternatively, this paper proposes a two-branch neural network (TB-Net) that combines classiﬁcation and regression in parallel. The grid-based classiﬁcation branch is optimized by binary cross-entropy (BCE) loss and provides a mask that indicates the existence of the DoAs at predeﬁned grids. The regression branch reﬁnes the DoA estimates by predicting the deviations from the grids. At the output layer, the outputs of the two branches are combined to obtain ﬁnal DoA estimates. To achieve a lightweight model, only convolutional layers are used in the proposed TB-Net. The simulation results demonstrated that compared with the model-based and existing deep-learning methods, the proposed method can achieve higher DoA estimation accuracy in the presence of model imperfections and only has a size of 1.8 MB.

Recently, with the rapid development of deep learning, neural-network-based algorithms have been proposed for DoA estimation. Thanks to the data-driven characteristics, these methods can be robust against model imperfections [6]. Generally, these methods can be divided into those based on regression networks or classification networks.
Under regression networks, different structures have been proposed to estimate the DoA values. Specifically, an end-to-end algorithm was proposed in [7], and a deep convolutional network was used to recover the spatial spectrum in [8]. However, the network structure depends heavily on the number of sources, which makes it difficult to extend to the scenarios where the number changes. With a small number of snapshots, a neural network was utilized in [9] for signal denoising, and a deep neural network (DNN) was utilized in [10,11] to reconstruct the covariance matrix. In [12,13], bi-directional gated recurrent units (GRUs) and bidirectional long short-term memory (BiLSTM) were introduced to learn the dependencies of signals, and the DoAs were estimated by the regression layer. In [14], a DNN was proposed to map the received signals to those of a larger dimension, which can be equivalently considered as adopting an antenna array of a larger size such that the DoA resolution is improved.
For classification networks, the most common architecture is the grid-based model, which divides the angular domain into several sectors, and then, for each sector, it is determined whether there exists an incoming signal [15]. In [6], an autoencoder with multilayer classifiers was proposed to build the spatial spectrum. In [16], two frameworks were proposed to separate coherent signals. In [17], a deep convolutional neural network (CNN) with 2D convolutional layers was proposed to improve the DoA estimation accuracy under a low signal-to-noise ratio (SNR). The grid-based model can improve the training stability, and the structure is universal in scenarios where the source number changes, but it is difficult to achieve a high resolution due to the limited number of grids.
In this paper, we propose a grid-based two-branch neural network (TB-Net). In particular, the proposed classification branch (C-Branch) and regression branch (R-Branch) work in parallel and share a feature extraction network. The C-Branch provides a grid-based mask to coarsely determine the DoAs, and the R-Branch provides a refinement of the DoA estimates. At the output layer, the DoA estimates are obtained by combining the masks and the corresponding deviations. The sharing of the feature extraction network leads to a lightweight network. Besides, to further reduce the weight scale and computational complexity, the proposed TB-Net only consists of convolutional layers. The simulation results showed that compared with conventional classification networks, the proposed TB-Net can achieve higher DoA estimation accuracy at the small cost of the computational overhead. Additionally, compared with the model-based and existing deep-learning methods, TB-Net is more robust against model imperfections and requires less calculation.
The rest of this paper is organized as follows. In Section 2, the signal model and neural network are introduced. The proposed TB-Net is described in Section 3. Section 4 shows the simulation results. Section 5 concludes this paper.

Preliminaries
In this section, we first describe the signal model and then give a brief introduction about CNNs.

Signal Model
In this work, the narrow band signal was used for the training and testing. In addition, we took into account three kinds of model imperfections that can possibly degrade the DoA estimation performance.
Denoting s k (t) as the k-th incoming signal and θ k as the DoA, the received signal can be expressed as: where n(t) ∼ CN(0, σ 2 ) is the Gaussian noise, K is the number of sources, and a(θ k ) is the array response vector. In this paper, we assumed that the uniform linear array (ULA) is adopted. Hence, we have: where M denotes the number of antennas in the ULA, d denotes the spacing between adjacent antennas, and λ denotes the wavelength. Similar to [6], we considered three kinds of model imperfections, i.e., gain and phase inconsistency (e g , e pha ), inter-sensor mutual coupling (e m ), and the deviation of the antenna position (e pos ). Hence, the i-th (i = 1, . . . , M) element in a(θ k ) can be rewritten as:

Neural Network Model
The covariance matrix of x(t) can be approximated as: where N denotes the number of snapshots. Then, the input to the proposed neural network is given by: where Triu(·) represents the upper triangular area of the matrix and Vec(·) reformulates the matrix as a vector.
The convolutional layer has been widely used in neural networks, and the output of the i-th layer can be expressed as: where u i is the input feature, W i is the convolution kernel, and b i is the bias. In (6), the nonlinear function f (·) (e.g., ReLU, Sigmoid, Tanh, etc.) is used for space mapping. To train the neural network, W i is updated via backpropagation under a certain loss function [18].
To reduce the internal covariate shift and accelerate the convergence rate, batch normalization (BN) can be performed before activation [19], for which y i can be rewritten as: 3. Proposed TB-Net Figure 1 shows the architecture of the proposed TB-Net, which can be divided into two parts: feature extraction network and parallel prediction network. The details of these two networks are represented in Sections 3.1 and 3.2. The parallel prediction network consists of the C-Branch, the R-Branch, and an output layer, which are described in Sections 3.2.1-3.2.3.
The detailed parameters of TB-Net are listed in Table 1, where C_IN denotes the number of input channels, C_OUT denotes the number of output channels, H denotes the kernel height, and W denotes the kernel width.

Feature Extraction Network
The feature extraction network extracts features from the covariance matrix and outputs them to the C-Branch and R-Branch, which realizes feature reuse and reduces the computational complexity.
The parameters of the network were determined by experiments. The results showed that the network consisting of five convolutional layers had the best mean absolute error (MAE) performance, and the parameters of the convolution kernel in each layer are listed in Table 1. Additionally, BN was utilized to accelerate the convergence. The experiments showed that adopting BN in the first five layers led to the best training stability and the highest MAE accuracy.

Parallel Prediction Network
The prediction network consists of the C-Branch and the R-Branch, and these two networks work in parallel. Denoting G as the number of grids, the output of the C-Branch is a mask vector m = [m 1 , m 2 , . . . , m G ], whose i-th element indicates the possibility that the DoA is around the i-th grid. The output of the R-Branch is a deviation vector d = [d 1 , d 2 , . . . , d G ], where d i represents the DoA's deviation, or estimation refinement, with respect to the i-th grid.
For model optimization, the total loss was set as: where l c is the loss of the C-Branch and l r is the loss of the R-Branch.
In the C-Branch, we used the Sigmoid function as the activation function of the output layer, which maps the result to [0, 1]. We used binary cross-entropy (BCE) as the loss function to optimize the neural network, i.e., where m i is the label andm i is the output of the network.

Regression Branch
The proposed R-Branch consists of a convolutional layer containing 121 output channels, which is consistent with the C-Branch. For the i-th channel, the output is the deviation on the i-th grid in the C-Branch. Note that such a deviation is valid only when m i = 1.
Since the grid size in the C-Branch is ∆θ, the deviation is restricted within [−0.5∆θ, 0.5∆θ]. Hence, the weighted Tanh function was used as the activation function, i.e., For training, we adopted l 2 as the loss function to optimize the neural network, i.e., where d i denotes the actual deviation andd i denotes the output of the R-Branch. Most importantly, because the R-Branch is parallel with the C-Branch, there is no data dependency between the two networks, which implies that TB-Net can estimate the DoAs in one evaluation.

Output Layer
The output layer combines m and d and obtains the DoA estimates. It first finds K (the number of sources) peak indexes p = [p 1 , · · · , p k ] inm and obtains the coarse DoA estimation by multiplying the grid size. Then, a final DoA estimate is obtained by adding the deviations selected in d according to p. The process is shown as: whereθ k denotes the DoA estimates and ∆θ denotes the grid size.

Experimental Results and Discussion
A 16-element ULA with a half-wavelength inter-element spacing was used to generate the dataset. Two sources with equal power were randomly generated within [−60 • , 60 • ]. The scale of the dataset for training, validation, and testing was 100,000, 20,000, and 20,000, respectively.
We implemented TB-Net in Pytorch. In the training process, we set the initial learning rate to 0.001 and adjusted the learning rate every 30 epochs to 0.9-times the previous one. We used the Adam [20] optimizer to update the network parameters during training. The total training epoch was set to 300, and the candidate achieving the highest DoA estimation accuracy was selected as the final model.
We used MAE to measure the performance of the algorithms, i.e., where N T denotes the number of testing samples.

Classification Network
As shown in Figure 2, the output of the C-Branch indicates the possibility of the source's existence on the grids. In cases of (a), (b), and (c), the two grids corresponding to the peaks are considered as the DoA estimates. In the case of (d), the direction of 56.47 • causes closely located peaks on two grids, from which the one with the higher peak is taken as the DoA estimate (e.g., Point 3 in Figure 2d). We compared the C-Branches that were optimized by l 2 and l BCE separately, and the results are shown in Figure 3. It can be seen that the network optimized by l BCE had a better accuracy than the one optimized by l 2 . Under SNR = 10 dB, the improvement of the accuracy was about 44.7%. Figure 4 shows the impact of the introduction of the R-Branch. It can be seen that the DoA estimation accuracy improved with SNR. When the SNR was low, the coarse estimates of the C-Branch were far from the DoA values and thus degraded the MAE significantly. With the increase of the SNR, the deviation given by the R-Branch gradually dominated the estimation accuracy, since the coarse DoA estimates obtained by the C-Branch almost had no error. This phenomenon was obvious when SNR > 2 dB. Compared with the C-Branch, TB-Net had an improvement of about 36.4% in accuracy under SNR = 10 dB.

Complexity Analyses
The computational complexity comparison among DNN-based algorithms is listed in Table 2, where the weight denotes the model size and the calculation amount denotes the multiplication and addition operation number. We chose the networks proposed in [6,17] for comparison, which are briefly named DNN_SF_SS and 2D-CNN. The results implied that the CNN-based TB-Net had the minimum model size and computational complexity.

Experiments with Model Imperfections
The imperfections considered in this paper were modeled as: where the parameter ρ with value [0.1, 0.2, . . . , 0.9] was used to control the strength of imperfections. For experiments, the data were generated under SNR = 10 dB, and the number of snapshots was set to N = 40. We compared TB-Net with MUSIC [1], ESPIRIT [2], and DNN-based models. For MUSIC, the searching step was set to 0.1 • . In order to make a fair comparison, the epochs of the training for TB-Net were set to 300. For the 2D-CNN and the DNN_SF_SS, the parameters used for the training were set according to [6,17]. Figure 5 shows that TB-Net performed well in all situations except the mutual coupling where MUSIC had the best performance. The prediction results of TB-Net did not fluctuate much with the increase of ρ, and the error was around 0.19 • . In contrast, the performances of MUSIC and ESPIRIT deteriorated significantly with the increase of ρ, especially in the case of the gain and phase inconsistency and the deviation of the antenna position. The error of the deep-learning-based algorithms did not deteriorate with the increase of ρ. However, the DNN_SF_SC [6] did not work well in all conditions, especially in intersensor mutual coupling. The MAE of the 2D-CNN [17] fluctuated around 0.6 • , which was limited to the resolution of the grid-based classification network. In comparison, TB-Net constantly achieved high DoA estimation accuracy under various model imperfections.

Conclusions
In this paper, TB-Net, which combines classification and regression in parallel, was proposed to address DOA estimation. The DoA estimates were first coarsely obtained by the C-Branch and then further refined by the R-Branch. The experiments demonstrated that TB-Net had a higher DoA estimation accuracy in the presence of model imperfections.
Besides, the C-Branch and the R-Branch shared a feature extraction network to reduce the model size. The convolutional layers were also adopted to implement a lightweight neural network. Hence, the proposed TB-Net had a model size of 1.8 MB and a calculation amount of 0.78 million, which is the minimum value to our knowledge. The proposed TB-Net has the limitation that the source number is assumed to be fixed and known. Therefore, the DoA estimation for an arbitrary source number is a potential direction for future research.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: