A Deep Learning Method for DOA Estimation with Covariance Matrices in Reverberant Environments

: Acoustic source localization in the spherical harmonic domain with reverberation has hitherto not been extensively investigated. Moreover, deep learning frameworks have been utilized to estimate the direction-of-arrival (DOA) with spherical microphone arrays under environments with reverberation and noise for low computational complexity and high accuracy. This paper proposes three different covariance matrices as the input features and two different learning strategies for the DOA task. There is a progressive relationship among the three covariance matrices. The second matrix can be obtained by processing the ﬁrst matrix and it effectively ﬁlters out the effects of the microphone array and mode strength to some extent. The third matrix can be obtained by processing the second matrix and it further efﬁciently removes information irrelevant to location information. In terms of the strategies, the ﬁrst strategy is a regular learning strategy, while the second strategy is to split the task into three parts to be performed in parallel. Experiments were conducted both on the simulated and real datasets to show that the proposed method has higher accuracy than the conventional methods and lower computational complexity. Thus, the proposed method can effectively resist reverberation and noise.


Introduction
Direction-of-arrival (DOA) estimation has received more attention in the field of signal processing because it has a wide range of application. It can be applied to wireless communication, speech source separation, video conferencing and so on [1][2][3][4][5][6]. Over the years, many conventional methods have been successfully developed for the estimation task. The subspace-based methods are more popular among traditional methods for the outstanding performances. The multiple signal classification (MUSIC) [7] and the estimation of signal parameter via rotational invariance (ESPRIT) [8] are relatively eminent representative methods in subspace-based algorithms. The noise subspace is used to generate a spectrum in the MUSIC algorithm. The position where the spectral peak appears is the corresponding estimated DOA. Although it can guarantee relatively higher accuracy, the peak search is very time-consuming. Conversely, the ESPRIT method utilizes the signal sub-space for DOA estimation. The ESPRIT is time-saving at the expense of accuracy. I the beamforming-based approach, such as the minimum variance distortionless response (MVDR) [9], is another traditional method. However, the Rayleigh limit and low resolution are the shortcomings of this method. Generalized cross-correlation phase transform (GCC-PHAT) is also a commonly used traditional method, which requires the time difference of arrival (TDOA) between microphone pairs [10,11]. There is also a conventional called Maximum likelihood method [12]. It needs a multi-dimensional search, even though they can provide high estimation accuracy. Even worse, the global convergence can not be guaranteed. All these methods mentioned above can be used Thus, the output vector () t x of the spherical microphone array can be obtained as An acoustic far-field signal s d (t) emitted from the location r d = (r d , θ d , φ d ) is supposed. The distance between the source and the center of the spherical microphone array is r d , and θ d and φ d mean the elevation and azimuth of the d-th source, respectively. Thus, the output vector x(t) of the spherical microphone array can be obtained as where x(t) = [x 1 (t), · · · , x L (t)] T , A is the steering matrix that represents the transfer functions from the source signals s(t) = [s 1 (t), · · · , s D (t)] T to the array microphones and holds the DOA information, v(t) = [v 1 (t), · · · , v L (t)] T is the noise vector, t is the time index, D represents the number of sources. When source signals are wideband ones, shorttime Fourier transform (STFT) [27] must be considered to construct a signal model in the time-frequency domain from the spatial domain as follows where τ indicates the index of time frame, k h = 2π f h /c is the wavenumber corresponding to the frequency bin f h , h represents the frequency bin index, c is the speed of the sound, x(τ, k h ) = [x 1 (τ, k h ), · · · , x L (τ, k h )] T , v(τ, k h ) = [v 1 (τ, k h ), · · · , v L (τ, k h )] T , and A(k h ), the corresponding steering matrix [28], can be decomposed as where Y(Ω) = [y(ϑ 1 , ϕ 1 ), · · · , y(ϑ L , ϕ L )] T is the spherical harmonic matrix containing the information of the locations of all the microphones and Y(Φ) = [y(θ 1 , φ 1 ), · · · , y(θ D , φ D )] T is the spherical harmonic matrix only dependent on DOAs of D sources, [·] T indicates transpose, [·] H is the conjugate transpose, the spherical harmonics vector y(θ, φ) is expressed as where Y m n (θ, φ) is the spherical harmonic function with order n and degree m described as where i 2 = −1, P m n (x) is the associated Legendre function of order n and degree m. When m is positive, P m n (x) is defined as for the negative parts, it can be obtained from the positive parts as where P n (x) is the associated Legendre polynomial given by N is the highest order of the spherical harmonic function, n = 0, · · · , N and m = −n, · · · , 0, · · · , n. B(k h r) = diag{b 0 (k h r), b 1 (k h r), b 1 (k h r), b 1 (k h r), · · · , b N (k h r)}, where b n (k h r) is the mode strength [29] defined as Appl. Sci. 2022, 12, 4278 where j n (x) is spherical Bessel function of the first kind and o n (x) is spherical Hankel function of the second kind. j n (x) and o n (x) are their derivatives, respectively The model in (2) can be rewritten as The model can be transformed into the spherical harmonic domain by left-multiplying Y H (Ω). Due to the orthogonality principle of spherical harmonics for uniform or nearly uniform sampling, it can be known that Y H (Ω)Y(Ω) = I (N+1) 2 [30], where I (N+1) 2 is an (N + 1) 2 × (N + 1) 2 identity matrix. The new model in the spherical harmonic domain is (11) can be left multiplied with B −1 (k h r).
In this model, the steering matrix Y H (Φ) is irrelevant to the information of the microphones and mode strength.

The Proposed Method
This section introduces the input features of the proposed method for the DOA task. As shown before, three different covariance matrices are considered as the input features for the neural network. The first covariance matrix is the covariance matrix of the received signal of the microphone array in the time-frequency domain (TFD). The second one is the covariance matrix of the signal in the spherical harmonic domain (SHD). The third one is the covariance matrix of the signal in the azimuth-elevation domain (AED). All the covariance matrices are to be introduced in detail.

TFD Matrix
The received signal of the microphone array in the time-frequency domain is the output vector x(τ, k h ) of the microphone array. The covariance matrix of the x(τ, k h ) can be obtained as where N F is the number of the frequency bins, and N T time frames are considered. In practical applications, the covariance matrix can also be calculated in this way. Under each frequency bin, the covariance matrix of the x(τ, k h ) can be obtained along the time index. The final step is to average all the matrices with the total frequency bins.

SHD Matrix
It can be known that the data model in (12) has filtered out the effects of the microphone array and mode strength to some extent. Thus, this model can contain the angle information more efficiently. Thus, the covariance matrix of the model in the spherical harmonic domain can be got by the same way as the TFD matrix

AED Matrix
A mapping matrix between the spherical harmonic function and Fourier series was found in our previous work [31]. Its dimension is (2N + 1) 2 × (2N + 1) 2 . The mapping matrix has the function to transform the steering vector in (12) into the Kronecker product of two Fourier series vectors. Each vector has a Vandermonde structure only dependent on azimuths or elevations, respectively, where operator ⊗ represents the Kronecker product. The f(φ) and f(θ) can be expressed as Moreover, the values in matrix E do not change with the positions of sound sources. Thus, (15) can be left multiplied with the generalized inversion of matrix E, and it can obtain According to (12) and (18), the novel model can be obtained from (12) by left multiply- where [.] is the operation that conjugates all the elements in the matrix. G(Φ) = [g(θ 1 , φ 1 ), · · · , g(θ D , φ D )]. Compared withx nm (τ, k h ), x θφ (τ, k h ), which is determined by the azimuth and elevation to the greatest extent, further efficiently removes information irrelevant to location information; therefore, it can be seen that the signal is transformed from the time-frequency domain to the azimuthelevation domain. Thus, one can obtain the covariance matrix of the x θφ (τ, k h ) with the same method The last step is to assemble the real parts and the imaginary parts to obtain the final input features. Table 1 indicates the dimension of the three different covariance matrices.  Figure 2 is the flowing chart of acquiring the input features. From the figure and theoretical analysis, it can be found that the AED best carries the information of the location, which will be the most efficient input features at a theoretical level. Algorithm 1 shows the steps of the algorithm for palpability. Figure 2 is the flowing chart of acquiring the input features. From the figure and theoretical analysis, it can be found that the AED best carries the information of the location, which will be the most efficient input features at a theoretical level. Algorithm 1 shows the steps of the algorithm for palpability.  2.

1.
Get x(τ, k h ) by using the STFT of the received signal x(t).

Learning Strategies and Network Architectures
Two different learning strategies and their corresponding network structures are introduced in this section. As shown before, the DOA estimation is regarded as a classification task. The task can be finished with two strategies.

Regular Strategy
The estimation is treated as a classification task. The whole space is divided into different classes and the emitted signal detects which class the signal belongs to. The elevation space, ranging from 30 • to 150 • , is divided into 13 different classes with the interval of 10 • . Similarly, the azimuth space, ranging from 0 • to 360 • , is divided into 36 classes with the interval of 10 • . In total, 468 different classes can be obtained. Moreover, each class is granted a label to distinguish it. For each class, a large number of samples are generated for the network to learn and give the network the ability to localize the sound source. For the classification, a convolutional neural network is utilized for the task. The network is composed of three convolutional layers and two fully connected layers. For the convolutional layers, the size of the convolution cores in these three convolutional layers is 2 × 2. Moreover, the number of filters in all layers is fixed at 8. All the convolutional layers are activated by the rectified liner unit (ReLU) activation [32]. A flattened layer follows the last convolutional layer to reshape the output of the convolutional layers. The following components are two fully connected (FC) layers. The first layer with N H nodes is activated by the ReLU activation. The second layer with N C nodes uses softmax for the activation. N C is the number of classes for the task. The softmax function [33] helps to obtain the posterior probabilities of all the candidates. Finally, according to the principle of maximum posterior probability, the position where the maximum probability occurs is the estimated DOA. Figure 3 demonstrates the details of the network for the estimation task. During learning, the cross-entropy [34] is chosen as the loss function because its derivation is simpler and it is only related to the probability of the correct class. Furthermore, the loss is able to conveniently derive the input of the softmax activation layer. Adam [35] is adopted for the stochastic optimization. The details of the CNN are shown in Figure 3.
over, each class is granted a label to distinguish it. For each class, a large number of samples are generated for the network to learn and give the network the ability to localize the sound source. For the classification, a convolutional neural network is utilized for the task. The network is composed of three convolutional layers and two fully connected layers. For the convolutional layers, the size of the convolution cores in these three convolutional layers is 2 × 2. Moreover, the number of filters in all layers is fixed at 8. All the convolutional layers are activated by the rectified liner unit (ReLU) activation [32]. A flattened layer follows the last convolutional layer to reshape the output of the convolutional layers. The following components are two fully connected (FC) layers. The first layer with H N nodes is activated by the ReLU activation. The second layer with C N nodes uses softmax for the activation. C N is the number of classes for the task. The softmax function [33] helps to obtain the posterior probabilities of all the candidates. Finally, according to the principle of maximum posterior probability, the position where the maximum probability occurs is the estimated DOA. Figure 3 demonstrates the details of the network for the estimation task. During learning, the cross-entropy [34] is chosen as the loss function because its derivation is simpler and it is only related to the probability of the correct class. Furthermore, the loss is able to conveniently derive the input of the softmax activation layer. Adam [35] is adopted for the stochastic optimization. The details of the CNN are shown in Figure 3.  The process for DOAs estimation and architecture in detail. Conv (x, y, z) indicates that the size of the convolution core is y × z, and the number of the filters is x. N H is the number of the nodes in the first FC layers. N C is the number of the nodes in the second FC layer, which represents the number of classes.

Segmentation Strategy
As shown in Section 4.1, there are 468 different classes in the regular strategy. Too many classes can bring interference to learning and testing. Thus, this part proposes the segmentation strategy for predigestion. For the azimuth, the entire space is considered, which is a major contributor to too many classes. The azimuth space can be processed with the segmentation strategy. The entire space is equally divided into Q sub-spaces. Thus, the task is finished in parallel with Q independent networks. This strategy can decompose the excessive number of classes in the task into Q parts so that the number of the total classes in each part is reduced, thereby reducing the interference induced by classes. Figure 4 shows the example of the space division when Q = 3, and the experiments in this paper are conducted in the case of Q = 3. The color-rendered part is the corresponding sub-space, and the white area is the irrelevant space compared to the coloring space.

Segmentation Strategy
As shown in Section 4.1, there are 468 different classes in the regular strategy many classes can bring interference to learning and testing. Thus, this part propose segmentation strategy for predigestion. For the azimuth, the entire space is consid which is a major contributor to too many classes. The azimuth space can be proc with the segmentation strategy. The entire space is equally divided into Q sub-sp Thus, the task is finished in parallel with Q independent networks. This strateg decompose the excessive number of classes in the task into Q parts so that the numb the total classes in each part is reduced, thereby reducing the interference induce classes. Figure 4 shows the example of the space division when 3 Q = , and the ex ments in this paper are conducted in the case of 3 Q = . The color-rendered part corresponding sub-space, and the white area is the irrelevant space compared t coloring space.
Take Figure 4a as an example. The pink part is the first sub-space used for the CNN. The space range for the pink part is from 0  to 120  , thus, 169 classes can b tained for the first CNN. As the other blank parts are irrelevant spaces, a new cl needed in this subtask. This class represents irrelevant sound, which means tha signal comes from irrelevant areas. This feature should be used as the input of the two networks to determine the location, the signal does not appear in this sub-s Thus, 170 classes are considered for the first sub-CNN. Similarly, the second sub-s ranging from 120  to 240  , also has 170 classes. The third sub-space also has whose range is from 240  to 360 .  Take Figure 4a as an example. The pink part is the first sub-space used for the first CNN. The space range for the pink part is from 0 • to 120 • , thus, 169 classes can be obtained for the first CNN. As the other blank parts are irrelevant spaces, a new class is needed in this subtask. This class represents irrelevant sound, which means that the signal comes • , also has 170 classes. The third sub-space also has this, whose range is from 240 • to 360 • . Figure 5 shows the architecture of the network in the segmentation strategy with Q = 3. The specific structure of the sub-CNN is the same as the network structure of the CNN in the regular strategy to better compare the effectiveness of strategies. For testing, the input feature of one signal is sent to three sub-networks for detection at the same time, and the three networks output all the posterior probabilities of the whole space and judge the position of the signal through the principal of the maximum posterior probability. In particular, when the maximum probability appears in the irrelevant class, the entire corresponding sub-space will no longer be considered.

Simulations and Evaluation
This section introduces the setups for the simulations and the evaluation for the performance. The audio settings followed are similar to those in [28]. A Hanning window is used with the length 256 and an overlap of 75% between adjacent frames. The sample frequency is 16 kHz. Moreover, 256-point FFT is considered. The frequency bins ranging from 500 to 3875 Hz are taken into consideration to have sufficient mode strengths and avoid aliasing, which results in the number of the frequency bins F N being 55. All these parameters concern the audio settings. The signal can be processed more effectively with the help of these parameters. Furthermore, the appropriate frequency range can be selected in order to have sufficient mode strengths and avoid aliasing with these parameters. Moreover, the highest order of spherical harmonics is 4. A rigid spherical microphone array with 32 L = and 4.2 r = cm is considered. The highest order of spherical harmonics and the number of the sensors directly influence the dimension of the input features. Thus, the dimension of the TFD matrix is 64 32  . The dimension of the SHD matrix is 50 25  . The dimension of the AED matrix is 162 81  .

Dataset
In this section, the simulated and real datasets for the algorithm are introduced. For training, a large number of data are needed; however, in practice, it is time-consuming and less cost-effective to sample the data. Thus, simulating data can be taken into consideration for comprehensive training. A variety of simulated environments can be obtained by using SMA response (SMIR) [36]. The SMIR is decided by the size of the room, the locations of array microphones and sound sources and the reverberation time ( 60 RT ). Figure 6 shows the flow chart of the process of the simulated dataset generation. For testing, the input feature of one signal is sent to three sub-networks for detection at the same time, and the three networks output all the posterior probabilities of the whole space and judge the position of the signal through the principal of the maximum posterior probability. In particular, when the maximum probability appears in the irrelevant class, the entire corresponding sub-space will no longer be considered.

Simulations and Evaluation
This section introduces the setups for the simulations and the evaluation for the performance. The audio settings followed are similar to those in [28]. A Hanning window is used with the length 256 and an overlap of 75% between adjacent frames. The sample frequency is 16 kHz. Moreover, 256-point FFT is considered. The frequency bins ranging from 500 to 3875 Hz are taken into consideration to have sufficient mode strengths and avoid aliasing, which results in the number of the frequency bins N F being 55. All these parameters concern the audio settings. The signal can be processed more effectively with the help of these parameters. Furthermore, the appropriate frequency range can be selected in order to have sufficient mode strengths and avoid aliasing with these parameters. Moreover, the highest order of spherical harmonics is 4. A rigid spherical microphone array with L = 32 and r = 4.2 cm is considered. The highest order of spherical harmonics and the number of the sensors directly influence the dimension of the input features. Thus, the dimension of the TFD matrix is 64 × 32. The dimension of the SHD matrix is 50 × 25. The dimension of the AED matrix is 162 × 81.

Dataset
In this section, the simulated and real datasets for the algorithm are introduced. For training, a large number of data are needed; however, in practice, it is time-consuming and less cost-effective to sample the data. Thus, simulating data can be taken into consideration for comprehensive training. A variety of simulated environments can be obtained by using SMA response (SMIR) [36]. The SMIR is decided by the size of the room, the locations of array microphones and sound sources and the reverberation time (RT 60 ). Figure 6 shows the flow chart of the process of the simulated dataset generation. In terms of the room size, six rooms are considered. The six rooms have different sizes except for a fixed height of 3 m. Four rooms are used for training, and all the rooms are utilized for testing. 60 RT can be chosen between 0.2 and 1 s. Eigenmike [37], a rigid SMA of radius of 4.2 cm with 32 L = , 4 N = , is used as the array for generating the signals. The distance between each source and the center of the SMA is at least 1 m. Moreover, the source and SMA positions are at a minimum distance of 50 cm away from all the walls in the room. Clean speeches are chosen for generating the dataset. Then, all the SMIRs and the source signals are convolved to obtain new signals. In terms of noises, white Gaussian noise is considered. Moreover, voice activity detection (VAD) [38] is performed on the clean speeches to filter out the silent frames and only maintain the speech frames. The method utilized for generating class labels is the same as that in [39] to treat DOA estimation as a classification task.

Training Dataset
In terms of the room size, four rooms are considered. The size of the room is measured in meters. The four rooms have different sizes except for a fixed height of 3 m. The room sizes are: Room 1 (3 m × 5 m × 3 m); Room 2 (6 m × 8 m × 3 m); Room 3 (7 m × 10 m × 3 m); Room 4 (8 m × 9 m × 3 m). As shown before, the elevation space and azimuth space are both divided with the interval of 10 .  Each class is granted a label. For each DOA class, 125 SMIRs can be obtained by combining different parameters. Eight different speech signals are randomly selected from the Librispeech dev corpus [40] to convolve all the SMIRs. Thus, from the four rooms with different sizes, a total of 4000 samples for each DOA class can be acquired. The signal to noise ratio (SNR) is randomly chosen between 0 and 30 dB. A validation dataset is also needed, whose size is 10% of that of the training dataset.

Testing Dataset
The test dataset is divided into two different parts. The first part is the simulated test dataset, which is similar to the training dataset but differs in terms of SNR, 60 RT and the source localizations. In total, 265 different DOA positions are selected randomly. In addition to the four rooms already mentioned in the training set, there are two additional rooms: Room 5 (4 m × 6 m × 3 m) and Room 6 (9 m × 7 m × 3 m). The rule of selecting the position of the SMA is the same as that for the training dataset. For better analysis of the relationship between the results and the parameters, 60 RT and SNR are no longer selected randomly. 60 RT varies from 0.2 s to 1 s in a step size of 0.2 s. The SNR changes from 0 dB to 30 dB with a step of 5 dB.. Similarly, two speech recordings are selected for the convolution with SMIR. However, this time, the recordings are selected from the Librispeech test corpus [40], not the Librispeech dev corpus. The second is the real test dataset. The LOCATA dataset [41] is used as a real dataset. This is the IEEE-AASP challenge on sound source localization and tracking. From this, task 1 is In terms of the room size, six rooms are considered. The six rooms have different sizes except for a fixed height of 3 m. Four rooms are used for training, and all the rooms are utilized for testing. RT 60 can be chosen between 0.2 and 1 s. Eigenmike [37], a rigid SMA of radius of 4.2 cm with L = 32, N = 4, is used as the array for generating the signals. The distance between each source and the center of the SMA is at least 1 m. Moreover, the source and SMA positions are at a minimum distance of 50 cm away from all the walls in the room. Clean speeches are chosen for generating the dataset. Then, all the SMIRs and the source signals are convolved to obtain new signals. In terms of noises, white Gaussian noise is considered. Moreover, voice activity detection (VAD) [38] is performed on the clean speeches to filter out the silent frames and only maintain the speech frames. The method utilized for generating class labels is the same as that in [39] to treat DOA estimation as a classification task.

Training Dataset
In terms of the room size, four rooms are considered. The size of the room is measured in meters. The four rooms have different sizes except for a fixed height of 3 m. The room sizes are: Room 1 (3 m × 5 m × 3 m); Room 2 (6 m × 8 m × 3 m); Room 3 (7 m × 10 m × 3 m); Room 4 (8 m × 9 m × 3 m). As shown before, the elevation space and azimuth space are both divided with the interval of 10 • . Each class is granted a label. For each DOA class, 125 SMIRs can be obtained by combining different parameters. Eight different speech signals are randomly selected from the Librispeech dev corpus [40] to convolve all the SMIRs. Thus, from the four rooms with different sizes, a total of 4000 samples for each DOA class can be acquired. The signal to noise ratio (SNR) is randomly chosen between 0 and 30 dB. A validation dataset is also needed, whose size is 10% of that of the training dataset.

Testing Dataset
The test dataset is divided into two different parts. The first part is the simulated test dataset, which is similar to the training dataset but differs in terms of SNR, RT 60 and the source localizations. In total, 265 different DOA positions are selected randomly. In addition to the four rooms already mentioned in the training set, there are two additional rooms: Room 5 (4 m × 6 m × 3 m) and Room 6 (9 m × 7 m × 3 m). The rule of selecting the position of the SMA is the same as that for the training dataset. For better analysis of the relationship between the results and the parameters, RT 60 and SNR are no longer selected randomly. RT 60 varies from 0.2 s to 1 s in a step size of 0.2 s. The SNR changes from 0 dB to 30 dB with a step of 5 dB.. Similarly, two speech recordings are selected for the convolution with SMIR. However, this time, the recordings are selected from the Librispeech test corpus [40], not the Librispeech dev corpus. The second is the real test dataset. The LOCATA dataset [41] is used as a real dataset. This is the IEEE-AASP challenge on sound source localization and tracking. From this, task 1 is chosen to localize the direction of the single static loudspeaker. The recordings are conducted in a room of (7.1 m × 9.8 m × 3 m) with RT 60 = 0.55 s. Table 2 indicates the parameters for the simulated datasets. The parameters mentioned above are very important for generating the dataset. Different room sizes are used to enrich the information on scene and are beneficial for the algorithm to apply to more rooms. The white Gaussian noise is used as the interference to verify the robustness of the algorithm to interference. SNR and RT 60 are utilized to enrich the complexity of the environments. Moreover, during the testing stage, these two parameters can help to show the relationship between the results and the conditions, which is effective for verifying the feasibility of the algorithm.

Methods for Comparison
Spherical harmonic MUSIC (SH-MUSIC) method [42], direct-path domain test method for MUSIC algorithm (DPD-MUSIC) [28,43] and FOA [44] method are used for the performance comparison.

SH-MUSIC
The SH-MUSIC algorithm is the MUSIC method conducted in the spherical harmonic domain. A spectrum is computed for all the possible candidates of DOAs, and the peak is found to give the corresponding DOA of the source where L is the DOA search grid set containing all the possible DOA candidates, U n is the noise sub-space decomposed from the covariance matrix of the array signals using eigenvalue decomposition [45]. The covariance matrix can be computed from (14).

DPD-MUSIC
In order to effectively resist reverberation, a direct-path dominance (DPD) test is often applied to select those time-frequency (TF) bins that contain more direct components. The test can be expressed as where λ DPD is the threshold for the DPD test to choose the TF-bins, σ 1 (τ, k h ) and σ 2 (τ, k h ) are the largest and second-largest singular values of the covariance matrixR nm (τ, k h ) obtained using a rectangular window over the time and frequency indices. The window contains J τ time frames and J υ frequency bins. Based on the selected TF bins, SH-MUSIC method is still used for the sound localization task [28,43].

FOA
Ref. [44] shows that the first four channels of the FOA are considered. This step involves convolving the signals of the last three channels with the one in the first channel, assembling the convolved result, and then taking the real and imaginary parts of the new result as input. In order to better compare the effectiveness of different inputs, all methods are performed by using the same network structure.

Measure
In order to evaluate the performance, the gross error (GE) is used for the measure stage. The GE is the percentage of estimation that does not fall into an allowed threshold from the original value. Thus, for the simulated dataset, the GE is used to evaluate the performance. As for the LOCATA dataset, mean deviation and standard deviation [41] are utilized for the evaluation. The GE is defined as where N K is the number of DOAs estimated, (θ κ , φ κ ) and (θ κ ,φ κ ) are the actual and the estimated DOA of the κth source, λ is the threshold that represents the maximum acceptable error, ∆ is the angular distance and it is defined as As introduced before, for the simulated dataset, the GE is used for the evaluation. λ = 10, 15, 20 is considered. Table 3 demonstrates the GE for the azimuth estimation with different thresholds. Table 4 illustrates the GE for the elevation estimation with different thresholds. The symbol '-S' in the tables means that this task is finished with the segmentation strategy. The GE value shown in the two tables is an average over all the combinations of SNR and RT 60 . From Table 3, it can be seen that the proposed input feature, the AED matrix, is the most efficient input feature for the azimuth estimation task. Furthermore, when the input features are fixed, compared to the regular strategy, the segmentation strategy has higher accuracy in the test of the azimuth estimation. Thus, the combination of the AED matrix and the segmentation strategy is the best choice for the azimuth.
It can be known from Table 4 that the proposed input feature, the AED matrix, is also the most efficient input feature for the elevation estimation task. Moreover, when the input features are fixed, compared to the regular strategy, the segmentation strategy still has higher accuracy in the test of the elevation estimation. Although the improvement is not as remarkable as that in the azimuth estimation task, the segmentation strategy does bring improvement. Therefore, the proposed method can effectively improve the accuracy for both the azimuth estimation and elevation estimation.
The relationships between the results and the parameters are shown in Figures 7-9. From Table 3, Table 4, it can be known that the SH-MUSIC method and the DPD-MUSIC algorithm perform worse than the deep learning methods. Thus, these two methods are removed from the experiments for exploring the relationship between results and influencing factors. The relationship between the results and RT 60 is given in Figure 7 with the threshold λ = 20. Figure 7a shows the GE for azimuth against RT 60 , Figure 7b shows the GE for elevation against RT 60 . Figure 8 shows the relationship between the results and SNR with the threshold λ = 20. Figure 8a shows the GE for azimuth against SNR, and Figure 8b shows the GE for elevation against SNR. Figure 9 indicates the relationship between the results and parameters, which is given in with the threshold λ = 10. Figure 9a shows the GE against RT 60 , and Figure 9b shows the GE against SNR. From Figure 8, it can be found that the proposed input feature, the AED matrix, is also the best input feature for reducing the error rate. Furthermore, the segmentation strategy is still effective. Thus, the combination of the AED matrix and the segmentation strategy can help to effectively resist noise. From Figure 8, it can be found that the proposed input feature, the AED matrix, is also the best input feature for reducing the error rate. Furthermore, the segmentation strategy is still effective. Thus, the combination of the AED matrix and the segmentation strategy can help to effectively resist noise. From Figure 9, it can be seen that the proposed input feature still performs best even with the lower threshold. Thus, it can be concluded that the proposed method can improve accuracy in all the simulated conditions.

Results in the Real Dataset
The first task in the LOCATA challenge is used for real data test. It deals with the localization of a single static loudspeaker using a static SMA. The mean and standard deviation represent the performance errors showed in Table 5. From the table, it can be seen that the proposed method has lower mean deviation and lower standard deviation for the real dataset test.

Computational Complexity
This section shows the comparison of the computational complexity and elapsed time. As shown before, a spherical microphone array is considered. Thus, the spherical harmonic order needs to be taken into consideration. Thus, only DPD-MUSIC and the proposed method are considered.  Figure 7, it can be obtained that the proposed input feature, the AED matrix, is the best input feature for improving accuracy. Furthermore, the segmentation strategy is also effective. Thus, the combination of the AED matrix and the segmentation strategy can help to resist reverberation effectively.
From Figure 8, it can be found that the proposed input feature, the AED matrix, is also the best input feature for reducing the error rate. Furthermore, the segmentation strategy is still effective. Thus, the combination of the AED matrix and the segmentation strategy can help to effectively resist noise.
From Figure 9, it can be seen that the proposed input feature still performs best even with the lower threshold. Thus, it can be concluded that the proposed method can improve accuracy in all the simulated conditions.

Results in the Real Dataset
The first task in the LOCATA challenge is used for real data test. It deals with the localization of a single static loudspeaker using a static SMA. The mean and standard deviation represent the performance errors showed in Table 5. From the table, it can be seen that the proposed method has lower mean deviation and lower standard deviation for the real dataset test.

Computational Complexity
This section shows the comparison of the computational complexity and elapsed time. As shown before, a spherical microphone array is considered. Thus, the spherical harmonic order needs to be taken into consideration. Thus, only DPD-MUSIC and the proposed method are considered.

DPD-MUSIC
Ref. [46] shows the run-time complexity of the narrowband signal estimation using SH-MUSIC method in the spatial domain. The complexity is O((N + 1) 6 + (N + 1) 2 N c ), where N C is the total number of the DOA search grids. In terms of DPD-MUSIC, each rectangular window of time-frequency bins, whose size is J τ × J k h , needs the eigenvalue decomposition. The number of the windows is (N T − J τ + 1) × (N F − J k h + 1). The J τ and J k h are typically small compared to N T and N F [21,28]. Thus, the amount of the rectangular windows is O(N T N F ). Thus, the complexity of DPD-MUSIC is defined as O(N T N F (N + 1) 6 + (N + 1) 2 N θ N φ ).

Proposed Method
For the proposed method, it can be known that for a traditional CNN, about 90% of the parameters of the total network are in the FC layers [47]. Hence, the computational complexity can be decided by the FC layers while ignoring the influence of the convolutional layers. It can be understood that the input feature is directly given to the first FC layer. Thus, the first part of the computational complexity is decided by the size of the input feature and the number of the nodes in the first FC layer. Similarly, the second part is decided by the output of the first FC layer and the number of the nodes in the second FC layer. Thus, the complexity of the proposed method is O(N H N I + N H N C ), and N I indicates the dimension of the input feature. Moreover, the proposed method has three different input features. From Table 1, the dimension can be obtained. Thus, the complexities are O(N H (2M 2 ) + N H N C ) for TFD, O(N H (2(N + 1) 4 ) + N H N C ) for SHD and O(N H (2(2N + 1) 4 ) + N H N C ) for AED. From the theorical analysis, it can be seen that the proposed method has lower computational complexity than the DPD-MUSIC.

Elapsed Time Comparison
The elapsed time comparison among the algorithms for the test stage is shown in this section. The elapsed time is calculated from acquiring the feature to obtaining the DOA. For a fair comparison, all the experiments are conducted in the same system with Core (TM) i5-4800S processor manufactured by Inter (R), RAM of 16 GB, 64-bit instruction set. A clock speed of 2.81 GHz is used for execution. Figure 10 shows the elapsed time comparison in the form of a bar chart. It can be observed that the computational complexity is reduced. From the results of the bar chart and theoretical analysis, it can be known that the proposed method reduces the computational complexity this section. The elapsed time is calculated from acquiring the feature to obtaining the DOA. For a fair comparison, all the experiments are conducted in the same system with Core (TM) i5-4800S processor manufactured by Inter (R), RAM of 16 GB, 64-bit instruction set. A clock speed of 2.81 GHz is used for execution. Figure 10 shows the elapsed time comparison in the form of a bar chart. It can be observed that the computational complexity is reduced. From the results of the bar chart and theoretical analysis, it can be known that the proposed method reduces the computational complexity

Conclusions
This paper proposed three different covariance matrices as the input features and two different learning strategies for the DOA task. For the matrices, there is a progressive relationship among the three covariance matrices. The second matrix can be obtained by processing the first matrix, and the third matrix can be obtained by processing the second matrix. Compared with TFD, SHD effectively filters out the effects of the microphone

Conclusions
This paper proposed three different covariance matrices as the input features and two different learning strategies for the DOA task. For the matrices, there is a progressive relationship among the three covariance matrices. The second matrix can be obtained by processing the first matrix, and the third matrix can be obtained by processing the second matrix. Compared with TFD, SHD effectively filters out the effects of the microphone array and mode strength to some extent. Compared with SHD, AED more efficiently removes information irrelevant to location information. In terms of the strategies, the first strategy is a regular learning strategy while the second strategy is to split the task into three parts to be performed in parallel. Experiments were conducted both on the simulated and real dataset to show that the combination of the AED features and the segmentation strategy can achieve the best performance. Moreover, through theoretical analysis and elapsed time, it can be observed that the proposed method has lower computational complexity. Thus, it can be concluded that the proposed method can effectively resist reverberation and noise.