Voice recognition-based systems, such as smart speakers and displays, recognize the voices of registered users and perform tasks by extracting semantic information from their speech. These systems allow users to search for information or run applications in situations where they cannot use their hands to control the devices. Because of this convenience, they have been widely adopted in recent smart devices as a core function, and customers actively use these smart devices while using home surveillance cameras, watching videos, and making video calls through apps installed on their devices. However, because of the nature of static devices with limited viewing angle cameras and displays, users are constrained to stay within a certain range while interacting with these devices for video calls or watching videos; otherwise, the direction of the device must be adjusted by hand, which clearly violates the purpose of these devices and reduces user satisfaction (
Figure 1).
The abovementioned mobility limitation can be resolved by tracking the user’s location in relation to the device. This can be achieved by detecting the user’s location using a camera or multi-channel microphone arrays, which mimic human recognition systems. The camera-based method appears to be a more intuitive and convincing user location detection method. In particular, owing to the recent developments in convolutional neural networks (CNNs) [
1,
2,
3], object detection accuracy has improved significantly, and user location detection is no longer a challenge. However, this method has a fundamental limitation in that it cannot provide a complete solution to the user location detection problem. First, users do not feel comfortable owing to the feeling of being observed, which is a potential privacy issue. Second, if users are located in a different room or space that is physically separated by a wall, the device fails to provide any service. Imagine a situation wherein a user accidentally falls down in a room next to where the device is located. If the device relies only on a camera, it cannot determine the location of the accident. Finally, running CNN-based object detection algorithms on IoT devices is costly even without an expensive GPU; this can be burdensome to the user.
The microphone array-based sound localization method can be implemented using an analytic sound source localization (SSL) algorithm such as in [
4,
5,
6,
7,
8,
9]. Unlike the camera vision-based method, this method has the advantage of users not worrying about privacy issues or space limitations. The traditional sound source localization method locates the sound source by processing the phase and amplitude differences of the sound signal received at each microphone array. However, the performance of the traditional solution decays in an environment with noise and reverberation [
4,
5] owing to the difficulty of modeling noise and interference, which depends on the geometry of the user space. Recently, several studies have been conducted using deep neural networks (DNNs) [
6,
7,
8,
9]. In [
6], estimating the sound source location was treated as a classification problem. The location of the sound source was estimated by dividing 360° into 360 classes. A spectrogram of multi-channel acoustic data was used as the input, and a DNN modified from ResNet [
10], which showed excellent performance in image classification, was used. In [
7], researchers introduced a method for finding the azimuth and elevation of a sound source using a stacked convolutional and recurrent neural network. The entire network was divided into two: the first network receives a spectrogram of multi-channel acoustic data as an input and generates a spatial pseudo-spectrum (SPS) as an intermediate output, and the second receives it as an input to estimate the azimuth and elevation of the sound source. Unlike previous studies, [
8,
9] used an end-to-end technique to extract features directly from multi-channel raw acoustic data. In [
8], the 3D coordinates of the sound source were predicted using a simple CNN model, and in [
9], the distance and azimuth of the sound source were predicted using the DNN proposed for audio classification using the raw acoustic data in [
11]. These studies demonstrated that the SSL based on DNNs significantly improved the accuracy and robustness of the model compared with traditional signal processing-based methods. Furthermore, [
12] used a CNN to analyze the speech intelligibility, which shows the suitability of a CNN for processing acoustic data.
However, the DNN-based approaches mentioned thus far are unsuitable for practical IoT systems that typically lack sufficient computation power. To run the DNN-based SSL on IoT devices in real time, the number of hyperparameters in the DNN must be limited while satisfying the minimum performance requirement. Therefore, in this study, a novel CNN-based real-time SSL model was considered for real-time operation in low-power IoT devices. The proposed CNN model has multi-stream (MS) blocks that comprises convolution layers of various sizes connected in parallel that can capture the low-, medium-, and high-frequency component features from multi-channel acoustic data. Owing to the parallel structure of the MS block, the model can reduce the number of parameters and computations without sacrificing performance. For the model training and testing datasets, the TIMIT database [
13] was simulated with a room impulse response (RIR) generator to construct multi-channel acoustic datasets in various directions, heights, and distances.