Three ‐ Dimensional Sound Field Reconstruction and Sound Power Estimation by Stereo Vision and Beamforming Technology

: The size of the sound field reconstruction area has an important influence on the beamforming sound source localization method and determines the speed of reconstruction. To reduce the sound field reconstruction area, stereo vision technology is introduced to continuously obtain the three ‐ dimensional surface of the target and reconstruct the sound field on it. The fusion method can quickly locate the three ‐ dimensional position of the sound source, and the computational complexity of this method is mathematically analyzed. The sound power level can be estimated dynamically by the sound intensity scaling method based on beamforming and the depth information of the sound source. Experimental results in a hemi ‐ anechoic chamber show that this method can quickly identify the three ‐ dimensional position of the moving source. When the depth of the moving sound source changes, the estimated sound power is more stable than the sound pressure on the microphone. Author Contributions: Conceptualization, Y.C. and X.L.; methodology, Y.C., X.L. and Y.X.; software, Y.C. and Y.X.; validation, Y.C., X.L. and Y.X.; writing—review and editing, X.L.; supervision, X.W.; project administration, X.W.; funding X.W.


Introduction
Beamforming is a method of processing array signals and has mature applications in many fields, such as sonar, mobile communications, medical imaging, and radio astronomy [1][2][3]. Davids and Billingsley applied this method to sound source localization [4]. The beamforming method based on the acoustic array can locate steady-state and transient sound sources at medium and long distances with medium and high frequencies [5]. The beamforming method has been widely used in mechanical structures such as airplanes [6], trains [7], and automobiles [8].
The acoustic array-based beamforming is to locate the sound source by reconstructing the sound field in the 2D or 3D space [9]. The usual method is to assume the depth Z of the plane where the sound source is located, then perform gridding on this plane (focus reconstruction surface) and calculate the sound field output of each grid point (reconstruction point) to complete the sound field reconstruction of the plane. This continuously iterates planes of different depths to achieve the sound field reconstruction in the spatial region. The grid point (reconstruction point) where the maximum output is located is the position of the sound source. This method is limited by the number of reconstruction points and requires huge computation.
Introducing the three-dimensional information of the object to be recognized and reducing the sound field reconstruction range is an effective way to reduce the amount of computation and achieve rapid positioning of the three-dimensional sound source. Döbler et al. [10] used the car's three-dimensional model as the sound field reconstruction area, which avoided the sound field reconstruction of the area outside the car body, and reduced the range. The above methods have the advantages of fewer reconstruction points and fast speed for sound source recognition, but the model information needs to be input into the computer in advance. Limited by the static characteristics of the model, it is difficult to identify the sound source in motion. In addition to obtaining structural information by importing the model, Legg [11] uses structured light scanning technology to enhance acoustic beamforming. In the experiment, the structure information of a fixed plexiglass plate is obtained and the sound source inserted into it is located, and this proves that the acoustic imaging error is an order of magnitude lower than that of a 2D plane.
Some scholars have tried to combine beamforming with vision. For example, Bub et al. [12] and Virginia et al. [13] used a camera to dynamically track the face of the speaker, and only searched in the face area to improve the recognition accuracy of beamforming. Jennings et al. [14] used visually guided beamformer to improve listening in complex acoustic environments. José Novoa et al. [15] used visual tracking on the robot and combined beamforming to locate the voice source. O'Donovan et al. [16] developed a realtime audio camera to more intuitively check the relationship between acoustic features and architectural details. In recent years, the rapid development of binocular cameras has made this an effective way to obtain 3D information in space. It has the advantages of fast measuring 3D information, high accuracy, mature technology, and continuous output of scene depth data [17,18]. Combining the binocular camera and beamforming technology can obtain the three-dimensional surface information of the operating mechanical structure, and complete the surface sound field reconstruction of the mechanism.
The radiated sound power is an important parameter to describe the acoustic characteristics [19,20]. For example, mechanical failure will produce noise, and accurate measurement of its radiation noise can obtain the degree of failure. The sound intensity scaling method estimates sound power, which is accurate and fast under the condition of obtaining the depth of the sound source [21]. This method can be combined with the threedimensional positioning of the mobile sound source to monitor the sound power of the moving sound source.
The major contributions of this study are as follows: 1. The introduction of binocular camera technology continuously obtains the threedimensional surface of the operating mechanism and performs sound field reconstruction on it, which reduces the number of reconstruction points and improves the positioning speed. Avoiding the import of 3D models and static restrictions realizes continuous 3D positioning of mobile sound sources; 2. Combining the three-dimensional positioning results with the sound intensity scaling method completes the sound power monitoring of moving sound sources with depth changes. 3. Through mathematical analysis, the computation amount between conventional beamforming and fusion methods is compared. In the experiment, the average time consumption of the two methods to reconstruct the sound field was recorded, which proved that the amount of computation can be effectively reduced.

Focused Beamforming in Three-Dimensional Space
Generally, conventional beamforming (CBF) is to search on a reconstruction surface with a preset depth, and the Z coordinate of the reconstruction point is the same as the Z coordinate of the sound source. The three-dimensional spatial focused beamforming searches in a spatial region composed of multiple reconstruction planes of different depths, as shown in Figure 1.
A planar array containing M microphones is used to locate the point sound source S in the space. The coordinate of the m-th microphone is , , (m = 1, 2, 3 …, M). The signal strength of the point sound source S received by different microphones attenuates as the distance increases [9]. The sound pressure signals measured by the m-th microphone is . The first microphone , , is taken as the reference. Point , , is a reconstruction point in the sound pressure reconstruction area in threedimensional space.
is the position vector of the reference microphone to the reconstruction point f, and is the position vector of the m-th microphone to the reconstruction point f, then the delay time , , of the m-th microphone to the reference microphone is as follows [9]: In formula (1), | • | is the vector modulus, and c is the speed of sound.
According to the principle of delay-and-sum beamforming, the sum of the phase delay , , of each microphone signal is compensated. The delay-and-sum beamformer output of point f is as follows [9]: where is the weighting coefficient of the m-th microphone [9]. Then, the sound pressure reconstruction points in all 3D reconstruction areas are calculated. The complex output of the sound pressure at each point , constitutes the acoustic imaging map. The reconstruction point with the maximum sound pressure output , is the sound source position.

Binocular Stereo Vision
After the long-term development of stereo vision technology, there are many mature solutions already. This study uses the Intel ® RealSense™ depth camera (Manufactured by Inter Corporation, Santa Clara, CA, USA), which can directly operate on its built-in chip and output a 3D point cloud of the scene. The point cloud is presented as a depth image on the screen. The camera uses active infrared stereo vision technology, and the depth perception is realized by two image sensors and an infrared projector, as shown in Figure  2.
The infrared projector projects invisible structured infrared light to improve the depth accuracy of the scene. Two image sensors simulate human eyes, and obtain two images of the measured area at the same time from different positions, and obtain the spatial threedimensional information of each pixel through the stereo vision algorithm [22,23].

Sound Intensity Scaling Theory Based on Beamforming Technology
The output of beamforming reflects the sound pressure contribution of the sound source on the plane of the microphone array. Jorgen Hald [21] scales the beamforming output result by constructing a scaling factor to ensure that the integral of the scaled output in the main lobe area is equal to the sound power radiated by the sound source on the side of the microphone array.
The scaled beamforming result has sound intensity meaning, and defined as "scaling sound intensity", the coefficient is defined as "sound intensity scaling factor" α [21], and it satisfies: In Formula (3), the , polar coordinate system is used to describe each focus grid point on the sound source calculation plane, R is the distance from the focus grid point to the intersection of the sound source calculation plane and the Z-axis, and is the azimuth angle.
, is the sound intensity after scaling the beamforming output result, and is the sound power of the point source on the side of the microphone array.
By constructing a fixed sound source recognition model and mathematical derivation, the sound intensity scaling factor α is obtained by: D is the array diameter, m. ρ is the air density, kg/m 3 . λ is the wavelength, m. α is the sound intensity scaling factor.
The integral of the scaled sound intensity in the main lobe area is the sound power of the side hemisphere of the sound array, and the sound power is [21]: L is the sound source depth, m. | |is the effective amplitude of the microphone signal. k is the wave number in the propagation direction of the plane wave.
According to Equation (5), it is necessary to obtain the depth of the sound source to calculate the sound power of the moving sound source. In this paper, the stereo vision h W method is introduced to obtain the three-dimensional positioning of the sound source, then the sound power can be estimated.

The Sound Source Monitoring Combing Stereo Vision and Beamforming
The stereo vision system and microphone array are used to collect optical and acoustic information at the same time in the space where the sound source is located. The target area of the sound source obtained by the former is used as the search domain for the beamforming sound source 3D localization, thereby narrowing the searching range and reducing the amount of computation. Specific steps are as follows: 1. Use binocular cameras and microphone arrays to acquire images and sound signals at the same time. 2. Segment the sound signals corresponding to the frame rate of the depth image, that is, the time matching of the image and the sound. 3. Segment the target from the depth image, and obtain the 3D information of the target surface. 4. Convert the 3D information of the target surface to the acoustic array coordinate system to complete the spatial matching of the two systems. 5. The 3D information of the target surface after coordinate conversion is used as the beam-forming reconstruction surface, and the focused beamforming is performed to reconstruct the sound field of the target 3D surface. The sound pressure reconstruction point where the maximum output is located is the three-dimensional coordinate of the sound source. 6. Import the depth of the three-dimensional coordinates into the sound intensity scaling method to obtain the sound power of the hemisphere on the side of the sound array.

Three-Dimensional Coordinates Extraction of Target Machinery Surface
Stereo vision can obtain the depth image of the entire field of view including the target, and extracting the three-dimensional coordinates of the target machinery surface is one of the key points of this method.
The threshold method is used to segment the depth image to obtain the pixel set of the target machinery surface. The threshold method compares the color value of the image pixel with a given threshold and divides the pixel points into two types, the background, and the target. This method has high computational efficiency and stable performance. Assuming that f (x, y) is the original image and the threshold is T. The image can be divided into two parts by the following formula: This method uses 0 (black) and 1 (white) to realize image binarization. In this study, the black area is the target. All pixels in the black area are coordinated, and the depth of these pixels can be obtained.

Coordinate System Transformation
The camera coordinate system takes the stereo camera as the origin, and the coordinate system for beamforming usually takes the center of the array as the origin. Spatial matching is needed to coincide the two systems. Coordinate system transformation is realized through rotation and translation between two coordinate systems. Before the coordinate system transformation, each pixel in the depth image output by the stereo camera needs to be transformed into the camera coordinate system. According to the imaging principle of the camera, the spatial transformation relationship between the pixel coordinates and the camera coordinate system , , , can be established, as shown in Equation (7): where , , is the spatial coordinate of the pixel in the camera coordinate system, and l is the depth of the pixel. U × V is the resolution of the output depth image. and are the horizontal view angle and the vertical view angle of the camera, respectively. The camera coordinate system , , , and the array coordinate system , , , are different spatial coordinate systems. The use of beamforming to reconstruct the sound field needs to be carried out in the array coordinate system, and their projection relationship is shown in Figure 3. Assuming that point , , in the camera coordinate system, the corresponding coordinates in the array coordinate system is , , . The relationship between the two coordinate systems: where is the three-dimensional rotation matrix [24], which makes the camera coordinate system rotate around the axis by , rotate around the axis by , and rotate around the axis by α. Then the two space coordinate systems are parallel to each other. The vector , , is the translation between the origins of the two coordinate systems. After coordinate system transformation, the target area in the depth map obtained by stereo vision is converted to the corresponding 3D point cloud. The point cloud is transformed into the microphone array coordinate system. The reconstruction points of the three-dimensional sound source position are replaced with the 3D point cloud, and the beamforming calculation is performed on the surface of the target to achieve the sound field reconstruction on the 3D surface.

Time Matching
The signal acquisition is recorded by the visual acquisition system and the acoustic acquisition system, and the matching of the two signals is completed through time synchronization. In this study, simultaneous acquisition is achieved by setting an external trigger. Then according to the frame rate fps captured by the camera, the shooting time ti of the i-th frame can be obtained. Next, the acoustic signal in the time [ti − ½ fps, ti − ½ fps] is used to reconstruct the sound field on the i-th frame.
When the fps is high, it can be considered that the reconstruction result of using each frame of depth image is the instantaneous sound field at that moment.

Computational Complexity of 3D Sound Field Reconstruction
The three-dimensional surface of the target machinery is obtained by stereo vision technology, and the sound field is reconstructed on this surface. Therefore, it is possible to avoid the reconstruction of the sound field in the redundant area outside the organization area and reduce the computational complexity. At the same time, however, it also increases computation in some steps. The following is the analysis process of the amount of computation during positioning.
The computations for beamforming when reconstructing the sound field mainly comes from two aspects: (1) the number of reconstruction points, (2) the amount of computations inside each reconstruction point. Through the binocular camera and extracting the operating mechanism area, the number of reconstruction points can be reduced. But the method in this paper also increases the amount of computations for 3D information extraction and internal coordinate conversion of each reconstructed point.
In CBF, L is the number of data points of the acoustic signal, the reconstruction surface contains N reconstruction points, and the number of microphones is M. According to the beamforming principle in Section 2.1, the main computations for a single reconstruction point include:

The Computation Increasement of the Combination Method
When combing the depth image of stereo vision, the amount of the increased computation mainly includes: 1. Target region extraction: using the threshold method to segment the image to extract target information. The segment uses logical operations. In the above calculations, roots, trigonometric functions, and exponents are complex calculations. The root operation uses the Newton iteration method to calculate the solution, and the trigonometric functions and exponents use the Taylor series to find their numerical solutions. Addition, subtraction, and logical calculations are simple calculations. There is a huge gap between complex calculations and simple calculations on the computer, so simple calculations can be ignored. Table 1 shows the calculation times of different calculation types for a reconstruction point of CBF, and the total amount of computations added by the combined method. R is the total number of reconstruction points in the acquired target's three-dimensional surface.

Comparative Analysis
To facilitate the comparison between a single CBF reconstruction point and all the increased computations in the combined method, complex calculations are converted into multiplication and division types through numerical calculation principles.
In computer, the Taylor series expansion method is used to calculate the approximate value of exponential and trigonometric functions. Exponent expansion to n items (n > 2) requires 1 1 MD operations. The expansion of the trigonometric function sine and cosine to n terms requires approximately 2n 2 times of MD. The tangent is composed of the ratio of sine and cosine, and it requires approximately 4n 2 + 1 times of MD [25]. Root calculations use the Newton-Raphson method to continuously iterate to solve: In this type of calculation, in addition to a division operation to find the initial value, approximately 3n + 1 times of MD are required for n iterations [25].
Generally, the accuracy of the numerical solution is 10 decimals: the index needs to expand about 15 items, which converts into 197 times the MD; the trigonometric function needs to expand about 8 items, which is converted into 128 times the MD; tangent is converted into 257 times the MD; when the Newton-Raphson method is used for calculation, when the solution accuracy is 10, about 5 to 6 iterations, and 6 iterations are approximately equivalent to 19 MD.
A reconstruction point in beamforming requires approximately times of multiplication/division operations. The overall increase of the algorithm in this paper is about T times the MD, then: The difference in computation amount mainly depends on the number of reconstruction points R, the number of sound data points L and the number of array channels M of the measured target area.
For example, the number of points in the beamforming scan reconstruction area is 100 × 100, M = 16, L = 25,600 Hz/30 fps ≈ 853. If the number of reconstruction points after extracting the target is half of the scanning area (R = 5000), the increased calculation amount is only equivalent to 2.1 points. The total calculation amount is reduced by nearly 1/2.

Experimental Setup
The purpose of the experiment is to locate the sound source on the experimental platform. The experiment was carried out in a hemi-anechoic room. The experimental bench contains a slowly rotating swing arm driven by a motor and a sound source. The rotation speed of the swingarm under the combined action of the motor and gravity gradually changes from the slow speed of climbing to the fast speed of descending. A Bluetooth speaker with a diameter of 10 cm is installed at the end of the swing arm as the sound source to be located on the experimental bench. The speaker plays a singlefrequency sound with a frequency steadily.
The depth images are obtained by Intel ® RealSense™ D415 depth camera (Manufactured by Inter Corporation, Santa Clara, CA, USA). As shown in this section and the next, the camera can output a normal image obtained by a red, green and blue (RGB) sensor and a depth image obtained by stereo vision technology at the same time. The key specifications of the depth camera are presented in Table 2. The resolution used by the camera in this experiment is 1280 × 720 ( 1280, = 69.4°) and 640 × 480 ( 640, = 65°), and the shooting frame rate is set to 10 fps. The depth accuracy of the reconstructed point required for beamforming depends on the camera accuracy, and the absolute error e of the measured value depth and the linear relationship of the distance [23].
A spiral array composed of 24 microphones was used in the experiment and the diameter of the array was 0.8 m. The layout of the microphone array is shown in Figure 4. The microphones were MPA416. The sound signal was collected through the data acquisition instrument SCM05 of the LMS company with a sampling rate of 51.2 kHz. The rising edge of an external direct current (DC) signal was used as a trigger so that the image and sound signals were collected at the same time.
The experiment used two layouts to test the effectiveness of the combined method. In layout 1, the plane of the microphone array was parallel to the plane of the test bench and the distance between them was 1.63 m. The camera and microphone array were placed on the lower right of the test bench. The resolution of the camera was 1280 × 720, and the speaker played a single frequency sound of 1 kHz. In layout 2, the center of the microphone array was 1.99 m away from the center of the test bench. The angle between the plane of the moving sound source in the operating mechanism and the array plane was 30° to ensure that the sound source changed in the XYZ direction. The resolution of the camera was 640 × 480, and the speaker played a single frequency sound of 3 kHz. (To simplify the computation of the program, the camera was placed in the same plane of the microphone array at a horizontal distance of 0.16 m from the center of the array). Figures  5 and 6 show the anechoic room experimental layout.

Localization Result
In the experiment of layout 1, both the video and the acoustic signals were recorded as the sound source rotating from below to above the experimental platform. The signals were also recorded from the right to the left of the experimental platform in layout 2. Figure 7 shows the two groups of source location images (a, b, c) at different positions in layout 1. Each group of images contains the normal image captured by the optical camera, the environment depth image taken by the stereo camera, and the acoustic beamforming result on the surface of the experimental platform. The set of images (a) is obtained at the start position of layout 1 in this experiment, the sound source localization is obtained by the combination method and the result is (599, 454, 1644) Table 3 shows the positioning results and the corresponding sound power level of two layouts. In the Table, x, y is the pixel coordinate, and z is obtained by indexing their values in the depth image.
Since the experimental platform was separated from the background, the experimental platform area could be easily distinguished from the depth image. In the experiment of layout 2, the actual number of reconstructed points in beamforming was about 22,000 points, and the number of pixels in this area accounted for about 8% of the total area. The combination method in this article added about 2.06 points of calculation in total. At the same time, in this experiment, the average time for the combined method to reconstruct a sound field was recorded as 0.039 s. The average time to reconstruct a 2D sound field using traditional reconstruction methods was 0.523 s. It can be seen that the combined method can effectively reduce the amount of computation. The obtained sound source depth was combined with the sound intensity scaling method to achieve the real-time acquisition of sound power. From these two layouts and settings, the position of the camera and array could also be adjusted to select a more suitable resolution in order to obtain better results according to actual needs.

Sound Power Results
In the experiment, the sound field on the surface of the experimental platform at each frame was reconstructed. Figures 9-11 show the monitoring results of moving sound sources during operation. According to the rotation angle of the swing arm, the movement speed of the sound source is from slow to fast.
The change of the calculated depth of sound source with time-angle is shown in Figure 9. The x-coordinate in these figures is the time and the corresponding angle of the swing arm calculated from the localization of the proposed method. The depth of the sound source varies from 1598 to 1665 mm in layout 1. The sound source depth value is stable at about 1630 mm, which is consistent with the layout. The depth of the sound source varies from 1651 to 2060 mm in layout 2, the sound source depth changes from far to near, and then from near to far. Although the swing arm shakes randomly and the camera has imaging errors, it can still be seen that this trend is consistent with the trend of the theoretical depth curve of the sound source.  Figure 10 shows the curve of the sound pressure level (SPL) of the channel at the center of the array versus time-angle. The SPL of layout 1 is within 27.36~29.43 dB, and the range of layout 2 is 53.42~59.48 dB. In layout 1, since the sound source depth is relatively stable, the value of SPL is also relatively stable. Due to the direction of the source and the reflection from the floor, when the Bluetooth speaker at the end of the swing arm rotates to certain angles, the sound pressure level of some channels in the array changes significantly in layout 2 (e.g., the sound pressure level at 1.2 s). However, according to the sound pressure level fitting curve, it can be seen that the overall trend is decreasing with the increase of the distance. By combining the sound source depth change and the sound intensity scaling method, the sound power level versus time-angle curve in Figure 11 is obtained. The sound power level range of layout 1 is stable at 81.02~82.48 dB. The sound power level is within 81.02~82.48 dB in layout 2, and is more stable than the sound pressure level. The sound intensity scaling method is to scale the entire array signal, and the influence of abnormal signals of certain microphones is reduced on the sound intensity of the entire array.

Discussion
The accuracy of the three-dimensional surface of the target is the basis for the accurate reconstruction of the 3D sound field, so the measurement error of the stereo camera affects the reconstruction area. The error between the point cloud and the real surface of the target includes the offsets on horizontal, vertical and depth directions. The depth of each pixel is calculated from the horizontal pixels of the two cameras, and this ensures that the horizontal error is smaller than the depth error. Although the accuracy of modern stereo cameras has reached a very high level, these errors may still need to be considered in some scenes with high accuracy requirements. At the same time, the error relationship between the depth error and the actual distance is given in Equation (11). It is necessary to pay attention to these key camera parameters.
Moreover, since the reconstruction of the three-dimensional sound field is limited on the surface of the target, the sound source inside the target cannot be located accurately. A manual adjustment on the depth of the surface may have to be undertaken.
It is worth noting that the recognition area (both depth and field of view) of the combined method is limited by both the stereo camera and the beamforming systems. Taking the effective recognition depth in the experiment as an example, the recognizable depth of the camera is 0.2~10 m, and the recognition depth of the microphone array is 0.3~5 m, so the depth recognition range of the entire system is 0.3~5 m.
This study uses conventional beamforming (CBF) to eventually locate the sound source. At present, many advanced algorithms have been proposed based on CBF to achieve higher precision in the positioning or better interference suppression. The CBF in the proposed fusion method can also be replaced with those methods to take advantage of them.
Furthermore, the sound intensity scaling method used to estimate sound power in this study is derived mathematically based on the physical propagation model of the point sound source. It is not suitable for accurate sound power estimation of multiple sound sources or directional sound sources.
In the practice, the camera can be built into the array to avoid manual measurement of errors between the camera and the array. At the same time, it will bring an advantage that no additional visual and acoustic registration is required when reconstructing the three-dimensional sound field distribution.

Conclusions
Aimed at the current difficulties in spatial localization and monitoring of the sound power of mobile sound sources, this paper studies the sound source localization method combining a binocular camera and beamforming. This method scans scene information through binocular cameras and outputs depth images, and obtains the structural information of the target area through image segmentation. As the range of the reconstructed sound field is limited on the three-dimensional surface of the target, the combination method can not only realize the three-dimensional localization of the sound source with less computation, but also achieve the sound field reconstruction of the moving target by updating the three-dimensional information of the target continuously. Locating the moving sound source spatially provides the depth variation of the source, and the sound power of the source can be estimated by the sound intensity scaling method. The sound power level is more stable than sound pressure level to monitor the condition of the sound source.
It was proved by a mathematical derivation and experiment that the use of stereo technology to reduce the number of reconstruction points can effectively improve computational efficiency. With mature technology and a high precision of stereo vision, the accuracy of sound source focusing and reconstruction is improved, which is conducive to obtaining accurate source positioning and sound power value.
Funding: This research was funded by National Natural Science Foundation of China (51875272); Science and Technology Major Project of Yunnan Province (202002AC080001).

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.