A Speech Enhancement Algorithm for Speech Reconstruction Based on Laser Speckle Images

In the optical system for reconstructing speech signals based on laser speckle images, the resonance between the sound source and nearby objects leads to frequency response problem, which seriously affects the accuracy of reconstructed speech. In this paper, we propose a speech enhancement algorithm to reduce the frequency response. The results show that after using the speech enhancement algorithm, the frequency spectrum correlation coefficient between the reconstructed sinusoidal signal and the original sinusoidal signal is improved by up to 82.45%, and the real speech signal is improved by up to 56.40%. This proves that the speech enhancement algorithm is a valuable tool for solving the frequency response problem and improving the accuracy of reconstructed speech.


Introduction
In the field of information security and the prevention of social crimes, covertly acquiring clear remote speech signals has become an important research topic [1,2]. For years, researchers have proposed a variety of methods for measuring vibration signals [3][4][5][6][7][8][9][10][11], many of which are used in speech signal reconstruction and are based on laser. There are two main methods based on remote vibration signals detection: one is based on laser Doppler vibrometer (LDV) [3][4][5][6][7][8] and the other is optical sensing and image processing technologies [9][10][11]. LDV has the advantages of non-contact, high spatial and temporal resolution, real-time processing. But LDV cannot fulfill the full field detection. The optical sensing and image processing technology is mainly divided into two types: natural light image processing [12] and coherent light image processing [13][14][15][16]. The former has simple system structure and easy to implement. However, it requires high computational cost, and its processing speed is slow. In the 1980s, researchers of [13,14] proposed the method to extract subtle motion from speckle images. Additionally, in recent years, researchers have also reconstructed speech based on speckle images [15,16]. The principle of this method is to irradiate the rough surface object near the sound source with laser, and the reflected light interferes to form secondary speckle [17]. Then the sound makes the surrounding objects vibrate slightly, which makes the speckle pattern shift slightly. The remote speech signal is reconstructed by extracting the small movement between speckle images. The system of remote speech reconstruction based on laser speckle pattern is mainly divided into two parts: the first is the construction of an optical system, which is used to collect continuous speckle image sequences; the second is the design of a reconstruction algorithm, which is used to detect movement from speckle images and reconstruct remote speech signal. As regards optical system construction, ref. [18] proposed a set of simplified optical equipment, it reduces the cost and realizes full-field non-contact detection. This optical device makes the technology more suitable for remote monitoring. The current laser technology and imaging technology can collect high-quality speckle images at a low price.
At present, the detection method based on laser speckle images is the most suitable for this task, but the problem of frequency response has been found in the actual experiments and applications. We found that the speech signal reconstructed by this optical device [18] has inhomogeneous enhancement or attenuation of speech components at different frequencies, and the frequency response of different vibration objects is different. Any object has a natural frequency, when the vibration sound wave of this natural frequency is transmitted to the object, the vibration amplitude of the object will have the maximum growth. Different objects have different natural frequencies, so the performance of frequency response is also different. This phenomenon greatly affects the accuracy of the reconstructed sound signal.
At present, many reconstruction algorithms ignore the frequency response, fewer algorithm is proposed to solve this problem. In this paper, a speech enhancement algorithm for different vibration objects is proposed to weaken the frequency response and improve the accuracy of speech reconstruction. This paper has two main contributions.

•
The frequency response of long-distance speech reconstruction based on laser speckle image is proposed.

•
Our algorithm is a speech enhancement algorithm designed to reduce the influence of frequency response, which greatly improves the accuracy of reconstructed speech signals.
The rest of this paper is organized as follows: Section 2 introduces the methodology of this paper, including DIC method and speech enhancement algorithm. Section 3 present the experimental setup. Section 4 introduces the experimental datasets and evaluation metrics. Section 5 is the result. Section 6 is the conclusion.

Digital Image Correlation Method
There is room for improvement in many speech reconstruction algorithms. For example, the intensity method runs fast but has low accuracy. The performance of the six algorithms is compared in [34], and the results show that the cross-correlation method is one of the best choices. In order to accurately detect the frequency response in this optical device, this paper uses the DIC method as the benchmark. Figure 1 shows the conversion between object vibration and speckle pattern displacement. In this figure, the transversal plane is regarded as the xoy plane, and the axial axis is the z-axis. The motion of this object can be divided into three directions: transverse, axial, and tilt. Ref. [18] proved that when strongly defocus the speckle image captured by the camera, only the tilt motion has the noticeable impact on the displacement of the speckle image, the influence of the other two motions on the shape and displacement of the speckle image can be ignored.
Refs. [35,36] deduced the speckle formation theory in detail, and ref [18] mentioned that making the imaging plane move from Z 1 to Z 2 . A(x o , y o ) is the amplitude distribution of speckle at Z 2 .
where λ is the optical wavelength and φ is the random phase generated by the rough surface. α x , α y are tilting angle in the x-axis and y-axis respectively. The inverse of the magnification of the imaging system is expressed in M.
where F is the focal length of the imaging lens. The speckles pattern displacement d is related to the tilting angle α of the object.
According to the previous description, we know that under appropriate conditions, there is only speckle movement between two consecutive speckle images, but no shape change. And the displacement is in the xoy plane. The relationship between two consecutive speckle images I t , I t+1 is: where ∆x, ∆y are the relative displacements in the x-axis and the y-axis direction. The calculation steps of DIC method are shown in Figure 2. Take two consecutive speckle images I t and I t+1 . Calculate the correlation coefficient between two images. At the maximum correlation, the corresponding displacement is the relative displacement between two images.
Corr2 is a function of finding the correlation coefficient between two matrices. The correlation coefficient r can be expressed as: where A and B are two-dimensional matrices of the same size, A is the mean of the matrix A, and B is the mean of the matrix B.

Speech Enhancement Algorithm
We use a sinusoidal signal which frequency varies from 80 to 1600 Hz and amplitude remains constant as the audio at the sound source ( Figure 3 shows the waveform and frequency spectrum of the sinusoidal signal), and collect the corresponding speckle image dataset. Then use the algorithm in Section 2.1 to reconstruct the sinusoidal signal when the vibration object is a carton ( Figure 4 shows the waveform and frequency spectrum of the reconstructed signal). By comparing Figure 3 with Figure 4, we found that the amplitude of the reconstructed signal is different at different frequencies, which seriously affects the accuracy of the reconstructed speech signal. Figure 5 is the reconstructed signal when the vibration object is a paper cup. By comparing Figure 4 with Figure 5, we found that the frequency responses of two different vibration objects are also different.

Speech Enhancement Algorithm
We use a sinusoidal signal which frequency varies from 80 to 1600 Hz and amplitude remains constant as the audio at the sound source ( Figure 3 shows the waveform and frequency spectrum of the sinusoidal signal), and collect the corresponding speckle image dataset. Then use the algorithm in Section 2.1 to reconstruct the sinusoidal signal when the vibration object is a carton ( Figure 4 shows the waveform and frequency spectrum of the reconstructed signal). By comparing Figure 3 with Figure 4, we found that the amplitude of the reconstructed signal is different at different frequencies, which seriously affects the accuracy of the reconstructed speech signal. Figure 5 is the reconstructed signal when the vibration object is a paper cup. By comparing Figure 4 with Figure 5, we found that the frequency responses of two different vibration objects are also different.    According to this phenomenon, we calculated the amplitude changes of several common vibration objects at different frequencies. According to the degree of change, the reconstructed speech signals of different frequencies are enhanced to different degrees. The above method can reduce the influence of the frequency response. The design flow of the algorithm is shown in Figure 6. To more clearly explain the steps involved in this speech enhancement algorithm, Figure 7 shows the results of each step of the speech enhancement algorithm when the vibration object is a carton.   According to this phenomenon, we calculated the amplitude changes of several common vibration objects at different frequencies. According to the degree of change, the reconstructed speech signals of different frequencies are enhanced to different degrees. The above method can reduce the influence of the frequency response. The design flow of the algorithm is shown in Figure 6. To more clearly explain the steps involved in this speech enhancement algorithm, Figure 7 shows the results of each step of the speech enhancement algorithm when the vibration object is a carton. According to this phenomenon, we calculated the amplitude changes of several common vibration objects at different frequencies. According to the degree of change, the reconstructed speech signals of different frequencies are enhanced to different degrees. The above method can reduce the influence of the frequency response. The design flow of the algorithm is shown in Figure 6. To more clearly explain the steps involved in this speech enhancement algorithm, Figure 7 shows the results of each step of the speech enhancement algorithm when the vibration object is a carton.   I.
Sinusoidal signal: Table 1 shows the common frequency range of speech. The sinusoidal signal which frequency varies from 80 to 1600 Hz and amplitude remains constant at 1. II. Capture laser speckle images: Use the high-speed camera to collect the corresponding laser speckle images. III. DIC method: Speech signals are reconstructed from speckle images using the DIC method described in Section 2.1. The discrete frequency domain sequence of the sinusoidal signal reconstructed is ( ). IV. Connect local highest point: Connect the local highest points of the amplitude signal in the reconstructed speech spectrum to obtain its envelope signal. The envelope signal is named ( ).
( ) Figure 7. The process of speech enhancement when the vibration object is a carton.

I.
Sinusoidal signal: Table 1 shows the common frequency range of speech. The sinusoidal signal which frequency varies from 80 to 1600 Hz and amplitude remains constant at 1. II.
Capture laser speckle images: Use the high-speed camera to collect the corresponding laser speckle images. III.
DIC method: Speech signals are reconstructed from speckle images using the DIC method described in Section 2.1. The discrete frequency domain sequence of the sinusoidal signal reconstructed is s( f ).

IV.
Connect local highest point: Connect the local highest points of the amplitude signal in the reconstructed speech spectrum to obtain its envelope signal. The envelope signal is named E( f ). V.
Enhancement signal: Equation (8) shows the calculation formula of the signal h( f ).
Note that E( f ) at some frequencies is very small, which will make h( f ) very large.
When h( f ) is greater than 1000, set it to 1.
VI. Real speech domain: Multiply the discrete frequency domain sequence of the reconstructed real speech r 1 ( f ) with the signal h( f ). The enhanced speech discrete frequency domain sequence is R( f ). According to the frequency range of speech, we process the speech with the frequency of 80-1400Hz.
After the following steps, the discrete frequency domain sequence R( f ) of the real speech signal will be obtained, and the speech waveform after speech enhancement can be obtained by using the inverse discrete Fourier transform (Equation (10)): where N is the total number of samples of the R sequence, R is the speech discrete frequency domain sequence after speech enhancement, and r is the speech discrete time domain sequences after speech enhancement.

Experimental Setup
According to previous research results, the formation and acquisition process of laser speckle image can be set (as Figure 8). Speckle is a three-dimensional ellipsoidal shape with its long axes facing the light propagation direction. Figure 9 shows the experimental platform. Figures 10 and 11 are the equipment simulation diagrams of two kinds of lasers, and Figure 12 is the equipment physical map. The equipment used mainly includes: 1.
The high-speed camera MVCAM AI-030U815M, with a maximum frame rate of 3200 frames per second (fps); 2.
He-Ne laser (detailed parameters are shown in Table 2); 3.
Fiber laser (detailed parameters are shown in Table 3); 4.
Machine vision experiment frame, with fine-tuning camera clip and universal clip; 5.
One personal computer (PC) with universal serial bus 3.0 (USB3.0) interface. In this experiment, the high-speed camera is used to collect speckle images. The frame rate of the camera is closely related to the exposure interval. The process of speckle image acquisition by high-speed charge coupled device (CCD) camera can be regarded as uniform sampling of continuous speckle video. The sampling frequency f s of the high-speed camera and the highest frequency of speech f m need to satisfy the Nyquist theorem (Equation (11)).
Equation (11) shows that the frame rate of the high-speed camera should be greater than or equal to two times of the highest frequency of speech. The data in Table 1 show the speech frequency range of male and female, and f m = 1200 Hz. This paper considers that all the frequency ranges in the speech can be restored to meet the actual use requirements, so the frame rate of the high-speed camera is greater than or equal to 2400 fps. The frame rate of the high-speed camera used in this experiment is 3200 fps, which meets the basic requirements of speech reconstruction.
In conjunction with Figure 1, we describe various physical parameters of speckle images. The first is the resolution of the speckle patterns in the Z 2 plane: where D is the diameter of the laser beam, Z 2 , Z 3 are different distances in Figure 1. F is the focal length of the lens. λ is the optical wavelength.
The optical system has requirements for the focal length F. The size of the pixel in the detector is ∆s. It is assumed that every speckle in this plane will be observed at least by K pixels. The requirements for F are as follows: The distance Z 2 needs to be satisfied: Finally, the number of speckles in every dimension of the spot is N: where φ is the is the diameter of the aperture of the lens, F # is the F number of the lens, Mδ x represents the speckle size obtained on the Z 2 -plane.     . Schematic diagram of speckle image formation and collection. The laser beam generated by laser S irradiates on object O with a rough surface near the sound source (P is a point on object O surface), and then the reflected light interferes to form speckle. C is the camera sensor (the size is exaggerated). Figure 9. Experimental setup diagram. The laser beam generated by laser A is projected onto object M on the N platform. The change of speckle is collected by B camera, and these images are transmitted to the D computer through the C interface to extract the movement between speckle images.   . Experimental setup diagram. The laser beam generated by laser A is projected onto object M on the N platform. The change of speckle is collected by B camera, and these images are transmitted to the D computer through the C interface to extract the movement between speckle images. Figure 8. Schematic diagram of speckle image formation and collection. The laser beam generated by laser S irradiates on object O with a rough surface near the sound source (P is a point on object O surface), and then the reflected light interferes to form speckle. C is the camera sensor (the size is exaggerated). Figure 9. Experimental setup diagram. The laser beam generated by laser A is projected onto object M on the N platform. The change of speckle is collected by B camera, and these images are transmitted to the D computer through the C interface to extract the movement between speckle images.    . Schematic diagram of speckle image formation and collection. The laser beam generated by laser S irradiates on object O with a rough surface near the sound source (P is a point on object O surface), and then the reflected light interferes to form speckle. C is the camera sensor (the size is exaggerated). Figure 9. Experimental setup diagram. The laser beam generated by laser A is projected onto object M on the N platform. The change of speckle is collected by B camera, and these images are transmitted to the D computer through the C interface to extract the movement between speckle images.

Sinusoidal Datasets
According to the speech frequency range in Table 1, we set the frequency of the sinusoidal signal to gradually change from 80 to 1600 Hz, with an amplitude of 1. Irradiate

Sinusoidal Datasets
According to the speech frequency range in Table 1, we set the frequency of the sinusoidal signal to gradually change from 80 to 1600 Hz, with an amplitude of 1. Irradiate the laser generated by the He-Ne laser on five vibration objects near the sound source: carton, A4 paper, plastic cup, paper cap, and leaf. Use the CCD camera to collect the speckle image sequence generated. The number of speckle images is shown in Table 4.

Two Laser Datasets
In order to verify the versatility of the algorithm under different laser systems, we use the He-Ne laser and the fiber laser to irradiate the carton near the sound source, and the number of speckle images collected is shown in Table 5.

Five Vibration Object Datasets
To verify the performance of the algorithm under different vibration objects, we use real speech as the sound source signal, project the laser onto different vibration objects, and collect the corresponding laser speckle image sequences respectively. The number of speckle images collected by five vibration objects is shown in Table 6.

Evaluation Metrics
This paper needs to calculate the correlation between the original audio signal and the speech signal reconstructed from speckle images. Since the start time of the two signals cannot be accurately aligned, the waveform correlation coefficient in the time domain is meaningless. To evaluate the accuracy of a reconstructed speech, we use the frequency spectrum correlation coefficient as the evaluation metrics. The frequency spectrum correlation coefficient here refers to the amplitude correlation coefficient of the frequency spectrum in detail.
As we know, scaling speech amplitude will affect the loudness of the speech, not the timbre. Therefore, using the frequency spectrum correlation coefficient can quantify the accuracy of speech reconstruction.
The value of the correlation coefficient is between -1 and 1. When it is 1, the two speeches are equal or proportional. When it is −1, one speech is equal to or proportional linearly to another negative speech.
where x and y represent the amplitude signal of discrete frequency spectrum signals with the same sampling points. N is the total number of samples. µ x , µ y is the mean value of x and y accordingly. σ x , σ y is the standard deviation of x and y accordingly. ρ is the correlation coefficient.

Performance on Sinusoidal Datasets
We recorded the changes of the frequency spectrum correlation coefficient between the reconstructed speech and the original speech after using the speech enhancement algorithm. We used the speech enhancement algorithm in this paper to process the sinusoidal datasets to verify whether the algorithm can increase the frequency spectrum correlation coefficient between the reconstructed sinusoidal signal and the original sinusoidal audio. Table 7 shows the sinusoidal signal reconstruction accuracy in detail. Figure 13 shows the line graph of the frequency spectrum correlation coefficient of reconstructed sinusoidal signal on different vibration objects. The existence of frequency response has a great impact on the reconstruction of sinusoidal signal, which will make the spectral correlation coefficient low. The sinusoidal signal dataset only verifies whether our proposed speech enhancement algorithm can reduce the frequency response; it cannot be verified whether the algorithm can also play a positive role in real speech. Therefore, the following experimental results are used to verify the enhancement effect of this algorithm on real speech.   Figure 13. Accuracy comparison of sinusoidal signal reconstruction with and without enhancement.

Performance on Two Laser Datasets
Using the same speech as the as the audio at the sound source, two lasers are used to irradiate the same vibration object (carton), and the CCD camera is used to collect the laser Figure 13. Accuracy comparison of sinusoidal signal reconstruction with and without enhancement.

Performance on Two Laser Datasets
Using the same speech as the as the audio at the sound source, two lasers are used to irradiate the same vibration object (carton), and the CCD camera is used to collect the laser speckle image sequences. Table 8 is the comparison of the frequency spectrum correlation coefficient with and without speech enhancement. Figure 14 shows the reconstructed speech and original speech waveforms using the He-Ne laser. Figure 15 shows the reconstructed speech and original speech waveforms using the fiber laser. In the waveform comparison, we can also clearly find that the waveform of the speech is significantly improved after the speech enhancement.    Table 9 shows the difference in frequency spectrum correlation coefficient between the reconstructed speech signal and the original speech using DIC and DIC + speech enhancement. The results show that after using the speech enhancement algorithm, the accuracy of speech signals reconstructed from different vibration objects is improved. This further proves that our algorithm has strong adaptability to different environments and has the practical application value.

Performance on Five Vibration Object Datasets
The results in Table 9 show that the accuracy of speech reconstruction for some datasets is improved slightly after using speech enhancement algorithm. We analyzed the reasons for this phenomenon as follows: when collecting these datasets, the surrounding  Table 9 shows the difference in frequency spectrum correlation coefficient between the reconstructed speech signal and the original speech using DIC and DIC + speech enhancement. The results show that after using the speech enhancement algorithm, the accuracy of speech signals reconstructed from different vibration objects is improved. This further proves that our algorithm has strong adaptability to different environments and has the practical application value.

Performance on Five Vibration Object Datasets
The results in Table 9 show that the accuracy of speech reconstruction for some datasets is improved slightly after using speech enhancement algorithm. We analyzed the reasons for this phenomenon as follows: when collecting these datasets, the surrounding environment has little interference, which ultimately makes the accuracy of speech reconstruction high. This can also be proved by the fact that the frequency spectrum correlation coefficient before speech enhancement is higher than that of other datasets. Figure 16 shows the line graph of the frequency spectrum correlation coefficient of reconstructed speech signals on different vibration objects. environment has little interference, which ultimately makes the accuracy of speech reconstruction high. This can also be proved by the fact that the frequency spectrum correlation coefficient before speech enhancement is higher than that of other datasets. Figure 16 shows the line graph of the frequency spectrum correlation coefficient of reconstructed speech signals on different vibration objects.

Discussions
Our speech enhancement algorithm is the speech process after the speech reconstruction algorithm, so whether the algorithm can be used in real time depends on the speed of the speech reconstruction algorithm. In the following content, we discuss some key factors that affect the performance of speech enhancement algorithm: (1) Loudspeaker performance: In order to provide a benchmark for frequency response, we use a sinusoidal signal which frequency varies from 80 to 1600 Hz and amplitude remains constant at 1 as the audio at the sound source. However, there will be errors in the sinusoidal signal played by the loudspeaker. These errors will be fewer if the loudspeaker performance is better. (2) Environmental noise and platform vibration: In the experiment, we find that the environmental noise and the vibration of the experimental platform seriously interfere with the sinusoidal signal reconstruction, and then affect the frequency spectrum correlation coefficient between the original sinusoidal signal and the reconstructed signal. The reconstructed sinusoidal signal contains environmental noise and vibration of the experimental platform, but the original signal is a clean and impurity-free

Discussions
Our speech enhancement algorithm is the speech process after the speech reconstruction algorithm, so whether the algorithm can be used in real time depends on the speed of the speech reconstruction algorithm. In the following content, we discuss some key factors that affect the performance of speech enhancement algorithm: (1) Loudspeaker performance: In order to provide a benchmark for frequency response, we use a sinusoidal signal which frequency varies from 80 to 1600 Hz and amplitude remains constant at 1 as the audio at the sound source. However, there will be errors in the sinusoidal signal played by the loudspeaker. These errors will be fewer if the loudspeaker performance is better. (2) Environmental noise and platform vibration: In the experiment, we find that the environmental noise and the vibration of the experimental platform seriously interfere with the sinusoidal signal reconstruction, and then affect the frequency spectrum correlation coefficient between the original sinusoidal signal and the reconstructed signal. The reconstructed sinusoidal signal contains environmental noise and vibration of the experimental platform, but the original signal is a clean and impurity-free speech, and the correlation coefficient between the two is low. Therefore, a quiet environment and a stable experimental platform are conducive to the verification of experimental results.

Conclusions
The main innovation of this paper is to propose a speech enhancement algorithm to solve the frequency response problem of speech signal reconstruction from laser speckle images. The experimental results in this paper show that our speech enhancement algorithm can effectively improve the accuracy of reconstructed speech, and the comparison of the reconstruction accuracy of different vibration object datasets verifies the practicability of the algorithm. In the experimental results, we found that when the ambient noise is small and the vibration interference of the experimental platform is small, the reconstructed speech accuracy is high. Our algorithm achieves a major advance of reconstructing speech from laser speckle images.
Although our speech enhancement algorithm has an obvious effect on reducing the frequency response problem, it also introduces some noise signals while increasing the accuracy of the reconstructed speech. In order to obtain higher-definition speech signals, we also need to make further improvements to the speech enhancement algorithm. In the future, we will focus on reconstructing speech with a small frequency response and high definition.