Non-Contact Speech Recovery Technology Using a 24 GHz Portable Auditory Radar and Webcam

: Language has been one of the most effective ways of human communication and information exchange. To solve the problem of non-contact robust speech recognition, recovery, and surveillance, this paper presents a speech recovery technology based on a 24 GHz portable auditory radar and webcam. The continuous-wave auditory radar is utilized to extract the vocal vibration signal, and the webcam is used to obtain the fitted formant frequency. The traditional formant speech synthesizer is selected to synthesize and recover speech, using the vocal vibration signal as the sound source excitation and the fitted formant frequency as the vocal tract resonance characteristics. Experiments on reading single English characters and words are carried out. Using microphone records as a reference, the effectiveness of the proposed speech recovery technology is verified. Mean opinion scores show a relatively high consistency between the synthesized speech and original acoustic speech. on a 24 GHz portable auditory radar and webcam, using the vocal fold vibration signal obtained by the continuous-wave auditory radar as the sound source excitation and the ﬁtted formant frequency obtained by the webcam as the vocal tract resonance characteristics to recover speech through the formant speech synthesizer model. The LSSVM ﬁtting model was utilized based on the mapping relationship between mouth shape motion characteristics and formant frequency when the speaker pronounces, with the motion characteristics of lip during pronunciation as the input training and testing samples and the ﬁrst four formants as the output training and testing samples, respectively. The output results were compared with the formant frequencies of the speech, and they were similar and feasible within a certain range. Then, the speech synthesis was conducted based on the formant speech synthesizer. Experiments and results were presented using the radar, webcam, and microphone.


Introduction
The speech signal, which plays a significant role in daily communication and information exchange, is one of the most vital physiological signals of the human body. Therefore, numerous attempts have been made to investigate speech signal detection technology. The microphone, recognized as one of the most widely used air conduction speech signal detection technologies, has greatly improved the efficiency of human communication [1]. This device can be used in places where speech acquisition is required and has been widely used in social life. The principle of a traditional microphone is to convert the varying pressure on the vibrating diaphragm caused by the acoustic wave into a detectable electrical signal. However, the microphone is easily interfered by various background noises and has a short detection range, which limits the development of air conduction speech detection technology to a certain extent [2].
Different from the traditional microphone, the throat microphone is a contact speech detector that can record clean speech even in the presence of strong background noise [3]. It is a transducer applied to the skin surrounding the larynx to pick up speech signals transmitted through the skin, and hence, it is relatively unaffected by environmental distortions. Another representative non-air conduction detector is a bone conduction microphone [4]. This device obtains the speech signal by picking up the vibration of the vocal cords that is transmitted to the skull. Although these devices can obtain high quality speech signals, they need to be in close contact with the skin of the human subject, which restricts the activities of the user and may even cause discomfort or skin irritation. In the research of non-contact speech signal detection, laser Doppler speech detection technology was proposed [5]. This optical speech detection technology has a good performance in long-distance speech detection [6]. Nevertheless, it is susceptible to environmental influences such as temperature. The respective shortcomings of the above speech detection technologies limits their applications for human speech detection.
In recent years, a new non-contact vital sign detection technology, biomedical radar, has gradually gained attention in fields such as medical monitoring and military applications [7][8][9]. The biomedical radar uses electromagnetic waves as the detection medium. When the electromagnetic waves reach the human body, their phase and frequency are modulated by the tiny movements on the body surface caused by the physiological activity of the human body. Human physiological signals can be obtained after demodulation. In 1971, Caro used continuous wave radar to monitor human respiration for the first time [10], and since then, researchers have begun to apply it to the monitoring of human vital signs. Therefore, the use of radar for speech signal detection has attracted the attention of many researchers.
In [11], a millimeter-wave (MMW) Doppler radar with grating structures was first proposed to detect speech signals. The operating principle was investigated based on the wave propagation theory and equations of the electromagnetic wave. An electromagnetic wave (EMW) radar sensor was developed in 1994 [12]. It was then named glottal electromagnetic microwave sensor (GEMS) and used to measure the motions of the vocal organs during speech such as vocal cords, trachea, and throat [13][14][15]. In [14], the speech phonation mechanism was discussed and a vocal tract excitation model was presented. In 2005, Holzrichter verified through a special set of experiments that the vibration source of the vocal organs detected by the EM radar sensor is mainly the vocal fold [16]. However, the GEMS also needs to be placed close to the mouth or the throat.
In 2010, a 925 MHz speech radar system with a coherent homodyne demodulator was presented for extracting speech information from the vocal vibration signal of a human subject [17]. The results showed that the measured speech radar signals had excellent consistency with the acoustic signals, which validated the speech detection capability of the proposed radar system. In [18], a novel 35.5 GHz millimeter-wave radar sensor with a superheterodyne receiver and high operating frequency was presented to detect speech signals. Based on this radar, this group enhanced the radar speech signals with the proposed Wiener filter method based on the wavelet entropy and bispectrum algorithm by accurately estimating and updating the noise spectrum in terms of whole signal segments [19,20]. Moreover, they proposed a 94 GHz MMW radar system to detect human speech in free space [21,22] and utilized it to detect the vibration signal from human vocal folds [23]. However, the experiments just demonstrated the similarity of the vocal fold vibration frequency detected by the radar, microphone, and vibration measurement sensor. In our previous work, we extracted the time-varying vocal fold vibration frequency of tonal and non-tonal languages. The low relative errors showed a high consistency between the radar-detected time-varying vocal fold vibration and acoustic fundamental frequency [24]. The variational mode decomposition (VMD) was used to obtain the time-varying vocal fold vibration [25]. However, the recovery of speech from vocal fold vibration frequency detected by radar needs to be further explored.
In this paper, a non-contact speech recovery technology based on a 24 GHz portable auditory radar and webcam is proposed: a formant speech synthesizer model is selected to recover speech, using the vocal fold vibration signal obtained by the continuous-wave auditory radar as the sound source excitation and the fitted formant frequency obtained from a webcam as the vocal tract resonance characteristics. We propose a method of extracting the formant frequency from visual kinetic features of lips in pronunciation utilizing the least squares support vector machine (LSSVM). The basic detection, speech synthesis theory, and overall system description are presented in Section 2, followed by the introduction of the radar system and experimental setup in Section 3. The results and discussion are demonstrated in Section 4. Finally, the conclusion is drawn in Section 5.

Overview
The technology proposed combines a continuous wave auditory radar and webcam system and then realizes speech recovery through the speech synthesis system. The system framework is illustrated in Figure 1. The system consists of a continuous wave auditory radar system that acquires the vibration of the vocal folds, a webcam that obtains lip motion information, and a speech synthesis system based on the formant speech synthesizer model. The continuous wave radar system emits a continuous wave of 24 GHz, which is reflected by the human throat and received by the receiver. After a series of digital signal processing, the frequency of the time-varying vocal fold vibration is extracted, which is consistent with the acoustic fundamental frequency of speech [24]. The lip motion is obtained by the webcam. There is a mapping relationship between the formant frequency and the motion characteristics of the mouth shape of the speaker [26]. The motion characteristics of the lips of the speaker obtained by the webcam and the formants extracted by the speech recorded by the microphone are used as the input and output training samples in the LSSVM fitting model [27]. The rest of the lip motion characteristics are used as the input testing samples in LSSVM, and the output can be used as the fitted formant frequency, so that the formant of the speech can be obtained without the microphone. Finally, based on the formant speech synthesizer model, speech synthesis is performed based on the vocal fold vibration frequency and formant parameters to achieve speech.

Radar Detection Theory
During speech, the vibration of vocal folds is a cyclical movement and can be detected by the auditory radar. The detection principle of the auditory radar is phase estimation based on the received signal modulated by the vibration of vocal folds. The fundamental mechanism of auditory radar detection is shown in Figure 2.
The auditory radar sends a normalized single-tone signal as follows: where f is the carrier frequency and Φ(t) is the phase noise. The echo signal captured by the continuous wave auditory radar can be written as [28]: where d 0 is the distance between the radar and the subject, x(t) is the vibration of the vocal folds, and λ = c/ f is the wavelength. Since the vibration of the vocal folds x(t) is much smaller than the detection distance d 0 , the above formula can be approximated as: Radar Computer x(t) Tx Rx The signal received is mixed with the local oscillator to obtain the fundamental frequency signal, and the fundamental frequency signal is orthogonally decomposed to obtain [29]: where θ = 4πx(t) λ + θ 0 is the constant phase determined by the distance d 0 and ∆Φ(t) = Φ(t) − Φ(t − 2d 0 /c) is the total residual phase noise.
In order to extract phase information, the complex signal demodulation (CSD) method [30,31] is used to process quadrature channel baseband signals as:

Formant Fitting Theory
The resonance frequency of a vocal tract is called a formant. The human vocal tract is a resonant cavity, which can be described by a distributed parameter system. The movements of tongue and lips during pronunciation affect the shape and size of the vocal tract, thus changing the resonance frequency. Therefore, it is suggested that kinetic features of lips have a mapping relationship with formants.
Generally speaking, five identifiable formants can be extracted from voiced sounds, but the first three formants are the most important to distinguish speech, which are called the first, second, and third formants in order from low to high, respectively. Formants, different for different sounds, refer to the position of the formant frequency and the bandwidth. The first formant F 1 is related to the height of the tongue position and the roundness of the contour of the lips, while the second formant F 2 is related to the position of the tongue and the roundness of the contour of the lips. The third formant F 3 is not significantly affected by the position of the tongue, but has a greater relationship with the movement of the tip of the tongue [26].
A simple lip model is selected to obtain visual kinetic features of lips from video images. As illustrated in Figure 3, the lip model is described by the periphery height (h) and width (w) of lip. Four kinetic features of lips in pronunciation are taken into account as follows: normalized lip height w e = w w max , normalized lip width h e = h w max , normalized change rate of lip height R w = ∂w e ∂t , and normalized change rate of lip width R h = ∂h e ∂t , with the maximum value of lip width w max as the normalization factor. These four features are arranged with the time of the image frames of the video to form a set of feature sequences and can be obtained by extracting the contour of lips using image processing algorithms. Since the uncertainty of the relationship between the formant frequency and the lip motion characteristics is assumed to be nonlinear, we utilized the least squares support vector machine (LSSVM). Support vector machine (SVM) is a machine learning method with a high fitting accuracy and thus more suitable for data processing with small samples. The standard SVM training samples need to solve the quadratic programming problem, making the training speed slower. Suykens J.A.K. proposed LSSVM, using a least squares linear system as the loss function to replace the quadratic programming method used by traditional SVMs [27], which greatly reduced the computational complexity. Furthermore, the model parameters of the LSSVM using the radial basis kernel function were one less than that of a standard SVM using the same kernel function, and the operation speed was significantly faster.
N training samples (x i , y i ) are given, where x i ∈ R P are the p-dimensional input training samples and y i are the output training samples, with i = 1, ..., N. In feature space, the SVM model is [32]: The objective function of LSSVM is: where e i is the error and γ is the regularization parameter. The constraint condition is: where φ(·) is the kernel function. The Lagrangian optimization method can be used to transform the above optimization problem into: Remote Sens. 2020, 12, 653 where α i is the Lagrange multiplier. It can be obtained through finding the stagnation point of the function that: By eliminating w and e, the optimization problem is transformed into solving linear equations: where y = [y 1 , ..., y N ], 1 = [1, ..., 1] and α = [α 1 , ..., α N ] T . α and b can be obtained by solving the linear equations and the LSSVM model used for function estimation is given as follows: where K(x, x i ) is the kernel function. The radial basis kernel function (RBF) is used as the kernel function in this paper, and its expression is presented as follows: where σ represents the width of the kernel parameter.

Speech Synthesis Theory
The discrete time-domain model of speech signal generation is given in Figure 4. This model, based on a digital signal processing method, is a short-term model because the vocal organs change over time during pronunciation. From the pronunciation mechanism and analysis of the pronunciation organ, we can know that the system of speech generation is divided into three parts, including the excitation system, the vocal tract system, and the radiation system [26]. The excitation system is below the vocal fold and responsible for generating a periodic vibration excitation source, which is also called the glottis subsystem. The vocal tract system is the vocal tract, which is the breathing passage from glottis to lips, and the radiation system is outside lips because speech radiates from lips.
As illustrated in Figure 4, a mathematical model of speech pronunciation is first established. The excitation source can simulate glottis, which is a periodic pulse during the pronunciation of voiced sounds and random noise during the pronunciation of unvoiced sounds, respectively. The acoustic principle shows that the resonance characteristics of the vocal tract model are completely dependent on the shape of the vocal tract. Changing the shape of the oral cavity and the vocal tract is equivalent to changing the parameters of the filter, which represent the resonance characteristics of the vocal tract. The accent and pitch of the synthesis can be changed by adjusting the intensity or period of the excitation source pulse. Therefore, by changing the excitation source and filter parameters, a suitable speech can be synthesized.  In this work, we select the formant model as the vocal tract model, which utilizes the formant parameters as the vocal tract parameters to describe the vocal tract. Within one frame, the excitation signal generated by the excitation source is modulated by several formant filters, which simulate the resonance characteristics of the vocal tract and are transmitted by the radiation model. The filters formed by these parameters can simulate the resonance characteristics. There are three practical models for the theory based on formants. In the cascade formant synthesis model, the formant filters are connected end-to-end. In the parallel model, the input signal is first adjusted by amplitude and then added to each formant filter and the output of each channel is superimposed. Based on these two models, people combined them and proposed a hybrid formant model. Here, we choose a cascade-parallel hybrid formant model with a relatively good synthesis effect, which converts the model into a series-parallel connection of several second-order digital filters. Second-order digital filters not only can accurately describe the characteristics of a single formant, but also have faster and more efficient calculation with the same spectral accuracy. The framework of the hybrid formant synthesizer system is shown in Figure 5. The formant frequency F refers to the area where the energy of the speech is relatively concentrated in the frequency spectrum, that is the resonance frequency of the vocal tract. During the pronunciation of voice sounds, the original resonance amplitude of the sound spectrum is partially enhanced and partially weakened by the modulation of the vocal cavity, forming a new undulating envelope. The positions of the new envelope peaks are the formants, which determine the sound quality. Therefore, the formant frequency is consistent with the resonance frequency of the vocal tract. The formant bandwidth B w refers to the 3 dB bandwidth of the formant peak position, which is the half-power bandwidth. The tone of speech is different because different formant modes determine different formant parameters. Each set of parameters determines a formant filter so that combining multiple formant filters can simulate the resonance characteristics of the vocal tract. In fact, the formant filter and combination method are fixed, but the parameters of the filter vary for every frame. For any set of formants, the transfer function of the second-order bandpass formant synthesis filter is shown as: where θ i = 2πTF i and r i = e −B i πT describe the poles of the denominator of the filter transfer function, F i is the frequency, B i is the bandwidth, and b 0 is the gain coefficient that makes the filter response one at the center frequency (z = e −jθ i ). b 0 is given as follows: Remote Sens. 2020, 12, 653 8 of 16 Therefore, for any set of formants, when the frequency F i and the bandwidth B i are known, a second-order bandpass formant synthesis filter can be calculated. By connecting several such second-order digital filters, the resonance characteristics of the vocal tract can be simulated

Experimental Setup
The photographs of the 24 GHz auditory radar from the front side and the right-hand side are illustrated in Figure 6. Figure 6a shows a pair of 4 × 4 antenna arrays designed to enhance directivity with an antenna directivity of 19.8 dBi. The antenna arrays included an RF front-end fabricated and integrated on a Rogers RT/duroid 5880 microwave substrate, thereby reducing the overall size of the device to 11.9 cm × 4.4 cm. The baseband board was integrated into the substrate, which supported the RF board as given in Figure 6b. A power interface was placed in the microcontroller unit (MCU) board, which cascaded the baseband board to power the entire probe. In order to ensure the high sensitivity of the vocal fold frequency detection of this system, this system used a continuous wave of 24 GHz as the transmitted signal waveform, which had a µm-scale motion detection sensitivity [33]. The key parameters of the auditory radar are shown in Table 1. The radio frequency signal was transmitted through the transmitting antenna and acted as a local oscillator (LO) for the mixer of the signal receiving chain, as well. At the receiver, the received echo signal was first amplified by two-stage low-noise amplifiers (LNAs). Compared with existing 24 GHz integrated mixer chips, a cost-effective six port structure was used here. The output of the six port down-converter was a differential quadrature signal, which was amplified by two differential amplifiers to produce a baseband I/Q signal. The received RF gain and baseband gain were 34 dB and 26 dB, respectively. The baseband signal was fed to a 3.5 mm audio jack, which could be easily connected to the audio interface of a laptop or smartphone to process the signal in real time. Ten volunteers including 4 females and 6 males (from 22 to 28 years old) without any phonation disorders were selected as experimental volunteers. The details of those volunteers are given in Table 2. The experimental setup is demonstrated in Figure 7. The volunteer was asked to sit on a chair, and the antennas of the 24 GHz auditory radar faced the throat of the volunteer at a distance of 50 cm. The radar data, sampled at a frequency of 44,100 Hz, were then transmitted over a wired link to a laptop and stored on it. At the same time, a cellphone (iPhone 6s) with webcam and microphone embedded was placed 80 cm away from the volunteer to capture videos of lips and the acoustic signal for comparison with the auditory radar.  To guarantee high quality signals, the volunteer was required to remain seated and read specified characters or words in a quiet indoor laboratory environment during the experiment. In this paper, 8 English letters including "A", "B", "C", "D", "E", "I", "O", and "U" and two words "boy" and "hello" were selected, and each of the words was recorded 10 times by both the radar and webcam. The experimental data were processed within MATLAB and Praat: doing phonetics by computer. Praat is a cross-platform multi-functional phonetics professional software for analyzing, labeling, processing, and synthesizing digital voice signals and generating various languages and text reports.

Vocal Fold Vibration Extraction
First, the auditory radar was used to detect the vibration frequency of vocal folds, which was verified as the fundamental frequency of the speech signals [24]. The radar-detected time-domain signal was decomposed by the VMD after filtering and segmentation as in our previous work [24]. Praat was used to extract the microphone-detected fundamental frequency for comparison, which is known as one of the most accurate methods in speech signal processing. Figure 8a presents the comparison between auditory radar-detected time-varying vocal fold vibration and the acoustic fundamental frequency values of the character "A". It was illustrated that the vocal cord vibration frequency detected by the radar was about 180 Hz, and the trend of the envelope was consistent with the acoustic fundamental frequency values. The comparative result of the word "boy" is shown in Figure 8b. Similarly, the radar-detected frequency closely matched the microphone-detected one. The observed fluctuation of instantaneous frequencies indicated the frequency deviation of the diphthong in this word. Here, we define the deviation degree of the acoustic fundamental frequency values as relative error: } × 100%, n = 1, 2, 3......N (15) where f r (t) means the radar-detected vibration frequency and f v [n] is the acoustic fundamental frequency at the moment t n , n = 1, 2, 3......N. The relative errors of the fundamental frequency are shown in Table 3. From this table, we find that the relative errors of the characters and words tested were below 10%. Compared with our previous work in [24], the number of participants in the experiments increased from two to 10, and the results were similar. The low relative errors showed a high consistency between the radar-detected vibration and acoustic fundamental frequency. In addition, the duration of these English characters and words are given in the table to illustrate the difference between characters and words.

Formant Frequency Fitting
Videos recorded of 10 subjects reading characters and words were segmented by the silent segments and the voiced segments of the audio signal. There is a simple relationship between three components of red-green-blue (RGB) color space to separate lip region from skin region [34]. We could perform a set of operations on the three components of RGB. The R (red) component and the B (blue) component were subtracted from the G (green) component and then added as: where f r , f g , and f b are the R, G, and B components in the RGB image, and as a result, f had a good performance in distinguishing lip and skin. Then, lip movement images obtained from voiced segments of videos were binarized, and miscellaneous points were wiped off to extract the contour of the outer lip. Video image processing can obtain visual information such as color, brightness, and outline from the image of each frame. In addition to visual information, motion information can also be extracted from the difference between continuous frames. Four sets of lip motion features (w e , h e , R w , R h ) during phonation were arranged with the time of the frames of the video to form a set of feature sequences. The set of feature sequence packets preserved the chronological relationship of the frames, as well as the global motion relationship of the video frames. Sixty percent of these features were used as the input samples, and the corresponding four microphone-detected formants were used as the output samples in LSSVM training. The remaining 40% of these lip movements features were used as the input testing samples, and the output was the fitted formants.
The performance and convergence of LSSVM model seriously depended on the input and parameters. When establishing the LSSVM prediction model, the regularization parameter γ and kernel parameter σ had a great influence on the performance of the model. If the value of the regularization parameter γ was small, the penalty on the sample data was small, which made larger training errors and strengthened the generalization ability of the algorithm. If the value of γ was large, the corresponding weight was small, and the generalization ability of the algorithm was poorer. When using a Gaussian kernel function, a small value of σ would cause over-learning of sample data, and a too large value would cause under-learning of sample data. In this paper, the determination of these two parameters was based on cross-validation and experience: Four sets of formants fitted by LSSVM were compared with acoustic formants extracted by speech signal processing software Praat, and the results of the word "a" and "boy" are presented in Figure 9. Figure 9 indicates that although some fitted values were not very close to the acoustic ones, the trend of fitted formants was consistent with the trend of the formant frequency of the original speech when the word was pronounced. Figure 9a shows the fitted formants of "a", and since it was a monophthong without any changes of tone and pitch, the fitter effect was slightly better than that of the word "boy" illustrated in Figure 9b. The acoustic formants of "boy" extracted by Praat had some discontinuities, and the LSSVM fitting results could not fit the drastic transition of them.
(a) (b) Figure 9. Comparison between the original acoustic formant and the synthesized formant frequency of the English letter and word (a) "a" and (b) "boy", respectively. The green, red, blue, and orange " * " symbols represent the first, second, third, and fourth original formant frequency, respectively. The green, red, blue, and orange lines represent the first, second, third, and fourth fitted formant frequency, respectively.
There are various reasons for the disagreement appearing in the comparisons of Figure 9. Although Praat is a classic and widely used tool to extract the formant frequencies in speech signal processing, there are inherent errors. As presented in Figure 9, the acoustic formants extracted by Praat had some discontinuities, and the formants in higher orders may have been mistaken as formants in lower orders, which showed the inaccuracy in extracting formants. Hence, the trained model may have had large errors locally when fitting the formants of the testing data. However, for a specific sound, the fitted formant frequencies were within a reasonable range. These errors would not have much effect on the final speech synthesis and the recognition of the synthesized sound. Furthermore, since the formant was not completely determined by the shape of lip, the formant fitting was relatively inaccurate due to the lack of information. The effect of fitting needed to be judged by the effect of speech synthesis.
It can be seen from Table 4 that there was a certain error between the frequency of the fitted formant and the frequency of the formant extracted from the original speech. The error of the first formant of "E" in the table is significantly larger than the others because when the tester was pronouncing the English letter "E", the first formant frequency was about 200-300 Hz, which was much lower than the first formant frequency of other letters. Compared with other letters, the denominator f o [n] was smaller when calculating the relative error, and therefore, the first formant of "E" was relatively larger than the others. The formant frequency was within a reasonable range, and the fitting effects needed to be evaluated by the speech synthesis results.

Speech Synthesis
Since the speech recovery technology introduced in this paper was based on the time-varying vocal fold vibration frequency obtained by auditory radar as the excitation source of speech synthesis, we chose the extracted time-varying vocal fold vibration period as the pitch period excitation, white noise as the consonant excitation, and the fitted formant frequency extracted by the webcam as the vocal tract resonance characteristics to synthesize speech [26]. Rabiner proposed to add a high-frequency peak as compensation when using the formant parameters for speech synthesis [35]. Here, the fourth formant was selected as a fixed value with a center frequency of 3500 Hz and a bandwidth of 100 Hz. For the bandwidths of the other three formants, we took fixed values B w1 = 150 Hz, B w2 = 200 Hz, B w3 = 250 Hz. From the resonance frequency and bandwidth, the second-order filter coefficients could be calculated. In order to make the synthesized sound more natural, we calculated the energy of each frame of speech to adjust the amplitude of the synthesized speech and make the energy of each frame of synthesized speech the same as the frame energy of the original speech. In the overlapping part between frames, a linear proportional overlap addition method was used. Several audio files of the synthesized speech are presented in the Supplementary Materials. Figure 10 shows the detection results of the microphone system and the radar system for the English character "A", which were compared with the synthesized speech. Figure 10a-c presents the time-domain waveform of the microphone, the time-domain waveform of the radar after filtering and segmentation, and the time-domain waveform of the synthesized speech, respectively. Figure 10d-f depicts the spectrogram of the signals detected by the microphone, radar, and synthesized speech, respectively. As shown in Figure 10a-c, compared to the microphone-detected and radar-detected signals, the synthesized speech signals lost a part of the high-frequency components, which could also be seen in Figure 10d-f. The fundamental frequency in Figure 10d-f is similar, while the high-frequency components and energy distributions are different. As illustrated, there are multiple frequencies in both Figure 10d,f, and the energy was mainly distributed at about 200-400 Hz, 2000-2200 Hz, and 3000-3800 Hz. A similar distribution of the energy in Figure 10d,f showed the effect of the fitted formants. Unlike Figure 10d,e only has a few low-frequency components, which made the radar-detected signals unable to be distinguished. Although the synthesized speech could be recognized successfully as English letters, the change of the timbre of the synthesized speech was Remote Sens. 2020, 12, 653 13 of 16 not obvious, and the speech was a little unnatural from the perspective of hearing, which may be ascribed to the difference between the high-frequency components of the microphone-detected result in Figure 10d and synthesized speech in Figure 10f.  Figure 11 depicts the detection results of the microphone system and the radar system for the English word "boy", which were compared with the synthesized speech. Like Figure 10, Figure 11a-c shows the loss of high-frequency components in the time-domain synthesized speech waveform. Figure 11d-f illustrates that in the spectrograms, microphone-detected, radar-detected, and synthesized speech were consistent in their distribution patterns. The similar distribution change of the energy in Figure 11d,f showed a high consistency between acoustic and fitted formants. The increase and decrease of frequencies presented by distribution patterns indicated the frequency deviation of the diphthong in this word. Furthermore, there were differences between Figure 11d,f in high frequencies.
The mean opinion score (MOS) is a subjective measurement and probably the most widely adopted and the simplest procedure in evaluating speech quality [36]. In this paper, MOS tests were conducted such that the 10 volunteers were asked to assess and compare the results of the synthesized speech and the original speech. The full score was five points according to the following criteria: 1: unqualified; 2: qualified; 3: medium; 4: good; 5: excellent. The assessor was required to listen to the synthesized speech every five minutes to avoid psychological cues caused by continuous testing. The evaluation results are shown in Figure 12, and the effect of the synthesized speech obtained by the speech restoration system of this paper was satisfactory.

Conclusions
In our former work, the vibration of vocal folds was extracted by the continuous-wave radar based on VMD, which was consistent with the fundamental frequency of the original speech. However, little has been done to recognize and recover the original speech directly using the vibration of vocal folds. Therefore, in this article, we proposed a non-contact speech recovery technology based on a 24 GHz portable auditory radar and webcam, using the vocal fold vibration signal obtained by the continuous-wave auditory radar as the sound source excitation and the fitted formant frequency obtained by the webcam as the vocal tract resonance characteristics to recover speech through the formant speech synthesizer model. The LSSVM fitting model was utilized based on the mapping relationship between mouth shape motion characteristics and formant frequency when the speaker pronounces, with the motion characteristics of lip during pronunciation as the input training and testing samples and the first four formants as the output training and testing samples, respectively. The output results were compared with the formant frequencies of the speech, and they were similar and feasible within a certain range. Then, the speech synthesis was conducted based on the formant speech synthesizer. Experiments and results were presented using the radar, webcam, and microphone.