Watermarking Based on Compressive Sensing for Digital Speech Detection and Recovery

In this paper, a novel imperceptible, fragile and blind watermark scheme is proposed for speech tampering detection and self-recovery. The embedded watermark data for content recovery is calculated from the original discrete cosine transform (DCT) coefficients of host speech. The watermark information is shared in a frames-group instead of stored in one frame. The scheme trades off between the data waste problem and the tampering coincidence problem. When a part of a watermarked speech signal is tampered with, one can accurately localize the tampered area, the watermark data in the area without any modification still can be extracted. Then, a compressive sensing technique is employed to retrieve the coefficients by exploiting the sparseness in the DCT domain. The smaller the tampered the area, the better quality of the recovered signal is. Experimental results show that the watermarked signal is imperceptible, and the recovered signal is intelligible for high tampering rates of up to 47.6%. A deep learning-based enhancement method is also proposed and implemented to increase the SNR of recovered speech signal.


Introduction
With the development of Internet and communication technology, the utilization of multimedia data is becoming common in our daily life, but multimedia data can be tampered with easily, it may be distorted during spreading through the Internet, and may be manipulated by an adversary. It is important to develop algorithms to protect and recover data. The watermark algorithm is a powerful kind of tool for multimedia authentication and verification, as well as tampering recovery. One advantage of watermark algorithms is that the algorithm can localize the tampering area, and on that basis, a number of watermarking schemes aiming at recovering multimedia data have been developed [1,2]. Many watermark algorithms used for the self-recovery in the image literature have been proposed in [3,4]. However, in this paper, we propose an imperceptible, fragile, blind

Watermarking Embedding Procedure
The watermark algorithm consists of two parts: the watermark embedding part and data recovery part. In this study, we embed the watermark into the LSBs of signal. The reference data are generated from a quantization of a linear representation of original DCT coefficients. Figure 1 shows a sketch of the watermark embedding procedure.
We partition the speech signal into non-overlapping frames, each frame contains N data points. Then, DCT is performed in each frame to yield the DCT coefficients. where where x(n) is the original speech signal, X(n) is the DCT coefficients of the original signal. According to a secret key, pseudo-randomly permute the DCT coefficients in frame units. Assume that each frame group contains m frames. Combine the adjacent m frames into one vector, whose length is n × m where n equal to N, the coefficient-vector is And calculate k reference values where k is the number of reference values in one frame-group in the following linear manner.       r 1 r 2 . . .
where A is a pseudo-random matrix and the Euclidean norm of each row is one. For generating A, we may first produce a matrix A 0 sized k × m × n whose elements are derived from a secret key and satisfy an independent identical Gaussian distribution with zero mean. Then, the matrix A can be obtained.
We allocate the k reference value in m frames each group, k/m reference values each frame, and then we quantize the reference values in a uniform manner. where where R max represents the maximum data after quantification, q is the quantitative parameter. The quantization changes the float reference values into integers to meet the storage constraint. We calculate l check-bits for tampered areas localization. We collect the MSBs of one frame, the position of the frame, and the reference values, then import them into a hash function to produce l bits' hash-bits. Randomly, we generate l bits' label-bits. For all frames, the label-bits are identical. Then, we calculate the exclusive-or result between the hash-bits and label-bits. 31) are the random label-bits, and c i (1), c i (2), ...c i (31) are the check-bits. The result is l check-bits. We add the check-bits to the reference values, and regard these bits as our watermark. In our scheme, we set n equal to 64, m equal to 16, and k equal to 368; thus, each frame contains 23 reference values, l equal to 31, R max equal to 16,384 and q equal to 1500. By quantization, each reference value is converted into an integer within [−16,384, 16,383]. Therefore, eachr occupies 15 bits, so the number of bits in a frame for reference values is 345. Combine the 345 reference-value-bits, 31 check-bits and eight bits equal to zero. These 384 bits are used to replace the six planes of the frame, producing the watermarked speech.

Signal Recovery Procedure
A watermarked speech signal may be tampered with when transformed via the Internet. Our motivation is to recover the original content. For achieving this aim, we should localize the tampered areas, extract the correct reference values and employ the reference values to calculate the original speech data. For tampered areas localization, we compare the check-bits of every frame group. To recover the original content, we employ the compressive sensing technique to retrieve the DCT coefficients by exploiting their sparseness. Figure 2 shows a sketch of the watermark recovery procedure. The details of each part are as follows.

Tampered Area Localization
The first step of signal recovery is to localize which part of the signal has been tampered with. Dividing the received signal into non-overlapping frames in the same manner as watermark embedding, we then extract the 384 bits from the six LSB-layers, which consist of 345 reference value bits and 31 check-bits. For each frame, we feed the 640 bits in the MSB-layers, the 64 position-bits and the 345 reference bits into the hash function to calculate the 31 hash-bits. Then, we calculate the exclusive-or between the 31 hash-bits and the 31 check-bits.
where h i (1), h i (2), ...h i (31) are the obtained hash-bits, c i (1), c i (2), ...c i (31) are the extracted check-bits, and we regard the label-bits l i (1), l i (2), ...l i (31) as the result. Comparing the results among all frames, frames whose label-bits are identical to most label-bits are judged as being not tampered. Otherwise, we consider the frame to be tampered. Although a receiver does not know the original label-bits, he can compare the calculated label-bits of all frames. If more than 40% of the frames possess the same result, the receiver can judge them as "reserved" frames and the other frames as "tampered" frames. Clearly, an unmodified frame must be judged as "reserved" as long as the rate of tampered frames is less than 60%. The probability for a frame containing modified content but being falsely judged as "reserved" is 2 −31 , which is extremely low, indicating that false judgment is virtually impossible.

Data Recovery
If all frames within the concerned region are not tampered, or all frames are reserved, there is no need to recover. If some frames within the concerned region have been tampered, we would extract (16 − z) × 23 reference values for recovery, where z is the number of tampered frames. After removing the tampered z frames, Equation (4) is reduced to . . .
where vector α 1 , α 2 , ..., α M are the reference values extracted from reserving frames whose length is M = (16 − z) × 23, matrix A (E) is a matrix whose rows are taken from matrix A, corresponding to reserved frames. The vector v could be decomposed into two parts: the DCT coefficients v R from the reserved area, the DCT coefficients v T from tampered area. In addition, our purpose is to find the where matrix A (E,R) is a matrix whose columns are those in matrix A (E) corresponding to vector v R , and matrix A (E,T) is the counterpart of the vector v T . The vector v R could be obtained from the watermarked signal directly. Deal with the extractable reference values α 1 , α 2 , ..., α M to get the original reference-values which are not quantized.
We employ the compressive sensing technique to solve v T by exploiting the sparseness of the DCT coefficient. After v T is obtained, the original signal can be recovered.

Content Recovery by Compressive Sensing
Using the compressive sensing technique, a sparse signal can be retrieved from a relatively small number of measurements. Suppose that x is a column-vector with most elements close to zero and where is Φ a random matrix and y consists of the measurements. Even though the length of y is significantly less than that of x, with the knowledge of the measurements y and the matrix Φ, one can approximately reconstruct the original signal. Several algorithms have been proposed such as orthogonal matching pursuit [30] and gradient projection for sparse reconstruction (GPSR) [29]. Because the reference-values are calculated from the original discrete cosine transform (DCT) coefficients of host speech signal, the reference-values are sparse in DCT domain. Then, view A (E,T) in (16) as the matrix Φ in (17), and S(1), S(2), ...S(M) as a series of measurements of sparse signal. Then, employ the GPSR method in [29] to solve v T . Here, the more the available measurements, the more exactly the original coefficients can be retrieved. After getting the vector v T , we can calculate the content of tampered frames by inverse DCT. where where X(n) is the DCT coefficients of the original signal, x(n) is the original speech signal.

Subjective Experiment
We choose five sentences from the CASIA-863-speech synthesis database randomly, represented here by the numbers one to five. All speech signals last five seconds. In addition, they were resampled to 16-bit 8-kHz to meet the requirements. The subjective experiment is mainly used to test whether watermarked speech is imperceptible or not.
The subjective difference grade (SDG) is a subjective test method for evaluating the quality of watermarked speech signals. SDG formula is as follows: The Grade TS and Grade RS represent test signal scores and reference signal scores, respectively. The SDG ranges from 0.0 to −4.0 (imperceptible to very annoying as show in Table 1). In the subjective listening test, the original and watermarked speech signals were given to ten participants, who were asked to score the speech signals, the scoring criteria are from 1.0 to 5.0, as shown in Table 1. In addition, Formula (20) was then used to calculate the SDG score. Before the start of the experiment, participants needed to understand the entire experimental process. They were properly trained to effectively assess sound quality. The participants were males and females of different ages (22-25 years) with normal hearing. Table 2 shows the average SDG scores of different watermarked speech signals using the proposed method. From the test results, we observed that the SDG ranges from −0.07 to 0.0 for all watermarked sounds using the proposed method, indicating that original and watermarked speech signals are perceptually indistinguishable.  The ABX method is another technique for subjective quality assessment of watermarked speech. Participants were ten males and females with normal hearing. In the test, the original speech signal 'A' and the watermarked speech signal 'B' were presented to each participant. A third speech signal 'X' was randomly selected from 'A' or 'B' and presented to the participants. Participants were asked to identify whether 'X' is the same as 'A' or 'B'. One identification was considered as one trial and each participant performed five trials. The correction of identification is used to determine if the watermarked speech is perceptible. A detection percentage of 50% indicates that the difference between the original and the watermarked speech is imperceptible. The evaluation results are shown in Table 2. We observed that the correct detection scores ranged from 46 to 54%, indicating that the watermarked sounds was almost imperceptible.

Objective Experiment
The objective experiment is divided into two parts, the first part is used to test the perception of the watermarked speech signal. The second part is used to test the intelligibility of the recovered speech signal.
The objective quality of a watermarked speech signal is measured by SNR. The SNR values of all selected speech signals using the proposed method are shown in Table 2, and we can observe that all the values are above 20 dB, meeting the requirements of International Federation of the Phonographic Industry (IFPI) [31] for audio quality. Table 3 shows SNR and mean opinion score (MOS) results required for the comparison of the proposed method with several recent methods, which are based on the reported results in [31][32][33][34][35][36]. The MOS method allows the tester to directly compare the original speech with the test speech. The score is from 1.0 to 5.0, similar to the scoring criteria in Table 1. From the comparison of the results, we observed that our method outperforms others in terms of the imperceptibility of the watermarked audio signal. The MOS value of our method is close to 5.0 points, which is higher than other methods. The objective quality of a recovered speech signal is also measured by SNR. In the test, the signal of the tamper area is set to mute, and the ratio is 10%, 20%, 30%, 40% 50%, 60% respectively. After tampered area localization, all the parts that are set to mute are found. Subsequently, we extract reference-values from areas which are not tampered, and recover the tampered areas with the reference values. The SNR values of all recovered speech signals using the proposed method are shown in Table 4. In addition, Figure 3 is plotted according to Table 4. The red line represents signal 1, the black line represents signal 2, the green line represents signal 3, the blue-green line represents signal 4, the blue line represents signal 5. According to Figure 3, we can find that the tampering rate of the speech ranged from 10 to 60%. The SNR is almost down from nearly 25 to around 5. The value drops slowly. It can be seen that our algorithm has strong stability when recovering damaged speech signal. The smaller the tampered area, the better the quality of the recovered signal is. Manipulate the speech signal with a tampering percentage of 19.5%. Figure 4 shows the waveform and spectrogram of the original speech signal. Figure 5 shows the waveform and spectrogram of the watermarked speech signal. Figure 6 shows the waveform and spectrogram of the damaged speech signal. Figure 7 shows the waveform and spectrogram of the recovered speech signal. Comparing Figures 4 and 5, it can be found that the original speech signal and the watermarked speech signal are almost identical. This also shows that the watermarked speech signal is imperceptible.       Manipulate the speech signal with tampering percentage 47.6%. Almost half of the speech signal is tampered. Figure 8 shows the waveform and spectrogram of the damaged speech signal. Figure 9 shows the waveform and spectrogram of the recovered speech signal. From these figures, it can be seen that when the tampering percentage is as high as almost 50%, our algorithm can still recover the tampered or damaged speech, and the recovery effect is very good.   Manipulate the speech signal and make different parts of the signal tampered. Figure 10 shows the waveform and spectrogram of the damaged speech signal. Figure 11 shows the waveform and spectrogram of the recovered speech signal. The figures show that the recovery effect is good. In addition, we can see that whether we separately tamper the different parts of the signal or tamper one part of the speech signal, our algorithm can recover the damaged speech signal well.

Speech Enhancement
We built a deep neural network to continue post-processing of the signal. The overall research process can be roughly divided into three steps: The first step is to build a deep neural network. We use a combination of classical autoencoder and back propagation networks (BP) to build deep neural networks. The second step is to collect data for training and testing. The last step is to train and test the deep neural network model. At this stage, we conduct multiple tests, use the same data set for training under different parameters, and then use the same test set for testing. A comparative analysis of the test results for each experiment, and the results are visually represented by the frequency domain plot, the time domain plot and calculating the signal-noise ratio of the recovered speech signal and the output signal of the networks.
We build a program for the enhancement to increase the SNR of recovered speech signal. In order to test the influence of the model parameters on the recovery result, we conduct a comparative experiment on the different hidden layer nodes, hidden layers and the iterations of the deep neural network. First of all, we need to build a deep neural network for training with some fixed parameters.
After that, we just adjust the corresponding parameters what we need, and use the same training set to train the different network models, and finally use the test program to find the signal-noise ratio, the frequency domain figure and the time domain figure.

Comparison of the Number of Hidden Layer Nodes
The number of parameters of the seven-layer neural network is higher than that of the four layers. In order to save running time, we first use a four-layer deep neural network for comparison experiments, setting the number of iterations for fine-tuning to 200. The following four frequency domain figures are the experimental results. In Figure 12, (a) is the spectrogram of the watermarked but undamaged speech information, and (b) is the spectrogram of the recovered speech signal. In (c) we set the nodes of four hidden layers to 1000, 500, 100, 30. In (d), the nodes of the four hidden layers are 1000, 1000, 1000, 1000, respectively.
From Figure 12, we can see that the original watermarking signal has a good frequency domain figure. In addition, there is still obvious distortion after recovery. We use the network in which the nodes of each hidden layer is decreasing and another network in which each hidden layer has the same number of nodes to enhance it. In addition, the result is greatly improved compared to the watermark recovery effect. We compare the enhanced effect of these two network models in Figure 13.     Table 5 shows the signal-noise ratio values of the recovered speech signal with different numbers of nodes. The SNR of the recovered speech signal without any processing is the lowest. The neural network processing effectively improves the SNR of the signal. The processing results of the network with a decrement in the number of nodes is better, compared to the network with the same nodes for each layer. In addition, we choose the deep neural network model with a decreasing number of hidden layer nodes in the next experiments.

Comparison of the Number of Hidden Layers
We construct a deep neural network with four hidden layers and a deep neural network with seven hidden layers. In addition, we set the number of iterations for fine-tuning to 200. The number of hidden layers is different. We use the network in which the nodes of each hidden layer decreases to improve efficiency, and in order to facilitate comparison with the previous model recovery results, the number of the nodes of the four hidden layers is still 1000, 500, 100, and 30, and the number of the nodes of the seven hidden layers is 1000, 750, 500, 250, 100, 75, 50. Figure 14 is a frequency domain figure of the enhanced results of a four-layer network model and a seven-layer network model.  From the two graphs in Figure 14, we can clearly see that the recovery effect of the four-layer deep neural network model is better than the seven-layer network model. The structure of the training data we use is relatively simple, the multi-layered network model produces a certain degree of the over-fitting phenomenon, and the recovery effect is not as good as the network with fewer hidden layers. We can further see the difference from the time domain figure.
From (a) and (b) in Figure 15, we can clearly see that the signal energy is too high where the signal is distorted after recovery by the seven-layer network model, and does not reduce the apparent noise in the signal, as the four-layer network model does. In addition, compared with the original watermark signal in Figure 13, the speech recovery effect in the whole signal is insufficient. Table 6 shows the signal-noise ratio of the original watermark signal, the output signal of the four-layer network model and the seven-layer network model. The signal-noise ratio of the original watermark signal and the output signal of the seven-layer network model is obviously too low, and the signal is noisy. This result shows that based on the training data set and the test data set we used, the seven-layer deep neural network will produce overfitting phenomenon, and the recovery effect is not as good as a four-layer deep neural network, which has a smaller number of layers.

Comparison of Iterations
In the comparison experiments of iterative layers, we mainly compare the experimental effect of different numbers of iterations for fine-tuning. In the previous experiment, we used 200 iterations. Now we train and test the network model with 100, 200, and 500 iterations. We also use a 4-layer deep neural network model. The number of hidden layer nodes of the model is 1000, 500, 100, 30. Figure 16 shows the frequency domain figure of the network model for three different numbers of iterations. From the three spectrograms in Figure 16, we can see that as the number of iterations of the network model increases, the effect of the model on the recovery of the speech signal gradually becomes better. The distortion of the signal is more obvious when enhanced by the network model with 100 iterations. In addition, the distorted parts of the signals enhanced by the models with 200 and 500 iterations are similar to the original watermarking figure in Figure 12. Figure 17 is a time domain figure of the recovery signal for these three different numbers of iterations of the network model. In the time domain figure, compared with the original watermarking signal in Figure 13, we can clearly see that the recovery effect of the front part and the middle part of the signal increases as the number of network model iterations increases. Let's take a look at the signal-noise ratio results. As shown in Table 7, the signal-noise ratio of the output results of the 100-iteration and 200-iteration network models are improved, which proves that the recovery effect is better; however, the signal-noise ratio of the output results of the 500-iteration network model is lower than the 200-iteration model. Due to the fact that when the number of iterations is small, the network model is under-fitting, and the model recovery effect is insufficient, but when the number of iterations is too high, the training model is overfitted, resulting in a decrease in the effect of recovery again. Therefore, selecting a number of iterations between 100 and 500 will make the recovery effect better. The SNR of the recovery signal of 200-iteration model also proves this point.

Conclusions
In this paper, we proposed a novel speech watermarking scheme. This method embeds the DCT coefficients of the host speech signal into the LSBs of the host speech signal. When a part of the watermarked signal is tampered with, the watermark data in the reserved area can be extracted. Then we use the compressive sensing technique to retrieve the coefficients in the tampered area due to their sparseness. As a result, the smaller the tampered area, the better quality of the recovered content is. The watermark data information is shared in a frames-group instead of stored in one frame. It has the capacity to make a trade-off between the data waste problem and the tampering coincidence problem. The results indicated that speech could be recovered with reasonable intelligibility when the reference data is sufficient. In addition, the deep neural network can improve the signal-noise ratio of the recovered speech signal. However, there is some noise in the recovered speech signal. In addition, the watermarking in this paper is fragile. It is easy to attack. The fragile watermark lacks robustness. In the future, we will study the post-processing of the recovered speech signal to improve the quality of the recovered speech signal. In addition, we will also focus on semi-fragile watermarking schemes to improve the robustness of the speech signals.