An Encrypted Speech Integrity Authentication Method: Focus on Fine-Grained Tampering Detection and Tampering Recovery Under High Tamper Ratios

Xu, Fujiu; Li, Jianqiang; Xu, Xi

doi:10.3390/math13040573

Open AccessArticle

An Encrypted Speech Integrity Authentication Method: Focus on Fine-Grained Tampering Detection and Tampering Recovery Under High Tamper Ratios

by

Fujiu Xu

,

Jianqiang Li

^*

and

Xi Xu

Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(4), 573; https://doi.org/10.3390/math13040573

Submission received: 17 December 2024 / Revised: 7 February 2025 / Accepted: 7 February 2025 / Published: 9 February 2025

(This article belongs to the Section E1: Mathematics and Computer Science)

Download

Browse Figures

Versions Notes

Abstract

With the increasing amount of cloud-based speech files, the privacy protection of speech files faces significant challenges. Therefore, integrity authentication of speech files is crucial, and there are two pivotal problems: (1) how to achieve fine-grained and highly accurate tampering detection and (2) how to perform high-quality tampering recovery under high tampering ratios. Tampering detection methods and tampering recovery methods of existing speech integrity authentication are mutually balanced, and most tampering recovery methods are carried out under ideal tampering conditions. This paper proposes an encrypted speech integrity authentication method that can simultaneously address both of problems, and its main contributions are as follows: (1) A 2-least significant bit (2-LSB)-based dual fragile watermarking method is proposed to improve tampering detection performance. This method constructs correlations between encrypted speech sampling points by 2-LSB-based fragile watermarking embedding method and achieves low-error tampering detection of tampered sampling points based on four types of fragile watermarkings. (2) A speech self-recovery model based on residual recovery-based linear interpolation (R2-Lerp) is proposed to achieve tampering recovery under high tampering ratios. This method constructs the model based on the correlation between tampered sampling points and their surrounding sampling points and refines the scenarios of the model according to the tampering situation of the sampling points, with experimental results showing that the recovered speech exhibits improved auditory quality and intelligibility. (3) A scrambling encryption algorithm based on the Lorenz mapping is proposed as the speech encryption method. This method scrambles the speech sampling points several times through 4-dimensional chaotic sequence, with experimental results showing that this method not only ensures security but also slightly improves the effect of tampering recovery.

Keywords:

encrypted speech integrity authentication; tamper detection; tamper recovery; least significant bit; residual recovery-based linear interpolation

MSC:

94A12

1. Introduction

The rapid development and widespread application of digital technologies and cloud computing have greatly facilitated the use and transmission of speech files. Consequently, this has led to illegal activities related to speech files, such as unauthorized distribution, unlawful copying, content modification, piracy, and infringements on the copyright of multimedia content [1]. Cloud-based speech files face privacy protection issues such as privacy leakage and malicious tampering [1,2,3] despite the implementation of security measures like encryption. However, the development of privacy protection for encrypted speech is relatively lagging, which proposes new requirements for integrity authentication of encrypted speech. Integrity authentication of encrypted speech aims to detect and recover; it detects whether the speech has been maliciously tampered and determines the tampered location, then recovers the information at the tampered location. Therefore, it can be observed that existing integrity authentication methods typically consist of tampering detection and tampering recovery. Tampering detection methods can be categorized into the feature-based methods (TFM) and the watermark-based methods (TWM). Tampering recovery methods can be categorized into the recovery information-based methods (TRM) and the correlation-based methods (TCM). However, these two components are not independent: the accuracy of tampering detection affects the tampering recovery effect, and the choice of a tampering recovery method also influences the tampering detection process.

During the tampering detection process in integrity authentication, as shown in Figure 1a, TFM [4,5,6] relies on the databases for storing speech features. In the big data context, extra databases will give rise to problems such as cost, management, and security. Therefore, in the existing research on integrity authentication, there is a growing trend to utilize TWM as the tampering detection method of choice. Synchronized information such as perceptual hash [7,8,9], speech features [10,11], most significant bits (MSB) [12], and compression information [13] often contains original speech information. Researchers have commonly used methods such as Piecewise Linear Chaotic Map (PWLCM) [14], mask [15] and Binaries of Message Size Encoding (BMSE) [16] to encode synchronized information as fragile watermarking to ensure information security. However, as shown in Figure 1b, the TWM-based integrity authentication method often requires a portion of watermarking embedding space to be utilized for embedding additional information, resulting in insufficient watermarking embedding capacity for tampering detection watermarking. Consequently, existing TWM generally exhibits coarse-grained tampering detection, making it insufficient to locate the tampered sampling points, while fine-grained methods suffer from significant tampering detection errors.

For the tampering detection process in integrity authentication, methods such as speech coding [17,18], compressed sensing [19] and speech features compression [20,21] are commonly used to construct compressed recovery information for TRM. However, as shown in Figure 2a, the compressed recovery information occupies a certain amount of watermarking embedding space, and when both the speech segments and their corresponding recovery information are damaged simultaneously, tampering recovery cannot be performed. Although methods such as misaligned embedding strategies [22] are used to address this problem, there still remains the problem of unrecoverable tampered speech especially under high tampering ratios. In light of these challenges, some researchers propose TCM that does not rely on recovery information. TCM uses linear interpolation (Lerp) [23,24], fitting [6], prediction, and other methods to utilize the correlation between sampling points and recover tampered sampling points. However, as shown in Figure 2b, under high tampering ratios, consecutively tampered sampling points may exhibit poor correlation, making it impossible to recover.

In summary, there are two pivotal problems in the integrity authentication of encrypted speech: (1) “Granularity and accuracy”: How to refine the tampering detection granularity to the level of sampling points and reduce the tampering detection errors? (2) “High tampering ratio”: How to perform tampering recovery and ensure its effect under high tampering ratios? Existing tampering detection methods are mostly limited by tampering recovery methods, which limits their ability to embed more tampering detection watermarking. Some tampering detection methods require the construction of additional databases, impacting the security of the speech. Tampering recovery methods are often evaluated under ideal tampering conditions [17,20,21] (the tampering process does not destroy the corresponding recovery information even when the tampering ratio reaches 50% [19]) or only discuss scenarios with tampering ratios below 30% [6,23,24], making them unable to meet the practical requirements. These problems simultaneously impose requirements on both tampering detection and tampering recovery, and the mutual influence between tampering detection methods and tampering recovery methods further increases the difficulty. Existing methods generally focus on improving only one of these problems and are unable to address both simultaneously. The presence of these problems significantly affects the practicality of integrity authentication methods and restricts the development of speech integrity authentication techniques.

In order to comprehensively and effectively address these problems, this paper proposes an encrypted speech integrity authentication method based on the principles of TWM and TCM. TCM provides sufficient watermarking embedding space to improve the effect of tampering detection, which imposes higher requirements on the tampering recovery effect under high tampering ratios (comparable to or even better than TRM). Meanwhile, TWM is expected to improve the performance of tampering detection methods by improving the processes of watermarking construction, embedding, and extraction when it has sufficient embedding space.

To address these requirements and challenges, this paper proposes an encrypted speech integrity authentication method. For problem (1), this paper proposes a 2-least significant bit (2-LSB)-based dual fragile watermarking method for tampering detection, and the process is shown in Figure 3. During watermarking generation, frame number watermarking constructed from the frame number is used to confirm the length of the original speech, and the correlation of encrypted speech sampling points is constructed synchronization watermarking for detecting the tampered sampling points. During watermarking embedding, 2-bit watermarking information is embedded in a sampling point by 2-LSB to construct the correlation of encrypted speech sampling points. The extracted watermarking can be constructed as tampering detection watermarking by this embedding method during watermarking extraction. During tampering detection, the extracted frame number watermarking and synchronization watermarking are separately constructed as tampering detection watermarking and tampering localization watermarking, in which tampering detection watermarking detects whether speech frames have been tampered, and tampering localization watermarking locates the tampered sampling points.

For problem (2), this paper proposes a speech self-recovery model based on residual recovery-based linear interpolation (R2-Lerp). As shown in Figure 4, compared to existing TCM, this model allows for tampering recovery of the target sampling point even when the surrounding sampling points have been arbitrarily tampered. Due to the varying degrees of tampering around the target sampling points under high tampering ratios, this model is refined into 11 scenarios based on the tampered situation of the sampling points. This method constructs corresponding tampering recovery formulas for each scenario in order to expect better recovery performance in each tampering situation.

The main contributions of the proposed method can be summarized as follows:

(1): A 2-LSB-based dual fragile watermarking method is proposed to accurately locate the tampered sampling points. This method introduces a 2-LSB-based fragile watermarking embedding method to establish the correlation of encrypted speech sampling points. Frame number watermarking and synchronization watermarking are constructed as tampering detection watermarking and tampering localization watermarking during the tampering detection process, collectively enhancing the granularity and accuracy of tampering detection.
(2): A speech self-recovery model based on Lerp [23] and R2-Lerp [24] is proposed to perform tampering recovery even under a 50% tampering ratio. This method extends the correlation of tampered sampling points to the surrounding five sampling points and constructs a more refined speech recovery model. This method exhibits favorable tampering recovery results from different perspectives, such as SNR, auditory quality, and intelligibility under different tampering ratios.
(3): A scrambling encryption algorithm based on the Lorenz mapping is proposed to encrypt speech. By comparing the encryption performance and tampering recovery performance with six commonly used traditional scrambling encryption methods, the proposed encryption algorithm maintains a certain degree of security while slightly improving the tampering recovery effect.

The remainder of the work is organized as follows: Section 2 provides a detailed description of Lorenz mapping, LSB, and R2-Lerp. Section 3 introduces the encrypted speech integrity authentication method and describes the processing in detail. Section 4 shows the performance of our presented approach through the experimental results and the performance analysis. Finally, Section 5 concludes the work of this paper.

2. Basic Concepts

This section presents some basic concepts in the proposed method, including Lorenz mappings used for speech encryption, LSB for watermarking embedding, and R2-Lerp for tampering recovery.

2.1. Lorenz Mapping

Lorenz mapping [25], as a type of hyper-chaotic system with complex chaotic behavior, is commonly used as a scrambling sequence for encryption. State variables w₁, w₂, w₃, and w₄ are calculated as shown in Equation (1).

When the parameters α = 10, γ = 28, β = 8/3, ρ is in the range of −1.52 < ρ < −0.06, and the system exhibits the hyper-chaotic behavior.

\{\begin{matrix} {\dot{w}}_{1} = α (w_{2} - w_{1}) + w_{2} w_{3} \\ {\dot{w}}_{2} = γ w_{1} - w_{2} - w_{1} w_{3} + w_{4} \\ {\dot{w}}_{3} = w_{1} w_{2} - β w_{3} \\ {\dot{w}}_{4} = - w_{1} z_{n} + ρ w_{4} \end{matrix} .

(1)

2.2. Least Significant Bit Method

The LSB method embeds information by modifying the least significant bit of the carrier file and is widely applied in speech information hiding and audio steganography [26,27,28]. Compared with the digital watermarking method based on the frequency domain, this method is simple and efficient and provides high transparency, but it has poorer resistance to various audio content-preserving operations. The fragile watermarking method in this paper is mainly used to detect whether the speech has been maliciously tampered, and robustness is typically a key indicator to consider when evaluating robust watermarking for legitimacy authentication. However, the robustness of fragile watermarking can hinder the detection of malicious tampering. Therefore, the method in this paper does not need to consider the robustness of the watermarking method, thus effectively avoiding the problem of poor robustness in the LSB method.

2.3. Residual Recovery-Based Linear Interpolation

Lerp calculates the approximate value y₂ at a point x₂ within the interval [x₁, x₃] by known two sampling points (x₁, y₁) and (x₃, y₃), and the calculation formula is shown in Equation (2):

y_{2} = y_{1} + \frac{(x_{2} - x_{1}) y_{3} - (x_{2} - x_{1}) y_{1}}{x_{3} - x_{1}} = \frac{(x_{3} - x_{2}) y_{1} + (x_{2} - x_{1}) y_{3}}{x_{3} - x_{1}} .

(2)

Residual recovery-based linear interpolation [24] uses four sampling points (x_t₋₂, y_t₋₂), (x_t₋₁, y_t₋₁), (x_t₊₁, y_t₊₁), and (x_t₊₂, y_t₊₂) to optimize the approximate value y_t of the tampered sampling point x_t, and the calculation formula is shown in Equation (3):

y_{t}^{'} = y_{t} + \frac{y_{t - 1} - {y_{t - 1}}^{'} + y_{t + 1} - {y_{t + 1}}^{'}}{2} = \frac{3 y_{t - 1} + 3 y_{t + 1} - y_{t - 2} - y_{t + 2}}{4} .

(3)

3. The Proposed Scheme

Figure 5 shows the flowchart of the encrypted speech integrity authentication method. The proposed method is mainly composed of four parts: speech encryption, a 2-LSB-based dual fragile watermarking method, integrity authentication, and tamper recovery. In this section, we will introduce the specific methods for these four parts.

The proposed method can be used for tampering detection and tampering recovery of important speech (involving economics, judiciary, etc.). Important speech may face malicious tampering that renders it unusable or misleading alterations during its use, often resulting in serious consequences such as economic losses. The proposed method can detect such malicious tampering, accurately locate the tampered sampling points even under high tampering ratios, and perform high-quality recovery, ensuring the normal use of important speech.

For example, there is an important speech of 60 s length that needs to be stored in the cloud. This speech is encrypted and stored in the cloud using the proposed algorithm as a security measure. Later, we suppose this encrypted speech in the cloud is attacked and the segment from 15 s to 45 s is deleted. When this speech is needed, it is downloaded from the cloud, and integrity authentication is performed by the proposed tampering detection algorithm. This method can detect that the speech has been subjected to the deletion attack and that the deleted sampling points are from 15 s to 45 s. Subsequently, the tampering recovery algorithm is used to restore the deleted sampling points. In this way, the user can obtain a speech that is semantically consistent with the original speech and has comparable auditory quality. This recovered important speech still meets relevant requirements.

3.1. The Speech Encryption Algorithm

The tampering recovery method in this paper requires that the encryption algorithm used should not affect the normal decryption of other sampling points when some encrypted speech sampling points are tampered. The proposed tampering recovery method requires several rounds of scrambling encryption to satisfy the requirements of correlation between the original speech sampling points.

Even after partial tampering of the encrypted speech, the untampered portions can still be decrypted.

Step 1: Speech division. The speech S with length L is cut into non-overlapping frames; the frame length is LF and the number of frames is LN = L/LF, and the speech matrix PS with dimensions LN × LF is generated. If the last frame of PS is shorter than LF, there is zero-padding.

Step 2: Chaotic sequences generation. After setting the initial values of key K and the parameters ρ, the chaotic mapping in Section 2.1 is used to generate the chaotic sequences W₁, W₂, W₃, and W₄. The elements of the chaotic sequences are sorted in ascending order, and the indexes WI₁, WI₂, WI₃, and WI₄ corresponding to the original position of each element after sorting are used as the new chaotic sequences.

Step 3: Scrambling encryption. After getting LN + LF − 1 groups of diagonal elements from (LN, 1) to (1, LF) of PS and dividing WI₁ into LN + LF − 1 groups based on the data size of each group, the elements in each group are sorted in ascending order to obtain the index sequence matrix I₁. Each group of diagonal elements is scrambled by I₁, and the scrambled diagonal elements are recombined to form matrix PS′₁. Then, in order to achieve sufficient scrambling of the sampling points near (LN, 1) and (1, LF), the above process in Step 3 is repeated using WI₂ from (LN, LF) to (1, 1) of PS′₁, and the encrypted speech matrix PS″₁ is obtained.

Step 4: Second-round scrambling encryption. To further enhance the global randomness of encrypted speech sampling points, WI₃ into LF groups are divided evenly, and the elements in each group are sorted in ascending order to obtain the index sequence matrix I₃. Each column elements of PS″₁ are scrambled to obtain matrix PS′₂. Then, the above process in Step 4 is repeated using WI₄ from (LN, LF) to (1, 1) of PS′₂, and matrix PS″₂ with dimensions LN × LF is obtained.

Step 5: Reconstructing the speech. PS₂″ is converted into a matrix of dimensions L × 1, and then, PS₂″ is reconstructed into the encrypted speech.

Step 6: Speech decryption. The decryption process is the inverse of the encryption process.

3.2. 2-LSB-Based Dual Fragile Watermarking Method

The proposed watermarking method consists of two parts: watermarking generation and embedding, and watermarking extraction, with the watermarking extraction taking place during the integrity authentication phase.

3.2.1. Watermarking Generation and Embedding

In this paper’s speech library, the SNR of the speech retained three decimal points with the original speech being more than 45 dB, which means that modifying the fourth decimal place of each sampling point to embed watermarking has little impact on the speech quality.

Step 1: Frame number watermarking generation. Extract the frame number f₁ of f-th frame from PS₂″, calculate the difference between LN and f₁ to construct the frame number f₂, reconstruct the order of the digits from the thousands place to the ones place of f₁ and f₂ into the frame number matrix w_f of 8 × 1, in which the frame number less than 1000 is set the insufficient places to zero, and convert w_f into a 4-bit binary matrix to construct the frame number watermarking w_ff of 32 × 1.

Step 2: Synchronous watermarking construction. Take LF = 128 as an example. Extract the fourth decimal place of the last 96 sampling points of f-th frame and convert it into a 4-bit binary matrix spb and extract the third dimension of spb to construct a matrix spb′ of 96 × 1. Taking two sampling points sp_i and sp_i₊₁ as a group, construct 96 sampling points with the synchronous watermarking w_s of 96 × 2. Construct the watermarking w_s as in Equation (4).

\{\begin{matrix} w_{s} (i, 1) = s p b (i, 1) \\ w_{s} (i, 2) = s p b (i - 1, 1) \\ w_{s} (i + 1, 2) = s p b (i, 1) \\ w_{s} (i + 1, 1) = w_{s} (i + 1, 2) \oplus 1 \end{matrix} .

(4)

Step 3: Watermarking embedding. Extract the fourth decimal place of the sampling points in f-th frame and convert it into a 4-bit binary matrix bm. Replace the third dimension bm₃ and the fourth dimension bm₄ of bm with the watermarking to embed the watermarking. The watermarking is embedded in Equations (5) and (6). Convert bm′ back into the fourth decimal place of the sampling points in f-th frame. If the value of the fourth decimal place is 10 or 11, modify it to 6 or 7.

b m_{3}^{'} (i) = \{\begin{matrix} w_{f f} (i), 1 \leq i \leq 32 \\ w_{s} (i, 1), 33 \leq i \leq 128 \end{matrix},

(5)

b m_{4}^{'} (i) = \{\begin{matrix} w_{f f} (i), 1 \leq i \leq 32 \\ w_{s} (i, 2), 33 \leq i \leq 128 \end{matrix} .

(6)

Step 4: Reconstruct the watermarked speech. Repeat the above process for all frames of PS₂″ to construct the PSW and then reconstruct the watermarked encrypted speech WE from PSW using Step 5 in Section 3.1.

3.2.2. Watermarking Extraction

The extracted watermarking in this paper mainly includes tampering detection watermarking and tampering localization watermarking. The tampering detection watermarking is used to detect whether the speech to be authenticated has been tampered and to identify the type of tamper; whereas, the tampering localization watermarking is used to locate the tampered sampling points. The tampering localization watermarking is an LF × 2 matrix extracted from the authenticated speech, which includes frame number watermarking and synchronization watermarking embedded during the watermarking embedding process. The tampering detection watermark is an LF × 1 matrix, obtained by performing XOR on the 2 bits of watermarking information in each dimension of the tampering localization watermark.

Step 1: Original tampering detection watermarking construction. The original tampering detection watermarking is constructed as shown in Figure 6, and the length of watermarking is 128 bits (equal to the value of LF), in which the first 32 bits are set to 0 and the remaining 96 bits are the alternations of null and 1 (the first 32 bits are the XOR of the embedded frame number watermarking, and the remaining 96 bits are the XOR of the corresponding position of the synchronized watermarking). The empty positions are skipped during the tampering detection.

Step 2: Watermarking extraction. The encrypted speech to be authenticated WE′ constructs WEM according to Step 1 in Section 3.1. The fourth decimal place of the WEM sampling points is extracted and converted into a 4-bit binary matrix bm″. The third dimension bm₃″ and the fourth dimension bm₄″ of bm″ are extracted to construct the tampering localization watermarking w_l, w_l = [w_lf, w_ls], in which w_lf is the extracted frame number watermarking and w_ls is the extracted synchronous watermarking. XOR is performed with the bm₃″ and bm₄″ to construct the tampering detection watermarking w_d.

3.3. Integrity Authentication

Integrity authentication includes two parts: tampering detection and tampering localization. The tampering detection method can determine whether the authenticated speech has been tampered, as well as the type of attack. If tampering is detected, the tampering localization process is used to identify the tampered sampling points.

3.3.1. Tampering Detection

Step 1: The frame tampering detection. After receiving the WE′ of length L′, construct the tampering detection watermarking according to Step 1 in Section 3.2.2 and extract the tampering detection watermarking according to Step 2 in Section 3.2.2. Measure the similarity between them by the normalized Hamming distance and the sliding window. The sliding window can detect tampered frames even after a desynchronization attack. The number of untampered frames is n_f.

Step 2: Determine the length of the original speech. Select an authenticated frame, extract the frame number watermarking from the tampering localization watermarking, and convert the frame number watermarking into the 8-bit binary matrix in groups of 4 bits. Reconstruct every 4 bits of the 8-bit binary matrix into the frame numbers f′₁ and f′₂ by thousands to ones bits and determine the original speech length L_o by f′₁ and f′₂ as in Equation (7) where n′_f is the number of original speech frames.

L_{o} = 128 {n^{'}}_{f} = 128 (f_{1}^{'} + f_{2}^{'}) .

(7)

Step 3: Judgment of tampering attack. If L′ = L_o and n′_f = n_f, WE′ has not been attacked; if L′ = L_o and n′_f > n_f, WE′ has been mute or a substitution; if L′ > L_o, WE′ has been an insertion; if L′ < L_o, WE′ has been a deletion.

3.3.2. Tampering Localization

Step 1: The watermarking of WE′ is extracted and the watermarking of the original speech is reconstructed. When WE′ fails authentication, the frame numbers sf and ef of the untampered frames at both ends of the unauthenticated frame t_f are extracted according to Step 2 in Section 3.3.1. The first frame of the tampered sampling point is sf′ = sf + 1 and the last frame of the tampered sampling point is ef′ = ef − 1. The frame number watermarking w′_ff and w″_ff of sf′ and ef′ are constructed according to Step 1 in Section 3.2.1, and the tampering localization watermarking w′_l₁ and w′_l₂ of sf′ and ef′ are extracted according to Step 2 in Section 3.2.2 in which w′_lf and w″_lf are the extracted frame number watermarking and w′_ls and w″_ls are the extracted synchronous watermarking.

Step 2: Tampering localization. w′_lf and w′_f, w″_lf and w″_ff are tampered localization by similarity measurement, and w′_ls and w″_ls are tampered localization by Equation (8). The positions of the tampered start x_sp and the tampered end x_ep are located so that the tampered area can be determined as [x_sp, x_ep]. Then, localization of all unauthenticated parts in WE′ is tampered.

\{\begin{matrix} w_{l s}^{'} (i, 1) = w_{l s}^{'} (i + 1, 2) \\ w_{l s}^{'} (i, 2) = w_{l s}^{'} (i - 1, 1) \\ w_{l s}^{'} (i + 1, 1) = w_{l s}^{'} (i + 1, 2) \oplus 1 \end{matrix} .

(8)

Step 3: Tamper localization of deletion attack. Tampering localization on frames sf and ef, sf′ and ef′ is according to Step 2.

3.4. Tampering Recovery Method

This paper proposes a speech self-recovery model that aims to sequentially recovery tampered sampling points. Based on the correlation of 5 sampling points x_i₋₂, x_i₋₁, x_i₊₁, x_i_+2, and x_i₊₃ around the tampered sampling point x_i, combined with the tampered situation of the 5 sampling points, the tampering recovery scenarios for the tampered sampling point x_i are classified into 11 scenarios.

3.4.1. Speech Self-Recovery Model Based on R2-Lerp

The proposed tampering recovery scenarios are shown in Figure 7. To minimize the impact of the destroyed sampling points on the tampering recovery of x_i and avoid scenarios that both sampling points x_i₋₂ and x_i₋₁ are destroyed, the method in this paper does not recovery continuously destroyed of four or more sampling points in the first recovery process.

Scenario 1: When the sampling points (x_i₋₂, y_i₋₂), (x_i₋₁, y_i₋₁), (x_i₊₁, y_i₊₁), and (x_i₊₂, y_i₊₂) are untampered, the approximate value y_i of x_i is calculated by R2-Lerp through y_i₋₂, y_i₋₁, y_i₊₁ and y_i₊₂ as in Equation (9).

y_{i} = \frac{3 y_{i - 1} + 3 y_{i + 1} - y_{i - 2} - y_{i + 2}}{4} .

(9)

Scenario 2: When the sampling points (x_i₊₂, y_i₊₂) are tampered, the approximate value y_i₊₂′ of x_i₊₂ is calculated by Lerp through y_i₊₁ and y_i₊₃, and the approximate value y_i of x_i is calculated by R2-Lerp through y_i₋₂, y_i₋₁, y_i₊₁ and y_i₊₂′ as in Equation (10).

\begin{array}{l} y_{i} & = \frac{3 y_{i - 1} + 3 y_{i + 1} - y_{i - 2} - {y_{i + 2}}^{'}}{4} \\ = \frac{3 y_{i - 1} + 3 y_{i + 1} - y_{i - 2} - \frac{y_{i + 1} + y_{i + 3}}{2}}{4} = \frac{3 y_{i - 1}}{4} + \frac{5 y_{i + 1}}{8} - \frac{y_{i - 2}}{4} - \frac{y_{i + 3}}{8} . \end{array}

(10)

Scenario 3: When the sampling points (x_i₊₂, y_i₊₂) and (x_i₊₃, y_i₊₃) are tampered, the approximate value y_i of x_i is calculated by Lerp through y_i₋₁ and y_i₊₁ as in Equation (11).

y_{i} = \frac{y_{i - 1} + y_{i + 1}}{2} .

(11)

Scenario 4: When the sampling points (x_i₊₁, y_i₊₁) are tampered, the approximate value y_i₊₁′ of x_i₊₁ is calculated by Lerp through y_i and y_i₊₂, and the approximate value y_i of x_i is calculated by R2-Lerp through y_i₋₂, y_i₋₁, y_i₊₁′ and y_i₊₂ as in Equation (12).

\begin{array}{l} y_{i} & = \frac{3 y_{i - 1} + 3 {y_{i + 1}}^{'} - y_{i - 2} - y_{i + 2}}{4} \\ = \frac{3 y_{i - 1} + 3 \frac{y_{i} + y_{i + 2}}{2} - y_{i - 2} - y_{i + 2}}{4} = \frac{6 y_{i - 1}}{5} - \frac{2 y_{i - 2}}{5} + \frac{y_{i + 2}}{5} . \end{array}

(12)

Scenario 5: When the sampling points (x_i₊₁, y_i₊₁) and (x_i₊₂, y_i₊₂) are tampered, the approximate value y_i₊₁′ of x_i₊₁ is calculated by Lerp through y_i and y_i₊₂, and the approximate value y_i₊₂′ of x_i₊₂ is calculated by R2-Lerp through y_i₊₁′ and y_i₊₃; then, the approximate value y_i of x_i is calculated by R2-Lerp through y_i₋₂, y_i₋₁, y_i₊₁′ and y_i₊₂′ as in Equation (13).

\begin{array}{l} y_{i} & = \frac{3 y_{i - 1} + 3 {y_{i + 1}}^{'} - y_{i - 2} - {y_{i + 2}}^{'}}{4} \\ = \frac{3 y_{i - 1} + 3 \frac{2 y_{i} + y_{i + 3}}{3} - y_{i - 2} - \frac{y_{i} + 2 y_{i + 3}}{3}}{4} = \frac{9 y_{i - 1}}{7} - \frac{3 y_{i - 2}}{7} + \frac{y_{i + 3}}{7} . \end{array}

(13)

Scenario 6: When the sampling points (x_i₊₁, y_i₊₁), (x_i₊₂, y_i₊₂), and (x_i₊₃, y_i₊₃) are tampered, the first tamper recovery is abandoned. Then, the second tampering recovery is carried out at the positions of four or more consecutively tampered sampling points. The approximate value y is calculated through (x_i₋₁, y_i₋₁) and (x_i₊_nt, y_i₊_nt) as in Equation (14), in which nt is the number of tampered sampling points, x_i₋₁ and x_i₊_nt are the untampered sampling points.

y = \frac{y_{i - 1} - y_{i + n t}}{x_{i - 1} - x_{i + n t}} x + \frac{y_{i + n t} x_{i - 1} + y_{i - 1} x_{i + n t}}{x_{i - 1} - x_{i + n t}} .

(14)

Scenario 7: When the sampling points (x_i₋₂, y_i₋₂), (x_i₊₂, y_i₊₂), and (x_i₊₃, y_i₊₃) are tampered, tamper recovery by the recovery method of Scenario 3.

Scenario 8: When the sampling points (x_i₋₂, y_i₋₂) and (x_i₊₁, y_i₊₁) are tampered, the approximate value y_i₊₁′ of x_i₊₁ is calculated by R2-Lerp through y_i₋₁, y_i, y_i₊₂ and y_i₊₃, and the approximate value y_i of x_i is calculated by Lerp through y_i₋₁ and y_i₊₁′ as in Equation (15).

\begin{array}{l} y_{i} & = \frac{y_{i - 1} + {y_{i + 1}}^{'}}{2} \\ = \frac{y_{i - 1} + \frac{3 y_{i} + 3 y_{i + 2} - y_{i - 1} - y_{i + 3}}{4}}{2} = \frac{3 y_{i - 1}}{5} + \frac{3 y_{i + 2}}{5} - \frac{y_{i + 3}}{5} . \end{array}

(15)

Scenario 9: When the sampling points (x_i₋₂, y_i₋₂), (x_i₊₁, y_i₊₁), and (x_i₊₃, y_i₊₃) are tampered, the approximate value y_i₊₁′ of x_i₊₁ is calculated by Lerp through y_i and y_i₊₂, and the approximate value y_i of x_i is calculated by Lerp through y_i₋₁ and y_i₊₁′ as in Equation (16).

\begin{array}{l} y_{i} & = \frac{y_{i - 1} + {y_{i + 1}}^{'}}{2} \\ = \frac{y_{i - 1} + \frac{y_{i} + y_{i + 2}}{2}}{2} = \frac{2 y_{i - 1}}{3} + \frac{y_{i + 2}}{3} . \end{array}

(16)

Scenario 10: When the sampling points (x_i₋₂, y_i₋₂), (x_i₊₁, y_i₊₁), and (x_i₊₂, y_i₊₂) are tampered, the approximate value y_i₊₂′ of x_i₊₂ is calculated by Lerp through y_i₊₁ and y_i₊₃, and the approximate value y_i₊₁′ of x_i₊₁ is calculated by Lerp through y_i and y_i₊₂′; then, the approximate value y_i of x_i is calculated by Lerp through y_i₋₁ and y_i₊₁′ as in Equation (17).

\begin{array}{l} y_{i} & = \frac{y_{i - 1} + {y_{i + 1}}^{'}}{2} = \frac{y_{i - 1} + \frac{y_{i} + {y_{i + 2}}^{'}}{2}}{2} \\ = \frac{y_{i - 1} + \frac{y_{i} + \frac{y_{i + 1} + y_{i + 3}}{2}}{2}}{2} = \frac{3 y_{i - 1}}{4} + \frac{y_{i + 3}}{4} . \end{array}

(17)

Scenario 11: When the sampling points (x_i₋₂, y_i₋₂), (x_i₊₁, y_i₊₁), (x_i₊₂, y_i₊₂) and (x_i₊₃, y_i₊₃) are tampered, tamper recovery uses the recovery method of Scenario 6.

3.4.2. Tampering Recovery

If the frames at both ends of the decrypted speech are tampered, the tampered sampling points are set to zero because the beginning and ending of the speech often do not contain any information, and setting them to zero does not affect the semantics of the original speech.

Step 1: Tampering recovery. Based on the tampering situation of the surrounding sampling points, the tampered sampling points are divided into different scenarios and tampered recovery according to the method in Section 3.4.1.

Step 2: Second tampering recovery. After the first tampering recovery, the continuously tampered sampling points are sequentially detected, the second tampering recovery is performed using the recovery method in Scenario 6 of Section 3.4.1, and the speech after tampering recovery is reconstructed.

4. Experimental Results

This section presents the experimental results and analysis of the proposed encrypted speech integrity authentication method. Section 4.1 provides the payload capacity of the proposed method, Section 4.2 offers the transparency analysis, Section 4.3 showcases the accuracy of tampering detection, Section 4.4 demonstrates the effectiveness of tampering recovery through several metrics, and Section 4.5 presents an analysis of the proposed encryption method.

This paper conducted the experiments in MATLAB R2022a on a personal computer with Intel(R) Core (TM) i7-12700H CPU@2.30 GHz, 16 GB installed RAM running on Windows 11. The test speeches for this experiment are obtained from the THCHS-30 speech library [29], and the speech is the 16 kHz 16-bit sampled single-channel file in WAV format. The parameters used in this experiment are as follows: K = (w₁(0) = 1.1, w₂(0) = 2.2, w₃(0) = 3.3, w₄(0) = 4.4), ρ = −1, LF = 128.

4.1. Payload Capacity

The payload capacity (Cap) is the maximum amount of data that the audio watermarking method can embed into a carrier speech. The definition of Cap is formulated as Equation (18) [24] in which t is the time of the watermarked speech, v is the number of bits embedded in each frame of speech, and N is the number of frames during the watermarking embedding phase. In this paper, t = 20 s, v = 256 bit, N = 2500, and Cap = 32,000 bit/s. This means that each sampling point embeds 2-bit information.

C a p = \frac{v N}{t} .

(18)

4.2. Inaudibility

Inaudibility refers to the degree of changes between watermarked speech and original speech: the better the inaudibility, the less noticeable the watermarking. The paper tests the transparency of the watermarking from the perspective of statistical characteristics and the auditory quality of watermarked speech by the signal-to-noise ratio (SNR) and perceptual evaluation of speech quality (PESQ) [24]. SNR is calculated as in Equation (19) [24] in which L is the length of the speech, l is the sampling point, x is the original speech, and y is the watermarked speech. Meanwhile, in order to directly observe the changes in the original speech after embedding the watermarking, the waveforms and spectrograms of the original speech and watermarked speech are shown in Figure 8.

S N R = 10 l o g_{10} \frac{\sum_{l = 1}^{L} x^{2} (l)}{\sum_{l = 1}^{L} {(x (l) - y (l))}^{2}} .

(19)

The SNR of the proposed method is 55.8897 dB, and the PESQ is 4.481. This is because our method only modifies the fourth decimal place of the sampling points, and the difference between the watermarked sampling point and the original sampling point is in the range of 0~0.001, which has little impact on the carrier speech sampling points. As shown in Figure 8, the speech waveform does not change obviously: the changes in the pixel points of the spectrogram are smaller, and the variation of sampling points is relatively small. Therefore, the proposed fragile watermarking method shows high inaudibility, and the watermarked speech shows minimal differences in auditory quality compared to the original speech, making it difficult for the human ear to perceive any anomalies.

4.3. Tampering Detection Analysis

To test the tampering detection accuracy, this experiment tampers with the watermarked encrypted speech of 10 s, including ordinary attacks such as mute, substitution, and desynchronization attacks such as insertion or deletion. The tamper range is 20% of the test speech, and the number of tempered sampling points is 32,000.

4.3.1. Ordinary Attack

Mute attack realizes tampering by zeroing part of the target speech. In order to simulate the mute attack, the sampling points of the test speech [12345, 44344] were zeroed, and the speech waveform of the mute attack is shown in Figure 9a.

The tampering detection result of our method is shown in Figure 9b. The tampering detection method detected that the length of the authenticated speech was the same as the length of the test speech and located that 32,002 sampling points of the test speech [12345, 44346] were tampered, producing an error of 2 sampling points compared to the actual tampering situation.

The substitution attack realizes tampering by replacing part of the target speech with noise or irrelevant speech sampling points. In order to simulate the substitution attack, the sampling points of the test speech [123456, 155455] were replaced with sampling points from random speech signals in the speech library. The speech waveform of the substitution attack is shown in Figure 10a.

The tampering detection result of our method is shown in Figure 10b. The tampering detection method detected that the length of the authenticated speech was the same as the length of the test speech and located that 31,998 sampling points of the test speech [123457, 155454] were tampered, producing an error of 2 sampling points compared to the actual tampering situation.

4.3.2. Desynchronization Attack

The insertion attack realizes tampering by inserting noise or irrelevant speech in the target speech. In order to simulate the insertion attack, 32,000 sampling points from random speech signals in the speech library were inserted after the 123456-th sampling point of the test speech. The speech waveform of the insertion attack is shown in Figure 11a.

The tampering detection result of our method is shown in Figure 11b. The tampering detection method detected that the authenticated speech was longer than the test speech and located that 32,000 sampling points of the test speech [123457, 155456] were the inserted sampling points, producing an error of 2 sampling points compared to the actual tampering situation.

The deletion attack realizes tampering by deleting the sampling points from the target speech. In order to simulate the deletion attack, 32,000 sampling points of the test speech [12345, 44344] were deleted, and the speech waveform of the deletion attack is shown in Figure 12a.

The tampering detection results of our method are shown in Figure 12b. The tampering detection method detected that the authenticated speech was shorter than the test speech and located that 16,000 sampling points of the test speech [12345, 44344] were deleted.

As shown in Figure 9, Figure 10, Figure 11 and Figure 12, our method can detect ordinary speech tampering attacks such as mute, substitution, and desynchronization speech tampering attacks such as insertion and deletion. The tampering detection errors mainly occur because the tampered sampling points exhibit correlations constructed with the surrounding sampling points. However, this situation is accidental, and the detection errors have little impact on the subsequent tampering recovery. Compared with the tampering localization method, which is accurate in the speech frame, our method achieves the tampering localization of tampered sampling points with high accuracy and improves recovery quality in the subsequent tampering recovery process. Additionally, the proposed tampering detection method does not require any information about the original speech, such as length or compression information, which effectively prevents the leakage of the original speech information.

4.4. Tampering Recovery Analysis

To measure the recovery quality of the recovered speech more comprehensively, this paper selects SNR, PESQ, and the frequency-weighted segmental SNR (fwSNR_seg) for evaluation. fwSNRseg is calculated as in Equation (20) where M is the number of segments, W(j, m) is the weight on the j-th sub-band in the m-th frame, KB is the number of sub-bands, X(j, m) is the spectrum magnitude of the j-th sub-band in the m-th frame, and

\hat{X}

(j, m) its distorted spectrum magnitude [30]. fwSNR_seg sets different weights for different speech bands on the basis of SegSNR so that the index can adapt to the short-term smoothness characteristics of speech, which is more in line with the human auditory perception.

f w S N R_{s e g} = \frac{10}{M} \sum_{m = 1}^{M} \frac{\sum_{j = 1}^{K B} W (j, m) l o g_{10} \frac{X^{2} (j, m)}{{(X (j, m) - \overset{⌢}{X} (j, m))}^{2}}}{\sum_{j = 1}^{K B} W (j, m)} .

(20)

Table 1 illustrates the comparison of SNR between the proposed method and existing methods based on recovery information such as G723.1 coding (G723.1) [17], compressed sensing (CS) [19], DCT coefficients, and the DWT coefficient (DCT-DWT) [21], and based on correlation of sampling points such as Lerp [23], R2-Lerp [24], and the least squares method (LSM) [6]. The tampering ratios in this experiment are 1%, 5%, 10%, 20%, 30%, 40%, and 50%. When the above methods still have unrecoverable sampling points after tamper recovery, the unrecoverable sampling points are zeroed.

Table 1 shows that the recovery quality of our method is similar or better than other tampering recovery methods under different tampering ratios. Compared to the existing methods based on the correlation of sampling points, our method achieves a better tamper recovery effect under high tampering ratios while ensuring the performance of tampering recovery under low tampering ratios. Compared to the existing methods based on recovery information, our method achieves a similar recovery effect under high tampering ratios and a better recovery effect under low tampering ratios. Additionally, our method avoids the problem of unrecoverable speech due to the destruction of sampling points containing recovery information.

In order to comprehensively represent the impact of increasing tampering ratios on the quality of recovered speech, Table 2 illustrates the comparison of PESQ between the proposed method and other tampering recovery methods, which evaluates the recovery effect from the perspective of auditory perception.

Table 2 shows that the PESQ of our method is superior to other tampering recovery methods within the tampering ratio of 30% and slightly worse than the method of [17] in the tampering ratios between 30% and 50%. When 10% of the speech is tampered, the PESQ of our method is greater than 4, which means that the recovered speech has excellent auditory quality and clear semantics. When 20% of the speech is tampered, the PESQ of our method is greater than 3.5, which means that the recovered speech auditory quality meets the requirements of practical use, and our method maintains a greater advantage compared with other methods in terms of auditory quality. At a 50% tampering ratio, the PESQ of our method is still greater than 1.5, and the semantics are clearly discerned from the recovered speech.

In order to test the intelligibility of the recovered speech, Table 3 illustrates the comparison of fwSNR_seg between the proposed method and other methods, which can reflect the intelligibility of the recovered speech. The higher the fwSNR_seg, the higher the speech intelligibility. Generally, fwSNR_seg for the same speech is lower than SNR.

Table 3 shows that the fwSNR_seg of our method is superior to other tampering recovery methods in all tampering ratios, which indicates that the speech recovered by our method has higher intelligibility. The frequency weighting in fwSNR_seg assigns a higher weight to the high-frequency part of speech, and the recovery effect of the high-frequency part has a greater impact on fwSNR_seg; higher fwSNR_seg also reflects that our method has a better recovery effect on the high-frequency part of speech.

To represent the quality of recovered speech after tampering more intuitively, Figure 13 shows the waveforms and spectrograms of the recovered speech in this paper under different tampering ratios, in which the red part of the waveform represents the sampling points after tampering recovery.

As shown in Figure 13, the recovered speech waveforms and the pixel points in the spectrograms at different tampering ratios are similar to the original speech waveform and spectrogram. Even while the tampering ratio reaches 50%, the recovered speech envelope still closely resembles that of the original speech, and the spectrogram only displays slight differences in the high-frequency part.

To demonstrate the adaptability of the proposed tampering recovery method for real-time applications, Table 4 shows the tampering recovery effect under different noise attacks. The types of noise attacks include background noise (B. N), factory noise (F. N), and narrowband Gaussian noise (G. N) at 20 dB, 30 dB, 40 dB, and 50 dB.

From Table 4, it is evident that noise attacks at different decibels have a significant impact on tampering recovery, and different kinds of noise have roughly the same effect on tampering recovery. This is because the proposed tampering recovery method is based on the correlation of sampling points. When the sampling points used for recovery change, the points being recovered will also change, meaning that the recovered points are subjected to the same noise attack.

To more intuitively demonstrate the impact of noise on the recovered sampling points, this paper conducts experimental analysis by testing the SNR of the original sampling points at the tampered locations, as well as the recovered sampling points that were not subjected to noise attacks and those that were affected by noise attacks. Table 5 shows the influence of different types of noise attacks on the tampering recovery sampling points.

From Table 5, it can be observed that different types of noise attacks have a roughly similar impact on the tampered sampling points. Various levels of noise have some effect on the tampered sampling points, but this impact decreases as the tampering ratio increases. This indicates that while noise attacks affect the proposed tampering recovery algorithm, their influence on the recovered sampling points is relatively minor.

4.5. Security

As shown in Figure 14, the sampling points are evenly distributed throughout the speech after encryption, and the envelope of encrypted speech differs significantly from that of the original speech. This means that it is difficult to obtain the correlation features and spectrogram features from the encrypted speech, which intuitively proves that our encryption algorithm has a better encryption effect.

To test the key sensitivity, our method sets K = (w₁(0), w₂(0), w₃(0), w₄(0)), and K′ = (w₁(0) + 10⁻¹², w₂(0) + 10⁻¹², w₃(0) + 10⁻¹², w₄(0) + 10⁻¹²). As shown in Figure 15, the waveform of the decrypted speech with the correct key is consistent with that of the original speech, and there are no significant differences in the spectrogram compared to the original speech. However, even a slight change in the key has a significant impact on the chaotic sequence, making it impossible to correctly decrypt the encrypted speech and obtain any useful features from the incorrectly decrypted speech.

To verify the security and scrambling effect of our method, this experiment analyzed the SNR, PESQ, fwSNR_seg, spectral entropy (SE), correlation coefficient (NC) of the encrypted speech, and the effect of tampering recovery at different tampering ratios. SE is calculated as in Equation (20) in which n_fft is the length of FFT, k is the frequency component, and p(k) is the probability density corresponding to the k-th frequency component. SE of the original speech is 9.7960, and NC is calculated as in Equation (21) in which S_i is the original speech sampling point and ES_i is the encrypted speech sampling point.

S E = - \sum_{k = 0}^{n_{f f t} / 2} p (k) l o g p (k) .

(21)

N C = \frac{\sum_{i} S_{i} E S_{i}}{\sqrt{\sum_{i} S_{i}^{2}} \sqrt{\sum_{i} E S_{i}^{2}}} .

(22)

Table 5 shows the comparison of the proposed method with the traditional scrambling encryption methods such as the Lorenz map (Lorenz), Iterative map (Iterative), Cubic map (Cubic), Piecewise map (Piecewise), Henon map (Henon), and Logistic map (Logistic) [31]. Table 6 shows the comparison of the tampering recovery effect between the proposed method and traditional scrambling encryption methods at different tampering ratios. The traditional scrambling encryption methods used chaotic sequences to scramble the whole speech, while the multidimensional chaotic sequences encrypted the speech in each dimension separately.

Table 6 shows that although the NC of our method is relatively high compared to other encryption methods, SNR, PESQ, fwSNR_seg, and SE of our method are lower than most of the encryption algorithms. While the higher correlation coefficient indicates a certain correlation between the encrypted speech and the original speech, there are significant differences in auditory quality between the encrypted speech and the original speech, making it difficult to distinguish semantics. From the perspective of spectral entropy, the proposed method sufficiently disrupts the correlation between the original speech sampling points. It shows that the encryption effect of our method meets the security requirements for encrypted speech.

5. Conclusions

In this paper, an encrypted speech integrity authentication method is proposed to solve “Granularity and accuracy” and “High tampering ratio” comprehensively and deeply through a 2-LSB-based dual fragile watermarking method and a speech self-recovery model based on R2-Lerp. The 2-LSB-based fragile watermarking embedding method is able to construct the correlation of encrypted speech sampling points while embedding the watermarking. The dual watermarking can achieve accurate tampering detection of speech frames based on this correlation and, then, perform tampering localization for the encrypted speech that failed authentication accurately up to the sampling point. The speech self-recovery classifies different tampering recovery scenarios based on the destruction of the 5 sampling points around the tampered sampling point and constructs the tampering recovery methods for tampered sampling points in different scenarios by Lerp and R2-Lerp to realize the tampering recovery under high tampering ratios. In addition, the speech encryption algorithm in this paper sufficiently scrambles the original speech sampling points by Lorenz mapping.

The experimental results show that the encrypted speech integrity authentication scheme in this paper can detect whether speech has been tampered with by ordinary attacks such as mute, substitution, and desynchronization attacks such as insertion or deletion; then, it locates tampered sampling points. The tampering localization error has minimal impact on the subsequent tampering recovery. The speech self-recovery model effectively improves the recovery effect of encrypted speech under a high tampering ratio. Even when the speech is tampered with up to 50%, it still exhibits good speech recovery; when the tampering ratio is below 50%, the semantics of the original speech can be clearly distinguished, which can satisfy the practical requirements of encrypted speech integrity authentication. The proposed speech encryption method spreads the continuously tampered sampling points in the encrypted speech uniformly to the whole speech during the decryption process while guaranteeing security.

Although the proposed tampering localization method can locate the tampered sampling points, it still generates certain errors. Future research work will focus on exploring methods to reduce the tampering localization error based on the correlation of speech, and we will consider tampering detection methods after the speech’s least significant bits are zeroed out. Additionally, the tampering recovery methods based on the recovery information still have certain advantages under high tampering ratios, and future research work will optimize the recovery effect under high tampering ratios through techniques such as prediction. The limited robustness of the LSB-based fragile watermarking method against noise attacks also restricts its usage. Future research will focus on exploring how to enhance robustness while maintaining the high transparency of the LSB method, thereby improving the adaptability of the algorithm for real-time applications.

Author Contributions

Conceptualization: F.X.; data curation: F.X. and X.X.; investigation: F.X.; methodology: F.X. and J.L.; project administration: J.L.; supervision: J.L. and X.X.; writing—original draft: F.X.; writing—review and editing: X.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Singh, R.; Saraswat, M.; Ashok, A.; Mittal, H.; Tripathi, A.; Pandey, A.-C.; Pal, R. From classical to soft computing based watermarking techniques: A comprehensive review. Future Gener. Comput. Syst. 2023, 141, 738–754. [Google Scholar] [CrossRef]
Makhdoom, I.; Abolhasan, M.; Lipman, J.A. comprehensive survey of covert communication techniques, limitations and future challenges. Comput. Secur. 2022, 120, 102784. [Google Scholar] [CrossRef]
Shehab, D.-A.; Alhaddad, M.-J. Comprehensive Survey of Multimedia Steganalysis: Techniques, Evaluations, and Trends in Future Research. Symmetry 2022, 14, 117. [Google Scholar] [CrossRef]
Meng, X.; Li, C.; Tian, L. Detecting Audio Splicing Forgery Algorithm Based on Local Noise Level Estimation. In Proceedings of the 2018 5th International Conference on Systems and Informatics, Nanjing, China, 10–12 November 2018; pp. 861–865. [Google Scholar] [CrossRef]
Wang, S.; Yuan, W.; Wang, J.; Unoki, M. Detection of speech tampering using sparse representations and spectral manipulations based information hiding. Speech Commun. 2019, 112, 1–14. [Google Scholar] [CrossRef]
Zhang, Q.-Y.; Zhang, D.-H.; Xu, F.-J. An encrypted speech authentication and tampering recovery method based on perceptual hashing. Multimed. Tools Appl. 2021, 80, 24925–24948. [Google Scholar] [CrossRef]
Narla, V.-L.; Gulivindala, S.; Chanamallu, S.-R.; Gangwar, D.-P. BCH encoded robust and blind audio watermarking with tamper detection using hash. Multimed. Tools Appl. 2021, 80, 32925–32945. [Google Scholar] [CrossRef]
Hu, H.; Lee, T. Hybrid Blind Audio Watermarking for Proprietary Protection, Tamper Proofing, and Self-Recovery. IEEE Access 2019, 7, 180395–180408. [Google Scholar] [CrossRef]
Zhou, S.-Y.; Song, M.-X.; Qian, Q.; Liao, W.-J.; Gong, X.-F. GRACED: A Novel Fragile Watermarking for Speech Based on Endpoint Detection. Secur. Commun. Netw. 2022, 2022, 496748. [Google Scholar] [CrossRef]
Liu, Z.-H.; Cao, Y.; Lin, K.-J. A watermarking-based authentication and recovery scheme for encrypted audio. Multimed. Tools Appl. 2024, 83, 10969–10987. [Google Scholar] [CrossRef]
Liu, Z.-H.; Huang, J.-W.; Sun, X.-M.; Qi, C.-D. A security watermark scheme used for digital speech forensics. Multimed. Tools Appl. 2017, 76, 9297–9317. [Google Scholar] [CrossRef]
Lu, W.-H.; Chen, Z.-L.; Li, L.; Cao, X.-C.; Wei, J.-G.; Xiong, N.-X.; Li, J.; Dang, J.-W. Watermarking Based on Compressive Sensing for Digital Speech Detection and Recovery. Sensors 2018, 18, 2390. [Google Scholar] [CrossRef]
Qian, Q.; Zhou, S.-Y.; Song, M.-X.; Cui, Y.-H.; Wang, H. VICTOR: An Adaptive Framing-based Speech Content Authentication and Recovery Algorithm. In Proceedings of the 2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application, Hainan, China, 18–20 December 2022; pp. 649–1654. [Google Scholar] [CrossRef]
Elshoush, H.-T.; Mahmoud, M.-M. Ameliorating LSB Using Piecewise Linear Chaotic Map and One-Time Pad for Superlative Capacity, Imperceptibility and Secure Audio Steganography. IEEE Access 2023, 11, 33354–33380. [Google Scholar] [CrossRef]
Sultani, Z.-N.; Dhannoon, B.-N. Image and audio steganography based on indirect LSB. Kuwait J. Sci. 2021, 48, 1–12. [Google Scholar] [CrossRef]
Mahmoud, M.-M.; Elshoush, H.-T. Enhancing LSB Using Binary Message Size Encoding for High Capacity, Transparent and Secure Audio Steganography—An Innovative Approach. IEEE Access 2022, 10, 29954–29971. [Google Scholar] [CrossRef]
Qian, Q.; Cui, Y.-H.; Wang, H.-X.; Deng, M.-S. REPAIR: Fragile watermarking for encrypted speech authentication with recovery ability. Telecommun. Syst. 2020, 75, 273–289. [Google Scholar] [CrossRef]
Fan, M.-Q. A source coding scheme for authenticating audio signal with capability of self-recovery and anti-synchronization counterfeiting attack. Multimed. Tools Appl. 2020, 79, 1037–1055. [Google Scholar] [CrossRef]
Hu, Y.-X.; Lu, W.-H.; Ma, M.-D.; Sun, Q.-L.; Wei, J.-G. A semi fragile watermarking algorithm based on compressed sensing applied for audio tampering detection and recovery. Multimed. Tools Appl. 2022, 81, 17729–17746. [Google Scholar] [CrossRef]
Liu, Z.-H.; Li, Y.-L.; Sun, F.; He, J.-J.; Qi, C.-D.; Luo, D. A Robust Recoverable Algorithm Used for Digital Speech Forensics Based on DCT. Cloud Comput. Secur. 2018, 11068, 300–311. [Google Scholar] [CrossRef]
Liu, Z.-H.; Luo, D.; Huang, J.-W.; Wang, J.; Qi, C.-D. Tamper recovery algorithm for digital speech signal based on DWT and DCT. Multimed. Tools Appl. 2017, 76, 12481–12504. [Google Scholar] [CrossRef]
Wang, S.-B.; Yuan, W.-T.; Zhang, Z.; Wang, L. Speech watermarking based tamper detection and recovery scheme with high tolerable tamper rate. Multimed. Tools Appl. 2024, 83, 6711–6729. [Google Scholar] [CrossRef]
Qian, Q.; Wang, H.-X.; Sun, X.-M.; Cui, Y.-H.; Wang, H.; Shi, C.-H. Speech authentication and content recovery scheme for security communication and storage. Telecommun. Syst. 2018, 67, 635–649. [Google Scholar] [CrossRef]
Zhang, Q.-Y.; Xu, F.-J. Encrypted speech authentication and recovery scheme based on fragile watermarking. Telecommun. Syst. 2023, 82, 125–140. [Google Scholar] [CrossRef]
Shahna, K.-U. Novel chaos based cryptosystem using four-dimensional hyper chaotic map with efficient permutation and substitution techniques. Chaos Solitons Fractals 2023, 170, 113383. [Google Scholar] [CrossRef]
Varghese, F.; Sasikala, P. A detailed review based on secure data transmission using cryptography and steganography. Wirel. Pers. Commun. 2023, 129, 2291–2318. [Google Scholar] [CrossRef]
Pejas, J.; Cierocki, L. A systematic review of highly transparent steganographic methods for the digital audio. In Information Systems and Industrial Management, Proceedings of the CISIM 2022, Barranquilla, Colombia, 15–17 July 2022; Springer: Cham, Switzerland, 2022; Volume 13293, pp. 63–77. [Google Scholar] [CrossRef]
AlSabhany, A.-A.; Ali, A.-H.; Ridzuan, F.; Azni, A.-H.; Mokhtar, M.-R. Digital audio steganography: Systematic review, classification, and analysis of the current state of the art. Comput. Sci. Rev. 2020, 38, 100316. [Google Scholar] [CrossRef]
Wang, D.; Zhang, X. Thchs-30: A free Chinese speech corpus. arXiv 2015, arXiv:1512.01882. [Google Scholar]
Kondo, K. Estimation of forced-selection word intelligibility by comparing objective distances between candidates. Appl. Acoust. 2016, 106, 113–121. [Google Scholar] [CrossRef]
Nidaa, A.-A.; Zeina, H. Review of dct and chaotic maps in speech scrambling. J. Theor. Appl. Inf. Technol. 2019, 97, 569–582. [Google Scholar]

Figure 1. Principles of existing tampering detection methods: TFM extracts the original speech features to construct the database and extracts the authenticated speech features for authentication during tampering detection. TWM constructs the synchronized information as original watermarking and, then, embeds and extracts the watermarking of the authenticated speech for authentication with original watermarking during tampering detection.

Figure 2. Principles of existing tampering recovery methods: TRM encodes the original speech as compressed recovery information and embeds and decodes the extracted watermarking in the tampered speech as recovered speech during tampering recovery. TCM recovers the tampered sampling points by correlation between sampling points during tampering recovery.

Figure 3. The 2-LSB-based dual fragile watermarking method.

Figure 4. Advantages of the proposed speech self-recovery model compared to existing methods [23,24].

Figure 5. The flowchart of the encrypted speech integrity authentication method.

Figure 6. Construction of the original tampering detection watermarking.

Figure 7. The proposed tampering recovery scenarios.

Figure 8. (a) The original speech waveform. (b) The original speech spectrogram. (c) The watermarked speech waveform. (d) The watermarked speech spectrogram. Waveforms and spectrograms of original speech and watermarked speech.

Figure 9. (a) The speech waveform of mute attack. (b) The localization of tampered speech. Waveform and tampering localization after the mute attack of the test speech.

Figure 10. (a) The speech waveform of substitution attack. (b) The localization of tampered speech. Waveform and tampering localization after the substitution attack of the test speech.

Figure 11. (a) The speech waveform of insertion attack. (b) The localization of tampered speech. Waveform and tampering localization after the insertion attack of the test speech.

Figure 12. (a) The speech waveform of deletion attack. (b) The localization of tampered speech. Waveform and tampering localization after the deletion attack of the test speech.

Figure 13. (a) 1% tampering; (b) 5% tampering; (c) 10% tampering; (d) 30% tampering. (e) 50% tampering. Recovered Speech waveform and spectrogram at different tampering ratios.

Figure 14. (a) An encrypted speech waveform. (b) An encrypted speech spectrogram. The waveform and spectrogram of encrypted speech.

Figure 15. (a) A speech waveform after the right decryption. (b) A speech spectrogram after the right decryption. (c) A speech waveform after the wrong decryption. (d) A speech spectrogram after the wrong decryption. The waveform and spectrogram of the right initial value and the wrong initial value.

Table 1. A comparison of SNR between the proposed and existing methods under different tampering ratios.

Tampering Ratio	Proposed	[17]	[19]	[21]	[23]	[24]	[6]
1%	48.1576	36.3753	32.1938	22.2401	39.8547	48.3656	41.0281
5%	38.9362	29.1062	25.0036	15.1230	23.1259	30.3133	23.3485
10%	33.8658	25.8159	22.1203	11.9880	17.6729	23.0334	17.8335
20%	27.0363	22.8468	18.8808	8.8222	11.4498	14.1896	11.5878
30%	21.6772	21.1041	17.0959	7.0877	8.2489	9.4517	8.3694
40%	17.3526	19.8173	15.8048	5.7669	5.8309	6.2714	5.9383
50%	13.6548	18.8655	14.7917	4.7982	4.1862	4.1308	4.2763

Table 2. A comparison of PESQ between the proposed and existing methods under different tampering ratios.

Tampering Ratio	Proposed	[17]	[19]	[21]	[23]	[24]	[6]
1%	4.479	4.250	4.099	3.219	4.215	4.473	4.248
5%	4.348	3.836	3.640	2.601	3.049	3.741	3.057
10%	4.154	3.623	3.358	2.274	2.515	3.028	2.528
20%	3.762	3.372	3.025	1.886	1.874	2.219	1.891
30%	3.224	3.261	2.844	1.614	1.498	1.696	1.516
40%	2.845	3.179	2.698	1.383	1.147	1.268	1.163
50%	2.445	3.139	2.594	1.166	0.889	0.941	0.898

Table 3. A comparison of fwSNR_seg between the proposed and existing methods under different tampering ratios.

Tampering Ratio	Proposed	[17]	[19]	[21]	[23]	[24]	[6]
1%	31.0936	27.9630	26.7156	21.1031	30.6061	31.0687	30.7451
5%	30.4756	21.3127	19.7162	13.6814	23.3101	28.9864	23.7436
10%	29.3078	18.2835	16.7630	11.0999	15.1551	23.3571	15.4941
20%	25.8689	15.7910	13.9600	8.7983	8.6516	12.1001	8.7579
30%	22.0969	14.6129	12.5683	7.6573	6.1332	7.5465	6.2238
40%	18.6124	13.9312	11.6115	6.9918	4.6486	5.5280	4.7038
50%	15.4145	13.6664	10.8939	6.4905	3.7196	4.3177	3.7231

Table 4. The tampering recovery effect under different noise attacks.

Tampering Ratio		10%	20%	30%	40%	50%
Proposed		33.8658	27.0363	21.6772	17.3526	13.6548
B. N	20 dB	19.4965	18.6640	17.3539	15.2725	12.6517
	30 dB	28.2712	25.0227	21.0080	17.0964	13.5439
	40 dB	32.8623	26.7843	21.6109	17.3272	13.6437
	50 dB	33.7580	27.0094	21.6723	17.3504	13.6537
F. N	20 dB	19.7208	19.0658	17.6346	15.4453	12.7931
	30 dB	28.4242	25.2046	21.0631	17.1249	13.5661
	40 dB	32.8925	26.8169	21.6142	17.3311	13.6477
	50 dB	33.7565	27.0147	21.6717	17.3510	13.6546
G. N	20 dB	19.7261	19.0418	17.6035	15.4154	12.7505
	30 dB	28.4396	25.1777	21.0419	17.1120	13.5509
	40 dB	32.9069	26.8051	21.6066	17.3269	13.6428
	50 dB	33.7622	27.0109	21.6692	17.3497	13.6531

Table 5. The impact of noise attacks on the tampering recovery sampling points.

Tampering Ratio		10%	20%	30%	40%	50%
Proposed		23.7079	20.0136	16.3712	13.3751	10.6563
B. N	20 dB	17.4420	16.2397	14.4602	12.3926	10.1379
	30 dB	22.5188	19.4410	16.1540	13.2690	10.6021
	40 dB	23.5843	19.9494	16.3545	13.3653	10.6510
	50 dB	23.6990	20.0060	16.3713	13.3744	10.6558
F. N	20 dB	17.5936	16.5358	14.5878	12.4899	10.2270
	30 dB	22.5284	19.5189	16.1635	13.2836	10.6171
	40 dB	23.5700	19.9643	16.3524	13.3677	10.6542
	50 dB	23.6924	20.0095	16.3701	13.3750	10.6567
G. N	20 dB	17.6392	16.4703	14.5371	12.4542	10.1815
	30 dB	22.5730	19.4825	16.1404	13.2706	10.6018
	40 dB	23.5878	19.9521	16.3448	13.3635	10.6493
	50 dB	23.6982	20.0057	16.3677	13.3736	10.6551

Table 6. The comparison of encryption performance with existing scrambling encryption methods.

Evaluation Features	Proposed	Lorenz	Henon	Logistic	Iterative	Cubic	Piecewise
SNR	−3.0099	−3.0001	−3.0157	−3.0000	−3.0153	−3.0040	−3.0220
PESQ	0.823	0.642	0.632	0.898	0.742	1.665	0.767
fwSNR_seg	2.0119	2.0480	2.1120	2.0287	2.0561	2.1054	2.0412
SE	11.5600	11.5596	11.5615	11.5611	11.5602	11.5595	11.5571
NC	0.0031	0.0023	−0.0013	0.0024	−0.0011	0.0015	−0.0027

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, F.; Li, J.; Xu, X. An Encrypted Speech Integrity Authentication Method: Focus on Fine-Grained Tampering Detection and Tampering Recovery Under High Tamper Ratios. Mathematics 2025, 13, 573. https://doi.org/10.3390/math13040573

AMA Style

Xu F, Li J, Xu X. An Encrypted Speech Integrity Authentication Method: Focus on Fine-Grained Tampering Detection and Tampering Recovery Under High Tamper Ratios. Mathematics. 2025; 13(4):573. https://doi.org/10.3390/math13040573

Chicago/Turabian Style

Xu, Fujiu, Jianqiang Li, and Xi Xu. 2025. "An Encrypted Speech Integrity Authentication Method: Focus on Fine-Grained Tampering Detection and Tampering Recovery Under High Tamper Ratios" Mathematics 13, no. 4: 573. https://doi.org/10.3390/math13040573

APA Style

Xu, F., Li, J., & Xu, X. (2025). An Encrypted Speech Integrity Authentication Method: Focus on Fine-Grained Tampering Detection and Tampering Recovery Under High Tamper Ratios. Mathematics, 13(4), 573. https://doi.org/10.3390/math13040573

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Encrypted Speech Integrity Authentication Method: Focus on Fine-Grained Tampering Detection and Tampering Recovery Under High Tamper Ratios

Abstract

1. Introduction

2. Basic Concepts

2.1. Lorenz Mapping

2.2. Least Significant Bit Method

2.3. Residual Recovery-Based Linear Interpolation

3. The Proposed Scheme

3.1. The Speech Encryption Algorithm

3.2. 2-LSB-Based Dual Fragile Watermarking Method

3.2.1. Watermarking Generation and Embedding

3.2.2. Watermarking Extraction

3.3. Integrity Authentication

3.3.1. Tampering Detection

3.3.2. Tampering Localization

3.4. Tampering Recovery Method

3.4.1. Speech Self-Recovery Model Based on R2-Lerp

3.4.2. Tampering Recovery

4. Experimental Results

4.1. Payload Capacity

4.2. Inaudibility

4.3. Tampering Detection Analysis

4.3.1. Ordinary Attack

4.3.2. Desynchronization Attack

4.4. Tampering Recovery Analysis

4.5. Security

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI