# A Low Distortion Audio Self-Recovery Algorithm Robust to Discordant Size Content Replacement Attack

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Description of the Watermarking Scheme for Audio Signals Self-Recovery

#### 2.1. Reed–Solomon

#### 2.2. Watermark Generation and Embedding Algorithm

- The watermark generation is performed as the start point. The input parameter is the original audio signal X to be protected against the discordant size content replacement attack. Each feature of the audio signal corresponds with a time of t seconds, sampling frequency of 48 kHz, 16 bits and a length $\lambda $.
- Obtain x, the original signal copy, and decimate it by an integer factor of $M=2$, i.e., $D\left[n\right]=x\left[nM\right]$, $n=0,1,2,\dots ,\frac{\lambda}{M}-1$. With this process, the original signal X is only represented using one-half of its samples, with the goal of performing a compression over a small version and obtaining a better compression rate.
- The decimated values D are processed with the OPUS audio codec to compress it at the rate of 64 kbps. The compression ratio of the proposed scheme has a quality two times better than the rate used in the scheme [3]. The source coding output is of length $\Gamma $ to 8 bits per sample.
- The symbols obtained are concatenated to $m=16$ bits. Two symbols of 8 bits are used, forming a new source-code of size $k=\frac{\Gamma}{2}$. It is important to perform this step because the RS channel codes work on symbols of length m.
- Scramble the code of k symbols with a secret key $\phi $. The secret key must be shared between the transmitter and receiver to provide the required security of the embedding algorithm; its value is randomly chosen. The scrambling helps to avoid a tampering of a set of contiguous samples. The process of channel decoding would consider it as small tampering in different positions, which eases the restoration of the individual portions.
- To protect the scrambled code, Galois fields $GF\left({2}^{m}\right)$ are used and after applying the RS coder, $RS(n,k)$. The coder adds parity symbols of size $n-k$, where k represents the original information symbols and $n=k\times 2$ represents the output channel coding. Thus, the first part of the watermark is obtained.
- The embedding process of the watermarking on the original audio is applied. First, the original audio signal X is decomposed into frames of 10 milliseconds; hence, the frame consists of 480 samples.
- Then, compute the hash information of the 14 most significant bits (MSB) of each sample by frame, i.e., the ${b}_{MSB}=14\times 480$ bits in each frame that are not modified. Use only 8 hash symbols or ${b}_{h}=64$ bits. The second part of the watermark is obtained.
- After that, distribute 56 symbols or ${b}_{ch}=896$ bits of channel-code, and the 8 symbols or ${b}_{h}=64$ bits hash, replacing the two LSBs of the samples of each frame, i.e., ${b}_{LSB}=960$ bits that are available.
- Finally, insert the parameters $\lambda $ and $\Gamma $ in the LSBs of the frames that are not watermarked. Each one is represented by 20 bits and 16 bits, respectively. They are distributed 24 times by frame. These values are embedded in the last frames on the assumption that the relevant information of the attacker is contained in the body of the audio signal. It is important to note that both parameters are useful in the extraction and restoration of the watermark. Therefore, the watermarked audio signal ${X}^{\prime}$ is then produced.

#### 2.3. Mathematical Model That Describes the Attack

- Define the input parameters as the size of the set of replaced samples i, the size of the set of replacement samples j, the discordance a (the difference between the number of samples of the original signal and the number of samples of the attacked signal), and the start and end positions of the attack, $pos\_ini\_ataq$ and $pos\_fin\_ataq$, respectively.
- It is necessary to put to zero the set of replaced samples of size i before using the decimation technique. The process enables that the set of replaced samples have a preparation previous to applying the set of replacement samples. This set takes the place of the watermarked audio signal received, ${X}^{\prime}$, of size $\lambda $ and n indicating the n-th sample:$$xz\left[n\right]={X}^{\prime}\left[n\right]\times z\left[n\right].$$$$z\left[n\right]=\left\{\begin{array}{c}0,\phantom{\rule{1.em}{0ex}}pos\_ini\_ataq\le n\le pos\_ini\_ataq+i\hfill \\ 1,\phantom{\rule{1.em}{0ex}}\mathit{otherwise}.\hfill \end{array}\right.$$
- Prior to applying the decimation and interpolation technique to $xz$, it is necessary to compute a value, $decimal$, that depends on the discordance of the attack, i.e., of equal, larger or smaller size. The quantity of samples added or removed by the attack is represented in the discordance. Without this value, one cannot obtain the integer factors used in the decimation and interpolation:$$decimal=\left\{\begin{array}{c}1,\phantom{\rule{1.em}{0ex}}\phantom{\rule{1.em}{0ex}}\phantom{\rule{1.em}{0ex}}a=0,\hfill \\ 1+\frac{a}{i},\phantom{\rule{1.em}{0ex}}a=j-i,\hfill \\ 1-\frac{a}{i},\phantom{\rule{1.em}{0ex}}a=i-j.\hfill \end{array}\right.$$
- Independently of the value taken by $decimal$, its value is converted to a fractional format, where the integer values L and M obtained represent the integer factors of the interpolation and decimation, respectively. The fractional format is as follows:$$\frac{L}{M}=decimal.$$
- The first sampling technique is applied, i.e., the decimation with an integer factor M. This technique is only performed on the original set of size i by replacing it with zero values using Equations (1) and (2). The set replaced is contained in $xz$ and its process can be observed in the following equation:$$\begin{array}{c}D\left[m\right]=xz\left[mM\right],\phantom{\rule{1.em}{0ex}}pos\_ini\_ataq\le m\le (pos\_ini\_ataq+i),\hfill \end{array}$$
- The second sampling technique, namely the interpolation with an integer factor L, is applied. By using the decimated block $D\left[m\right]$ of size $BD$, its process is described as follows:$$I\left[m\right]=\left\{\begin{array}{c}D\left[\frac{m}{L}\right],\phantom{\rule{1.em}{0ex}}m=n\times L,\phantom{\rule{1.em}{0ex}}pos\_ini\_ataq\le n\le w2\hfill \\ 0,\phantom{\rule{1.em}{0ex}}\phantom{\rule{1.em}{0ex}}\phantom{\rule{1.em}{0ex}}\mathit{otherwise},\hfill \end{array}\right.$$
- Once the decimation–interpolation process has finished, only the interpolated block $I\left[m\right]$ is tampered with, using a set of replacement samples of size j. The set of replacement audio samples $r\left[n\right]$ is added as follows:$$Y\left[n\right]=I\left[n\right]+r\left[n\right],$$$$r\left[n\right]=\left\{\begin{array}{c}A\left[n\right],\phantom{\rule{1.em}{0ex}}pos\_ini\_ataq\le n\le pos\_fin\_ataq\hfill \\ 0,\phantom{\rule{1.em}{0ex}}\phantom{\rule{1.em}{0ex}}\mathit{otherwise}.\hfill \end{array}\right.$$

#### 2.4. Extraction and Reconstruction of the Watermark

- A tampered watermarked audio signal Y of size $\mu $ is received. Extract $\lambda $ and $\Gamma $ by taking the 30 last frames. Each frame contains 480 samples. The frames should be obtained starting from the last frame of the signal. Both parameters were repeated 24 times by frame, i.e., a total of 864 bits was distributed by each frame. It is possible to recover both parameters with only one frame that had not been tampered with.
- A synchronizing strategy is then developed. Once $\lambda $ has been extracted, which is the length of the original audio signal, compute the length error $\epsilon $ using the size of the tampered signal received $\mu $:$$\epsilon =\mu -\lambda .$$
- Decompose the audio signal Y into frames of 480 samples or 10 milliseconds, in such a way as to extract the hash information from the LSBs of the samples of each frame, i.e., the eight symbols or ${b}_{h}=64$ bits embedded.
- Calculate the hash information for all ${b}_{MSB}=14\times 480$ MSBs of the samples of each frame, obtaining ${b}_{h}=64$ hash bits of the frame. Compare the extracted and calculated hash bits of each frame to determine the first tampered frame.
- Then, by only using the first tampered frame, it must be possible to synchronize Y. The process adds zeros or removes samples from this first tampered frame, depending on the value of the length error $\epsilon $:
- If $\epsilon <0$, add a set of zeros of size $\left|\epsilon \right|$, where $|\u2022|$ would be a positive value.
- Else if $\epsilon >0$, remove a set of samples of size $\epsilon $.

In the case where $\epsilon =0$, the length of the tampered signal is the same as the original signal, and it is not necessary to synchronize the signal. - Once the tampered signal has been synchronized in length, ${Y}^{\prime}$, it is possible to compute the hash information and to extract the channel coding of each frame. Similarly as in the steps 3 and 4, all frames must be correctly determined. For each frame, the generated hash bits are compared to the embedded hash bits of the same frame. The frames are marked as healthy when the extracted and reproduced hash bits match, and otherwise tampered. Therefore, each tampered frame results in losing $\frac{{b}_{ch}}{m}$ channel code symbols. The proposed method allows a channel code symbol length set to $m=16$, and every frame hosts 56 channel code symbols or ${b}_{ch}=896$ bits of channel code.
- Channel coding output bits are collected from the watermark bits of the audio signal. The collected channel coding output bits are input to the channel decoding module. Collect all of the ${b}_{ch}=896$ bits of the channel code or 56 symbols of the 2 LSBs of the samples of each frame. The number of watermarked frames is $block\_marked=\frac{\Gamma}{56}$ where $\Gamma $ is the value extracted in step 1 which represents the total samples generated by the source coding.
- Pass the channel-coded symbols to the $RS(n,k)$ channel decoder, where $n=\Gamma $ and $k=\frac{n}{2}$. This decoding process finds the k source-coded symbols. The channel decoder is used to compress the recovered audio.
- The inverse process of scrambling is applied to the k output symbols of the channel decoder in case of its successful decoding by using the secret key $\phi $. This allows returning the symbols to their original positions.
- Convert the ordered output symbols of 16 bits to 8 bits, where the size of the new set of ordered symbols will be $\Gamma $, i.e., the concatenation applied in the watermarking generation process is removed.
- Apply the OPUS source decoder to the new set of ordered symbols to find the compressed decimated signal D.
- After the source decoding, it is necessary to interpolate with an integer factor $L=2$ the output symbols of the source decoder using the spline method [19,20] to reconstruct in size and content the decimated signal:$$I\left[m\right]=\left\{\begin{array}{c}D\left[\frac{m}{L}\right],\phantom{\rule{1.em}{0ex}}m=n\times L,\phantom{\rule{1.em}{0ex}}n=0,1,2,\dots \hfill \\ spline,\phantom{\rule{1.em}{0ex}}\phantom{\rule{1.em}{0ex}}\phantom{\rule{1.em}{0ex}}\mathit{otherwise}.\hfill \end{array}\right.$$
- Replace the content of the tampered frames of ${Y}^{\prime}$ with the recovered interpolated audio signal I, i.e., with the reconstructed watermark.
- This yields the recovered audio signal ${X}^{\u2033}$.

## 3. Experimental Results

#### Comparative Results

## 4. Discussion and Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Cox, I.; Miller, M.; Bloom, J.; Fridrich, J.; Kalker, T. Digital Watermarking and Steganography, 2nd ed.; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2008. [Google Scholar]
- Sarreshtedari, S.; Akhaee, M.A.; Abbasfar, A. A Watermarking Method for Digital Speech Self-recovery. IEEE/ACM Trans. Audio Speech Lang. Proc.
**2015**, 23, 1917–1925. [Google Scholar] [CrossRef] - Gomez-Ricardez, J.J.; Garcia-Hernandez, J.J. An audio self-recovery scheme that is robust to discordant size content replacement attack. In Proceedings of the IEEE 61st International Midwest Symposium on Circuits and Systems, MWSCAS 2018, Windsor, ON, Canada, 5–8 August 2018; pp. 825–828. [Google Scholar] [CrossRef]
- Gomez-Ricardez, J.J.; Parra-Michel, R.; Garcia-Hernandez, J.J. Mathematical models for the discordant size content replacement attack. In Proceedings of the 2019 7th International Workshop on Biometrics and Forensics (IWBF), Cancun, Mexico, 2–3 May 2019; pp. 1–5. [Google Scholar] [CrossRef]
- National Forensic Science Technology Center (NFSTC). NFSTC: A Simplified Guide to Forensics Audio and Video Analysis; Technical Report; National Forensic Science Technology Center (NFSTC): Largo, FL, USA, 2010.
- Newton, H. Music Censorship: An Overview; George Washington University: Washington, DC, USA, 2012; Volume 1. [Google Scholar]
- Li, J.; Lu, W.; Zhang, C.; Wei, J.; Cao, X.; Dang, J. A Study on Detection and Recovery of Speech Signal Tampering. In Proceedings of the 2016 IEEE Trustcom/BigDataSE/ISPA, Tianjin, China, 23–26 August 2016; pp. 678–682. [Google Scholar] [CrossRef]
- Menendez-Ortiz, A.; Feregrino-Uribe, C.; García-Hernández, J.J.; Guzmán-Zavaleta, Z.J. Self-recovery scheme for audio restoration after a content replacement attack. Multimed. Tools Appl.
**2017**, 76, 14197–14224. [Google Scholar] [CrossRef] - Hu, H.; Lee, T. Hybrid Blind Audio Watermarking for Proprietary Protection, Tamper Proofing, and Self-Recovery. IEEE Access
**2019**, 7, 180395–180408. [Google Scholar] [CrossRef] - Fan, M.Q. A source coding scheme for authenticating audio signal with capability of self-recovery and anti-synchronization counterfeiting attack. Multimed. Tools Appl.
**2019**, 79, 1037–1055. [Google Scholar] [CrossRef] - Hu, H.T.; Lu, Y.H. Frame-synchronous Blind Audio Watermarking for Tamper Proofing and Self-Recovery. Adv. Technol. Innov.
**2020**, 5, 18–32. [Google Scholar] [CrossRef] - Jovanovic-Dolecek, G. Multirate Systems: Design and Applications: Design and Applications; Idea Group Pub.: Hershey, PA, USA, 2001. [Google Scholar]
- Oppenheim, A.; Willsky, A.; Nawab, S. Signals & Systems; Prentice-Hall Signal Processing Series; Prentice-Hall International: Upper Saddle River, NJ, USA, 1997. [Google Scholar]
- Tan, L. Digital Signal Processing: Fundamentals and Applications; Digital Signal Processing SET; Elsevier Science: Amsterdam, The Netherlands, 2007. [Google Scholar]
- Blahut, R.E. Algebraic Codes for Data Transmission; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
- Justesen, J.; Forchhammer, S. Two-Dimensional Information Theory and Coding: With Applications to Graphics Data and High-Density Storage Media; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar] [CrossRef]
- Proakis, J.; Salehi, M. Digital Communications, 5th ed.; McGraw-Hill Higher Education: New York, NY, USA, 2008. [Google Scholar]
- Xiph.Org Foundation. Opus Interactive Audio Codec. 2012. Available online: https://opus-codec.org (accessed on 10 December 2020).
- Faires, J.; Burden, R. Numerical Methods, 4th ed.; Cengage Learning: Boston, MA, USA, 2012. [Google Scholar]
- Gupta, S. Numerical Methods for Engineers; New Age International (P) Limited: New Delhi, India, 1995. [Google Scholar]
- Garcia-Hernandez, J.J. Replication Data for: “On a Key-Based Secured Audio Data Hiding Scheme Robust to Volumetric Attack with Entropy-Based Embedding” Submitted to Entropy; Harvard Dataverse: Cambridge, MA, USA, 2019. [Google Scholar] [CrossRef]
- Thiede, T.; Treurniet, W.C.; Bitto, R.; Schmidmer, C.; Sporer, T.; Beerends, J.G.; Colomes, C. PEAQ—The ITU Standard for Objective Measurement of Perceived Audio Quality. J. Audio Eng. Soc
**2000**, 48, 3–29. [Google Scholar] - Bosi, M.; Goldberg, R.E. Introduction to Digital Audio Coding and Standards; Engineering and Computer Science; Springer: New York, NY, USA, 2003. [Google Scholar]
- Furht, B.; Kirovski, D. Multimedia Watermarking Techniques and Applications (Internet and Communications Series); Auerbach Publications: Boston, MA, USA, 2006. [Google Scholar]

**Figure 4.**Simulation results for the audio signal self-recovery scheme: (

**a**) the original audio signal; (

**b**) the watermarked audio signal, the color line indicates the region to be attacked; (

**c**) the attacked watermarked audio signal, the color line indicates the attacked region; and (

**d**) the restored audio signal, the color line indicates the recovered region.

**Figure 5.**Simulation results for an audio signal self-recovery with a tampering equal than to one-half of its total length: (

**a**) the original audio signal; (

**b**) the watermarked audio signal, the color line indicates the region to be attacked; (

**c**) the attacked watermarked audio signal, the color line indicates the attacked region; and (

**d**) the restored audio signal, the color line indicates the recovered region.

**Table 1.**Average value of ODG and PSNR in the case of recovering against the content replacement of equal size.

% Attack Degree | Orig vs. Rec | WM vs. Rec | |||
---|---|---|---|---|---|

j−i | % Recovery | ODG | PSNR | ODG | PSNR |

10,000–10,000 | 100 | −0.5861 | 49.0191 | −0.4757 | 49.0365 |

15,000–15,000 | 100 | −0.6506 | 46.5534 | −0.5326 | 46.5791 |

19,000–19,000 | 100 | −0.7118 | 45.0146 | −0.5878 | 45.0289 |

24,000–24,000 | 100 | −0.7673 | 44.3166 | −0.6388 | 44.3185 |

38,000–38,000 | 100 | −0.8733 | 42.0566 | −0.7348 | 42.0578 |

55,000–55,000 | 45 | −1.0321 | 41.3916 | −0.8683 | 41.0578 |

**Table 2.**Average value of ODG and PSNR in the case of recovering against content replacement of larger size.

% Attack Degree | Orig vs. Rec | WM vs. Rec | |||
---|---|---|---|---|---|

j−i | % Recovery | ODG | PSNR | ODG | PSNR |

40,000–20,000 | 100 | −0.7153 | 45.3185 | −0.5911 | 45.3381 |

60,000–24,000 | 100 | −0.7512 | 44.5893 | −0.6212 | 44.5906 |

80,000–28,000 | 100 | −0.7958 | 43.9882 | −0.6642 | 44.0045 |

100,000–32,000 | 100 | −0.8261 | 43.0283 | −0.6922 | 43.0293 |

120,000–45,000 | 100 | −0.9192 | 41.6639 | −0.7803 | 41.6658 |

160,000–55,000 | 56 | −1.0165 | 40.1697 | −0.8781 | 40.1699 |

**Table 3.**Average value of ODG and PSNR in the case of recovering against content replacement of smaller size.

% Attack Degree | Orig vs. Rec | WM vs. Rec | |||
---|---|---|---|---|---|

j−i | % Recovery | ODG | PSNR | ODG | PSNR |

25,000–28,000 | 100 | −0.7971 | 44.1190 | −0.6654 | 44.1208 |

20,000–35,000 | 100 | −0.8519 | 42.6765 | −0.7167 | 42.6858 |

15,000–44,000 | 100 | −0.9189 | 41.5872 | −0.7771 | 41.5909 |

12,000–50,000 | 45 | −0.9753 | 40.6038 | −0.8367 | 40.6041 |

300–56,000 | 48 | −1.0327 | 40.7395 | −0.8809 | 40.7398 |

100–62,000 | 46 | −1.0149 | 39.2611 | −0.8980 | 39.2614 |

**Table 4.**Average value of ODG and PSNR in signal recovery against content tampering higher or equal to one-half of its total length.

% Attack Degree | Orig vs. Rec | WM vs. Rec | |||
---|---|---|---|---|---|

j−i | % Recovery | ODG | PSNR | ODG | PSNR |

Equal size | |||||

120,000–120,000 | 22 | −1.4257 | 36.7428 | −1.3539 | 36.7428 |

130,000–130,000 | 14 | −1.4313 | 36.5323 | −1.3309 | 36.5321 |

140,000–140,000 | 9 | −1.3751 | 36.7690 | −1.2729 | 36.7690 |

Larger size | |||||

240,000–120,000 | 21 | −1.4217 | 37.5442 | −1.2966 | 37.5443 |

260,000–130,000 | 19 | −1.4161 | 37.0987 | −1.3447 | 37.0987 |

280,000–140,000 | 10 | −1.6282 | 37.6108 | −1.5034 | 37.6109 |

Smaller size | |||||

60,000–120,000 | 26 | −1.3611 | 35.2997 | −1.2475 | 35.2997 |

65,000–130,000 | 14 | −1.5525 | 38.3139 | −1.4093 | 38.3136 |

70,000–140,000 | 8 | −1.5404 | 35.0041 | −1.4728 | 35.0042 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Gomez-Ricardez, J.J.; Garcia-Hernandez, J.J.
A Low Distortion Audio Self-Recovery Algorithm Robust to Discordant Size Content Replacement Attack. *Computers* **2021**, *10*, 87.
https://doi.org/10.3390/computers10070087

**AMA Style**

Gomez-Ricardez JJ, Garcia-Hernandez JJ.
A Low Distortion Audio Self-Recovery Algorithm Robust to Discordant Size Content Replacement Attack. *Computers*. 2021; 10(7):87.
https://doi.org/10.3390/computers10070087

**Chicago/Turabian Style**

Gomez-Ricardez, Juan Jose, and Jose Juan Garcia-Hernandez.
2021. "A Low Distortion Audio Self-Recovery Algorithm Robust to Discordant Size Content Replacement Attack" *Computers* 10, no. 7: 87.
https://doi.org/10.3390/computers10070087