Audio Watermarking System in Real-Time Applications

Santin-Cruz, Carlos Jair; Dolecek, Gordana Jovanovic

doi:10.3390/informatics12010001

Open AccessArticle

Audio Watermarking System in Real-Time Applications

by

Carlos Jair Santin-Cruz

^* and

Gordana Jovanovic Dolecek

Department of Electronics, National Institute of Astrophysics, Optics and Electronics, Puebla 72840, Mexico

^*

Author to whom correspondence should be addressed.

Informatics 2025, 12(1), 1; https://doi.org/10.3390/informatics12010001

Submission received: 3 November 2024 / Revised: 23 December 2024 / Accepted: 23 December 2024 / Published: 25 December 2024

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Watermarking is widely employed to protect audio files. Previous research has focused on developing systems that balance performance criteria, including robustness, imperceptibility, and capacity. Most existing systems are designed to work with pre-recorded audio signals, where the characteristics of the host signal are known in advance. In such cases, processing time is not a critical factor, as these systems generally do not account for real-time signal acquisition or report tests for real-time signal acquisition nor report the elapsed time between signal acquisition and watermarking output, known as latency. However, the increasing prevalence of audio sharing through real-time streams or video calls is a pressing issue requiring low-latency systems. This work introduces a low-latency watermarking system that utilizes a spread spectrum technique, a method that spreads the signal energy across a wide frequency band while embedding the watermark additively in the time domain to minimize latency. The system’s performance was evaluated by simulating real-time audio streams using two distinct methods. The results demonstrate that the proposed system achieves minimal latency during embedding, addressing the urgent need for such systems.

Keywords:

audio watermarking; real time; latency; spread spectrum; embedding process

1. Introduction

The development of the internet has significantly increased multimedia traffic, with high-quality files being shared and utilized across various applications. This growth has heightened concerns over security and copyright, making the development of effective watermarking systems more crucial than ever. In this context, our research on a low-latency audio watermarking system is not only timely but also essential in addressing the challenges posed by this increasing multimedia traffic.

Watermarking consists of hiding information within the content of a cover signal, referred to as the host signal. The host signal can be any media file, including audio, video, images, or text. Audio watermarking aims to ensure that the embedded information remains imperceptible, thereby preserving the quality of the host signal. The literature [1,2] highlights numerous contexts in which watermarking has been applied. Considering the similarities among these applications, we can identify the following five as the primary ones, encompassing those mentioned in previous works:

Copyright protection: The watermark can identify ownership of an audio signal and prevent unauthorized actions, such as copying, playback, or tampering.
Authentication: In this application, the embedded information verifies the legitimacy of the audio signal after distribution.
Content identification and management: In scenarios where large volumes of content require automation, the watermark identifies and manages audio signals.
Monitoring: Watermarking facilitates the retrieval of information while sharing audio signals.
Second screen: The embedded information is used by another device for various purposes.

Audio watermarking presents unique challenges, particularly in maintaining imperceptibility due to the sensitivity of the human auditory system (HAS) to added noise. As noted in [2], an effective audio watermarking system should prioritize imperceptibility while balancing other performance criteria including robustness, security, capacity, and computational complexity. However, achieving this balance involves trade-offs, as observed in [3], where the denominated “magic triangle” is the most studied, with imperceptibility, capacity, and robustness. For instance, an audio watermarking system with acceptable performance should meet specific requirements, a minimum signal-to-noise ratio (SNR) of 20 dB to achieve a certain grade of imperceptibility, a minimum capacity of 20 bps, and a robustness capable of withstanding attacks with minimal errors (ideally zero error after an attack), as outlined in [4]. Our research delves into these complexities, aiming to design an effective watermarking system that meets these stringent criteria.

Several systems have been proposed to mitigate these trade-offs and enhance performance. In [5], methods are categorized based on the domain in which the message is embedded, having the temporal domain and the frequency domain. Recent advancements have enabled many systems to exceed operational requirements while significantly reducing trade-offs within the “magic triangle.” However, as highlighted in [6] and explored in Section 2 of this work, the security and computational complexity criteria remain less comprehensively reported than other performance criteria.

Audio files are no longer limited to prerecorded artistic content. Many modern applications rely on real-time audio sources. With advancements in communication technologies, audio information is now widely shared through mobile communications, broadcasting, and video calls, often in high-quality, real-time formats. Protecting such content requires low-latency embedding processes, a critical requirement in the context of real-time audio sources.

Latency is a key consideration in communication systems. As mentioned in [7], latency must be carefully controlled, with acceptable thresholds varying based on application, such as conferences, verbal communication, or music. For example, networked music performances (NMP) require latency values below 25 ms to maintain rhythmic interaction [8]. Low computational complexity and support for reduced buffer sizes at the transmitter side are essential to meet these stringent latency requirements. As noted in [8], buffering significantly contributes to delays, underscoring the importance of minimizing buffer size to achieve real-time performance.

The novelty of this work is as follows:

This work proposes a system designed to be implemented in a live audio watermarking signal, where knowing a priori the host signal is unnecessary. The watermark embedding is additive in the time domain, using a spread spectrum (SS) technique and a low buffer size. The system achieves acceptable performance values, according to [4].
Two methods are used to measure latency: The first modifies the loop-based approach in [9], while the second utilizes an oscilloscope and an external audio source. In both scenarios, the results show that the embedding process does not introduce a measurable latency.

2. Literature Review

Audio watermarking remains an active area of research. Numerous studies have been proposed to enhance performance and reduce trade-offs. However, while the “magic triangle” performance criteria are extensively reviewed, security and computational complexity are often less comprehensively examined. A review of recent works published since 2016 reveals that only a few studies report computational complexity [4,10,11,12,13,14,15,16,17,18]. One possible reason for this gap in reporting is the advancements in hardware where audio watermarking algorithms are implemented, which can help mitigate trade-offs involving computational complexity. Nonetheless, computational complexity remains a critical factor for real-time applications due to the latency it can introduce.

In general, computational complexity is reported as the elapsed time required for the watermarking process on specific hardware, with measurements typically conducted after the entire audio signal is available for processing. Among the reviewed works, only ref. [14] provides evidence of real-time performance, with tests conducted during message recovery. In [17], the authors mention that “there is a limitation in exploring the proposed audio watermarking algorithm in real-time applications due to its computational complexity.” The shortest time reported in this work is 1.0632 s for 10 s length of the host signal; taking this threshold, it is possible to observe that only ref. [13] reports times below the second in the embedding process, although it can be inferred that the embedding process in [18] is also below because the whole process takes less than one second. In the extraction, refs. [10,11,12,16,18] reported times below 1 s. These findings suggest that embedding processes generally involve greater computational complexity than extraction, posing challenges for real-time application on the transmission side.

A novel approach is presented in [19], which employs a parallel computing process implemented on 16-core Raspberry Pi hardware to achieve real-time performance. The system achieves an average execution time of 2.3197 s for a 60 s audio signal. However, this method requires the entire host signal to be available before processing, limiting its application to real-time streaming scenarios.

In [20], a real-time audio watermarking system is introduced for mobile communication applications. However, this system is designed exclusively for speech signals and uses a watermark embedded in the time domain to control tampering. It is not intended for embedding additional information or identifiers, thereby failing to meet the embedding capacity requirements outlined earlier.

Alternative methods for protecting real-time generated audio signals can be found in the literature, such as the method described in [21]. However, these methods fall into a different category of data hiding, such as cryptography [1]. Consequently, they have distinct applications and lack certain characteristics inherent to watermarking, such as the ability to protect the signal even after it has been decrypted.

3. Description of the Primary System

This system is based on a non-live watermarking system introduced in [22], where SS is used as a modulation technique. The block diagram of the system is presented in Figure 1.

The system can be summarized into four main processes:

Convolutional codification.
SS modulation and demodulation.
Gain modulation.
Wiener filtering.

The first three processes are implemented in the transmitter and receiver, while Wiener filtering is exclusive to the receiver.

The additive embedding process in time domain can be represented as follows:

h^{'} (n) = h (n) + \hat{w} (i)

(1)

where

h^{'} (n)

is the watermarked signal,

h (n)

is the host signal, and

\hat{w} (i)

is the modulated and filtered watermark.

3.1. Convolutional Codification

Convolutional coding enhances noise tolerance, improving the bit error rate (BER) and imperceptibility. However, it increases the number of inserted bits, reducing the embedding rate. Additionally, the encoder must not exhibit chaotic behavior [23] while ensuring optimal performance. For this reason, the convolutional encoder from [22] was chosen here with a code rate of R_c = 1/2, L = 7 shift registers, and the generating polynomials in octal numbers (OCT) of 171 OCT and 133 OCT.

The Viterbi decoder at the receiver requires knowledge of the number of shift registers and generator polynomials used at the transmitter. This dependency adds an additional layer of security to the watermarking system. The decoder operates using soft decision decoding with 8-bit quantization. The number of bits is chosen based on [22], which reports that increasing the number of bits beyond this value does not significantly enhance performance.

3.2. SS Modulation and Demodulation

SS is a modulation technique commonly used in communications to maximize bandwidth utilization. This work employs direct sequence SS modulation, generating the modulated signal by multiplying the message signal with a pseudo-noise (PN) sequence. Each element of the PN sequence has a significantly shorter duration than each bit of the message, resulting in increased bandwidth and reduced power spectral density (PSD). This PSD reduction provides advantages in terms of imperceptibility and robustness. Additionally, since the PN sequence must be known during extraction for correct demodulation, it serves as a key, enhancing security.

The PN sequence, used for SS modulation and demodulation in the proposed system, is a Gold sequence, generated from two sequences of a maximum length using 15 shift registers. With the sampling frequency of commercial audio signals (f_s = 44.1 kHz), the period of each chip in the PN sequence is calculated as T_c = f_s ⁻¹.

3.3. Gain Modulation

3.3.1. All-Pole Filter

To achieve imperceptibility, the system leverages the characteristics of the HAS, specifically using the masking threshold determined by the psychoacoustic model designed for MPEG-I Layer III [24] obtained over a time window of W_p = 512 samples. An all-pole filter with frequency response G(z) is designed as follows:

G (z) = \frac{b_{0}}{1 + a_{1} z^{- 1} + a_{2} z^{- 2} + \dots + a_{n} z^{- n}},

(2)

where b₀ is a gain factor that modifies the noise variance, and a_i are the coefficients to achieve the desired frequency response. The required number of poles (n) in (2) should be at least 2B, where B is the bandwidth in kHz [22]. Since the HAS bandwidth is 20 kHz, a value of n = 50 is used. The filter response closely approximates the masking threshold by solving the Yule–Walker equations planted to obtain the coefficients’ values, resulting in improved imperceptibility. Figure 2 illustrates this process, showing the PSD of the host signal, the masking threshold, and the filter response. Although the response does not perfectly match the masking threshold, it is sufficient to enhance imperceptibility, as demonstrated in Section 4.

Filtering at the transmitter modifies the frequency response of the embedded signal and introduces a non-linear phase. To compensate for these effects at the receiver, a zero-forcing equalizer is employed. The masking threshold and filter coefficients are determined similarly to the transmitter, and the inverse filter response (1/G(z)) is applied to compensate for the transmitter’s filtering effects.

It is important to note that the frequency response obtained from the watermarked signal G′(z) may differ from that processed in the transmitter due to modifications and attacks on the audio signal. Consequently, the accuracy of filtering and compensation depends on the extent of these attacks.

3.3.2. Wiener Filter

Since the power of the modulated signal w(n) is significantly smaller compared to the power of the audio host signal a(n), numerous errors can occur during the demodulation of the embedded signal. To mitigate these errors, there is an additional stage at the output of the all-pole filter in the receiver.

The power ratio can be improved by approximating the PSD of the transmitted signal to that of the modulated signal. This is achieved by modulating a random signal using the modulation PN sequence and filtering the watermarked signal to approximate it to the modulated signal using a Wiener filter. The coefficients h(i) are determined by solving the Wiener–Hopf Equation (3). This approach utilizes a Wiener symmetric FIR filter, as described in [25].

[\begin{matrix} \begin{matrix} \emptyset_{s s} (0) & \emptyset_{s s} (0) & \dots \\ \emptyset_{s s} (- 1) & \emptyset_{s s} (1) & \dots \end{matrix} & \begin{matrix} \emptyset_{s s} (2 p) \\ \emptyset_{s s} (2 p - 1) \end{matrix} \\ \begin{matrix} \dots & \dots & \dots \\ \emptyset_{s s} (- 2 p) & \emptyset_{s s} (- 2 p + 1) & \dots \end{matrix} & \begin{matrix} \dots \\ \emptyset_{s s} (0) \end{matrix} \end{matrix}] [\begin{matrix} h_{- p} \\ \begin{matrix} h_{- p + 1} \\ \begin{matrix} \dots \\ h_{p} \end{matrix} \end{matrix} \end{matrix}] = [\begin{matrix} \emptyset_{w w} (- p) \\ \begin{matrix} \emptyset_{w w} (- p + 1) \\ \begin{matrix} \dots \\ \emptyset_{w w} (p) \end{matrix} \end{matrix} \end{matrix}]

(3)

where ∅_ss and ∅_ww are the autocorrelation functions of s(n) and w′(n), respectively, and p is the number of filter coefficients.

This system works with 50 coefficients calculated within the time windows of W_p = 512 samples, consistent with the gain modulation stage.

4. System Working with Live Audio

This system is designed to process audio in real-time while minimizing latency, as all audio signal processing stages introduce delays. Figure 3 shows a block diagram of the processes involved in a real-time audio environment.

To protect the audio signal generated and transmitted in real-time, the watermark message is not created or recovered in real-time. The distribution process lies beyond the scope of this work. Furthermore, the latency caused by the driver, the analog-to-digital converter (ADC), and the digital-to-analog converter (DAC) is beyond the control of this system. Consequently, this system is specifically designed to achieve low latency during embedding and operate with a minimal buffer size.

4.1. Working with Two Different Buffer Sizes

The system presented in [22] obtains the masking threshold through the psychoacoustic model used in [24]. To achieve the masking threshold with adequate frequency accuracy, the model requires 512 samples. Thus, the system can utilize a buffer set to that size. However, to further reduce latency, the system can also work with a buffer size of 256 samples. In this case, the system stores the previous 256 samples before computing the filter response for the current buffer. The first frame has no message incrusted, which impacts the capacity, but this impact decreases if the audio signal is longer. Thus, this affects imperceptibility. Figure 4 shows this process. Figure 5 shows the comparation of masking thresholds, using different buffer sizes.

Since the watermark’s imperceptibility is a subjective characteristic, it is evaluated using the perceptual audio quality evaluation algorithm (PEAQ) [26], which compares the original audio signal with the watermarked signal. The algorithm returns a parameter called objective difference grade (ODG), which can take values from −4 to 0. A value of −4 indicates that changes to the audio signal are perceptible and annoying, while a value of 0 indicates that the changes are imperceptible.

Imperceptibility was tested on four host signals using different music genres. The results are shown in Table 1, where it can be observed that the difference in ODG values does not exceed 0.2, and no value below −1 is obtained.

4.2. Real-Time Test

The system was tested by simulating a real-time environment using MATLAB R2020b software. The simulation does not include the distribution process, as it lies beyond the scope of this work. An audio interface capable of simultaneous input and output was required. The M-Audio M-Track Solo interface, which connects via USB, was chosen for the simulation. The system operates with a sampling frequency of 44.1 kHz and supports two buffer sizes: 512 and 256 samples.

The hardware specifications used during testing include a laptop manufactured by HP and sourced from Mexico. The equipment runs the Windows 10 operating system and is equipped with an Intel Core i5-7200U processor @ 2.5 GHz and 8 GB of RAM.

Latency was measured using two different methods. The first method is based on [9], where latency is measured in a loop. The audio signal is read from a file stored in a buffer to simulate a live source. A block diagram of this process is presented in Figure 6.

After reading the audio signal file, the latency of the embedding process is determined using a correlation function between the input and output signals. Figure 7 shows an example of this process.

The second method employs an external audio source. A square pulse from the input signal is used as a reference, and the difference between the input and output signals is measured using an oscilloscope. Figure 8 and Figure 9 present the connection diagram and an example of the oscilloscope measurement, respectively.

Latency was measured with and without the watermarking process using both methods. Table 2 shows the average values obtained from the tests. Since the measured values depend not only on the hardware but also on the tasks and workload assigned to the computing device, and given that some processes cannot be controlled by the user, 20 measurements were taken to determine each value. Of these, 10 measurements were performed consecutively, while the other 10 were taken after a significant time interval, during which other tasks were assigned to the computing device. For this reason, the standard deviation is presented alongside the results.

The latency values differ between the two methods. However, neither method shows a significant increase in latency due to the watermarking process, as the differences are negligible given the time resolution of both approaches.

5. Comparison of Results

Table 3 compares performance criteria results from the literature review. Computational complexity is not included in the comparison, as the reviewed methods are designed for scenarios where the audio signal is available a priori. This precondition makes it impossible to directly compare their results with the latency tests conducted in this work, which simulate real-time signal acquisition.

The perceptual quality metric for imperceptibility is the ODG obtained through the method described in [26]. The ODG values for all reviewed works are above −1, which, according to [26], indicate good quality. While the proposed system does not achieve the best average ODG value, it is slightly below the average of the works reported, which is approximately −0.625.

The embedding capacity is reported in bits per second (bps) that can be inserted into the host signal. As shown in Table 3, works such as [4,11,15,16] exhibit high embedding capacities exceeding 1000 bps. Although the proposed system does not prioritize high embedding capacity, it maintains a balance among design trade-offs. Its capacity remains well above 20 bps, which, according to [4], is the minimum advisable value.

Robustness testing was performed by calculating the bit error rate (BER) after an MP3 compression attack. This test was chosen because MP3 compression is one of the most common attacks on audio signals. The proposed system demonstrated robustness in this test, further validating its performance.

As noted in Section 1, according to [6], security measures are not commonly reported in the literature. For this reason, the comparison focuses on whether the reviewed works reported using a key for watermark recovery. The proposed system introduces an approach by utilizing the polynomial generator of the PN sequence used in spread spectrum modulation and demodulation as the key.

6. Conclusions

The tests, conducted using a combination of software and hardware, validated the results and revealed different latency values for the two methods. However, in both cases, the latency introduced by the proposed method was consistently minimal, never exceeding 3 ms. It is important to note that while latency is influenced by the computer’s processing capacity, the tests were performed on an average specification system, as detailed in the report. This underscores the adaptability of the proposed system, making it suitable for implementation on a wide range of currently available computing devices.

The latency results obtained through software measurement align closely with the recommendations for real-time musician interaction applications. Although the latency values obtained via hardware measurements were slightly higher, this system still holds significant potential for real-time audio systems. It can be effectively employed in less demanding applications, such as conferences or music distribution without interaction, opening up a wide range of practical applications.

The embedding process introduces minimal additional latency, which, in some cases, could not be detected due to the resolution limits of the test equipment. As demonstrated in the measurements and supported by the literature, the number of buffered samples plays a crucial role in latency. This feature instills confidence in the system’s performance in real-time scenarios.

The system’s ability to function with a buffer size of 256 samples is a significant advantage, particularly for real-time performance. As demonstrated in the measurements and supported by the literature, the number of buffered samples plays a crucial role in latency. This feature instills confidence in the system’s performance in real-time scenarios.

The proposed system exceeds the minimum requirements reported in the literature regarding imperceptibility, capacity, and robustness. While the comparison shows that, particularly for capacity, the proposed system’s values are lower than those of some other works, the primary goal of this study is real-time performance. Therefore, computational complexity and imperceptibility were prioritized in balancing trade-offs. This means that while the system excels in real-time performance, it may not be the best choice for applications that require high capacity.

Future work will focus on further reducing the buffer size to minimize latency without excessively compromising other performance criteria. The aim is to achieve latency values below 25 ms, as recommended for musician interaction applications. Additionally, given that the tests confirm that the algorithm introduces very low latency, future improvements could explore a trade-off by increasing computational complexity to enhance capacity and robustness. Efforts will be directed toward improving robustness against a broader range of attacks, particularly desynchronization attacks.

Author Contributions

Conceptualization, C.J.S.-C. and G.J.D.; methodology, C.J.S.-C.; formal analysis, C.J.S.-C.; investigation, C.J.S.-C.; resources, C.J.S.-C.; writing—original draft preparation, C.J.S.-C.; writing—review and editing, G.J.D.; visualization, C.J.S.-C.; supervision, G.J.D.; project administration, C.J.S.-C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Dataset available on request from the authors.

Conflicts of Interest

The authors declare no conflict of interest.

References

Nematollahi, M.A.; Vorakulpipat, C.; Rosales, H.G. Digital Watermarking Techniques and Trends, 1st ed.; Springer: Berlin/Heidelberg, Germany, 2017. [Google Scholar]
Lin, Y.; Abdulla, W.H. Audio Watermark: A Comprehensive Foundation Using MATLAB; Springer: Cham, Switzerland, 2015. [Google Scholar]
Bajpai, J.; Kaur, A. A literature survey—various audio watermarking techniques and their challenges. In Proceedings of the 2016 6th International Conference—Cloud System and Big Data Engineering (Confluence), Noida, India, 14–15 January 2016; pp. 451–457. [Google Scholar]
Pourhashemi, S.M.; Mosleh, M.; Erfani, Y. A novel audio watermarking scheme using ensemble-based watermark detector and discrete wavelet transform. Neural Comput. Appl. 2021, 33, 6161–6181. [Google Scholar] [CrossRef]
Hua, G.; Huang, J.; Shi, Y.Q.; Goh, J.; Thing, V.L.L. Twenty years of digital audio watermarking—A comprehensive review. Signal Process. 2016, 128, 222–242. [Google Scholar] [CrossRef]
Yong, X.; Hua, G.; Yan, B. Digital Audio Watermarking: Fundamentals, Techniques and Challenges; Springer: Gateway East, Singapore, 2017. [Google Scholar]
Moscatelli, R.; Stahel, K.; Kraneis, R.; Werner, C. Why real-time matters: Performance evaluation of recent ultra-low latency audio communication systems. In Proceedings of the 2024 IEEE 21st Consumer Communications & Networking Conference (CCNC), Las Vegas, NV, USA, 6–9 January 2024; pp. 77–83. [Google Scholar]
Rottondi, C.; Chafe, C.; Allocchio, C.; Sarti, A. An overview on networked music performance technologies. IEEE Access 2016, 4, 8823–8843. [Google Scholar] [CrossRef]
Measure Audio Latency. MathWorks. Available online: https://www.mathworks.com/help/audio/ug/measure-audio-latency.html (accessed on 11 May 2022).
Hu, H.-T.; Chang, J.-R.; Lin, S.-J. Synchronous blind audio watermarking via shape configuration of sorted LWT coefficient magnitudes. Signal Process. 2018, 147, 190–202. [Google Scholar] [CrossRef]
Wu, Q.; Ding, R.; Wei, J. Audio watermarking algorithm with a synchronization mechanism based on spectrum distribution. Secur. Commun. Netw. 2022, 2022, 2617107. [Google Scholar] [CrossRef]
Pourhashemi, S.M.; Mosleh, M.; Erfani, Y. Audio watermarking based on synergy between Lucas regular sequence and Fast Fourier Transform. Multimed. Tools Appl. 2019, 78, 22883–22908. [Google Scholar] [CrossRef]
Wang, S.; Yuan, W.; Unoki, M. Multi-subspace echo hiding based on time-frequency similarities of audio signals. IEEE ACM Trans. Audio Speech Lang. Process. 2020, 28, 2349–2363. [Google Scholar] [CrossRef]
Erfani, Y.; Pichevar, R.; Rouat, J. Audio watermarking using spikegram and a two-dictionary approach. IEEE Trans. Inf. Forensics Secur. 2017, 12, 840–852. [Google Scholar] [CrossRef]
Su, Z.; Zhang, G.; Yue, F.; Chang, L.; Jiang, J.; Yao, X. SNR-constrained heuristics for optimizing the scaling parameter of robust audio watermarking. IEEE Trans. Multimed. 2018, 20, 2631–2644. [Google Scholar] [CrossRef]
Elshazly, R.; Nasr, M.E.; Fouad, M.M.; Abdel-Samie, F.E. Intelligent high payload audio watermarking algorithm using color image in DWT-SVD domain. J. Phys. Conf. Ser. 2021, 2128, 012019. [Google Scholar] [CrossRef]
Suresh, G.; Narla, V.L.; Gangwar, D.P.; Sahu, A.K. False-Positive-Free SVD based audio watermarking with integer wavelet transform. Circuits Syst. 2022, 41, 5108–5133. [Google Scholar] [CrossRef]
Zhao, J.; Zong, T.; Xiang, Y.; Gao, L.; Hua, G.; Sood, K.; Zhang, Y. SSVS-SSVD based desynchronization attacks resilient watermarking method for stereo signals. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 448–461. [Google Scholar] [CrossRef]
Yamni, M.; Daoui, A.; Karmouni, H.; Sayyouri, M.; Qjidaa, H.; Motahhir, S.; Aly, M.H. An efficient watermarking algorithm for digital audio data in security applications. Sci. Rep. 2023, 13, 18432. [Google Scholar] [CrossRef] [PubMed]
Nita, V.A.; Stanomir, D. Real-time temporal audio watermarking for mobile communication. In Proceedings of the 2020 13th International Conference on Communications (COMM), Bucharest, Romania, 18–20 June 2020; pp. 81–84. [Google Scholar]
Haridas, T.; Upasana, S.D.; Vyshnavi, G.; Krishnan, M.S.; Muni, S.S. Chaos-based audio encryption: Efficacy of 2D and 3D hyperchaotic systems. Frankl. Open 2024, 8, 100158. [Google Scholar] [CrossRef]
Cruz, C.J.S.; Dolecek, G.J. Exploring performance of a spread spectrum-based audio watermarking system using convolutional coding. In Proceedings of the 2021 IEEE URUCON, Montevideo, Uruguay, 24–26 November 2021; pp. 104–107. [Google Scholar]
Sklar, B. Digital Communications, Fundamentals, and Applications; Prentice Hall: Upper Saddle River, NJ, USA, 2001. [Google Scholar]
ISO/IEC 11172-3:1993; Coding of Moving Pictures and Associated Audio for Digital Storage Media at Up to About 1,5 Mbit/s—Part 3: Audio. ISO/IEC: Geneva, Switzerland, 1993.
Dutoit, T.; Marqu, F. How could music contain hidden information? In Applied Signal Processing: A MATLAB-Based Proof of Concept; Springer: New York, NY, USA, 2009; pp. 223–262. [Google Scholar]
ITU-T, R. Methods for Objective Measurements of Perceived Audio Quality; International Telecommunication Union: Geneva, Switzerland, 2001; Volume 1387-1, p. 100. [Google Scholar]

Figure 1. Block diagram based on SS-based system presented in [22]. (a) Stages in embedding; (b) stages in detection.

Figure 2. PSD of the original audio signal, the filter response, and the masking threshold, using the psychoacoustic model in [24] and an all-pole filter of 50 coefficients.

Figure 3. The audio signal in the real-time block diagram.

Figure 4. Process using a buffer size of 256 samples. The purple segment represents the first 256 samples, which are stored as a buffer and do not contain a watermark. The yellow arrow indicates the 256 buffer samples used to compute the perceptual mask for the red audio segment. The psychoacoustic model requires 512 samples to calculate the masking thresholds across the entire bandwidth.

Figure 5. The masking thresholds are compared using 512 samples (blue), the previous 256 samples, and the first 256 samples to obtain the masking threshold (red).

Figure 6. Diagram of the method to measure latency based on [9]. The operations performed inside the simulation software are enclosed within the dashed line. The blue line represents the physical connection for acquiring input and output signals to measure latency.

Figure 7. Example of measure of latency by correlation. The first graph shows the time-domain representation, with the host signal in blue and the watermarked signal in red. The latency introduced by the process is visible. The second graph displays the correlation plot of the two audio signals, with the time at which the maximum correlation peak occurs taken as the latency measurement. The third graph provides a zoomed-in view for a more detailed observation of the latency between the signals.

Figure 8. Diagram method to measure latency with an external audio source. The light brown lines indicate the connections between the oscilloscope and the audio source for the host signal and the watermarked signal. The green lines represent the connection to the computer for performing the watermarking process, and the blue lines show the input of the host signal.

Figure 9. Example of latency measuring with the oscilloscope.

Table 1. Comparison of objective difference grade (ODG) values using the current 512 samples and using the delayed 256 samples.

Host Signal	Capacity (BPS)	Samples Used	ODG
Classical	78.75	512	−0.471
	78.75	256	−0.501
	110.25	512	−0.921
	110.25	256	−0.921
Jazz	78.75	512	−0.530
	78.75	256	−0.530
	110.25	512	−0.561
	110.25	256	−0.749
Pop	78.75	512	−0.398
	78.75	256	−0.539
	110.25	512	−0.451
	110.25	256	−0.468
Rock	78.75	512	−0.161
	78.75	256	−0.161
	110.25	512	−0.194
	110.25	256	−0.194

Table 2. Results of the latency test.

Method	Buffer Size (Samples)	Watermarking	Latency (ms)	Standard Deviation (ms)
1	512	No	30.08	0
	512	Yes	30.61	0
	256	No	22.63	0
	256	Yes	22.63	0
2	512	No	114.867	9.799
	512	Yes	117.68	3.44
	256	No	97.29	0.99
	256	Yes	98.36	1.03

Table 3. Comparison of results in the performance criteria except for computational complexity.

Reference	Imperceptibility (ODG)	Capacity (bps)	Robustness to MP3 (BER)	Security
[4]	−0.354	1225	0	NR
[9]	−0.56	86.13	0.08	Yes
[10]	−0.45	64	0.007	NR
[11]	−0.915	1000–8000	0.019	NR
[12]	−0.902	4–128	NR	Yes
[13]	−0.44	177	0	Yes
[14]	−0.6422	NR	NR	Yes
[15]	NR	3843.2	0	Yes
[16]	−0.52	6553.6	NR	Yes
[17]	−0.85	70	0	NR
[18]	NR	196	0	Yes
[19]	NR	91.1926	0	NR
Proposed	−0.583	110.25	0	Yes

NR = not reported.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Santin-Cruz, C.J.; Dolecek, G.J. Audio Watermarking System in Real-Time Applications. Informatics 2025, 12, 1. https://doi.org/10.3390/informatics12010001

AMA Style

Santin-Cruz CJ, Dolecek GJ. Audio Watermarking System in Real-Time Applications. Informatics. 2025; 12(1):1. https://doi.org/10.3390/informatics12010001

Chicago/Turabian Style

Santin-Cruz, Carlos Jair, and Gordana Jovanovic Dolecek. 2025. "Audio Watermarking System in Real-Time Applications" Informatics 12, no. 1: 1. https://doi.org/10.3390/informatics12010001

APA Style

Santin-Cruz, C. J., & Dolecek, G. J. (2025). Audio Watermarking System in Real-Time Applications. Informatics, 12(1), 1. https://doi.org/10.3390/informatics12010001

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Audio Watermarking System in Real-Time Applications

Abstract

1. Introduction

2. Literature Review

3. Description of the Primary System

3.1. Convolutional Codification

3.2. SS Modulation and Demodulation

3.3. Gain Modulation

3.3.1. All-Pole Filter

3.3.2. Wiener Filter

4. System Working with Live Audio

4.1. Working with Two Different Buffer Sizes

4.2. Real-Time Test

5. Comparison of Results

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI