Synchronization of Acoustic Signals for Steganographic Transmission

Steganography is a technique that makes it possible to hide additional information (payload) in the original signal (cover work). This paper focuses on hiding information in a speech signal. One of the major problems with steganographic systems is ensuring synchronization. The paper presents four new and effective mechanisms that allow achievement of synchronization on the receiving side. Three of the developed methods of synchronization operate directly on the acoustic signal, while the fourth method works in the higher layer, analyzing the structure of the decoded steganographic data stream. The results of the research concerning both the evaluation of signal quality and the effectiveness of synchronization are presented. The signal quality was assessed based on both objective and subjective methods. The conducted research confirmed the effectiveness of the developed methods of synchronization during the transmission of steganographic data in the VHF radio link and in the VoIP channel.


Introduction
The paper is devoted to the issues of acoustic steganography, and more precisely hidden synchronization of acoustic steganographic channels. Analyzing scientific publications in this field, it can be concluded that this is a valid topic, and the algorithms of acoustic steganography are constantly being improved. However, in many cases, the authors of the published solutions in their research ignore the significant problem of signal synchronization, in which data is embedded, often assuming perfect synchronization. In the case of practical implementations of steganographic systems, this approach is too much of a simplification, because achieving synchronization is a necessary condition for the effective extraction of payload [1,2].
Data transmission in a steganographic system is inextricably linked with the issue of synchronization. In the absence of synchronization mechanisms, the moment of starting the steganographic data extraction procedure is difficult to determine unequivocally, which implies the random nature of the received data, which is synonymous with low efficiency of hidden transmission.
The bit error rate (BER) was adopted as a measure of the efficiency of steganographic data transmission [3]. The use of hidden synchronization methods should therefore result in obtaining low BER values, which, in combination with detection and correction codes, will enable error-free transmission of the payload.
The use of hidden synchronization methods may or may not be associated with a deterioration in the quality of the cover work. Therefore, it is reasonable to search for such methods of synchronization that will not cause a significant deterioration of the quality of the signal carrying the payload. We often call cover work (original signal) with a payload Stego Object or Stego Work [1,2].
The paper presents four unique mechanisms that allow to achieve synchronization on the receiving side. Three of the developed methods of synchronization operate directly in Section 2, one of the algorithms was selected for further analysis. The algorithm is presented in [26,28], and described in detail in [45]. This method uses a narrowband speech signal as a carrier of steganographic data. Additionally, the authors showed that the method is resistant to a number of factors that degrade the steganographic signal during its transmission over the VoIP link.
For the purposes of this paper, there have been little changes introduced to the algorithm. The signal frame size was determined to be 192 samples (24 ms). Moreover, when determining the masking curve, a procedure using the psychoacoustic model of the MPEG-1 standard was used [46]. Hiding a single bit of information in a steganographic encoder consists of bipolar modification of the amplitude spectrum of the original signal in two adjacent signal frames. The steganographic data transmission rate was 20.83 bit/s. The last modification of the original algorithm consisted in adding a feedback loop and a local decoder in the transmitting part, whose task is to constantly check whether it is possible to correctly extract the information bit embedded in the steganographic signal. In the steganography literature, such solutions are referred to as informed sender algorithms or "dirty paper codes" [47]. In case of error detection, a coefficient C i is determined at the output of the local decoder for the curve SMR i (k) (Equation (1)). In the feedback loop, we determine such value C i , for which the instantaneous signal value at the output of the local decoder, which is also the average R value of the previous instantaneous values, exceeds the specified threshold K min .
It should be additionally emphasized here that the greater the value of the signal at the output of the local decoder, the greater the energy of the watermark signal. Therefore, it will be more resistant to possible disturbances. At the same time, the higher energy value of the watermark signal makes it "audible" to the user of the system.

Technique Development and Implementation
The methods of signal synchronization in conjunction with the procedure of data embedding and extraction described in the Section 3 should allow for hidden data transmission in the selected telecommunications channel. The first method Monotonic Phase Correction and the second Direct Spread Spectrum of synchronization, consist of the construction of the synchronizing signal and adding it to the steganographic signal. The third method Pattern Insertion Detection consists of inserting a synchronizing marker into the speech signal preceding the steganographic transmission. The fourth and last method Minimal Error Synchronization, on the other hand, consists of the appropriate preparation of steganographic information.

Monotonic Phase Correction
The synchronizing signal synthesis system is shown in Figure 1. The input signal here is a Stego Object (signal in which steganographic information is embedded).

Embedding and Extraction Algorithm
In order to study synchronization methods, it is necessary to have a mechanism for embedding and extracting steganographic data. Among the many methods presented in Section 2, one of the algorithms was selected for further analysis. The algorithm is presented in [26,28], and described in detail in [45]. This method uses a narrowband speech signal as a carrier of steganographic data. Additionally, the authors showed that the method is resistant to a number of factors that degrade the steganographic signal during its transmission over the VoIP link.
For the purposes of this paper, there have been little changes introduced to the algorithm. The signal frame size was determined to be 192 samples (24 ms). Moreover, when determining the masking curve, a procedure using the psychoacoustic model of the MPEG-1 standard was used [46]. Hiding a single bit of information in a steganographic encoder consists of bipolar modification of the amplitude spectrum of the original signal in two adjacent signal frames. The steganographic data transmission rate was 20.83 bit/s. The last modification of the original algorithm consisted in adding a feedback loop and a local decoder in the transmitting part, whose task is to constantly check whether it is possible to correctly extract the information bit embedded in the steganographic signal. In the steganography literature, such solutions are referred to as informed sender algorithms or "dirty paper codes" [47]. In case of error detection, a coefficient Ci is determined at the output of the local decoder for the curve SMRi(k) (Equation (1)). In the feedback loop, we determine such value Ci, for which the instantaneous signal value at the output of the local decoder, which is also the average R value of the previous instantaneous values, exceeds the specified threshold Kmin.
It should be additionally emphasized here that the greater the value of the signal at the output of the local decoder, the greater the energy of the watermark signal. Therefore, it will be more resistant to possible disturbances. At the same time, the higher energy value of the watermark signal makes it "audible" to the user of the system.

Technique Development and Implementation
The methods of signal synchronization in conjunction with the procedure of data embedding and extraction described in the Section 3 should allow for hidden data transmission in the selected telecommunications channel. The first method Monotonic Phase Correction and the second Direct Spread Spectrum of synchronization, consist of the construction of the synchronizing signal and adding it to the steganographic signal. The third method Pattern Insertion Detection consists of inserting a synchronizing marker into the speech signal preceding the steganographic transmission. The fourth and last method Minimal Error Synchronization, on the other hand, consists of the appropriate preparation of steganographic information.

Monotonic Phase Correction
The synchronizing signal synthesis system is shown in Figure 1. The input signal here is a Stego Object (signal in which steganographic information is embedded).  The SMR(k) masking curve was determined based on the psychoacoustic model of the MPEG-1 standard [46] according to the procedure described in [48]. For each frame and Sensors 2021, 21, 3379 5 of 24 each harmonic component, the value of the correction factor was determined in accordance with the relationship.
where: i-signal frame number, k-harmonic number, SPL i (k)-sound pressure level for the i-th original signal frame, LT min i -the minimum masking threshold for the i-th original signal frame, C i -additional optional correction factor.
In an OFDM (Orthogonal Frequency Division Multiplexing) block, a signal is formed which is the sum of 14 harmonic components. The OFDM signal is contained in the band from 375 Hz to 500 Hz and from 3041.7 Hz to 3166.7 Hz. Figure 2 shows a single OFDM signal frame and the corresponding amplitude spectrum. The OFDM signal frame duration is 48 ms. The OFDM signal phase is set as follows: The SMR(k) masking curve was determined based on the psychoacoustic model of the MPEG-1 standard [46] according to the procedure described in [48]. For each frame and each harmonic component, the value of the correction factor was determined in accordance with the relationship.
where: i-signal frame number, k-harmonic number, ( ) i SPL k -sound pressure level for the i-th original signal frame, min i LT -the minimum masking threshold for the i-th original signal frame, i C -additional optional correction factor.
In an OFDM (Orthogonal Frequency Division Multiplexing) block, a signal is formed which is the sum of 14 harmonic components. The OFDM signal is contained in the band from 375 Hz to 500 Hz and from 3041.7 Hz to 3166.7 Hz. Figure 2 shows a single OFDM signal frame and the corresponding amplitude spectrum. The OFDM signal frame duration is 48 ms. The OFDM signal phase is set as follows: 1,4,7,8,11,14 , pilots harmonics ( ) , 2,3,5, 6,9,10,12,13 synchronization harmonics 2  x n is fed first to the input of the phase angle scanner system. The task of the phase angle scanner is to determine the value of the phase angle jitter [49,50]. This jitter may arise as a result of different accuracy of the clocks that clock the sampling circuits in the steganographic signal transmitter and receiver.  Figure 3 shows the synchronization system. The input signal x s,syn (n) is fed first to the input of the phase angle scanner system. The task of the phase angle scanner is to determine the value of the phase angle jitter [49,50]. This jitter may arise as a result of different accuracy of the clocks that clock the sampling circuits in the steganographic signal transmitter and receiver.
The next stage of the synchronizing system operation is the detection of pilot spectral lines. This procedure consists of checking whether a given pilot spectral line, after correcting its phase angle by the value of the determined jitter correction, has a phase angle value of zero. If the number of pilot spectral lines thus detected is greater than or equal to 4, then the input is assumed to be a steganographic signal and the algorithm moves to the timing step. The next stage of the synchronizing system operation is the detection of pilot spectral lines. This procedure consists of checking whether a given pilot spectral line, after correcting its phase angle by the value of the determined jitter correction, has a phase angle value of zero. If the number of pilot spectral lines thus detected is greater than or equal to 4, then the input is assumed to be a steganographic signal and the algorithm moves to the timing step.
The time synchronization mechanism is based on the analysis of the cumulative phase of the signal. The cumulative phase is determined based on the recursive equation: where: i-number of the analyzed signal frame, k-harmonic number of the synchronization spectral line, the constant component has the index k = 0, , i k ϕ -value of the phase angle of the k-th harmonic in the i-th frame. Figures 4 and 5 show the cumulative phase waveform for an exemplary steganographic signal, in which a synchronizing signal was additionally embedded. The continuous line marks the course of the cumulative phase of the signal on the transmitting side (in the synthesis circuit), and the dashed lines mark the courses of the cumulative phases recorded on the receiving side (in the synchronizer circuit). Figure 4 shows the cumulative waveforms of the signal in the absence of synchronization, and in Figure 5, when the synchronization is achieved. It is worth adding that for the presented characteristics, the average ratio of the steganographic signal energy to the energy of the synchronizing signal expressed in dB and determined in terms of segments (for 5 ms fragments) was 21.62 dB.
The time synchronization procedure consists of an iterative search for such a signal detuning (shift) for which the distance between the expected value of the cumulative signal phase and the measured value is the smallest. Due to the periodicity of the OFDM signal, said minimum is searched for in the set of distances determined for offsets ranging from 0 to 383 samples. There are many different methods of determining the distance between data sets [51]. The work is limited to determining the synchronization using the Euclidean distance, Mahalanobis distance [52,53], and Fréchet distance [54,55].  show the total distance between the expected cumulative phase of the signal and the phase measured using the above-mentioned metrics. Additionally, the figures show the The time synchronization mechanism is based on the analysis of the cumulative phase of the signal. The cumulative phase is determined based on the recursive equation: where: i-number of the analyzed signal frame, k-harmonic number of the synchronization spectral line, the constant component has the index k = 0, ϕ i,k -value of the phase angle of the k-th harmonic in the i-th frame. Figures 4 and 5 show the cumulative phase waveform for an exemplary steganographic signal, in which a synchronizing signal was additionally embedded. The continuous line marks the course of the cumulative phase of the signal on the transmitting side (in the synthesis circuit), and the dashed lines mark the courses of the cumulative phases recorded on the receiving side (in the synchronizer circuit). Figure 4 shows the cumulative waveforms of the signal in the absence of synchronization, and in Figure 5, when the synchronization is achieved. It is worth adding that for the presented characteristics, the average ratio of the steganographic signal energy to the energy of the synchronizing signal expressed in dB and determined in terms of segments (for 5 ms fragments) was 21.62 dB.       The time synchronization procedure consists of an iterative search for such a signal detuning (shift) for which the distance between the expected value of the cumulative signal phase and the measured value is the smallest. Due to the periodicity of the OFDM signal, said minimum is searched for in the set of distances determined for offsets ranging from 0 to 383 samples. There are many different methods of determining the distance between data sets [51]. The work is limited to determining the synchronization using the Euclidean distance, Mahalanobis distance [52,53], and Fréchet distance [54,55]. Figures 6-8 show the total distance between the expected cumulative phase of the signal and the phase measured using the above-mentioned metrics. Additionally, the figures show the minimum value of the determined distance. In the analyzed case, the minimum was achieved in each case for shifting the signal by 184 samples. minimum value of the determined distance. In the analyzed case, the minimum was achieved in each case for shifting the signal by 184 samples.

Direct Spread Spectrum
The synchronizing signal generation circuit is shown in Figure 9. The input signal here is a Stego Object.
The SMR(k) masking curve was determined as described in the Monotonic Phase Correction method.
In an OFDM block, a signal is formed which is the sum of 6 harmonic components. The OFDM signal is contained in the band from 416.7 Hz to 500 Hz and from 3083.3 Hz to 3166.7 Hz. The OFDM signal frame duration is 24 ms. All harmonics of the OFDM signal act as pilot spectral lines. In the implementation, the value of the phase angle was assumed to be equal to 0.
The second component of the synchronization signal, next to the OFDM signal, is the DSS signal (Direct Spread Spectrum). The block scheme of the DSS signal generation system is shown in Figure 10.

Direct Spread Spectrum
The synchronizing signal generation circuit is shown in Figure 9. The input signal here is a Stego Object.
The SMR(k) masking curve was determined as described in the Monotonic Phase Correction method.
In an OFDM block, a signal is formed which is the sum of 6 harmonic components. The OFDM signal is contained in the band from 416.7 Hz to 500 Hz and from 3083.3 Hz to 3166.7 Hz. The OFDM signal frame duration is 24 ms. All harmonics of the OFDM signal act as pilot spectral lines. In the implementation, the value of the phase angle was assumed to be equal to 0.
The second component of the synchronization signal, next to the OFDM signal, is the DSS signal (Direct Spread Spectrum). The block scheme of the DSS signal generation system is shown in Figure 10.

Direct Spread Spectrum
The synchronizing signal generation circuit is shown in Figure 9. The input signal here is a Stego Object.
The SMR(k) masking curve was determined as described in the Monotonic Phase Correction method.
In an OFDM block, a signal is formed which is the sum of 6 harmonic components. The OFDM signal is contained in the band from 416.7 Hz to 500 Hz and from 3083.3 Hz to 3166.7 Hz. The OFDM signal frame duration is 24 ms. All harmonics of the OFDM signal act as pilot spectral lines. In the implementation, the value of the phase angle was assumed to be equal to 0.
The second component of the synchronization signal, next to the OFDM signal, is the DSS signal (Direct Spread Spectrum). The block scheme of the DSS signal generation system is shown in Figure 10.  The first stage of DSS signal synthesis is the generation of a pseudo-random sequence with appropriate properties [56]. Gold sequences and primary polynomials were used: The size of the Gold string used to generate the DSS signal has been limited to 6096 symbols. The duration of a single symbol has been set to Tc = 1 ms. The duration of the entire sequence is therefore T = 6.096 s.
In the next stage, the generated pseudo-random sequence is fed to the input of the filter block. First of all, it is an interpolation filter with the characteristic of the root raised cosine (RRC, Root Raised Cosine) and then the low-pass filter such as FIR (Finite Impulse Response). These filters are designed to properly shape the pseudorandom sequence pulses and narrow the signal band. Figure 11 shows a fragment of the signal at the output of the low-pass filter. Additionally, the corresponding fragment of the pseudorandom sequence is marked (top picture, red dotted line).
The final step in generating the DSS signal is to transfer the signal from the low-pass filter output to a higher range of audio frequencies. This is due to the fact that frequencies below 300 Hz can be strongly suppressed during signal transmission in telecommunications links. The frequency of the carrier wave used is fc = 2000 Hz.  The first stage of DSS signal synthesis is the generation of a pseudo-random sequence with appropriate properties [56]. Gold sequences and primary polynomials were used: The size of the Gold string used to generate the DSS signal has been limited to 6096 symbols. The duration of a single symbol has been set to Tc = 1 ms. The duration of the entire sequence is therefore T = 6.096 s.
In the next stage, the generated pseudo-random sequence is fed to the input of the filter block. First of all, it is an interpolation filter with the characteristic of the root raised cosine (RRC, Root Raised Cosine) and then the low-pass filter such as FIR (Finite Impulse Response). These filters are designed to properly shape the pseudorandom sequence pulses and narrow the signal band. Figure 11 shows a fragment of the signal at the output of the low-pass filter. Additionally, the corresponding fragment of the pseudorandom sequence is marked (top picture, red dotted line).
The final step in generating the DSS signal is to transfer the signal from the low-pass filter output to a higher range of audio frequencies. This is due to the fact that frequencies below 300 Hz can be strongly suppressed during signal transmission in telecommunications links. The frequency of the carrier wave used is fc = 2000 Hz. The first stage of DSS signal synthesis is the generation of a pseudo-random sequence with appropriate properties [56]. Gold sequences and primary polynomials were used: with initial condition [0000000000001] The size of the Gold string used to generate the DSS signal has been limited to 6096 symbols. The duration of a single symbol has been set to T c = 1 ms. The duration of the entire sequence is therefore T = 6.096 s.
In the next stage, the generated pseudo-random sequence is fed to the input of the filter block. First of all, it is an interpolation filter with the characteristic of the root raised cosine (RRC, Root Raised Cosine) and then the low-pass filter such as FIR (Finite Impulse Response). These filters are designed to properly shape the pseudorandom sequence pulses and narrow the signal band. Figure 11 shows a fragment of the signal at the output of the low-pass filter. Additionally, the corresponding fragment of the pseudorandom sequence is marked (top picture, red dotted line).
The final step in generating the DSS signal is to transfer the signal from the low-pass filter output to a higher range of audio frequencies. This is due to the fact that frequencies below 300 Hz can be strongly suppressed during signal transmission in telecommunications links. The frequency of the carrier wave used is f c = 2000 Hz. Spread spectrum systems are characterized by two important parameters processing gain G and the interference margin M [57]. The processing gain is a parameter that determines the degree of dispersion of the information signal spectrum: where: Tb-duration of the data bit-synchronization bit, Tc-the duration of the spreading sequence chip.
Determining that the duration of the sync bit is equal to the duration of the spreading sequence Tb = T = 6.096 s, the processing gain of the considered system is G = 37.85 dB. The obtained value of the processing profit meets the condition related to perceptual transparency.
The interference margin M is a measure of the receiver's immunity to interference. It determines the maximum ratio of the noise power to the signal power at the receiver input, at which we obtain the minimum bit energy level to the noise power Eb/N0 ensuring an acceptable error probability [56].
After generating the DSS and OFDM signals, these signals are fed to the input of the amplitude correction circuit and then summed with the steganographic signal. Both the OFDM signal and the DSS signal are corrected based on the SMR(k) masking curve. At the same time, an additional condition is introduced for the DSS signal related to the correction of the signal energy. The correction factor Ci (Equation (1)) is chosen such that for each signal frame the following condition is satisfied: Spread spectrum systems are characterized by two important parameters processing gain G and the interference margin M [57]. The processing gain is a parameter that determines the degree of dispersion of the information signal spectrum: where: T b -duration of the data bit-synchronization bit, T c -the duration of the spreading sequence chip.
Determining that the duration of the sync bit is equal to the duration of the spreading sequence T b = T = 6.096 s, the processing gain of the considered system is G = 37.85 dB. The obtained value of the processing profit meets the condition related to perceptual transparency.
The interference margin M is a measure of the receiver's immunity to interference. It determines the maximum ratio of the noise power to the signal power at the receiver input, at which we obtain the minimum bit energy level to the noise power E b /N 0 ensuring an acceptable error probability [56].
After generating the DSS and OFDM signals, these signals are fed to the input of the amplitude correction circuit and then summed with the steganographic signal. Both the OFDM signal and the DSS signal are corrected based on the SMR(k) masking curve. At the same time, an additional condition is introduced for the DSS signal related to the correction of the signal energy. The correction factor C i (Equation (1)) is chosen such that for each signal frame the following condition is satisfied: where: x s i (n)-i-th frame of steganographic signal, noise signal for DSS signal, DSS i (n)-i-th frame of DSS signal.
In the receiving part, the synchronization procedure is based on a system similar to that shown in Figure 3. The difference is in the different principle of the time synchronization block.
The time synchronization procedure consists of determining the value of the crosscorrelation function between the received signal (corrected by the determined correction of the phase angle) and the reference signal generated on the receiving side. The received signal is "shifted" relative to the reference signal. The signals are considered synchronized when the value of the cross-correlation function reaches a maximum for τ = 0 s. Figure 12 shows an example of the course of the cross-correlation function value determined in the time synchronization block. In the receiving part, the synchronization procedure is based on a system similar to that shown in Figure 3. The difference is in the different principle of the time synchronization block.
The time synchronization procedure consists of determining the value of the crosscorrelation function between the received signal (corrected by the determined correction of the phase angle) and the reference signal generated on the receiving side. The received signal is "shifted" relative to the reference signal. The signals are considered synchronized when the value of the cross-correlation function reaches a maximum for τ = 0 s. Figure 12 shows an example of the course of the cross-correlation function value determined in the time synchronization block.

Pattern Insertion Detection
The synchronization method consists of inserting a synchronizing marker into the speech signal preceding the steganographic transmission. This method was developed for the needs of a steganographic system, which was designed to embed and extract data in a VoIP stream.
The principle of operation is based on the analysis of the signal that will be the information carrier. The start of steganographic transmission is determined by the detection of the speech signal. The presence of speech in the signal is detected on the basis of the analysis of the values of two parameters [58]: signal power ( ) ZCR (Zero Crossing Rate)

Pattern Insertion Detection
The synchronization method consists of inserting a synchronizing marker into the speech signal preceding the steganographic transmission. This method was developed for the needs of a steganographic system, which was designed to embed and extract data in a VoIP stream.
The principle of operation is based on the analysis of the signal that will be the information carrier. The start of steganographic transmission is determined by the detection of the speech signal. The presence of speech in the signal is detected on the basis of the analysis of the values of two parameters [58]: Signal power ZCR (Zero Crossing Rate) where: x i (n)-i-th frame of signal, w(n) = 0.54 − 0.46 cos 2πn N−1 N-number of samples in the signal frame. The duration of a single frame was set to 24 ms (192 samples). The implementation assumes that the presence of a speech signal is determined when the power value exceeds −50 dB and the number of zero crossings coefficient is less than 0.5. If three consecutive signal frames meet the above conditions, then these frames are corrected according to the attenuation pattern, the characteristics of which are shown in Figure 13. The characteristics of the attenuation pattern have been empirically established based on the preliminary research of the method. Document [59] states that the permissible IP (Internet Protocol) packet loss during the conversation should not exceed 3%. This value additionally depends on the speech signal codec used during communication. In addition, it assumes the use of the Packet Loss Concealment mechanism (PLC). The adopted attenuation pattern shape reduces the power of the speech signal in the 11 ms window, which is a value similar to the typical frame lengths in speech codecs used in VoIP. In three consecutive signal frames (72 ms) the mentioned reduction of the signal power occurs twice, see Figure 14. The duration of a single frame was set to 24 ms (192 samples). The implementation assumes that the presence of a speech signal is determined when the power value exceeds −50 dB and the number of zero crossings coefficient is less than 0.5. If three consecutive signal frames meet the above conditions, then these frames are corrected according to the attenuation pattern, the characteristics of which are shown in Figure 13. The characteristics of the attenuation pattern have been empirically established based on the preliminary research of the method. Document [59] states that the permissible IP (Internet Protocol) packet loss during the conversation should not exceed 3%. This value additionally depends on the speech signal codec used during communication. In addition, it assumes the use of the Packet Loss Concealment mechanism (PLC). The adopted attenuation pattern shape reduces the power of the speech signal in the 11 ms window, which is a value similar to the typical frame lengths in speech codecs used in VoIP. In three consecutive signal frames (72 ms) the mentioned reduction of the signal power occurs twice, see Figure 14. The next stage, after performing the signal correction procedure (inserting a synchronizing marker into the signal), consists of embedding a portion of steganographic data in the speech signal, according to the algorithm described in the Section 3. The data portion size was set to 16 bits. The algorithm then restarts from scratch detecting the speech signal again.
In the receiving part, the synchronization procedure consists of continuously checking whether the currently analyzed signal fragment includes a synchronizing marker in its structure. This process is based on the analysis of the signal energy value and the number of zero crossings according to the Formula (8) and (9).

Minimal Error Synchronization
This method was developed for the needs of a steganographic system, which was designed to embed and extract data in a VoIP stream. The method was inspired by the cell delineation mechanism used in ATM (Asynchronous Transfer Mode) networks [60,61]. The purpose of the MES method is to recognize the steganographic transmission solely on the basis of the decoded bitstream, without the use of additional tags or unique sequences. The method of embedding and extraction of steganographic data remains unchanged as described in Section 3. It was assumed that the steganographic data extraction procedure would not know whether steganographic information was being transmitted at a given moment and that the extraction would always return a certain bitstream. Moreover, it was assumed that the steganographic data would be formed into a frame constructed in such a way that it would be possible to unambiguously recognize it in the bitstream after extraction, and that it would be resistant to 5% RTP packet loss. The next stage, after performing the signal correction procedure (inserting a synchronizing marker into the signal), consists of embedding a portion of steganographic data in the speech signal, according to the algorithm described in the Section 3. The data portion size was set to 16 bits. The algorithm then restarts from scratch detecting the speech signal again.
In the receiving part, the synchronization procedure consists of continuously checking whether the currently analyzed signal fragment includes a synchronizing marker in its structure. This process is based on the analysis of the signal energy value and the number of zero crossings according to the Formulas (8) and (9).

Minimal Error Synchronization
This method was developed for the needs of a steganographic system, which was designed to embed and extract data in a VoIP stream. The method was inspired by the cell delineation mechanism used in ATM (Asynchronous Transfer Mode) networks [60,61]. The purpose of the MES method is to recognize the steganographic transmission solely on the basis of the decoded bitstream, without the use of additional tags or unique sequences. The method of embedding and extraction of steganographic data remains unchanged as described in Section 3. It was assumed that the steganographic data extraction procedure would not know whether steganographic information was being transmitted at a given moment and that the extraction would always return a certain bitstream. Moreover, it was assumed that the steganographic data would be formed into a frame constructed in such a way that it would be possible to unambiguously recognize it in the bitstream after extraction, and that it would be resistant to 5% RTP packet loss.
The problem of recognizing a data structure in a bitstream is often solved by using a unique preamble or flag. However, in conditions of significant losses, and thus also distortions, such a mechanism cannot be used because it would generate incorrect frame recognition too often. Moreover, it is desirable that the data organization used should provide redundancy to repair bits corrupted due to RTP packet loss.
The transmission errors caused by the loss of RTP packets can be detected and corrected using detection and correction codes (Error Correction Code). There are many different variations of the code that can detect and correct errors. For the purposes of the paper, it was decided to use BCH codes (Bose-Chaudhuri-Hocquenghem). The choice of the BCH code was conditioned, on the one hand, by the requirement of the ability to improve the assumed percentage of lost RTP packets, and, on the other hand, by ensuring the lowest possible information overhead. In addition, the ease of implementation of the target steganographic system was of great importance here because the BCH encoding and decoding procedures are included in the Linux kernel.
BCH codes have strictly defined parameter values (n, k, t) where: n specifies the length (in the number of bits) of the code vector, n = 2 m − 1, m-integer, m ≥ 3, k-specifies the length (in the number of bits) of the information vector, t-is the corrective ability of the code.
To determine the appropriate variant of the BCH code, which will enable the protection of steganographic transmission in the VoIP channel with RTP packet loss at the level of 5%, simulation tests were carried out. Two VoIP channel models were designed in the Matlab/Simulink environment: Model with iLBC codec variant 15.2 kbit/s. The input signal was each time a speech signal with a duration of about 2 min, containing more than 2000 bits of payload. Packet losses were adjusted in the range from 0 to 5% with step 1. The payload was extracted on the receiving side. In the next step, the maximum number of errors recorded in a given observation window was determined. The observation window was shifted in the receiving vector every bit. Table 1 shows the maximum number of errors found in the receive vector with a size of d bits Due to the specific values of the BCH codes parameters, the codes listed in Table 2 were selected for further analysis. In addition, this table shows the steganographic data rate R after taking into account the code rate and the minimum duration of the signal T to allow n bits of the code vector to be embedded in the signal. The next stage of work on the method was to estimate the probability of the first type of errors. To this end, 10 7 random bit sequences of length equal to n were generated for each variant of the BCH code, and then it was checked whether the BCH algorithm would qualify such a sequence as a BCH code vector. The results are presented in Table 3. For codes with the length of the code vector n = 63, the probability of the first type errors was considered too high. Two variants of the code with a length of n = 127 were selected for further analysis: • n = 127, k = 50, t = 13; • n = 127, k = 15, t = 27. For the purposes of transmission, the code vector was interleaved.
In the receiving part, the synchronization procedure consists of continuously checking whether the BCH decoder can recognize the data frame in the extracted bitstream. If the BCH decoder determines that there are no errors or detects and corrects the errors, then it is assumed that synchronization is achieved. Otherwise, if the BCH decoder results in a negative syndrome, the speech signal is shifted by a certain number of samples and the steganographic data extraction and BCH decoding procedures are repeated.
It should be emphasized that the main disadvantage of the presented method is the high computational complexity related to the continuous operation of the steganographic decoder and the BCH decoder. On the other hand, it should also be noted that this is a method that does not interfere with the steganographic signal in any way. Therefore, there will be no deterioration in signal quality.

Signal Quality Assessment
The methods of assessing the quality of audio signals can be divided into two main groups subjective and objective methods.
The objective assessment of the signal quality was carried out based on the ITU-T P.862 PESQ (Perceptual Evaluation of Speech Quality) [62]. Measurements were carried out using a dedicated MultiDSLA tester [63,64]. Documents ITU-T P.862 PESQ [62] and ITU-T P.862.3 [65] describe a number of requirements related to the selection of signals constituting the research material. The ITU-recommended test signals are contained in Annex B to ITU-T P.501 [66]. Samples for the Polish language were used in the research.
The MultiDSLA tester assesses the quality of the signals and returns the results in the form of several values: • "raw" data (PESQ raw score or PESQ score); • PESQ LQ (Listening Quality); Raw data changes range is from −0.5 to 4.5. These data are then transformed into the remaining results, the values of which range from 1 to 4.5 and correspond to the values on the MOS scale [67][68][69].
The subjective evaluation of the quality of the signals was based on the recommendations of ITU-R BS.1116-3 [70]. As with the objective tests, the document ITU-R BS.1116-3 describes a number of considerations on how to conduct the test. The study was conducted on a group of 20 students. A set of 16 test signals was used [71]. The test procedure consisted in listening by the listener of the original (undistorted) signal, marked as signal A, and two copies of this recording, marked as signals B and C, where, randomly, one of these signals is reference signal A and the other signal is distorted signal subject to evaluation. These types of tests are often referred to as "ABC (ABX) tests" or "forced choice tests". The result of the assessment is the value in the SDG (Subjective Degradation Grades) scale, the range of possible scores is from −4 to 4. A positive value means that the listener has incorrectly determined which of the two signals he/she listens to is distorted. Figures 15 and 16 show the results of the objective evaluation of signal quality carried out on the basis of ITU-T P.862 PESQ recommendations [62]. The line plotted in the drawings sets the reference level and represents the quality rating obtained by embedding steganographic information in the original signal. All of the developed methods, except for the MES method, reduce the evaluation of the signal quality. However, this is a slight reduction, amounting to about 6% for both analyzed values of the coefficient K min .
ings sets the reference level and represents the quality rating obtained by embedding steganographic information in the original signal. All of the developed methods, except for the MES method, reduce the evaluation of the signal quality. However, this is a slight reduction, amounting to about 6% for both analyzed values of the coefficient Kmin.
Summing up, it is worth adding that in all studies, the mean value of the MOS scale greater than 3 was obtained each time, which in the case of special applications, especially military, is a highly satisfactory result [72].   Figures 17 and 18 show the results of the conducted listening tests. Again, the straight line defines the reference threshold and represents the SDG score obtained from tests that compared the original signal and a signal where only the steganographic information was embedded. This result is in line with the SDG assessment for the MES method. In the case ings sets the reference level and represents the quality rating obtained by embedding steganographic information in the original signal. All of the developed methods, except for the MES method, reduce the evaluation of the signal quality. However, this is a slight reduction, amounting to about 6% for both analyzed values of the coefficient Kmin.
Summing up, it is worth adding that in all studies, the mean value of the MOS scale greater than 3 was obtained each time, which in the case of special applications, especially military, is a highly satisfactory result [72].   Figures 17 and 18 show the results of the conducted listening tests. Again, the straight line defines the reference threshold and represents the SDG score obtained from tests that compared the original signal and a signal where only the steganographic information was embedded. This result is in line with the SDG assessment for the MES method. In the case Summing up, it is worth adding that in all studies, the mean value of the MOS scale greater than 3 was obtained each time, which in the case of special applications, especially military, is a highly satisfactory result [72]. Figures 17 and 18 show the results of the conducted listening tests. Again, the straight line defines the reference threshold and represents the SDG score obtained from tests that compared the original signal and a signal where only the steganographic information was embedded. This result is in line with the SDG assessment for the MES method. In the case of the value of the K min = 500 coefficient, individual methods of synchronization cause a deterioration of the SDG assessment in the range from 34% to 90% compared to the assessment for the MES method. However, it is worth noting that the determined average values do not take values lower than −1, which should be perceived as an inaudible signal distortion.
For K min = 1250 we can observe a much smaller, relative decrease in the SDG rating ranging from 15% to 40%. However, in this scenario, the benchmark is an SDG score of around −0.8. As a result, the SDG score for the DSS method slightly exceeds the value of −1.
values do not take values lower than −1, which should be perceived as an inaudible signal distortion.
For Kmin = 1250 we can observe a much smaller, relative decrease in the SDG rating ranging from 15% to 40%. However, in this scenario, the benchmark is an SDG score of around −0.8. As a result, the SDG score for the DSS method slightly exceeds the value of −1.

Hidden Transmission Effectiveness Assessment
The aim of the study was to check whether the application of the developed mechanisms of synchronization of acoustic signals will allow for the implementation of steganographic transmission in a telecommunications channel in which there are signal degrading factors. The research was carried out using two different methods of steganographic signal transmission between the sender and the recipient. A teletransmission system based on radio waves and VoIP Internet telephony were used. sessment for the MES method. However, it is worth noting that the determined average values do not take values lower than −1, which should be perceived as an inaudible signal distortion.
For Kmin = 1250 we can observe a much smaller, relative decrease in the SDG rating ranging from 15% to 40%. However, in this scenario, the benchmark is an SDG score of around −0.8. As a result, the SDG score for the DSS method slightly exceeds the value of −1.

Hidden Transmission Effectiveness Assessment
The aim of the study was to check whether the application of the developed mechanisms of synchronization of acoustic signals will allow for the implementation of steganographic transmission in a telecommunications channel in which there are signal degrading factors. The research was carried out using two different methods of steganographic signal transmission between the sender and the recipient. A teletransmission system based on radio waves and VoIP Internet telephony were used.

Hidden Transmission Effectiveness Assessment
The aim of the study was to check whether the application of the developed mechanisms of synchronization of acoustic signals will allow for the implementation of steganographic transmission in a telecommunications channel in which there are signal degrading factors. The research was carried out using two different methods of steganographic signal transmission between the sender and the recipient. A teletransmission system based on radio waves and VoIP Internet telephony were used. Figure 19 shows a laboratory stand for steganographic transmission. The stand was built on the basis of two computers and two RRC 9211 radio stations.

Steganographic Transmission on the VHF Radio Link
The research was carried out using three different radio modes: The signals recorded on the receiving side were subjected to the synchronization procedure in accordance with the adopted synchronization method. Then, after obtaining the synchronization, the steganographic information was extracted. The bit error rate was adopted as a measure of the hidden transmission efficiency. The mean BER value and the 95% confidence interval (T-Student distribution) were determined. For the MES method, the BER value was determined based on the value of the BCH decoder syndrome. The study included one of the variants of the MES method for k = 50 and t = 13.
Additive noises occurring in the considered communication channel caused the distortion of the received signals. These distortions were so significant that they prevented the correct operation of the PID synchronization procedure. The obtained results are shown in Figures 20 and 21.
Increasing the K min factor value from 500 to 1250 reduces the average BER value from 7% to 3% for AFF mode and from 12% to 6% for DFF and FFH modes. The higher BER values for digital modes may be due to the fact that in these operating modes there is lossy compression of the speech signal related to CVSD encoding.
Comparing the synchronization methods, we can see that the obtained bit error rate values in a given operating mode and for a fixed value of K min are similar to each other. Generally, the differences in values do not exceed 3.5% when analyzing both mean values and maximum values for the 95% confidence interval.  The research was carried out using three different radio modes: The signals recorded on the receiving side were subjected to the synchronization procedure in accordance with the adopted synchronization method. Then, after obtaining the synchronization, the steganographic information was extracted. The bit error rate was adopted as a measure of the hidden transmission efficiency. The mean BER value and the 95% confidence interval (T-Student distribution) were determined. For the MES method, the BER value was determined based on the value of the BCH decoder syndrome. The study included one of the variants of the MES method for k = 50 and t = 13.
Additive noises occurring in the considered communication channel caused the distortion of the received signals. These distortions were so significant that they prevented the correct operation of the PID synchronization procedure. The obtained results are shown in Figures 20 and 21.  Figure 19 shows a laboratory stand for steganographic transmission. The stand was built on the basis of two computers and two RRC 9211 radio stations. The research was carried out using three different radio modes: The signals recorded on the receiving side were subjected to the synchronization procedure in accordance with the adopted synchronization method. Then, after obtaining the synchronization, the steganographic information was extracted. The bit error rate was adopted as a measure of the hidden transmission efficiency. The mean BER value and the 95% confidence interval (T-Student distribution) were determined. For the MES method, the BER value was determined based on the value of the BCH decoder syndrome. The study included one of the variants of the MES method for k = 50 and t = 13.
Additive noises occurring in the considered communication channel caused the distortion of the received signals. These distortions were so significant that they prevented the correct operation of the PID synchronization procedure. The obtained results are shown in Figures 20 and 21.   Increasing the Kmin factor value from 500 to 1250 reduces the average BER value from 7% to 3% for AFF mode and from 12% to 6% for DFF and FFH modes. The higher BER values for digital modes may be due to the fact that in these operating modes there is lossy compression of the speech signal related to CVSD encoding.
Comparing the synchronization methods, we can see that the obtained bit error rate values in a given operating mode and for a fixed value of Kmin are similar to each other. Generally, the differences in values do not exceed 3.5% when analyzing both mean values and maximum values for the 95% confidence interval.

Steganographic Transmission in VoIP Channel
Steganographic transmission in the VoIP channel was made using the free and open source PJSIP library [73]. The application was compiled and run under the XUbuntu GNU/Linux operating system. The research was carried out for two variants of the network LAN and WAN.
The research was carried out using three different standards of speech signal coding: The same test signals were used in the research that were used in the tests in the radio link [71].
The mean BER value and the 95% confidence interval (T-Student distribution) were determined. The obtained results are shown in  For PCMA, the synchronization was achieved, allowing for the decoding of steganographic information for all methods. The use of Speex and G.729 codecs prevented synchronization in the PID method. As in the case of a radio link, an increase in the Kmin coefficient value from 500 to 1250 causes a decrease in the average BER value.
For PCMA and LAN, the BER value with the 95% confidence interval shall not exceed 1.2% for Kmin = 500 and 0.8% for Kmin = 1250. WAN transmission increases BER. The obtained values do not exceed 8.5% for Kmin = 500 and 6% for Kmin = 1250. Signal encoding using the Speex codec results in a BER level not exceeding 3% in the LAN and 12% and 10% in the WAN for Kmin = 500 and Kmin = 1250, respectively. Signal coding based on the G.729 codec translates into the largest number of errors during steganographic transmission in the VoIP channel. In the LAN, the BER value together with the 95% confidence interval does not exceed 9% (for MPC and DSS methods) or the number of errors is less than the BCH code correction capacity (for the MES method). Research in the WAN network allowed to obtain BER values below 16% for the MPC method and below 14% for DSS. In both cases, Kmin = 1250. In the case of the MES method for the BCH code with

Steganographic Transmission in VoIP Channel
Steganographic transmission in the VoIP channel was made using the free and open source PJSIP library [73]. The application was compiled and run under the XUbuntu GNU/Linux operating system. The research was carried out for two variants of the network LAN and WAN.
The research was carried out using three different standards of speech signal coding: The same test signals were used in the research that were used in the tests in the radio link [71].
The mean BER value and the 95% confidence interval (T-Student distribution) were determined. The obtained results are shown in Figures 22-24.           For PCMA, the synchronization was achieved, allowing for the decoding of steganographic information for all methods. The use of Speex and G.729 codecs prevented synchronization in the PID method. As in the case of a radio link, an increase in the K min coefficient value from 500 to 1250 causes a decrease in the average BER value.
For PCMA and LAN, the BER value with the 95% confidence interval shall not exceed 1.2% for K min = 500 and 0.8% for K min = 1250. WAN transmission increases BER. The obtained values do not exceed 8.5% for K min = 500 and 6% for K min = 1250. Signal encoding using the Speex codec results in a BER level not exceeding 3% in the LAN and 12% and 10% in the WAN for K min = 500 and K min = 1250, respectively. Signal coding based on the G.729 codec translates into the largest number of errors during steganographic transmission in the VoIP channel. In the LAN, the BER value together with the 95% confidence interval does not exceed 9% (for MPC and DSS methods) or the number of errors is less than the BCH code correction capacity (for the MES method). Research in the WAN network allowed to obtain BER values below 16% for the MPC method and below 14% for DSS. In both cases, K min = 1250. In the case of the MES method for the BCH code with parameters (k = 50, t = 13), regardless of the value of the K min coefficient, problems with achieving synchronization were noticeable. Changing the coding variant to (k = 50, t = 27) significantly improved the efficiency of synchronization, and increasing the value of the K min coefficient to 1250 allowed for 100% synchronization efficiency.

Conclusions
The paper describes four new mechanisms that allow synchronization in acoustic steganography systems. All of these methods have been tested against transparency, robustness, and data rate.
The presented research results regarding the objective and subjective assessment of the quality of signals in relation to the developed methods of synchronization confirm the initial assumption that the use of hidden synchronization of acoustic signals will not significantly deteriorate the quality of the signal being the information carrier.
The presented research results on steganographic transmission in real telecommunications channels allow us to conclude that the use of hidden synchronization of acoustic signals increases the efficiency of steganographic data transmission in a telecommunications channel with signal degrading factors.
Machine learning algorithms can help increase the effectiveness of acoustic synchronization mechanisms. These algorithms build a mathematical model from sample data, called the training set. Machine learning that may prove helpful in the synchronization recovery process include the following methods: Decision Tree Learning for acquiring knowledge based on examples with numerous variants, Bayesian Learning as a probabilistic inference and Instance-based Learning method for modelling the synchronization procedure based on previous sample solutions. There are known methods of synchronization recovery for Forward Error Correction enabled channel [74] and the solution of the problem of network time synchronization [75] with the use of machine learning. Further work on the synchronization in acoustic steganographic channels should also cover the implementation of machine learning algorithms.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to project restrictions.