ECSS: High-Embedding-Capacity Audio Watermarking with Diversity Reception

Digital audio watermarking is a promising technology for copyright protection, yet its low embedding capacity remains a challenge for widespread applications. In this paper, the spread-spectrum watermarking algorithm is viewed as a communication channel, and the embedding capacity is analyzed and modeled with information theory. Following this embedding capacity model, we propose the extended-codebook spread-spectrum (ECSS) watermarking algorithm to heighten the embedding capacity. In addition, the diversity reception (DR) mechanism is adopted to optimize the proposed algorithm to obtain both high embedding capacity and strong robustness while the imperceptibility is guaranteed. We experimentally verify the effectiveness of the ECSS algorithm and the DR mechanism, evaluate the performance of the proposed algorithm against common signal processing attacks, and compare the performance with existing high-capacity algorithms. The experiments demonstrate that the proposed algorithm achieves a high embedding capacity with applicable imperceptibility and robustness.


Introduction
With the booming development of the mobile Internet, information dissemination has changed. Internet firms such as Netflix, YouTube, and Apple have replaced those outdated CD stores-fewer and fewer music is traded via CDs and tapes. This much easier dissemination has led to a prosperous music market. However, beneath the appearance of prosperity, copyright concerns have been getting worse-a piece of music can be copied and distributed too quickly on the Internet. According to the IFPI (International Federation of the Phonographic Industry) report [1], almost a third (29%) of respondents admitted using unlicensed or illegal sources to listen to or download music. There are laws in place to protect music copyrights, but determining infringement remains a challenge.
Engineers and researchers have also come up with several technologies, among which digital audio watermarking is quite representative [2], as a technology that can be used to protect the copyright of music or conference recordings. Digital audio watermarking embeds the copyright identifier, also known as a watermark, imperceptibly into the host audio. When a copyright conflict occurs, it can be verified by extracting the watermark from the watermarked audio.
The performance of a watermarking algorithm is measured by three main criteria: imperceptibility, robustness and embedding capacity. The imperceptibility refers to the data fidelity of the watermarked audio, indicating that the embedded watermarks can only introduce perceptually indistinguishable distortion to the host audio; the robustness refers to the reliability of the watermark that can be effectively extracted when watermarked audio has undergone some attacks, evaluated by the accuracy of the watermark extraction; the embedding capacity refers to the amount of watermark information that can be extracted 1.
We construct the capacity model of the SS-based watermarking algorithm, which reveals how the embedding capacity is dependent on codebook size and robustness. 2.
We propose the ECSS algorithm and analyze its embedding capacity and the accuracy of the watermark extraction. By extending the codebook, each sequence can carry more than one watermark bit, which improves the embedding capacity with applicable watermark robustness.

3.
We adopt a diversity reception mechanism for optimizing the ECSS. This mechanism repeatedly embeds the watermark into different regions of the host audio and aggregates the HSI from all repeated regions when extracting the watermark, reducing the randomness of the HSI and achieving a higher embedding capacity and stronger robustness.
The remainder of this paper is organized as follows. Section 2 describes the works related to embedding capacity on watermarking and information hiding. Section 3 introduces the communication model of SS-based algorithms and formulates the embedding capacity. Section 4 proposes the ECSS algorithm and analyzes the embedding capacity and the accuracy of the watermark extraction. Section 5 introduces the DR mechanism and Section 6 presents the specific watermarking system (In this paper, two terms, watermarking algorithm and system, are used. The former is concerned with the mathematical model of the embedding and extraction stages, while the latter is the concrete implementation, which also contains modules such as the embedding location selection and the domain transformation.) flow. Section 7 is devoted to experiments and related discussions. The work of this paper is summarized in Section 8.

Related Works
Researchers have proposed numerous algorithms to heighten the embedding capacity of audio watermarking, which focus on proposing novel multibit embedding approaches. These algorithms can be roughly categorized into QIM-based (quantization index modulation), and SS-based embedding approaches. QIM-based embedding approaches achieve the effectiveness of multibit embedding by delicately constructing the quantized features or using more quantization levels. SVD-A-QIM [12] converted audio clips into feature matrices and embedded the watermark into the singular values of matrices; Wang [13] observed that the phase of an audio frame distributed uniformly in [0, 2π] and applied the phase shifting for watermarking. SS-based watermarking algorithms construct an ensemble of pseudorandom sequences to express multibit watermarks or choose shorter sequences to heighten the embedding capacity. Xiang adopted circularly shifted pseudonoise codes [14] or orthogonal pseudorandom sequences [15] to implement multibit embedding approaches. Zhang [16] used a correlation-and-bit-aware embedding approach to shorten the vector length required for each bit. All these efforts to improve the embedding capacity have been successful from the technical level, but there is a lack of theoretical analysis of the concept of embedding capacity. Since the SS-based algorithms are more similar to the communication system in detail, this paper employs information theory to analyze the SS algorithm.
Information theory, which provides several theoretical foundations, is widely used in the design of watermarking algorithms [6,[17][18][19]. These watermarking algorithms model the HSI and attacks as a communication channel whose output can be characterized by a conditional probability. The distribution of the host data in the transform domain is often assumed to be a generalized Gaussian distribution [20] or a Weibull distribution [21]. One of the most influential works on embedding capacity [6] treated the embedding stage as encoding processing, while the attacks that audio undergoes were treated as a communication channel A(ŷ|y). If the attacks could be explicitly described as a conditional probability distribution, this would provide theoretical guidance for designing a suitable encoding (or embedding) processing. However, researchers often know little about the attacks, or the attacks are too complex to be modeled statistically. In addition, information theory is also broadly applied in the design of information hiding, and some entropy encodings are often used to improve the performance of information-hiding algorithms [22][23][24]. Meanwhile, communication channels have also been applied to model embedding and extraction approaches to analyze the information-hiding capacity. Fridrich [25] presented a capacity model for information hiding for a given embedding distortion and transmission distortion, which established three channels for embedding, extraction and transmission and derived the steganographic capacity as the difference of the mutual information between the codebook and the cover data. Due to the difficulty of calculating mutual information for data with a complex distribution, Fridrich's model finally provided only an approximate result, which made it hard for inspiring the design of an information-hiding algorithm. Moreover, several works investigated information-hiding algorithms and their capacity, and these works commonly used Gaussian distribution [26,27], quantized Gaussian distribution [28], generalized Gaussian model [29], Gaussian Markov field [30] or Gibbs distribution [31] to model the host data. These works focused on the probability of steganographic messages being discovered by the warden and ignored the effect of the accuracy of the extraction, thus making the conclusions of their analysis challenging to generalize to audio watermarking systems.
In this paper, the embedding capacity of the SS-based watermarking algorithm is analyzed using information theory. Based on the analysis, we propose an extended-codebook spread-spectrum watermarking (ECSS) algorithm which enhances the embedding capacity, and finally, we adopt a diversity reception mechanism to further optimize the embedding capacity and robustness of ECSS.

Embedding Capacity of SS-Based Algorithms
As the first work to apply spread spectrum to digital watermarking, vanilla spread spectrum (VSS) has had a significant influence thanks to the anti-interference capability of the spread spectrum [32]. It is assuming x, y and p are three equal-length row vectors. Among of them, x is the signal vector of the host audio in a particular domain, y is its corresponding signal vector of watermarked audio, and p is a pseudorandom sequence independent of x. Throughout this paper, the length of signal vector x is used to denote the length of these three vectors. The VSS embeds a sign ω ∈ {−1, +1} into the host signal x as where α, named embedding strength, is a small positive constant controlling the embedding distortion. Notice that the watermark bits usually consists of 0's and 1's. Typically, the 0's are replaced with −1's and then the embedding in Equation (1) is performed.
In the watermark extraction stage, according to the watermarked signal y and the pseudorandom sequence p, we can calculate where the superscript T denotes the transpose operator. Equation (2) can be rewritten as where xp T pp T is referred to as the host signal interference (HSI) since it affects the extraction results and is related to the host signal. If p is long enough to have xp T pp T ≈ 0, or if the embedding strength is large enough to have | xp T pp T | α, then R has the same sign as ω.
Therefore, the embedded watermark bits can be extracted from R bŷ

Channel Model and Embedding Capacity
As defined by information theory [33], a communication channel is a system consisting of an input codebook I = {I i }, output codebook O = {O i } and a probability transition matrix T = {p ij }, where an element p ij expresses the probability of observing the output O j given the input I i . The Figure 1a gives a schematic of a general communication channel with an input codebook size |I| of m, an output codebook |O| of n and a transition matrix T.
The channel capacity is the highest rate in bits that can be transmitted per channel use and the maximum mutual information between input and output for different input distributions. It can be expressed formulaically as [33], Specifically, if the channel in Figure 1a is weakly symmetric and the input codebook fit a uniform distribution, then its channel capacity is, and H(p i1 , p i2 , · · · , p in ) is the Shannon entropy of any row in transition matrix T, the logarithmic function is set to a base of 2.  According to the above introduction, the VSS watermarking algorithm can be considered a binary symmetric channel, as shown in Figure 1b, in which the input symbols are complemented with probability P e , and its channel capacity is It can be seen that the size of the output codebook corresponding to the VSS algorithm is 2, which is a limitation on the channel capacity. In addition, the error probability of the watermark extraction also dramatically influences the channel capacity. At the extreme, the channel capacity is 0 if the error probability is 0.5; the channel can deliver the message "accurately" if the error probability is dramatically equal to 1. However, there is a slight difference between channel capacity and embedded capacity. It can be seen from Equation (1) that embedding a symbol (watermark bit) requires a signal vector of length l x . Therefore, the watermarking embedding capacity also needs to take the length of the host signal into account. We define the embedding capacity E c as the bits that can be embedded per sample point, measured in bits/sample . It can be calculated from the channel capacity C, where c is a constant factor representing the loss of capacity in the watermarking system. In Equation (9), it can be found that the embedding capacity E c of the watermark is affected by several factors:

•
The output codebook size |O|; in a watermarking system, the input and output codebooks are generally the same in size; • The entropy of transition probability H(p i1 , p i2 , · · · , p in ), which is relevant to watermark robustness; • The length of the signal vector employed to embed a symbol l x ; • The domain transformation and location selection modules contained in the watermarking system. Parameter c in Equation (9) has varied values for different systems.
However, these factors may be mutually constraining, e.g., a short signal vector may heighten the embedding capacity and increase the transition probability's entropy. Some factors may also conflict with other performance criteria; for instance, a sufficiently large embedding strength reduces the transition probability's entropy, but it also destroys the imperceptibility of the watermark. The remainder of this paper does not involve the embedding location selection and domain transformation, so the constant factor c in Equation (9) is ignored.
In the next two sections, we present ECSS and DR to improve the two terms corresponding to the numerator in Equation (9), respectively.

Extended-Codebook Spread-Spectrum Watermarking Algorithm
The VSS watermarking algorithm adds the pseudorandom sequence p to the signal vector x with different signs according to the watermark bit ω. It has a codebook size of just 2. Hence, increasing the codebook size of the SS-based algorithm is a viable option.
There are m mutually orthogonal sequences {p i } m−1 i=0 and the sequences' norm are 1, i.e., Here, the length of sequence p i is equal to the signal vector as l x and l x ≥ m. Each sequence can represent two symbols of the channel input (refer to Equation (1), a pseudorandom sequence has two signs), then the size of the codebook is extended to |O| = 2m. In this section, we propose an extended-codebook spread-spectrum audio watermarking algorithm using the mentioned sequences {p i } m−1 i=0 and analyze its error probability and embedding capacity.

Embedding Approach
When using the sequences {p i } m−1 i=0 as the extended codebook to embed the watermark, we can embed M = log|2m| bits of the watermark once rather than embedding 1 bit in the VSS algorithm, where m is assumed to be an integer power of 2. First, the former M − 1 bits in the M-bit watermark are converted to decimal, where b i is the ith bit in the M-bit watermark. There are 0 ≤ k ≤ m − 1, so each M-bit can correspond to a sequence p k . Then, the last bit of the M-bit watermark b M−1 is employed to modulate the sign of the sequence p k . The embedding approach is as follows, The embedded M-bit watermark can be expressed as (b M−1 , k).

Extraction Approach and Error Probability
For watermark extraction, similar to Equation (2), a pre-extraction is performed on all sequences, The HSI term also exists in Equation (13). When the distribution of the HSI is studied, a Gaussian distribution [9] with mean 0 is usually assumed, and we use Q-Q (quantilequantile) plot and K-S (Kolmogorov-Smirnov) tests [34] to test this hypothesis on the distribution of the HSI. In the experiments, 5000 samples of the HSI were collected for each length of signal vectors of the VSS watermarking algorithm, and the corresponding assumption was verified using the Q-Q plot and p-values of the K-S test. Figure 2 shows a Q-Q plot and the results of the K-S tests. In Figure 2a, the points are evenly distributed around the reference line, indicating that their distribution is similar to the hypothetical distribution. Due to size limitation, the result is shown here only for the length of the signal vector of 32. In Figure 2b, it is shown that the K-S tests performed on different lengths of signal vector have p-values greater than 0.05, which means that we cannot reject the hypothesis that these sample points fit the Gaussian distribution.
We also estimated the HSI's mean and standard deviation, as shown in Figure 3. The mean of the HSI can also be considered as 0. The standard deviation σ is insignificantly related to the length of the signal vector.   When the length of p i is appropriate, or the embedding strength is large, the k corresponding to the watermark can be obtained by comparing the magnitudes of the preextraction, i.e.,k Thus, the M-bit watermark is extracted as (b M−1 ,k). However, the extracted watermark may still be incorrect due to the HSI. Generally, the embedded watermark of (1, k 0 ) is analyzed here as an example. The error occurs in two cases, i.e.,k = k 0 in Equation (14), ork = k 0 , but b M−1 = 1 in Equation (15). Whenk = k 0 , there is at least one pre-extraction R m greater in magnitude than R k 0 . It can be described as, Equation (16) can be analyzed in two scenarios. The first scenario is In this case, , and all the pre-extractions are independent and identically distributed. Then, the cumulative density function (CDF) of R m is, where Φ(x; σ) is the CDF of the HSI. The right side of Equation (17) is a rather tricky variable to analyze, and since the expectation of the HSI xp T k 0 is 0, Equation (17) is simplified here as follows, The probability that Equation (19) holds is Similarly, the second scenario of Equation (16) is, The corresponding error probability is Since the probability density function (PDF) of the HSI in Equation (3) is even, we have p e1 = p e2 . Therefore, the probability of the first case of error (Equation (16)) is (24) It is noted that Equation (16) does not hold at this point, so the corresponding probability is In a nutshell, the probability that Equations (14) and (15) can extract the correct watermark is Since the distribution of the HSI is symmetric, the mentioned probabilities still hold at b M−1 = 0, and they are all functions of the length of the signal vector l x , the number of sequences m and the embedding strength α.

Embedding Capacity
Two error cases were analyzed in Section 4.2, where the error probability p e was uniformly distributed to 2m − 2 corresponding symbols, while p e3 only corresponded to one symbol. Therefore, one row of the transition probability matrix can be written as, where there are a total of (2m − 2) terms p e 2m − 2 .
The embedding capacity of the ECSS algorithm can be calculated according to Equation (9), where the constant factor c is ignored here. Equation (28) shows that E c becomes lower when the signal vector becomes longer.
Since l x ≥ m, the maximum embedding capacity for a given m is obtained when Thus, the embedding capacity is simplified to a function of the number of sequences (the length of the signal vector) m and the embedding strength α.

Diversity Reception
It can be seen in Equation (28) that the transition probability of the channel, especially the accuracy of the watermark extraction, p c , has a significant influence on the embedding capacity E c . Improving the watermark extraction accuracy is also an effective way to heighten the embedding capacity. Moreover, a high accuracy is also beneficial for the implementation of a watermarking system.
There are three strategies to improve the accuracy of the watermark extraction shown in Equation (26).
(1) Reducing the number of sequences m; (2) Increasing the embedding strength α; (3) Modifying the distribution of the HSI.
Both strategies (1) and (2) involve a parameter selection, and in the next section, we experiment with different parameters.
In this section, we adopt the diversity reception (DR) mechanism to modify the HSI distribution, a method commonly used in signal detection. The DR mechanism can effectively improve the accuracy of the watermark extraction. In brief, the M-bit watermark (b M−1 , k) is repeatedly embedded into different signal vectors x γ , where γ is an integer in the interval [0, Γ), where Γ is the number of repetitions in DR. In the extraction phase, all the watermarked signal vectors are referenced. Equation (13) can be rewritten as, By substituting the pre-extractions of Equation (30) into Equations (14) and (15), the corresponding embedded watermark can be extracted. Since x γ p T i or x γ p T k are Γ independent and identically distributed random variables with variance σ 2 , the variance Hence, in the DR mechanism, the distribution of the HSI is modified, and the variance is smaller than that in the ECSS algorithm. It is indicated from Equation (26) that the smaller the standard deviation of the HSI is, the higher the accuracy of the watermark extraction is.
This method of repeatedly embedding the watermark and averaging the HSI in different watermarked signal vectors to reduce the randomness of the HSI is referred to as DR in this paper. Although DR improves the accuracy of the watermark extraction, on the other hand, DR may also reduce the embedding capacity because more signal vectors are used for embedding the watermark.
It seems that DR is a trade-off to balance the embedding capacity with the robustness. However, it is not the case. Even if the number of repetitions is small (e.g., Γ = 2), DR can still significantly improve the accuracy of the watermark extraction (in Section 7.3). In this case, the gain of an accuracy improvement on the embedding capacity outweighs the capacity cost of repeated embeddings.

Our Audio Watermarking System
An audio-watermarking-system-based ECSS and DR is provided here. It can be divided into two stages, the watermark embedding stage and the extraction stage, as shown in Figure 4.

The Embedding Stage
The watermark embedding stage is shown as Figure 4a and can be summarized as three steps.
Step 1: framing, DCT and selection of embedding regions The host audio is split into consecutive nonoverlapping frames and DCT is applied on each frame; the medium frequency components of the DCT coefficients are selected as the embedding regions and grouped as vectors x γ .
Step 2: watermarking modulation and embedding The sequences {p i } m−1 i=0 is an ensemble of standard orthogonal vectors satisfying Equation (10) and of length m. The M-bit watermark is represented in the form of (b M−1 , k) and repeatedly embedded into multiple signal vectors x γ according to Equation (12), resulting in watermarked vectors y γ . The detail of this step is described in Section 4.1.
Step 3: audio reconstruction Replace x γ correspondingly by the watermarked vector y γ . The DCT coefficients are reorganized, and then an inverse DCT is performed on the coefficients to obtain the audio frames. These audio frames are concatenated to yield the watermarked audio.

The Extraction Stage
The watermark extraction stage is shown in Figure 4b and can also be summarized as three steps.
Step 1: framing, DCT and the selection of embedding regions The received audio is split into consecutive nonoverlapping frames and DCT is applied on each frame; the medium frequency components of the DCT coefficients are selected as the embedding regions and grouped as vectors y γ .
Step 2: averaging and pre-extraction Average the y g ammas yieldingȳ, whose correlation with each sequence p i is the pre-extraction results R i 's, as Equation (30) in Section 5.
Step 3: extraction The watermark (b M−1 ,k) can be obtained from the magnitude and sign of the preextraction results, as shown in Equations (14) and (15) of Section 4.2.
The ECSS and DR correspond to Step 2 in the embedding stage and Steps 2 and 3 in the extraction stage of our watermarking system. Framing, DCT, and the selection of embedding regions can be replaced by any embedding regions' strategy. The watermarking system presented in this section is only used as an example to evaluate the performances of the ECSS and DR.

Experiments and Discussion
This section uses a series of experiments to validate the proposed algorithm. The accuracy of the watermark extraction of the ECSS/+DR algorithm was measured at different embedding strengths and lengths of signal vectors. The embedding capacity was calculated according to the accuracy. The performance of the ECSS/+DR algorithm was also compared with that of the VSS algorithm to validate the effectiveness of the ECSS algorithm.
The influence of the DR mechanism with different repetitions Γ on the performance of the audio watermarking system was also evaluated. This part of the experiment also verified the effectiveness of the DR mechanism.
In addition, we evaluated the robustness of the proposed algorithm against common signal processing attacks and discuss the influence of these attacks on the embedding capacity.
Finally, the performance of our algorithm was also compared with that of other existing watermarking algorithms.

Experimental Settings and Metrics
Our watermarking algorithm contains three settable parameters, namely the embedding strength α, the length of signal vectors m (equal to the number of sequences, as in Equation (29)) and the number of repetitions Γ. We chose different α, m and Γ for the experiments, evaluated the influence of attacks on the performance of the algorithm and compared the performance with other existing algorithms on suitable parameters.
We selected 1000 stereo audio tracks with a sample rate of 44.1 kHz and a duration of 30 s from the FMA dataset [35] for the experiments. In the selection of embedding regions, the framed audio segment had 2048 samples and we used the 256th to 511th DCT coefficients as the embedding region, which corresponded to the frequency interval [2756, 5512] Hz. All or part of these 256 DCT coefficients were divided into several signal vectors.
The embedding distortion SWR (signal-to-watermark ratio) was employed to evaluate the watermark imperceptibility. The SWR is calculated as, where x(i) and y(i) denote the ith sample of the host audio and the watermarked audio. SWR reflects the amount of change to the host audio by the watermark embedding. Although SWR does not reflect the perceptual quality of audio locally, it is still an important metric of imperceptibility. In this paper, the robustness of the watermarking algorithm was evaluated using the watermark group extraction precision (GEP). A watermark group consists of the watermark bits corresponding to a sequence, e.g., an M-bit watermark in the ECSS algorithm, while a watermark group is a single bit in the VSS algorithm. The GEP is calculated as follows, where GEP = 1 means that the watermark can be extracted entirely and correctly. The higher the GEP is, the more robust the watermarking algorithm is. When evaluating the robustness, we also used another metric similar to the GEP, the watermark bit detection rate (BDR). The BDR compares the embedded watermark bits with the extracted bits, and it is defined as, where BDR = 1 means that the watermark can be extracted entirely and correctly. The higher the BDR is, the more robust the watermarking algorithm is.
We used two different metrics to evaluate robustness, because (1) The BDR does not reflect the actual robustness of the algorithms if there is a use of multibit embedding, and in the case of the ECSS algorithm, e.g., when the four-bit watermark (1, 7) is incorrectly extracted as (1, 6), the BDR is 75%, while the GEP is 0%. The GEP reflects the fact more accurately. (2) However, not all algorithms embed the same quantity of bits once. For those algorithms that have small watermark groups, the GEP typically yields better results. The actual performance of the algorithms with various embedding bits is somewhat reflected in the BDR.
We used the GEP to evaluate the robustness of the ECSS algorithm and DR, and then used the BDR, which is more common in the literature, to evaluate various attacks and compare them with existing algorithms.

Effectiveness of the ECSS Algorithm
In this subsection, we statistically measure the GEPs of the ECSS algorithm and calculate the corresponding embedding capacity E c based on the GEPs. The relevant performance of the VSS algorithm is also evaluated as a comparison.
The GEP can only estimate the probability of correctly extracting the watermark group, but the complete probability transition matrix is required to calculate the embedding capacity. For this reason, we assigned the error probabilities uniformly to the other symbols to calculate the infimum of the embedding capacity. Figures 5-8 show the results of these experiments, where Figure 5 shows the GEP versus embedding strength α at different lengths of the signal vector; Figure 6 shows the embedding capacity versus embedding strength α at different lengths of the signal vector; Figure 7 shows the GEP versus the length of the signal vector at different embedding strengths; Figure 8 shows the embedding capacity versus the length of the signal vector at different embedding strengths.
As shown in Figure 5, the GEP of the VSS algorithm varies flatly and significantly outperforms that of the ECSS algorithm. As the GEP favors the single-bit embedding approaches, the performance difference between the proposed algorithm and the VSS algorithm is not as significant as shown in Figure 5. As the embedding strength increases, the GEP of ECSS gradually increases and finally saturates. In Figure 6, the embedding capacity of VSS is greater than that of the ECSS algorithm when the embedding strength is low. As the embedding strength becomes larger, the embedding capacity of both algorithms increases, but the increase is faster for the ECSS algorithm. It can also be seen that the embedding capacity of VSS is low when the embedding strength is greater than 0.3.
In Figure 7, it can be seen that the GEP of VSS is almost independent of the length of the signal vector. The GEP of VSS increases with the lengthening of the signal vector for a given embedding strength. However, the factor √ m was divided by the embedding strength to ensure that the SWR of the VSS and ECSS algorithms were equal. (Although α in both Equations (1) and (12) denotes the embedding strength, the influence on the embedding distortion is different since the norm of the pseudorandom sequence in Equation (1) is l x , while it is one in Equation (12). Thus, there is a factor √ l x difference between the embedding strengths corresponding to the same embedding distortion.) At the embedding strength α = 0.2 or α = 0.3, the GEPs of the ECSS algorithm do not reach saturation, and there is no obvious pattern indicating that the GEP is related to the length of the signal vector, which is similar to the standard deviation of the HSI in Figure 3b.
On the other hand, it is shown in Figure 8 that the embedding capacity decreases as the length of the signal vector grows, which can be concluded accordingly from Equation (28).

Effectiveness of the DR mechanism
In the above subsection, the GEP and embedding capacity of VSS and ECSS were analyzed. However, the role of the DR mechanism is not yet clear. In this subsection, the performance of ECSS is analyzed with different repetitions of the DR mechanism combined. The relevant experimental results are shown in Figures 9-12. Figures 9 and 10 show the curves of GEPs and embedding capacity versus the embedding strength at different lengths of the signal vector, where "w/o DR" means Γ = 1. In this paper, the length of the signal vector in the DR mechanism refers to the sum of the length of multiple repeated signal vectors, e.g., the length of 64 for the ECSS+DR algorithm in Figure 9 means that two signal vectors with length 32 were used to embed the same M-bit watermark. The DR mechanism with repetition Γ = 3 in Figures 9 and 10 actually used signal vectors with lengths 33, 63, 129, 255, which are somewhat different from the subplot titles. Figures 11 and 12 show the curves of GEPs and embedding capacity versus the length of the signal vector at different embedding strengths.
As the embedding strength α increase, the GEP gradually increases and eventually saturates, as shown in Figure 9. Furthermore, the smaller the embedding strength is required to saturate the GEP when there are more repetitions. As in Figure 10, the embedding capacity of the watermark is improved with the embedding strength. However, unlike the GEP, DR with different number of repetitions has different embedding capacities when it finally reaches saturation, and the embedding capacity at saturation is lower when there are more repetitions. At low embedding strength, the ECSS+DR algorithm achieves a better imperceptibility, robustness, and embedding capacity than the ECSS algorithm. Moreover, as the signal vector lengthens, the embedding capacity with different repetitions gets closer when saturated, which indicates that repeated embedding is less costly.
There is no clear relationship between the GEP and the length of the signal vector, as shown in Figure 11. Nevertheless, it can be seen in Figure 12 that the embedding capacity gradually decreases as the length of the vector gets longer. If the chosen embedding strength saturates the GEP, the embedding capacity is inversely proportional to the square root of the length of the signal vector.
In summary, the DR mechanism effectively improves the embedding capacity and the GEP at low embedding strength, balancing the proposed watermarking algorithm's imperceptibility, robustness and embedding capacity.

Robustness and Embedding Capacity against Attacks
In this subsection, we evaluate the BDR and embedding capacity of the proposed ECSS and ECSS+DR algorithms for some common data processing attacks. The attacks are described in Table 1. In order to demonstrate the influence of the attacks on the performance of the watermarking algorithm, the parameters α = 0.3 and Γ = 2 for which the GEP did not reach saturation were deliberately chosen. The evaluation was also performed using signal vectors of different lengths. The experimental results are shown in Figure 13, the top subplot is the BDR and the bottom half is the corresponding embedding capacity.  The effect of the signal processing attacks can be equivalent to adding a certain amount of noise to the audio, and the power of that noise is much lower than the power of the audio itself. The ECSS algorithm proposed in this paper resisted most of these attacks, where there was almost no influence in RQZ, ASC, RSP, LPF, and MP3 attacks, which reduced the average BDR of the ECSS algorithm by 0.1% to 0.5% and had negligible influence on the ECSS+DR algorithm. However, in the WGN attack, the BDR of the ECSS algorithm showed a substantial decline, where a 15 dB Gaussian noise would decrease the BDR of the ECSS algorithm by 4.1%, corresponding to a 1.0% decrease for the ECSS+DR algorithm, and a 10 dB noise would decrease the BDR of the ECSS algorithm by 11.5%, corresponding to a 5.5% decrease for the ECSS+DR algorithm.
The decrease of the BDR also reduced the embedding capacity of algorithms. Among these attacks, the influence of RQZ, ASC, RSP, LPF, and MP3 attacks on the embedding capacity of the watermarking algorithms was less than 3 × 10 −5 bits/sample, which is negligible. The WGN attack decreased the embedding capacity of the ECSS and ECSS+DR algorithms by 5 × 10 −4 bits/sample and 3 × 10 −4 bits/sample, respectively, at an SNR of 15 dB; 10dB noise decreases the embedding capacity by 1.2 × 10 −3 bits/sample and 1.5 × 10 −3 bits/sample correspondingly.
We also evaluated the performance of the ECSS/+DR algorithm against a short delay attack (DLY). The BDR was reduced by 8.11% for the ECSS algorithm and by 2.51% for the ECSS+DR algorithm. Due to the change in robustness, the embedding capacity was reduced by 9.1 × 10 −4 and 7.5 × 10 −4 , respectively. Algorithm DLY is not a signal processing attack, although adding other parts from the same audio is still equivalent to adding noise. Indeed, the ECSS+DR algorithm provided an anti-interference property.
The influence of various attacks on the performance and the equivalent noise intensity are summarized in Table 2, where the equivalent SNR indicates the intensity of the equivalent noise corresponding to the attack. In addition to signal processing attacks, advanced attacks such as crop, framing errors and playback scaling are also possible with watermarked audio. It is difficult to resist these attacks by considering only the embedding and extraction algorithms. Various strategies for selecting embedding regions have been proposed in the literature, among which RASE [36] and RFPS [37] can resist crop and framing errors, A WSOLA-based strategy [38] can resist playback. These strategies can replace Step 1 in Section 6 to resist the corresponding attacks. However, there is no watermarking algorithm that can resist all audio attacks.

Comparison with Other Existing Algorithms
The proposed ECSS+DR algorithm was also compared with existing high-embeddingcapacity algorithms [12,13,15,16]. These algorithms all have relatively high embedding capacity which have been presented as related works in Section 2, and the parameter settings of each comparison algorithm are referred to in the relevant papers. Here, the BDR was used to evaluate the robustness of these algorithms against various attacks.
It is still evident from the experimental results in Table 3 that our algorithm obtained outstanding performance. "Ours" represents the ECSS+DR algorithm, where the embedding strength was α = 0.35, the repetitions Γ = 2, and the length of the signal vector m = 16 (due to repeated embeddings, each M-bit watermark group actually used signal vectors with a total length of 32). These parameters provide a balanced performance for the ECSS+DR algorithm according to Sections 7.2 and 7.3.
The ECSS+DR algorithm had similar embedding distortion SWR as the algorithm in [12] and also had a comparable BDR in CLP, LPF, and RSP attacks. However, the embedding capacity of our algorithm was significantly higher, and our algorithm was also more robust against WGN and MP3 attacks. Compared with the algorithm in [15], the robustness of our algorithm was close to that of the latter, but our embedding capacity was better, and notably, there was a slighter SWR in our algorithm. Compared with the methods in [13,16], our embedding distortion was near to theirs, but our algorithm performed much better in both robustness and embedding capacity. In a nutshell, our algorithm achieved a significant advantage in embedding capacity, without any disadvantage in robustness and imperceptibility compared to other existing algorithms.

Summary
In this paper, we analyzed the spread-spectrum watermarking algorithm from the perspective of the communication channel and formularized its watermark embedding capacity. According to this analysis, we proposed a watermarking algorithm using extendedcodebook ECSS and optimized it by adopting the diversity reception (DR) mechanism, so that the ECSS+DR watermarking algorithm achieved a higher embedding capacity and stronger robustness while given a limited embedding strength.
However, the embedding capacity formulation in this paper was only limited to the spread-spectrum watermarking algorithm, and the derived formula did not take into account the various attacks that could be encountered with the watermarked audio. Due to the excellent anti-interference performance of the spread-spectrum approach, our algorithm still resisted common signal processing attacks. Our team will subsequently try to build a more generalized embedding capacity model that takes into account watermarking algorithms from other paradigms such as QIM and deep learning, as well as considering attacks.
In this paper, a Gaussian distribution was used to model the HSI, which is a heuristic method lacking a rigorous theoretical support. Although the Gaussian distribution passed the K-S test, we note that the sample points in the Q-Q plots formed an S-like curve. Therefore, there may exist a more suitable distribution for modeling the HSI.
Our algorithms are not resistant to white-box attacks in which the attackers knows this algorithm. Although the embedder and the extractor of the watermarking system are the same, the security of this algorithm still needs to be improved in future work.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: