Content Adaptive Lagrange Multiplier Selection for Rate-Distortion Optimization in 3-D Wavelet-Based Scalable Video Coding

Rate-distortion optimization (RDO) plays an essential role in substantially enhancing the coding efficiency. Currently, rate-distortion optimized mode decision is widely used in scalable video coding (SVC). Among all the possible coding modes, it aims to select the one which has the best trade-off between bitrate and compression distortion. Specifically, this tradeoff is tuned through the choice of the Lagrange multiplier. Despite the prevalence of conventional method for Lagrange multiplier selection in hybrid video coding, the underlying formulation is not applicable to 3-D wavelet-based SVC where the explicit values of the quantization step are not available, with on consideration of the content features of input signal. In this paper, an efficient content adaptive Lagrange multiplier selection algorithm is proposed in the context of RDO for 3-D wavelet-based SVC targeting quality scalability. Our contributions are two-fold. First, we introduce a novel weighting method, which takes account of the mutual information, gradient per pixel, and texture homogeneity to measure the temporal subband characteristics after applying the motion-compensated temporal filtering (MCTF) technique. Second, based on the proposed subband weighting factor model, we derive the optimal Lagrange multiplier. Experimental results demonstrate that the proposed algorithm enables more satisfactory video quality with negligible additional computational complexity.


Introduction
With the rapid development of video services in recent years, how to efficiently compress video sequences has been considered as a very challenging task for transmitting video data over heterogeneous networks. To meet this demand, scalable video coding (SVC) has become a simple and flexible solution to enable seamless delivery by offering three different kinds of scalabilities, namely, temporal scalability, spatial scalability, and quality scalability [1][2][3][4]. Generally speaking, there are two main categories on SVC: discrete cosine transform (DCT)-based hybrid video coding SVC and wavelet-based SVC.
Owing to the intrinsic localization and multiresolution features of the discrete wavelet transform (DWT), video codecs based on motion-compensated three-dimensional (3-D) DWT have been studied extensively for use in SVC [5][6][7][8]. 3-D wavelet-based SVC provides a natural way in producing embedded bitstreams with full scalability and fine granularity for in-network adaptation [9,10]. In the 3-D wavelet-based SVC, temporal redundancy across frames is exploited by adopting the motion-compensated temporal filtering (MCTF) framework [11][12][13][14], and spatial redundancy inside a frame is utilized by 2-D spatial transform. Such codecs do not suffer from the drift problem often

System Model
The representative motion compensated embedded zero block coding (MC-EZBC) scalable video coder [34] is a highly efficient member of the existing 3-D wavelet-based SVC schemes. However, the coding efficiency of MC-EZBC in the range of low bitrates is far from satisfactory. To improve this deficiency, Wu et al. put forward the well-known enhanced motion-compensated embedded zero block coding (ENH-MC-EZBC), which retains its excellent rate-distortion performance at high bitrates and achieves significant improvement at low bitrates and/or low resolutions [35].
In our work, the proposed algorithm is designed for the popular ENH-MC-EZBC considering quality scalability. Our choice of this codec system is motivated by the fact that ENH-MC-EZBC incorporates all the advanced encoding tools found in the state-of-the-art video coding schemes and obtains the excellent coding performance both at low and high bitrates. The ENH-MC-EZBC codec system model contains three parts: encoder, bitstream extractor, and decoder, which is shown in Figure 1. In the encoder, a motion compensated 3-D subband/wavelet transform naturally partitions the input video sequence into a range of spatiotemporal resolutions. We then use 3D-EZBC to encode the resulting spatio-temporal subbands. All the motion fields are coded as side information by using lossless spatial differential pulse code modulation (DPCM) and adaptive arithmetic coding. In the bitstream extractor, the fully embedded bitstream is truncated based on both user preferences and network conditions to generate a highly flexible scalable bitstream to meet specific applications. At the decoder, the respective reverse operations are carried out to reconstruct video sequences.

System Model
The representative motion compensated embedded zero block coding (MC-EZBC) scalable video coder [34] is a highly efficient member of the existing 3-D wavelet-based SVC schemes. However, the coding efficiency of MC-EZBC in the range of low bitrates is far from satisfactory. To improve this deficiency, Wu et al. put forward the well-known enhanced motion-compensated embedded zero block coding (ENH-MC-EZBC), which retains its excellent rate-distortion performance at high bitrates and achieves significant improvement at low bitrates and/or low resolutions [35].
In our work, the proposed algorithm is designed for the popular ENH-MC-EZBC considering quality scalability. Our choice of this codec system is motivated by the fact that ENH-MC-EZBC incorporates all the advanced encoding tools found in the state-of-the-art video coding schemes and obtains the excellent coding performance both at low and high bitrates. The ENH-MC-EZBC codec system model contains three parts: encoder, bitstream extractor, and decoder, which is shown in Figure 1. In the encoder, a motion compensated 3-D subband/wavelet transform naturally partitions the input video sequence into a range of spatiotemporal resolutions. We then use 3D-EZBC to encode the resulting spatio-temporal subbands. All the motion fields are coded as side information by using lossless spatial differential pulse code modulation (DPCM) and adaptive arithmetic coding. In the bitstream extractor, the fully embedded bitstream is truncated based on both user preferences and network conditions to generate a highly flexible scalable bitstream to meet specific applications. At the decoder, the respective reverse operations are carried out to reconstruct video sequences.

Lagrange Optimization and Lagrange Multiplier Selection
The Lagrange optimization technique provides a systematic way to solve the constrained RDO problem, which aims at selecting the optimal coding parameter that minimizes the overall distortion measure subject to a given target bitrate restriction.
More details on Lagrange optimization technique have been discussed in [16,20,36]. This technique is well known in optimization problems where the cost and objective functions are continuous and differentiable. Everett's contribution [37] demonstrated that the Lagrange optimization technique could also be used for discrete optimization problems, with no loss of optimality if a solution exists with the required budget; i.e., as long as there exists a point in the convex hull that meets the required budget.

Lagrange Optimization and Lagrange Multiplier Selection
The Lagrange optimization technique provides a systematic way to solve the constrained RDO problem, which aims at selecting the optimal coding parameter that minimizes the overall distortion measure subject to a given target bitrate restriction.
More details on Lagrange optimization technique have been discussed in [16,20,36]. This technique is well known in optimization problems where the cost and objective functions are continuous and differentiable. Everett's contribution [37] demonstrated that the Lagrange optimization technique could also be used for discrete optimization problems, with no loss of optimality if a solution where d i and r i are the distortion and bitrate for the ith (i = 1, 2, · · · K) coding unit, respectively; K is the total number of coding units involved and R T the available bitrate constraint.
In view of the Lagrange optimization, the above constrained optimization problem (1) can be converted into an unconstrained form as: where J = D + λR is the Lagrange cost function, and λ is the so-called Lagrange multiplier that weights the relative importance between d i and r i . The optimal coding parameter set for all coding units can be determined by minimizing the Lagrange cost function as expressed in Equation (2). Consequently, how to determine λ becomes a key problem in Lagrange optimization. To have a better solution to the unconstrained problem, much effort has been placed on the research of the Lagrange multiplier selection. One may attain the λ using bisection search [38,39]. In RDO for video coding, however, a more computationally efficient approach is usually favorable to determine the Lagrange multiplier. Rather than empirically solving the problem of Lagrange multiplier selection as in [38,39], the λ in video coding can be determined using an R-D function. Due to the convexity of the R-D curve, the optimal slope λ matched to the desired R can be easily obtained using standard convex search techniques [24,40]. For each coding unit, the point on the R-D curve that minimizes the Lagrange cost is that point at which the line of absolute slope λ is tangent to the convex hull of the R-D curve.
Note that the R-D curve is convex and non-increasing in video coding, and if we assume that both R and D are differentiable everywhere, the minimum of the Lagrange cost function J is given by setting its derivative to zero, i.e.: ∂J ∂R which yields: where R T is the target bitrate. A given value of λ yields an optimal solution {Para} opt (λ) to the original RDO problem (1) for a particular value of R {Para} opt .

Problem Formulation
In 3-D wavelet-based SVC, the temporal decomposition is efficiently refined by adopting the MCTF technique. The ENH-MC-EZBC scalable video coder utilizes an adaptive lifting-based MCTF framework with switching between LeGall and Tabatabai (LGT) 5/3 and Haar 2/2 filters to exploit the temporal redundancies between successive frames [41]. During the MCTF process, video frames are filtered into low-frequency (L) and high-frequency (H) subbands. The process to generate the temporal high-and low-subbands is called the "prediction" and "update" step, respectively. Figure 2 illustrates the prediction and updates steps for the analysis stage of the three-level adaptive MCTF decomposition. MC and IMC are motion compensation and inverse operators, respectively. Scene change information is reflected in the choice of the filter bank, which propagates to the lower temporal levels. The coefficients α τ and β τ (τ = 1, 2, 3) along the branches are filter-based weighting factors. Since 3-D wavelet-based SVC is usually based on the underlying open-loop MCTF structure, its RD performance is further complicated by the inherent problem of the propagation of the quantization errors along the temporal wavelet decomposition tree [42]. Therefore, the selection of the Lagrange multiplier for wavelet-based SVC should be addressed differently. Since 3-D wavelet-based SVC is usually based on the underlying open-loop MCTF structure, its RD performance is further complicated by the inherent problem of the propagation of the quantization errors along the temporal wavelet decomposition tree [42]. Therefore, the selection of the Lagrange multiplier for wavelet-based SVC should be addressed differently.

Proposed Lagrange Multiplier Selection Algorithm
In this section, we introduce the proposed content adaptive Lagrange multiplier selection algorithm for the 3-D wavelet-based SVC codec.

Lagrange Multiplier Selection Bottlenecks in 3-D Wavelet-Based SVC
Lagrange multiplier based mode decision is one of the most important technologies in SVC [43]. In the adaptive MCTF framework of 3-D wavelet-based SVC, each temporal subband frame in Figure  2 is assigned to one out of four frame modes, including "bi-direction" (denoted as ENH-MC-EZBC SVC codec incorporates an optional rate-distortion (R-D) optimized mode decision algorithm. The goal of an R-D optimized mode-selection algorithm is to choose the best mode from available coding frame modes. The mode that minimizes distortion subject to a rate constraint is chosen as the best frame mode. Let where * M , D(F, M) and R(F, M) denote the vector of best mode allocations, distortion metric, and coded bitrate. This may be written as an unconstrained problem using a Lagrange optimization method as: where J(F ) i is the Lagrange cost for the i th frame and is given as:

Proposed Lagrange Multiplier Selection Algorithm
In this section, we introduce the proposed content adaptive Lagrange multiplier selection algorithm for the 3-D wavelet-based SVC codec.

Lagrange Multiplier Selection Bottlenecks in 3-D Wavelet-Based SVC
Lagrange multiplier based mode decision is one of the most important technologies in SVC [43]. In the adaptive MCTF framework of 3-D wavelet-based SVC, each temporal subband frame in Figure 2 is assigned to one out of four frame modes, including "bi-direction" (denoted as m 1 ), "uni-left" (denoted as m 2 ), "uni-right" (denoted as m 3 ) and "intra" (denoted as m 4 ) modes. The ENH-MC-EZBC SVC codec incorporates an optional rate-distortion (R-D) optimized mode decision algorithm. The goal of an R-D optimized mode-selection algorithm is to choose the best mode from available coding frame modes. The mode that minimizes distortion subject to a rate constraint is chosen as the best frame mode. Let F = (f 1 , f 2 , f 3 , . . . , f K ) denote a group of K frames. For a vector of coding mode allocations M = (m 1 , m 2 , m 3 , m 4 ) and a bitrate constraint R T , this optimization problem can be expressed by: where M * , D(F, M) and R(F, M) denote the vector of best mode allocations, distortion metric, and coded bitrate. This may be written as an unconstrained problem using a Lagrange optimization method as: where J(F i ) is the Lagrange cost for the ith frame and is given as: where D(F i , M i ) is the distortion term between the current frame and its reference frame, and R(F i , M i ) the bitrate term representing the expected number of bits allocated to the ith frame, with a specific λ.
Accurate R-D model is crucial in the determination of the Lagrange multiplier. It is worthwhile to mention that the R-D relationships can be quite different for various temporal subbands (T-bands). Unfortunately, it cannot to maximize coding efficiency through utilizing a fixed Lagrange multiplier at each temporal layer. Consequently, the conventional Lagrange multiplier selection method is not optimal as it does not consider the content characteristics of the T-bands.

Analyzing the Distortion Relationship between Temporal Subbands and Reconstructed Frames
The distortion fluctuation exhibited by the 3-D wavelet-based SVC codecs can be better understood by analyzing the distortion propagation during MCTF. In the ENH-MC-EZBC codec, the adaptive MCTF framework has been implemented using either the Haar or the 5/3 filter to improve the coding performance. As shown in Figure 2, the coefficients α τ and β τ (τ = 1, 2, 3) along the branches are filter-based weighting factors, which are given in Table 1. Table 1. Coefficients α τ and β τ depending on the mode of the current frame.

Frame Mode
In the Haar MCTF, the original frames are filtered temporally with a two-tap Haar filter along the motion trajectory. For the connected pixels, the low-frequency subbands and high-frequency subbands are implemented by the following lifting structure: For the 5/3-based MCTF, the temporal analysis can be written as: (11) where H i and L i denote the temporal high-and low-frequency subbands of the ith (i = 1, 2, · · · K) frame of a video sequence, respectively; I 2i (m, n) represents the pixel value at position (m, n) in frame f 2i ; I 2i (m − d m , n − d n ) and I 2i+2 (m + d m , n + d n ) denote the interpolated values of pixel (m, n) in the frame 2i and 2i + 2, respectively. H is the interpolated value of pixel in the high-frequency frame; (d m , d n ) is the motion vector in frame f 2i ; d m is the closest integer value to d m , and d n is defined in the same way.
For mathematical convenience, we only consider the case in the lifting structure of one-level 5/3 inverse MCTF in the following. Thus, according to the Equations (10) and (11), the errors in the reconstructed f 2i (m, n) and f 2i+1 (m, n) can be formulated as: where ε f denotes the reconstruction error of frame f. The assumption supposes that the distortion in each T-band can be modeled as independent and identically-distributed zero-mean additive white noise. Let σ 2 L and σ 2 H be the error variance of pixels in low-frequency subbands and that of pixels in high-frequency subbands, respectively. Also, let σ 2 f 2i and σ 2 f 2i+1 denote the error variance in the reconstructed video frames. Accordingly, the distortions in the reconstructed frames are expressed as: The distortion associated with different T-bands will contribute differently to distortion in the reconstructed frames. Typically, this difference is quantified by analyzing the weighting factor for inverse MCTF. The weighting factor indicates how much a unit distortion in a specified subband contributes to the overall distortion in the reconstructed video. The derivation of its weighting factors is given as: This results in distortion fluctuation across the reconstructed frames after one-level inverse MCTF. When multi-level MCTF is involved, the weighting factor of each temporal subband is modeled as: Therefore, the distortion relationship between the T-bands and the reconstructed frame can be derived in the same way but with an iterative calculation. That is, the distortion for the original video sequences (denoted as D) after T levels MCTF decomposition can be derived by: where d (t) i_H , and d (T) i_L are the reconstruction distortion (namely error variance) of the high-frequency subbands in the tth (1 ≤ t ≤ T) temporal level, and low-frequency subbands in the highest temporal level, respectively. Additionally, ω (j) i_H and ω (t) i_L represent the weighting factors dependent on the motion estimation algorithm and the wavelet filter pair used for MCTF.

Adaptive Lagrange Multiplier Selection
As described in Subsection 2.2, the Lagrange multiplier λ is usually used to guide the bit allocation of each subband so that the overall distortion can be minimized and also with balanced the rate and distortion among the reconstructed frames. To our best knowledge, a larger λ will result in a higher coding distortion with less coding bits. Contrarily, a smaller λ will lead to a lower coding distortion with more coding bits. Therefore, we can allocate more bits to the T-bands with more detailed information by decreasing the subband-level Lagrange multiplier.
We define a weighting factor vector as Ω = (ω 1 , ω 2 , . . . ω K ), which indicates the contribution of a unit quantization error in each temporal subband to the overall distortion in the reconstructed video sequence. To obtain the weighting factor of the MCTF process, we need to employ a novel derivation, which involves the complicated motion compensation process and temporal subband content characteristics statistics. For computing these weighting factors, we utilize the mutual information, gradient per pixel, and texture homogeneity to measure the temporal subband content features.
Mutual information (MI) is a measurement of similarity between frames which can detect the differences among successive frames [44]. In this paper, we utilize MI to measure the similarity between the temporal subband frames. A large difference between frames (corresponding to the high motion activity) leads to a low MI value, while a small change between frames responds to a high MI value [45,46]. Let X be a discrete random variable with a set of possible outcomes A X = {a 1 , a 2 , . . . , a L } with possibilities {p 1 , p 2 , . . . , p L }, p X (x = a l ) = p ll ≥ 0, and ∑ x∈A X p X (x) = 1. According to the information theory, the entropy of X is: The joint entropy of discrete random variables X and Y is: The MI between random variable X and Y is given by: The relation between MI and joint entropy is given by: In a YUV formatted video sequence, let us consider a gray level video sequence with intensity value ranging from 0 to L − 1 (e.g., L = 256 for 8-bit depth). For the luminance component Y, is the probability that a pixel with gray level x in frame f i has a gray level y in frame f i+1 . Let MI Y i,i+1 be the MI of the luminance component. So we can obtain the MI Y i,i+1 value as shown below: where P Y i (x) and P Y i+1 (y) are the possibilities that a pixel with gray level x in frame f i and a pixel with gray level y in frame f i+1 , respectively.
The gradient based subband content complexity measure has been considered in our Lagrange multiplier selection scheme. Here, the gradient per pixel (GPP) of the temporal subband is defined by: where M and N are the width and height of the temporal subband, and I i (m, n) is the pixel value at position (m, n).
With the statistical analysis, we observe that, if the subband belongs to the homogeneous regions, the number of bits will be a smaller value, and it will relevantly increase with the complex of the texture. Hence, the texture homogeneity of each subband is measured by calculating the ratio between variance and mean of the subband, which is given by: where Variance and Mean are the variance and mean value of each temporal subband, respectively. Let W denote the synthesis gain matrix with regard to target video sequence. Therefore, the associated temporal subband weighting factor Ω = (ω 1 , ω 2 , . . . ω K ) for prediction can be calculated as: where α, β, and γ are model parameters updated by multiple linear regression analysis. For each temporal subband frame i, ω filter_i is the filter-based weighting factor using Equations (15)-(17); MI i , GPP i , and Ratio i can be computed by Equations (23)-(25), respectively. As pointed out in [31], W i→i+1 is the cross-subband error propagation, which reflects the importance of coefficients in each subband. Similar to JPEG2000 [47], we assume that the distortions from various temporal subbands are approximately additive. Equation (18) reveals that the total distortion is simply a linear combination of the distortions of all the temporal subbands. It is noticeable that the wavelet coefficients within each temporal subband obey the Gaussian distribution [48,49], hence, the distortion of the transform coefficients can be expressed as follows: where d i and r i stand for the distortion and bitrate for the ith temporal subband frame, respectively. σ 2 q is the variance of the qth coefficient, Q the number of coefficients concerned. Besides, k and η are the parameters related to the temporal decomposition scheme used and the corresponding slope for SVC, respectively.
However, in practice, it is very difficult to determine the variance for a given coefficient, as required by Equation (27). Based on the observation the variance of each coefficient can be estimated by: where β q is a parameter related to the wavelet transform, and σ 2 f the variance of the residual pixel values before wavelet transform. Therefore, Equation (27) can then be rewritten as: Moreover, the variance of the residual pixel value before wavelet transform σ 2 f can be further approximated using the mean absolute difference (MAD) by σ 2 f ≈ 2MAD 2 . Consequently, the distortion d i for the i th subband can be expressed as: According to Equations (26)- (30), it is easy to compute the Lagrange multiplier λ i of the ith subband as: Correspondingly, Equation (31) can be written as: Based on Equations (30)-(32), a more practical Lagrange multiplier expression is then derived as: with ϕ = 6.2 an empirical constant suitable for different sequences. By means of Lagrange multiplier λ i , the RDO objective function Equation (7) can be rewritten as: where ω i_H (R i ) denote the weighted distortion term for the high-frequency subbands, and ω i_L (R i ) the weighted distortion term for the low-frequency subbands. Only the frame mode with the minimum Lagrange cost J(F i ) is finally chosen as the best frame mode.

Summary of the Proposed Algorithm
The detailed procedure of the proposed Lagrange multiplier selection algorithm is outlined in the following Algorithm 1.
In order to access the accuracy of the proposed R − λ model, R 2 is utilized as the quantitative metric which can measure the degree of data variation from a given model [50]: where X i andX i , respectively, are the actual and the estimated values of the ith data point. X is the mean of all the data points. The maximum R 2 value is 1, which occurs when X i =X i for any i. The closer the value of R 2 is to 1, the more accurate the model is. In our experiment, for each test sequence, the λ value of each MCTF level is fitted at the target bitrate by the proposed R − λ model. The R 2 statistics for four-level MCTF are tabulated in Table 2. From this table, we can notice that the R 2 values are all close to 1. Consequently, it can be concluded that the analytical R − λ model works well for sequences at the different temporal decomposition levels. This is a crucial point of our approach.

Algorithm 1 An efficient algorithm for content adaptive Lagrange multiplier selection
Step 1 Initialization: Input a video sequence with K frames for illustration. Obtain the frame rate, total number of frames, target bitrate, etc., from the configuration file or manual input.
Step 2 Modeling the distortion relationship between the temporal subbands and reconstructed frames.
Step 4 Lagrange multiplier selection: Carry out the adaptive Lagrange multiplier selection scheme by means of Equations (30)-(33).
Step 5 Model parameters update: Parameters update using multiple linear regression analysis.
Step 6 Temporal subband frame mode selection: (6-1) Calculate the Lagrange cost J(F i ) for the temporal subband frame i with mode M = (m 1 , m 2 , m 3 , m 4 ). (6-2) Output the best frame mode M * with the minimum cost J(F i ).
Step 7 Loop until all frames are encoded:

going to
Step 2 until the end of encoding, else report the related encoding parameters and exit.

Experimental Results
In this section, extensive experiments have been conducted to verify the effectiveness of the proposed content adaptive Lagrange multiplier selection algorithm for 3-D wavelet-based SVC.

Experimental Setup
In this paper, all the algorithms are implemented with ANSI C in Microsoft Visual C++ 6.0 and MATLAB R2012b programming environments. Our experiments are conducted on a 4-core (i5-2400@3.10GHz) computer equipped with RAM 8 GB that is also used to measure the computational complexity of our method. In the simulation, each group of pictures (GOP) contains 16 frames. During the MCTF process, motion estimation is implemented using a full-search with quarter-pixel accuracy on the dyadic wavelet coefficients. The block size varies from 4 × 4 to 128 × 128, and the

Experimental Setup
In this paper, all the algorithms are implemented with ANSI C in Microsoft Visual C++ 6.0 and MATLAB R2012b programming environments. Our experiments are conducted on a 4-core (i5-2400@3.10GHz) computer equipped with RAM 8 GB that is also used to measure the computational Entropy 2018, 20, 181 13 of 21 complexity of our method. In the simulation, each group of pictures (GOP) contains 16 frames. During the MCTF process, motion estimation is implemented using a full-search with quarter-pixel accuracy on the dyadic wavelet coefficients. The block size varies from 4 × 4 to 128 × 128, and the search range of both the horizontal and vertical dimension is [-16, 15]. The default Lagrange multiplier values are given in the rate-distortion optimized mode selection process. All the other encoder settings are set identically for all methods.

Performance Evaluation
To evaluate the effectiveness of our algorithm, we have implemented it on the 3-D wavelet-based SVC reference software ENH-MC-EZBC configured with the common test conditions as suggested in configure file. In the simulation, we have performed the following five representative codecs with the same configuration. Recently, these codecs deliver the best coding performance for scalable coding of video datasets [53]: In the experiments, only the luminance component is taken into consideration since human visual system is less sensitive to color than to luminance. For reasons of brevity, the average peak signal-to-noise ratios (PSNR) (dB) and the standard deviation of PSNR (PSNR STD) on luminance component have been used as quality metric, which are defined as follows: where MSE denotes the mean square error between the original frame and reconstructed frame, PSNR i stands for the PSNR of the ith reconstructed frame and K the number of video frames in a sequence. Note that a higher PSNR means that a better RD performance is achieved. Meanwhile, the smaller the PSNR STD value, the better the video quality perceived by the end user, and vice versa.

Comparison of Rate-Distortion Performance
To verify the overall rate-distortion (R-D) performance of the proposed Lagrange multiplier selection approach, we compare it with the successful coding schemes on the framework of motion-compensated subband coding (MCSBC): ENH-MC-EZBC, RWTH-MC-EZBC, RPI-MC-EZBC, and MC-EZBC. Let "Scheme 1", "Scheme 2", "Scheme 3"and "Scheme 4" denote our method compared to the ENH-MC-EZBC, RWTH-MC-EZBC, RPI-MC-EZBC, and MC-EZBC, respectively. Figure 4 shows the R-D curves of five codecs for six selected test sequences ("Soccer", "Crew", "Stockholm", "Basketball", "Park_joy", and "PeopleOnStreet") at different target bitrates. Obviously, it is observed that the codec with the proposed algorithm achieves the best R-D performance among all codecs. From these figures, we can see that our method is noted with average PSNR improvement over other methods about 0.53-3.2 dB. As shown in Figure 4a, for the "Soccer" sequence, the proposed algorithm yields 0.79, 1.15, 1.4, and 2.7 dB higher PSNR values than the ENH-MC-EZBC, RWTH-MC-EZBC, RPI-MC-EZBC, and MC-EZBC in average, respectively. Average PSNR gains for all the test sequences are also displayed in Figure 5, one can see that the proposed Lagrange multiplier algorithm achieves an average of 0.5-3.57 dB improvement in PSNR results. The PSNR gains are more especially significant for the test sequences with complex texture and/or highly moving objects, such as "Foreman", "Soccer", "City", "Stockholm", "Basketball", and "Park_joy" sequences. These test sequences contain abrupt changes over frames in video content characteristics with fast moving objects and highly spatial details and tend to be encoded with various coding types. Thus, for all target bitrates ranges, our method shows the remarkable superiorities of RD Average PSNR gains for all the test sequences are also displayed in Figure 5, one can see that the proposed Lagrange multiplier algorithm achieves an average of 0.5-3.57 dB improvement in PSNR results. The PSNR gains are more especially significant for the test sequences with complex texture and/or highly moving objects, such as "Foreman", "Soccer", "City", "Stockholm", "Basketball", and "Park_joy" sequences. These test sequences contain abrupt changes over frames in video content characteristics with fast moving objects and highly spatial details and tend to be encoded with various coding types. Thus, for all target bitrates ranges, our method shows the remarkable superiorities of RD performances with higher PSNR gains by performing content-adaptive Lagrange multiplier selection.  In addition, we also investigate PSNR variations during video reconstruction. Since high fluctuation of frame PSNR values may cause perceptual annoying to viewers, the fluctuation of PSNR values is one of the vital factors for video coding applications. The standard deviation of PSNR (PSNR STD) is utilized to measure for the smoothness of video quality. The smaller PSNR   In addition, we also investigate PSNR variations during video reconstruction. Since high fluctuation of frame PSNR values may cause perceptual annoying to viewers, the fluctuation of PSNR values is one of the vital factors for video coding applications. The standard deviation of PSNR (PSNR STD) is utilized to measure for the smoothness of video quality. The smaller PSNR STD value result in the smoother PSNR variation and hence more consistent video quality over the In addition, we also investigate PSNR variations during video reconstruction. Since high fluctuation of frame PSNR values may cause perceptual annoying to viewers, the fluctuation of PSNR values is one of the vital factors for video coding applications. The standard deviation of PSNR (PSNR STD) is utilized to measure for the smoothness of video quality. The smaller PSNR STD value result in the smoother PSNR variation and hence more consistent video quality over the video frames. Figure 7 illustrates the average PSNR STD values over different target bitrates for the proposed algorithm and the other four methods. Compared to the four methods, the proposed algorithm achieves more stable visual quality with substantially smaller PSNR fluctuations as depicted in Figure 7. From this figure, we can see that the proposed Lagrange multiplier selection algorithm reduces the PSNR standard deviations of all frames by up to 3.58 with respect to RPI-MC-EZBC. As shown in Figure 7, our method generates the minimum PSNR fluctuation over the entire sequence than the other three methods. Hence, we reach the conclusion that the proposed method can be beneficial for controlling the fluctuations of qualities in the reconstructed video and produce more stable visual quality than others.

Comparison of Subjective Performance
To obtain the subjective inspection of the reconstructed frames, we show the 8th reconstructed frame of the "City" sequence at 896 kbps in Figure 8. From this figure, we can see that the visual quality of the reconstructed frame by the proposed method is conspicuously better than those by the other four methods. It can be plainly discerned that the frame processed by our codec presents less blocking artifacts in the homogeneous regions, better preserved textures, and sharper appearance than other codecs. In particular, it is worth noting that the regions with high spatial details (those enclosed by red rectangles on buildings) are well preserved by the proposed algorithm whereas they are not very clear by other reference codecs.

Comparison of Subjective Performance
To obtain the subjective inspection of the reconstructed frames, we show the 8th reconstructed frame of the "City" sequence at 896 kbps in Figure 8. From this figure, we can see that the visual quality of the reconstructed frame by the proposed method is conspicuously better than those by the other four methods. It can be plainly discerned that the frame processed by our codec presents less blocking artifacts in the homogeneous regions, better preserved textures, and sharper appearance than other codecs. In particular, it is worth noting that the regions with high spatial details (those enclosed by red rectangles on buildings) are well preserved by the proposed algorithm whereas they are not very clear by other reference codecs.
quality of the reconstructed frame by the proposed method is conspicuously better than those by the other four methods. It can be plainly discerned that the frame processed by our codec presents less blocking artifacts in the homogeneous regions, better preserved textures, and sharper appearance than other codecs. In particular, it is worth noting that the regions with high spatial details (those enclosed by red rectangles on buildings) are well preserved by the proposed algorithm whereas they are not very clear by other reference codecs.

Comparison of Computational Complexity
To measure the computational complexity, we define the encoding speed as the number of frames which can be encoded in one second on the hardware platform with processor Intel Core i5 4-core CPU 3.10GHz. The computational complexity is considered as inverse value of the encoding speed, which is measured without any use of assemblers, threads, or other program optimization techniques [6,7,15,28]. Table 4 demonstrates the encoding speed for the proposed algorithm, ENH-MC-EZBC, RWTH-MC-EZBC, RPI-MC-EZBC, and MC-EZBC. As shown in Table 4, the encoding speed of the proposed algorithm is about 1.14, 3.65, 4.65 times faster than RWTH-MC-EZBC, RPI-MC-EZBC, and MC-EZBC, respectively. The reason is that for the sequences with strong inter-frame dependencies, Figure 8. Subjective visual quality comparisons of the 8th reconstructed frame of "City" sequence at 896 kbps.

Comparison of Computational Complexity
To measure the computational complexity, we define the encoding speed as the number of frames which can be encoded in one second on the hardware platform with processor Intel Core i5 4-core CPU 3.10GHz. The computational complexity is considered as inverse value of the encoding speed, which is measured without any use of assemblers, threads, or other program optimization techniques [6,7,15,28]. Table 4 demonstrates the encoding speed for the proposed algorithm, ENH-MC-EZBC, RWTH-MC-EZBC, RPI-MC-EZBC, and MC-EZBC. As shown in Table 4, the encoding speed of the proposed algorithm is about 1.14, 3.65, 4.65 times faster than RWTH-MC-EZBC, RPI-MC-EZBC, and MC-EZBC, respectively. The reason is that for the sequences with strong inter-frame dependencies, more bits are allocated to the reference frames which lead to better prediction results. Thus, only small residual signals need to be processed in the following encoding steps, which further result in computational complexity reductions. However, the encoding speed of ENH-MC-EZBC is about 1.11 times faster than our algorithm. This is mainly due to the additional operations on the statistics for video content characteristics. In general, the experimental results demonstrate that the computing overhead brought by the proposed Lagrange multiplier selection algorithm can be negligible when compared to the other four methods. Although our algorithm encodes at average 3.07 frames per second, it is insufficient for real-time video processing applications.
Since the ENH-MC-EZBC reference software is non-optimized C++ implementation, development of an efficient low-complexity 3-D wavelet-based scalable video coding scheme is an important practical problem, which is necessary to be considered in our future work.

Conclusions
In this paper, we present an efficient content adaptive Lagrange multiplier selection algorithm for RDO in 3-D wavelet-based SVC. The wavelet filter types, subband coupling in the MCTF process, and temporal subband content characteristics have been incorporated into our algorithm to select the Lagrange multiplier adaptively. The simulation results demonstrate that the proposed algorithm turns out to be much better than the reference methods in terms of both accuracy and effectiveness.
In a future work, we are going to extend our algorithm to other scalabilities, not merely quality scalability. The overall video quality can be further improved by employing human visual system-based perceptual features. In addition, development of a scalable low-complexity video codec based on 3-D DWT is our main concern as well. Therefore, we will experiment in these directions to obtain a more compelling result.