Pretreatment and Wavelength Selection Method for Near-Infrared Spectra Signal Based on Improved CEEMDAN Energy Entropy and Permutation Entropy

The noise of near-infrared spectra and spectral information redundancy can affect the accuracy of calibration and prediction models in near-infrared analytical technology. To address this problem, the improved Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN) and permutation entropy (PE) were used to propose a new method for pretreatment and wavelength selection of near-infrared spectra signal. The near-infrared spectra of glucose solution was used as the research object, the improved CEEMDAN energy entropy was then used to reconstruct spectral data for removing noise, and the useful wavelengths are selected based on PE after spectra segmentation. Firstly, the intrinsic mode functions of original spectra are obtained by improved CEEMDAN algorithm. The useful signal modes and noisy signal modes were then identified by the energy entropy, and the reconstructed spectral signal is the sum of useful signal modes. Finally, the reconstructed spectra were segmented and the wavelengths with abundant glucose information were selected based on PE. To evaluate the performance of the proposed method, support vector regression and partial least square regression were used to build the calibration model using the wavelengths selected by the new method, mutual information, successive projection algorithm, principal component analysis, and full spectra data. The results of the model were evaluated by the correlation coefficient and root mean square error of prediction. The experimental results showed that the improved CEEMDAN energy entropy can effectively reconstruct near-infrared spectra signal and that the PE can effectively solve the wavelength selection. Therefore, the proposed method can improve the precision of spectral analysis and the stability of the model for near-infrared spectra analysis.


Introduction
Diabetes, which is a kind of blood glucose metabolism disorder, causes serious health problems [1].According to the statistical data from the International Diabetes Federation (IDF), the number of people with diabetes will reach 592 million in 2025 [2].The foundations of diabetes treatment are regular blood glucose detection, diet plans, and injected or oral insulin.Therefore, blood glucose detection is the key step to an effective diabetes treatment.The non-invasive blood glucose detection method is a painless, convenient, and affordable method.Given the development of computer technology and chemometrics in recent years, high efficiency and low-cost near-infrared spectra technologies that can perform fast analysis are widely used in non-invasive blood glucose detection [3].The electromagnetic wavelength near-infrared light between visible light and medium infrared light ranges from 700 to 2500 nm [4].For example, glucose molecules contain C-H, N-H, and O-H groups, and the stretching vibration of these hydrogen groups [5] forms certain strength absorption bands of frequency doubling and combined frequency in the near-infrared wavelength region.The different numbers of hydrogen groups in different concentrations of glucose will affect the intensity of the peak position.Therefore, the glucose concentration is quantitatively analyzed based on near-infrared spectra and Beer-Lambert's law.However, different hydrogen groups have varying near-infrared characteristics.Some groups have no absorption or weak absorption capacity.The quality of the models will decline with the use of all wavelengths that build the calibration and prediction models.Moreover, the near-infrared spectra itself has some problems, such as the presence of wavelength points, overlap of spectral information, and low absorption intensity.Therefore, the pretreatment of spectra and wavelength selection method [6] are critical for simplifying and improving the predictive ability of the model before building the calibration model in near-infrared spectra analysis technology.
The acquired spectra contain not only useful information related to the glucose concentration, but also many uncorrelated noise signals.These noises will affect spectral quality and model accuracy.Thus, the removal of these useless noises is needed.At present, Empirical Mode Decomposition (EMD) is widely used in the signal denoising domain [7][8][9][10].EMD decomposes time series to intrinsic mode functions (IMFs) and a residue that depends on time scale.EMD is an adaptive signal processing method that can effectively analyze stationary and non-stationary signals.Ensemble Empirical Mode Decomposition (EEMD) solves the mode mixing problem in the EMD method by adding white Gaussian noise; this approach also brings residue noise [11].Complementary Ensemble Empirical Mode Decomposition (CEEMD) eliminates residue noise in the reconstructed signal by adding a pair of positive and negative signals [12].However, this method will produce false modes.The CEEMDAN method, which has an iteration number that is half of the EEMD method, accurately completes signal reconstruction [13].However, false modes in the early stage of CEEMDAN method and residue noise in the modes are still observed.Therefore, an improved CEEMDAN [14] method is used in denoising and reconstructing near-infrared spectra signals in this study.
The superior quantitative calibration model can be obtained through the characteristic wavelength or wavelength interval using specific method.Wavelength selection can simplify the model and reduce modeling time.Irrelevant or nonlinear variables should be eliminated to obtain an excellent calibration model with strong prediction and stability [15].Therefore, wavelength selection procedures are particularly important when dealing with near-infrared spectra data.Common variable wavelength selection methods include correlation coefficient [16], uninformative variable elimination [17], interval partial least squares [18], successive projections algorithm (SPA) [19], simulated annealing, and genetic algorithms [20].In this study, a wavelength selection method based on PE is proposed as a new method.C. Bandt and B. Pompe proposed a random detection method for time series, namely, permutation entropy (PE) [21][22][23][24][25][26].
This study proposed a new pretreatment and wavelength selection method.Firstly, the original near-infrared spectra signal is decomposed by using improved CEEMDAN to obtain IMFs.The critical point between useful signal modes and noisy signal modes can be identified through the value of energy entropy of each IMF.The reconstructed signal is the sum of useful signal mode and residue.The characteristic wavelengths are then selected by comparing the PE of the same wavelength interval spectra between glucose solution and pure water.Finally, the performance of the proposed method is verified by the quantitative model established with PLSR and SVR.The results show that the proposed pretreatment and wavelength selection method outperforms the other pretreatment and wavelength selection methods in near-infrared spectra analysis.

EMD Method
According to [27], the general steps of the EMD method are as follows: (1) Find all the maxima and minima for the signal, s(t).
(2) Obtain the upper envelope composed of all the maxima and the lower envelope composed of all the minima using the cubic spline interpolation, and define them as u(t) and v(t) respectively.
(3) The mean of upper and lower envelope is m(t) = u(t)+v(t) 2 .(4) The difference between original signal and mean of envelope is h(t) = s(t) − m(t).
(5) If h(t) meet the nature of IMF, then the h(t) is c 1 (t).Otherwise, repeat steps (1)-( 4) until c 1 (t) is obtained.The IMF needs to meet two natures, one is that the number of extreme value points and passing zero points is equal or differs at most by one point, another one is that the mean of upper and lower envelope at any point is zero.(6) The r 1 (t) = s(t) − c 1 (t), as a new signal to be analyzed, repeat the steps (1)-( 5) to obtain the second IMF and the r 2 (t) = s(t) − c 2 (t).(7) Repeat the above steps and the decomposition ends when the residue r n (t) is a monotonic function.
Finally, a set of IMF, c 1 (t), c 2 (t), . . ., c n (t) and the residue r n (t) are obtained.Therefore, the original signal is

Improved CEEMDAN Method
According to Ref. [14], given x (i) = x + w i , the first mode for the CEEMDAN algorithm is where, x is the original signal, w i is a realization of zero mean unit variance white Gaussian noise, E 1 is a function to extract the first mode decomposed by EMD (E 1 (x) = x − M(x)), M(•) is the operator that produces the local mean of the applied signal, and • is the action of averaging throughout the realization.If only the local mean is estimated and subtracted from the original signal, IMF 1 = x − M(x (i) ) .Based on the above content, the improved CEEMDAN method is described as follows: (1) Decompose signal x (i) = x + β 0 E 1 (w i ) to obtain the first residue and first mode using the EMD algorithm.
where x is the original signal, β 0 is the standard deviation of the added white Gaussian noise, and E k (•) is the operator that produces the k-th mode obtained by EMD algorithm (k = 1,2, . . .,N, N is the total ensemble number).
(3) The k-th mode is

Energy Entropy of IMF
Entropy is used to describe the irregular and complex evolution of time series.The composition changes of signal can be directly distinguished by comparing the transformation situation of some characteristics of signal entropy [28].The IMF components decomposed by improved CEEMDAN contain the local characteristics of original signal and time scale information with different characteristics.The joint distribution of signal energy entropy with frequency and time can be accurately given through the characteristic information of signal expressed by different resolution.The concept of information entropy is introduced to the energy distribution analysis of the IMFs to describe the difference.Information entropy is a measure used to locate a system in a certain state.Information entropy is a measure of unknown degree of time series (x 1 , x 2 , • • • , x n ), which can be used to estimate the complexity of the random signal.The entropy in this process is expressed by the following formula where p(x) is the joint probability density function of (x 1 , Each IMF component is equally divided into N segments along the time axis.The energy of each segment is W i (i = 1, 2, • • • , N) and the energy of the whole timeline is A. The energy of each segment is normalized to obtain energy normalized values q i = W i A .With reference to the information entropy calculation formula, the energy entropy of IMF is defined as [29] H(q) = − N ∑ i=1 q i ln q i (8)

Permutation Entropy
According to the [21], the definition of PE is: Considering time series {x(i), i = 1, 2, • • • , N} with the length N, it is reconstructed in phase space to obtain the time series, where m and τ are the embedding dimension and delay time, respectively.Afterward, an ordinal pattern probability distribution, P = p j , j = 1, • • • m! can be obtained from the time series by computing the relative frequencies of the m! possible permutations j.The PE is just the Shannon entropy estimated by using this ordinal pattern probability diatribution, If some ordinal patterns appear more frequently than others, the PE decreases, indicating that the signal is less random and more predictable [30].For convenience, H p is typically normalized with log m!, namely, S max = ln (m!) is the value obtained from an equiprobable ordinal pattern probability distribution.Therefore, the H p ranges between 0 and 1.The magnitude of H p represents the randomness degree of the time series.The smaller the value of H p is, the more inerratic the time series will be, otherwise, the more stochastic the time series will be.The change in H p reflects and amplifies the minute details of the time series.

Selection of Relevant Mode
The noisy signal, y(t), can be decomposed into several modes by improved CEEMDAN algorithm as Equation ( 12) also can be expressed as the sum of noisy modes and useful signal modes as where the first (k − 1) modes are noisy modes, and the residual modes are the useful signal modes and residue.The critical task is to find k to reconstruct the signal.The role of signal reconstruction can also be understood as a low-pass filter.The front several high frequency IMFs (noise modes) are removed, and the low frequency IMFs (useful signal modes) are kept and added to reconstruct the signal.Given that each IMF contains different frequency components and different energy, the energy of the IMFs is measured by energy entropy to select the relevant modes effectively.According to a large number of experimental results, it is found that the energy entropy of the noise modes is around a certain value, and that of useful signal modes is around another certain value.The difference of energy entropy of noise modes or useful signal modes is a small change.The maximum energy entropy appears when the first useful signal mode comes.Therefore, a mutational point exists, which is the maximum of all energy entropy of IMFs between two kinds of modes.The mutational point that corresponds to the mode index is k.The steps of the selection of relevant mode are as follows: (1) Noisy signal y(t) is decomposed to obtain The energy entropy of each IMF i is calculated, which is denoted as EE i (i = 1, 2, • • • , I), where I is the number of modes obtained by improved CEEMDAN algorithm.(3) The relevant mode is identified as (4) The reconstructed signal is

Application
The periodic signal y(t) = sin (2π ), which has a data length of 1024, composed by different frequencies f 1 and f 2 , where f 1 = 2 Hz and f 2 = 4 Hz.The white Gaussian noise with 3 dB is added to signal y(t) (Figure 1).The signal is decomposed by improved CEEMDAN, where the ratio of standard deviation of added white noise is 0.2 and the ensemble number is 50.To illustrate the stability of the proposed reconstructed method, the method is tested 10 times to prove the effect of reconstruction.Each time, the noisy signal y(t) is decomposed by improved CEEMDAN algorithm.The energy entropy of each IMF is then calculated.Figure 2 shows that the noisy signal is decomposed into nine IMFs and one residue.The eighth and ninth modes are the useful signal modes, and the reconstructed signal is the sum of the last three modes (IMF8, IMF9, and IMF10).The energy entropy of each IMF is listed in Table 1.The maximum of energy entropy corresponds to IMF8.Therefore, the index k of mutational point is 8 (Figure 3), and the useful modes start with the eighth mode.The results of other nine tests are similar to those of the first test.The reconstructed signal is shown in Figure 4, which illustrates the energy entropy can effectively identify the noisy modes and useful modes.To compare the reconstruction result, improved CEEMDAN energy entropy, Fourier transform (the cut-off frequency is 70 Hz), wavelet transform (the mother wavelet is db3, and the level of decomposition is 5), moving averaging (the size of sliding window is 5), and median (the dim is 2) are used to reconstruct the signal.The reconstructed performance is evaluated at various input signal to noise ratios (SNR), which range from 1 to 10 dB with a fixed step of 1 dB.The output SNR and mean square error (MSE) are calculated to quantize the reconstructed result.
where the ( ) is the pure signal, and the ( ) is the reconstructed signal.Tables 2 and 3 are the SNR and MSE of different reconstructed signal methods.To effectively evaluate reconstructed result, the ratio of standard deviation of added white noise is 0.2 and the ensemble number is 100 in the improved CEEMDAN algorithm.The value of SNR and MSE are the average value of 10 test times.Based on the Tables 2 and 3, we conclude that the SNR of the reconstructed method based on improved CEEMDAN energy entropy is larger than that of others.The MSE of the reconstructed method based on improved CEEMDAN with energy entropy is smaller than that of others.These results show that the proposed reconstructed method is superior to other methods.To compare the reconstruction result, improved CEEMDAN energy entropy, Fourier transform (the cut-off frequency is 70 Hz), wavelet transform (the mother wavelet is db3, and the level of decomposition is 5), moving averaging (the size of sliding window is 5), and median (the dim is 2) are used to reconstruct the signal.The reconstructed performance is evaluated at various input signal to noise ratios (SNR), which range from 1 to 10 dB with a fixed step of 1 dB.The output SNR and mean square error (MSE) are calculated to quantize the reconstructed result.SNR = 10 log 10 ( where the y(n) is the pure signal, and the y(n) is the reconstructed signal.Tables 2 and 3 are the SNR and MSE of different reconstructed signal methods.To effectively evaluate reconstructed result, the ratio of standard deviation of added white noise is 0.2 and the ensemble number is 100 in the improved CEEMDAN algorithm.The value of SNR and MSE are the average value of 10 test times.Based on the Tables 2 and 3, we conclude that the SNR of the reconstructed method based on improved CEEMDAN energy entropy is larger than that of others.The MSE of the reconstructed method based on improved CEEMDAN with energy entropy is smaller than that of others.These results show that the proposed reconstructed method is superior to other methods.To verify the validity of the proposed method, the non-stationary ECG signal (from the MIT-BIH normal Sinus Rhythm Database) and Blocks signal [31] with 5 dB white Gaussian noise is introduced into the experiments.The reconstructed results were then compared with Fourier transform (the cut-off frequency is 110 Hz), wavelet transform (the mother wavelet is db5, and the level of decomposition is 10), moving averaging (the size of the sliding window is 5), and median (the dim is 5).Table 4 shows the output SNR and MSE of the ECG signal and Block signal.In the table, the output SNR/MSE of the proposed method is higher/smaller than that of others.The structure of the ECG signal is different from that of the Blocks signal.These results demonstrate the extensive application of the proposed method based on improved CEEMDAN energy entropy.To verify the validity of the proposed method for different noise distribution, the uniform distribution noise between 0 and 1 is added into the periodic signal y(t), the ECG signal, and Blocks signal.The reconstructed results were then compared with Fourier transform (the cut-off frequency is 110 Hz), wavelet transform (the mother wavelet is db5, and the level of decomposition is 10), moving averaging (the size of the sliding window is 5), and median (the dim is 5).Table 5 shows the output SNR and MSE of the periodic signal y(t), ECG signal, and Block signal.In the table, the output SNR/MSE of the proposed method is higher/smaller than that of others.These results demonstrate that the proposed method based on improved CEEMDAN energy entropy is effective for uniform distribution of noise.Overall, the method of how to select the relevant mode to distinguish the noise mode and useful signal mode is explained in the Section 3.1.In Section 3.2, three kinds of signals are introduced to illustrate the effectiveness of the proposed method.The periodic signal y(t) is a stationary signal, and the two signals with different structures, ECG signal (Electrocardiogram) and Blocks signal, are non-stationary signals.

Near-Infrared Spectra Collection
The near-infrared spectra were measured on Antaris II FT-NIR instrument (America Thermo Company, Shanghai, China) in the spectral range of 833 nm to 2630 nm at 4 cm −1 resolution.The diagram of measure system structure is shown in Figure 5.In the measurement experiments for glucose concentration of near-infrared spectra, all glucose solutions with concentrations ranging from 50 to 1000 mg/dL are continuous and equally distributed liquid that are uniformly configured under the same conditions.The collected near-infrared spectra data of the glucose solutions are measured five times with the same concentration to obtain a small statistical error and shown in Figure 6.
Overall, the method of how to select the relevant mode to distinguish the noise mode and useful signal mode is explained in the Section 3.1.In Section 3.2, three kinds of signals are introduced to illustrate the effectiveness of the proposed method.The periodic signal y(t) is a stationary signal, and the two signals with different structures, ECG signal (Electrocardiogram) and Blocks signal, are non-stationary signals.

Near-Infrared Spectra Collection
The near-infrared spectra were measured on Antaris II FT-NIR instrument (America Thermo Company, Shanghai, China) in the spectral range of 833 nm to 2630 nm at 4 cm −1 resolution.The diagram of measure system structure is shown in Figure 5.In the measurement experiments for glucose concentration of near-infrared spectra, all glucose solutions with concentrations ranging from 50 to 1000 mg/dL are continuous and equally distributed liquid that are uniformly configured under the same conditions.The collected near-infrared spectra data of the glucose solutions are measured five times with the same concentration to obtain a small statistical error and shown in Figure 6.

Reconstruction of Near-Infrared Spectra
The noise of the collected near-infrared spectral data is removed based on the improved CEEMDAN energy entropy method.This method is performed by adding a standard deviation of added white noise of 0.2 and the ensemble number of 100.The reconstructed efficiency was  Overall, the method of how to select the relevant mode to distinguish the noise mode and useful signal mode is explained in the Section 3.1.In Section 3.2, three kinds of signals are introduced to illustrate the effectiveness of the proposed method.The periodic signal y(t) is a stationary signal, and the two signals with different structures, ECG signal (Electrocardiogram) and Blocks signal, are non-stationary signals.

Near-Infrared Spectra Collection
The near-infrared spectra were measured on Antaris II FT-NIR instrument (America Thermo Company, Shanghai, China) in the spectral range of 833 nm to 2630 nm at 4 cm −1 resolution.The diagram of measure system structure is shown in Figure 5.In the measurement experiments for glucose concentration of near-infrared spectra, all glucose solutions with concentrations ranging from 50 to 1000 mg/dL are continuous and equally distributed liquid that are uniformly configured under the same conditions.The collected near-infrared spectra data of the glucose solutions are measured five times with the same concentration to obtain a small statistical error and shown in Figure 6.

Reconstruction of Near-Infrared Spectra
The noise of the collected near-infrared spectral data is removed based on the improved CEEMDAN energy entropy method.This method is performed by adding a standard deviation of added white noise of 0.2 and the ensemble number of 100.The reconstructed efficiency was

Reconstruction of Near-Infrared Spectra
The noise of the collected near-infrared spectral data is removed based on the improved CEEMDAN energy entropy method.This method is performed by adding a standard deviation of added white noise of 0.2 and the ensemble number of 100.The reconstructed efficiency was compared with the proposed method, wavelet filter method (the mother wavelet is db5, and the level of decomposition is 10), moving average filter method (the size of the sliding window is 5), and median filter method (the dim is 2).The reconstructed results of near-infrared spectra for a 700 mg/dL glucose solution are shown in Figure 7.To quantify the reconstructed results and verify the effectiveness of these methods, the SNR and MSE were calculated for different methods.Given that the noisy signal was used to replace the pure signal y(n) in Equations ( 16) and ( 17), the evaluated results are opposite to the simulation signals, i.e., the smaller the SNR (bigger MSE) is, the better the reconstructed effect.The values of SNR and MSE of different methods are shown in Table 6.The SNR and MSE values generated by the improved CEEMDAN energy entropy method are 24.0355 and 0.0297, respectively.These values are better than those generated by other methods.The results show that the reconstructed signal based on the improved CEEMDAN energy entropy was smooth and presented the near-infrared spectra characteristics.The proposed method had excellent performance in de-noising and signal reconstruction.
Entropy 2017, 19, 380 10 of 14 compared with the proposed method, wavelet filter method (the mother wavelet is db5, and the level of decomposition is 10), moving average filter method (the size of the sliding window is 5), and median filter method (the dim is 2).The reconstructed results of near-infrared spectra for a 700 mg/dL glucose solution are shown in Figure 7.To quantify the reconstructed results and verify the effectiveness of these methods, the SNR and MSE were calculated for different methods.Given that the noisy signal was used to replace the pure signal y(n) in Equations ( 16) and ( 17), the evaluated results are opposite to the simulation signals, i.e., the smaller the SNR (bigger MSE) is, the better the reconstructed effect.The values of SNR and MSE of different methods are shown in Table 6.The SNR and MSE values generated by the improved CEEMDAN energy entropy method are 24.0355 and 0.0297, respectively.These values are better than those generated by other methods.The results show that the reconstructed signal based on the improved CEEMDAN energy entropy was smooth and presented the near-infrared spectra characteristics.The proposed method had excellent performance in de-noising and signal reconstruction. (

Wavelength Selection of Near-Infrared Spectra
The characteristic wavelengths are selected from reconstructed near-infrared spectra data of the glucose solution.Full spectrum wavelength data have a total of 1867 points, which are divided into wavelength intervals with a rolling window.The rolling window size W is chosen according to the rule > 5 ![32,33], where m is the order of ordinal patterns or embedding dimension.The permutation entropy of each wavelength interval is calculated with an embedding dimension of 4 and a delay time of 1 in this experiment.Therefore, the window size is larger than 120.However, some permutation entropy of the spectral absorption peak will be missed with an extremely large rolling window size.Given these conditions, the window size is chosen as 130 for the wavelength selection of near-infrared spectra.To illustrate the proposed method, the four different   The characteristic wavelengths are selected from reconstructed near-infrared spectra data of the glucose solution.Full spectrum wavelength data have a total of 1867 points, which are divided into wavelength intervals with a rolling window.The rolling window size W is chosen according to the rule W > 5m! [32,33], where m is the order of ordinal patterns or embedding dimension.The permutation entropy of each wavelength interval is calculated with an embedding dimension of 4 and a delay time of 1 in this experiment.Therefore, the window size is larger than 120.However, some permutation entropy of the spectral absorption peak will be missed with an extremely large rolling window size.Given these conditions, the window size is chosen as 130 for the wavelength selection of near-infrared spectra.To illustrate the proposed method, the four different concentrations of glucose solutions are used in the calculation.The calculated results of glucose solutions with 50, 500, and 1000 mg/dL, and a pure water solution are shown in Figure 8.As shown in the figure, PE values in some wavelength intervals are substantially consistent and significantly different in other wavelength intervals.Therefore, the later wavelength intervals are the characteristic wavelengths that contained abundant glucose concentration information.All of the non-overlapping intervals are considered as the final characteristic wavelengths (Table 7).By combining the Figure 6 and Table 7, the result shows that the selected characteristic wavelengths contain the peak position of near-infrared spectra, which correspond to the peak of glucose absorption.solutions with 50, 500, and 1000 mg/dL, and a pure water solution are shown in Figure 8.As shown in the figure, PE values in some wavelength intervals are substantially consistent and significantly different in other wavelength intervals.Therefore, the later wavelength intervals are the characteristic wavelengths that contained abundant glucose concentration information.All of the non-overlapping intervals are considered as the final characteristic wavelengths (Table 7).By combining the Figure 6 and Table 7, the result shows that the selected characteristic wavelengths contain the peak position of near-infrared spectra, which correspond to the peak of glucose absorption.To verify the effectiveness of the proposed method, the characteristic wavelength of the reconstructed spectral data of glucose solutions with the proposed method, mutual information method [34], SPA method [35], PCA method [36], and full spectral data are integrated into the calibration models established by PLSR [37] and SVR [38] (ε = [0,0.2],C = [1,10 ], γ = [0.01,2]).The correlation coefficient and root mean square error of prediction (RMSEP) of the model are evaluated.
where, n is the sample quantity of the calibration set, is the true value of the ith sample, is the predicted value of the i-th sample, and is the average value of of all the samples in the calibration set.
The characteristic wavelengths selected based on the permutation entropy are 375, which is lower than the points of full spectral wavelength.The smaller the selected characteristic wavelength points are, the shorter the established model time.The experimental results of PLSR and SVR calibration model (Table 8) show that the correlation coefficient (R) and RMSEP of established calibration model by characteristic wavelengths that were selected based on the improved CEEMDAN energy entropy method reach 0.9999/0.9998and 0.9125/0.9089.This result is better than  To verify the effectiveness of the proposed method, the characteristic wavelength of the reconstructed spectral data of glucose solutions with the proposed method, mutual information method [34], SPA method [35], PCA method [36], and full spectral data are integrated into the calibration models established by PLSR [37] and SVR [38] (ε = [0, 0.2], C = 1, 10 8 , γ = [0.01, 2]).The correlation coefficient and root mean square error of prediction (RMSEP) of the model are evaluated.
where, n is the sample quantity of the calibration set, y i is the true value of the ith sample, ŷi is the predicted value of the i-th sample, and ŷi is the average value of ŷi of all the samples in the calibration set.
The characteristic wavelengths selected based on the permutation entropy are 375, which is lower than the points of full spectral wavelength.The smaller the selected characteristic wavelength points are, the shorter the established model time.The experimental results of PLSR and SVR calibration model (Table 8) show that the correlation coefficient (R) and RMSEP of established calibration model by characteristic wavelengths that were selected based on the improved CEEMDAN energy entropy method reach 0.9999/0.9998and 0.9125/0.9089.This result is better than that of the established calibration model by characteristic wavelengths that were selected based on MI method, SPA method, PCA method, and full spectral data.The overall modeling results of SVR are more reliable than that of PLSR modeling.The errors between the predicted values and the true values are calculated and those between the predicted values and true values are provided in Figure 9. are more reliable than that of PLSR modeling.The errors between the predicted values and the true values are calculated and those between the predicted values and true values are provided in Figure 9.

Conclusions
This study proposed a novel pretreatment and wavelength selection method for near-infrared spectra signal using the improved CEEMDAN energy entropy and permutation entropy.In terms of signal reconstruction, Fourier transform, wavelet transform, moving averaging, and median are compared to remove noise with different input SNRs.The reconstructed results show that the proposed method based on the improved CEEMDAN energy entropy works best.By utilizing the near-infrared spectral data of glucose solutions as the object, full spectral data are reconstructed by the improved CEEMDAN energy entropy to remove noise.To select the characteristic wavelength, the reconstructed near-infrared spectra are divided with certain interval points.The PE values of wavelength intervals are then calculated.The PLSR and SVR models are introduced to establish the calibration model with characteristic wavelength selection using the PE method, MI method, SPA method, PCA method, and full spectral data.According to the correlation coefficient and RMSEP of the calibration models, the proposed wavelength selection method effectively solves the redundancy problem of near-infrared spectral data.This approach also improves the robustness and predictive ability of the regression model.Therefore, the proposed method can remove the useless noise information and reduce the effective range of data to establish stable, accurate, and practicable quantitative models.
Acknowledgments: The authors are grateful for comments and suggestions by anonymous reviewers and the Associate Editor for their valuable contribution in improving the quality of the paper significantly.This work was supported by the Fundamental Research Funds for the Central Universities (Grant No. HIT.IBRSEM.

Conclusions
This study proposed a novel pretreatment and wavelength selection method for near-infrared spectra signal using the improved CEEMDAN energy entropy and permutation entropy.In terms of signal reconstruction, Fourier transform, wavelet transform, moving averaging, and median are compared to remove noise with different input SNRs.The reconstructed results show that the proposed method based on the improved CEEMDAN energy entropy works best.By utilizing the near-infrared spectral data of glucose solutions as the object, full spectral data are reconstructed by the improved CEEMDAN energy entropy to remove noise.To select the characteristic wavelength, the reconstructed near-infrared spectra are divided with certain interval points.The PE values of wavelength intervals are then calculated.The PLSR and SVR models are introduced to establish the calibration model with characteristic wavelength selection using the PE method, MI method, SPA method, PCA method, and full spectral data.According to the correlation coefficient and RMSEP of the calibration models, the proposed wavelength selection method effectively solves the redundancy problem of near-infrared spectral data.This approach also improves the robustness and predictive ability of the regression model.Therefore, the proposed method can remove the useless noise information and reduce the effective range of data to establish stable, accurate, and practicable quantitative models.

Figure 2 .
Figure 2. The IMF obtained by improved CEEMDAN algorithm.

Figure 2 .
Figure 2. The IMF obtained by improved CEEMDAN algorithm.

Figure 2 .
Figure 2. The IMF obtained by improved CEEMDAN algorithm.

Figure 2 .
Figure 2. The IMF obtained by improved CEEMDAN algorithm.

Figure 3 .
Figure 3.The energy entropy of each IMF.

Figure 4 .
Figure 4.The pure signal and reconstructed signal.

Figure 4 .
Figure 4.The pure signal and reconstructed signal.

Figure 5 .
Figure 5.The diagram of measure system structure.

Figure 6 .
Figure 6.The near-infrared spectral data of glucose solution.

Figure 5 .
Figure 5.The diagram of measure system structure.

Figure 6 .
Figure 6.The near-infrared spectral data of glucose solution.

Figure 6 .
Figure 6.The near-infrared spectral data of glucose solution.

Figure 8 .
Figure 8.The PE of different segmented spectral data.

Figure 8 .
Figure 8.The PE of different segmented spectral data.

Figure 9 .
Figure 9.The errors and the predicted values of two methods (a) SVR model (b) PLSR model.

Figure 9 .
Figure 9.The errors and the predicted values of two methods (a) SVR model (b) PLSR model.

Table 1 .
The energy entropy of each IMF.
Figure 3.The energy entropy of each IMF.

Table 1 .
The energy entropy of each IMF.The energy entropy of each IMF.

Table 1 .
The energy entropy of each IMF.

Table 1 .
The energy entropy of each IMF.The energy entropy of each IMF.

Table 2 .
Value of SNR for different reconstructed signal methods.

Table 2 .
Value of SNR for different reconstructed signal methods.

Table 3 .
Value of MSE for different reconstructed signal methods.

Table 4 .
Values of SNR and MSE of different reconstructed methods for ECG signal and Blocks signal (input SNR = 5 dB).

Table 5 .
Values of SNR and MSE of different reconstructed methods for signals y(t) , ECG and Blocks with uniform distribution noise.

Figure 5 .
The diagram of measure system structure.

Table 6 .
Value of SNR and MSE for different reconstructed signal methods.

Table 6 .
Value of SNR and MSE for different reconstructed signal methods.

Table 7 .
Selection of characteristic wavelengths.

Table 7 .
Selection of characteristic wavelengths.

Table 8 .
R and RMSEP of SVR model and PLSR model.

Table 8 .
R and RMSEP of SVR model and PLSR model.