Over the past decades, near-infrared (NIR) spectroscopy has been successfully applied in many areas, such as agriculture, medicine, environment [1
]. Compared with traditional laboratory methods, NIR has the advantages of rapid speed and noninvasiveness. Usually, the NIR spectra collected from samples have the following several characteristics. Firstly, the number of necessary wavelengths is far higher than the number of samples, which, from the point of view of multivariate equations, means that the number of X factors (wavelengths) is far higher than the number of equations (samples). Hence, it is impossible to obtain a unique solution. Secondly, the contribution of each wavelength to quantitative analysis is different, since some wavelengths may be strongly correlated with the target content, while others may show little or no correlation. Obviously, it is not possible to consider the whole range of NIR spectra to perform subsequent qualitive or quantitative analysis. Hence, wavelength selection (also called “feature selection” and “variable selection”) is an essential preprocessing step to find the most representative wavelengths and eliminate uninformative wavelengths.
In recent years, many researchers have focused on the wavelength selection issue and proposed a series of algorithms which have been proven effective in many areas. For example, Norgaard et al. [2
] proposed the interval PLS (iPLS) method, which first divides the whole range of spectrum into several intervals and then adopts a forward/backward stepwise selection algorithm to choose the most effective interval combinations. However, in how many intervals should the whole range of the spectrum be divided? Should it be divided equally or non-equally? These factors have a great influence on the wavelength selection results. Centner et al. [3
] proposed a novel feature selection method called uninformative variable elimination (UVE), which brings in some random variables as the criterion of evaluating the correlation between wavelength and output target component. If the correlation coefficient of a certain wavelength is smaller than the random variables, the wavelength variable can be eliminated as an uninformative variable. The idea of the UVE algorithm is ingenious and intuitionistic. However, the disadvantage of this method is that the number of selected wavelengths is commonly large because this approach can only eliminate the uninformative wavelengths without selecting the most representative wavelengths [4
]. Our group previously proposed an L1 regularization-based wavelength selection method [5
], which considered the wavelength selection problem as a sparsity optimization issue. By adjusting the value of the sparsity parameter
, we can freely control the number of selected wavelengths, which is reduced while the value of
increases. Detailed information can be found in reference [5
Besides the above-mentioned methods, evolution and swarm optimization methods (such as genetic algorithm [6
], bat algorithm [7
], particle swarm optimization [1
], etc.) have also been applied to solve the wavelength selection problem. The main idea at the basis of these methods is similar. Firstly, they imply the generation of an initial population which is comprised of some individuals, each of which is a binary sequence. The length of each binary sequence is equal to the number of wavelengths, and the value of each point in the binary sequence is “1” or “0”, indicating if the corresponding wavelength is selected or not. Secondly, the iterative searching is implemented by a series of heuristic strategies (for example, selection, crossover, mutation operators in genetic algorithm). However, due to the fact that there are some random mechanisms in swarm optimization methods (such as roulette wheel selection, Lévy flight, etc.), the wavelength selection results of each optimization are variable; hence, it is often confusing which wavelength variables should be taken into account. In 2016, Mirjalili et al. [8
] proposed a novel swarm optimization method called “dragonfly algorithm (DA)” based on the observation, summary, and abstraction of the behaviors of the dragonfly in nature. Hence, the main contribution of this paper is to apply the binary dragonfly algorithm to solve the wavelength selection problem and improve its stability on the basis of the ensemble learning method.
The paper is organized as follows. Section 2
introduces the principles of the dragonfly algorithm and the basic idea of the transition dragonfly algorithm from continuous domain to binary domain. The proposed algorithm is introduced in detail in Section 3
. The experimental results and discussion are presented in Section 4
. Finally, Section 5
summarizes the contribution of this work and suggests some directions for future studies.
3. Wavelength Selection Framework Based on BDA and Ensemble Learning
The flowchart of BDA-based wavelength selection method is illustrated in Figure 2
. The main procedures are as follows.
Step 1: Mapping the wavelength selection problem to the BDA optimization problem. More precisely, initialize the binary dragonfly population. Without loss of generality, suppose there are K wavelengths in the whole range, hence the individuals in the initialized population are in binary series whose length is equal to K, in which the values of “1” and “0” indicate if the corresponding wavelength is selected or not.
Step 2: Evaluating the “goodness” or “badness” of each individual in the initialized population. For each individual, find the selected wavelength combinations and then establish the quantitative analysis model to predict the content of octane by using the partial least-squares (PLS) method. In this paper, the cost function (similar to the fitness function in the genetic algorithm) is defined as the root-mean-square error of cross validation (RMSECV) of the quantitative analysis model, which means that the individual with the smallest RMSECV has a good wavelength combination.
Step 3: Judging whether the stop criteria are satisfied or not. If yes, output the best individual in the population as the selected wavelength. If no, go to step 4. Generally, there are two approaches to design the stop criteria. The first takes the maximum iterations as the stop criteria, while the second takes the absolute error between two successive iterations as the stop criteria.
According to Table 1
, each individual in the population updates its position through five strategies, including separation, alignment, cohesion, attraction to food, and distraction from enemy. Finally, a new binary dragonfly population will be generated.
Step 5: Go back to Step 2, calculate the cost function of each individual in the new population, then execute the loop until the stop criteria are satisfied.
In this paper, we propose three typical wavelength selection methods based on BDA, as illustrated in Figure 3
Similar to traditional swarm optimization methods, the wavelength section algorithm was designed on the basis of the single BDA. Figure 3
a shows the procedure, which consists in using the BDA only once to search the most representative wavelengths. As mentioned above, because there are some random mechanisms in BDA, the wavelength selection results of each search are different.
To solve this problem, Figure 3
b illustrates a possible solution which was designed on the basis of the multi-BDA. Differently from the single-BDA method, in this case the BDA is used many times, and then all the resulting selected wavelengths are aggregated by a voting strategy; those wavelengths with high votes are selected as the final wavelengths. However, an issue needs to be considered, that is, what is the quantitative criterion of “high votes”? Should it be 60%, 70%, 80%, or higher? The higher the vote percentage, the lower the number of selected wavelengths. If the number of selected wavelengths is too small, the performance of the subsequent quantitative analysis model may be reduced. Therefore, reaching a tradeoff between them is a challenge.
Furthermore, we propose a novel wavelength selection framework based on BDA and ensemble learning, as shown in Figure 3
c. Traditionally, ensemble learning is often used to combine several models to improve the generalized performance of a model. However, in this paper, ensemble learning was not used to improve the model’s performance. Instead, the idea of traditional ensemble learning was introduced to improve the stability of the wavelength selection results. Hence, only one PLS model was used during the wavelength selection period. Compared with the multi-BDA method, before the BDA searching procedure, Bootstrap sampling is added to randomly generate a series of different subsets. Suppose there are N samples in the raw dataset: N samples are drawn from the raw dataset with replacement, and then replicated samples are eliminated; hence, usually, the size of the subset becomes smaller because of the elimination.
Similar to the traditional swarm optimization algorithms, the single-BDA method also suffers from instability. In contrast, the multi-BDA and ensemble learning-based BDA methods can improve the stability. The key technical skill is to reduce the randomness inherent in the BDA. Even if both multi-BDA and ensemble learning-based BDA can achieve this, their principles are different. Although the multi-BDA method has many BDA selectors, they are built on the whole raw dataset. In contrast, the BDA selectors in the ensemble leaning method are built on different subsets. The variety of subsets leads to the different wavelength selection results from each BDA. This is the essence of ensemble learning. Besides, the computational complexity of the ensemble learning-based BDA method is smaller than that of the multi-BDA method, which mainly reflects in the computation of the cost (fitness) function RMSECV.
First of all, let us look at the scenario of the multi-BDA method. For example, consider a raw dataset with 100 samples and suppose the raw dataset is uniformly divided into five folds, each of which contains 20 samples. Establish five PLS models, each of which is trained on the basis of four folds (80) samples and tested on the remaining (20) samples.
Next, let us look at the scenario of the ensemble learning-based BDA method. According to the theoretical analysis, the size of subsets generated by Bootstrap sampling is approximately 63.2% that of the raw dataset (the whole theoretical analysis process can be found in reference [12
]). Hence, there are about 63 samples in the subsets, and, similar to the multi-BDA method, the subset is uniformly divided into five folds, each of which contains about 12 samples. Establish five PLS models, each of which is trained on the basis of four folds (about 48) samples and tested on the remaining (12) samples.
The experimental results proved that, though the size of the samples in the subset became smaller, the performance was close to that of the multi-BDA method.
In terms of the wavelength selection problem in NIR spectroscopy, this paper proposes a novel method based on binary dragonfly algorithms, which includes three typical frameworks: single-BDA, multi-BDA, and ensemble learning-based BDA framework. The experimental results for a public gasoline NIR dataset showed that by using the proposed method, we could improve the stability of the wavelength selection results through the multi-BDA and ensemble learning-based BDA methods. With respect to the subsequent quantitative analysis modeling, the wavelengths selected with the multi-BDA and ensemble learning BDA methods outperformed those selected with the single-BDA method. When comparing the ensemble learning-based BDA and the multi-BDA methods, it can be seen that they can provide similar wavelength selection results with lower computational complexity. The results also indicated that the proposed method is not limited to the dragonfly algorithm but it can be combined with other swarm optimization algorithms. In addition, the ensemble learning idea can be applied to other feature selection areas to obtain more robust results.
Besides, during the experimental results analysis, we found that, for the same selected wavelengths, there were great changes of the generalized performance (RMSECV) between different subsets. This means that not only the wavelength variables, but also the samples influence the generalized performance of the quantitative analysis model. Currently, the majority of studies are focused on the wavelength selection problem; we suggest that sample selection using swarm optimization methods is an open problem that needs to be studied further.
Additionally, in this study, we found that the votes percentage should be set between 60% and 70%. However, we do not know whether this range is also suitable for other datasets. This should be validated further. Actually, in general, we should not be concerned by which wavelengths are selected. Instead, we may want to know how good the quantitative analysis model will be by using the selected wavelengths. Hence, improving the generalized performance of the quantitative analysis model is a hot topic in NIR spectroscopy. With the development of artificial intelligence, ensemble learning methods (such as random forest, adaboost, etc.) and deep learning (such as convolutional neural networks [14
]) will be mainstream in the future.