Variable Selection Using Adaptive Band Clustering and Physarum Network

Variable selection is a key step for eliminating redundant information in spectroscopy. Among various variable selection methods, the physarum network (PN) is a newly-introduced and efficient one. However, the whole spectrum has to be equally divided into sub-spectral bands in PN. These division criteria limit the selecting ability and prediction performance. In this paper, we transform the spectrum division problem into a clustering problem and solve the problem by using an affinity propagation (AP) algorithm, an adaptive clustering method, to find the optimized number of sub-spectral bands and the number of wavelengths in each sub-spectral band. Experimental results show that combining AP and PN together can achieve similar prediction accuracy with much less wavelength than what PN alone can achieve.


Introduction
Spectroscopy has been widely used for quantitative analysis of complex samples in various fields, such as petrochemical, pharmaceutical, agricultural, food, and biological sectors.It is a non-invasive and efficient analytic technology that can be deployed in on-line analysis.To predict the concentration of one or more samples, a mathematical model has to be built to relate the spectrum (spectra) of the sample (samples) with the concentration.A common problem in spectroscopy is that the large number of spectral variables makes the prediction unreliable and complicates the prediction model.To reduce the dimensionality of the spectral date, projection methods [1], variable selection methods [2], or a combination of both [3][4][5] are used.
Considering the high correlation between adjacent spectral variables (wavelengths) due to the characteristics of the spectrograph, Chen et al. [14] transformed the variable selection problem into a path finding problem, i.e., the whole spectrum was divided into many sub-spectral bands.Each sub-spectral band was regarded as a node in a maze, every wavelength inside the sub-spectral band provided a route to its neighbor node; the shortest route (least correlation) was found by using a physarum network (PN) [15][16][17][18][19], and, thus, one wavelength from each sub-spectral band was selected, and the selected wavelengths had the least correlation.The PN is a mathematical model simulating the foraging process of plasmodium in a maze.This model has been proven to be able to compute the shortest path in the maze [15], and, thus, finds applications in many other fields, such as network design [16], sensor The input required by the AP is the real valued similarities S(i, k) between data point i and point k, which indicates how suitable the data point k is for being the exemplar of data point i.The input S(k, k) for each data point k is called preference and is denoted as p k .The data point with a large preference has a higher chance of being selected as an exemplar.The values of the input preferences can influence the number of exemplars or clusters, i.e., the larger the p k is, the more exemplars the AP will give [20].
The messages passing through the edges are responsibility and availability.The responsibility r(i, k) reflects how well the point k acts as an exemplar to the point i.The availability a(i, k) reflects how well the point i belongs to a class centered on point k [20].An illustration of message exchange in an AP network is given in Figure 1.
Algorithms 2017, 10, 73 3 of 17 The input required by the AP is the real valued similarities ( , ) S i k between data point i and point k , which indicates how suitable the data point k is for being the exemplar of data point i .The input ( , ) S k k for each data point k is called preference and is denoted as k p .The data point with a large preference has a higher chance of being selected as an exemplar.The values of the input preferences can influence the number of exemplars or clusters, i.e., the larger the k p is, the more exemplars the AP will give [20].
The messages passing through the edges are responsibility and availability.The responsibility ( , ) r i k reflects how well the point k acts as an exemplar to the point i .The availability ( , ) a i k reflects how well the point i belongs to a class centered on point k [20].An illustration of message exchange in an AP network is given in Figure 1.The ( , ) r i k and ( , ) a i k are calculated by using the formulas below { , } ( , ) min{0, ( , ) max{0, ( , )}} The ( , ) r k k and ( , ) a k k are calculated using different rules below The responsibility and availability are updated constantly by using the rules below ( , ) ( , ) Where λ is the damping factor between 0 and 1 to avoid numerical oscillations, ( , ) new r i k and ( , ) old r i k are current and previous responsibility, respectively, and ( , ) new a i k and ( , ) old a i k are current and previous availability, respectively.The damping factor λ was set to a value between 0.5 and 0.9 following the suggestion in [21].Different values of λ in the suggested range were tested.It was found that λ in this range has no effect on the final number of groups (see Sections 3.1.2., 3.2.2., and 3.3.2for details), so the λ is set as 0.9, which is the default value suggested in AP Algorithm software [20].The r(i, k) and a(i, k) are calculated by using the formulas below The r(k, k) and a(k, k) are calculated using different rules below The responsibility and availability are updated constantly by using the rules below where λ is the damping factor between 0 and 1 to avoid numerical oscillations, r(i, k) new and r(i, k) old are current and previous responsibility, respectively, and a(i, k) new and a(i, k) old are current and previous availability, respectively.The damping factor λ was set to a value between 0.5 and 0.9 following the suggestion in [21].Different values of λ in the suggested range were tested.It was found that λ in this range has no effect on the final number of groups (see Sections 3.1.2,3.2.1 and 3.3.2for details), so the λ is set as 0.9, which is the default value suggested in AP Algorithm software [20].
To begin with the update of r(i, k) and a(i, k), all the availabilities are set to zero before the starting of Formula (1).
The exemplar of data point i is the data point k that makes r(i, k) + a(i, k) the maximum.The iteration of calculations can be stopped if a fixed step of iteration is reached or the exemplars stay constant for a fixed step of iteration.

Variable Selection Based on AP-PN
AP-PN is a sequential combination of AP and PN.It uses AP to automatically divide the whole spectrum into many sub-spectral bands that are not necessarily equal to each other and then selects one wavelength from each sub-spectral band by using PN.In AP-PN variable selection, a spectral response at a wavelength is regarded as a data point, a sub-spectral band is regarded as a cluster of data points.The input S(i, k) required by AP in this paper is the sample correlation coefficient [24] between wavelength i and wavelength k, which indicates to what extent the spectral information of the two wavelengths are correlated.It is calculated by using Formula ( 7) where x ij and x kj represent the spectral response of the j-th sample at wavelength i and k, respectively, x i and x k are the average spectral response of all samples at wavelength i and k, respectively, and N is the number of samples.The preference p k is the input for controlling the number of clusters [20].It is determined by maximizing the correlation function J(p k ) (shown in Formula ( 8)) or finding the point where the increasing trend of J(p k ) vanishes.
where L is the total number of clusters, c i and c j represent the bands in the n-th cluster, M is the number of wavelengths in the n-th cluster, and m is the total number of wavelengths.The steps of using AP-PN for variable selection are as follows: 1. Calculate the similarity matrix S according to the Formula (7); 2.
Set p k as a value between 0 to 1, staring from 0.00001 and updating by using Calculate the responsibility information and availability information, and update them according to the Formulas(1)-( 6); 5.
Determine the clusters, and go to step 2 until all p k are tried; 6.
Determine the p k according to Formula (8) and the corresponding clusters and exemplars; 7.
Bring the calculated clustering results into the PN for variable selection.

Complexity Analysis of AP-PN
The AP algorithm consists of two parts.The first part is to determine the value of p k .The time complexity of this part is O(max(n 2 , (1 + M) × ML/2)), where L is the number of bands selected in AP, M is the number of wavelengths in each cluster, and it is not greater than 4, and n is the number of total bands.In this paper, M and L are less than n, so the time complexity of the first part of AP is O(n 2 ).The second part of AP is to determine the clustering results.The time complexity of this part is O(n 2 ).The space complexity of AP is O(n 2 ).It is worth noting that the similarity matrix is calculated only once.
The time complexity of PN is O((n + 2) 2 ), and the space complexity is O(n 2 ).

Data Set
This database is available at [28].Three different Near Infrared (NIR) spectrometers were used to measure the corn samples.The whole spectrum was from 1100 to 2498 nm and the scanning resolution was 2 nm, namely, there were 700 wavelengths in total.There were 80 samples in the database.Each sample had four measured properties, i.e., water, oil, protein, and starch.The protein content was the property predicted in this paper.Numerical range of the protein content was from 7.6540 to 9.7110%.We used the hold-out method to select the training set and test set, i.e., we randomly selected the training and test sets.There were ten rounds of selection to generate the training set and test set.In each round, a training set and a test set were randomly produced.A training set consisted of 20 samples and a test set consisted of the remaining 60 samples.In addition, the performance of the model was assessed by the predicted root mean square error (RMSEP) and the correlation coefficient (R) between the predicted values and the real values in the test set.The RMSEP is defined in Formula ( 9) where ŷi are the predicted values, y i are the measured values, and N represents the number of the test sets.

Data Analysis
The AP-PN-GA-PLS, PLS, GA-PLS, PN-GA-PLS, and AP-GA-PLS were used for comparison.For PLS, the number of input wavelengths was the number of full-band variables.
For GA-PLS, the parameters of the GA-PLS model were set according to the literature [3], i.e., the population size was 30, the crossover probability was 50%, the mutation rate was 1%, and the iteration step was 100.The crossover is an operator used to generate child generation chromosome by exchanging the subsequences of the parent chromosomes.The crossover probability is the ratio indicating how many child generations are produced by crossover.The mutation is another operator to generate child generation chromosome by alerting one or more gene values in the parent chromosomes.The mutation rate is a ratio of the number of alerted genes to the whole number of genes.Before the GA-PLS was conducted, the ratio of the number of variables to the number of samples was checked by using the GAPLSOPT function in the PLS-Genetic Algorithm Toolbox provided by Leardi [3].Averaging adjacent variables was done if the ratio did not pass the check.
For PN-GA-PLS, the parameters of the PN were set according to the literature [14], i.e., the total network traffic was 6, the number of iterations was 2000, the threshold of stop was 0.001, and the initial network continuity was 0.00001.The whole spectrum with W wavelengths was equally divided into M sub-spectral bands, each of which has P wavelengths (W = M × P).If the sub-spectral band could not be equally divided (W/P is not an integer), each of the first to the (M-1)-th sub-spectral bands had P wavelengths, and the last sub-spectral band had W − (M − 1) × P wavelengths.Many sets of M and P were tried; M = 100 and P = 7 were determined because they gave the minimum RMSEP.
AP itself can give a representative (exemplar) for each sub-spectral band; it is not necessary to use PN after AP if every exemplar can be used directly to replace a corresponding sub-spectral band.Therefore, it is interesting to examine whether AP-GA-PLS can be used to replace AP-PN-GA-PLS.For the AP-GA-PLS, the settings were the same as those in AP-PN-GA-PLS.After the AP, the exemplars were used directly as the representative of the sub-spectral band, which were input to the GA-PLS.
For the AP-PN-GA-PLS, since we used the hold-out method to select the training set and the test set, the training set in each round was different and so too was the p k .In the first round, λ was set to 0.9, which is a default value suggested in the AP algorithm source code provided in [20], and the p k was then determined as 0.9996.We then fixed the value of p k and set λ to a value in the range of 0.5 to 0.9 to see whether the number of groups would change.It is shown in Table 1 that different values of λ produce the same number of groups (184).So we set λ to 0.9 in all the ten rounds.The other parameters of the PN network are the same as those in PN-GA-PLS.3.2.Diesel Data Set

Data Set
The database is available at [29].The dataset includes the near-infrared spectral data of the diesel samples and its corresponding property values.The whole spectrum was from 750 nm to 1550 nm, and the scanning resolution was 2 nm.
The property used in this research is the viscosity, whose numerical range is from 1.12 to 4.05.There were 395 data samples.We used the hold-out method to select the training set and test set.There were ten rounds for selecting the training set and test set.In each round, a training set and a test set were randomly produced.A training set consisted of 95 samples and a test set consisted of the remaining 300 samples.

Data Analysis
The AP-PN-GA-PLS, PLS, GA-PLS, PN-GA-PLS, and AP-GA-PLS were used for comparison.For the PLS, the number of input wavelengths was the number of the full-band variables.
For the GA-PLS, the parameters were set according to the literature [3].The population size was 30, the crossover probability was 50%, the mutation rate was 1%, and the iteration step was 100.Before the GA-PLS was conducted, the ratio of the number of variables to the number of samples was checked by using the GAPLSOPT function in the PLS-Genetic Algorithm Toolbox provided by Leardi [3].Averaging adjacent variables was done if the ratio did not pass the check.
For the PN-GA-PLS, the total network traffic was 6, the number of iterations was 2000, the threshold of filtering stop was 0.001, the initial network continuity was 0.00001, M = 120, and P = 3.
For the AP-PN-GA-PLS, λ was set to 0.9 in the first round, which is a default value suggested in the AP algorithm source code provided in [20], and the p k was then determined as 0.9953.We then fixed the value of p k and set λ to a value in the range of 0.5 to 0.9 to see whether the number of groups would change.It is shown in Table 2 that different value of λ produced the same number of groups (120), so we set λ to 0.9 in all the ten rounds.The other parameters of the PN network were the same as those in the PN-GA-PLS.For the AP-GA-PLS, the settings were the same as those in the AP-PN-GA-PLS.After the AP, the exemplars were used directly as the representatives of the sub-spectral band, which were input into the GA-PLS.

Data Set
The data set contained chlorophyll with different concentrations of sweet orange leaves and was collected by the hyperspectral imaging system (HSI) used by Chongqing Metrology and Quality Inspection [14].The HSI system consisted of a spectrometer and Charge-coupled Device (CCD) sensors, the spectral range was 400-1000 nm, scanning resolution was 0.74 nm-0.81 nm, and the channel was 761.
The R f is the reflection rate of sweet orange leaves, which can be represented as R f = (I − D)/(S − D).Where I is the image density, S represents the density of white light, and D denotes the density of dark light.After hyperspectral imaging, the mesophyll in the leaves was extracted from the veins.In order to extract chlorophyll from 0.02 g of mesophyll, we used 25 mL of 80% acetone, the concentration of the method described in [30].
One hundred thirty-three samples were included in this data set.We used the hold-out method to select the training set and the test set.There were ten rounds of selection to generate the training set and the test set.In each round, a training set and a test set were randomly produced.A training set consisted of 33 samples and a test set consisted of the remaining 100 samples.

Data Analysis
The AP-PN-GA-PLS, PLS, GA-PLS, PN-GA-PLS and AP-GA-PLS were used for comparison.For the PLS, the number of input wavelengths was the number of the full-band variables.
For the GA-PLS, the parameters were set according to the literature [3].The population size was 30, the crossover probability was 50%, the mutation rate was 1%, and the iteration step was 100.Before the GA-PLS was conducted, the ratio of the number of variables to the number of samples was checked by using the GAPLSOPT function in the PLS-Genetic Algorithm Toolbox provided by Leardi [3].Averaging adjacent variables was done if the ratio could not pass the check.
For the PN-GA-PLS, the total network traffic was 6, the number of iterations was 2000, the threshold of filtering stop was 0.001, the initial network continuity was 0.00001, M = 120, and P = 7.
For the AP-PN-GA-PLS, λ was set to 0.9 in the first selection, which is a default value suggested in the AP Algorithm source code provided in [20], and the p k was then determined as 0.99981.We then fixed the value of p k and set λ to a value in the range of 0.5 to 0.9 to see whether the number of groups would change.It is shown in Table 3 that different value of λ produce the same number of groups (234), so we set λ to 0.9 in all the ten rounds.The other parameters of the PN network were the same as those in the PN-GA-PLS.For the AP-GA-PLS, the settings were the same as those in the AP-PN-GA-PLS.After the AP, the exemplars were used directly as the representatives of the sub-spectral band, which were input into the GA-PLS.

Corn Data Set
The convergence diagram of the AP algorithm run in the first round is shown in Figure 2.With the increase in the number of iterations, the fitness (net similarity) of quantized intermediate solutions gradually increases.When the number of iterations increases to 562, fitness no longer changes.The AP algorithm converges in the first round as well as in the rest of the rounds.

Corn Data Set
The convergence diagram of the AP algorithm run in the first round is shown in Figure 2.With the increase in the number of iterations, the fitness (net similarity) of quantized intermediate solutions gradually increases.When the number of iterations increases to 562, fitness no longer changes.The AP algorithm converges in the first round as well as in the rest of the rounds.The convergence diagram of the PN algorithm run in the first round is shown in Figure 3.The D ij represents the rate of change of the network continuity.The convergence condition of the PN algorithm is that all the rates of change of the network continuity are smaller than a threshold (0.001). Figure 3 shows the relationship of one of the rates of change of the network continuity and iterations.With an increase in the number of iterations, the D ij gradually decreases.When the number of iterations increases to 10, all the D ij are smaller than the threshold of stop.The PN algorithm converges in the first selection as well as in the rest of the rounds.

Corn Data Set
The convergence diagram of the AP algorithm run in the first round is shown in Figure 2.With the increase in the number of iterations, the fitness (net similarity) of quantized intermediate solutions gradually increases.When the number of iterations increases to 562, fitness no longer changes.The AP algorithm converges in the first round as well as in the rest of the rounds.There were 184 wavelengths selected by the AP-PN in the first round, which are illustrated in Figure 4a.These 184 wavelengths were input into the GA-PLS.The times of each wavelength selected by GA-PLS are illustrated in Figure 4b.By using the 16 most selected wavelengths, the root mean square errors of cross-validation (RMSECV = n ∑ i=1 ( y i − y i ) 2 /n, where y i represents the predicted value of the concentration of the i-th sample in the cross validation set, y i represents the measured value of the concentration of the i-th sample in the cross validation set, and n is the number of cross validation sets) can reach the smallest value.These final 16 selected wavelengths found by using AP-PN-GA-PLS are illustrated in Figure 4c.The scatter plot of the predicted value vs. the measured value is given in Figure 4d.The curve that fits the data points best is also given in Figure 4d.The AP-PN-GA-PLS selected 16 wavelengths in the first round, which are shown in Figure 4c.
All five algorithms, i.e., PLS, GA-PLS, PN-GA-PLS, AP-GA-PLS, and AP-PN-GA-PLS, were tested on the ten rounds, and the average number of selected wavelengths, RMSEP, and Rare summarized in Table 4.
Algorithms 2017, 10, 73 9 of 17 There were 184 wavelengths selected by the AP-PN in the first round, which are illustrated in Figure 4a.These 184 wavelengths were input into the GA-PLS.The times of each wavelength selected by GA-PLS are illustrated in Figure 4b.By using the 16 most selected wavelengths, the root mean square errors of cross-validation ( , where i y  represents the predicted value of the concentration of the i-th sample in the cross validation set, i y represents the measured value of the concentration of the i-th sample in the cross validation set, and n is the number of cross validation sets) can reach the smallest value.These final 16 selected wavelengths found by using AP-PN-GA-PLS are illustrated in Figure 4c.The scatter plot of the predicted value vs. the measured value is given in Figure 4d.The curve that fits the data points best is also given in Figure 4d.The AP-PN-GA-PLS selected 16 wavelengths in the first round, which are shown in Figure 4c.All five algorithms, i.e., PLS, GA-PLS, PN-GA-PLS, AP-GA-PLS, and AP-PN-GA-PLS, were tested on the ten rounds, and the average number of selected wavelengths, RMSEP, and Rare summarized in Table 4.It is seen from Table 4 that the PN-GA-PLS can achieve similar prediction performance (RMSEP = 0.0597%) as that of GA-PLS (RMSEP = 0.0431%) but with fewer wavelengths (35 vs. 67).This result confirms the conclusion in the literature [14].However, the number of wavelengths selected or the prediction performance of PN-GA-PLS in this research is larger or better than those of PN-GA-PLS in the literature [14], as are those of the GA-PLS.These differences in the number of selected wavelengths or prediction performance are due to the way in which training sets, test sets, and the sub-spectral bands were selected.
AP-PN-GA-PLS can achieve a similar prediction performance (RMSEP=0.0397%)as that of PN-GA-PLS (RMSEP=0.0597%)but with fewer wavelengths.The number of selected wavelengths of AP-PN-GA-PLS and PN-GA-PLS was 25 and 35, respectively.This may suggest that by using AP before PN, AP-PN-GA-PLS can further eliminate the redundant information in variables.Different It is seen from Table 4 that the PN-GA-PLS can achieve similar prediction performance (RMSEP = 0.0597%) as that of GA-PLS (RMSEP = 0.0431%) but with fewer wavelengths (35 vs. 67).This result confirms the conclusion in the literature [14].However, the number of wavelengths selected or the prediction performance of PN-GA-PLS in this research is larger or better than those of PN-GA-PLS in the literature [14], as are those of the GA-PLS.These differences in the number of selected wavelengths or prediction performance are due to the way in which training sets, test sets, and the sub-spectral bands were selected.
AP-PN-GA-PLS can achieve a similar prediction performance (RMSEP = 0.0397%) as that of PN-GA-PLS (RMSEP = 0.0597%) but with fewer wavelengths.The number of selected wavelengths of AP-PN-GA-PLS and PN-GA-PLS was 25 and 35, respectively.This may suggest that by using AP before PN, AP-PN-GA-PLS can further eliminate the redundant information in variables.Different from PN, AP optimized both the number of the sub-spectral bands and the number of the wavelengths within each sub-spectral band, therefore, AP-PN-GA-PLS can achieve similar prediction performance with less wavelength.Table 4.The average values of the number of selected wavelengths, correlation coefficient (R), and predicted root mean square error (RMSEP) using PLS, GA-PLS, PN-GA-PLS, AP-GA-PLS, and AP-PN-GA-PLS (corn data).It is also shown in Table 4 that AP-GA-PLS can achieve RMSEP of 0.0643% with 39 wavelengths.This wavelength selection performance is comparable to that of PN-GA-PLS but is better than GA-PLS in terms of achieving similar RMSEP with less wavelength.This result suggests that the exemplars selected by AP can be input into GA-PLS directly, which can improve the wavelength selection performance.

Method
However, AP-GA-PLS's wavelength selection performance (39 wavelengths) is worse than that of AP-PN-GA-PLS (25 wavelengths).Every exemplar selected by the AP is a representative of each sub-spectral band, but it only represents the local information of the corresponding sub-spectral band.These exemplars do not consider the global information of the whole spectrum.They do not guarantee that they have the least correlation in a whole.This limitation of AP can be overcome by using PN afterwards.The PN selects one wavelength from each sub-spectral band to ensure that all the wavelengths in a whole have the least correlation.By combining the advantages of both AP and PN, the AP-PN-GA-PLS can thus achieve the best performance.
There were 25 wavelengths selected by AP-PN-GA-PLS.The selected wavelengths 1778 nm and 1780 nm are similar to 1778 nm, which is the absorption wavelength of wheat gluten [31].The selected wavelengths 2154 nm, 2156 nm, 2176 nm, and 2178 nm are within the range of 2100 nm-2200 nm, which are the absorption wavelengths of wheat gluten [31].The selected wavelength 2220 nm is similar to 2230 nm, which is the local minimum absorption wavelength of wheat gluten [31].

Diesel Data Set
The convergence diagram of the AP algorithm is shown in Figure 5.With the increase of the number of iterations, the fitness (net similarity) of quantized intermediate solutions gradually increases.When the number of iterations increases to 386, and fitness no longer changes.The AP algorithm converges in the first round as well as in the rest of the rounds.
The convergence diagram of the PN algorithm is shown in Figure 6.The D ij represents the rate of change of the network continuity.With the increase of the number of the iterations, the D ij gradually decreases.When the number of iterations increases to 18, all the D ij are smaller than the threshold of stop.The PN algorithm is converged.
The number of wavelengths selected by using AP-PN in the first round was 120, which is illustrated in Figure 7a.These 120 wavelengths were input into GA-PLS.The times of each wavelength selected by GA-PLS are illustrated in Figure 7b.By using the 25 most selected wavelengths, the root mean square errors of cross-validation (RMSECV) can reach the smallest value.These final 25 selected wavelengths determined by using AP-PN-GA-PLS are illustrated in Figure 7c.The scatter plot of the predicted value vs. the measured value is given in Figure 7d.The curve that fits the data points best is also given in Figure 7d.
Figure 7c.The scatter plot of the predicted value vs. the measured value is given in Figure 7d.The curve that fits the data points best is also given in Figure 7d.All five algorithms, i.e., PLS, GA-PLS, PN-GA-PLS, AP-GA-PLS, and AP-PN-GA-PLS, were tested in the ten rounds, and the average number of selected wavelength, RMSEP, and R are summarized in Table 5.
It is seen from Table 5 that AP-PN-GA-PLS can give the largest R (0.9744) and the smallest RMSEP (0.1167%) among PN-GA-PLS (R = 0.9727, RMSEP = 0.1404%), AP-GA-PLS (R = 0.9722, RMSEP = 0.1356%), and PLS (R=0.9716,RMSEP = 0.1370%).It is also observed that the AP-PN-GA-PLS achieved good prediction performance with the least number of wavelengths among all methods.The number of selected wavelengths by AP-PN-GA-PLS was26.The number of Figure 7c.The scatter plot of the predicted value vs. the measured value is given in Figure 7d.The curve that fits the data points best is also given in Figure 7d.All five algorithms, i.e., PLS, GA-PLS, PN-GA-PLS, AP-GA-PLS, and AP-PN-GA-PLS, were tested in the ten rounds, and the average number of selected wavelength, RMSEP, and R are summarized in Table 5.
It is seen from Table 5 that AP-PN-GA-PLS can give the largest R (0.9744) and the smallest RMSEP (0.1167%) among PN-GA-PLS (R = 0.9727, RMSEP = 0.1404%), AP-GA-PLS (R = 0.9722, RMSEP = 0.1356%), and PLS (R=0.9716,RMSEP = 0.1370%).It is also observed that the AP-PN-GA-PLS achieved good prediction performance with the least number of wavelengths among all methods.The number of selected wavelengths by AP-PN-GA-PLS was26.The number of All five algorithms, i.e., PLS, GA-PLS, PN-GA-PLS, AP-GA-PLS, and AP-PN-GA-PLS, were tested in the ten rounds, and the average number of selected wavelength, RMSEP, and R are summarized in Table 5.
It is seen from Table 5 that AP-PN-GA-PLS can give the largest R (0.9744) and the smallest RMSEP (0.1167%) among PN-GA-PLS (R = 0.9727, RMSEP = 0.1404%), AP-GA-PLS (R = 0.9722, RMSEP = 0.1356%), and PLS (R = 0.9716, RMSEP = 0.1370%).It is also observed that the AP-PN-GA-PLS achieved good prediction performance with the least number of wavelengths among all methods.The number of selected wavelengths by AP-PN-GA-PLS was 26.The number of selected wavelengths by AP-GA-PLS, PN-GA-PLS, GA-PLS, and PLS were 66, 40, 142, and 401, respectively.The AP-PN-GA-PLS model may, thus, achieve the least complexity without degrading the prediction accuracy.The AP-GA-PLS can further reduce the redundant information in variables, compared with GA-PLS, but its wavelength selection performance (66 wavelengths selected, RMSEP = 0.1356%) is not as good as that of PN-GA-PLS (40 wavelengths selected, RMSEP = 0.1404%).
selected wavelengths by AP-GA-PLS, PN-GA-PLS, GA-PLS, and PLS were 66, 40, 142, and 401, respectively.The AP-PN-GA-PLS model may, thus, achieve the least complexity without degrading the prediction accuracy.The AP-GA-PLS can further reduce the redundant information in variables, compared with GA-PLS, but its wavelength selection performance (66 wavelengths selected, RMSEP = 0.1356%) is not as good as that of PN-GA-PLS (40 wavelengths selected, RMSEP = 0.1404%).The AP-PN-GA-PLS selected 26 wavelengths.In general, the polycyclic aromatic hydrocarbon (PAHs) relates to the viscosity of diesel [32].The final selected wavelengths of 942 nm and 1046 nm are similar to934 nm and 1053 nm, respectively, which are the absorption wavelengths of methylene that relate to octane number.The selected wavelengths of 1422 nm and 1426 nm are the absorption wavelengths of the aromatic ring that relates to viscosity [32].
Table 5.The average values of the number of selected wavelengths, R, and RMSEP using PLS, GA-PLS, PN-GA-PLS, AP-GA-PLS, and AP-PN-GA-PLS (diesel data).

Sweet Orange Leaves Data Set
The convergence diagram of the AP algorithm in the first round is shown in Figure 8.With the increase of the number of iterations, the fitness (net similarity) of quantized intermediate solutions gradually increases.When the number of iterations increases to 486, fitness no longer changes.The AP algorithm converges in the first round as well as in the other rounds.The AP-PN-GA-PLS selected 26 wavelengths.In general, the polycyclic aromatic hydrocarbon (PAHs) relates to the viscosity of diesel [32].The final selected wavelengths of 942 nm and 1046 nm are similar to 934 nm and 1053 nm, respectively, which are the absorption wavelengths of methylene that relate to octane number.The selected wavelengths of 1422 nm and 1426 nm are the absorption wavelengths of the aromatic ring that relates to viscosity [32].

Sweet Orange Leaves Data Set
The convergence diagram of the AP algorithm in the first round is shown in Figure 8.With the increase of the number of iterations, the fitness (net similarity) of quantized intermediate solutions gradually increases.When the number of iterations increases to 486, fitness no longer changes.The AP algorithm converges in the first round as well as in the other rounds.The number of wavelengths selected by using AP-PN was234 in the first round, which is illustrated in Figure 10a.These 234 wavelengths were input into GA-PLS.The times of each wavelength selected by GA-PLS is illustrated in Figure 10b.By using the 25 most selected wavelengths, the RMSECV can reach smallest value.These final 25 wavelengths selected by using AP-PN-GA-PLS are illustrated in Figure 10c.The scatter plot of the predicted value vs. the measured value is given in Figure 10d.The curve that fits the data points best is also given in Figure 10d.The number of wavelengths selected by using AP-PN was234 in the first round, which is illustrated in Figure 10a.These 234 wavelengths were input into GA-PLS.The times of each wavelength selected by GA-PLS is illustrated in Figure 10b.By using the 25 most selected wavelengths, the RMSECV can reach smallest value.These final 25 wavelengths selected by using AP-PN-GA-PLS are illustrated in Figure 10c.The scatter plot of the predicted value vs. the measured value is given in Figure 10d.The curve that fits the data points best is also given in Figure 10d.The number of wavelengths selected by using AP-PN was 234 in the first round, which is illustrated in Figure 10a.These 234 wavelengths were input into GA-PLS.The times of each wavelength selected by GA-PLS is illustrated in Figure 10b.By using the 25 most selected wavelengths, the RMSECV can reach smallest value.These final 25 wavelengths selected by using AP-PN-GA-PLS are illustrated in Figure 10c.The scatter plot of the predicted value vs. the measured value is given in Figure 10d.The curve that fits the data points best is also given in Figure 10d.All five algorithms, i.e., PLS, GA-PLS, PN-GA-PLS, AP-GA-PLS, and AP-PN-GA-PLS, were tested in the ten rounds, and the average number of selected wavelength, RMSEP, and R are summarized in Table 6.
It is seen from Table 6 that the PN-GA-PLS can achieve similar prediction performance (RMSEP=1.3206%)as that of GA-PLS (RMSEP = 1.4998%) but with fewer wavelengths (49 vs. 125).This result confirms the conclusion in the literature [14].However, the number of wavelengths selected or the prediction performance of PN-GA-PLS in this research is larger or better than those of PN-GA-PLS in the literature [14], as are those of the GA-PLS.These differences in the number of selected wavelengths or prediction performance are due to the way in which training sets, test sets, and the sub-spectral bands were selected.
It is also observed that the AP-PN-GA-PLS achieved good prediction performance with the least number of wavelengths among all methods.The number of selected wavelengths by AP-PN-GA-PLS was27.The numbers of selected wavelengths by AP-GA-PLS, PN-GA-PLS, GA-PLS, and PLS were 56, 49, 125, and 761, respectively.The AP-PN-GA-PLS model may, thus, achieve the least complexity without degrading the prediction accuracy.
The AP-PN-GA-PLS selected 27 wavelengths.The final selected wavelengths of 564 nm and 571 nm are similar to 568 nm.The selected wavelengths 581 nm is similar to 582 nm.The isotropic absorption point of chlorophyll is 568 nm and that of chlorophyll b is 582 nm [33].The selected wavelength of 576 nm is the local maximum absorption wavelength of chlorophyll a [33].The selected wavelength of 615 nm is similar to 614 nm, which is the local maximum absorption wavelength of chlorophyll a [33].All five algorithms, i.e., PLS, GA-PLS, PN-GA-PLS, AP-GA-PLS, and AP-PN-GA-PLS, were tested in the ten rounds, and the average number of selected wavelength, RMSEP, and R are summarized in Table 6.
It is seen from Table 6 that the PN-GA-PLS can achieve similar prediction performance (RMSEP = 1.3206%) as that of GA-PLS (RMSEP = 1.4998%) but with fewer wavelengths (49 vs. 125).This result confirms the conclusion in the literature [14].However, the number of wavelengths selected or the prediction performance of PN-GA-PLS in this research is larger or better than those of PN-GA-PLS in the literature [14], as are those of the GA-PLS.These differences in the number of selected wavelengths or prediction performance are due to the way in which training sets, test sets, and the sub-spectral bands were selected.
It is also observed that the AP-PN-GA-PLS achieved good prediction performance with the least number of wavelengths among all methods.The number of selected wavelengths by AP-PN-GA-PLS was 27.The numbers of selected wavelengths by AP-GA-PLS, PN-GA-PLS, GA-PLS, and PLS were 56, 49, 125, and 761, respectively.The AP-PN-GA-PLS model may, thus, achieve the least complexity without degrading the prediction accuracy.
The AP-PN-GA-PLS selected 27 wavelengths.The final selected wavelengths of 564 nm and 571 nm are similar to 568 nm.The selected wavelengths 581 nm is similar to 582 nm.The isotropic absorption point of chlorophyll is 568 nm and that of chlorophyll b is 582 nm [33].The selected wavelength of 576 nm is the local maximum absorption wavelength of chlorophyll a [33].The selected wavelength of 615 nm is similar to 614 nm, which is the local maximum absorption wavelength of chlorophyll a [33].Table 6.The average values of the number of selected wavelengths, R, and RMSEP using PLS, GA-PLS, PN-GA-PLS, AP-GA-PLS, and AP-PN-GA-PLS (orange leaves data).

Conclusions
In spectroscopy, variable selection is important for the establishment of a prediction model.We proposed a method based on AP and PN for variable selection.The AP overcomes the PN's limitation in dividing spectrum for building a network.Instead of dividing the spectrum equally, the AP can find optimized numbers of sub-spectral bands and numbers of wavelengths in each sub-spectral band.
The AP-PN can achieve higher prediction accuracy with fewer wavelengths than PN.It can also be combined with other variable selection methods, such as GA-PLS, to further eliminate the redundant information in variables and, thus, simplify the final prediction model and reduce the computation load.
The proposed algorithm has been tested on three databases.However, the amount of the samples in the databases is limited.To gain a large database is always a challenge.In future, the algorithm could be tested on simulated databases, in which artificial data with changeable parameters exist.By performing this simulation, more behaviors of the algorithm can be observed, which will help to improve the algorithm further.

Figure 1 .
Figure 1.Two messages, responsibility ( ( , ) r i k , ( , ) r i k ′ ,and ( , ) r i k′ ), and availability ( ( , ) a i k , ( , ) a i k ′ , and ( , ) a i k′ ), are passing in an affinity propagation network.The responsibility ( , ) r i k and ( , ) r i k ′ reflect how well the point k acts as an exemplar to the point i and i′ , respectively.The availability ( , ) a i k and ( , ) a i k ′ reflect how well the point i and i′ belong to a class centered on point k .The responsibility ( , ) r i k′ reflects how well the point k′ acts as an exemplar to the point i′ .The availability ( , ) a i k′ reflects how well the point i belongs to a class centered on point k′ .

Figure 1 .
Figure1.Two messages, responsibility (r(i, k), r(i , k), and r(i, k ) ), and availability (a(i, k), a(i , k), and a(i, k ) ), are passing in an affinity propagation network.The responsibility r(i, k) and r(i , k) reflect how well the point k acts as an exemplar to the point i and i , respectively.The availability a(i, k) and a(i , k) reflect how well the point i and i belong to a class centered on point k.The responsibility r(i, k ) reflects how well the point k acts as an exemplar to the point i .The availability a(i, k ) reflects how well the point i belongs to a class centered on point k .

Figure 2 .
Figure 2. The convergence diagram of the affinity propagation (AP) algorithm (corn data) in the first round: Iterations vs. Fitness (net similarity) of quantized intermediate solutions.When the iteration is 562, the network is converged.

Figure 3
shows the relationship of one of the rates of change of the network continuity and iterations.With an increase in the number of iterations, the ij D gradually decreases.When the number of iterations increases to 10, all the ij D are smaller than the threshold of stop.The PN algorithm converges in the first selection as well as in the rest of the rounds.

Figure 3 .
Figure 3.The convergence diagram of the physarum network (PN) algorithm (corn data) in the first round: Iterations vs. ij D ( ij D is the rate of change of the network continuity).When the iteration is

Figure 2 .
Figure 2. The convergence diagram of the affinity propagation (AP) algorithm (corn data) in the first round: Iterations vs. Fitness (net similarity) of quantized intermediate solutions.When the iteration is 562, the network is converged.

Figure 2 .
Figure 2. The convergence diagram of the affinity propagation (AP) algorithm (corn data) in the first round: Iterations vs. Fitness (net similarity) of quantized intermediate solutions.When the iteration is 562, the network is converged.

Figure 3 .
Figure 3.The convergence diagram of the physarum network (PN) algorithm (corn data) in the first round: Iterations vs. ij D ( ij D is the rate of change of the network continuity).When the iteration is

Figure 3 .
Figure 3.The convergence diagram of the physarum network (PN) algorithm (corn data) in the first round: Iterations vs. D ij (D ij is the rate of change of the network continuity).When the iteration is 10, all the D ij are smaller than the threshold of stop, and the network converges.

Figure 4 .
Figure 4. Corn data (predicting protein content) wavelength selection result by using AP-PN-GA-PLS in the first round: (a) spectral responses vs. wavelengths, the 184 wavelengths selected by AP-PN are marked with circles; (b) times of selection (by GA-PLS) vs. wavelength identifier (1-184), the wavelengths above the lower line were selected; (c)spectral response vs. wavelengths, the 16 wavelengths selected by AP-PN-GA-PLS are marked with circles; (d) scatter plot of predicted values vs. the measured values.PLS = partial least square; GA = genetic algorithm.

Figure 4 .
Figure 4. Corn data (predicting protein content) wavelength selection result by using AP-PN-GA-PLS in the first round: (a) spectral responses vs. wavelengths, the 184 wavelengths selected by AP-PN are marked with circles; (b) times of selection (by GA-PLS) vs. wavelength identifier (1-184), the wavelengths above the lower line were selected; (c) spectral response vs. wavelengths, the 16 wavelengths selected by AP-PN-GA-PLS are marked with circles; (d) scatter plot of predicted values vs. the measured values.PLS = partial least square; GA = genetic algorithm.

Figure 5 .
Figure 5.The convergence diagram of the AP algorithm (diesel data): Iterations vs. Fitness (net similarity) of quantized intermediate solution in the first round.When the iteration is 386, the network is converged.

Figure 6 .
Figure 6.The convergence diagram of the PN algorithm (corn data): Iterations vs. ij D ( ij D is the rate

Figure 5 .
Figure 5.The convergence diagram of the AP algorithm (diesel data): Iterations vs. Fitness (net similarity) of quantized intermediate solution in the first round.When the iteration is 386, the network is converged.

Figure 5 .
Figure 5.The convergence diagram of the AP algorithm (diesel data): Iterations vs. Fitness (net similarity) of quantized intermediate solution in the first round.When the iteration is 386, the network is converged.

Figure 6 .
Figure 6.The convergence diagram of the PN algorithm (corn data): Iterations vs. ij D ( ij D is the rate

Figure 6 .
Figure 6.The convergence diagram of the PN algorithm (corn data): Iterations vs. D ij (D ij is the rate of change of the network continuity).When the iteration is 18, all the D ij are smaller than the threshold of stop, and the network is converged.

Figure 7 .
Figure 7. Diesel data (predicting viscosity content) wavelength selection result in the first round: (a) spectral responses vs. wavelengths, the 120 wavelengths selected by AP-PN are marked with circles; (b) times of selection (by GA-PLS) vs. wavelength identifier (1-120); (c) spectral response vs. wavelengths, the 25 wavelengths selected by AP-PN-GA-PLS are marked with circles; (d) scatter plot of predicted values vs. the measured values.

Figure 7 .
Figure 7. Diesel data (predicting viscosity content) wavelength selection result in the first round: (a) spectral responses vs. wavelengths, the 120 wavelengths selected by AP-PN are marked with circles; (b) times of selection (by GA-PLS) vs. wavelength identifier (1-120); (c) spectral response vs. wavelengths, the 25 wavelengths selected by AP-PN-GA-PLS are marked with circles; (d) scatter plot of predicted values vs. the measured values.

Figure 8 .
Figure 8.The convergence diagram of the AP algorithm (sweet orange leaves data) in the first round: Iterations vs. Fitness (net similarity) of quantized intermediate solutions.When the iteration is 486, the network is converged.The convergence diagram of the PN algorithm in the first round is shown in Figure 9.The ij D represents the rate of change of the network continuity.With the increase of the number of the iterations, the ij D gradually decreases.When the number of iterations increases to 132, all the ij D are smaller than the threshold of stop.The PN algorithm converges in the first round as well as in the other rounds.

Figure 9 .
Figure 9.The convergence diagram of the PN algorithm (corn data) in the first round: Iterations vs. ij D ( ij D is the rate of change of the network continuity).When the iteration is 132, all the ij D are smaller than the threshold of stop, and the network is converged.

Figure 8 . 17 Figure 8 .
Figure 8.The convergence diagram of the AP algorithm (sweet orange leaves data) in the first round: Iterations vs. Fitness (net similarity) of quantized intermediate solutions.When the iteration is 486, the network is converged.

Figure 9 .
Figure 9.The convergence diagram of the PN algorithm (corn data) in the first round: Iterations vs. ij D ( ij D is the rate of change of the network continuity).When the iteration is 132, all the ij D are smaller than the threshold of stop, and the network is converged.

Figure 9 .
Figure 9.The convergence diagram of the PN algorithm (corn data) in the first round: Iterations vs. D ij (D ij is the rate of change of the network continuity).When the iteration is 132, all the D ij are smaller than the threshold of stop, and the network is converged.

Figure10.
Figure10.Sweet orange leaves data (predicting viscosity content) wavelength selection results in the first round: (a) spectral responses vs. wavelengths, the 234 wavelengths selected by AP-PN are marked with circles; (b) times of selection (by GA-PLS) vs. wavelength identifier (1-234); (c) spectral response vs. wavelengths, the 25 wavelengths selected by AP-PN-GA-PLS are marked with circles; (d) scatter plot of predicted values vs. the measured values.

Figure 10 .
Figure 10.Sweet orange leaves data (predicting viscosity content) wavelength selection results in the first round: (a) spectral responses vs. wavelengths, the 234 wavelengths selected by AP-PN are marked with circles; (b) times of selection (by GA-PLS) vs. wavelength identifier (1-234); (c) spectral response vs. wavelengths, the 25 wavelengths selected by AP-PN-GA-PLS are marked with circles; (d) scatter plot of predicted values vs. the measured values.

Table 1 .
The relationship between λ and the number of groups (corn data).

Table 2 .
The relationship between λ and the number of groups (diesel data).

Table 3 .
The relationship between λ and the number of groups (orange leaves data).

Table 5 .
The average values of the number of selected wavelengths, R, and RMSEP using PLS, GA-PLS, PN-GA-PLS, AP-GA-PLS, and AP-PN-GA-PLS (diesel data).