Concentration-Emission Matrix (CEM) Spectroscopy Combined with GA-SVM: An Analytical Method to Recognize Oil Species in Marine

The establishment and development of a set of methods of oil accurate recognition in a different environment are of great significance to the effective management of oil spill pollution. In this work, the concentration-emission matrix (CEM) is formed by introducing the concentration dimension. The principal component analysis (PCA) is applied to extract the spectral feature. The classification methods, such as Probabilistic Neural Networks (PNNs) and Genic Algorithm optimization Support Vector Machine (SVM) parameters (GA-SVM), are used for oil identification and the recognition accuracies of the two classification methods are compared. The results show that the GA-SVM combined with PCA has the highest recognition accuracy for different oils. The proposed approach has great potential in rapid and accurate oil source identification.


Introduction
The types and ratios of polycyclic aromatic hydrocarbons (PAHs) and their derivatives are significantly different in different kinds of spill oils, which provides a theoretical basis for oil identification. Traditional methods such as gas chromatography-mass spectroscopy (GC-MS) [1][2][3][4], gas chromatography-flame ionization detection (GC-FID) [5][6][7] and high-performance liquid chromatography (HPLC) [8][9][10] have a high resolution; they are considered reliable analytical methods. However, they have complicated pretreatment procedures and require a large amount of solvents. In addition, traditional analytical instruments can only be operated in a laboratory environment, which is challenging to meet the on-site analysis and quickly give the identification results in the case of oil spill emergency treatment [11][12][13][14].
The fluorescence technique is a sensitive, rapid and non-destructive screening method that can be used as a complement to traditional methods [15]. The oil-bearing samples contain a variety of polycyclic aromatic hydrocarbons, which are stimulated to produce fluorescence. So, the fluorescent nature of the petroleum product is related to the electronic structure of its aromatic compounds. Therefore, some researchers have conducted study of the fluorescence properties of petroleum products to establish reliable fluorescence analysis methods for oil spills [16][17][18][19][20]. Mirnaghi et al. used the EEM fluorescence spectroscopy, PARAFAC method and principal component analysis (PCA) to identify and quantify unknown spilled oil, thereby delivering a preliminary evaluation of the petroleum products as soon as possible [21].
However, one major problem in the identification of oil samples by the fluorescence technique is that the shapes of fluorescence spectra are sensitive to the concentration, which has been reported by some researchers [22][23][24]. At a higher concentration, the fluorescence of the low-ring PAHs would be quenched, and the fluorescence information of the high-ring PAHs would be retained. Spectral changes caused by red shift behavior can affect the oil species recognition results based on fluorescence spectra technique [25][26][27][28]. Therefore, the PAH S composition ratio of oils cannot be adequately represented by only one certain concentration.
The other problem is how to choose the appropriate feature extraction algorithm and recognition algorithm to identify the species of oils. Recently, Machine learning has been widely used to solve pattern recognition, data mining, and other problems, such as Convolutional Neural Networks (CNN) [29] and Probabilistic Neural Networks (PNNs) [30]. However, these neural network algorithms are not suitable for the cases with small sample datasets. Support Vector Machine (SVM) is state-of-the-art data processing that has unique advantages in dealing with complex problems such as finite samples and high-dimensional nonlinear data [31][32][33][34][35]. Rios-Reina et al. distinguished the wine vinegar in the Spanish Designation of Protected Areas based on support vector machine, which obtained the better classification results (>92% classification rate) [36]. The performance of the SVM is closely related to its penalty factors and kernel parameters. Therefore, choosing the appropriate parameters is the key to improve classification accuracy. Currently, there are many parameter optimization methods. Chang et al. used particle swarm optimization (PSO) algorithm to optimize SVM parameters, which improved the classification effect of image textures [37]. Zhang et al. used the ant colony algorithm to optimize SVM parameters and improved the classification performance of SVM [38]. Li et al. applied the cross validation (CV) and genetic algorithm (GA) to optimize the parameters of SVM, respectively, and the results showed that GA-SVM can perform rapid and high accuracy recognition [39]. Nevertheless, there are few reports about the identification of fluorescence spectra based on SVM.
In order to solve the above problems, the three-dimension concentration-emission matrix (CEM) spectra are formed by introducing the concentration dimension. It not only includes the wavelength and intensity of the fluorescence peak, but also includes the variation of the fluorescence spectra with the concentrations, which can express all the fluorescence information from low-ring PAHs to high-ring PHAs. For better extraction the spectral information, PCA is used to extract feature of the CEM spectra. The GA-SVM algorithm and PNNs are used to recognize the six kinds of oils in different environmental condition, the recognition accuracy of the two algorithms are compared. Our research provides a method for fast and accurate identification of oil samples.

The CEM Characteristics of Different Oil Samples
The three-dimensional concentration-emission matrix (CEM) has been formed by obtaining the fluorescence emission spectra at the excitation wavelength of 266 nm with 10 series of concentrations. The emission spectrum of each concentration of oil is a matrix with 146 rows and 1 column, so the CEM composed of 10 concentrations of emission spectra is a matrix with 146 rows and 10 columns. Figure 1 shows the CEM spectra of six kinds of oils. each oil through weathering experiments, thereby 30 stock solutions are prepared. Water samples from six different locations (five different areas in Yellow sea-Bohai sea and one Dongpu Reservoir water) are collected to prepare the solution. Six kinds of water samples are added to 30 weathering stock solutions to prepare a series of working solutions. The fluorescence spectra of different concentrations of samples are obtained, and the CEM spectra of different kinds of oils are formed, with a total of 180 CEM spectra.

Features Extraction of Oil Samples Based on PCA
Feature extraction is essential to the performance of oil spectra recognition system. PCA is usually used for feature extraction and dimensionality reduction. PCA is generally used to process the first-order vector of each sample (such as the emission spectra). For the second-order matrix (such as EEM or CEM), it is necessary to deform the data to meet the processing requirements. In this work, the CEM spectrum is a (146 × 10) matrix, which is transformed into a row-vector (1 × 1460) by unfolding the (146 × 10) matrix end to end. The 180 CEM spectra are expanded into vectors to form a 180 × 1460 matrix. Then, the dataset is standardized to a unit scale (mean = 0 and variance = 1) based on the mean and standard deviation of the original data. The spectral features is extracted using PCA, and the variance explained is shown in Figure 2. It can be seen that the variance contribution (PC1: 56.8%; PC2: 14.5%, PC3: 9.8%; PC4: 8.1%; PC5: 4.0%; PC6: 3.0%) of the first six PCs can reach 96%. The first six principal component matrices (180 × 6) obtained by the PCA are used as feature spectra and input into different classifiers in the next stage in our algorithm design.  The CEM spectra of crude oil have one fluorescence peaks located at x/λ em of 39 ppm/420 nm. While heavy oil has the two fluorescence areas with the center at 78 ppm/350 nm and 78 ppm/420 nm. 0#diesel and shell helix 10w-40 have similar spectral characteristics, which both have one peak of CEM spectra with the centers at x/λ em 625 ppm/340 nm. The 92#gasoline and motor oil 20w-40 both have two peaks, the components (x/λ em = 156 ppm/300 nm and x/λ em = 312 ppm/340 nm) refer to 92#gasoline and the components (x/λ em = 312 ppm/360 nm and x/λ em = 312 ppm/390 nm) refer to motor oil 20w-40. As the concentration of the oil sample increases, the fluorescence spectra would produce red-shift behavior. Different kinds of oils have different components, so their fluorescence red shift behavior differs vastly. Thus, this provides the basis for the identification of different types of oils using CEM spectroscopy.
The fluorescence characteristics of the water environment in the different areas could be different and the fluorescence spectra of the oils could change during the weathering process, which would interfere with the fluorescence spectra of oils. In this work, the weathering processes and the different seawater are taken into consideration to establish a spectral database for oil recognition. Five oil stock solutions with different weathering times (0, 10, 20, 30 and 40 days) are prepared for each oil through weathering experiments, thereby 30 stock solutions are prepared. Water samples from six different locations (five different areas in Yellow sea-Bohai sea and one Dongpu Reservoir water) are collected to prepare the solution. Six kinds of water samples are added to 30 weathering stock solutions to prepare a series of working solutions. The fluorescence spectra of different concentrations of samples are obtained, and the CEM spectra of different kinds of oils are formed, with a total of 180 CEM spectra.

Features Extraction of Oil Samples Based on PCA
Feature extraction is essential to the performance of oil spectra recognition system. PCA is usually used for feature extraction and dimensionality reduction. PCA is generally used to process the first-order vector of each sample (such as the emission spectra). For the second-order matrix (such as EEM or CEM), it is necessary to deform the data to meet the processing requirements. In this work, the CEM spectrum is a (146 × 10) matrix, which is transformed into a row-vector (1 × 1460) by unfolding the (146 × 10) matrix end to end. The 180 CEM spectra are expanded into vectors to form a 180 × 1460 matrix. Then, the dataset is standardized to a unit scale (mean = 0 and variance = 1) based on the mean and standard deviation of the original data. The spectral features is extracted using PCA, and the variance explained is shown in Figure 2. It can be seen that the variance contribution (PC1: 56.8%; PC2: 14.5%, PC3: 9.8%; PC4: 8.1%; PC5: 4.0%; PC6: 3.0%) of the first six PCs can reach

Spectra Classification Results
The feature information of 180 × 1460 matrix is extracted by PCA to obtain the principal component matrix with 180 rows and 6 columns. The crude oil, 0#diesel, heavy oil, motor oil 20w-40, 92#gasoline, Shell helix 10w-40 are set to 1st label, 2nd label, 3rd label, 4th label, 5th label, 6th label, respectively. Each type of oil contains 30 samples, of which 20 samples are used as the training set, and the remaining 10 samples are used as the testing set. Thus, the training set contains 120 samples and the testing set contains 60 samples.
The main parameters for GA-SVM and PNNs are shown in Table 1. The five cross-validation experiments are carried out to improve the stability of the GA-SVM model. The average of the classification accuracy is used as the performance indicator of the classifier. The experimental platform is MATLAB R2019a and the LIBSVM software package designed by Professor Lin at National Taiwan University is carried out. As shown in Figure 3 and Table 2, the average classification rate of PNNs for six kinds of oils is 80% [48/60]. Among them, the classification accuracy of heavy oil is the lowest at 20% [2/10], and the classification accuracy of crude oil is the highest at 100% [10/10]. However, the classification accuracy of GA-SVM for each oil sample is 100%, which indicates that the prediction results of GA-SVM is significantly better than that of PNNs in spectral recognition. The poor prediction results of the PNNs may be due to its information processing paradigm, which makes it easy to fall into a local minimum. Whereas SVM is a classification method based on the principle of structural risk minimization, which can effectively avoid falling into the local optimum. It has a unique advantage in dealing with limited samples and high-dimensional nonlinear data set. In addition, the GA is chosen to optimize the kernel parameters and penalty factors of SVM, which further improve the accuracy of classification.

Spectra Classification Results
The feature information of 180 × 1460 matrix is extracted by PCA to obtain the principal component matrix with 180 rows and 6 columns. The crude oil, 0#diesel, heavy oil, motor oil 20w-40, 92#gasoline, Shell helix 10w-40 are set to 1st label, 2nd label, 3rd label, 4th label, 5th label, 6th label, respectively. Each type of oil contains 30 samples, of which 20 samples are used as the training set, and the remaining 10 samples are used as the testing set. Thus, the training set contains 120 samples and the testing set contains 60 samples.
The main parameters for GA-SVM and PNNs are shown in Table 1. The five cross-validation experiments are carried out to improve the stability of the GA-SVM model. The average of the classification accuracy is used as the performance indicator of the classifier. The experimental platform is MATLAB R2019a and the LIBSVM software package designed by Professor Lin at National Taiwan University is carried out. As shown in Figure 3 and Table 2, the average classification rate of PNNs for six kinds of oils is 80% [48/60]. Among them, the classification accuracy of heavy oil is the lowest at 20% [2/10], and the classification accuracy of crude oil is the highest at 100% [10/10]. However, the classification accuracy of GA-SVM for each oil sample is 100%, which indicates that the prediction results of GA-SVM is significantly better than that of PNNs in spectral recognition. The poor prediction results of the PNNs may be due to its information processing paradigm, which makes it easy to fall into a local minimum. Whereas SVM is a classification method based on the principle of structural risk minimization, which can effectively avoid falling into the local optimum. It has a unique advantage in dealing with limited samples and high-dimensional nonlinear data set. In addition, the GA is chosen to optimize the kernel parameters and penalty factors of SVM, which further improve the accuracy of classification.

Samples Preparation
Six different types of oils: 0#diesel, crude oil, heavy oil, 92#gasoline, shell helix 10w-40, and motor oil 20w-40. The oil sample was weighted by electronic balance and dissolved in isopropanol, which was placed in an ultrasonic oscillator to be fully dissolved. After waiting for 30 min, the supernatant was taken as the stock solution with a concentration of 5000 ppm of the oil sample. The concentrations of working solutions with a serial of concentrations (1.2, 2.4, 4.9, 9.8, 19.5, 39, 78, 156, 312, 625 ppm, respectively) were prepared by diluting the stock solutions.

Weathering Experiments
During the 40 days weathering experiment period, the lowest temperature was 19 °C and the highest temperature was 31 °C. The weather was mainly sunny and only 9 days were rainy. The oil samples were subjected to weathering by placing 5-cm-thick oil film in a beaker placed in natural conditions for 40 days. Stock solutions (5000 ppm) were prepared by dissolving appropriate weathering samples for 10 days, 20 days, 30 days and 40 days, respectively. Due to the volatility of 92#gasoline, four stock solutions of weathering sample were prepared in 12, 24, 36 and 48 h.

Water Samples
Taking the scientific research ship "Haijian NO.101" as a platform, different seawater samples were taken at five different points in the Bohai Sea and the Yellow Sea area, as shown in

Samples Preparation
Six different types of oils: 0#diesel, crude oil, heavy oil, 92#gasoline, shell helix 10w-40, and motor oil 20w-40. The oil sample was weighted by electronic balance and dissolved in isopropanol, which was placed in an ultrasonic oscillator to be fully dissolved. After waiting for 30 min, the supernatant was taken as the stock solution with a concentration of 5000 ppm of the oil sample. The concentrations of working solutions with a serial of concentrations (1.2, 2.4, 4.9, 9.8, 19.5, 39, 78, 156, 312, 625 ppm, respectively) were prepared by diluting the stock solutions.

Weathering Experiments
During the 40 days weathering experiment period, the lowest temperature was 19 • C and the highest temperature was 31 • C. The weather was mainly sunny and only 9 days were rainy. The oil samples were subjected to weathering by placing 5-cm-thick oil film in a beaker placed in natural conditions for 40 days. Stock solutions (5000 ppm) were prepared by dissolving appropriate weathering samples for 10 days, 20 days, 30 days and 40 days, respectively. Due to the volatility of 92#gasoline, four stock solutions of weathering sample were prepared in 12, 24, 36 and 48 h.

Water Samples
Taking the scientific research ship "Haijian NO.101" as a platform, different seawater samples were taken at five different points in the Bohai Sea and the Yellow Sea area, as shown in

Analytical Technique
Fluorescence measurement was performed on a Hitachi F-7000 spectrofluorometer (Hitachi, Japan). The excitation-emission matrices for three-way fluorescence spectra of each sample were record with the excitation wavelengths EX in range of 250-450 nm at a 2 nm interval and emission wavelength EM 260-550 nm at a 2 nm interval. EEMSCAT was used to remove Raman scattering and Raleigh scattering in the MATLAB environment.

Principal Components Analysis (PCA)
PCA is a common multi-variable statistical method and one of the most widely used data dimensionality reduction algorithms. The essence of PCA is to project data samples in highdimensional space into low-dimensional space via the K-L transform while preserving the original data features as much as possible. The PCA mainly achieves the dimensionality reduction of the original matrix by retaining components with significant variance and a large amount of information and removing components with a small difference and insufficient information [40,41].
Assume X is the input dataset (X = , , … . , ) of P dimensional, it is transformed into a smaller dimension set Y of L dimension (L < P), the Y represent the principal component of X, the process is as follow: (a) calculate the mean, variance and covariance of the spectral data; (b) calculate the eigenvalues and eigenvectors; (c) the eigenvectors are ordered according to the eigenvalues from highest to lowest. The components whose cumulative variance contribution value is greater than 90% are selected as spectral feature data to represent the original spectra.

Probabilistic Neural Networks (PNNs)
PNNs is a forward-propagating neural network that often learns more quickly than many backpropagation networks neural models. It is because PNNs combine Bayesian decision theory and density function estimation to determine the type of samples. The network structure of PNNs consists of four layers: the input layer, a hidden (pattern) layer, a summation layer, and an output layer [42][43][44][45].

The Support Vector Machine Algorithm
The support vector machine (SVM) has been utilized as one of the most popular schemes for the classification of data in the past decades [46,47]. It is a margin-based classifier with excellent generalization features. In the classification problem, the SVM transforms the input space into a highdimensional space using a nonlinear transformation defined by an inner product function. In this space, the SVM can find an optimal hyper-plane between different class data [48,49]. The data are not linearly separable in low dimensions, and it is easier to search for a separating hyper-plane when mapped into high-dimensional space.

Analytical Technique
Fluorescence measurement was performed on a Hitachi F-7000 spectrofluorometer (Hitachi, Japan). The excitation-emission matrices for three-way fluorescence spectra of each sample were record with the excitation wavelengths EX in range of 250-450 nm at a 2 nm interval and emission wavelength EM 260-550 nm at a 2 nm interval. EEMSCAT was used to remove Raman scattering and Raleigh scattering in the MATLAB environment.

Principal Components Analysis (PCA)
PCA is a common multi-variable statistical method and one of the most widely used data dimensionality reduction algorithms. The essence of PCA is to project data samples in high-dimensional space into low-dimensional space via the K-L transform while preserving the original data features as much as possible. The PCA mainly achieves the dimensionality reduction of the original matrix by retaining components with significant variance and a large amount of information and removing components with a small difference and insufficient information [40,41].
Assume X is the input dataset (X = [x 1 , x 2 , . . . ., x n ]) of P dimensional, it is transformed into a smaller dimension set Y of L dimension (L < P), the Y represent the principal component of X, the process is as follow: (a) calculate the mean, variance and covariance of the spectral data; (b) calculate the eigenvalues and eigenvectors; (c) the eigenvectors are ordered according to the eigenvalues from highest to lowest. The components whose cumulative variance contribution value is greater than 90% are selected as spectral feature data to represent the original spectra.

Probabilistic Neural Networks (PNNs)
PNNs is a forward-propagating neural network that often learns more quickly than many back-propagation networks neural models. It is because PNNs combine Bayesian decision theory and density function estimation to determine the type of samples. The network structure of PNNs consists of four layers: the input layer, a hidden (pattern) layer, a summation layer, and an output layer [42][43][44][45].

The Support Vector Machine Algorithm
The support vector machine (SVM) has been utilized as one of the most popular schemes for the classification of data in the past decades [46,47]. It is a margin-based classifier with excellent generalization features. In the classification problem, the SVM transforms the input space into a high-dimensional space using a nonlinear transformation defined by an inner product function. In this space, the SVM can find an optimal hyper-plane between different class data [48,49]. The data are not linearly separable in low dimensions, and it is easier to search for a separating hyper-plane when mapped into high-dimensional space.
The parameters in the SVM algorithm, such as the penalty coefficient C and the Gaussian kernel δ, have a significant impact on the classification results. Therefore, appropriate C and δ should be selected to make the model achieve the best effect.

The Flowchart of the GA-SVM Algorithm
The Genetic Algorithm (GA) is an adaptive heuristic search algorithm for population evolution through continuous and efficient selection, crossover, mutation, and other operations [50,51]. In this work, the GA is used to optimize the penalty parameter C and the kernel parameter δ of SVM and the GA-SVM prediction model based on PCA is established. In the GA-SVM algorithm, the correct classification rate of the SVM is used as the fitness function of the GA. The algorithm flow is shown in Figure 5, which includes the following steps: Step 1: Feature spectra extraction. According to the mean value and standard deviation of the original data, the data set is standardized to a unit scale, and PCA is used to extract spectral features.
Step 2: Binary encoding and setting the initial population. Determine the parameters of initial population, including the size of the population, the number of iterations, crossover rate, and mutation rate. Additionally, the binary encoding method is used to determine the penalty function C and function parameter δ.
Step 3: Determine fitness function. The corresponding parameter C and δ are decoded to obtain the actual parameter optimization solutions. The obtained SVM parameters C and δ are used for learning and training with the LIBSVM model, and the classification accuracy is obtained by testing the test data in the well-trained model, which will be used to calculate the fitness function.
Step 4: Termination criteria. If the termination criteria of the genetic algorithm are satisfied, the process ends and the optimal parameters are selected and input to the SVM model. Otherwise, perform operations (selection, crossover and mutation) to generate a new generation of population, and finally determine the optimal solution.
Molecules 2020, 25, x 7 of 10 The parameters in the SVM algorithm, such as the penalty coefficient C and the Gaussian kernel δ, have a significant impact on the classification results. Therefore, appropriate C and δ should be selected to make the model achieve the best effect.

The Flowchart of the GA-SVM Algorithm
The Genetic Algorithm (GA) is an adaptive heuristic search algorithm for population evolution through continuous and efficient selection, crossover, mutation, and other operations [50,51]. In this work, the GA is used to optimize the penalty parameter C and the kernel parameter δ of SVM and the GA-SVM prediction model based on PCA is established. In the GA-SVM algorithm, the correct classification rate of the SVM is used as the fitness function of the GA. The algorithm flow is shown in Figure 5, which includes the following steps: Step 1: Feature spectra extraction. According to the mean value and standard deviation of the original data, the data set is standardized to a unit scale, and PCA is used to extract spectral features.
Step 2: Binary encoding and setting the initial population. Determine the parameters of initial population, including the size of the population, the number of iterations, crossover rate, and mutation rate. Additionally, the binary encoding method is used to determine the penalty function C and function parameter δ.
Step 3: Determine fitness function. The corresponding parameter C and δ are decoded to obtain the actual parameter optimization solutions. The obtained SVM parameters C and δ are used for learning and training with the LIBSVM model, and the classification accuracy is obtained by testing the test data in the well-trained model, which will be used to calculate the fitness function.
Step 4: Termination criteria. If the termination criteria of the genetic algorithm are satisfied, the process ends and the optimal parameters are selected and input to the SVM model. Otherwise, perform operations (selection, crossover and mutation) to generate a new generation of population, and finally determine the optimal solution.

Conclusion
In the present work, the feasibility of identifying oil spill species based on the concentrationemission matrix (CEM), under the condition of weathering and different water environment disturbance conditions, is studied. PCA can be used to extract the spectra feature, combined with the two pattern recognition methods to achieve rapid identification of different oil samples. The result show that GA-SVM has the highest classification accuracy for six kinds of oils, which can reach 100%. The average classification accuracy is 20% higher than the PNNs.

Conclusions
In the present work, the feasibility of identifying oil spill species based on the concentration-emission matrix (CEM), under the condition of weathering and different water environment disturbance conditions, is studied. PCA can be used to extract the spectra feature,