Fluorescence Excitation–Emission Matrix Spectroscopy and Boosting Regression Tree Model to Detect Dissolved Organic Carbon in Water

In recent years, optical methods have been proven to be a powerful tool for m onitoring dissolved organic carbon (DOC) in natural waters. However, the effectiveness of this method in marine systems with low DOC concentrations remains to be shown. Herein, a new method based on fluorescence excitation–emission matrix spectroscopy for seawater DOC quantification is proposed. Pre-processing method is investigated to achieve a high signal to noise ratio. Peak-picking operation is then performed to obtain feature peaks. In order to combine the information from sparsely located feature peaks, sparse principal component analysis is applied to identifying important variables used in the following regression procedure. Under these conditions the result of regression analysis can be obtained readily in a given data set coupling with boosting regression tree. The method was tested on samples collected from the East China Sea. Compared to the parallel factor analysis–multivariate linear regression method, experimental results show that the proposed method achieved a more consistent regression output and indicate that the boosting regression tree has potential for DOC quantification even at low concentrations.


Introduction
As an important carbon pool for oceans, dissolved organic carbon (DOC) has a critical role in global carbon cycle and provides a contribution to the carbon sequestration [1][2][3][4]. In addition, it is an important part of many trace elements in biogeochemistry [5][6][7], and therefore, the detection and monitoring of DOC is meaningful for the research of oceanic carbon cycle [4].
One of the general methods for DOC analysis in the ocean is the high-temperature catalytic oxidation (HTCO) method [8,9], which is utilized by many modern TOC analyzers. In this method, all carbon is oxidized to carbon dioxide, and total organic carbon is then evaluated by measuring the weight of carbon dioxide. However, this type of method requires high temperatures for combusting samples. In addition, a series of specialized instruments are needed to remove interferences, collect carbon dioxide and weight it. In practice, such an approach would be time consuming, and the equipment is costly. An alternative and as effective method is to monitor DOC with UV-visible spectroscopy [10,11], which establishes relationships between DOC quantity with spectral parameters such as the absorbance, peaks and slopes [12][13][14], without having to perform any chemical manipulations. This optical method can be useful for informative analysis and quantification. For example, absorbance at 254 nm is linked to aromatic humic substances [12,15] and is often Pre-processing method based on incremental Delaunay triangulation algorithm is investigated to achieve a high signal to noise ratio. Feature extraction based on peak-picking operation and sparse principal component analysis is applied to identifying important variables used in the following regression procedure. The result of regression analysis can be obtained in a given data set coupling with boosting regression tree. The proposed method was shown in Figure 1, which included the principal method name and the data dimension after each procedure.

Incremental Delaunay Triangulation Algorithm
As there are Rayleigh and Raman scatter peaks which can create problems quantitative analysis, particularly for samples with low DOC concentrations, correction methods need be used to eliminate scatter peaks from fluorescence data. In this work, the scattering correction proposed by Zepp et al. [33] was adopted, which can excise scatter peaks efficiently. Although, three-dimensional interpolation was applied to this method, the remaining data still has low signal to noise ratio. There are 'breaks' or 'peaks' at connections Water 2021, 13, 3612 3 of 12 of patches (triangle facets in Zepp et al. [33]). Therefore, results of scattering correction lack the surface continuity and smoothness, which may influence regression analysis.
In order to achieve some surface continuity and smoothing for EEMs data, a method termed incremental Delaunay triangulation [34,35] was utilized in this work. This algorithm inserts data points into the original matrix sequentially by selecting the most contribution subset from the matrix. The Delaunay triangulation is then used to approximate the elevation at subset points within a distant criterion. Based on the above operations, it is therefore relatively straight forward to obtain a continuous and smooth result.

Incremental Delaunay Triangulation Algorithm
As there are Rayleigh and Raman scatter peaks which can create problems quantitative analysis, particularly for samples with low DOC concentrations, correction methods need be used to eliminate scatter peaks from fluorescence data. In this work, the scattering correction proposed by Zepp et al. [33] was adopted, which can excise scatter peaks efficiently. Although, three-dimensional interpolation was applied to this method, the remaining data still has low signal to noise ratio. There are 'breaks' or 'peaks' at connections of patches (triangle facets in Zepp et al. [33]). Therefore, results of scattering correction lack the surface continuity and smoothness, which may influence regression analysis.
In order to achieve some surface continuity and smoothing for EEMs data, a method termed incremental Delaunay triangulation [34,35] was utilized in this work. This algorithm inserts data points into the original matrix sequentially by selecting the most contribution subset from the matrix. The Delaunay triangulation is then used to approximate the elevation at subset points within a distant criterion. Based on the above operations, it is therefore relatively straight forward to obtain a continuous and smooth result.

Feature Extraction
In EEMs, not all of the fluorescence intensities have contribution to the final regression model. In this practical application, the fluorescence data should be preprocessed to transform them into new space of variables where, it is hoped, the goal of DOC quantification will be easier to achieve. Herein, this pre-processing stage utilized a peak-picking approach based on two-dimensional polynomial model and extremum detection. For a spectra matrix, a simple approximation to the plane is two-dimensional polynomial model:

Feature Extraction
In EEMs, not all of the fluorescence intensities have contribution to the final regression model. In this practical application, the fluorescence data should be preprocessed to transform them into new space of variables where, it is hoped, the goal of DOC quantification will be easier to achieve. Herein, this pre-processing stage utilized a peak-picking approach based on two-dimensional polynomial model and extremum detection. For a spectra matrix, a simple approximation to the plane is two-dimensional polynomial model: Then the sum of squared differences (SSD) is constructed: The two-dimensional polynomial model that fits the plane best is the one with the lowest value of SSD. Thus, when SSD is differentiated with respect to parameters [a, . . . , f,] the values of different parameters can be got by solving: To compute the x and y from function D(x, y), we can differentiate D(x, y) with respect to x and y: dD(x, y) dx = 0, dD(x, y) dy = 0 In this research, the peak-picking approach is not applied to the whole spectra matrix. In EEM data, the peaks are associated with phenol-like organic compounds, tryptophanlike, tyrosine-like, or humic-like [36]. Therefore, it can be divided into five regions: I, II, III, IV, and V. Then, the sparse principal component analysis is performed on these five EEM regions to get fluorescence peaks and their features. Having obtained the feature positions of the EEMs data, there remains the task of regression analysis with respect to DOC concentration.

Boosting Regression Tree
The boosting tree, based on additive model and forward stage-wise algorithm, is developed for regression and classification applications [37]. It produces competitive, highly robust, interpretable procedures, which is particularly appropriate for data with much disturbance. In this section, we review the basic boosting regression tree algorithm and then apply it on fluorescence spectroscopy.
Boosting regression tree given a training set [(x 1 , y 1 ), (x 2 , y 2 ), . . . , (x N , y N )], with inputs x i ∈ R N and outputs y i ∈ R, the goal is to obtain a function estimation f (x) that minimizes loss function L(y, f (x)). Here, the boosting tree model is derived by assuming that the function estimation f (x) can be expressed as a sum of basic decision trees: where T(x; Θ m ) is the decision tree, Θ m is the parameter of the decision tree and M is the number of trees. This boosting tree algorithm can be solved by using forward stagewise algorithm: where f m−1 (x) represents the current estimation model. The parameter Θ m of next decision tree can be determined by experiential risk minimization: A common choice of loss function in regression problems is the squared loss given by: where r = y − f m−1 (x). In this case, r is the residual of the regression model. Therefore, our goal is to choose appropriate regression tree model T(x; Θ m ) so as to fitting the residual r − T(x; Θ m ).
In the case of the fluorescence data, single EEM data X is divided into 5 regions. A single EEM data is therefore represented by a 5-dimensional vector, each point of which represents a single fluorescence intensity corresponding to the concentration of dissolved organic carbon. The training data set was used to construct a boosting regression tree, and then the testing data set will be used to evaluate the regression model.

PARAFAC-MLR Model
PARAFAC [38,39] is one of several methods that has been successfully applied on analysis of fluorescence data. Considering a three-way array given by three loading matrices, A ∈ R I×F , B ∈ R J×F , and C ∈ R K×F , with elements a i f , b j f , and a k f , in which each element can be expressed as where e ijk are the residuals and F is the number of sought factors [40]. By employing the Khatri-Rao product , the three-way array can be represented as a two-way array which is composed of several matrices: Then it minimizes the following three loss functions with alternating least squares method: where F is the Frobenius norm. The score vector C is then combined with MLR method to define a regression model. Such a model can directly accommodate multiple predictors, and it is possible to have an observation that is well within the range of each individual predictor values.

Assessing the Accuracy of the Model
It is natural to want to quantify the extent to which the model fits the data. The R 2 statistic provides an alternative measure of the quality of a linear regression fit [41]. It takes the form of a proportion, assuming a value between 0 and 1.
where DOC pre is the predicted value of DOC and DOC real is the real value of DOC. Hence, the R 2 statistic measures the proportion of variability in DOC that can be explained using fluorescence data.

Site Descriptions and Sampling
Seawater samples for this investigation were taken from the coastal water in East China Sea. The annual average temperature here is 16 • C and the annual precipitation is 927~1620 mm. Seawater samples were collected in 500 mL amber laboratory bottle and filtered through 0.45 mm cellulose nitrate membrane filters. As all these seawater samples were analyzed in the laboratory, filtered samples were stored in a fridge at 4 • C for several days. Low temperature can inhibit the activity of undesirable microorganisms and slow down chemical reaction rate. Sixty seawater samples were taken from nine different sample points. In order to obtain a gradient in concentration simulating the DOC of seawater far away from land, the samples at each point were diluted with pure water.

Fluorescence Measurements
Water samples were allowed to warm to room temperature prior to fluorescence measurements. The fluorescence measurements of seawater samples were performed using a F-7000 Fluorescence Spectrophotometer (Hitachi, Tokyo, Japan) and a quartz cuvette with 1 cm path length. The excitation wavelength ranged from 200 nm to 700 nm with a 5 nm interval, and the emission wavelength ranged from 200 nm to 700 nm with a 5 nm interval as well. According to the common regions of fluorescence maxima [42], the fluorescence for analysis has the excitation wavelength region of 200~360 nm and emission wavelength region of 275~460 nm.
All calculations were carried out on a personal computer equipped with a Core i7 2.5 GHz processor with 16 GB RAM under Windows 10 operating system using MATLAB R2016a (MathWorks, Natick, MA, USA).

DOC Measurements
The concentration of dissolved organic carbon was analyzed with TOC-L CPN (Shimadzu, Japan), which adopts the 680 • C combustion catalytic oxidation method. Sample acidification and sparging with purified air were carried out automatically. Because of the automatic dilution function, it has a wide measurement range of 4 µg/L to 30,000 mg/L.

Pre-Processing
Illustration of the effects of incremental Delaunay triangulation pre-processing applied to the EEMs data set is shown in Figure 2. The plot on the left shows the original data (after scattering correction). The plot on the right shows the result of the surface-smoothing operation of the data. As shown, the fluorescence data resulting from scattering correction is a noisy data set, thus feature extraction depending on feature peaks cannot be applied. However, incremental Delaunay triangulation algorithm provides smooth surfaces and leads to a higher signal to noise ratio. measurements. The fluorescence measurements of seawater samples were performed using a F-7000 Fluorescence Spectrophotometer (Hitachi, Tokyo, Japan) and a quartz cuvette with 1 cm path length. The excitation wavelength ranged from 200 nm to 700 nm with a 5 nm interval, and the emission wavelength ranged from 200 nm to 700 nm with a 5 nm interval as well. According to the common regions of fluorescence maxima [42], the fluorescence for analysis has the excitation wavelength region of 200~360 nm and emission wavelength region of 275~460 nm.
All calculations were carried out on a personal computer equipped with a Core i7 2.5 GHz processor with 16 GB RAM under Windows 10 operating system using MATLAB R2016a (MathWorks, Natick, MA, USA).

DOC Measurements
The concentration of dissolved organic carbon was analyzed with TOC-L CPN (Shimadzu, Japan), which adopts the 680 °C combustion catalytic oxidation method. Sample acidification and sparging with purified air were carried out automatically. Because of the automatic dilution function, it has a wide measurement range of 4 μg/L to 30,000 mg/L.

Pre-Processing
Illustration of the effects of incremental Delaunay triangulation pre-processing applied to the EEMs data set is shown in Figure 2. The plot on the left shows the original data (after scattering correction). The plot on the right shows the result of the surfacesmoothing operation of the data. As shown, the fluorescence data resulting from scattering correction is a noisy data set, thus feature extraction depending on feature peaks cannot be applied. However, incremental Delaunay triangulation algorithm provides smooth surfaces and leads to a higher signal to noise ratio. Furthermore, the R-square statistic between fluorescence intensities of EEM data and measured DOC concentration was analyzed directly to estimate the effectiveness of preprocessing results as indicators for DOC concentration. Figure 3 and Figure 4 showed the four contour plots of R-square statistic for regression analysis. As shown in this figure, horizontal and vertical lines were drawn to divide the EEM into five regions [28,43]. For Furthermore, the R-square statistic between fluorescence intensities of EEM data and measured DOC concentration was analyzed directly to estimate the effectiveness of pre-processing results as indicators for DOC concentration. Figures 3 and 4 showed the four contour plots of R-square statistic for regression analysis. As shown in this figure, horizontal and vertical lines were drawn to divide the EEM into five regions [28,43]. For EEM Region I and Region II, which are related to tyrosine protein-like matter, the emission wavelengths are shorter than 350 nm and excitation wavelengths are shorter than 250 nm. The location of Region III has the emission wavelengths longer than 350 nm and excitation wavelengths shorter than 250 nm, which represents the fulvic acid-like matter. Region IV are assigned to soluble microbial by-product-like matter. It is located at the emission wavelengths (<380 nm) with excitation wavelengths ranged from 250 to 280 nm. Excitation wavelengths shorter than 280 nm and emission wavelengths longer than 380 nm are defined as Region V. This region indicates the property of humic acid-like organics.
In Figure 3, data points that have the strongest correlations between EEM and DOC concentrations are scattered in fluorescence matrix only in regions clustering together. Because there are peaks caused by noise, it may have negative influences on later regression analysis. Therefore, theoretically, information in the fluorescence can be barely detected, which results in substantially lower signal to noise ratio. Conversely, for EEMs data with smoothing in Figure 4, most of data points that have the strongest correlations between EEM, and DOC concentrations clustered together. The feature peaks with high value of R 2 therefore can be well extracted and make contribution to regression analysis.
EEM Region I and Region II, which are related to tyrosine protein-like matter, the emission wavelengths are shorter than 350 nm and excitation wavelengths are shorter than 250 nm. The location of Region III has the emission wavelengths longer than 350 nm and excitation wavelengths shorter than 250 nm, which represents the fulvic acid-like matter. Region IV are assigned to soluble microbial by-product-like matter. It is located at the emission wavelengths (<380 nm) with excitation wavelengths ranged from 250 to 280 nm. Excitation wavelengths shorter than 280 nm and emission wavelengths longer than 380 nm are defined as Region V. This region indicates the property of humic acid-like organics.   EEM Region I and Region II, which are related to tyrosine protein-like matter, the emission wavelengths are shorter than 350 nm and excitation wavelengths are shorter than 250 nm. The location of Region III has the emission wavelengths longer than 350 nm and excitation wavelengths shorter than 250 nm, which represents the fulvic acid-like matter. Region IV are assigned to soluble microbial by-product-like matter. It is located at the emission wavelengths (<380 nm) with excitation wavelengths ranged from 250 to 280 nm. Excitation wavelengths shorter than 280 nm and emission wavelengths longer than 380 nm are defined as Region V. This region indicates the property of humic acid-like organics.

Boosting Regression Tree Approach
To illustrate the performance of these two DOC quantification methods, among 60 samples were collected, 30 samples are specified as testing data chosen randomly, while 30 samples are used for training. On each of these training sets, boosting regression tree and PARAFAC-MLR approaches were fitted to the data and computed the resulting test error rate on a test set. The DOC concentrations of training set ranged from 0.7128 to 5.6290 mg/L, while the testing set ranged from 1.7500 to 5.6650 mg/L. The R 2 statistic and As shown in Figure 5a, the relationship of predicted and measured DOC is plotted. Predicted DOC concentrations ranged from 1.2683 to 5.2708 mg/L. R 2 is equal to 0.982 and MSE is 0.003, which indicates that the boosting regression tree almost describes the whole variance in the training data set. As shown in above (Section 2.1), the boosting regression tree fits a decision to the residual from the model. At each step, a new decision tree is added into the fit function in order to fit the residual from previously grown trees. Thus, the residual error of the regression model is driven to zero.

Boosting Regression Tree Approach
To illustrate the performance of these two DOC quantification methods, among 60 samples were collected, 30 samples are specified as testing data chosen randomly, while 30 samples are used for training. On each of these training sets, boosting regression tree and PARAFAC-MLR approaches were fitted to the data and computed the resulting test error rate on a test set. The DOC concentrations of training set ranged from 0.7128 to 5.6290 mg/L, while the testing set ranged from 1.7500 to 5.6650 mg/L. The 2 statistic and mean square error (MSE) is computed, which can measure the model fitting and explain the fraction of variance.
As shown in Figure 5a, the relationship of predicted and measured DOC is plotted. Predicted DOC concentrations ranged from 1.2683 to 5.2708 mg/L. 2 is equal to 0.982 and MSE is 0.003, which indicates that the boosting regression tree almost describes the whole variance in the training data set. As shown in above (Section 2.1), the boosting regression tree fits a decision to the residual from the model. At each step, a new decision tree is added into the fit function in order to fit the residual from previously grown trees. Thus, the residual error of the regression model is driven to zero. The relationship of predicted and measured DOC is plotted in Figure 5b. In Figure  5b, the model trained with training data was applied to the testing data set. Predicted DOC concentrations ranged 2.0919 to 5.5083 mg/L, the 2 statistic of the boosting regression tree model, which gives a measure of the linear relationship between measured and predicted DOC concentrations. The 2 was 0.914, and so 91.4% of the measured DOC concentrations is predicted by the boosting regression tree. The MSE is 0.146, which indicates that the predicted DOC concentrations are close to the measured DOC concentrations.

PARAFAC-MLR Approach
For PARAFAC-MLR quantification method, the procedure is quite different from the boosting regression tree. This difference stems from the fact that in the PARAFAC quantification case, the samples to be predicted need to be analyzed with measured samples. The relationship of predicted and measured DOC is plotted in Figure 5b. In Figure 5b, the model trained with training data was applied to the testing data set. Predicted DOC concentrations ranged 2.0919 to 5.5083 mg/L, the R 2 statistic of the boosting regression tree model, which gives a measure of the linear relationship between measured and predicted DOC concentrations. The R 2 was 0.914, and so 91.4% of the measured DOC concentrations is predicted by the boosting regression tree. The MSE is 0.146, which indicates that the predicted DOC concentrations are close to the measured DOC concentrations.

PARAFAC-MLR Approach
For PARAFAC-MLR quantification method, the procedure is quite different from the boosting regression tree. This difference stems from the fact that in the PARAFAC quantification case, the samples to be predicted need to be analyzed with measured samples. As the three-way array was decomposed into one score vector and two loading vectors, the concentration of the analytes was estimated using score vector, which represents the relative concentration of each sample. However, the score vector is, in fact, a N × F matrix, which leads to some difficulties for regression analysis. Here, an alternative is to introduce multivariable linear regression. Therefore, the DOC concentrations of seawater samples can be obtained according to the score vector of PARAFAC solution combined with multivariable linear regression.
In this work, PARAFAC is based on N-way toolbox [44], which is freely downloadable from the Chemometrics site at University of Copenhagen (www.models.life.ku.dk, accessed on 6 October 2021). The PARAFAC-MLR result of training data set is shown in Figure 6a. In this example, the R 2 value over training data is 0.979 and MSE is 0.05, which means the PARAFAC-MLR provides a good fit to the training data.
samples can be obtained according to the score vector of PARAFAC solution combined with multivariable linear regression.
In this work, PARAFAC is based on N-way toolbox [44], which is freely downloadable from the Chemometrics site at University of Copenhagen (www.models.life.ku.dk, accessed on 6 October 2021). The PARAFAC-MLR result of training data set is shown in Figure 6a. In this example, the 2 value over training data is 0.979 and MSE is 0.05, which means the PARAFAC-MLR provides a good fit to the training data.  Since the success of PARAFAC decomposition is dependent on signal-tonoise ratio [45,46], the regression results suffer from limitations because of the possible noisy, nonlinear disturbances. All these factors can lead to a poor fit to the data set. Therefore, the PARAFAC-MLR method should be further optimized when applied for predicting DOC concentrations.

Discussion
The difference between these two results can be understood by noting that there is considerable amount of information presented by EEM. In general, the PARAFAC-MLR method is a comprehensive indicator of concentration measurement. As it offers a useful but not sufficient concentration analysis, there needs to be a feature extraction which is essential pre-processing step to data analysis. Feature extraction can greatly reduce noise disturbances and make it much easier for a subsequent regression analysis. In addition, as shown in Figure 5b or Figure 6b, PARAFAC-MLR performs less efficiently than the proposed method because MLR is designed for linear situations. In a real-life situation in which the true relationship is unknown, the boosting tree-based method may still provide better results to MLR whether the true relationship is linear or non-linear. In particular, instead of fitting a linear regression model for loading matrix, boosting regression tree provides an improvement over PARAFAC-MLR method in regression analysis. As in this tree method, a number of decision trees are built based on the bootstrapped EEM samples. Then the training samples are separately fitted to different trees. Finally, all these trees will be combined into a decision model. In Figure 7, with the number of trees increased,  Since the success of PARAFAC decomposition is dependent on signal-to-noise ratio [45,46], the regression results suffer from limitations because of the possible noisy, nonlinear disturbances. All these factors can lead to a poor fit to the data set. Therefore, the PARAFAC-MLR method should be further optimized when applied for predicting DOC concentrations.

Discussion
The difference between these two results can be understood by noting that there is considerable amount of information presented by EEM. In general, the PARAFAC-MLR method is a comprehensive indicator of concentration measurement. As it offers a useful but not sufficient concentration analysis, there needs to be a feature extraction which is essential pre-processing step to data analysis. Feature extraction can greatly reduce noise disturbances and make it much easier for a subsequent regression analysis. In addition, as shown in Figure 5b or Figure 6b, PARAFAC-MLR performs less efficiently than the proposed method because MLR is designed for linear situations. In a real-life situation in which the true relationship is unknown, the boosting tree-based method may still provide better results to MLR whether the true relationship is linear or non-linear. In particular, instead of fitting a linear regression model for loading matrix, boosting regression tree provides an improvement over PARAFAC-MLR method in regression analysis. As in this tree method, a number of decision trees are built based on the bootstrapped EEM samples. Then the training samples are separately fitted to different trees. Finally, all these trees will be combined into a decision model. In Figure 7, with the number of trees increased, R-square in Figure 7a changed and the mean squared error (MSE) in Figure 7b decreased. This indicates that boosting regression tree method utilizes the information of fluorescence intensities at different regions and tries to integrate information by discovering and identifying mappings between different decision trees. In Figure 7a, R-square grew rapidly from 1 to 2 because simple tree models are very restricted in term of the data relationship that they can represent. Moreover, R-square firstly decreased and then increased from 10 to 25, which means that the proposed boosting regression tree was able to find better predicted value. Empirically, this ensemble method which uses multiple models tends to yield better results. cence intensities at different regions and tries to integrate information by discovering and identifying mappings between different decision trees. In Figure 7a, R-square grew rapidly from 1 to 2 because simple tree models are very restricted in term of the data relationship that they can represent. Moreover, R-square firstly decreased and then increased from 10 to 25, which means that the proposed boosting regression tree was able to find better predicted value. Empirically, this ensemble method which uses multiple models tends to yield better results. Although, when applied correctly, boosting regression tree method can provide a relatively sufficient analysis of DOC concentration in fluorescence data, it is not always suitable. The samples with large biases can have a significant impact on the accuracy of the regression model. Because boosting method will highlight bias, samples with large biases will influence the weights among decision trees, which may yield unsatisfactory prediction results. In practice these issues are unavoidable but with increasing number of input data points they have less impact.

Conclusions
Dissolved organic matter (DOC) has a relatively important role for carbon cycle and linking the marine carbon cycle to climate [47,48]. The development of an inexpensive and fast quantitative procedure for DOC is meaningful for real practice. This paper described a new, quantitative method for DOC in seawater by fluorescence excitation-emission matrix spectroscopy. Samples collected from the East China Sea were analyzed with fluorescence spectroscopy at laboratory. Incremental Delaunay triangulation algorithms ensure that patches are continuous at the connections, thus it can provide smooth surfaces. Moreover, a peak-picking method for EEMs data was applied to identify the local maxima. However, the result of peak detection is sparse and not computationally efficient. In this work, sparse principal component analysis was used to achieve dimensionality reduction, which has the advantage of identifying important variables.
For the regression analysis in this study, a mapping model between DOC concentrations and fluorescence intensities of EEM were established on the basis of boosting regression tree method. The comparison with PARAFAC method further indicates that boosting regression tree provided a better performance at DOC estimation. The correlations between DOC concentrations and fluorescence EEM suggested that fluorescence spectroscopy has a potential for estimating the DOC concentrations of seawater. This emerging set of different fluorescence intensities can help in creating a common method for the DOC quantification. Although, when applied correctly, boosting regression tree method can provide a relatively sufficient analysis of DOC concentration in fluorescence data, it is not always suitable. The samples with large biases can have a significant impact on the accuracy of the regression model. Because boosting method will highlight bias, samples with large biases will influence the weights among decision trees, which may yield unsatisfactory prediction results. In practice these issues are unavoidable but with increasing number of input data points they have less impact.

Conclusions
Dissolved organic matter (DOC) has a relatively important role for carbon cycle and linking the marine carbon cycle to climate [47,48]. The development of an inexpensive and fast quantitative procedure for DOC is meaningful for real practice. This paper described a new, quantitative method for DOC in seawater by fluorescence excitation-emission matrix spectroscopy. Samples collected from the East China Sea were analyzed with fluorescence spectroscopy at laboratory. Incremental Delaunay triangulation algorithms ensure that patches are continuous at the connections, thus it can provide smooth surfaces. Moreover, a peak-picking method for EEMs data was applied to identify the local maxima. However, the result of peak detection is sparse and not computationally efficient. In this work, sparse principal component analysis was used to achieve dimensionality reduction, which has the advantage of identifying important variables.
For the regression analysis in this study, a mapping model between DOC concentrations and fluorescence intensities of EEM were established on the basis of boosting regression tree method. The comparison with PARAFAC method further indicates that boosting regression tree provided a better performance at DOC estimation. The correlations between DOC concentrations and fluorescence EEM suggested that fluorescence spectroscopy has a potential for estimating the DOC concentrations of seawater. This emerging set of different fluorescence intensities can help in creating a common method for the DOC quantification.