## 3. Results

The work of [

19] shows that the root mean square error of the SVM varies more sharply with the

σ parameter than with the

C parameter. Our previous study [

8] also shows that the

σ parameter is much more sensitive than the

γ parameter of the LS-SVM. Based on this prior knowledge, we conducted grid search experiments with these settings: (1) the starting values of the

C,

γ, and

σ parameters were set to 0.1, 0.1, and 0.01, respectively; (2) the steps to advance

C and

γ were 10 times their previous values and the step to advance

σ was 1.1 times its previous value; and (3) the optimal

σ of the LS-SVM was used as the initial guess of the SVM to estimate the optimal

ε and then the optimal

σ of the SVM was searched again. The last setting indicates the assumption that the optimal

σ of the two models would be the same.

Figure 1 shows the correlation coefficient (

R^{2}) between the LS-SVM outputs and the target values of the artificial dataset. In the experiments of

Figure 1A,B, five random variables in (−1,1) were used as inputs and the results of Equation (11) as the target. Both the fitting and validation become better with a larger

γ. Equation (9) indicates that a larger

γ means more precise fit of the data; and the kernel function (6) indicates that a smaller

σ yields a larger variability. Therefore, a smaller

σ and a larger

γ yields a better fitting for the noise-free data. The validation shows that the best fitting did not generalize well. The optimal

σ is 1.17 for

γ = 1000 and did not stray far from the optimal value for other

γ. The experiments in

Figure 1C,D used four random variables as input and a random variable in the same range as the target. That the LS-SVM can make a perfect fitting for unrelated variables is an example of overfitting. Although the correlation detected by the validation is weak,

σ = 0.6 and

γ = 1 are clearly the candidates for obtaining a better validation. Small correlations in the validation resulted from a few random points that incidentally fit Equation (11).

Using the optimal σ value s of the LS-SVM, we conducted grid search for the optimal C and ε of the SVM with the same artificial dataset. For the noise-free target, ε = 0.01 in the discrete set (0.01, 0.02, 0.05, 0.10, 0.20, 0.30) and C = 1000.0 in the discrete set (0.1, 1.0, 10.0, 100.0, 1000.0) yielded the best validation. This is understandable as the target includes no noise, a smaller ε, and a larger C would include more data points as support vectors to produce a better fitting and validation. When the target was a random variable, ε = 0.1 and γ = 1.0 became optimal values.

Figure 2 shows that the SVM behaved similar to the LS-SVM. The grid search for the optimal

C and

σ was done with the optimal

ε values above. When the target is noise free, the

R^{2} of both fitting and validation increases monotonically with the

C parameter (

Figure 2A,B) for a given

σ. In the fitting, the

R^{2} tends to decrease monotonically with the

σ for a large

C. This is because a large

C made LibSVM include most training samples as support vectors and smaller σ makes the hyperplane more elastic. But with a small

C, the optimal σ occurred in a narrow range around 1.0, which differs not much from the standard deviation of the target (0.58). In the validation, the optimal

σ appeared around 1.0 for all tested

C values. Obviously, an over fitting would have occurred with a small

σ and a large

C, resulting in excellent fitting but unacceptable validation. One may ask what the results would be for

C > 1000. The outputs of LibSVM show that nearly all training data points have been included as support vectors; therefore, using larger

C would yield similar results as those with

C = 1000.

We reassessed the optimal

ε value of the SVM for the CO

_{2} dataset as the target has a much larger variance. Based on the experiments with the random dataset, we set

σ = 0.6 to evaluate the

ε in the discrete set (1, 2, 4, 8, 16, 32 µatm). Divided by the CO

_{2} standard deviation of 32.6 µatm, these values correspond to

ε of 0.03, 0.06, 0.12, 0.24, 0.49, and 0.98, respectively, for normalized CO

_{2}.

Figure 3 shows the variation of the correlation coefficient and bias (model-target) obtained from a validation. Obviously, one cannot have both a zero bias and the best correlation. Our priority is to have a zero bias as a large bias may reverse the conclusion of the global oceans as CO

_{2} sink or source. Since the bias crosses the zero line for all tested

C values (

Figure 3B), we selected

C = 100 from

Figure 3A and estimated

ε ≈ 12 µatm from

Figure 3B. This value corresponds to

ε of 0.37 for normalized CO

_{2}. Note that it is not necessary to calculate the zero-bias

ε precisely as the bias would change with a different training and validation dataset. If one emphasizes having the best correlation, then the best

ε would be about 8 µatm or 0.24 for normalized CO

_{2}.

Figure 4 shows the grid search results with

ε = 12 using a CO

_{2} dataset. The optimal

C of the SVM is 1000 for fitting and 100 for validation; and the optimal γ of the LS-SVM is 100 for fitting and 10 for validation. It is no surprise that the optimal σ values of the two model are similar: 0.611 for SVM and 0.690 for LS-SVM. Overall, the two models respond to parameter changes similarly. Further, the responses of both models with the CO

_{2} dataset are similar to those with noise-free artificial dataset in (

Figure 1A,B and

Figure 2A,B).

Figure 5 presents a fitting and validation obtained using the optimal parameters. Beside having a larger

R^{2} and a smaller standard error (SE), the LS-SVM visually shows a less dented blank area near CO

_{2} value of 320 µatm. The skewed distribution of data points around the regression line indicates unbalance sampling of the measurements. The SVM yielded a smaller validation bias because the dataset was used to choose the

ε to have a zero-bias validation.

We repeated the grid search for optimal parameters for the 10 CO

_{2} datasets prepared for Monte Caro cross validation. We obtained the overall optimal

C and σ of the SVM as 100 and 0.613, respectively, and optimal γ and σ of the LS-SVM as 10 and 0.695, respectively. The biases in

Table 1 and

Table 2 were obtained using these parameter values. Overall, the LS-SVM performs better than the SVM in terms of correlation and bias. The LS-SVM yielded a mean bias of 0.00 ± 0.00 μatm and −0.14 ± 0.14 μatm for fitting and validation, respectively; and a mean

R^{2} of 0.801 ± 0.005 and 0.691 ± 0.002 for fitting and validation, respectively. Whereas, the SVM yielded a mean bias of −0.05 ± 0.05 μatm and −0.17 ± 0.29 μatm for fitting and validation, respectively; and a mean

R^{2} of 0.761 ± 0.005 and 0.680 ± 0.001 for fitting and validation, respectively. The SVM yielded a larger variance for both bias and

R^{2} as there is no

ε that can minimize the bias of all datasets and the number of support vectors changes with different datasets, even using the same

C value.

The performance of the two models in term of computing time did not differ much. This was evaluated using a PC with an Intel Xeon 3.20 GHz CPU and 32 GB memory. The LS-SVM took 43 s to complete a training of 20,000 samples. The training time of the SVM increased linearly with C from 7 s for C = 1 to 350 s for C = 1000. While the SVM would take more time to search for support vectors with a more relaxed constrain in Equation (4) or a larger C, the bottleneck of the LS-SVM is in solving Equation (9). In our experiments, the conjugate-gradient method of the LS-SVM software took about 200 to 300 iterations to obtain a solution with a sufficient precision. The number of operations on floating number is about O(iteration × n^{2}) for a n × n matrix, which is much smaller than O(n^{3}) operations of the LU decomposition method, commonly used for solving linear system equations.

The LS-SVM has fewer parameters to be tuned than the SVM. Therefore, it is harder to obtain the optimal parameters of the SVM. However, the resource-thirsty characteristic of the LS-SVM could limit its use with a large dataset. We tested the LS-SVM model with 60,000 training samples. It took 920 s to complete a training, which is not unacceptable. However, it consumed 28.5 GB memory. Increasing the sample size further halted our computer. Meanwhile, the SVM consumed only 130.5 MB memory and the training took 600 s for C = 100.