Estimation of Fugacity of Carbon Dioxide in the East Sea Using In Situ Measurements and Geostationary Ocean Color Imager Satellite Data

The ocean is closely related to global warming and on-going climate change by regulating amounts of carbon dioxide through its interaction with the atmosphere. The monitoring of ocean carbon dioxide is important for a better understanding of the role of the ocean as a carbon sink, and regional and global carbon cycles. This study estimated the fugacity of carbon dioxide (ƒCO2) over the East Sea located between Korea and Japan. In situ measurements, satellite data and products from the Geostationary Ocean Color Imager (GOCI) and the Hybrid Coordinate Ocean Model (HYCOM) reanalysis data were used through stepwise multi-variate nonlinear regression (MNR) and two machine learning approaches (i.e., support vector regression (SVR) and random forest (RF)). We used five ocean parameters—colored dissolved organic matter (CDOM; <0.3 m−1), chlorophyll-a concentration (Chl-a; <21 mg/m3), mixed layer depth (MLD; <160 m), sea surface salinity (SSS; 32–35), and sea surface temperature (SST; 8–28 ◦C)—and four band reflectance (Rrs) data (400 nm–565 nm) and their ratios as input parameters to estimate surface seawater ƒCO2 (270–430 μatm). Results show that RF generally performed better than stepwise MNR and SVR. The root mean square error (RMSE) of validation results by RF was 5.49 μatm (1.7%), while those of stepwise MNR and SVR were 10.59 μatm (3.2%) and 6.82 μatm (2.1%), respectively. Ocean parameters (i.e., sea surface salinity (SSS), sea surface temperature (SST), and mixed layer depth (MLD)) appeared to contribute more than the individual bands or band ratios from the satellite data. Spatial and seasonal distributions of monthly ƒCO2 produced from the RF model and sea-air CO2 flux were also examined.


Introduction
Carbon dioxide, one of the greenhouse gases, has significantly increased since the industrial revolution due to economic and population growth.Increase of carbon dioxide concentration in the atmosphere accelerates global warming, which is directly related with on-going climate change.Climate change has brought significant impacts on human society and the natural environment all over the world, in such ways as increasing extreme weather events and a rise in sea levels [1].The ocean contains fifty times more carbon dioxide than the atmosphere, and twenty times more than terrestrial ecosystems.Although the substantial amount of carbon dioxide in the atmosphere is absorbed into the oceans, approximately half of anthropogenic carbon dioxide remains in the atmosphere, increasing its concentration [2].Since the ocean acts as a buffer for carbon dioxide uptake, temporal and spatial changes of the sea-air carbon dioxide flux are crucial to understanding the global carbon cycle [3].
The increasing absorption of carbon dioxide from the atmosphere can result in huge damage to marine organisms [4].When the carbon dioxide dissolves in seawater, it becomes carbonic acid and has a negative effect on biochemical functions of organisms, in particular, calcareous ones, such as lime, coral algae, clams, and oysters, with their shells composed of calcium carbonate ( [5] and references therein, for example).Accumulated carbon dioxide in the ocean accelerates global warming and climate change by decreasing the capacity of the ocean that absorbs carbon dioxide from the atmosphere.Consequently, quantifying the distribution of ocean carbon dioxide is crucial in order to better understand the ocean carbon sink [6].However, there has been limited exploration of ocean carbon dioxide around East Asia, especially in the East Sea located between Korea and Japan.
In situ field observations are relatively limited over the ocean because of technical and financial problems [6].In addition, in situ measurements are not typically continuous in the spatiotemporal domain.Satellite remote sensing can be a good alternative to this as satellite sensors collect data over vast areas at high temporal resolution (~hours).Satellite data can be used to estimate various ocean parameters including chlorophyll-a concentration (Chl-a), sea surface temperature (SST), and colored dissolved organic materials (CDOM), which are related to the distribution of carbon dioxide in the ocean.
In this study, we used multi-sensor satellite data and in situ measurements to quantify carbon dioxide over the East Sea of Korea.This study focuses on the fugacity of carbon dioxide (ƒCO 2 ) in surface seawater, in order to examine ocean carbon dioxide.Fugacity is expressed in Pascals or in atmospheres, the same unit as is used with partial pressure.Since it is difficult to retrieve pressure (or fugacity) of carbon dioxide directly from satellite reflectance data, it is necessary to use satellite-derived ocean parameters that are related to carbon dioxide concentration [7].Many studies have suggested that SST and Chl-a are the important parameters for estimating the partial pressure of carbon dioxide [8][9][10][11][12][13][14][15][16][17][18][19].The capacity of gas solubility is highly related to SST.SST also affects other carbon pumps, which are physical transport and biological photosynthesis and respiration [20].SST could act as an indication of cooler upwelling water by vertical mixing [7].In addition, sea surface salinity (SSS) [6][7][8]11,12,15], mixed layer depth (MLD) [10,11,16], CDOM [21], and wind speed [6,16,21] were used to quantify the partial pressure of carbon dioxide in the ocean.SSS also shows the variability of carbon dioxide by expressing the mixing between seawater and freshwater [7].SSS can determine the characteristics of surface seawater carbon dioxide not determined by other ocean parameters (i.e., SST, MLD, and biological variables) as water mass tracer and water parcel history [22].MLD is defined as the depth of the vertical mixing process of the ocean.The bottom layer of seawater has a high concentration of dissolved inorganic carbon, which can come up to the surface through the process, resulting in ƒCO 2 increases [23,24].CDOM and Chl-a are controlled by organisms that produce O 2 and consume CO 2 .Active photosynthesis decreases surface water ƒCO 2 [23,[25][26][27].Temperature is related to seasonal thermal balance [7].
The East Sea, surrounded by Korean peninsula, Japan, and Russia, is a mid-latitude marginal sea with average depth of about 1750 m.It is connected to the western North Pacific through the Korea, Tsugaru, Soya/La Pérouse, and Tatar Straits.The inflow through the Korea Strait that brings warm and salty water, and the outflow through the Tsugaru, Soya/La Pérouse are mainly confined to about upper 150 m.Below the upper level, there is a thermohaline circulation due to deep convection occurring in the northern part of East Sea [28,29].Deep convection carries anthropogenic carbon dioxide into the East Sea, making a large reservoir for carbon dioxide.The specific column inventory of anthropogenic carbon dioxide in the East Sea is two or three times greater than that of the North Pacific, which results in a great change in the carbonate chemistry of the sea through the large accumulation of carbon dioxide [23].Within the East Sea, ƒCO 2 varies even over a small area due to environmental variations [30].In response to the increase in the atmospheric CO 2 concentration, ƒCO 2 in the East Sea steadily increased between 1995 and 2010 [20].A numerical model with ocean carbon chemistry suggests that the inflow into the East Sea could also contribute to a change in ƒCO 2 [31].Considering these environments, quantifying ƒCO 2 is crucial for monitoring carbon flux in the East Sea and improving our understanding of the regional carbon cycle.However, there have been only a few studies analyzing the carbon dioxide in the East Sea using in situ data [20,23,30].Park et al. [32] constructed a neural network model to estimate pCO 2 in the East Sea based on satellite-derived SST and Chl-a and in situ pCO 2 data from 2003 to 2012.The variability of pCO 2 was large due to high primary production and mesoscale eddies, and the relationship between pCO 2 and SST was higher than Chl-a [32].
Most of the studies mentioned above estimated the pressure of ocean carbon dioxide using in situ and satellite data through statistical approaches, such as simple linear and multiple linear regression.Such linear approaches might not work well for the nonlinear behavior of carbon dioxide over the ocean with a dynamic spatiotemporal environment.More advanced algorithms that can handle the nonlinear behavior are required to examine the relationship between ocean carbon dioxide and ocean related parameters [33].In the field of remote sensing, recently adopted machine learning approaches or nonlinear analyses, may be able to effectively model the pressure of ocean carbon dioxide.These techniques include various neural networks including self-organizing maps (SOM) [6,14,15,20,32,34,35], mechanistic nonlinear models [13], principal component analysis [9], mechanistic semi-analytic algorithms [7], and quadratic polynomial regression [9].However, generalization of the empirical modeling approaches from one area to different areas is still challenging.
Most of the studies mentioned above used polar orbiting satellite sensors (especially, Moderate Resolution Imaging Spectroradiometer (MODIS)).We used Geostationary Ocean Color Imager (GOCI) satellite sensor, which has higher spatial and temporal resolutions than MODIS.This is the first study to estimate the surface seawater ƒCO 2 using GOCI satellite data.
The objectives of this study were (1) to estimate surface seawater ƒCO 2 in the East Sea of Korea using satellite data and in situ measurements based on multi-variate nonlinear regression (MNR) and machine learning approaches, and (2) to examine ocean parameters contributing to ƒCO 2 estimation, and (3) investigate the spatial and temporal variation of surface seawater ƒCO 2 and sea-air CO 2 flux over the East Sea using satellite data.Two machine learning approaches including support vector regression (SVR) and random forest (RF) were used in this study.MNR, recently applied in a study [9] to estimate pCO 2 , was used for comparison with the two machine learning approaches.

In Situ Data
In situ measurements for this study were provided by the Korea Institute of Ocean Science and Technology (KIOST).KIOST conducted several field surveys (May 2014; August 2014; March 2015; April 2015; August 2015; October 2015; November 2015) in the East Sea and provided in situ data measured every minute.Each point data contains information about the date/time, location (latitude and longitude), SST ( • C), SSS, and ocean ƒCO 2 (µatm).
In the LI-COR (LI-COR Biosciences, Lincoln, NE, USA) LI-7000 mode, a non-dispersive infrared analysis instrument was used to measure concentrations of carbon dioxide in the atmosphere and ocean by the standard operating procedure (SOP) [36].The underway CO 2 data obtained from the non-dispersive infrared analyzer were calibrated using three non-zero standard gases with known CO 2 concentrations every three hours.SST and SSS were measured using a Sea-Bird Electronics 45 Micro TSG thermosalinograph (Sea-Bird Electronics Inc., Bellevue, WA, USA).It measures sea surface conductivity and SST in real time by passing seawater through the pump, and also computes SSS.Surface seawater was pumped aboard from a 2-m depth and a showerhead type equilibrator was used.The underway system was operated on a cycle consisting of three standard gases, measurements of air, and measurements of a headspace equilibrated with flowing seawater.Air and seawater measurements were conducted at a 1-min interval.The system measured the mole fraction of CO 2 in dry air (xCO 2 ), which was converted into CO 2 fugacity by correcting for the non-ideality of the gas and the water vapor level, as outlined in [37].All post-cruise calculations of ƒCO 2 including temperature correction were done using the methods described in [37].The maximum ƒCO 2 in summer (August 2014) is higher than the other seasons due to warming of surface seawater (Figure 1; Tables 1 and 2).Warming of surface seawater increases ƒCO 2 thermodynamically [38].Surface seawater ƒCO 2 is higher in fall than spring because of a higher SST and deeper MLD.
Remote Sens. 2017, 9, 821 4 of 23 the gas and the water vapor level, as outlined in [37].All post-cruise calculations of ƒCO2 including temperature correction were done using the methods described in [37].The maximum ƒCO2 in summer (August 2014) is higher than the other seasons due to warming of surface seawater (Figure 1; Tables 1 and 2).Warming of surface seawater increases ƒCO2 thermodynamically [38].Surface seawater ƒCO2 is higher in fall than spring because of a higher SST and deeper MLD.

GOCI Imagery
The Geostationary Ocean Color Imager (GOCI), launched in June 2010, is the first geostationary ocean color observation satellite sensor in the world.GOCI collects data hourly for 8 h per day from 9 a.m. to 4 p.m. in local time at six visible (centered at 412 nm, 443 nm, 490 nm, 555 nm, 660 nm, and 680 nm) and two near-infrared bands (centered at 745 nm and 865 nm) at 500 m resolution.It covers 2500 km × 2500 km square around the Korean peninsula, East China, and Japan (Figure 1).GOCI provides three basic products including the concentrations of Chl-a, suspended sediment, and CDOM, which are downloadable from the Korea Ocean Satellite Center (KOSC) website [39] free charge.This study used Chl-a and CDOM products (Table 1), which are related to the biochemical processes and carbon cycle on the ocean surface.Band reflectance (Rrs) data at four visible bands (bands 1-4; Table 3) among the eight bands were used in this study because most pixels in red and NIR bands have very low reflectance close to 0. Details of the products are found in the KOSC website [40].

HYCOM Imagery
The HYbrid Coordinate Ocean Model + NRL Coupled Ocean Data Assimilation (HYCOM + NCODA) system provides reanalyzed ocean data based on multiple satellite data and in situ measurements through partnerships with various academic, governmental, and commercial entities, such as the Florida State University Center for Ocean-Atmospheric Prediction Studies (FSU/COAPS), the Naval Research Laboratory/Stennis Space Center (NRL/STENNIS), the National Oceanographic and Atmospheric Administration (NOAA), and Planning Systems Inc. (Reston, VA, USA).Among many variables that HYCOM provides, we used the daily 1/12 degree (around 8.3 km) MLD, SST, and SSS products (Table 1), which were downloaded from the HYCOM + NCODA homepage [41].

NOAA Greenhouse Gas Marine Boundary Layer Reference
NOAA Greenhouse Gas Marine Boundary Layer (MBL) reference provides surface xCO 2 data based on in situ measurements for all over the ocean, which is provided from the NOAA Earth System Research Laboratory (ESRL) carbon cycle cooperative global air sampling network [42].Surface xCO 2 data was used to calculate atmosphere ƒCO 2 to analyze carbon sink or source in the East Sea.They provide weekly data at intervals of 0.05 times the sine of latitude, and it was downloaded from the NOAA ESRL Greenhouse Gas MBL reference homepage [43].

European Reanalysis of (ERA-) Interim Data
European Centre for Medium-Range Weather Forecasts (ECMWF) provides ERA-Interim data, which is global atmospheric reanalysis data since 1979 based on data assimilation [44].ERA-Interim data include various variables related with atmospheric conditions.We used daily 0.125 degree mean sea level pressure, and 10 m U and V wind components to calculate atmosphere ƒCO 2 and sea-air CO 2 flux.ERA-Interim daily data were downloaded from the ECMWF homepage [45] 3. Methodology

Experimental Schemes
In this study, five major ocean parameters and individual band reflectance (Rrs) data were used to estimate ƒCO 2 in surface seawater from GOCI satellite products and in situ measurements We used ocean parameters (SST, SSS, MLD, CDOM, and Chl-a), as in the existing literature, and additionally individual band reflectance and band ratio data as input parameters.Satellite-derived ocean parameters are produced using band reflectance data, sometimes causing high uncertainty, especially near coastal areas.Band reflectance data and their ratios were considered because there is a chance that satellite-derived CDOM and Chl-a products might not effectively reflect biological processes in the East Sea.
In this study, samples, which spatiotemporally matched satellite data, were used to estimate ƒCO 2 .Some samples were excluded due to cloud cover in the corresponding satellite data, and, when multiple samples were located within one pixel, they were averaged.While in situ data were measured every minute, satellite data were collected either every hour between 9 a.m. and 4 p.m. for GOCI at 500 m spatial resolution.Thus, in situ samples were selected considering the collection times and the spatial resolution of satellite data.HYCOM data were resampled at 500 m, the same pixel size with GOCI, using bilinear interpolation.Samples were divided into two groups: eighty percent of the samples were randomly selected and used to train machine learning algorithms, while the remaining samples (twenty percent of the samples) were used to validate the models to estimate ƒCO 2 .
We used four band reflectance data (excluding red and NIR bands; Table 3) and their ratios, and ocean parameters (CDOM, Chl-a, MLD, SSS, and SST) as input parameters to develop machine learning models to estimate ƒCO 2 in surface seawater.The total number of samples was for 843 (673 for calibration and 170 for validation).SST, SSS, and MLD (from HYCOM) and visible (i.e., blue and green) band reflectance (Rrs), their ratios, CDOM, and Chl-a (from GOCI) were used to develop machine learning models for surface seawater ƒCO 2 estimation.Then, the developed models were applied to GOCI satellite band reflectance, their ratios, CDOM, Chl-a, HYCOM MLD, SSS, and SST images to examine the spatiotemporal patterns of ƒCO 2 in surface seawater.The MNR regression model used the log-transformed MLD because of its large-scale distributions [9], but not in machine learning approaches, which do not generally require data normalization or scaling.Unlike traditional statistical approaches, RF and SVR are not significantly influenced by the range distribution of parameters.

Multi-Variate Nonlinear Regression
Chen et al. [9] used stepwise multiple linear regression (MLR), principal component regression (PCR), multi-variate nonlinear regression (MNR), and stepwise MNR to estimate surface pCO 2 from MODIS satellite data in the West Florida Shelf.They found that MNR performed the best among the four approaches and we adopted MNR in the present study to compare its performance with those of our proposed machine learning approaches.ƒCO 2 in surface seawater is calculated using Equation (1) based on MNR [9]: where n is the number of input parameters, x i is each input parameter, and k is the coefficient associated with each term (coefficients are not shown in this paper).Various combinations of input parameters were tested and the subset of input parameters that resulted in the best performance was used in the subsequent analyses.Various combinations of input variables (a total of 16 combinations with 4-7 parameters based on random selection and variable importance identified from RF) were evaluated in this study.MNR is relatively easier to understand than machine learning approaches.However, it is time-consuming to identify the best combination of input parameters (through the trial-and-error approach) and require statistical assumptions on data distribution.

Random Forest
RF uses multiple classification and regression trees (CART) [46], which is based on repeated binary splits to construct a tree to reach a solution.Although CART or other decision trees have been widely used for various classification and regression tasks [47][48][49][50], it is highly sensitive to training data configuration, frequently resulting in overfitting.RF improves such a weakness of CART through introducing two randomizations to develop many independent trees.A random subset of training samples and a random subset of input parameters at each node are used to construct a tree.Such independent trees (typically more than 100) are combined to reach a solution through a (weighted) majority voting for classification and (weighted) averaging for regression.RF uses the Gini index, a measure of statistical dispersion, to determine a variable at each node in a tree.RF performs internal cross validation using unused samples (i.e., out-of-bag (OOB) samples) for each tree and provides relative variable importance, which can be identified by the increase of mean squared errors using OOB samples when a variable is permuted.Breiman [46] provides more details about RF.RF has been widely used for various classification and regression tasks [51][52][53][54][55][56][57][58][59][60][61][62] and is known to better overcome an overfitting problem than simple decision trees such as CART.RF requires less setting of parameters and is faster than SVM and other ensemble classifiers [57].Although results are straightforward, it is hard to interpret all trees when the number of trees is large (e.g., 500-1000).RF is known to be often sensitive to sample size and quality [57].
In this study, RF was performed in the R statistical software (version 3.1.3)(1020 Vienna, Austria) using the 'randomForest' package.We used 500 trees, which is a default number of trees in the RF package and recommended by [57] as the number of trees when using the RF classifier on remotely sensed data.

Support Vector Regression
SVR is a regression version of support vector machines (SVM) and has been used for estimating biophysical parameters in the remote sensing field [63].SVR transforms the original dimension of input data into a higher dimension using a kernel function to find an optimum hyperplane to effectively separate samples.SVM and SVR are known to be good at modeling when the training sample size is small [64][65][66][67][68][69].It is often challenging to select and parameterize an appropriate kernel.Commonly used kernel functions include linear, polynomial, Gaussian, sigmoid, spectral angle, and radial basis functions [58,70,71].Since many studies showed that the radial basis kernel function produced better performance than the others in various applications, it was also used in this study.In this study, we used the package LIBSVM as a library for SVM [72].A grid search optimization algorithm was applied to optimize the radial basis kernel function with two parameters (i.e., 8 for gamma and 512 for penalty parameters in LIBSVM).SVM/SVR is known to be less sensitive to sample size and quality, but selecting a proper kernel function and optimizing associated parameters are crucial for successful performance of SVM/SVR [62].Although SVM/SVR has been widely used for various remote sensing applications, it is difficult to interpret results from SVM/SVR to understand the mechanism of processes.Detailed explanation about SVM/SVR can be found in [63].

Cost Function
Cost functions (CF) were calculated to compare the accuracy of the models for each scheme (Equation ( 2)) [11,73]: where N is the number of samples and σ is the standard deviation of the in situ measurements.CF indicates the closeness between each sample and the output from the optimized model, and lower CF means better performance than higher CF.

Sea-Air CO 2 Flux Calculation
Sea-air CO 2 flux is an important factor to analyze carbon sink or source between the ocean and atmosphere.Sea-air flux can be calculated by Equation (3) [36,74]: where U 10 is the 10 m wind speed (m s −1 ), Sc is the Schmidt number, 660 is the Schmidt number of CO 2 in seawater at 20 • C, and K 0 is the solubility of CO 2 as a function of SST and SSS.The unit is 'mol m −2 year −1 '.

Estimation of Surface Seawater ƒCO 2
Table 4 summarizes the correlation coefficients between input parameters and in situ surface seawater ƒCO 2 .SST shows the best correlation with surface seawater ƒCO 2 among the input parameters.SST is highly related with the capacity of gas solubility, and it is suggested as a crucial parameter in many studies [6][7][8][9][10][11][12][13][14][15][16][17][18][19].Table 5 summarizes the calibration and validation results based on each model.In the case of MNR (Equation (1) in Section 3.2), the various combinations of input parameters were tested and the one yielding the best performance is presented in the table.The best combination of input parameters was SST, SSS, log 10 (MLD), band 1, band 2, band 3, and band 4.
Figure 2 depicts the scatterplots between in situ measured and predicted surface seawater ƒCO 2 .Two machine learning approaches produced high and similar calibration accuracy (root mean square error (RMSE) = 1.82-2.31µatm; 0.6-0.7%).For validation results, RF (5.49 µatm; 1.7%) and SVR (6.82 µatm; 2.1%) performed better than MNR (10.59 µatm; 3.2%), RF produced the slightly lower RMSE, mean bias error, and CF values than SVR (Table 5).We tested another rule-based machine learning algorithm called Cubist [75], developed by RuleQuest Research Inc. (Empire Bay, Australia).Cubist performed better than SVR and worse than RF.The results of Cubist are not shown in this paper because Cubist is similar to RF in that both are rule-based ones.Figure 2 depicts the scatterplots between in situ measured and predicted surface seawater ƒCO2.Two machine learning approaches produced high and similar calibration accuracy (root mean square error (RMSE) = 1.82-2.31μatm; 0.6-0.7%).For validation results, RF (5.49 μatm; 1.7%) and SVR (6.82 μatm; 2.1%) performed better than MNR (10.59 μatm; 3.2%), RF produced the slightly lower RMSE, mean bias error, and CF values than SVR (Table 5).We tested another rule-based machine learning algorithm called Cubist [75], developed by RuleQuest Research Inc. (Empire Bay, Australia).Cubist performed better than SVR and worse than RF.The results of Cubist are not shown in this paper because Cubist is similar to RF in that both are rule-based ones.We evaluated MODIS satellite data (Chl-a, CDOM, and four band reflectance data) to justify the use of GOCI data.MODIS-based models produced higher uncertainty due to two possible reasons: (1) a much smaller training sample size for developing MODIS-based models when compared to We evaluated MODIS satellite data (Chl-a, CDOM, and four band reflectance data) to justify the use of GOCI data.MODIS-based models produced higher uncertainty due to two possible reasons: (1) a much smaller training sample size for developing MODIS-based models when compared to GOCI-based models, and (2) a greater temporal discrepancy between in situ measurements and satellite-derived products for MODIS-based models (in situ data from 9 a.m. to 4 p.m. vs. one MODIS image per day around 1:30 p.m.) (results not shown).
Although the number of in situ measurements used in this study was relatively small, they cover spring (March, April, and May), summer (August), and fall (October and November), which implies that the proposed models can be used to estimate surface seawater ƒCO 2 in any season except winter.In case of SST and SSS, the difference between in situ measurements and HYCOM data might increase the uncertainty of the models, although the relationships between the two are moderately good with R 2 ~0.87-0.9(Figure 3).It should be noted that HYCOM data are daily while in situ data are time-specific (between 9 a.m. and 4 p.m.) and thus some discrepancies between the two exist.In addition, the discrepancies occur due to the different spatial scale: while the spatial resolution of HYCOM is 1/8 degrees, in situ samples are point-based.Thus, when multiple in situ samples were located in a grid of HYCOM, they were averaged.Since there were no in situ Chl-a and CDOM data, it was not possible to compare them to satellite-derived products.Although the number of in situ measurements used in this study was relatively small, they cover spring (March, April, and May), summer (August), and fall (October and November), which implies that the proposed models can be used to estimate surface seawater ƒCO2 in any season except winter.In case of SST and SSS, the difference between in situ measurements and HYCOM data might increase the uncertainty of the models, although the relationships between the two are moderately good with R 2 ~ 0.87-0.9(Figure 3).It should be noted that HYCOM data are daily while in situ data are time-specific (between 9 a.m. and 4 p.m.) and thus some discrepancies between the two exist.In addition, the discrepancies occur due to the different spatial scale: while the spatial resolution of HYCOM is 1/8 degrees, in situ samples are point-based.Thus, when multiple in situ samples were located in a grid of HYCOM, they were averaged.Since there were no in situ Chl-a and CDOM data, it was not possible to compare them to satellite-derived products.SVR is known to be suitable for modeling especially when the training sample size is small [64][65][66][67], it was right in this study.As the selection and optimization of a kernel function is critical for successful modeling in SVR, the use of different kernel functions and parameters might improve the performance.RF uses an ensemble approach (i.e., numerous trees), which is efficient at modeling the nonlinear characteristics of surface seawater ƒCO2.By using 500 independent trees, a well-known overfitting problem when using a single tree was effectively mitigated in this study [57].RF showed the highest accuracy in this study.Both linear and nonlinear characteristics of surface seawater ƒCO2 were effectively modeled using the RF approach.Since RF uses an ensemble approach (i.e., combining numerous independent tree results), it appeared to be less sensitive to overfitting, resulting in higher validation accuracy, when compared to the other approaches.More training samples can mitigate validation accuracy of the two machine learning approaches in further study.
Figure 4 summarizes the relative variable importance provided by RF.From the variable importance identified by RF, the three ocean parameters (i.e., SST, SSS, and MLD) were considered more important than the individual band reflectance data, except for the ratio between bands 1 and 2.Although some of the ocean parameters were generated using satellite band reflectance data, it appears that the band reflectance data only by themselves were not able to effectively model surface seawater ƒCO2 without other input variables using the three machine learning approaches.MLD was identified as the most important variable for RF models.MLD is the second least correlated to surface seawater ƒCO2 (Table 5), but becomes the most important variable in the RF model, which implies that there is a nonlinear relationship between MLD and surface seawater ƒCO2.While MLD and ƒCO2 showed a relatively positive relationship in the spring season, no clear pattern (but with some clusters) was found in the summer and autumn seasons (not shown).This nonlinear SVR is known to be suitable for modeling especially when the training sample size is small [64][65][66][67], it was right in this study.As the selection and optimization of a kernel function is critical for successful modeling in SVR, the use of different kernel functions and parameters might improve the performance.RF uses an ensemble approach (i.e., numerous trees), which is efficient at modeling the nonlinear characteristics of surface seawater ƒCO 2 .By using 500 independent trees, a well-known overfitting problem when using a single tree was effectively mitigated in this study [57].RF showed the highest accuracy in this study.Both linear and nonlinear characteristics of surface seawater ƒCO 2 were effectively modeled using the RF approach.Since RF uses an ensemble approach (i.e., combining numerous independent tree results), it appeared to be less sensitive to overfitting, resulting in higher validation accuracy, when compared to the other approaches.More training samples can mitigate validation accuracy of the two machine learning approaches in further study.
Figure 4 summarizes the relative variable importance provided by RF.From the variable importance identified by RF, the three ocean parameters (i.e., SST, SSS, and MLD) were considered more important than the individual band reflectance data, except for the ratio between bands 1 and 2.Although some of the ocean parameters were generated using satellite band reflectance data, it appears that the band reflectance data only by themselves were not able to effectively model surface seawater ƒCO 2 without other input variables using the three machine learning approaches.MLD was identified as the most important variable for RF models.MLD is the second least correlated to surface seawater ƒCO 2 (Table 5), but becomes the most important variable in the RF model, which implies that there is a nonlinear relationship between MLD and surface seawater ƒCO 2 .While MLD and ƒCO 2 showed a relatively positive relationship in the spring season, no clear pattern (but with some clusters) was found in the summer and autumn seasons (not shown).This nonlinear seasonality with clusters could be well addressed by the rules produced in RF.Since MLD is a result of vertical mixing, it is a parameter controlling surface seawater ƒCO 2 [23].By controlling stratification and subsequent vertical mixing, salinity influences to ƒCO 2 .As SST is related to solubility of CO 2 , and directly affects biological processes [7], it has been commonly used to estimate the partial pressure of CO 2 [10,11,14,17].The various carbon pumps in the ocean carbon cycle are directly or indirectly related to SST.
Among band reflectance data, band 4 and the ratio between bands 1 and 2 were identified as important in an RF model.The ratio between bands 3 and 4 is used to make CDOM and Chl-a products and band 2 is used to make CDOM for GOCI [76].Results from the RF model showed that there are several other bands and band ratios that are important to estimate surface seawater ƒCO 2 , which are not used to make ocean parameters for GOCI.This implies that only using a few ocean parameters for predicting surface seawater ƒCO 2 might not be ideal because they do not represent all major biological processes occurring in surface seawater with different environments.In that regard, individual band reflectance data can be used as good supplements for modeling ƒCO 2 .
Remote Sens. 2017, 9, 821 12 of 23 seasonality with clusters could be well addressed by the rules produced in RF.Since MLD is a result of vertical mixing, it is a parameter controlling surface seawater ƒCO2 [23].By controlling stratification and subsequent vertical mixing, salinity influences to ƒCO2.As SST is related to solubility of CO2, and directly affects biological processes [7], it has been commonly used to estimate the partial pressure of CO2 [10,11,14,17].The various carbon pumps in the ocean carbon cycle are directly or indirectly related to SST.Among band reflectance data, band 4 and the ratio between bands 1 and 2 were identified as important in an RF model.The ratio between bands 3 and 4 is used to make CDOM and Chl-a products and band 2 is used to make CDOM for GOCI [76].Results from the RF model showed that there are several other bands and band ratios that are important to estimate surface seawater ƒCO2, which are not used to make ocean parameters for GOCI.This implies that only using a few ocean parameters for predicting surface seawater ƒCO2 might not be ideal because they do not represent all major biological processes occurring in surface seawater with different environments.In that regard, individual band reflectance data can be used as good supplements for modeling ƒCO2.

Spatial and Temporal Distribution of Surface Seawater ƒCO2
We applied the RF model, which produced the best performance, to GOCI satellite and HYCOM reanalysis images, and examined the seasonal variability of surface seawater ƒCO2 averaged by month in 2015.The GOCI images collected around 1:30 p.m. were used to produce the distribution of surface seawater ƒCO2.Satellite-based surface seawater ƒCO2 maps show similar spatial distribution with in situ measurements (Figure 1).The spatial distribution of GOCI-estimated surface seawater ƒCO2 has a similar pattern to those of SST, SSS, and MLD (Figure 5), which are

Spatial and Temporal Distribution of Surface Seawater ƒCO 2
We applied the RF model, which produced the best performance, to GOCI satellite and HYCOM reanalysis images, and examined the seasonal variability of surface seawater ƒCO 2 averaged by month in 2015.The GOCI images collected around 1:30 p.m. were used to produce the distribution of surface seawater ƒCO 2 .Satellite-based surface seawater ƒCO 2 maps show similar spatial distribution with in situ measurements (Figure 1).The spatial distribution of GOCI-estimated surface seawater ƒCO 2 has a similar pattern to those of SST, SSS, and MLD (Figure 5), which are highly affected by ocean currents in the East Sea [77].While the estimated surface seawater ƒCO 2 generally shows a similar monthly pattern with SSS and SST, it is substantially influenced by MLD when MLD shows extreme variations.The estimated monthly surface seawater ƒCO 2 in summer (June, July, and August) might be affected by an inflow of warm current from the south, and it shows similar distribution to HYCOM SSS and SST.The low surface seawater ƒCO 2 in coastal areas in July, August, and September appears to be related to biological activity, which implies that high biological activities reduce surface seawater ƒCO 2 values.This agrees with in situ measurements in August showing low surface seawater ƒCO 2 values in coastal regions (Figure 1).The vortex pattern of the estimated surface seawater ƒCO 2 in fall is derived by the distributions of ocean parameters of MLD, SSS, SST, and Chl-a (Figure 5).Especially, patches with relatively high ƒCO 2 values found in the southern part of the East Sea in June are mainly associated with lower SSS values.
Many factors are related with the uncertainty of the spatial distribution of ƒCO 2 produced from the proposed approach.First of all, input parameters, especially ocean parameters from GOCI and HYCOM, contain their own uncertainties, which are inherent in the distribution of ƒCO 2 .This implies that more robust and reliable products of ocean parameters may further reduce the uncertainty of estimated ƒCO 2 .For example, the operational algorithms of GOCI standard products (e.g., chl-a and CDOM) will be upgraded in the GOCI Data Processing System (GDPS) and available later in 2017, which implies that the accuracy of ƒCO 2 estimation may improve with the new products in the future.In addition, it should be noted that the different numbers of daily ƒCO 2 data by pixel were used to generate the spatial distribution of ƒCO 2 for each month mainly due to cloud cover.In particular, the areas with high environmental changes in terms of currents and biological activities appeared to have relatively high uncertainty of the spatial distribution of ƒCO 2 .
Compared to in situ measurement data, the estimated surface seawater ƒCO 2 in November show lower values.This is likely due to temporal bias of in situ measurements that covered only one day (15 November).The surface seawater ƒCO 2 values in summer are higher than other seasons mainly due to higher SST values.An increase in SST raises surface seawater ƒCO 2 thermodynamically [38].Biological activity controlling surface seawater ƒCO 2 is also lowest in summer except coastal areas due to lack of nutrients.Surface seawater ƒCO 2 values in fall and winter are higher than those in spring, and variations in biological production, SST, and MLD can explain these seasonal differences [20].Deeper MLD in winter brings subsurface waters with high CO 2 concentrations to the surface, leading to higher surface seawater ƒCO 2 values.Nutrients supplied from subsurface to the surface by deepening of MLD cause an increase of biological production in the following spring, resulting in a decrease of ƒCO 2 values.In addition to higher biological production, lower SST values in spring compared to those in fall are also responsible for lower surface seawater ƒCO 2 values in spring.Seasonal variations of estimated ƒCO 2 values correspond well to measured seasonal changes in surface seawater ƒCO 2 in the East Sea [78].Since the in situ data are a bit limited, only covering spring, summer, and fall, there might be high uncertainty of ƒCO 2 estimation during the winter season.
Validation of the monthly distribution of ƒCO 2 produced from GOCI and HYCOM data was conducted using the in situ measurements.In situ ƒCO 2 measurements collected around the time of data collection of GOCI (collection time is around 1:30 p.m.) were averaged by month and compared with the monthly surface seawater ƒCO 2 estimation from GOCI and HYCOM data in each pixel (Figure 6).It should be noted, though, that there is a temporal discrepancy between the two: while the daily satellite-estimated ƒCO 2 images were averaged for cloud-free days over a month, in situ measurements for only a few days per month were averaged for comparison.Nonetheless, the results are very promising, which supports the use of machine learning and satellite data fusion approaches for the estimation of surface seawater ƒCO 2 .

Sea-Air CO2 Fluxes
Figure 7 is monthly distributions of surface seawater ƒCO2, delta (sea-air) ƒCO2, and sea-air CO2 flux for the reference year 2015 calculated using Equation ( 3), and Table 6 shows monthly mean

Sea-Air CO 2 Fluxes
Figure 7 is monthly distributions of surface seawater ƒCO 2 , delta (sea-air) ƒCO 2 , and sea-air CO 2 flux for the reference year 2015 calculated using Equation ( 3), and Table 6 shows monthly mean values of each factor in the whole study area (the East Sea) in 2015.The spatial distributions of delta (sea-air) ƒCO 2 and sea-air CO 2 flux have similar patterns to those of predicted surface seawater ƒCO 2 and SST (Figure 5).In the case of delta (sea-air) ƒCO 2 and sea-air CO 2 flux, the blue color depicts absorption of carbon from the atmosphere (negative value; carbon sink) and red color depicts efflux of carbon (positive value; carbon source) from the ocean.Seasonal changes of CO 2 flux between the ocean and atmosphere clearly appeared in Figure 7. Overall, the East Sea absorbs CO 2 from the atmosphere throughout the whole region, acts as a sink for atmospheric 2 , except some areas in July and August (Figure 7 and Table 6).Monthly mean delta (sea-air) ƒCO 2 shows a minimum in April and a maximum in August that is in agreement with the result of [20] (Table 6).The largest CO 2 flux to the ocean was estimated in winter and the lowest flux was found in summer.The annual mean CO 2 flux value of −1.53 mol m −2 year −1 estimated in this study is slightly lower than the value of CO 2 flux (−2.47 ± 1.26 mol m −2 year −1 ) reported in the Ulleung Basin of the East Sea [78].This is due to fact that the Ulleung Basin is the most productive area in the East Sea except coastal regions, and our study area is much larger than that of [78].
The amount of absorbed CO 2 was smaller from June to September than the other months (Delta ƒCO 2 from June to September ranged from −29.50 to −4.14 µatm while the yearly mean was −48.92 µatm; CO 2 flux from June to September ranged from −0.56 to −0.11 mol m −2 year −1 while the yearly mean was −1.53 mol m −2 year −1 ).This lower uptake of atmospheric CO 2 in summertime is mainly due to higher SST than the other seasons.Increase of SST causes a decrease of solubility, leading to an increase of surface seawater ƒCO 2 .Lower biological production compared to other seasons also resulted in higher surface seawater ƒCO 2 and less negative or positive delta (sea-air) ƒCO 2 values.While the highest monthly mean surface seawater ƒCO 2 value was observed in July, the highest monthly mean delta (sea-air) ƒCO 2 (−4.14 µatm) and the lowest monthly mean CO 2 flux to the ocean (−0.11 mol m −2 year −1 ) was shown in August (Table 6).This is due to lower atmospheric CO 2 values in August than in July.The seasonal cycle of atmospheric CO 2 is in the opposite phase to that of surface seawater ƒCO 2 according to the intensity of land biosphere production.In addition, the month with the lowest monthly mean delta (sea-air) ƒCO 2 (April) does not correspond to that with the highest monthly mean CO 2 flux to the ocean (February) due to higher wind speed in winter compared to in spring.Variation of wind speed has a significant impact on the changes in sea-air CO 2 fluxes (see Equation ( 3)).For the period from July to August, some areas of the East Sea experienced weakly positive CO 2 fluxes (red color) implying a release of CO 2 to the atmosphere.The areas showing positive flux values in August correspond well to those with relatively higher SSS values (Figure 5), leading to higher surface seawater ƒCO 2 by decreasing solubility.

Conclusions
This study estimated the surface seawater ƒCO 2 in the East Sea of Korea using in situ measurements, GOCI satellite and their derived products, and HYCOM reanalysis data through MNR and two machine learning approaches (RF and SVR).Results show that RF generally produced a better performance than MNR and SVR.RF effectively modeled both linear and nonlinear characteristics of surface seawater ƒCO 2 through an ensemble approach.Ocean parameters (i.e., SSS, SST, and MLD) appeared to be more contributing than the individual bands or band ratios from the satellite data.Since MLD controls the amount of carbon dioxide moving into the surface from the subsurface, it may be very useful for estimating surface seawater ƒCO 2 .It should be noted, however, that HYCOM MLD uses a fixed threshold of change in the vertical profile of temperature to identify the depth.
The monthly surface seawater ƒCO 2 maps in 2015 provided valuable information of seasonal spatial variations of surface seawater ƒCO 2 in the East Sea.Strong saturations of CO 2 in the ocean was observed in summer because of increased SST.Surface seawater ƒCO 2 was higher in fall than spring because of a higher SST and deeper MLD [20].The spatial distribution of surface seawater ƒCO 2 in the East Sea showed a strong link with SST, SSS, and MLD in GOCI-based estimation.Overall, the East Sea is a sink for atmospheric CO 2 , although some areas in summer act as a weak CO2 source.
Compared to the existing literature that used traditional regression models [8,[10][11][12]16,17,19], this study estimated surface seawater ƒCO 2 with complicated, but more advanced algorithms than conventional statistical ones, using two machine learning approaches.In addition, using the world first geostationary ocean color satellite data (i.e., GOCI), we were able to improve the model performance: much more data collection by GOCI allowed us to use more samples temporally matched between in situ measurements and satellite-derived data.However, there are several limitations of this study, for which further research is needed.In situ measurements data used in this study only cover a few months in spring, summer, and fall.Longer time series data should be investigated to make the model much more robust and reliable.Uncertainty of satellite-derived products should also be reduced, especially near the coastal areas.

Figure 1 .
Figure 1.The study area of this research and monthly ƒCO2 distribution based on in situ observations.The red box represents the specific study area in the East Sea where the in situ measurements were conducted.

Figure 1 .
Figure 1.The study area of this research and monthly ƒCO 2 distribution based on in situ observations.The red box represents the specific study area in the East Sea where the in situ measurements were conducted.
models, and (2) a greater temporal discrepancy between in situ measurements and satellite-derived products for MODIS-based models (in situ data from 9 a.m. to 4 p.m. vs. one MODIS image per day around 1:30 p.m.) (results not shown).

Figure 3 .
Figure 3.The comparison of SST and SSS between in situ measurements and HYCOM data.

Figure 3 .
Figure 3.The comparison of SST and SSS between in situ measurements and HYCOM data.

Figure 4 .
Figure 4. Variable importance identified by random forest (RF).Increase of mean squared error (MSE) of RF was calculated using out-of-bag (OOB) data when a variable is perturbed.More explanation about the increase of Mean Square Error (MSE; %) is provided in Section 3.3.1.

Figure 4 .
Figure 4. Variable importance identified by random forest (RF).Increase of mean squared error (MSE) of RF was calculated using out-of-bag (OOB) data when a variable is perturbed.More explanation about the increase of Mean Square Error (MSE; %) is provided in Section 3.3.1.

Figure 5 .
Figure 5.The monthly surface seawater ƒCO2 produced using the RF model (i.e., GOCI CDOM, Chl-a, and band reflectance data and HYCOM MLD, SSS, and SST), HYCOM SST, SSS, and MLD, and GOCI Chl-a and CDOM distribution maps in 2015.No data pixels in white in the ocean were due to GOCI CDOM and Chl-a data.Please note that Korean Institute of Ocean Science and Technology (KIOST) often provides slot-uncorrected images in L1B band reflectance data, which resulted in an artifact line along longitude 130°E, which is shown in some of the ƒCO2 maps.

Figure 6 .
Figure 6.Scatterplots between in situ surface seawater ƒCO2 averaged by month and GOCI-derived monthly ƒCO2.

Figure 5 . 23 Figure 5 .
Figure 5.The monthly surface seawater ƒCO 2 produced using the RF model (i.e., GOCI CDOM, Chl-a, and band reflectance data and HYCOM MLD, SSS, and SST), HYCOM SST, SSS, and MLD, and GOCI Chl-a and CDOM distribution maps in 2015.No data pixels in white in the ocean were due to GOCI CDOM and Chl-a data.Please note that Korean Institute of Ocean Science and Technology (KIOST) often provides slot-uncorrected images in L1B band reflectance data, which resulted in an artifact line along longitude 130 • E, which is shown in some of the ƒCO 2 maps.

Figure 6 .
Figure 6.Scatterplots between in situ surface seawater ƒCO2 averaged by month and GOCI-derived monthly ƒCO2.

Figure 7
Figure 7  is monthly distributions of surface seawater ƒCO2, delta (sea-air) ƒCO2, and sea-air CO2 flux for the reference year 2015 calculated using Equation (3), and Table6shows monthly mean

Table 1 .
Summary of the data used in this study.

In Situ Start Date In Situ End Date Latitude Longitude In Situ Products GOCI c Products HYCOM d Products Number of In Situ Data Collected in the Ship Number of Data Matched with Satellite g
There are no initial data because of instability of GPS; f There is no latter data because of equipment failure; g The number of data matched with GOCI satellite and HYCOM reanalysis data.The number is different to the number of samples used in machine learning approaches because, when multiple samples fall into one pixel, they were averaged.

Table 3 .
Specification of Geostationary Ocean Color Imager (GOCI) bands used in this study.These four bands were used for input parameters of machine learning approaches.

Table 4 .
Correlation coefficients between input parameters and in situ surface seawater ƒCO 2 .

Table 5 .
A summary of calibration and validation results of each model.
a root mean square error; b multi-variate nonlinear regression; c random forest; d support vector regression.

Table 5 .
A summary of calibration and validation results of each model.

Validation R 2 RMSE a (rRMSE) Mean Bias Cost Function
a root mean square error; b multi-variate nonlinear regression; c random forest; d support vector regression.

Table 6 .
Monthly mean value of surface seawater ƒCO 2 , delta (sea-air) ƒCO 2 , and sea-air CO 2 flux in the whole study area in 2015.