Sea Surface Salinity Estimation and Spatial-Temporal Heterogeneity Analysis in the Gulf of Mexico

: As an important parameter to characterize physical and biogeochemical processes, sea surface salinity (SSS) has received extensive attention. Cubist is a data mining model, which can be well-suited to estimate and analyze SSS in the Gulf of Mexico (GOM) because it can reﬂect the SSS internal heterogeneity in the GOM—overall circular distribution, and the seasonality related to temperature and river discharge changes. Using remote sensing reﬂectance (Rrs) at 412, 443, 488 (490), 555, and 667 (670) nm and sea surface temperature (SST), a cubist model was developed to estimate SSS with high accuracy with the overall performance demonstrates a root mean square error (RMSE) of 0.27 psu and correlation coefﬁcient of 0.97 of R2. The model divides the GOM area according to model rules into four sub-regions, which include estuary, nearshore, and open sea, reﬂecting the gradient distribution of SSS. The division of sub-regions and seasonal changes can be explained by the distribution of water bodies, river discharges, and local wind forces since it is obvious that the estuary region reaches the largest low-value area and spreads eastward with the monsoon in the spring when the river ﬂow increases to the highest value. While the east to west wind in the non-summer monsoon period guides the plume westward, and the lowest river discharge in winter corresponds to the smallest low value area. After comparison with other statistical models, the cubist model showed satisfactory results in independent veriﬁcation of cruise data, proving the estimation capability under different geographical conditions (such as estuaries and open seas) and seasons. Therefore, considering high accuracy and heterogeneity mining, the cubist-based model is an ideal method for coastal SSS estimation and spatial-temporal heterogeneity analysis, and can provide ideas for model construction for coastal areas with similar geographic environments.


Introduction
Ocean salinity controls the dynamic and thermodynamic behaviors of seawater, plays a role as a key parameter in oceanic and climate studies, and its distribution provides significant information for studying physical and biochemical marine processes [1,2]. Several processes govern the evolution of salinity, such as evaporation, precipitation, river runoff, formation and melting of sea ice, and internal ocean dynamics such as circulation and mixing of water masses [3]. Thus, changes in salinity can be used to indicate freshwater input to coastal oceans and therefore understand many physical and biogeochemical processes in coastal waters [4,5].
In this study, we will use the cubist model to predict SSS of the GOM with higher accuracy, using Rrs and sea surface temperature (SST) as input variables. At the same time, it analyzes the temporal and spatial heterogeneity of salinity in coastal areas on the basis of model zoning. The model divides the GOM area based on variable differences rather than geographic locations for the first time. This study can contribute to proposing a model with general applicability to estimate SSS from satellites for coastal areas, and facilitate the selection of variables and forms of SSS model construction for different regions of the GOM.

Study Area
The Gulf of Mexico (GOM) possesses an outer shoreline from the Florida Keys to the northwest coast of Cuba, and is the ninth largest body of water in the world [36] ( Figure 1). The largest river system in North America (the MARS) comprises a complex estuary in the northern GOM [44]. The GOM is a shallow basin that holds approximately 2.5 million km 3 of water. The average water depth is 1615 m, with its deepest point at 4383 m. Approximately 38% of the Gulf is less than 20 m deep-mainly intertidal areas. The continental shelf and slope comprise approximately 42% of the Gulf, and abyssal areas cover approximately 20% [45].
Remote Sens. 2021, 13, x FOR PEER REVIEW 3 of 17 time, it analyzes the temporal and spatial heterogeneity of salinity in coastal areas on the basis of model zoning. The model divides the GOM area based on variable differences rather than geographic locations for the first time. This study can contribute to proposing a model with general applicability to estimate SSS from satellites for coastal areas, and facilitate the selection of variables and forms of SSS model construction for different regions of the GOM.

Study Area
The Gulf of Mexico (GOM) possesses an outer shoreline from the Florida Keys to the northwest coast of Cuba, and is the ninth largest body of water in the world [36]( Figure  1). The largest river system in North America (the MARS) comprises a complex estuary in the northern GOM [44]. The GOM is a shallow basin that holds approximately 2.5 million km 3 of water. The average water depth is 1615 m, with its deepest point at 4383 m. Approximately 38% of the Gulf is less than 20 m deep-mainly intertidal areas. The continental shelf and slope comprise approximately 42% of the Gulf, and abyssal areas cover approximately 20% [45].
Both local wind stress and river flow have greater impact on the plume near the estuary. In the spring/summer with high river discharge, the southerly/southeast wind promotes eastward spread of the Mississippi River plume, while in the fall/winter-the low-flow periodthe northeast/northern wind transports the river's fresh water westward [46,47].

Field Data
The in-situ measured salinity data used in this study were downloaded from Ocean Carbon Data Systems (OCADS, https://www.nodc.noaa.gov/ocads/) (access on 1 January 2021). These data were collected by twelve cruises and covered the entire 2018 year with properties, including SSS and SST. The SSS and SST data were obtained ~3 m below the sea surface using a CTD (SBE-38 or SBE-45, Seabird Inc.) integrated in the underway pCO2 system. Data used for independent verification were distributed in the coastal area, open sea, and west of the Florida Current, guaranteeing that the typical region can be well verified (Table 1). Table 1. Underway surface salinity measurements used for the development, validation and independent validation of the sea surface salinity(SSS) model. Independent validation cruises were listed in blue font. The cruise GU1609Leg1-3 Fall Pelagic Trawl/Acoustic Survey was listed as Both local wind stress and river flow have greater impact on the plume near the estuary. In the spring/summer with high river discharge, the southerly/southeast wind promotes eastward spread of the Mississippi River plume, while in the fall/winter-the low-flow period-the northeast/northern wind transports the river's fresh water westward [46,47].

Field Data
The in-situ measured salinity data used in this study were downloaded from Ocean Carbon Data Systems (OCADS, https://www.nodc.noaa.gov/ocads/) (access on 1 January 2021). These data were collected by twelve cruises and covered the entire 2018 year with properties, including SSS and SST. The SSS and SST data were obtained~3 m below the sea surface using a CTD (SBE-38 or SBE-45, Seabird Inc.) integrated in the underway pCO 2 system. Data used for independent verification were distributed in the coastal area, open sea, and west of the Florida Current, guaranteeing that the typical region can be well verified (Table 1).

Satellite Data
Daily standard NASA Moderate Resolution Imaging Spectroradiometer (MODIS/ Aqua) level-2 data products were downloaded from NASA Goddard Space Flight Center (http://oceancolor.gsfc.nasa.gov/) (access on 1 January 2021), including SST and spectral remote sensing reflectance (Rrs) between 412 and 678 nm. All data products have been validated by in-situ data in the study area to ensure the availability of data in the region. Images with quality level > 1 were discarded from the daily level-2 SST data. Discarding low quality images (quality level >1) from the daily level-2 SST data to guarantee data accuracy. Conjugate matching was processed between the remote sensing images and in situ measurement, and daily images were matched up with in situ underway measurements after discarding low quality data ( Figure 2). The spatial resolution of the field and satellite data was reprocessed to approximately 1 km for conjugate matching. As the same consideration of Chen and Hu [34], we chose five visible spectral bands (412, 443, 490, 555, and 670 nm) based on the exponential decay of CDOM absorption from the blue to the red color spectra. SST was used as a model input to capture the difference in temperature between the ocean and rivers.

Cubist
Cubist is an algorithm based on M5, which is similar to regression trees and can be used in spatial data mining [48]. As a data partitioning algorithm, it enables exploration of the nonlinear relationship in the observed data [49]. The predicted variable will be fitted by composed linear equations under rules generated by the cubist model, which is different from the CART regression tree model [50] in which its terminal nodes are not predictions [51]. The heterogeneity caused by predictor variables produces different data division conditions, which means "rules", and will be used to divide the total dataset [52]. For each sub-dataset corresponding to each rule, the cubist model creates linear equations for sub-datasets, respectively, with the form: if {condition} then linear model. By using the linear regression model at each terminal node, a prediction is made and is "smoothed" by considering the prediction in the previous node [53]. The cubist algorithm was implemented in R with the "caret" package in this study.

Other Compassion Methods
Multilinear regression uses linear equations to model the relationship between predictor variables and response variables. It is one of the most widely used methods to express how the response variable depends on multiple independent variables. Similar to MLR, the multiple nonlinear regression (MNR) method uses nonlinear equations, the nonlinear functions include exponential functions, logarithmic functions, power functions, etc., which is also a commonly used empirical method [54]. The MLR and MNR models in this study were implemented in Python with the sklearn package.
The support vector machine (SVM) method is a kernel-based method proposed by Vladimir Vapnik in 1995 [55]. SVM is a general term, possessing two sub-categories support vector classification (SVC) and support vector regression (SVR). SVR uses strips to cover the sample points. The points on the boundary and the points that violate the margin within the two boundaries are regarded as support vectors. The support vector will affect predictions, while the point weight of the non-support vector is zero [56]. In this study, we chose the typical kernel radial basis function (RFB) as the core and implemented the SVM model in Python with the sklearn package.
The multilayer perceptron neural network (MPNN) is a feedforward neural network developed in the Gulf of Mexico and consists of an input layer, a hidden layer and an output layer. The Levenberg-Marquardt optimization and Bayesian regularization algorithm were used for backpropagation. The change of neuron number in the hidden layer will affect the effect of MPNN, because the neuron number of input and output is unchanged [34]. In this study, we used Matlab (R2019a) to implement the construction of MPNN.

Cubist
Cubist is an algorithm based on M5, which is similar to regression trees and can be used in spatial data mining [48]. As a data partitioning algorithm, it enables exploration of the nonlinear relationship in the observed data [49]. The predicted variable will be fitted by composed linear equations under rules generated by the cubist model, which is different from the CART regression tree model [50] in which its terminal nodes are not predictions [51]. The heterogeneity caused by predictor variables produces different data division conditions, which means "rules", and will be used to divide the total dataset [52]. For each sub-dataset corresponding to each rule, the cubist model creates linear equations for sub-datasets, respectively, with the form: if {condition} then linear model. By using the linear regression model at each terminal node, a prediction is made and is "smoothed" by considering the prediction in the previous node [53]. The cubist algorithm was implemented in R with the "caret" package in this study.

Other Compassion Methods
Multilinear regression uses linear equations to model the relationship between predictor variables and response variables. It is one of the most widely used methods to express how the response variable depends on multiple independent variables. Similar to MLR, the multiple nonlinear regression (MNR) method uses nonlinear equations, the nonlinear functions include exponential functions, logarithmic functions, power functions, etc., which is also a commonly used empirical method [54]. The MLR and MNR models in this study were implemented in Python with the sklearn package.
The support vector machine (SVM) method is a kernel-based method proposed by Vladimir Vapnik in 1995 [55]. SVM is a general term, possessing two sub-categories support vector classification (SVC) and support vector regression (SVR). SVR uses strips to cover the sample points. The points on the boundary and the points that violate the margin within the two boundaries are regarded as support vectors. The support vector will affect predictions, while the point weight of the non-support vector is zero [56]. In this study, we chose the typical kernel radial basis function (RFB) as the core and implemented the SVM model in Python with the sklearn package.
The multilayer perceptron neural network (MPNN) is a feedforward neural network developed in the Gulf of Mexico and consists of an input layer, a hidden layer and an output layer. The Levenberg-Marquardt optimization and Bayesian regularization algorithm were used for backpropagation. The change of neuron number in the hidden layer will affect the effect of MPNN, because the neuron number of input and output is unchanged [34]. In this study, we used Matlab (R2019a) to implement the construction of MPNN.

Model Performance
When choosing the suitable rule of the cubist model for this study, we used root mean square error (RMSE) and correlation coefficient (R 2 ) to assess model performance. As seen in Figure 3, the lowest RMSE and highest R 2 appeared when the number of rules was four; the RMSE increased and the R 2 decreased when it exceeded four. This indicates that four rules should be selected owing to the high accuracy and simplicity of the model.

Model Performance
When choosing the suitable rule of the cubist model for this study, we used root mean square error (RMSE) and correlation coefficient (R 2 ) to assess model performance. As seen in Figure 3, the lowest RMSE and highest R 2 appeared when the number of rules was four; the RMSE increased and the R 2 decreased when it exceeded four. This indicates that four rules should be selected owing to the high accuracy and simplicity of the model. Eighty percent of the dataset was used to build the prediction model, and the rest was used to validate the model. Before dividing, the total dataset was sorted and grouped, and for each group, the data were randomly divided by 8:2. Then two sub-datasets with the same range were created by composing group data. The performance of the cubist model in both the training and validation datasets is shown in Figure 4, colored by data density. R 2 is 0.97 and 0.95 for model development and validation, with RMSE of 0.24 and 0.38 psu, respectively. The other standard statistical measures, mean bias (MB), and mean absolute error (MAE), were also used to compare the cubist model with other methods. igure 4. Cubist model performance in estimating SSS in both training (a), validation (b) and total dataset (c); data pairs re color coded by data density.
As presented in Table 2, the performance of training and validation was stable in total, meaning that there was no overfitting in each method. The machine learning approaches were better than the regression methods obviously, since the performance of MLR was the worst with RMSE of 1.00 psu and R 2 of 0.64 for training, and RMSE of 1.04 psu and R 2 of 0.63 for validation. Despite the poor performance of MLR, the correlation coefficient was still higher than 0.6, indicating that a linear relationship can explain part of the SSS, and the regression method can be used as a simple SSS estimation algorithm. As a regression Eighty percent of the dataset was used to build the prediction model, and the rest was used to validate the model. Before dividing, the total dataset was sorted and grouped, and for each group, the data were randomly divided by 8:2. Then two sub-datasets with the same range were created by composing group data. The performance of the cubist model in both the training and validation datasets is shown in Figure 4, colored by data density.

Model Performance
When choosing the suitable rule of the cubist model for this study, we used root mean square error (RMSE) and correlation coefficient (R 2 ) to assess model performance. As seen in Figure 3, the lowest RMSE and highest R 2 appeared when the number of rules was four; the RMSE increased and the R 2 decreased when it exceeded four. This indicates that four rules should be selected owing to the high accuracy and simplicity of the model. Eighty percent of the dataset was used to build the prediction model, and the rest was used to validate the model. Before dividing, the total dataset was sorted and grouped, and for each group, the data were randomly divided by 8:2. Then two sub-datasets with the same range were created by composing group data. The performance of the cubist model in both the training and validation datasets is shown in Figure 4, colored by data density. R 2 is 0.97 and 0.95 for model development and validation, with RMSE of 0.24 and 0.38 psu, respectively. The other standard statistical measures, mean bias (MB), and mean absolute error (MAE), were also used to compare the cubist model with other methods. As presented in Table 2, the performance of training and validation was stable in total, meaning that there was no overfitting in each method. The machine learning approaches were better than the regression methods obviously, since the performance of MLR was the worst with RMSE of 1.00 psu and R 2 of 0.64 for training, and RMSE of 1.04 psu and R 2 of 0.63 for validation. Despite the poor performance of MLR, the correlation coefficient was still higher than 0.6, indicating that a linear relationship can explain part of the SSS, and the regression method can be used as a simple SSS estimation algorithm. As a regression As presented in Table 2, the performance of training and validation was stable in total, meaning that there was no overfitting in each method. The machine learning approaches were better than the regression methods obviously, since the performance of MLR was the worst with RMSE of 1.00 psu and R 2 of 0.64 for training, and RMSE of 1.04 psu and R 2 of 0.63 for validation. Despite the poor performance of MLR, the correlation coefficient was still higher than 0.6, indicating that a linear relationship can explain part of the SSS, and the regression method can be used as a simple SSS estimation algorithm. As a regression method, the MNR was better owing to the capacity of simulating the inherent nonlinear relationship, with RMSE of 0.78 psu and R 2 of 0.78, and RMSE of 0.90 psu with R 2 of 0.72 for training and validation, respectively. SVM processes the stable performance showing the little difference between training and validation, the R 2 of 0.85 and RMSE of 0.38 psu for training, and R 2 of 0.84 with RMSE of 0.39 psu for validation. The MPNN model was also developed in the GOM, and our results were similar to those of [34], with R 2 of 0.86 and MB of 0.00 psu for training, and R 2 of 0.85 with MB of 0.00 psu for validation. The complexity of the ocean makes it impossible to directly estimate SSS with high accuracy through linear or simple nonlinear combinations. However, since the use of CDOM to estimate salinity is based on remote sensing mechanisms, simple statistical methods can still fit salinity data. The task of the SVM regression method is to use a fixed-width strip to cover more sample point as possible, making the total error smaller, so there are some limitations for complex data. The internal fitting of the neural network is more complicated, and the learning of data features makes it perform better on large data sets. Comparing all the machine learning methods, the cubist model provided outstanding performance with all indices significantly better than those of other methods. Although the improvement in R 2 was small, the RMSE decreased by nearly half compared with that of the MPNN model. The year's SSS map estimated by the cubist model is shown in Figure 5. It can be seen that the overall trend of SSS showed a concentric inward value increase. The influence of wind force and GOM dynamics affects the distribution of SSS values [57]. In the northern region, due to the physical mixing of river water with low salinity, SSS was lower throughout the year. In the open sea area, the SSS was usually higher because of minimal river influence.
In summary, the cubist model was superior to other methods in terms of relevance and accuracy. Considering the good interpretability of the cubist model, for it was not a "black box" and can provide the equation of each rule, the cubist model was the most favorable method for estimating SSS within the range 22-38 in the GOM.

Rule Accuracy Validation
The cubist model divided the dataset into sub-datasets based on conditions and generated an equation for each sub-dataset. Based on the rule partition of the cubist model, it can be seen that the inputs Rrs(412), Rrs(555), and SST, played an important role in dataset division, since the conditions were composed of these three parameters ( Table 3). The difference between rule 2 and 4 was the range of SST, while the range of Rrs(555) was the reference to divide rule 1 and 3. Contrary to the rule condition, most model inputs participated in the composition of linear equations of each rule, except Rrs(667). The intercepts of the four equations were very close, but the coefficient of each parameter had great differences.
It is obvious that rule 4 delivered the best performance in each index among the subdatasets when evaluating the accuracy of each rule-divided dataset, although it possessed the largest amount of data (Table 4). Rule 4 was also the only sub-dataset that did not underestimate SSS. Comparing rules 2 and 4 with rules 1 and 3, it is worth noting that using more data can often make the model more stable. The large difference between training and validation datasets appeared in rule 3; the validation performance was significantly worse than that of the training dataset, with RMSE of 0.69 and R 2 of 0.80.
1 Figure 5. Annual SSS map generated by the cubist model, averaged from daily composed image within the cruise date range.

Rule-Based GOM Partition
Based on the rule conditions, the cubist model divided the total GOM area into four parts, which can indicate the spatial heterogeneity of SSS. Incorporating the annual data of model inputs (Rrs(412), Rrs(443), Rrs(488), Rrs(555), Rrs(667), and SST), and conditions applied to region division, the distribution of the four partitions can be distinguished ( Figure 6). Due to the strong correlation with the surrounding freshwater environment, the analysis should be combined with the GOM water depth and wetland distribution ( Figure 1).
On the whole, region 1 was mainly distributed in the nearshore aera, while region 2 and region 3 were scattered in estuaries and nearshore places with water depths less than 5 m. Region 4 occupied the largest area of GOM, covering a large area of open sea. In general, this distribution of the sub-regions corresponded to water depth, the area with depth of less than 30 m comprised most of region 1. The other areas of region 1, such as a small area on the south side of the GOM and west side of Yucatan, were consistent with the distribution of water bodies, which has great influence on the zoning and was also reflected in the distribution of region 2. Most areas of region 2 mainly correspond to the estuary area, where the river water is mixed with seawater, and the salinity was different from that of region 1. Region 4 covered most of the GOM area-mainly open sea areas. Referring to the division rules, we can see that the temperature in region 2 was higher than that in region 4. This may be because a larger amount of data was distributed in summer, and high temperature can also be used to explain that other areas in region 2 are mainly in the Loop Current range, corresponding to warm Florida currents. Region 3 corresponds to very few areas, which may be due to the small number of samples in rule 3.

Rule-Based GOM Partition
Based on the rule conditions, the cubist model divided the total GOM area into four parts, which can indicate the spatial heterogeneity of SSS. Incorporating the annual data of model inputs (Rrs(412), Rrs(443), Rrs(488), Rrs(555), Rrs(667), and SST), and conditions applied to region division, the distribution of the four partitions can be distinguished ( Figure  6). Due to the strong correlation with the surrounding freshwater environment, the analysis should be combined with the GOM water depth and wetland distribution ( Figure 1). On the whole, region 1 was mainly distributed in the nearshore aera, while region 2 and region 3 were scattered in estuaries and nearshore places with water depths less than 5 m. Region 4 occupied the largest area of GOM, covering a large area of open sea. In general, this distribution of the sub-regions corresponded to water depth, the area with

Seasonal Variations of Surface Salinity
The salinity of the GOM presents a gradient distribution throughout the year, with low values in the nearshore and high values in open seas. The areas near fresh water vary greatly, and sea water with high temperature and high salinity brought by the Loop Current can also be observed (Figure 7). It can be seen that the most obvious change occurs in the estuary area, reflecting the seasonal cycle, which may be related to the discharge of rivers, since salinity is largely controlled by fluxes of freshwater into and out of the Gulf [58]. The river discharge data were acquired to analyze the influence of river discharge (Figure 8), downloaded from U.S. Army Corps of Engineers website. Mississippi River at Tarbert Landing (Gate ID 01100Q) and Atchafalaya River at Simmesport (Gate ID 03045Q) capturing the water flow from the total Mississippi-Atchafalaya basin. In addition, wind force is an important factor affecting the distribution of salinity. Combining the effects of wind and river flow, the discharge of the Mississippi River increases and the wind plume gradually guides eastward in the spring [59]. Therefore, the SSS values related to the river plume are significantly different than those of the surrounding waters, extending eastward from the mouth of the Mississippi River since March [46,47,60]. Since the monthly image is composed during the cruise period, the data during the maximum flow period in March are not well shown on the monthly map. However, the largest low-value salinity area shown in April indicates the impact of continuous large flow. Although the discharge began to decline in May, the SSS value in the estuary still maintained a low trend. A low SSS plume in the east of the estuary can be clearly observed due to the strong winds out of the south and west during the summer, which drives river plumes eastward [61]. The force of wind directed downcoast from east to west in non-summer periods, and areas with low SSS values were distributed westward due to the effect of wind [62]. At the same time, since the effect of river water is offset by ocean dynamics, the distribution of salinity in autumn is stable with little monthly change [39]. The generally high summer value in open sea areas would have continued throughout the fall, but due to a large amount of precipitation, the value dropped. In addition, as river runoff will reach its lowest annual value, the lowvalue areas of the estuary contracted in winter. It can be seen that the most obvious change occurs in the estuary area, reflecting the seasonal cycle, which may be related to the discharge of rivers, since salinity is largely controlled by fluxes of freshwater into and out of the Gulf [58]. The river discharge data were acquired to analyze the influence of river discharge (Figure 8), downloaded from U.S. Army Corps of Engineers website. Mississippi River at Tarbert Landing (Gate ID 01100Q) and Atchafalaya River at Simmesport (Gate ID 03045Q) capturing the water flow from the total Mississippi-Atchafalaya basin. In addition, wind force is an important factor affecting the distribution of salinity. Combining the effects of wind and river flow, the discharge of the Mississippi River increases and the wind plume gradually guides eastward in the spring [59]. Therefore, the SSS values related to the river plume are significantly different than those of the surrounding waters, extending eastward from the mouth of the Mississippi River since March [46,47,60]. Since the monthly image is composed during the cruise period, the data during the maximum flow period in March are not well shown on the monthly map. However, the largest low-value salinity area shown in April indicates the impact of continuous large flow. Although the discharge began to decline in May, the SSS value in the estuary still maintained a low trend. A low SSS plume in the east of the estuary can be clearly observed due to the strong winds out of the south and west during the summer, which drives river plumes eastward [61]. The force of wind directed downcoast from east to west in non-summer periods, and areas with low SSS values were distributed westward due to the effect of wind [62]. At the same time, since the effect of river water is offset by ocean dynamics, the distribution of salinity in autumn is stable with little monthly change [39]. The generally high summer value in open sea areas would have continued throughout the fall, but due to a large amount of precipitation, the value dropped. In addition, as river runoff will reach its lowest annual value, the low-value areas of the estuary contracted in winter. Along with the estuary area, other surrounding areas with water bodies, including wetlands and small rivers, are also affected by fresh water and present seasonality related to river flux. The seawater entering with the Loop Current is also well reflected, usually bringing hightemperature and high-salinity seawater and flowing out to the east of the GOM.

Model Evaluation for Various Cases
Cruise data distributed in different areas of the GOM were collected to assess the versatility of the cubist SSS model (Table 5). Since, on the one hand, the chosen independent cruises are located in different geographical locations, reflecting the spatial heterogeneity. On the other hand, the independent cruises covered different seasons, which reflects the temporal heterogeneity. For each cruise, the field-measured dataset was independent from the others, and none of these datasets were used in the model development above. The verification data covered much of the northern GOM and the southeast area near the Loop Current. As seen in the overall results, the model has good generalization ability with an RMSE of 1.64 psu, indicating good predictability in remotely estimating SSS both spatially and temporally. The results based on the underway SSS data collected from cruise GM0606 between June 6 and 11, 2006, are shown in Figure 9. This cruise was collected in the Mississippi-Atchafalaya coastal areas (Figure 9b). For the entire dataset, the RMSE was 1.88, with MB of 0.4 and mean ratio (MR) of 1.02. The variation of satellite SSS along the cruise track agreed well with the field-measured SSS with RMSE of 1.49, MB of −0.02, and MR of 1.0 when the value was higher than 30. However, the model showed higher uncertainties (RMSE = 2.96, MB = 2.04, and MR = 1.07) in the area close to the coastline, especially in Along with the estuary area, other surrounding areas with water bodies, including wetlands and small rivers, are also affected by fresh water and present seasonality related to river flux. The seawater entering with the Loop Current is also well reflected, usually bringing high-temperature and high-salinity seawater and flowing out to the east of the GOM.

Model Evaluation for Various Cases
Cruise data distributed in different areas of the GOM were collected to assess the versatility of the cubist SSS model (Table 5). Since, on the one hand, the chosen independent cruises are located in different geographical locations, reflecting the spatial heterogeneity. On the other hand, the independent cruises covered different seasons, which reflects the temporal heterogeneity. For each cruise, the field-measured dataset was independent from the others, and none of these datasets were used in the model development above. The verification data covered much of the northern GOM and the southeast area near the Loop Current. As seen in the overall results, the model has good generalization ability with an RMSE of 1.64 psu, indicating good predictability in remotely estimating SSS both spatially and temporally. The results based on the underway SSS data collected from cruise GM0606 between June 6 and 11, 2006, are shown in Figure 9. This cruise was collected in the Mississippi-Atchafalaya coastal areas (Figure 9b). For the entire dataset, the RMSE was 1.88, with MB of 0.4 and mean ratio (MR) of 1.02. The variation of satellite SSS along the cruise track agreed well with the field-measured SSS with RMSE of 1.49, MB of −0.02, and MR of 1.0 when the value was higher than 30. However, the model showed higher uncertainties (RMSE = 2.96, MB = 2.04, and MR = 1.07) in the area close to the coastline, especially in four locations (marked A, B, C, and D in Figure 9a,b). The SSS values in these areas were lower than 30 psu and were overestimated by the cubist model. This may be due to the mixing with Mississippi-Atchafalaya water, and the wind directed eastward in summer, making the low value of SSS in the east of the estuarine area difficult to model. However, except for the cubist method, the RMSE of other methods is greater than 3 psu where the salinity value is lower than 30 psu (Table 6). It indicates that the cubist model has a good estimation potential in river dominated nearshore areas in summer. four locations (marked A, B, C, and D in Figure 9a,b). The SSS values in these areas were lower than 30 psu and were overestimated by the cubist model. This may be due to the mixing with Mississippi-Atchafalaya water, and the wind directed eastward in summer, making the low value of SSS in the east of the estuarine area difficult to model. However, except for the cubist method, the RMSE of other methods is greater than 3 psu where the salinity value is lower than 30 psu (Table 6). It indicates that the cubist model has a good estimation potential in river dominated nearshore areas in summer.  The validation result based on one cruise dataset (GU1609 Leg 1-3 Fall Pelagic Trawl/Acoustic Survey) collected in the northeastern GOM between September 2 and October 1, 2016 was shown in Figure 10. A comparison of field and satellite SSS was shown in Figure 10a. Although there were more overestimated values, mainly because a large number of the points were distributed in the open sea, the overall performance was good with an RMSE of 1.65 psu. There was no need for concern with the overestimation of the results with SSS > 30 (RMSE = 1.53, MB = 0.09, and MR = 1.0). The effect of seawater and freshwater mixing was well reflected in the results with overestimated values occurring in the areas near the water bodies (locations marked B and C in Figure 10a,b), and in the coastal area where there is no fresh water influence, underestimated values occurred (locations marked A and D in Figure 10a,b). The advantages of the cubist model in nearshore estimation were still reflected, but a large number of overestimated points in open sea areas makes the overall RMSE results not very prominent. Combined with the results of  The validation result based on one cruise dataset (GU1609 Leg 1-3 Fall Pelagic Trawl/Acoustic Survey) collected in the northeastern GOM between September 2 and October 1, 2016 was shown in Figure 10. A comparison of field and satellite SSS was shown in Figure 10a. Although there were more overestimated values, mainly because a large number of the points were distributed in the open sea, the overall performance was good with an RMSE of 1.65 psu. There was no need for concern with the overestimation of the results with SSS > 30 (RMSE = 1.53, MB = 0.09, and MR = 1.0). The effect of seawater and freshwater mixing was well reflected in the results with overestimated values occurring in the areas near the water bodies (locations marked B and C in Figure 10a,b), and in the coastal area where there is no fresh water influence, underestimated values occurred (locations marked A and D in Figure 10a,b). The advantages of the cubist model in nearshore estimation were still reflected, but a large number of overestimated points in open sea areas makes the overall RMSE results not very prominent. Combined with the results of MB, the estimation ability of the cubist model in the autumn nearshore was still satisfactory (Table 7). MB, the estimation ability of the cubist model in the autumn nearshore was still satisfactory (Table 7).  The results based on SSS data collected from cruise EQNX_20190209 in the southeastern GOM waters between February 9 and 16, 2019 can be seen in Figure 11. The fieldmeasurement SSS in this area was very stable, because in winter, there was no strong water mass mixing in the open sea areas. The model results were consistent with the previous performance in the open sea area, with primarily overestimated SSS (Table 4), but the most overestimated value did not exceed 0.5 psu. The RMSE of the result is 0.13, which is not significantly different from other machine learning methods results (Table 8)  The results based on SSS data collected from cruise EQNX_20190209 in the southeastern GOM waters between February 9 and 16, 2019 can be seen in Figure 11. The field-measurement SSS in this area was very stable, because in winter, there was no strong water mass mixing in the open sea areas. The model results were consistent with the previous performance in the open sea area, with primarily overestimated SSS (Table 4), but the most overestimated value did not exceed 0.5 psu. The RMSE of the result is 0.13, which is not significantly different from other machine learning methods results (Table 8), but it still proves the good prediction ability of the model in open sea areas. MB, the estimation ability of the cubist model in the autumn nearshore was still satisfactory (Table 7).  The results based on SSS data collected from cruise EQNX_20190209 in the southeastern GOM waters between February 9 and 16, 2019 can be seen in Figure 11. The fieldmeasurement SSS in this area was very stable, because in winter, there was no strong water mass mixing in the open sea areas. The model results were consistent with the previous performance in the open sea area, with primarily overestimated SSS (Table 4), but the most overestimated value did not exceed 0.5 psu. The RMSE of the result is 0.13, which is not significantly different from other machine learning methods results (Table 8)

Conclusions
In this study, we applied a cubist model to estimate SSS from MODIS images for the GOM area and obtained satisfactory performance with an RMSE of 0.38 psu and R 2 of 0.95. Through accuracy verification for each rule, we found that the linear equations given by the model have good accuracy in each region, which makes the overall salinity prediction more accurate. The model rules divided the GOM into four sub-regions, mainly including the continental shelf, estuary, and open sea area. The influences of water circulation and rivers and distribution of wetlands can be used to explain the rationality of model zoning. Seasonal changes are mainly affected by river discharge and local wind force, reflected in the area and direction of low-salinity-value plumes. Our model performs well in estimating salinity, and the auxiliary verification of other voyages proves the usability of the model under different geographical conditions. Additional estimation and research are needed to locally tune model parameters when extrapolating the model to other areas with similar environmental and geographic conditions. The model provides ideas for model construction in other coastal areas and also provides effective information for explaining the spatial heterogeneity of salinity in coastal areas and exploring seasonal changes in salinity.