Statistical Validation of MODIS-Based Sea Surface Temperature in Shallow Semi-Enclosed Marginal Sea: A Comparison between Direct Matchup and Triple Collocation

Validating remotely sensed sea surface temperature (SST) is a fundamental step in establishing reliable biological/physical models that can be used in different marine applications. Mapping SST using accurate models would assess in understanding critical mechanisms of marine and coastal zones, such as water circulations and biotic activities. This study set out to validate MODIS SSTs with a spatial resolution of 1-km in the Arabian Gulf (24–30° N, 48–57° E) and to assess how well direct comparison of dual matchups and triple collocation analyses perform. For the matchup process, three data sets, MODIS-Aqua, MODIS-Terra, and iQuam, were co-located and extracted for 1-pixel box centered at each actual in situ measurement location with a time difference window restricted to a maximum of ±3 h of the satellite overpass. Over the period July 2002 to May 2020, the MODIS SSTs (N = 3786 triplets) exhibited a slight cool night-time bias compared to iQuam SSTs, with a mean ± SD of −0.36 ± 0.77 °C for Aqua and −0.27 ± 0.83 °C for Terra. Daytime MODIS SST observations (N = 5186 triplets) had a lower negative bias for both Aqua (Bias = −0.052 °C, SD = 0.93 °C) and Terra (Bias = −0.24 °C, SD = 0.90 °C). Using extended triple collocation analysis, the statistical validation of system- and model-based products against in situ-based product indicated the highest ETC-based determination coefficients (ρt,X2 ≥ 0.98) with the lowest error variances (σε2 ≤ 0.32), whereas direct comparison underestimated the determination coefficients and overestimated the error estimates for all MODIS algorithms. The ETC-based error variances for MODIS Aqua/Terra NLSSTs were 0.25/0.19 and 0.26/0.32 in daytime and night-time, respectively. In addition, MODIS-Aqua was relatively more sensitive to the SST signal than MODIS-Terra at night and vice versa as seen in the unbiased signal-to-noise ratios for all observation types.


Introduction
Sea surface temperature (SST; see Table 1 for a list of abbreviations used in this paper) is an important component in marine ecosystems. It is measured commonly as an indicator of marine pollution, upwellings, and primary productivity in water bodies [1]. SST at the sea-atmosphere interface plays a key role in the cycling of compounds between the sea and the atmosphere [2,3]. Furthermore, sea temperatures of the near-surface layer are routinely used to provide an accurate estimation of surface flux, which often is of prime importance in energy exchange at the air-sea interface [4]. Such information is needed for numerical modeling of weather forecasting applications and ocean-atmosphere systems [5][6][7].
Sea temperatures are either obtained directly from floating or moored instruments, e.g., research vessels, offshore platforms, ships, and buoys, or derived remotely by satellites. Drifting buoys and ships measure the "bulk" sea temperature at few meters below the water's surface, while satellite sensors such as MODerate-resolution Imaging Spectroradiometer (MODIS) measure the "skin" (<1 mm) SST. The MODIS sensor was launched onboard the Terra and Aqua satellite systems in December 1999 and May 2002, respectively. Since then, SSTs derived from MODIS are not only used because of their large spatial coverage over long time spans, but also because of their near real-time (NRT) data and imagery. In general, MODIS-based SST is estimated from the brightness temperature (T B ), which is derived from the radiance (L b ) that leaves an extremely thin water surface [8]. Infrared algorithms covert the T B to skin SST using different spectral bands. Like most algorithms that rely on empirical coefficients, MODIS-standard algorithms are tuned based on regression analysis using buoys with thermometers and ship-based infrared radiometers [9] and are subjected to several continuous quality assurance practices to verify the sensor performance and report any large variations between satellite and in situ data. The empirical coefficients of SST algorithms are regularly reprocessed and improved by the Ocean Biology Processing Group (OBPG) [10]. Nevertheless, recent studies have demonstrated enhancement in the satellite SST retrievals with the use of deterministic methods, and principally, for MWIR measurements [11,12] compared with the traditional regression-based methods. Furthermore, several researchers working in different institutes have developed additional MODIS global algorithms [13], as well as region-specific algorithms [14][15][16] in order to better understand the problems that lead to errors in the SST retrieval.
A primary concern of the retrieval of remotely sensed SST data is their accuracy, especially in regions where SST determinations are expected to be less accurate than those obtained globally through the global buoys network (i.e., in situ observation data are sparsely distributed). As the satellite sensor measures the skin temperature of sea surface, satellite SST data are highly affected by the skin effect and diurnal warming [17]. Moreover, SST signal may be contaminated with clouds and atmospheric aerosols that can underestimate the SST retrieval [18]. For this reason, accurate mapping of satellite SST observations must be carefully considered and requires certain validation metrics (i.e., bias, uncertainty and precision) that assess the quality and the uncertainty fields. One of the most frequent statistical validation/calibration approaches is through conventional comparison (i.e., by regression analysis) of satellite data against in situ measurements, which could be collected from a fixed validation sites or with regular bulk measurements [19]. Several previous papers comparing the accuracy and precision of MODIS SST using the conventional comparison are given in Table 2. However, in fact, this approach is highly affected by representativeness errors [20] that are associated with sampling and calibration techniques, and therefore, they are difficult to account [21]. To minimize the effect of representativeness errors on validation metrics, triple collocation (TC) three-way error analysis is employed as a powerful technique. The advantage of the TC technique is that it estimates the uncertainties in independent co-located data sets more robustly [20] without the need for a ground truth data [22,23]. Even though TC was initially developed for wind products [22], it has also been applied to other data sets such as SST [24][25][26][27][28][29][30], sea surface salinity (SSS) [31], soil moisture [32][33][34][35], precipitation [36], and surface albedo [20].
There have been several studies validating satellite SSTs using TCA. For the global validation, results from earlier studies demonstrate a strong and consistent association between satellite-derived and in situ SSTs. Saha et al. [24] investigated globally the accuracy of Pathfinder SST retrievals by using Advanced Along-Track Scanning Radiometer (AATSR) and buoy data at day and night. They observed that the RMSE error of Pathfinder SST is in the range of 0.31 to 0.37 K with a corresponding unbiased signal to noise ratio of~0.98. However, the RMSE (true variability), when compared to SST derived from AATSR, were in general smaller than for those derived from Pathfinder. In the same vein, O'Carroll et al. [26] validated infrared atmospheric sounding interferometer (IASI) SSTs using buoys and AVHRR and found smaller cool errors from the drifting buoys (mean bias = -0.32 K), when compared to SST from AVHRR (mean bias = -0.35 K). After applying the TCA, they found that AVHRR SST exhibited a smaller standard deviation of errors than that of IASI and drifting buoys. A key study comparing infrared AATSR and Advanced Microwave Scanning Radiometer-Earth Observing System (AMSR-E) SST products against in situ buoy data is that of O'Carroll et al. [25], in which the environmental parameters and uncertainty characteristics were evaluated comprehensively. AMSR-E SST was found to have a standard deviation of error of 0.42 K, whereas AATSR and buoy SSTs were shown to have a smaller standard deviation of error of 0.16 K and 0.23 K, respectively. To further investigate the role of error between microwave and infrared SST products, Gentemann's [27] three-way error analysis showed that the errors of SST retrievals for AMSR-E v7 (SD = 0.28 • C) are smaller than that of MODIS (SD = 0.38 • C), whereas buoy SSTs exhibited the smallest error (SD = 0.20 • C). Herein, it is important to mention that microwave-based SSTs are not widely used, and especially, in coastal zones because of the sidelobe contamination near land [37][38][39]. Although extensive validation studies have been carried out in the Arabian Gulf on MODIS SST [40,54] and AVHRR SST [57,58], no single study exists which independently assess the uncertainties of a set of co-located data sets. Furthermore, previous studies were limited to constrained areas (i.e., not representative for the entire region) where the number of matchups was relatively small, and hence, caution must be applied as the error estimates might be biased.
The current study presents statistical validation of SST products using two approaches which are (a) dual comparison and (b) extended triple collocation. There are two primary aims of this study: 1. To analyze and understand the nature of the underlying uncertainties involved in MODIS SST data sets and consequently their uses and applications in the Gulf, and 2. To compare the performance of the two methods of validation at day and night. In this paper, we will use a set of co-located and concurrent SST data collected The paper is structured as follows: Section 2 provides brief characteristics of the study area; Section 3 describes the data sources; Section 4 explains the matchup protocol and the methodology followed to conduct the study; Section 5 draws together the validation results and the error assessment between in situ and satellite data in detail; ending with conclusions from this study in Section 6.

Study Area
The Arabian Gulf is a semi-enclosed marginal sea located between 24 • N and 30 • N latitude and 48 • E and 57 • E longitude ( Figure 1) and has a surface area and a total volume of approximately 238,797 km 2 and 6000 km 3 , respectively. The maximal length and width are 1000 km and 370 km, respectively, with an average depth of 35 m in the Arabian side and 60 m in the Iranian side [59]. It is connected to the Gulf of Oman through Hormuz Strait where the depth exceeds 100 m. Gulf sea temperature is characterized with wide fluctuations. For instance, subsurface water temperature in winter reaches around 20 • C and elevates in summer to 33 • C ( Figure 2). A major oceanographic fieldwork investigation was carried out in the Gulf and Sea of Oman during the Mt Mitchell Expedition from February to June 1992 [60]. By far, the most detailed account of environmental characteristics of the Arabian Gulf is found in the work of ROPME [61], B. H. Purser and Seibold [62], and Walters and Sjoberg [63].  . The section distance along the Gulf (from left to right) starts from the north west (red dot) to the south east of the Arabian Gulf (see Figure 1). The solid black line represents the contour line of water temperature. The color bar on the right depicts the water temperature in Celsius. Profile data for in situ measurements from Boyer et al. [64], and for bathymetry from GEBCO 2014.

Level-2 MODIS SST Products
MODIS passes the earth every one to two days with a swath width of 2330 km and acquires data in 36 spectral bands. The spectral bands of 11-12 µm and 3.7-4 µm are used mainly for skin SST estimation at a depth of ~10-20 µm [65], with a low noise-equivalent temperature difference (NE∆T) of 0.05 K for the bands between 11 µm and 12 µm (31 and 32) and 0.05-0.07 K for the bands located near 4 µm (20, 22 and 23) [9]. SST satellite images acquired daily from MODIS, after the processing chain at the Distributed Active Archive Center (DAAC), have a spatial resolution of ~1 km, comparable to those from SeaWiFS and AVHRR [66].  . The section distance along the Gulf (from left to right) starts from the north west (red dot) to the south east of the Arabian Gulf (see Figure 1). The solid black line represents the contour line of water temperature. The color bar on the right depicts the water temperature in Celsius. Profile data for in situ measurements from Boyer et al. [64], and for bathymetry from GEBCO 2014.

Level-2 MODIS SST Products
MODIS passes the earth every one to two days with a swath width of 2330 km and acquires data in 36 spectral bands. The spectral bands of 11-12 µm and 3.7-4 µm are used mainly for skin SST estimation at a depth of~10-20 µm [65], with a low noiseequivalent temperature difference (NE∆T) of 0.05 K for the bands between 11 µm and 12 µm (31 and 32) and 0.05-0.07 K for the bands located near 4 µm (20, 22 and 23) [9]. SST satellite images acquired daily from MODIS, after the processing chain at the Distributed Active Archive Center (DAAC), have a spatial resolution of~1 km, comparable to those from SeaWiFS and AVHRR [66].
To compare satellite-derived SSTs with field-measured sea temperatures, daily Level-2 (R2019.0 reprocessing version) MODIS-Aqua and MODIS-Terra data covering the entire missions were obtained from National Aeronautics and Space Administration (NASA) Ocean Color website (https://oceancolor.gsfc.nasa.gov (accessed on 20 May 2020)). Both mid-wave infrared (MWIR, λ = 3.7-4.2 µm) and long-wave, thermal, infrared (LWIR, λ = 10-12 µm) SST radiations were processed. The short-wave SST retrieval algorithm computed the night-time SST, commonly referred to as the "SST4", using the bands 22 and 23 as follows [10]: where c 1 , c 2 , c 3 , and c 4 are algorithm coefficients for the MODIS 3-4 µm, derived using the match-ups between the satellite brightness temperature (T B ) and in situ sea surface temperature, T 3.9 and T 4.0 are the brightness temperatures at 3.959 and 4.050 µm, respectively, and θ is sensor (satellite) zenith angle. The main advantage of using the mid-range window is the higher accuracy in deriving the SST than that at the long-wave radiation. However, the use of SST4 algorithm is limited to night-time use only owing to the sea signal contamination caused by the reflection of solar radiation in the daytime [9]. Modified version of nonlinear SST (NLSST) algorithm of Walton et al. [67] was applied to return the daytime and night-time long-wave sea skin temperature, commonly referred as "SST", using the bands 31 and 32, which is: where c 1 through c 4 are the MODIS 11-12 µm SST algorithm coefficients, T 11 and T 12 are the brightness temperatures in the 11 and 12 µm bands, respectively, and T sfc is the reference (Reynolds) sea surface temperature [10,68,69]. The ground resolution generated from processing MODIS Level-1B (L1B) files to Level-2 (L2) files was set to 1 km (at nadir). Each pixel was assigned with 32 flags (e.g., land, atmospheric correction failure, atmospheric correction warning, shallow water, high solar zenith angle, and high glint determined) and with five numeric quality levels ranging from quality Level 0, being the highest quality, to quality Level 4, being the worst. The cloud screening for MODIS (R2019) data was done based on the recent alternating decision trees (ADtrees) classifier that has shown a major improvement in the SST retrievals over the previous recursive binary decision trees (BDtrees) classifier [70].
MODIS-Terra passes over the Arabian Gulf at around 10:30 AM local time (descending overpass), and 3 h later, MODIS-Aqua passes at around 1:30 PM local time (ascending overpass) [71]. A total number of 24,502 and 27,418 SST images of Aqua and Terra, respectively, were provided for the matchup process (excluding the number of SST4 images) ( Figure 3). An example of the Level-2 MODIS satellite image for 18 November 2018 can be seen in Figure 4a, with the corresponding quality levels in Figure 4b. Water 2021, 13, x FOR PEER REVIEW 8 of 27

In Situ Observations
Although in situ sea temperature profiles of the near-surface layer (~10 m depth) can vary significantly during daytime and night-time based on wind speed and thermal stratification [72], as well as surface heat flux and cloud conditions [73], subsurface SST (SSTdepth) can be provided as skin SST ( Figure 5) and validated against satellite-derived SST [72]. For the purpose of validation, in situ SST data (v2.10) were acquired from in situ SST quality monitoring (iQuam). The quality assurance of iQuam temperature measurements relies on the procedure documented by Xu and Ignatov [74]. The NetCDF4 data files were accessed and filtered programmatically using a self-written program in MATLAB©. In situ observations with quality Level 5 (i.e., data for high-accuracy applications) were considered valid and were retained for further analysis. The spatial distribution of surface and subsurface data over the Arabian Gulf ( Figure 6) was visualized using Ocean Data View (ODV) software [75]. Prior the matchup process (outlined in Section 4.1), valid in situ measurements were 161,201 recorded from January 2000 to May 2020 ( Figure  7). The majority of in situ observations were collected by ships (63.3%) followed by drifting buoys (36.7%). Although different platform types can introduce different degree of uncertainty [76,77], the effect of simultaneous error partitioned by the platform type was not investigated in this study. Further studies, which take this variable into account, will need to be undertaken.

In Situ Observations
Although in situ sea temperature profiles of the near-surface layer (~10 m depth) can vary significantly during daytime and night-time based on wind speed and thermal stratification [72], as well as surface heat flux and cloud conditions [73], subsurface SST (SST depth ) can be provided as skin SST ( Figure 5) and validated against satellite-derived SST [72]. For the purpose of validation, in situ SST data (v2.10) were acquired from in situ SST quality monitoring (iQuam). The quality assurance of iQuam temperature measurements relies on the procedure documented by Xu and Ignatov [74]. The NetCDF4 data files were accessed and filtered programmatically using a self-written program in MATLAB©. In situ observations with quality Level 5 (i.e., data for high-accuracy applications) were considered valid and were retained for further analysis. The spatial distribution of surface and subsurface data over the Arabian Gulf ( Figure 6) was visualized using Ocean Data View (ODV) software [75]. Prior the matchup process (outlined in Section 4.1), valid in situ measurements were 161,201 recorded from January 2000 to May 2020 ( Figure 7). The majority of in situ observations were collected by ships (63.3%) followed by drifting buoys (36.7%). Although different platform types can introduce different degree of uncertainty [76,77], the effect of simultaneous error partitioned by the platform type was not investigated in this study. Further studies, which take this variable into account, will need to be undertaken.

Match-Up Protocol
SST matchups of co-located Level-2 MODIS-Aqua and MODIS-Terra pixels with iQuam observations were concurrently extracted to be within 1 km distance, e.g., in the pixel, using SeaWiFS Data Analysis System (SeaDAS7.5.3). To minimize the errors associated with the temporal variation of SST, a conventional time window of ±3 h of the satellite overpass has been applied as recommended by many studies [27,78,79] and to maintain a sufficient number of triplets. Pixels assigned with "best and good" quality (Levels 0 and 1, respectively) were used because of their high certainty and stability. Data affected by: (i) atmospheric correction failure; (ii) cloud, sun glint, or stray light conditions; (iii) land; (iv) high solar and viewing zenith angles; or (v) large biases between retrieved and referenced temperatures [27], were excluded. Following the matchup stage, an N × 3 matrix of temporally and spatially co-located triplets from the three measurement systems was generated by returning the data common to a unique system ID, with no repetitions. While single in situ measurement can be co-located and coincident with multiple satellite pixels (i.e., usually happens if the in situ point is located at the overlapped edge between two successive satellite images), only one MODIS pixel is generated for each in situ data point. However, it is noteworthy here to mention that without excluding the duplicated satellite pixels, the effect of errors is negligible [77]. Match-up triplets obtained for the validation analysis were 5186 and 3943 during daytime and night-time, respectively, (Figure 8). Figure 9 shows the SST triplet locations in different years after the matchup protocol.

Match-Up Protocol
SST matchups of co-located Level-2 MODIS-Aqua and MODIS-Terra pixels with iQuam observations were concurrently extracted to be within 1 km distance, e.g., in the pixel, using SeaWiFS Data Analysis System (SeaDAS7.5.3). To minimize the errors associated with the temporal variation of SST, a conventional time window of ±3 h of the satellite overpass has been applied as recommended by many studies [27,78,79] and to maintain a sufficient number of triplets. Pixels assigned with "best and good" quality (Levels 0 and 1, respectively) were used because of their high certainty and stability. Data affected by: (i) atmospheric correction failure; (ii) cloud, sun glint, or stray light conditions; (iii) land; (iv) high solar and viewing zenith angles; or (v) large biases between retrieved and referenced temperatures [27], were excluded. Following the matchup stage, an N × 3 matrix of temporally and spatially co-located triplets from the three measurement systems was generated by returning the data common to a unique system ID, with no repetitions. While single in situ measurement can be co-located and coincident with multiple satellite pixels (i.e., usually happens if the in situ point is located at the overlapped edge between two successive satellite images), only one MODIS pixel is generated for each in situ data point. However, it is noteworthy here to mention that without excluding the duplicated satellite pixels, the effect of errors is negligible [77]. Match-up triplets obtained for the validation analysis were 5186 and 3943 during daytime and night-time, respectively, (Figure 8). Figure 9 shows the SST triplet locations in different years after the matchup protocol. Water 2021, 13, x FOR PEER REVIEW 12 of 27

Direct Comparison
Since both SST variables from MODIS and iQuam contain error, model-II regression analysis, described by Legendre [80], was employed based on the major axis (MA) method (or principal axis) to calculate the slope ( ) of the regression line and the intercept ( ), such that

Direct Comparison
Since both SST variables from MODIS and iQuam contain error, model-II regression analysis, described by Legendre [80], was employed based on the major axis (MA) method (or principal axis) to calculate the slope ( ) of the regression line and the intercept ( ), such that

Direct Comparison
Since both SST variables from MODIS and iQuam contain error, model-II regression analysis, described by Legendre [80], was employed based on the major axis (MA) method (or principal axis) to calculate the slope (b MA ) of the regression line and the intercept (I), such that where y is the estimated variable (from the model) and x is the measured variable (in situ data). For model II regression, the squared Euclidean distances between the data points and the regression line are minimized. This minimization can be expressed theoretically as [21] wherex i andŷ i are the regressed values for both x and y, respectively, and σ 2 i is the uncertainty in the induvial measurements (x i and y i ).
The slope of the regression line is calculated according to [80] b MA = where s 2 x and s 2 y are the estimated variances of x and y, respectively, and s xy is their covariance. A closer b MA to one and a closer I to zero imply that the linear model compares well with in situ data. Type I least-square method tends to underestimate the slope of the linear line between the two variables [81]. However, from our trials assessing the error between the two regression models, we noticed that there is no significant difference affecting the RMSE between type I and type II models.
The correlation coefficient r (also called Pearson correlation coefficient) is calculated according to where N is the sample size and i is the match-up index. The correlation coefficient (r) assumes a normal distribution and ranges between −1.0 and 1.0.
To assess the average model-performance error [82], the root-mean-squared error (RMSE) is determined using where P and O represent, respectively, the predicted and observed SSTs from the two measurement systems. Conventional RMSE is affected by the mean bias error ∆. To overcome this shortcoming, we also calculated the unbiased (or centered) root-mean-square error (RMSE ub ) to quantify the non-systematic effects [83] according to

Extended Triple Collocation
In fact, the standard validation practices, such as linear regression, typically treats in situ observations as "true" values. However, this is not the case. Hence, we implemented the extended triple collocation (ETC) [84], which estimates the error variances of each measurement type. ETC uses the same assumptions as the triple collocation (TC) but provides an additional validation parameter, i.e., the coefficient of determination with respect to the unknown "true" value of the variable being measured (ρ 2 t,X i ). For any measurement system (X i ), the error model for the classical TC analysis can be expressed as where i ∈ {1, 2, 3} are three spatially and temporally co-located data sets (X 1 : systembased product, X 2 : model-based product, and X 3 : in situ-based product). t is the unknown true value (or true state); α i and β i are systematic additive and multiplicative biases of data set X i with respect to the true value, and ε i represents additive random error.
In this study, we adopted the principle of combinations of the covariances between the three measurement systems without rescaling [22,84]. These covariances are obtained through where σ 2 t = Var(t). The underlying assumptions for TC are: (i) the errors from the independent data sources have zero means (E(ε i ) = 0), (ii) the errors are uncorrelated with each other (Cov ε i , ε j = 0, f or i = j), and (iii) the errors are uncorrelated with t (Cov(t, ε i ) = 0). These covariences of the three measurement systems can then be written as where σ 2 ε i = Var(ε i ). We assert that for a set of data sets of dimension N × 3, there exists the 3 × 3 covariance matrix: The error variances for each observation type are obtained from the unique terms (Q 11, Q 12 , Q 13 , Q 22 , Q 23 , Q 33 ) as follow The RMSE could be determined from the square root of the error variace: The correlation coefficient, in terms of covariance values, between X and t is given by It is noteworthy that there is a sign ambiguity for the ρ t,X provided by the ETC; however, in practice, the ρ t,X always has a positive sign.
To describe the combined effect of the sensitivity of the measurement system (β i ), the variability of the true signal (σ t ), and the variability of the measurement error (σ ε ), the squared correlation coefficient (also referred to as unbiased signal-to-noise ratio SNR UB ) is computed according to

Systematic Errors
The bias (commonly referred to as mean bias error MBE) describes the over-or underestimated average error between two measurement systems according to The closer the bias ∆ is to a reference value (e.g., zero), the better the two measurement systems correspond with each other. The statistics on the mean differences and the standard deviations between the two MODIS sensors and iQuam SSTs are presented in Table 3. At night, MODIS-Terra and -Aqua NLSSTs are cooler than iQuam SSTs by, respectively, 0.27 • C and 0.36 • C. The standard deviation of their differences is at 0.77-0.83 • C. The night-time mean bias of MODIS-Aqua SSTs minus MODIS-Terra SSTs is very minimal. Compared to the NLSST algorithm, the mean biases (and the standard deviations) of SST4 algorithm are lower. This can be seen in the empirical probability density function (EPDF) of SST differences between MODIS and iQuam ( Figure 10). The relative frequency peak of the small differences (i.e., where values are concentrated) is much higher for SST4 than that of NLSST. In addition, the width of the probability density estimates for the SST4 algorithm is much narrower than the NLSST algorithm, which confirms its higher reliability. The daytime mean bias ± SD for MODIS-Aqua minus MODIS-Terra SSTs is 0.19 ± 0.66 • C. This apparent positive bias between Aqua and Terra is partly due to the overpass time of Aqua (~1:30 PM local time) which intensifies the stratification layer and increases MODIS-based SST [37,40,85]. Table 3. Statistical results of the match-up triplets between Aqua, Terra, and iQuam SSTs according to the time of retrieval. Values in parentheses are derived from using MODIS SST4 products.   The most prominent feature of the SST retrieval from both MODIS sensors is the negative mean bias. Several factors could explain this observation. Firstly, MODIS estimates the skin SST, and therefore, it is expected that MODIS SSTs are cooler than the bulk iQuam measurements. Moreover, as shown in Figure 11, the negative biases increase in their magnitude with the increase in measured SST. Secondly, the large negative SST biases are likely related to presence of water vapor and aerosols [86,87] and the difficult removal of cloud contamination with the cloud detection algorithm [42].

Data Independence
As stated earlier, the residuals in the three data sets should be independent of each other. However, this assumption does not ideally exist because of a combination of influences from the sensor, algorithm, and the cloud-mask that may simply stimulate some dependency between SST products [24]. Furthermore, even with different measurement techniques, an interaction between the measurement systems and the geophysical variable is noticed [88,89]. As an example, Saha et al.'s [24] three-way analysis found that the r 2 values of the residuals between AVHRR, AATSR, and drifters SST data were 0.255 and 0.338 for night-time and daytime, respectively, whereas in situ data from ships exhibited a larger r 2 of~0.66. Yet, the ETC assumption was still valid, and the data sets were acceptable. We refer the reader to a recent study that addresses the possible violations of TC analysis [23].
Plotted in Figure 11 is the scatter plot of errors using the probability density estimate (PDE) for each point. The density plots of NLSST errors during night and day (Figure 12a,b) depict a quite low r 2 value of 0.549 and 0.547, respectively. It is evident that some association exists between the errors of MODIS-Aqua (system) versus MODIS-Terra (model), but the error pairs are considerably scattered around the regression line. For night-time observations, this interdependency was further reduced to~0.47 by combining short-wave SSTs (system-based product) from one satellite with long-wave SSTs (model-based product) from the other satellite, while the in situ-based product remained unchanged (Figure 12c,d). However, MODIS SST residuals derived from SST4 algorithm alone (Figure 12e) had the highest r 2 (=0.795). This is probably due to the higher transparency [9] with the lower variability [10] at the MWIR window (3.5-4.1 µm) than that at the LWIR window (10-12 µm). In addition, the mid-IR, despite the small signal, is much more sensitive to SST changes [10]. Hence, Aqua and Terra MODIS SST data from the SST4 algorithm (system-and model-based products) were excluded from the ETC three-way analysis.

Data Independence
As stated earlier, the residuals in the three data sets should be independent of each other. However, this assumption does not ideally exist because of a combination of influences from the sensor, algorithm, and the cloud-mask that may simply stimulate some dependency between SST products [24]. Furthermore, even with different measurement techniques, an interaction between the measurement systems and the geophysical variable is noticed [88,89]. As an example, Saha et al.'s [24] three-way analysis found that the values of the residuals between AVHRR, AATSR, and drifters SST data were 0.255 and 0.338 for night-time and daytime, respectively, whereas in situ data from ships exhibited a larger of ~0.66. Yet, the ETC assumption was still valid, and the data sets were acceptable. We refer the reader to a recent study that addresses the possible violations of TC analysis [23].
Plotted in Figure 11 is the scatter plot of errors using the probability density estimate (PDE) for each point. The density plots of NLSST errors during night and day ( Figure  12a,b) depict a quite low value of 0.549 and 0.547, respectively. It is evident that some association exists between the errors of MODIS-Aqua (system) versus MODIS-Terra (model), but the error pairs are considerably scattered around the regression line. For night-time observations, this interdependency was further reduced to ~0.47 by combining short-wave SSTs (system-based product) from one satellite with long-wave SSTs (modelbased product) from the other satellite, while the in situ-based product remained unchanged (Figure 12c,d). However, MODIS SST residuals derived from SST4 algorithm alone (Figure 12e) had the highest (=0.795). This is probably due to the higher transparency [9] with the lower variability [10] at the MWIR window (3.5-4.1 µm) than that at the LWIR window (10-12 µm). In addition, the mid-IR, despite the small signal, is much more sensitive to SST changes [10]. Hence, Aqua and Terra MODIS SST data from the SST4 algorithm (system-and model-based products) were excluded from the ETC threeway analysis.

Direct Comparison versus Triple Collocation
To compare the validation efficacy of direct matchup versus triple collocation, two different scenarios were analyzed. Firstly, in situ data from iQuam were considered as the reference, and subsequently, MODIS SSTs from Aqua and Terra were compared against

Direct Comparison versus Triple Collocation
To compare the validation efficacy of direct matchup versus triple collocation, two different scenarios were analyzed. Firstly, in situ data from iQuam were considered as the reference, and subsequently, MODIS SSTs from Aqua and Terra were compared against the reference. Secondly, the three data sets from MODIS-Aqua, MODIS-Terra, and iQuam were validated simultaneously without choosing a reference system.
All linear regression models are seen to capture high amount of variability in the in situ SST data ( Figure 13). For both MODIS sensors, SST had a good agreement with in situ observations (r 2 > 0.94). MODIS-Aqua NLSST validation against iQuam data presented coefficients of determination r 2 of 0.9590 and 0.9430 for NSST and DSST models, respectively. MODIS-Terra NLSST showed a slightly similar r-squared values of 0.9532 and 0.9469 for NSST and DSST, respectively. SST4 algorithms had the highest r 2 values of 0.9728 and 0.9721 for Aqua and Terra, respectively. The MODIS r 2 values reported here are lower than those reported previously for MODIS [40,54] and for AVHRR [57,58] in the same study region.
For the full data set, the match-up pairs were close to the 1:1 (dashed red) line, except a few in situ points that are either overestimated or underestimated by the MODIS sensor. The MA linear fit (solid black line) slope of MODIS night-time NLSST is 0.98 for both Aqua and Terra. The MA slopes of SST4 algorithm were the closest to the unity. Nevertheless, the daytime SST slopes of MODIS-Aqua (= 0.97) and MODIS-Terra SST (= 0.96) were slightly lower than the unity.
Examination of Table 4 shows that, based on direct comparison, the RMSEs for NLSST and (in parenthesis for SST4) fell within the range of 0.85 • C (0.69 • C) during night-time and 0.93 • C during daytime. When systematic bias was removed from the conventional RMSEs, there was no indication that the unbiased RMSEs were significantly reduced for both MODIS sensors. In similar studies, the RMSEs of daytime SST matches were 0.53 • C (MODIS-Aqua) and 0.44 • C (MODIS-Terra) in the northern Gulf [40] and 0.7 • C (MODIS-Aqua) in Kuwait's waters [54]. These discrepancies presumably resulted from the difference in the number of matchups, the spatial-temporal sampling mismatch, the type of in situ measurements, and the difference between bulk and in situ temperatures. 0.9728 and 0.9721 for Aqua and Terra, respectively. The MODIS values reported here are lower than those reported previously for MODIS [40,54] and for AVHRR [57,58] in the same study region.
For the full data set, the match-up pairs were close to the 1:1 (dashed red) line, except a few in situ points that are either overestimated or underestimated by the MODIS sensor. The MA linear fit (solid black line) slope of MODIS night-time NLSST is 0.98 for both Aqua and Terra. The MA slopes of SST4 algorithm were the closest to the unity. Nevertheless, the daytime SST slopes of MODIS-Aqua (= 0.97) and MODIS-Terra SST (= 0.96) were slightly lower than the unity. Examination of Table 4 shows that, based on direct comparison, the RMSEs for NLSST and (in parenthesis for SST4) fell within the range of 0.85 °C (0.69 °C) during nighttime and 0.93 °C during daytime. When systematic bias was removed from the conventional RMSEs, there was no indication that the unbiased RMSEs were significantly reduced for both MODIS sensors. In similar studies, the RMSEs of daytime SST matches were 0.53 °C (MODIS-Aqua) and 0.44 °C (MODIS-Terra) in the northern Gulf [40] and 0.7 °C (MODIS-Aqua) in Kuwait's waters [54]. These discrepancies presumably resulted from the difference in the number of matchups, the spatial-temporal sampling mismatch, the type of in situ measurements, and the difference between bulk and in situ temperatures.
Comparing the two validation approaches, it can be seen that our ETC-based RMSE results differ significantly from the results reported earlier by direct comparison. The calculated RMSEs from ETC for Aqua and Terra are in the region 0.18 °C to 0.26 °C for SST4  Comparing the two validation approaches, it can be seen that our ETC-based RMSE results differ significantly from the results reported earlier by direct comparison. The calculated RMSEs from ETC for Aqua and Terra are in the region 0.18 • C to 0.26 • C for SST4 and 0.51 • C to 0.57 • C for NLSST (cases 1 and 2). During daytime (case 3), MODIS-Terra NLSST (model-based SST product) had slightly lower ETC-based RMSE (0.44 • C) than that of MODIS-Aqua (0.50 • C). It is clear that using in situ data as a reference increases the error estimates of MODIS SST by some factor (higher by~1.5-3 times than ETC-based RMSE). Even after removing the effect of the long-term mean bias from the conventional RMSE, the difference between ETC-based RMSE and conventional unbiased RMSE is high as the ETC approach is much less influenced by the representativeness error [20,90]. Nevertheless, the difference in the squared correlation coefficient between direct comparison and ETC is very minimal. Overall, ETC has the largest squared correlation coefficient compared to direct comparison approach (higher by~0.02 during night and~0.04 during day).
Focusing on MODIS ETC estimates, we see that, for both Aqua and Terra, the error variances are low, and the unbiased signal-to-noise ratios are high. For SST triplets retrieved from Aqua and Terra, the very high SNR ub reported here confirm that the σ 2 ε values are reasonable and realistic [24]. At night (cases 1 and 2), Aqua showed a slightly higher sensitivity (lower σ 2 ε ) than Terra when the same algorithms were compared against each other. Conversely, Terra daytime SST observations (model) had a higher SNR ub and lower σ 2 ε than that of Aqua (system). This suggests that, at night, MODIS-Terra estimates are noisier than those of Aqua and vice versa. For the iQuam measurement system, however, the σ 2 ε is remarkably high for all cases, and especially during daytime. The correlations showed the opposite behavior, with the poorest SNR ub . As noted above, the observed increase in the iQuam uncertainty could be attributed to the spatial mismatch between the point scale of in situ measurements and the large scale of satellite data [91]. Additionally, it seems possible that the use of ship data in the SST triplets degraded the overall quality of iQuam system since these measurements have the lowest accuracy among other in situ platform types [77]. As compared to the results from Saha et al. [24], their error estimates for iQuam (although using AVHRR instrument for validation) are fairly consistent with our findings. For all data sets (MODIS-Aqua, MODIS-Terra and iQuam), lower error variance corresponds to a higher unbiased signal-to-noise ratio. This can be seen in Figure  14. Unlike the MODIS-Aqua NLSST results, appreciable differences were found for ETC estimates from MODIS-Terra NLSST and iQuam between day and night. Although the TC technique takes into account the long-term effect of errors [90], it is important to bear in mind the possible bias in these findings as they do not rule out the influence of imprecise assumptions in the TC error model [84] (as presented in Section 5.2).
can be seen in Figure 14. Unlike the MODIS-Aqua NLSST results, appreciable differences were found for ETC estimates from MODIS-Terra NLSST and iQuam between day and night. Although the TC technique takes into account the long-term effect of errors [90], it is important to bear in mind the possible bias in these findings as they do not rule out the influence of imprecise assumptions in the TC error model [84] (as presented in Section 5.2).

Conclusions
Accurate mapping of SST is essential for monitoring the status of marine environments and developing reliable climate models and data sets. The present paper attempted to assess the quality of three co-located SST data sets (i.e., MODIS-Aqua, MODIS-Terra, and iQuam). In general, a reasonably good agreement was achieved when SST estimated remotely is compared with that measured in the field. At night, the SST mean error for MODIS-Aqua was more negative, along with a smaller standard deviation compared to the MODIS-Terra SST and vice versa. From direct comparison of dual matchups, we found that the unbiased RMSEs for NLSST algorithm were very high during day (0.90-0.93 °C) and to a lesser extent during night (0.78-0.83 °C). This approach, however, was highly affected by the errors compared to the TCA. When NLSST was used as the model-based product, night-time SSTs acquired from the SST4 algorithm have the lowest ETC-based error variance (0.034 and 0.066 for Aqua and Terra, respectively) and the highest determination coefficient (0.9977 and 0.9956 for Aqua and Terra, respectively). Conversely, we observed that the in situ data from the iQuam network, despite their high retrieval accuracy (QF = 5), suffered from the highest ETC-based error variance and the lowest determination coefficient due to the temporal and spatial representativeness errors.
It is believed that the selection of an appropriate validation approach depends greatly on the type of measurement system, the characteristics of random errors, and the practical considerations. Although regression analysis has successfully demonstrated the general agreement between satellite-derived and in situ SST observations, it has certain limitations in terms of error estimates. A reasonable approach to tackle this issue is to implement the three-way error analysis even though it requires several underlying assumptions such as the large number of matchups and the independence of the data sets. Figure 14. Extended triple collocation estimates of σ 2 ε and SNRub for SST triplets from Aqua, Terra, and iQuam during night-time and daytime. The colored bars are our ETC estimates, with colors corresponding to different cases as shown in the legend at top.

Conclusions
Accurate mapping of SST is essential for monitoring the status of marine environments and developing reliable climate models and data sets. The present paper attempted to assess the quality of three co-located SST data sets (i.e., MODIS-Aqua, MODIS-Terra, and iQuam). In general, a reasonably good agreement was achieved when SST estimated remotely is compared with that measured in the field. At night, the SST mean error for MODIS-Aqua was more negative, along with a smaller standard deviation compared to the MODIS-Terra SST and vice versa. From direct comparison of dual matchups, we found that the unbiased RMSEs for NLSST algorithm were very high during day (0.90-0.93 • C) and to a lesser extent during night (0.78-0.83 • C). This approach, however, was highly affected by the errors compared to the TCA. When NLSST was used as the model-based product, night-time SSTs acquired from the SST4 algorithm have the lowest ETC-based error variance (0.034 and 0.066 for Aqua and Terra, respectively) and the highest determination coefficient (0.9977 and 0.9956 for Aqua and Terra, respectively). Conversely, we observed that the in situ data from the iQuam network, despite their high retrieval accuracy (QF = 5), suffered from the highest ETC-based error variance and the lowest determination coefficient due to the temporal and spatial representativeness errors.
It is believed that the selection of an appropriate validation approach depends greatly on the type of measurement system, the characteristics of random errors, and the practical considerations. Although regression analysis has successfully demonstrated the general agreement between satellite-derived and in situ SST observations, it has certain limitations in terms of error estimates. A reasonable approach to tackle this issue is to implement the three-way error analysis even though it requires several underlying assumptions such as the large number of matchups and the independence of the data sets.
We recommend the use of remotely sensed SST data retrieved over the Arabian Gulf, bearing in mind the uncertainties and the quality flags of each gridded SST image. In higherlevel marine models; however, MODIS data must be retrieved with caution when imprecise satellite-derived SSTs of~1 • C error are applied as input. Therefore, new refined/revised regional atmospheric correction model is well worth considering and could be developed to further improve the quality of satellite SST.
Future research might explore the errors caused by different in situ platform types in addition to the effect of several environmental conditions (e.g., atmospheric aerosol contamination, water vapor, and wind speed) on the quality of retrieved SST in different seasons.