Is Ocean Reﬂectance Acquired by Citizen Scientists Robust for Science Applications?

: Monitoring the dynamics of the productivity of ocean water and how it affects ﬁsheries is essential for management. It requires data on proper spatial and temporal scales, which can be provided by operational ocean colour satellites. However, accurate productivity data from ocean colour imagery is only possible with proper validation of, for instance, the atmospheric correction applied to the images. In situ water reﬂectance data are of great value due to the requirements for validation and reﬂectance is traditionally measured with the Surface Acquisition System (SAS) solar tracker system. Recently, an application for mobile devices, “HydroColor”, was developed to acquire water reﬂectance data. We examined the accuracy of the water reﬂectance measures acquired by HydroColor with the help of both trained and untrained citizens, under different environmental conditions. We used water reﬂectance data acquired by SAS solar tracker and by HydroColor onboard the BC ferry Queen of Oak Bay from July to September 2016. Monte Carlo permutation F tests were used to assess whether the differences between measurements collected by SAS solar tracker and HydroColor with citizens were signiﬁcant. Results showed that citizen HydroColor measurements were accurate in red, green, and blue bands, as well as red/green and red/blue ratios under different environmental conditions. In addition, we found that a trained citizen obtained higher quality HydroColor data especially under clear skies at noon.


Introduction
Improved understanding of the long-term spatio-temporal productivity of coastal oceans is of fundamental importance for the management of natural resources, especially fisheries [1]. However, the scale of ocean productivity data is generally poor due to constraints related to research ship-based data acquisition. Operational ocean colour satellites such as MODIS-AQUA, VIIRS, and Sentinel-3 can provide data at the required scale; however, data validation is required [2,3]. These satellites measure water reflectance at different wavelengths, which is the basis for the models used to derive concentration of chlorophyll, a proxy for the ocean's productivity [4]. Limitations on the use of satellite-derived reflectance measurements to retrieve accurate chlorophyll concentrations exist [5]. These limitations are generally a function of inaccurate atmospheric correction of the satellite measured signal and consequently inability to relate this signal to chlorophyll concentrations [6][7][8][9].
The successful use of satellite data requires in situ data to validate both the atmospheric correction and the chlorophyll derived concentrations [10]. Validation of atmospheric correction models, generally, makes use of data acquired with the AERONET (atmospheric transmittance data) and AERONET-OC (water reflectance data) sensors, which are globally distributed, but do not offer sufficient data in the mainland of British Columbia and Vancouver Island [46]. The water productivity of the SoG is profoundly affected by the outflow of the Fraser River, which generally peaks in June [47,48]. Phytoplankton abundance in the SoG varies seasonally, and is generally highest in the spring and followed by a weaker boom in the fall [9,46]. This variability is associated with the Fraser River plume, tidal mixing, wind, and cloud cover [48]. Turbidity in the SoG has been found to be high at the surface, particularly in the spring and summer coinciding with the Fraser River freshet [49]. In this region, measurements of above-water reflectance (R rs ) at visible and near infra-red wavelengths show a large magnitude range from 0.001 to 0.027 sr −1 at 400-750 nm due to the variable concentrations of water optical constituents [21]. Spectrally, Phillips and Costa [21] have shown that R rs in the SoG is typically low in the blue wavelengths, high in the green, and low again in the red wavelengths, typical of coastal water bodies [50][51][52].

Dataset
R rs , turbidity, and chlorophyll-α (Chlα) measurements were acquired along the Queen of Oak Bay ferry route between 1 July and 5 September 2016. R rs was derived from data collected by ferry passengers with the mobile application HydroColor, and an autonomous hyperspectral radiometer system, SAS solar tracker system.

Citizen HydroColor Measurements
Citizen water reflectance data (R rs,H ) were collected by HydroColor developed by Leeuw and Boss [41], and available on both Android and iOS mobile devices. This application uses the camera to detect the light intensity coming from the water and sky, and then calculates the red, green, and blue reflectance (denoted as R rs,H (R), R rs,H (G), and R rs,H (B), respectively). HydroColor leverages the GPS, compass, and inclinometer in mobile devices to ensure users obtain precise measurements with an azimuth angle of 135 • from the Sun and a zenith angle of 40 • from the nadir according to the optimal geometry defined by Mobley [53]. For measuring R rs,H , the user takes a photo of an 18% exposure photography grey card, a photo of the sky, and a photo of the water in the correct direction denoted by green arrows (illustrated in Figure 2). These photos are used to derive downward plane irradiance, sky radiance, and water leaving radiance, respectively. From these measurements, water reflectances of red, green, and blue wavelengths are calculated, and turbidity (0-80 NTU), concentration of SPM (g/m 3 ), and the backscattering coefficient in the red spectra (m −1 ) are estimated based on Gordon et al. [54] and Neukermans et al. [55]. HydroColor measurements were acquired from July to September 2016 every Friday to Monday during two ferry routes: morning (8:30-10:10 a.m.) with a solar zenith angle range from 40.5 • to 67.6 • and noon (12:50-2:30 p.m.) with a solar zenith angle range from 17.4 • to 39.3 • [56]. Three iPad mini 4 (Apple Inc., Cupertino, CA, USA) tablets with the HydroColor application were used to acquire data, and two 18% photography grey cards were installed on the railing of the ferry deck for photo taking ( Figure 3). As part of the BC Ferries and Parks Canada Coastal Naturalist Program, a biologist (the same person for the entire sampling period), called a Coastal Naturalist (CN), was trained to use the HydroColor application (trained citizen) before working with ferry passengers (FP, untrained citizens). For quality control, only adults (age 18 or older) were invited to collect data. The established protocol was that the CN gave a 25-min presentation prior to inviting FP to collect data, so a minimum level of information was provided to FP ( Figure 4). A total of 446 FP were involved in data collection resulting in a total of 1270 HydroColor samples acquired by CN and FP. HydroColor data and images were first sorted by environmental conditions: (1) cloud cover; and (2) solar zenith (route time); and subsequently by (3) level of training (Table 1). Cloud cover conditions affect the quality of the R rs data because the reflectance of clouds is brighter than the background sky and therefore produces cloud glitter effects [53]. The time effect was considered because, as the solar zenith angle becomes larger, there is less radiance that is backscattered from the ocean water [57]; consequently, a lower reflectance signal is expected to be detected in the morning. Subsequently, HydroColor data were classified into three categories according to photo quality within the white square ( Figure 5): perfect, good, and bad. A sample was classified as "perfect" if all three photos (water, sky, and grey card) had no shadows or other contaminants ( Figure 5a); as "good" if only one of them was less than 1/3 contaminated (Figure 5b); and as "bad" if more than one photo was contaminated, or if any of the water reflectance values were zero (Figure 5c). Typical reasons for bad photos were: (1) shade covering grey card or water; (2) white foam or sun glint in the water photo; and/or (3) the photo was entirely different from the object, for example a sky image captured for the photo titled "water". Conditions (1) and (2) are generally considered as main sources of contamination for R rs measurements [41,58,59]. Only perfect and good samples were considered in the statistical analyses.

Autonomous SAS Solar Tracker Measurements
The SAS solar tracker was installed aboard the ferry to obtain precise R rs measurements for comparison with HydroColor. The system was mounted in the front of the upper deck of the vessel to avoid capturing the vessel structure or its shadow in the measurements [23]. This instrument has three sensors to simultaneously acquire high precision hyperspectral measurements of water-surface radiance, L t (λ), sky radiance, L s (λ), and sky irradiance, E s (λ), where λ is the wavelength from 350 to 800 nm, at a 3-nm resolution. The SAS solar tracker automatically adjusts its sensors' pointing angles to the correct geometry setting with respect to the vessel's heading angle and the local sun azimuth angle according to Hooker and Morel [60]. Measurements were acquired continuously when the ferry was travelling and the solar zenith angle was less than 60 • [56].
The raw SAS solar tracker data were processed by the automated raw data processing software PySciDon [61] in a 90-s interval, and the above-water reflectance R rs,SAS (λ) was calculated according to Ruddick et al. [62]: where ρ sky is the proportion of sky radiance that is reflected off the surface of the water and is dependent on wind speed W, and on the proportion of cloud cover in the sky radiance measurements. A similar approach has been effectively used by other researchers in this region [9,14,21] To allow statistical comparison with R rs,H , R rs,SAS (λ) was convolved into three broader spectral bands (red, green, and blue) using the spectral sensitivity function (SSF) of the iPad: where R rs,SAS (i) is the water reflectance of the i th band, i = R, G, B represents red, green, and blue channels, respectively; SSF i (λ) is the spectral sensitivity function of the i th band; and R rs,SAS (λ) is the above-water remote sensing reflectance acquired by SAS solar tracker and computed by The SSF of the iPad mini 4 was not available from the manufacturer, and therefore the SSF of the iPhone 5 [41] was used instead after examining for interchangeability. HydroColor data acquired with the iPad mini 4 and the iPhone 5 under a clear sky and green water conditions during 11:30 a.m.-12:30 p.m. on 10 October 2016 were compared. The magnitude and variability of mean difference of the R rs,H between these devices were small, particularly for the red band ( Table 2). The mean absolute percentage differences of bands were relatively small compared with a similar study, 69%, 67%, and 77% for red, green, and blue respectively [45], which compared the differences of three bands generated by the iPhone 5 camera's SSF and the CIE1931 two-degree colour matching function. Therefore, we assumed that using the iPhone 5 camera's sensitivity function as an alternative was acceptable when the actual function was unavailable.

Statistical Analyses
Data analyses included examinations of: (1) whether environmental conditions and training affect HydroColor photos' quality; and (2) whether R rs,H is accurate compared to R rs,SAS and, if not, whether R rs,H can be corrected. We used RStudio version 1.0.136 [63] for all analyses.
First, we used contingency tables and Person's chi-squared tests on the analysis of the quality of HydroColor photos with environmental conditions (cloud cover and sun zenith) and training to obtain a preliminary understanding of citizen data quality and to assess whether these factors were significant. Second, we analysed the qualified R rs,H and R rs,SAS to assess whether the citizen R rs,H was accurate enough for scientific purposes under all environmental conditions. This analysis considered the R rs values for the blue, green and red bands and the ratios of any two bands' R rs (red/green, blue/green, and red/blue) for both SAS solar tracker (denoted as R rs,SAS (R/G), R rs,SAS (B/G), and R rs,SAS (R/B), respectively) and HydroColor (denoted as R rs,H (R/G), R rs,H (B/G), and R rs,H (R/B), respectively). Band ratios are generally used in ocean colour chlorophyll algorithms, especially the blue/green ratio [8,64]. All bands and band ratios were considered to be independent and identically distributed random variables.
To assess the accuracy of R rs,H , we evaluated the significance level of mean differences between time-matched R rs,H and R rs,SAS . For this analysis, the differences between R rs,H and R rs,SAS was denoted as: ccd ijk = R rs,SAS,ijk − R rs,H,ijk , i = 1, 2, 3, 4, 5, 6; j = 1, 2, 3, 4; k = 1, ..., n j where R rs,SAS,ijk and R rs,H,ijk are the above-water reflectance acquired by SAS solar tracker and HydroColor under the i th band or ratio and the j th environmental condition, respectively. The index i = 1, ..., 6 represents the red, green, and blue bands and red/green, red/blue, and blue/green band ratios, respectively; index j = 1, ..., 6 represents clear sky morning, cloudy morning, overcast morning, clear sky noon, cloudy noon, and overcast noon respectively; and index k represents the sample number of the j th environmental condition. For each band or band ratio, we first tested the normality of d ij· , and subsequently a Monte Carlo permutation F test was used to examine whether the mean of d ij· was significantly different to zero under all environmental conditions [65]. The permutation test is a nonparametric statistical test that does not require distributional assumption, for instance, the normal distribution of standard statistical methods [66].
For those bands and band ratios that showed the mean d ij· were significant, we built piecewise models with environmental condition variables (Equation (5)) to correct the R rs,H based on the corresponding R rs,SAS for future HydroColor data acquisition. The correction error was examined by leave-one-out cross-validation method, the idea and algorithm of which are well documented [67,68]. The leave-one-out error we used was root mean squared error, which has the same scale as R rs .

HydroColor Photos Quality
A total of 1270 HydroColor citizen samples were collected, in which the number classified as perfect, good, and bad samples were 791, 276, and 203, respectively. The distributions of perfect, good, and bad categories versus training, sun zenith, and cloud cover are shown in Tables 3-5, respectively.
The training effect was significantly associated with the quality of HydroColor data (X 2 2 = 236.3, p < 0.001). The CN (trained) acquired a significantly much higher percentage of perfect data than the FP (untrained) ( Table 3). We also observed that data acquired at noon (lower sun zenith) showed slightly higher quality than data acquired in the morning (higher sun zenith) (65% versus 58% of the "perfect category") ( Table 4). The association between HydroColor data quality and the time of day was also significant (X 2 2 = 14.2, p < 0.001). Moreover, the association between HydroColor image quality and cloud cover was significant (X 4 2 = 48.5, p < 0.001). Table 5 shows that clear conditions had a higher percentage of perfect data than cloudy or overcast conditions. The highest percentage of bad data (28%) was with overcast conditions since most of the R rs,H values (58 samples, 89% of the bad data) were zero with this condition.

Accuracy and Correction of R rs,H
For the perfect or good categories, R rs,SAS were first matched with the corresponding R rs,H data by time stamps; a total of n = 214 matches were considered in the analysis. Bad category R rs,H were removed from analyses.
The descriptive statistics for matched R rs,SAS and R rs,H show that the mean and standard deviation of all bands of R rs,H were slightly higher than the corresponding statistics of R rs,SAS ( Table 6). The range and trend of both R rs,SAS and R rs,H were similar to another study in this region [21]. The nonparametric Monte Carlo permutation F tests were deemed appropriate because the normality tests indicate that the differences between R rs,SAS and R rs,H for the perfect and good category in most bands and band ratios were not normally distributed. Results of the permutation F tests showed that there were no significant differences between the mean difference and zero under any environmental conditions for red, green bands, and the red/green, red/blue ratios, except for the blue band and the blue/green band ratio, which showed that the true mean difference was significant (Table 7). In addition, R rs,H tended to overestimate the R rs signals for all bands compared to R rs,SAS , although the overestimations were low (Table 7). When only perfect category R rs,H were evaluated, the difference of blue band R rs,H and R rs,SAS was not significant; however, the blue/green ratio R rs,H was still significant (Table 8). Given the statistically significant observed differences between R rs,H and R rs,SAS for the blue/green ratio, a linear model was built to correct R rs,H (B/G) based on R rs,SAS (B/G) (Equation (6)). The overcast condition was not included in this model because of the low number of samples (n = 1 for noon, n = 0 for morning). This can also be written as piecewise models, respectively: This linear model was significant (F 5,207 = 16.18, p < 0.001) with a weak adjusted R 2 (0.26). The prediction performance was assessed by the leave-one-out root mean squared error. Results showed that the magnitude of error was small (0.074), but the percentage of this error compared with mean was high (124%). The correction model does solve the underestimation problem in the original R rs,H (B/G) with a fairly high improvement rate (42%). However, Figure 6b also shows a horizontal trend for the morning conditions. This discrepancy may be a result of the small sample sizes of these conditions for parameter estimation.

Discussion
The objective of this study was to examine the accuracy of water reflectance measurements acquired by citizens using the HydroColor application, aiming to provide recommendations for extending the use of Hydrocolor to "fisher scientists". Although other mobile sensing applications exist, HydroColor is the focus of this research. Water reflectance samples from HydroColor and SAS solar tracker were collected on the BC Ferry Queen of Oak Bay crossing the Strait of Georgia daily, during July to September 2016. The main findings show that the HydroColor citizen data are accurate compared with hyperspectral instrument data for most bands and band ratios; however, citizen level of training and environmental conditions play a role in the data quality.

Citizen Participation and Data Quality
During the study period, which corresponds to the regional tourist season, over 200,000 passengers per month travelled along the ferry route from Departure Bay to Horseshoe Bay (traffic data provided by BC Ferries: http://www.bcferries.com/about/traffic.html). The total number of passengers participating in the data collection was lower than the number attending the educational talk just before the measurements. In addition, the total sample number acquired by regular citizens (FP) was lower than the trained citizen (CN) ( Table 3, 446 versus 824). This result suggests that either: (1) volunteer participation for this type of data acquisition may not be as effective, similar to the findings by Kotovirta et al. [42] studying the surface algal bloom citizen monitoring data; or (2) the used methodology prevents larger volunteer participation. The first can be dealt with using an incentive mechanism, such as micro-payments [69]. The second, most likely in this case, could be prevented by having more trained citizens aboard the ferry to help passengers with data acquisition.
In our methodological framework, only one trained citizen, "the coastal naturalist", was allowed to guide passengers interested in participating in the experiment (a requirement from BC Ferries). The CN frequently had several people asking questions after the presentation, making it difficult to help all passengers who were willing to acquire data. It would have been useful to have more than one trained citizen (CN) to help with the data collection. In addition, only adults (over age 18) were allowed to participate in data collection. Interestingly, children showed a large interest in using HydroColor for data acquisition, and it would be a valuable experiment to obtain data quality measures for various age groups in future studies.
The CN (trained citizen) collected a higher number of samples of higher quality than the passengers (untrained citizen) (77% versus 35%, Table 3). This result was not a surprise as the CN had an educational background in environmental science and received training on the use of HydroColor before working on this program. When classifying the untrained citizens' samples, the main errors were associated with capturing the ship's shadow and white foam in water photos (Figure 5c). Given the required geometry for data acquisition azimuth angle of 135 • from the Sun [41], and the size of the ferry structure, the ship's shadow and foam are likely to appear in some of the water photos if care is not taken. Focused training such as that given to the CN, can help with improving skills to avoid these errors. The training effect is highly significant (p < 0.001), which implies that a certain degree of training is necessary for high-quality water reflectance sample acquisition using HydroColor. This finding is similar to other studies working with plant species/phenophase [26] and terrestrial invertebrate biodiversity [28]. Volunteers, having around six hours of previous training, could correctly identify 91% of the plant phenophases for a variety of species compared with experts (Fuccillo et al., 2015). The volunteers' sampling performance for invertebrates, with training in the field, is similar to (p > 0.05) expert researchers [28]. Thiel et al. [34] indicated that appropriate training varies from study to study and should be considered before instructing volunteers.
Besides training, the patience of citizen scientists is also a concern for data quality; for instance, some volunteers did not finish crabs' size measurements in a study by Delaney et al. [32] since these volunteers thought this process was tedious. In our case, passengers needed the patience to find the correct direction for taking high quality samples. Another approach to improve crowdsourcing data quality suggested by Rogstadius et al. [70] is to emphasize intrinsic motivation such as helping other people.

Environmental Variables and Data Quality
Similar to many studies [53,57,60,71], our results demonstrated the importance of environmental variables (sun zenith and sky condition) on above-water reflectance (R rs ) measurement quality. HydroColor data quality was found to be better (p < 0.05) during the noon (time: 12:50-2:30 p.m., solar zenith: 17.4 • to 39.3 • ) ferry run than the morning (time: 8:30-10:10 a.m., solar zenith: 40.5 • to 67.6 • ) run (Table 4). A higher percentage of bad samples acquired in the morning (21% versus 13%) was most often due to contamination by the shadow of the ship on the water photos. As sun zenith angle increases, the shadow of the vessel becomes larger and consequently more likely to be detected as part of a sample [71]. In addition to the shadowing effect, the water-leaving radiance is highly related to the time of day or solar zenith [57,72]. Atmospheric attenuation of solar irradiance in the visible and near-infrared spectra decreases as the solar zenith angle decreases [72]. At lower zenith angles, higher irradiance reaches the ocean surface, and therefore more of the water-leaving radiance is detected by the sensor [57]. In our study, morning runs in late August and the beginning of September happened when zenith angles were between 60.0 • and 67.6 • . At these angles, less irradiance reaches the ocean surface, and consequently less radiance from the water is available for detection by the camera. As a result, it is difficult to detect and calculate satisfactory water reflectance samples in these runs. These findings concur with other studies, showing that a lower sun zenith angle reduces the effects of sun glint, low solar irradiance, weak water-leaving radiance, and wave-shadowing [53,57,71,72].
Sky condition also affected the quality of the data. Higher quality HydroColor data were collected under clear sky conditions than under cloudy and overcast conditions, and the cloudiness effect was significant (p < 0.05, Table 5). This finding is consistent with existing studies recommending a clear sky condition for R rs measurement [53,60,71]. In principle, atmospheric light attenuation in clear sky conditions, mostly due to Raleigh scattering, is relatively more constant than in cloudy or overcast conditions [72]. Further, the reflection of clouds in the water is brighter than the reflectance of blue sky; therefore, more skylight reflected from the water surface was detected by the water-viewing sensor under cloudy conditions [53,71]. Hooker and Morel [60] show that the cloudiness effect is not systematically detected if cloud cover is under a certain threshold. Larger and lower-level cloud cover has a higher effect on the R rs measurements by increasing the magnitude of radiance reflected into the sensor [53].
In addition, to affirm the importance of collecting data under a clear sky, we also found that HydroColor data acquired under overcast conditions were overall the poorest quality; in particular, a relatively large percentage of bad samples (89%) showed water reflectance values equal to zero. This is likely due to the lower sensitivity of the iPad mini 4 camera at low irradiance conditions. This result corroborates the work of Salisbury [72] that illustrated that less measurement radiance signal can be detected under a hazy sky because of blurry shadows, and the work of Garaba and Zielinski [71] that showed that light is more diffuse in hazy skies than clear skies.
Beyond assessing HydroColor data quality, we examined comparisons between HydroColor and the gold standard SAS solar tracker to determine how accurate HydroColor can be as an optical instrument to measure R rs . The evaluation of the accuracy of qualified (classified as perfect and good quality) HydroColor citizen data revealed the larger mean and higher variability (Table 6) for the observed three bands in relation to the SAS solar tracker data (R rs,SAS ). A possible reason for this result is that different ρ sky values were used to calculate R rs and R rs,SAS . The ρ sky is used to remove the proportion of the sky radiance that is reflected off the water surface and detected by the water radiance sensor [53]. In a flat sea surface, the reflected sky radiance can be prescribed by viewing geometry alone, according to the Fresnel reflectance [6]. However, the sea surface is usually wavy due to wind, and therefore the surface reflects sunlight from a range of directions other than only the viewing angle [53]. The ρ sky value used to calculate R rs,H was set to a constant 0.028 in the HydroColor application [41], while the value used for R rs,SAS was computed by Equations 2 and 3, which is related to wind speed and sky condition. The comparisons of HydroColor data and calibrated Water Insight Spectrometer (WISP) data by Leeuw and Boss [41] does not show that HydroColor tends to overestimate R rs when the same ρ sky is applied to calculate the above-water reflectance.
According to the results, the observed differences between R rs,H and R rs,SAS for red and green bands were not statistically significant (p > 0.05); however, the difference was significant for the blue band (p < 0.05, Table 7). The result of lower accuracy in the blue band is also shown in the work of Leeuw and Boss [41], which indicates that R rs,H in the blue band has the highest median percent error compared to the WISP among the three bands. Interestingly, when only perfect quality R rs,H was evaluated, the statistical analysis showed that the difference of blue band was no longer significant (p < 0.05, Table 8). These results imply that minor contaminations in the HydroColor photos still introduce errors in the accuracy of the R rs,H for the blue band. The high sensitivity of minor contaminations for the blue band may be caused by: (1) comparably weak blue signals, and therefore subject to more variability due to noise (low signal-to-noise) [73]; (2) relatively high difference and variability between the true spectral sensitivity function and the substitute one; or (3) the fixed ρ sky programmed in HydroColor, making it difficult to precisely correct the effect of skylight; however, the skylight is mostly contributed by blue band radiance due to Rayleigh scattering [74]. As such, for accurate retrievals of above-water reflectance at the blue spectra using HydroColor, perfect data quality is required.
The average values and variability (Table 7) of the difference between R rs,H and R rs,SAS were higher for band ratios than for individual bands. The results of permutation F tests for band ratios (Table 7) showed the difference between R rs,H and R rs,SAS of blue/green ratio was significant (p < 0.05), while the other two ratios were not. The significant differences of the blue/green ratio were first considered as an effect of the bad performance of the blue band with minor contaminations. However, we note that, even discarding samples with contaminations, the significance was not eliminated (p < 0.05, Table 8). Thus, a linear model with environmental factors as variables was developed to correct the bias (Equation 6). Although this correction model was significant (p < 0.05), the low adjusted R 2 (0.26) and the high percentage of correction error (124%) illustrated that it was not satisfactory. Therefore, the magnitudes of individual bands are recommended to be used for scientific purposes rather than band ratios, especially the blue/green ratio. The most likely explanation for this negative finding is that the blue/green band ratio R rs,H is highly sensitive to the differences between R rs,H and R rs,SAS measurement protocols, which may result in a slight difference in measuring location (different field of view-angle of the lens and different footprint on the water) and time. To clarify, the SAS solar tracker was installed on the same side of the ship where the citizens acquired data, but still at approximately 15 m distance horizontally. In addition, a Hydrocolor sample usually takes one minute to finish collecting the grey card, sky, and water photos, whereas SAS solar tracker collected the data with three sensors simultaneously. Atmospheric properties and water optical constituents are continually changing, and therefore measured reflectance might be slightly different depending on location and time; for instance, clear blue skies with fast cirrus clouds passing in front of the Sun may suddenly change the downwelling irradiance. The studies of Toole et al. [6] and Leeuw and Boss [41] have also mentioned that the deviations between two instruments are partly caused by imperfect matching on time and location.

Conclusions
We have conducted the first evaluation of the accuracy of Hydrocolor samples collected by citizens aboard a ferry based on a thorough statistical analysis. We have shown that water reflectance measurements acquired by HydroColor can be used for data acquisition with trained citizen scientists; however, we must be careful when using this method with untrained citizen science. We suggest that oral and written instructions on the use of HydroColor are provided for the "fisher scientists" before asking for their participation in data acquisition. In these instructions, we will recommend that the data should be acquired under clear skies at noon to be high-quality samples. General sources of errors are shadows in the grey card, white foam, shadows, and sun glint in water photos.
We showed that it is acceptable to use the spectral sensitivity function of the iPhone 5 as a substitute for the actual function of the iPad mini 4. Nevertheless, the difference between spectral sensitivity functions was likely a source of error. It would be better to acquire the true spectral sensitivity function of the user device if the "fisher scientists" water reflectance samples need to be compared with ocean colour imagery reflectance in the future.
Future work would include determining the quality of water reflectance samples using HydroColor by "fisher scientists", especially if the samples will be used as ground-truth data [42]. All images of HydroColor samples in this study were carefully checked for data quality. However, it would be impossible for an individual to check the data with large datasets. Thus, an automated data quality control system should be developed to enhance the reliability of "fisher scientists" data [75].
Systematic citizen science research should provide meaningful citizen involvement as well as use thorough statistical analyses to evaluate the reliability and associated errors of the acquired data [76].
The engagement of citizens not only provides more data needed in natural resource management, but it also closes the gap between scientists and citizens [35].
Author Contributions: M.C. conceived the idea of assessing citizen above-water reflectance data; M.C. and Y.Y. supervised the data collection; L.L.E.C. and Y.Y. constructed the statistical analysis; Y.Y. prepared the original draft; and M.C. and L.L.E.C. reviewed and edited this paper.