Global Land High-Resolution Cloud Climatology Based on an Improved MOD09 Cloud Mask

Clouds play an important role in the energy and moisture cycle of the earth–atmosphere system, which affects many important processes in nature and human societies. However, there are very few fine-grained and high-precision global cloud climatology data available for high-resolution models. In this paper, we produced a fine-grained (1 km resolution) global land cloud climatology (GLHCC) report based on MOD09 cloud masks from 2001 to 2016, with a temporal resolution of 10 days. The two improvements (short-wave infrared and Band 2/6 ratio threshold method) on the original MOD09 cloud mask have reduced the snow, ice, and bright areas mistakenly classified as clouds. The preliminary cloud products undergo the removal of orbital artifacts by Variational Stationary Noise Remover (VSNR) and the removal of abnormal albedo areas to generate the final cloud climatology data. The new product was directly validated by ground-based cloud observations collected from 3777 global weather stations. PATMOS-X from the Advanced Very High Resolution Radiometer (AVHRR) and MOD/MYD35 served as comparison products for consistency check of GLHCC. The assessment results show that GLHCC demonstrated a strong correlation with ground station observations, MOD/MYD35, and PATMOS-X. When the ground observations were taken as the truth value, GLHCC and MOD/MYD35 displayed higher accuracy than PATMOS-X. In most selected interested areas where the three behave differently, GLHCC matched the facts better than MOD/MYD35 and PATMOS-X. The GLHCC can well represent the cloud distribution over the past 16 years and will play an important role in the fine-grained demands of many aspects of nature and human society.


Introduction
Clouds play an important role in the energy and moisture cycle of the earth-atmosphere system [1][2][3][4][5][6][7]. Clouds directly affect the long and short-wave radiation absorbed and emitted by the earth's surface and the atmosphere, and the global and regional energy balance. The Intergovernmental Panel on Climate Change (IPCC) indicates that much of the uncertainty of climate model predictions comes from uncertainties in cloud descriptions. In addition, the characteristics of clouds affect many important processes in nature and human societies such as the dynamics of ecosystems, the use of solar energy, tourism, and resource planning [5,8]. Therefore, monitoring the global distribution and changes of clouds is of great significance to both nature and economic society.
Previous climate data, such as cloud cover, are mainly derived from ground observations [9]. Ground observations are greatly affected by local conditions; meanwhile, regional representation is not very ideal [10]. In addition, ground stations are distributed unevenly across the world. Satellite remote sensing data has the advantage of having a wide range of detection due to its global monitoring method, which can better reflect the macroscopic characteristics of cloud distribution. Therefore, it can be said that remote sensing provides the only way to observe and monitor clouds regularly on large spatial scales. Many cloud product data are obtained by satellite remote sensing at present [11], such as the National Oceanic and Atmospheric Administration (NOAA) series satellite cloud products [12][13][14][15][16], the International Satellite Cloud Climatology Project (ISCCP) products [17][18][19], the Earth Observing System (EOS) series satellite cloud products [20][21][22][23], and the Cloud Detecting Satellite (CloudSat) cloud products [24,25], etc. There are many comparisons and applications that depend on these products [26][27][28][29][30][31][32]. AVHRR data have been applied to NOAA climate assessment reports, long-term trends in aerosol optical thickness, water vapor variations in the tropical stratosphere, and cloud microphysical variations on decadal timescales [12,29]. Tang and Yang [31] produced a global high-resolution (10 km, 3 h) surface solar radiation dataset using an improved physical algorithm based on the latest international satellite cloud climate program, the Global High Resolution Series Cloud Product (ISCCP-HXG), reanalysis data (ERA5), and MODIS aerosol and albedo products. Sassen and Wang [32] used CloudSat satellite data from 2006 to 2007 to statistically analyze the number of various kinds of clouds in the world.
However, due to the coarse spatial resolution of these cloud products (such as AVHRR PATMOS-X ≈ 11 km and ISCCP-HXG ≈ 10 km), they are difficult to apply in models requiring fine-grained data extraction, such as the prediction of precipitation influenced by convective clouds with high spatial variability and the selection of optimal locations for solar power stations [6,33,34], which requires cloud frequency data with high spatial resolution as a reference. There are a few examples of finer-grain climatologies based on AVHRR data, such as GAC cloud data sets (4 km) and CLARA-A2 (4 km) [16]. In fact, AVHRR scenes have a nadir resolution of approximately 1 km. The problem is that there is no global historic archiving of AVHRR data with the finest resolution (even if regional datasets exist, e.g., over Europe). In addition, the Climate Change Initiative cloud project produces fine-resolution cloud data sets based on different sensors (AVHRR, MODIS, ATSR2, AATSR, and MERIS) [35,36]. Therefore, there is more than one data source that can create a high-resolution cloud climatology. However, MODIS cloud products are of very high quality because of their good data navigation and ability to provide more detailed spectral information compared to other datasets. Therefore, MODIS data can be used to make more detailed cloud products to meet the fine-grained demands.
To date, several studies have produced high-resolution (≤1 km) regional cloud climatologies from MODIS. Douglas and Dominguez [37] developed the derivative product with a 1 km resolution based on the MODIS "rapid response" system [38] by converting RGB "brightness" to "cloudiness" using user-defined thresholds. However, the product strongly depends on the brightness threshold and has problems with high reflectivity surfaces (such as urban areas or snow) [39]. In addition, this approach does not utilize the testing methods used in most cloud detection algorithms, which are only applicable to certain regions of the globe. Moreover, some studies have produced cloud climatologies based on cloud masks stored in MODIS products. For example, Mulligan [40] produced cloud climatology data for seven years based on the MOD35 250 m visible cloud mask (2000)(2001)(2002)(2003)(2004)(2005)(2006), which is spatially bounded to the tropics. The MOD35 cloud mask was developed by the MODIS atmospheric science team. The product indicates whether a pixel is cloudy or clear with several levels of confidence at 1 km and 250 m spatial resolution. While the 250 m version is based on the visible channels only, the algorithm of 1 km employs a broad range of spectral tests using 20 of MODIS's 36 spectral bands. Wilson and Parmentier [41] produced a cloud climatology for 2009 to compare the differences between the 1 km MOD35 cloud mask of Collection 5 and Collection 6. This study described serious deviations in land cover and processing path in Collection 5, which were reduced but not eliminated in Collection 6 [42]. The algorithm of MOD35 adopts different processing paths for different land covers, taking into account the difference of cloud identification threshold caused by the change of surface characteristics. However, the algorithm does not well solve the problem of huge cloud variations and obvious boundary between adjacent land covers caused by different processing paths. These erroneously categorized areas in Collection 6 of MOD35 are mainly associated with water surfaces and exposed surface areas (river channels, sparse grasslands, and cultivated land); see Sections 3.1 and 3.2.2 for details. Moreover, a high-resolution, near-global, and monthly cloud product based on Collection 5 of the MOD09 cloud mask was developed by Wilson and Walter [39] to predict the distribution of ecosystems and biodiversity. This mask is a product of the MOD09 level-2 processing chain (PGE11), which is also called the "internal" cloud mask of MOD09 reflectance products. It is less affected by land cover and more balanced in space [41,43]. In addition, the MOD09 cloud mask, together with the internal cloud shadow flags, form the basis for all composites related to the MOD09 surface reflectance products, demonstrating the usefulness of the MOD09 cloud mask. However, MOD09 cloud detection algorithms are still unable to fully distinguish between clouds and some features (snow, ice, and bright areas) [44]. Therefore, if the error of the MOD09 cloud mask can be corrected, the accuracy of multi-year cloud climatology will be greatly improved. Although Collection 6 of the MOD09 cloud mask is available, there are many anomalies similar to the anomalous cloud frequencies seen in Collection 5 of MOD35 when calculating long-term cloud frequencies. Please refer to Section 3.1 for details. Therefore, Collection 5 of MOD09 is more suitable, in a relative sense, for establishing long-term cloud climatology data. In addition, most of the current global cloud climatology data are annual, quarterly, and monthly products due to the huge amount of data, resulting in an application range that is limited by the temporal resolution. The improvement in temporal resolution can provide a more detailed interpretation of some rapidly changing natural scenes during the year, and can also be treated as a reference for applications such as satellite scheduling. Therefore, the goal of this article is to produce a higher precision and higher resolution global land cloud climatology.
In this paper, we produced a fine-grained (1 km resolution) global land cloud climatology (GLHCC) using MOD09 cloud masks of twice-daily observations from 2001 to 2016, with a temporal resolution of 10 days. Compared with the study by Wilson and Walter [39], in addition to improving the temporal resolution of cloud products, we also present two improvements for the original MOD09 cloud mask. On one hand, a short-wave infrared (SWIR) threshold method was used to identify the confusion of clouds, snow, and ice. On the other hand, a Band 2/6 ratio threshold method is intended to remove the high reflectivity areas mistakenly classified as clouds. Referring to the study by Wilson and Walter [39], the preliminary cloud products undergo the removal of orbital artifacts by the Variational Stationary Noise Remover (VSNR) and the removal of abnormal albedo areas to be generated by the final cloud climatology. The improved effect of the cloud mask was tested by a series of comparative experiments before and after improvement. The GLHCC product was validated by ground observations collected by a global network of 3777 stations. PATMOS-X from AVHRR and MOD/MYD35 were used as comparison products for a consistency check of the GLHCC.

Data and Preprocessing
Cloud statistics were derived from the MOD09 cloud mask from 1 January 2001 to 31 December 2016, as being provided as a cloud mask flag in the MOD/MYD09GA daily surface reflectance product. MOD/MYD09GA provides an estimate of the surface spectral reflectance of the Terra/Aqua Moderate Resolution Imaging Spectroradiometer (MODIS) Bands 1 through 7, corrected for atmospheric conditions such as gases, aerosols, and Rayleigh scattering. MOD09GA and MYD09GA provide twice-daily observation data at 1030 and 1330 local time, respectively. Provided along with the 500 m surface reflectance, observation, and quality bands are a set of ten 1 km observation bands and geolocation flags. The regions classified as "deep ocean" in the surface reflectance products are not processed in the MOD/MYD09 (surface reflection) algorithm; rather, an ocean reflection model is generally used for such regions. Therefore, all the bands and flags stored in MOD/MYD09 products lack the data of the deep ocean area. Two official cloud masks (MOD35 and MOD09 cloud mask) are available in MOD/MYD09GA, both of which are stored in the State Quality Scientific Dataset (QA SDS). The QA SDS also contains further information on the pixel's state, such as cloud shadow or aerosol quality flags. The internal cloud mask (MOD09) used in this study can be acquired from bit 10 in the State QA SDS. The MOD09 cloud mask is stored in one bit thus only having space for the two labels "cloudy" and "clear". This mask is a product of the MOD09 level-2 processing chain (PGE11) that relies on two reflective and one thermal test [44,45]. Although the processing steps leading to the level-2 reflectance products are well documented, no information on the functionality of the internal cloud detection algorithm is available. The MOD35 cloud mask is contained in bits 0-1 of this SDS, thereby having the capacity for the two additional labels: "mixed" and "not set". The MOD35 cloud mask algorithm employs a broad range of spectral tests using 20 of MODIS's 36 spectral bands.

Ground Observational Data
The ten-day GLHCC product was directly validated using global observational data from the "Integrated Surface Database" (ISD). National Centers for Environmental Information (NCEI), with U.S. Air Force and Navy partners, originated the effort in 1998 with the assistance of external funding from several sources. ISD consists of global hourly and synoptic observations compiled from numerous sources into a single common ASCII format and common data model. The outcome of this effort is a dataset containing data from more than 100 original data sources that collectively archived hundreds of meteorological variables. The primary data sources include the Automated Surface Observing System (ASOS), Automated Weather Observing System (AWOS), synoptic, Airways, METAR, Coastal Marine (CMAN), buoy, and various others, from military and civilian stations, including both automated and manual observations. ISD contains surface weather observations from more than 35,000 stations worldwide included in the archive (1900-present). Currently, there are over 14,000 "active" stations updated daily in the database. ISD includes numerous parameters such as wind speed and direction, wind gust, temperature, dew point, cloud data, sea level pressure, altimeter setting, station pressure, present weather, visibility, precipitation amounts for various time periods, and snow depth.
To ensure the continuity of data in time and the wide distribution of data in space, we extracted the cloud records, from 0900 to 1500 local time, from the hourly data of 16 years, and filtered them according to the conditions mentioned in Section 2.2.4. In the end, only 3777 valid stations around the world were retained.

PATMOS-X Cloud Climatology
The ten-day GLHCC product was checked using PATMOS-X cloud cover data from the Advanced Very High Resolution Radiometer (AVHRR), which is a sensor carried on the National Oceanic and Atmospheric Administration (NOAA) series of meteorological satellites. Since the launch of TIROS-N in 1979, AVHRR sensors on NOAA satellites have been continuously carrying out earth observation missions. PATMOS-X cloud cover data is a set of atmospheric and surface climate record products obtained by the University of Wisconsin using the existing AVHRR data. The generated products mainly include clouds, aerosols, surface radiation, etc., which are divided into 165,018 pixel points worldwide. The standard PATMOS-X products use the Level2b file format, which is a sampled (not averaged) product fit to a 0.1 • equal-angle global grid. PATMOS-X includes a full suite of cloud and atmospheric products, in which the cloud fraction can be used as the data source for the quality assessment in this paper to evaluate the accuracy of the MOD09 cloud product. Cloud fraction was defined as the ratio of cloudy to total pixels determined using a naïve Bayesian cloud detection algorithm. The algorithm uses different tests or classifiers to calculate the probability of a given pixel being cloudy or clear, and uses collocated CALIPSO overpasses for training. The algorithm uses infrared, near-infrared, and visible light bands, and quantifies the probability of each classifier. In this study, daily PATMOS-X data in 2010 were used as data sources to generate the average cloud fraction in 2010.

Methods
In this research, the fine-grained (1 km resolution) global cloud climatology (GLHCC) was produced by using the MOD09 cloud mask of twice-daily observations. The two improvements (the short-wave infrared and Band 2/6 ratio threshold method) introduced in this paper upon the original MOD09 cloud mask reduce the snow, ice, and bright areas mistakenly classified as clouds. The preliminary cloud products undergo the removal of orbital artifacts by the Variational Stationary Noise Remover (VSNR) and the removal of abnormal albedo areas to generate the final cloud climatology. The effectiveness of GLHCC was demonstrated by ground observations, MOD/MYD35, and AVHRR's PATMOS-X cloud data. The flowchart of production and quality assessment for the datasets in this study is shown in Figure 1.

Calculation of Cloud Frequencies
In this study, the goal is to produce a ten-day average global cloud climatology dataset. Cloud statistics were derived from the MOD09 cloud mask from 1 January 2001 to 31 December 2016 as being provided as a cloud mask flag in the MOD/MYD09GA daily surface reflectance product. For each of the MODIS tiles, the respective cloud information was extracted from bit 10 of the "state_1 km" Scientific Data Set (SDS). The MOD09 cloud mask is stored in one bit thus only having space for the two labels "cloudy" and "clear". We extracted the daily cloud flags and converted flags into separate cloud masks using the Google Earth Engine (GEE) application programming interface (http://earthengine. google.org/ accessed on 23 September 2021).
The MOD09 cloud mask algorithm includes two reflective ("middle infrared anomaly" index and the 1.38 microns reflectance) and one thermal test to maximize the reliability of cloud detection. The first reflective test uses the short-wave and middle infrared to calculate a "middle infrared anomaly" index, which efficiently detects low or high reflective clouds. The second test using reflectance at 1.38 microns effectively detects high clouds. The thermal test is used to identify pixels with high infrared radiance anomalies (e.g., fires, sun-glint, and high albedo surfaces) with respect to the near-surface (2 m) air temperature provided by the NCEP reanalysis [46]. After these two reflective and thermal tests, the MOD09 cloud mask still had some errors, such as bright rocks and sand which were misclassified as clouds. See Section 3.1 for details. The MOD09 cloud algorithm was designed to minimize the confusion over snow and ice by taking the surface air temperature into account. However, the surface temperature does not effectively distinguish snow and ice from clouds at high latitudes or altitudes. Especially in the multi-year averaged cloud products, the abnormal regions become more obvious, such as water surfaces in winter and bright deserts. The confusion over snow, ice, and high reflectance areas are the main components of the cloud error detected in the MOD09 cloud mask. Therefore, before performing cloud frequency calculations, we added spectral tests to minimize snow, ice, and bright areas which were misclassified as clouds.

• SWIR Threshold Method
Cloud detection is done by essentially finding a threshold to most accurately distinguish between clear sky and cloudy pixels. Figure 2 shows the change of average reflectance values of four types of pixels of ice, snow, thin cloud, and thick cloud obtained at visible to short-wave infrared wavelengths. In true-color images, clouds that completely block ground objects are defined as thick clouds, and clouds that obscure ground objects are defined as thin clouds. The distinction between snow and ice is similar to the distinction between thick and thin clouds. These four types of pixels were selected from true-color images by visual interpretation in 2001, 2005, 2009, 2013, and 2016. These four types of pixels cover areas of different latitudes, elevations, and land cover. The number of samples of one type of pixel in a year is about 70~80. Since it is difficult to distinguish snow and cloud effectively through human eyes in winter at high latitudes, sample selection is subjective to some extent. In Figure 2, the spectral range from left to right represents band 8, band 9, band 3, band 4, band 1, band 2, band 5, band 6, and band 7 of MOD09GA, respectively. Band 2 is the near-infrared (NIR) band, Bands 5, 6, and 7 are short-wave infrared (SWIR) bands, and the rest are of the visible (VIS) spectrum. It can be seen that there is not much difference in the VIS between the reflectance values of thick clouds and snow, as well as between ice and thin clouds. However, the reflectance of snow and ice drops sharply at the NIR and SWIR, which differs greatly from that of thick and thin clouds. In the spectral range of 2105 nm to 2155 nm, the reflectance of snow and ice decreased to the lowest value. Ice and snow generally show a high degree of reflection in the VIS (wavelength ca 400~750 nm), lower reflection in the NIR (wavelength ca 780~900 nm), and very low reflection in the SWIR (wavelength ca 1570~1780 nm). Therefore, due to the coupled effects of the strong absorption by snow and ice in the SWIR, and the high reflection of clouds in the SWIR [47][48][49][50][51][52], we can use MODIS band 7 to perform threshold segmentation for snow, ice, and cloud, as shown in Equation (1).
(1) In general, all kinds of clouds should have Band 7 reflectance larger than 0.03, referring to the LEDAPS (The Landsat Ecosystem Disturbance Adaptive Processing System) internal cloud masking algorithm. In this study, the threshold is set to 0.025 in order to prevent filtering out thin clouds. When the value of MODIS Band 7 is lower than 0.025, it should not be regarded as "cloudy".

•
Band 2/6 ratio threshold method In addition to the confusion of snow and ice, and bright rocks and sand often appear in cloud masks, especially in desert areas. We can use the ratio of Band 2 to Band 6 to distinguish between clouds and sand, which is similar to the Band 4/5 ratio commonly used in Landsat to eliminate highly reflective sands and rocks [47,48]. In principle, the sand and rocks tend to exhibit higher reflectance in Band 6 than that in Band 2, resulting in the ratio of Band 2 to Band 6 of sand usually being less than 1, whereas the reverse is true for clouds. However, sometimes the ratios of thin clouds are less than 1 due to the exposed sand and rock beneath thin clouds. As shown in Figure 3, the ratios of thin clouds are scattered between 0.85 and 1. The Band 2/6 test is shown in Equation (2).

Removal of Orbital Artifacts
The MODIS orbit results in systematic gaps in the daily global coverage near the equator that results in nearly longitudinal orbital artifacts (0 • due north, approximately 15 • for Terra, and 345 • for Aqua) in the long-term cloud frequencies, as shown in Section 3.1. The orbital artifacts are evenly distributed in the MOD35 cloud mask between 180 • W to 180 • E and 30 • S to 30 • N. Such orbital artifacts are more pronounced at sea than on land. In the MOD09 cloud mask, orbital artifacts are not as obvious as in the MOD35 cloud mask, only visible in some areas. Therefore, in the MOD09 cloud mask, the orbital artifacts are visible over eastern Australia, northwest Africa, and South Asia. In addition, the absence of short-wave infrared in the winter at high latitudes has resulted in horizontal fringe noise in the Arctic areas. To remove these features, we used the Variational Stationary Noise Remover (VSNR) for the long-term cloud frequency data produced in part in 2.2.1 [39,53]. It can be interpreted both as a restoration method in a Bayesian framework and as a cartoon + texture decomposition method. The VSNR is able to preserve the image details and remove the stripes and most of the white noise. Therefore, it is well-suited to remove these artifacts by setting the shape, scale, and angle of known artifacts.
Images with obvious fringe noise were screened from 37 cloud frequency data (i.e., the year is broken down in 37 ten-day periods), and the problematic regions in these images were clipped as VSNR input data. The Gabor filter in VSNR was set to y = 200, x = 5, and θ = 15 for Terra and θ = −15 for Aqua to minimize the orbital artifacts in eastern Australia and the central Pacific areas. The Gabor filter with y = 40, x = 5, and θ = 15 for Terra and θ = −15 for Aqua was used in Northwest Africa and the northern Indian Ocean areas. The output image with the fringe noise removed is reinserted into the long-term cloud frequency data.

Removal of Abnormal Points
Examination of the resulting ten-day climatology revealed an abnormally high cloud frequency in some areas with high albedo and annual variability (SD), especially at the land-water junction. Although the MOD09 cloud mask performs much better than MOD35, the misclassification of clouds at the land/water interface is still obvious. Even the contours of land and water can clearly be seen depending on the distribution of clouds. Albedo from the 1 km MODIS MCD43B3 product was used to identify these problematic regions. These problematic areas generally fall into three categories: (1) land with high albedo and high variability (LAV), (2) water with high variability (WS), and (3) water with high albedo (WA). Therefore, MODIS MOD44W water mask products and SRTM Digital Elevation Data are employed to assist in identifying these areas. Equations (3)-(5), and (7) are the identification Equations for abnormal regions. LAV = alb min ≥ 0.08 and alb cv ≥ 30 and slope ≤ 0 and water = 0 (3) WS = alb min ≥ 0.025 and alb cv ≥ 30 and water = 1 (4) WA = alb min ≥ 0.075 and water = 1 where, alb min represents the minimum albedo of the pixel over 16 years. The slope is calculated from SRTM DEM. For water, 0 and 1 represent a non-water surface and water surface, respectively. alb cv is the Coefficient of Variation (CV), which is obtained by Equation (6).
where, alb sd represents the standard deviation of the 16 years of albedo data, and alb mn represents the average of 16 years of albedo data. The threshold setting of the above equation is referenced from the research by Wilson and Walter [39]. Therefore, through Equation (7), a binary image can be obtained to mask out the problem regions.
Area prob = LAV or WS or WA where, Area prob represents the areas with abnormal points. The abnormal areas identified by this method are usually small areas such as narrow land and water boundaries. To avoid data loss after the deletion of abnormal areas, an interpolation method is used to fill these areas. The interpolation method may change the cloud frequency data of small regions, but it is ultimately more beneficial than harmful to remove abnormal regions in practical application. For example, an unusually high cloud frequency could lead to an inefficient location for a solar power station on water.

Quality Assessment Method
Due to the unfeasibility of obtaining precise simultaneous cloud information, the validation of a cloud product remains a challenging task. The available reference cloud products present their own inherent limitations and are often not as accurate as the products they are intended to evaluate. Given the implied lack of fully reliable datasets for cloud product validation, the quality assessment of a cloud product must be regarded as a product comparison rather than a product validation. The verification strategy in this article includes a direct validation and consistency check.

• Direct validation
Direct validation is a method to obtain the "relative truth value" on the pixel scale of the product by processing the ground observational data, to evaluate the accuracy of the remote sensing product by directly comparing observations with the remote sensing product. In the study by Wilson and Walter [39], monthly daytime ground-based cloud observations from 1997 to 2009 produced by Eastman and Warren [54] were used to validate their dataset from 2001 to 2015. The time difference between the ground-based cloud observations and the cloud dataset is huge both in terms of year span and daily observation time. To eliminate possible validation inaccuracies due to this large time difference, we used hourly data from ISD from 2001 to 2016. In this study, cloud observation data from the ISD were used to validate the 16-year average, ten-day cloud climatology. We extracted the hourly values of total clouds, which represent the percentage of the sky covered by all types of clouds at that time. The observations from 2001 to 2016 were collated to extract useful information (station ID, latitude and longitude, UTM time, and cloud record). The ground observations used to validate cloud products needed to meet three conditions. First, there should be at least one observational recorded between 0900 and 1500 local time to obtain the daily mean noon cloud frequency. Second, at least 6 out of every 10 days are observed to obtain the ten-day average cloud frequency. Third, observational data are available for at least 10 out of the 16 years to obtain the 16-year average ten-day cloud frequency. We relaxed the time conditions so that the remaining stations could be spread over as much of the world as possible. This time difference between satellite and ground observations affects the verification results to some extent. We filtered the stations that were available over the 37 time periods (i.e., the year was broken down into 37 ten-day periods), leaving 3777 stations.
To compare these observations with satellite data, it is important to consider that the sampling radius of these observations (the visible sky) depends on cloud height, cloud thickness, earth's curvature, and other factors, but the range of an observation is usually much larger than a single 1 km MODIS pixel [39]. For this reason, we calculated the average of the GLHCC in a circle with a radius of 16 km [55] centered on each station, and converted the GLHCC into the average cloud cover within the sample radius, making it equivalent to the observed value of the station. Finally, regression analysis was conducted to obtain the correlation coefficient (R) and root mean square error (RMSE) of the two groups of data.

• Consistency check
Consistency check is a method of using the same type of remote sensing products with known accuracy to verify the remote sensing products tested. Consistency check helps to evaluate the consistency between different products, and can comprehensively reflect the relative accuracy of cloud products. In this study, a visual comparison and correlation analysis were conducted between the GLHCC, MOD/MYD35 [41], and PATMOS-X [56] to analyze the advantages and disadvantages of the regions with differences between them. Through the regression analysis of the GLHCC, MOD/MYD35, PATMOS-X, and observational data, the correlation coefficient and standard error are calculated, and the relative accuracy of the three is analyzed.

Results
After the modification of the cloud mask and the production and improvement of products mentioned in Section 2, the 16-year average cloud climatology data for 37 time periods (i.e., the year is broken down into 37 ten-day periods) are finally obtained. Figure 4 is an example diagram of ten-day land cloud frequency, which shows a non-global region due to the absence of Band 7 at high latitudes in winter. In specific data sets, cloud frequency data for high latitude areas in summer are available. From left to right and from top to bottom, the graph shows the 2nd, 5th, 8th, 11th, 14th, 17th, 20th, 23rd, 26th, 29th, 32nd, and 35th ten-day period of the year, respectively. Figure 5 shows the average of the 37 images (annual mean cloud frequency). The statistical results of cloud distribution on the land, except the poles, show that the annual average cloud cover is about 50.4%. The mean annual cloud frequency confirms equatorial South America, the Congo River basin in Africa, and Southeast Asia to be the cloudiest regions of the world, with annual cloud frequencies (proportion of days with a positive cloud flag) ≥ 80%. The Sahara Desert and the Middle East are the least cloudy areas with annual cloud frequencies ≤ 10%. Figure 6 is a histogram of the global land average cloud frequency across the 37 periods. It can be seen from the graph that the global cloud cover changes in a wavy pattern, with higher cloud frequencies in October, November, December, and January, and lower frequencies in June and July.
Regions also vary strongly in the temporal variability of cloud cover, both within and between years, as shown in Figure 7. The largest inter-annual variability (standard deviation over 16 years of the mean cloud frequency for each individual 10-day period) outside Antarctica occurs in the tropical and subtropical savannas and the shrublands (SD ≥ 18%). The regions with large intra-annual variability (16-year mean of the annual standard deviation over all 37 ten-day periods for every individual year) are mainly distributed in the tropical monsoon climate zones of South Asia and Southeast Asia and savanna climate zones in Africa, South America (SD ≥ 35%). However, the inter-annual and intra-annual variations of tropical desert climate are relatively small (SD ≤ 5%).    In the specific experiment process, to prove the reliability of the research, we have done comparative experiments focusing on many aspects. Taking the annual average cloud frequency over a region of East Asia in 2001 as an example, Collection 5 and Collection 6 of the MOD09 and MOD35 cloud masks were compared visually, as shown in Figure 8. From left to right and top to bottom, the images are Collection 5 of the MOD09 cloud mask, Collection 6 of the MOD09 cloud mask, Collection 5 of the MOD35 cloud mask, and Collection 6 of the MOD35 cloud mask. By visual comparison, Collection 6 of MOD09 shows more anomalies, with areas of unusually high cloud frequency consistent with city and water locations. This may be due to cloud mask algorithms misclassifying bright areas as clouds, regardless of whether the brightness is caused by bright buildings, bright land, or aerosols. Collection 5 of the MOD35 cloud mask exhibits similar problems as Collection 6 of MOD09 does in urban areas. Compared with Collection 5 of MOD35, Collection 6 has less abrupt cloud variation between different land covers. However, this does not mean that the bias of land cover and processing path in Collection 6 of MOD35 have been completely eliminated. We can find from the four figures that the cloud frequency of MOD35 is higher than that of MOD09 on the whole, especially in the northern part of China. In fact, northern China receives much less rainfall than southern China, which is not reflected in MOD35. Therefore, we performed a visual examination of four cloud masks (Figures 9 and 10).   Overall, Collection 5 of MOD09 has a better performance with smoother transitions between adjacent areas and has no abrupt changes between land and water.
To further prove the usability of Collection 5 of MOD09, we have done comparative experiments with Collection 6 of MOD35, as shown in Figures 11 and 12. Figure 11 shows the annual mean cloud frequency of Taihu Lake in China in 2001, located in the southern plain area. Strangely, the MOD35 cloud mask (Figure 11) shows extremely low cloud frequency over the lake without the influence of topography. Figure 12 shows the annual average cloud frequency for a region of Santiago del Estero in 2010. Figure 12a shows an area of Santiago where woodland is mixed with cultivated land, some of which has been left bare for years without crops. Figure 12b,c show the cloud frequency of the MOD09 and MOD35 cloud masks, respectively. The cloud frequency displayed by the MOD35 cloud mask shows a high degree of consistency with the land cover. It can be seen that the MOD35 cloud mask is severely affected by land cover and processing paths.  Based on the above comparison, this study finally identified Collection 5 of the MOD09 cloud mask as the most suitable data source for multi-year cloud frequency calculations.
The improvement of the MOD09 cloud mask is evidenced by the correction of snow, ice, and high reflectivity areas, which were otherwise misclassified as clouds. The error in the original MOD09 cloud mask will become more obvious with the average of multiple images. Figure 13a shows the first 16-year ten-day average cloud frequency based on the uncorrected MOD09 cloud mask in GLHCC01. It is unreasonable that the cloud frequency over Qinghai Lake and Hala Lake in winter is abnormally higher than the surrounding areas; and that the boundary between land and water is unusually clear. GLHCC01 represents the image of the first time period among the 37 time periods (i.e., the year is broken down into 37 ten-day periods), i.e., the average cloud frequency from 1 January to 10 January. It is caused by the mistaken classification of snow and ice in Qinghai Lake and Hala Lake in winter as clouds over many years. Taking Qinghai Lake on 6 January 2001 as an example, the correction effect of the SWIR threshold method on the MOD09 cloud mask is illustrated (Figure 14). Figure 14a is the true-color image of Qinghai Lake, from which it can be found that the reflectivity of ice, snow, and clouds on Qinghai Lake is extremely high. According to the characteristics of cloud distribution, shape, and cloud shadow, it can be ascertained that the highlighted areas on the lake surface are probably ice and snow rather than clouds. We can test our idea in the SWIR band. It can be seen from Figure 14b that part of the lake with high reflectivity originally within the range of VIS now presents low reflectivity within the range of SWIR, while some areas still have high reflectivity. This is because clouds are still highly reflective in short-wave infrared wavelengths, whereas snow and ice have low reflectivity because of the absorption of short-wave infrared wavelengths. Figure 14c,d show cloud masks before and after the correction of the SWIR threshold, respectively. The threshold here is set to 0.025 to prevent the thin cloud from being removed altogether. Figure 13b shows the average cloud frequency in GLHCC01 based on the corrected MOD09 cloud mask. Although the modified cloud frequency over Qinghai Lake and Hala Lake is still slightly higher than the surrounding area, there is a notable improvement. Figure 13c shows the effect after the treatment of the removal of the albedo anomaly, upon which the obvious land-water boundary is removed.  Taking the Sahara Desert in Northern Africa on 1 January 2008 as an example, the Band 2/6 ratio threshold method is used to separate particularly bright areas from the clouds, as shown in Figure 15. Figure 15a is a true-color image of an area in the Sahara showing the distribution of clouds. In the original GLHCC cloud mask shown in Figure 15c, it can be seen that a large bright area is mistakenly divided into clouds. This is because clouds, along with some bright rocks and sand, have high reflectivity at visible wavelengths. The ratio of Band 2 to Band 6 in Figure 15b can effectively distinguish clouds from bright rocks and sand because bright rock and desert tend to exhibit higher reflectance in Band 6 than in Band 2. Therefore, areas with a ratio greater than 1 are generally regarded as clouds; however, the threshold ratio is lowered to 0.85 to prevent the removal of some thin clouds. Figure 15d is the cloud mask modified by the Band 2/6 ratio threshold method for the MOD09 cloud mask, upon which non-clouded "cloudy" pixels are modified to "clear".
The MODIS orbit causes systematic gaps in the daily global coverage near the equator, resulting in nearly longitudinal orbital artifacts (0 • due north, approximately 15 • for Terra and 345 • for Aqua) for the long-term cloud frequencies. Figure 16 shows the comparison of the east coast of Australia before and after removing fringe noise in GLHCC01. To clearly show the elimination effect of fringe noise, we added the ocean area with more obvious noise. As shown in Figure 16a, there are several longitudinal stripes, of approximately 15 • , at the point of the red arrow. The VSNR was adopted to eliminate these stripes, whose results after elimination were shown in Figure 16b. It can be seen that this technology can eliminate the fringe well and keep the image value close to the original value to ensure the accuracy of data. In this experiment, we set the x, y, and θ of the Gabor filter to 5, 200, and 15, respectively.  The observed cloud products reveal that the cloud frequency is unusually high in areas with a high albedo and annual variability (SD), especially at land-water intersections.
The albedo from the 1 km MODIS MCD43B3 product and other data was used to identify these problematic regions. Using an example from the east coast of Africa in GLHCC01, we demonstrated the elimination effect of abnormal values (Figure 17). In order to clearly show the removal effect of abnormal areas, we retained the cloud frequency of the ocean area. As can be seen from Figure 17a, along the east coast of Africa there is a distinct white curve at the point of the red arrow, representing an abnormally high cloud frequency. These abnormal areas can be effectively identified and removed by the albedo anomaly removal method, and then be filled by interpolation. The effect after filling is shown in Figure 17b.

Direct Validation
In direct validation, we have compiled total cloud cover records from 3777 observation stations around the world for over 16 years to verify the cloud products of this study. Figure 18 shows the global distribution of weather stations used for verification. The northern hemisphere retains more weather stations than the southern hemisphere. The available observational data in parts of Africa and South America that meet the verification conditions are sparse or even absent in some places. Therefore, the validation result is affected by the uneven distribution of the observation data. Figure 19 shows the global land average cloud frequency changes for the 37 time periods (i.e., the year broken is down in 37 ten-day periods) of ground observation data and the GLHCC. The "global" here refers to the areas covered by 3777 stations. Figure 19 shows that the frequency change trends of the two types of clouds are roughly the same, which are highly correlated. The observed cloud frequency is 1% to 5% higher than that of GLHCC products as a whole, which may be related to the observation and statistical methods of the two. Both types of data have higher average cloud frequencies in November, December, and January, and lower cloud frequencies in July, August, and September. Table 1 shows the specific values for comparison and validation of the two types of products over 37 time periods. The results show that the cloud products obtained in this study have a strong correlation with the observational data of meteorological stations. The average R of the verification results in 37 periods is 0.872 (Time 12: 0.8034~Time 30: 0.9105), and the RMSE is 9.3109 (Time 30: 7.9898~Time 19: 11.0265). In general, in October, November, and December, the GLHCC cloud products express the best degree of matching to the observation data. The correlation between the two is the smallest in April and May.

Consistency Check
In this study, PATMOS-X daily cloud data from 2010, with a resolution of 11 km, was selected as the consistency check comparison product. At the same time, in order to compare the advantages and disadvantages of different products, we added the annual average cloud frequency in 2010 based on the MOD/MYD35 cloud mask. Visual comparison and correlation analysis were conducted between GLHCC, MOD/MYD35, and PATMOS-X to analyze the relative merit of the regions with differences between them.    According to the statistics, the annual mean cloud frequencies over land, excluding polar regions, are 51.1%, 55.6%, and 52.8% for the GLHCC, MOD/MYD35, and PATMOS-X, respectively. The cloud frequency of MOD/MYD35 is significantly higher than that of the GLHCC and PATMOS-X in the middle and high latitudes of the Northern Hemisphere. Near the equator, the cloud frequency of PATMOS-X is significantly higher than GLHCC and MOD/MYD35. In addition, there are local differences in cloud frequency, mostly distributed in areas along a land-water boundary and regions with large terrain changes.
PATMOS-X has a much higher cloud frequency than the GLHCC and MOD/MYD35 on rivers and at the boundary of land and water, such as the coastline of Southeast Asia (Figures 21 and 22) and the Amazon River Basin (Figures 23 and 24). Figure 21 shows the cloud distribution of the GLHCC, MOD/MYD35, and PATMOS over southeast Asian islands in 2010. To show more clearly the cloud frequency along the island, the ocean area has been added here. Figure 21c shows an extremely anomalous land-water boundary in PATMOS-X, which is also observed near the equator in Africa and South America. In Figure 21a,b, the GLHCC and MOD/MYD35 show the opposite phenomenon to PATMOS-X; that is, the cloud frequency along the coast of the island is lower than the surrounding cloud frequency. This phenomenon is mainly due to the form of cloud distribution shown in Figure 22. At the boundary of land and sea, a long and narrow cloudless zone is formed due to the influence of topography, ocean currents, and other factors. Figure 23 is the cloud distribution of the GLHCC (Figure 23a), MOD/MYD35 (Figure 23b), and PATMOS-X (Figure 23c) over the Amazon River Basin. The GLHCC and MOD/MYD35 show lower cloud frequencies in the river region than in the surrounding region. However, PATMOS-X showed completely opposite results to the GLHCC and MOD/MYD35, with a higher cloud frequency over the river. By studying true-color images of the Amazon, it was determined that the cloud frequency of PATMOS-X was not consistent with the facts. Figure 24 shows an image of the Amazon on 1 June 2010, with cumulus clouds over the surrounding rainforest but not over the river. This phenomenon reveals transpiration in tropical rainforests, which is also common in tropical regions such as the Congo River Basin. This phenomenon is more obvious on MOD/MYD35 than in the GLHCC. It is difficult to determine which is more accurate, the GLHCC or MOD/MYD35. However, it is apparent that the river channel in the MOD35 cloud mask shows a very high cloud frequency, which is obviously inconsistent with the facts. The details are shown in an enlarged view in the lower right corner of Figure 23b, which is the area in the red box.     Cloud frequency has a strong relationship with topography in the regions with great undulating terrain, such as in the Eastern African region ( Figure 25) and Western South America ( Figure 26). Figure 25 shows the cloud frequency of East Africa in 2010 for the GLHCC (Figure 25a), MOD/MYD35 (Figure 25b), and PATMOS-X (Figure 25c). The cloud distributions of GLHCC and MOD/MYD35 over East Africa differ from that of PATMOS-X. In PATMOS-X, the cloud frequency over the Great Rift Valley is higher than that over the plateau region, while that over the coastal plain is lower than that over the plateau region. The lower cloud frequency for the Great Rift Valley in the GLHCC and MOD/MYD35 is more accurate than PATMOS-X because the Great Rift Valley is a typical dry-hot valley. In addition, the coastal plains and low foothills influenced by the tropical monsoon climate are wetter than the East African Plateau. Therefore, there is a high probability that the GLHCC and MOD/MYD35 are correct in their cloud estimation for this region. Figure 26 shows the cloud frequency of western South America in 2010 for the GLHCC (Figure 26a), MOD/MYD35 (Figure 26b), and PATMOS-X (Figure 26c). The GLHCC and MOD/MYD35 produce much higher cloud frequencies than PATMOS-X in the Andes Mountains. In the Southern Andes, the GLHCC showed a high cloud frequency, which was not seen in MOD/MYD35 or PATMOS-X. After investigation, it was found that the high cloud frequency displayed by the GLHCC in this area was mainly caused by the persistent snow cover throughout the year.
At the same time, we conducted regression analysis on the GLHCC, MOD/MYD35, PATMOS-X, and observed data, and the results are shown in Figure 27. The correlation coefficient (R) and Root Mean Square Errors (RMSE) of four cloud frequency data were obtained. Among them, the GHLCC and MOD/MYD35 obtained the largest R (0.9555) and the smallest RSME (4.5697%). PATMOS-X and the observations obtained the smallest R (0.7814) and the largest RSME (9.1718%). The R and RMSE of the GLHCC with observations and PATMOS-X were similar to those of MOD/MYD35 with observations and PATMOS-X. It can be seen that the GLHCC and MOD/MYD35 perform better, while PATMOS-X performs worst when the observations are taken as the truth value. In addition, the consistency between satellite data is higher than that between satellite data and observation data.

Discussion
The statistical results of cloud distribution on the land, except the poles, show that the annual average cloud cover is about 50.4%, which is slightly lower than the previous results of 52 to 56% of the annual average cloud cover over land [11,57,58]. To avoid the differences caused by different statistical regions, we calculated the cloud frequency of global land areas, except the poles, based on Collection 6 of the MOD35 cloud mask for 16 years, which is about 53.6%. At the same time, there was a nearly 4% difference in cloud frequencies between the GLHCC and MOD/MYD35 in 2010 (Section 3.2.2). There may be two main reasons for this result. First, the MOD09 cloud mask is only stored in one bit, a hybrid pixel with both "cloudy" and "clear" will be identified as "clear". Therefore, the MOD09 cloud mask is not good at identifying thin clouds with very low optical thickness, and certainly does not misclassify aerosols as clouds (Figures 9 and 10). Second, due to land cover deviation, the cloud cover of the MOD35 cloud mask is overestimated in some areas. Moreover, we found that the MOD35 cloud mask seems to be better at capturing objects with very low optical thickness (very thin clouds or aerosols) (Figures 9 and 10). This sensitivity to objects with very low optical thickness is more useful for downstream product filtering data than for long-term cloud frequency calculations. As it is obviously unfair to classify such an object with extremely low optical thickness as a 100% cloud. The difference in attitude between the MOD09 and MOD35 cloud masks to-wards objects with very low optical thickness is also the reason for the difference in cloud frequency between them. In practical application, identifying the object with extremely low optical thickness as "Clear" is a better indicator of the true cloud distribution than identifying as a 100% cloud. Therefore, it can be argued that the MOD35 cloud mask is more suitable for downstream products, but the MOD09 cloud mask is more suitable for computing long-term cloud frequencies.
In the process of consistency check, the experiment well proves that the product cannot be evaluated from a single aspect. We choose which product to use based on the specific application, because each product has its advantages and disadvantages. The misestimation of cloud frequency in PATMOS-X was mainly reflected in the water surface, coastline, and mountainous areas. The abnormal cloud frequencies in MOD/MYD35 were mainly low cloud frequency over some water surfaces and high cloud frequency over exposed land surfaces (river channels, sparse grasslands, and cultivated land). The disadvantage of the GLHCC lies in its misjudgment of snow in high-altitude areas. The reason why this type of snow was not successfully removed by SWIR threshold method is that its reflectivity in short-wave infrared band is higher than the conservative SWIR threshold. This error occurs only at very high elevations, such as the Andes and Himalayas. Compared with the large area of misestimation by MOD/MYD35 and PATMOS-X, the GLHCC has a smaller misjudgment area. The advantages of the GLHCC products mainly originate from the relatively accurate statistics of cloud amounts of the MOD09 cloud mask. Although each product has its own advantages and disadvantages, on the whole, GLHCC can better represent the real situation of cloud frequency distribution.

Conclusions
In this study, the MOD09 cloud mask was improved by using a short-wave infrared threshold method and a Band 2/6 threshold method. The global 16-year average ten-day cloud frequency with a resolution of 1 km was produced based on the improved cloud mask and improved by VSNR and albedo products. The accuracy of the final GLHCC product was assessed by ground observations, MOD/MYD35 and PATMOS-X cloud data.
The short-wave infrared threshold method and Band 2/6 threshold method can effectively reduce the confusion of snow and ice, high brightness regions, and clouds in the GLHCC cloud mask. VSNR can effectively remove the orbital artifacts in cloud products, and the removal of albedo anomalies makes the cloud frequency at the land and water interface more reasonable. These algorithms enable the GLHCC cloud mask and the final product to more accurately reflect the global cloud distribution.
In the quality assessment stage, the correlation and standard errors between cloud products and observed data in 37 time periods (i.e., the year is broken down into 37 ten-day periods) were obtained by direct verification. The cloud products in 37 time periods exhibit a strong correlation with the observed data along with reasonable RMSE. A consistency check showed that the cloud frequency of GLHCC was lower than that of MOD/MYD35 and PATMOS-X on the whole, but better than MOD/MYD35 and PATMOS-X concerning the details. By regression analysis, R and RMSE between the GLHCC, MOD/MYD35, PATMOS-X, and 2010 observation data were obtained. The GLHCC had the greatest correlation with MOD/MYD35 and the smallest RMSE, while PATMOS-X had the worst correlation with the observed data and the largest RMSE. When using observational data as truth values, the GLHCC and MOD/MYD35 performed similarly, both outperforming PATMOS-X.
The experiments and studies in this paper have proved the high accuracy and availability of GLHCC product data, but there are still some problems and deficiencies. Due to the huge amount of data and the difficulty of calculation, the single threshold in the algorithm cannot adapt to the uncertainties in the global region. In addition, our algorithm, operating based on the MOD09 cloud mask, only excludes the non-cloud pixels misclassified as clouds, and clouds misclassified as non-cloud pixels may be missed. Therefore, in future studies, the single threshold value will be promoted to the dynamic threshold value to obtain the best threshold segmentation boundary in the global scope. Moreover, the comprehensive use of various cloud masks should be considered to recognize cloud pixels globally. For example, by combining the MOD09 and MOD35 cloud mask, the advantages of both are integrated and the disadvantages are eliminated to produce a more accurate cloud climatology. The lack of cloud climatology data in ocean areas may be filled in with the improved MOD35 cloud masks or other cloud products.