1. Introduction
The explosion of remote sensing data and the advent of large-scale cloud computing have significantly advanced scientific discoveries across various fields [
1,
2,
3,
4]. A pivotal catalyst in these advancements lies in the ability of optical satellite sensors to trace the physical properties of land with higher spatial resolution and at shorter temporal intervals. Nonetheless, a substantial challenge arises from the inherent limitations of optical sensors, resulting in the inevitable contamination of most images by clouds and raising biases during the analysis of imagery. Thus, to mitigate biases during the analysis of remote sensing imagery, the development of automated, accurate, and efficient methods for cloud and cloud shadow detection is crucial, especially in the context of the preprocessing of large-scale remote sensing data. Sentinel-2 satellite imagery provides valuable datasets with a wide swath width, high spatial resolution, and frequent revisit times. It includes bands across the visible, near-infrared, and shortwave infrared ranges. However, cloud detection in Sentinel-2 imagery remains a significant challenge due to the lack of a thermal band [
5].
Various cloud and cloud shadow detection algorithms have been developed for Sentinel-2 images, and they can be broadly classified into physical-rule-based algorithms, multi-temporal-based algorithms, and machine-learning-based algorithms.
Physical-rule-based algorithms are designed based on the large gaps in spectral and spatial characteristics between clouds and ground objects, and the cloud and cloud shadow masks can be obtained by setting thresholds according to the single-band reflectance, band ratios, and differences [
6,
7]. These methods are simple and effective for the screening of typical clouds and cloud shadows in simple scenes. They can directly operate on an image without relying on sample data and additional auxiliary data. However, their reliance on empirically determined parameters and sensitivity to parameter selection can lead to the omission of thin clouds and the inclusion of bright, non-cloud objects (e.g., snow). Temporal and spatial variations further make it more difficult to find suitable thresholds.
Multi-temporal algorithms typically use cloud-free images/pixels as a reference and mark pixels with significant differences in reflectance from the reference images as potential cloud or cloud shadow pixels [
8,
9,
10]. In addition, cloud effects can also be obtained by comparing the original values with the predicted values of a time-series change trend model fitted with non-cloud pixels, which requires an initial cloud mask [
11,
12,
13]. The 5-day revisit period and free availability of Sentinel-2 imagery facilitate the development of multi-temporal cloud detection algorithms. In comparison with single-temporal cloud detection algorithms, multi-temporal algorithms leverage temporal information to complement physical properties, and they can better distinguish cloud effects and ground objects [
14]. However, their accuracy performance can be limited by the reference image, initial cloud mask, and real land-cover changes.
Machine-learning-based algorithms, including traditional machine learning methods and deep learning methods, treat cloud and cloud shadow detection as a classification problem by optimizing the parameters of the designed model based on a mass of representative training data [
15]. The success of machine-learning-based methods relies on the availability of a large number of labeled samples, and many cloud detection datasets have been published [
16,
17,
18]. Traditional machine learning methods, such as the support vector machine (SVM) [
19], random forest [
20], classical Bayesian [
17], fuzzy clustering [
21], boosting [
22], and neural networks [
23], have achieved significant results in cloud detection. Compared to non-machine learning methods, these approaches can automatically uncover complex nonlinear relationships between different patterns while also achieving end-to-end rapid data processing. However, traditional machine-learning-based approaches, together with the above physical-rule-based and multi-temporal algorithms, usually rely on manual feature selection and use limited information on the spatial context; thus, they cannot effectively deal with cloud and cloud shadow detection in complex scenes.
Recently, deep learning methods have been increasingly used in cloud detection, and they have achieved outstanding performance [
24,
25,
26,
27]. Deep-learning-based algorithms automatically learn spatial and semantic features directly from training data, avoiding the need for manual feature selection and reducing the reliance on subjective experience. Inspired by state-of-the-art deep learning architectures, such as U-Net [
28], DeepLab v3+ [
29], and ResNet [
30], many cloud detection models have been proposed [
16]. Techniques such as weakly supervised learning and domain adaptation have also been developed to address the challenge of insufficient labeled data in cloud detection [
31,
32,
33,
34]. Deep learning methods can automatically learn high-level feature representations of data, avoiding the need for manual feature engineering. These algorithms are able to achieve more accurate results in complex scenes, such as in snow/ice-covered areas and urban areas. However, deep-learning-based methods still rely heavily on representative training data, which may not be available in sufficient quantities, especially for thin clouds and cloud shadows. In fact, these two types are difficult to define in a single scene.
Although a number of cloud detection algorithms are currently available for Sentinel-2 imagery, a comparison of their results shows that the performance of these methods varies depending on the reference dataset, and no algorithm has been able to achieve both production accuracy and user accuracy greater than 90% [
5]. Google introduced the Cloud Score+ (CS+) method just a few months ago [
34]. This method skips cloud detection and employs a weakly supervised method to directly quantify the availability of data. CS+ performs better than existing methods at masking unavailable data, such as clouds and cloud shadows, but it still relies on manual trade-offs between inclusions and omissions, and it misses the boundaries among clouds, thin clouds, and cloud shadows. In general, an accurate and detailed cloud and cloud shadow mask for Sentinel-2 imagery remains elusive.
This study attempted to employ end-to-end spatial–temporal deep learning models to achieve high-quality cloud detection. The combination of spectral, spatial, and temporal features was expected to improve the results, especially for thin clouds and cloud shadows. We employed a classification system consisting of six categories to obtain more detailed cloud and cloud shadow detection results while reducing intra-class variance and simplifying the pattern complexity. Additionally, by utilizing dense time-series Sentinel-2 data, based on the six-class classification system, we constructed a time-series cloud detection dataset distributed across mainland China. This dataset can serve to support the construction and validation of time-series models.
2. Materials and Methods
This study focused on producing finer cloud and cloud shadow masks of Sentinel-2 imagery across mainland China. Spatial–temporal deep learning (STDL) architectures were tested for the task of series image classification. We also developed a new time-series dataset to support data-driven deep learning methods.
2.1. Satellite Data
We employed the Sentinel-2 Level-1C (L1C) product in this work to build the time-series cloud detection dataset, which contains rich temporal and spatial information. The Sentinel-2 mission consists of two satellites, making the revisit period short; it is 5 days with the two satellites at the equator and 2–3 days at mid-latitudes. The L1C product provides the scaled top-of-atmosphere (TOA) reflectance of 13 spectral bands, and the digital number (DN) value is the reflectance multiplied by 10,000. There are four bands with a resolution of 10 m (B2, B3, B4, and B8), six bands with a resolution of 20 m, and three bands with a resolution of 60 m. The details of the band definitions can be seen in
Table 1. In addition, the Sentinel-2 L1C product also contains a true-color image (TCI), which is a composite of B4 (red), B3 (green), and B2 (blue). The TOA reflectance values from 0 to 0.3558 were rescaled to a range from 1 to 255 in TCI.
It is worth noting that the DN values were shifted by 1000 in PROCESSING_BASELINE ‘04.00’ or above after 25 January 2022. These shifted values were corrected during our data processing.
We selected 36 tiles within the domain of mainland China to cover different climate zones and types when constructing the dataset. The spatial distribution of these tiles was the same as that in another Sentinel-2 cloud detection dataset [
16], as illustrated in
Figure 1. The complete time series of products was collected for each tile. The temporal scope for the majority of the tiles ranged from 2018 to 2019, while for some tiles, it extended from 2016 to 2022. Finally, a total of 13,306 L1C products were collected.
2.2. Class Schema
In the stage of dataset preparation and model runtime, we classified the cloud cover into six categories: clear sky, (opaque) clouds, thin clouds, haze, (cloud) shadow, and ice/snow. The definitions of these categories were primarily derived from the visual appearance of a 60 m × 60 m neighborhood at a specific location, also called the region of interest (ROI), in a time-series TCI rather than their physical characteristics:
Clear sky was defined as the land surface being clearly observable.
Opaque clouds were defined as the land surface being totally blocked by clouds.
Thin clouds were defined as surface features being discernible but located within semi-transparent regions with recognizable shapes.
Haze was defined as surface features being discernible but located within homogeneous semi-transparent regions (without recognizable shapes).
Cloud shadows were defined as a sudden decrease being observed in surface reflectance.
Ice/snow was defined as regions where the surface became white and brighter and exhibited texture or where a melting process was observable.
Thin clouds and haze could sometimes be challenging be to visually distinguish when it was difficult to recognize the shape of the semi-transparent region. Both very slight and heavy semi-transparency effects were defined as haze/thin clouds, which meant that some samples in the haze and thin cloud classes could be similar to those in the clear sky or opaque cloud classes. While this categorization may introduce some imprecision, it was intentionally designed to reduce intra-class differences between clear sky and opaque cloud images. When encountering an ROI containing multiple targets, assign the category of the ROI based on priority. The highest priority is given to “opaque cloud”, followed by “thin cloud”; “cloud shadow”; “haze”; “ice/snow”; and, finally, “clear sky”.
For more details on category interpretation, refer to
Section 2.4.
2.3. Overall Framework
Considering the difficulty of making semantic segmentation labels, especially for semi-transparent thin clouds, haze, and cloud shadows [
5], we adopted a time-series pixel-wise classification framework based on the composites of the four bands with a 10 m resolution (B2, B3, B4, and B8). In this way, we could reduce the confusion in making labels and ensure consistency between input features and the features from TCIs used for interpretation.
STDL-based cloud detection model architectures take a sequence with image patches, denoted as , as input and yield a sequence of class labels () corresponding to the ROI at each timestamp. Here, t denotes the number of timestamps, denotes the number of bands, and and denote the numbers of rows and columns for each image patch, respectively. The ROI is defined as an area of pixels centered in each composite image, corresponding to an actual area of 60 m × 60 m.
In this setting, a cloud mask with a 60 m resolution could be generated using a sliding-window prediction approach, leading to a significant reduction in computational load compared to that with the original 10 m resolution. However, in practice, models with a short computation time are still required for pixel-wise classification when using deep learning methods. Therefore, in this research, we used shallow and simple models with fewer parameters; for details, see
Section 2.5.
2.4. Dataset Construction
First, 500 points were randomly generated for each tile, and all of these points were aligned to the pixel grid center of the bands with a 60 m resolution (e.g., B10) to maintain consistency when comparing them with other products. This was because other cloud detection methods may use the cirrus band, which has a resolution of 60 m, as a feature.
Then, time-series image patches of two different shapes were cropped with each point as the center. Patches with dimensions of were extracted from B2, B3, B4, and B8 to serve as the input TOA reflectance features for the model. Additionally, patches with dimensions of were cropped from the TCI file to serve the purpose of interpreting class labels.
Figure 2 illustrates an example of labeled time-series patches with dimensions of
, with the ROI being marked by a red line. All ROIs that intersect with opaque clouds are labeled as thick clouds. When interpreting samples of thin clouds, haze, and shadows, it could help to observe disturbances in brightness or contrast across time-series images. For example, the ROI in the image block dated 10 November 2019 appeared to be slightly brighter and had lower contrast than image blocks from adjacent times, indicating that it may have been obscured by semi-transparent objects. Additionally, it appeared to be homogeneous, with no distinct areas of cloud aggregation, leading to the determination that it was in the haze class. However, for the ROI in the image block dated 15 December 2019, although it also showed disturbances in brightness and contrast, the presence of areas of slight cloud aggregation around (above and below) the ROI led to the determination that it was of the thin cloud class. Other ROIs without an increase in brightness and decrease in contrast compared to adjacent times could be classified as clear sky.
Despite potential confusion between cloud and snow/ice samples due to brightness saturation, the patches were correctly interpreted as snow/ice if they were subsequently followed by a progression indication of snow/ice melting and were collected during the winter. Additionally, pixels with a normalized difference snow index (NDSI) greater than 0.6 could be considered as ice/snow [
10]. We further utilized this index to assist in identifying brightness-saturated ice/snow in the TCIs.
Finally, there were 17,292 groups of labeled time series in total, including 4,450,140 individual image patches and label pairs. The detailed sample numbers for each class are shown in
Table 2. This dataset was stored in TFRecord format for better file input performance, and it has been shared at
https://zenodo.org/records/10613705 (accessed on 4 February 2024).
2.5. Spatial–Temporal Deep Learning Models
The general architecture of an STDL model is shown in
Figure 3. As mentioned in
Section 2.3, small and simple models were adopted in this study. In the STDL models, the inputs were first normalized by the means and standard deviations for each band; then, the spatial features of each timestamp were separately extracted by the shared convolutional neural network (CNN) module, followed by global average pooling; finally, the classification module computed the logits of different classes in order, and the prediction labels were generated with the argmax operation. In this study, four STDL models with different classification modules were tested.
2.5.1. CNN Module
Considering that the size of input image patches is much smaller than the requirements of state-of-the-art image classification backbones, such as ResNet, EfficientNet [
35], etc., a shallow and simple CNN module was used. For the CNN module, the number of filters of each layer was 64, 128, 128, 128, and 128, respectively. The kernel size was
for the first layer and
for the subsequent four layers. The stride was 2 for the first two layers and 1 for the rest. Each CNN layer was followed by batch normalization and a rectified linear unit (ReLU) activation function. The TimeDistributed layer of Keras was utilized to concurrently apply the shared CNNs. The shape of the CNN output for a time step was
, and a spatial feature vector of length 128 was then obtained after global average pooling.
2.5.2. Classification Modules
We tested four types of classification modules: dense, LSTM [
36], Bi-LSTM, and transformer [
37]. All classifiers took the feature vectors from the CNNs as input and used a dense layer with six units as the output layer to obtain the logits of six classes. The softmax activation function was used in the output layer to obtain the probability distribution across the six classes.
A dense classifier was used as a reference. It had only one dense layer with 128 units as a middle layer, in which the ReLU activation function was used. The dense layer was implemented with the dense layer from the Keras library; it shared the same weights among all time steps and processed them separately by using the TimeDistributed layer of Keras. Thus, it was not able to handle any temporal dependencies.
In the LSTM classifier, an LSTM layer with 128 units was used to encode temporal dependencies. LSTM is a type of RNN in which the output of every time step is affected by both the current and previous inputs. The LSTM layer of Keras was used; the full sequence of outputs was returned to support series classification, and all other parameters were set to their default values.
The Bi-LSTM classifier used a bidirectional LSTM layer with 128 units. This layer separately processed the input sequence backward and forward, and the output sequence was the sum of the two directional outputs. In this way, the output of each time step could be influenced by all inputs. A bidirectional wrapper of Keras was used to implement the bidirectional LSTM.
The transformer architecture has achieved excellent accuracy in recent natural language processing and serial classification applications. In its attention mechanism, each time step can focus on certain parts of the whole sequence and obtain temporal dependencies. Since there was no generation step in our cloud detection task, only the encoder part of the original architecture was used. The transformer encoder is made up of identical encoder blocks. Each block is composed of one self-attention layer and a feedforward network. We used one transformer encoder block containing eight self-attention headers. The feedforward layer in the encoder block had a size of 128 and used the ReLU activation function. We used the same positional encoding method as that in the original transformer: a mixture of sine and cosine functions with geometrically increasing wavelengths [
37]. Positional encoding was added to the input of the transformer module to make it aware of positions. The implementation from official TensorFlow models (
https://github.com/tensorflow/models/tree/master/official (accessed on 4 February 2024)) was adopted. An overview of the hyperparameters of all of these classifiers can be seen in
Table 3. Following common practices from ResNet, we employed a step-wise learning rate schedule, and cross-entropy loss function, and SGD optimizer with a momentum of 0.9 and incorporated a warm-up technique. The model underwent training for 90 epochs with a batch size of 20.
2.5.3. Hyperparameter Optimization
While we explored the hyperparameter space, limitations in computational resources necessitated focusing on a specific range. Because we aimed to obtain a lightweight model suitable for large-scale cloud detection, we started with a larger number of layers, larger layer sizes, and more channels, and we gradually reduced them until there was a significant decrease in the overall accuracy on the validation set.
This step-wise learning rate schedule starts with a base learning rate () that is divided by 10 at specified epochs (30th, 60th, and 80th). After determining the model structure, we explored different base values based on STDL with the LSTM classifier, finally choosing a base of .
2.6. Model Evaluation
For each tile, 80% of the generated sample points were randomly selected as the training set, with the remaining points being reserved for evaluation. The number of validation sample timestamps was as follows: 129,401 for clear sky, 172,500 for (opaque) clouds, 24,431 for thin clouds, 18,458 for haze, 4677 for (cloud) shadows, and 37,533 for ice/snow. The model’s performance was assessed using confusion metrics, precision, recall, and the F1 score.
For clear sky, opaque clouds, and ice/snow, the typical metrics were calculated. However, this was relatively complex for thin clouds, haze, and cloud shadows. Both thin clouds and haze were semi-transparent, had similar effects on images, and had no quantifiable criteria. In addition, thin clouds and haze could cover any type of object, including cloud shadows, which made this more like a multi-label classification task. Thus, some special metrics were further designed in consideration of these characteristics to gain a more comprehensive understanding of the accuracy performance. Practical precision for thin clouds and haze (THPP) is expressed as follows:
Practical recall for thin clouds and haze (THPR) is expressed as follows:
The practical F1 score for thin clouds and haze (THPF1) is expressed as follows:
Practical precision for shadows (SPP) is expressed as follows:
Practical recall for shadows (SPR) is expressed as follows:
The practical F1 score for shadows (SPF1) is expressed as follows:
where
denotes the value in the row of the true label (
i) and the column of the predicted label (
j) from the confusion matrix (
). The short names of these classes are C for clear sky, O for opaque clouds, T for thin clouds, H for haze, S for cloud shadows, and I for ice. In the definitions of THPP, THPR, and THPF1, the errors from opaque clouds and cloud shadows were ignored because (1) masking heavy thin clouds and haze as opaque is acceptable when avoiding interference, (2) thin clouds and haze are sometimes so similar that there may be potential ambiguities in their differentiation, (3) some thin clouds and haze are usually mixed with cloud shadows near opaque clouds, and (4) there could be further processing in downstream applications to remove this semi-transparency and enhance the ground information [
38]. Similarly, we defined SPP, SPR, and SPF1, in which the errors from thin clouds and haze for cloud shadows were also ignored.
Furthermore, some metrics were specifically defined to assess the effectiveness of masking “poor” or “good” data in various situations and to adapt to different classification schemes. These metrics were groups of precision, recall, and F1 score. The positive class of each group was the union of fine classes for a specified purpose. The details of the positive classes and abbreviations for each metric group are listed in
Table 4.
The semi-transparency data mask regarded thin clouds and haze as positive classes. The general invalid data mask regarded opaque clouds, thin clouds, haze, and cloud shadows as positive classes, while the stricter one ignored haze. The general usable data mask regarded clear sky, ice/snow, and haze as positive classes. The general cloud mask regarded opaque clouds, thin clouds, and haze as positive classes. The general non-cloud mask regarded clear sky, haze, cloud shadows, and ice/snow as positive classes. All of these stricter metrics excluded the haze class.
2.7. Comparison of Methods
Due to the absence of a large-scale annotated time-series cloud detection dataset before this study, it was difficult to find supervised machine learning methods for time-series cloud detection to use for comparison. We compared the results of our method with those of the s2cloudless [
22], MAJA (MACCS-ATCOR Joint Algorithm, where MACCS is Multi-Temporal Atmospheric Correction and Cloud Screening, and ATCOR is Atmospheric Correction) [
10], and Cloud Score+ [
34] methods. The s2cloudless method, an official algorithm designed for generating cloud masks with a resolution of 60 m, utilizes 10-band reflectance data from a single-period image and is based on a gradient-boosting algorithm. It is considered the state of the art and was identified on the Pareto front among 10 cloud detection algorithms by Skakun et al. [
5]. The MAJA algorithm is a multi-temporal approach that generates cloud masks at a resolution of 240 m. The results obtained from MAJA served as training samples during the development of the s2cloudless algorithm. Cloud Score+, which was proposed by Google LLC, Mountain View, CA, USA, has been available on the Google Earth Engine since the end of 2023. It utilizes a weakly supervised deep learning methodology to assess the quality of individual image pixels in comparison with relatively clear reference images. We exported the official Cloud Score+ time-series values of the validation points at a resolution of 60 m from the Google Earth Engine for comparison.
This comparison may have been biased due to the different classification schemas used by the algorithms. The s2cloudless method only identified non-clouds and clouds. The MAJA algorithm was able to detect clouds, thin clouds, cloud shadows, and high clouds separately, and a single pixel could have multiple classification labels. In particular, the Cloud Score+ method ignored the definition of different classes and gave a quality score that indicated the degree to which a pixel was like a clear pixel.
To maintain consistency in the comparison, we remapped these classes and used the metrics for usable masks and all cloud masks shown in
Table 4. Due to the slow processing speed of MAJA, only a subset of the products was selected for comparison.
4. Discussion
4.1. Motivation
Our motivation for undertaking this work was to enhance cloud detection accuracy when using Sentinel-2 data, especially for cloud shadows and thin clouds, with the aim of ensuring that downstream remote sensing applications can minimize interference from clouds while avoiding excessive loss of usable data.
In our comparative experiments, we observed that the s2cloudless method, which is the state of the art and was provided by official Sentinel-2 sources, failed to offer any information for removing cloud shadows and, at the same time, missed some thin and low-altitude clouds.
Improving capabilities for the detection of thin clouds and shadows is particularly challenging, as their boundaries are difficult to quantify, and most existing datasets do not include information on thin clouds and shadows. Using time-series data can make thin clouds and shadows more distinguishable.
However, it is challenging to find datasets that are suitable for training and validating supervised deep learning models on time-series data. To address this, we also constructed a more detailed time-series cloud detection dataset.
4.2. Classification System
Defining various cloud types and cloud shadows based only on image features is difficult. When designing the classification system for the dataset and STDL models, we did not attempt to address the definitions related to clouds, thin clouds, etc.
From the perspective of utilizing deep learning models, our approach was to reduce intra-class variance, leading us to roughly design a classification system comprising six categories based on the visual characteristics of TCIs. In addition, we employed a time-series interpretation approach based on TCIs to enhance the visual distinctions among different cloud types and shadows.
In the results, it is evident that the six-category classification strategy was effective. As shown in
Table 5, opaque clouds, clear sky, and usable data exhibited very high classification accuracy. Additionally, the confusion matrices in
Figure 4 indicate that opaque clouds were almost never misclassified as clear sky or haze, and clear sky was almost never misclassified as clouds. This characteristic is useful in practice, as the cloud detection step is typically expected to remove invalid data.
Figure 6a further demonstrates that even against a bright surface background, thin clouds and their shadows could be effectively detected. The other patches in
Figure 6 also indicate that we were able to achieve cloud detection results with rich details.
4.3. Input Feature Space
In this study, we primarily enhanced cloud detection accuracy through spatial and temporal features while utilizing only four spectral bands, namely blue, green, red, and near-infrared. This decision was based on the following reasons:
The model read the entirety of the time-series data at once during inference. Using more bands would increase the runtime, data loading workload, and memory usage, which would not be conducive to applying the model to inference on massive datasets. In addition, through our testing, we did not observe an improvement in accuracy by directly increasing the number of input channels based on the current model structure.
The different categories were defined based only on the visual features in TCIs. Introducing more bands may require the redefinition of categories. For example, some very thin, high-altitude cirrus clouds that were almost invisible in the TCIs were classified as clear sky, but they could be detected using cirrus bands.
According to
Table 5 and
Figure 4, even the STDL–dense model, which utilized only spatial features, demonstrated high accuracy in classifying opaque clouds and usable data. Additionally, the confusion matrix showed almost no misclassification of opaque clouds and clear skies. After introducing time-series information, the accuracy of the extraction of thin clouds and shadows significantly improved—in particular, the omission errors were significantly reduced. This confirmed the effectiveness of incorporating temporal and spatial features when enhancing cloud detection.
4.4. Limitations
4.4.1. Ice/Snow
The interpretation based on TCIs made it more challenging to distinguish between opaque clouds and snow/ice, as they both appeared to be saturated in the TCIs. Consequently, some confusion between opaque clouds and snow/ice was inevitable in this dataset. Although using shortwave infrared bands could help in the identification of snow/ice, it would further complicate the construction of the time-series dataset. Hence, they were not employed during the construction of the dataset.
The accuracy evaluation results also confirmed this issue: The confusion matrices in
Figure 4 report approximately 3% of snow/ice being misclassified as clouds. The tile labeled ‘44SNE’ in
Figure 5 also exhibited lower accuracy in usable data. This tile was located in the Qinghai–Tibet Plateau region, which contains some glaciers and has a significant presence of snow/ice in winter; at the same time, it is often influenced by clouds.
4.4.2. Haze
We defined the category of ‘haze’ from the perspective of image features, but it may actually encompass various types of aerosols. Due to differences in the color rendering of different screens during interpretation, confused labels are inevitable, especially for image patches that are less affected by aerosols. In regions that are frequently influenced by aerosols, the lack of sufficient clear-sky image patches for reference increases the confusion. The spatial variations in the accuracy distribution shown in
Figure 5 also confirmed this issue. The difference between the F1 score for the general usable mask and the F1 score for the stricter usable mask was more significant in the relatively humid southeast region. Despite this, users can still flexibly choose the opaque cloud, thin cloud, and cloud shadow categories based on their needs to mask the data and avoid the interference of haze-type errors in local areas.
4.4.3. Single Labels
Unlike typical land cover classifications, the categories in this study were not entirely mutually exclusive. For example, an ROI can simultaneously contain thin clouds, cloud shadows, and snow. However, due to the significant workload involved in annotating time-series data, we only created a single-label dataset.
This may not fully capture the complexity of cloud and cloud shadow coverage, and the label annotations may be influenced by the annotators’ biases. Exploring methods for rapidly constructing multi-label datasets in the future, along with the utilization of appropriate loss functions, holds promise for further enhancing model accuracy and practical utility.
4.4.4. Features
The use of more bands can enrich the features of targets. However, as mentioned in
Section 4.3, the model structure used in this study may not have effectively leveraged the information from additional bands.
The reasons could be the increasing feature complexity. From experience in manual interpretation, for example, some thin clouds, haze, and clear skies may be challenging to distinguish in the shortwave infrared band, and some low-altitude opaque clouds are not apparent in the cirrus band.
Perhaps employing a more flexible structure, such as multi-scale branch inputs or channel attention, could allow for the better utilization of a larger feature space. We have not explored this avenue to date due to the trade-off between the costs and potential accuracy gains.
4.5. Model Comparison
The STDL model using Bi-LSTM as a classifier had the highest overall accuracy, but the model’s inference time was longer. On the other hand, the STDL model based on the transformer classifier performed similarly overall but had a faster inference speed.
For the goal of cloud detection, a higher recall is more practically valuable than precision. The s2cloudless method achieved a higher recall for the stricter cloud mask than those of MAJA and certain SC+ thresholds, indicating that it is an excellent method.
However, the goal of detecting usable data is more practically meaningful than detecting clouds. Although the MAJA method exhibited lower accuracy in cloud detection, it had a better ability to detect usable data. In particular, the precision values for the general usable mask and stricter usable mask of MAJA were significantly higher than those of s2cloudless and certain SC+ thresholds, indicating that the usable data labels provided by the MAJA method were highly reliable. Of course, this came at the cost of sacrificing recall, leading to the omission of some usable data.
In the process of preparing this study, we noticed that the Google team had introduced the CS+ algorithm, and we included it in the method comparisons in this study. The CS+ algorithm requires users to explore score thresholds for different data to balance the precision and recall. Increasing the CS+ score threshold decreases the recall of usable data and increases the precision. For the dataset used in this study, cs-0.5 had the highest F1 score for the general usable mask, and cs-0.65 had the highest F1 score for the stricter usable mask, both of which were lower than those of the simple supervised STDL models used here. Although this method can provide high-quality binary masks for usable data, it cannot distinguish thin clouds, haze, and cloud shadows for various post-processing applications, such as image enhancement in areas with thin clouds and shadows.
4.6. Spatial Generalization Ability
Using as many representative datasets as possible to train the model is crucial for engineering applications. Therefore, the training data used in our previous experiments came from all 36 tiles. In
Figure 5, it can be observed that the F1 scores are relatively lower at certain locations, such as the tiles near the coastal land. We investigated and found that the direct reason is that the proportion of ROIs of thin clouds and haze is relatively larger compared to other regions, and their accuracies are relatively lower than those of opaque clouds and clear sky, thus lowering the average F1 scores.
In this section, to further discuss the model’s spatial generalization ability, we split the training and testing sets based on tiles, then conducted training and evaluation.
The tiles used for testing need to encompass a variety of terrains and landforms to ensure the reliability of the evaluation results. Because we had a relatively small number of tiles, we manually selected six of them for validation rather than randomly selecting tiles for cross-validation. The positions of these six validation tiles are marked with filled rectangles in
Figure 1.
Then, we trained the STDL model using training points from the other 30 tiles in the dataset and evaluated the model using validation points from these 6 validation tiles. Similar to
Section 3.3, only the STDL models using the transformer classifier were tested. For convenience, this trained model is named ‘STDL-val’ in the following.
The F1 scores for the general/stricter usable and cloud masks were used for comparison. We also recalculated the F1 scores for other methods (s2cloudless, MAJA, and CS+) using validation points from these validation tiles. The F1 scores of different mask types for different methods are listed in
Table 8.
Compared to the original trained STDL model, the STDL-val model’s F1 scores for all four masks experienced a slight decrease. This could be attributed to a reduction in correlation between training and testing data—typically, points spatially closer in space exhibited greater correlation [
39]. Additionally, reducing the amount of training data may lead to a poorer model. Nonetheless, the F1 scores of the STDL-val model still surpass those of other existing methods. This result indicates that extending the STDL model constructed in this study for cloud detection in Sentinel-2 data to other regions in China is promising.
5. Conclusions
In this study, we explored the effectiveness of using a simple supervised spatial–temporal deep learning model for fine-grained cloud detection in time-series Sentinel-2 data. Simultaneously, we contributed the first long-time-series dataset specifically designed for cloud detection to facilitate both model development and accuracy assessment. To avoid boundary uncertainty, we designed the models and constructed the dataset based on a pixel-wise classification framework.
Drawing inspiration from existing research on deep learning models for time-series classification, we built four simple sequence-to-sequence classification models. All of these models took four bands with a resolution of 10 m as input, utilized multi-layer CNNs with shared weights as spatial feature extractors, and employed a fully connected dense layer, LSTM, Bi-LSTM, and transformer as classifiers.
The results showed that the supervised STDL models could produce a detailed cloud mask with thin clouds and cloud shadow information, while most cloud detection methods only produced a binary mask. The STDL models with the Bi-LSTM classifier and transformer classifier exhibited close performance and were better than the other two classifiers. Although the model accuracy of the transformer classifier was slightly lower than that of Bi-LSTM, it had higher computational efficiency. This cost-effectiveness enables the generation of high-quality usable data masks for large data volumes.
In this study, we only used the visual effects of TCIs to define categories, we designed datasets and models based solely on single-label classification tasks, and we used only four 10 m bands, which may have hindered the learning of complex cloud and cloud shadow coverage patterns. In addition, the time interval between the overlap areas of adjacent satellite scanning paths was half the normal revisit period. We temporarily ignored the potential effects of the time interval on model accuracy in this study.
Future research endeavors include extending the STDL models to more satellite imagery, such as that of the Chinese Gaofen series, as well as to multi-label classification tasks, and handling more spectral bands. In addition, building a deep learning model for computationally efficient single-period segmentation while leveraging the accurate cloud masks generated by an STDL model presents an intriguing avenue for further exploration.