Sentinel-2 Image Scene Classiﬁcation: A Comparison between Sen2Cor and a Machine Learning Approach

: Given the continuous increase in the global population, the food manufacturers are advocated to either intensify the use of cropland or expand the farmland, making land cover and land usage dynamics mapping vital in the area of remote sensing. In this regard, identifying and classifying a high-resolution satellite imagery scene is a prime challenge. Several approaches have been proposed either by using static rule-based thresholds (with limitation of diversity) or neural network (with data-dependent limitations). This paper adopts the inductive approach to learning from surface reﬂectances. A manually labeled Sentinel-2 dataset was used to build a Machine Learning (ML) model for scene classiﬁcation, distinguishing six classes (Water, Shadow, Cirrus, Cloud, Snow, and Other). This models was accessed and further compared to the European Space Agency (ESA) Sen2Cor package. The proposed ML model presents a Micro-F1 value of 0.84, a considerable improvement when compared to the Sen2Cor corresponding performance of 0.59. Focusing on the problem of optical satellite image scene classiﬁcation, the main research contributions of this paper are: (a) an extended manually labeled Sentinel-2 database adding surface reﬂectance values to an existing dataset; (b) an ensemble-based and a Neural-Network-based ML models; (c) an evaluation of model sensitivity, biasness, and diverse ability in classifying multiple classes over different geographic Sentinel-2 imagery, and ﬁnally, (d) the benchmarking of the ML approach against the Sen2Cor package.


Introduction
In remote sensing, classifying parts of the high-resolution optical satellite images into morphological categories (e.g., land, water, cloud, etc.) is known as scene classification [1]. Recently, the challenge of optical satellite image scene classification has been the focal point of many researchers. Scene classification plays a key role in urban and regional planning [2,3], environmental vulnerability and impact assessment [4,5] and natural disasters and hazard monitoring [6], for example. Further, given the current population growth and industrial expansion needs, assessment of land-use dynamics is certainly required for the well-being of individuals.
Earth observation can be defined as gathering physical, chemical, and biological information of the planet using Earth surveying techniques, which encompasses the collection of data [7]. In such Earth surveying techniques, optical satellites play a major role, and one such satellite is Sentinel-2 [8]. Sentinel-2 is part of the Earth observation mission from the Copernicus Programme, and systematically acquires optical imagery at a high spatial 1.
Focusing on the problem of optical satellite image scene classification, an ensemble and Neural Network based ML models are proposed; 2.
An extended manually labeled Sentinel-2 database is set up by adding Surface Reflectance values to a previous available dataset; 3.
The diversity of the formulated dataset and the ML model sensitivity, biasness, and generalization ability are tested over geographically independent L1C images; 4.
The ML model benchmarking is performed against the existing Sen2Cor package that was developed for calibrating and classifying Sentinel-2 imagery.

Sen2Cor
While capturing satellite images, the atmosphere influences the spatial and spectral distribution of the electromagnetic radiation from the Sun before it reaches the Earth's surface. As a result, the reflected energy recorded by a satellite sensor is affected and attenuated, requiring an atmospheric correction. Atmospheric correction is the process of removing the effects of the atmosphere on the Top-of-Atmosphere (TOA) reflectance values of original satellite images; TOA reflectance is a unitless measurement that provides the ratio between the radiation reflected and the incident solar radiation on a given surface. Bottom-Of-Atmosphere (BOA) reflectance, on the other hand, is defined as the fraction of incoming solar radiation that is reflected from Earth's surface for a specific incident or viewing case.
Sen2Cor is an algorithm whose pivotal purpose is to correct single-date Sentinel-2 Level-1C products from the effects of the atmosphere and deliver a Level-2A surface reflectance product. Level-2A (L2A) output consists of a Scene Classification (SCL) image with eleven classes together with Quality Indicators for cloud and snow probabilities, Aerosol Optical Thickness (AOT) and Water Vapour (WV) maps and the surface (or BOA) reflectance images at different spatial resolutions (60 m, 20 m, and 10 m). Table 1 presents the eleven classes with their corresponding color representation in the SCL image. Each particular classification process [34] is discussed next.  [34].

0
No Data (Missing data on projected tiles) (black) 1 Saturated or defective pixel (red) 2 Dark features / Shadows (very dark gray) 3 Cloud shadows (dark brown) 4 Vegetation (green) 5 Bare soils / deserts (dark yellow) 6 Water (dark and bright) (blue) 7 Cloud low probability (dark gray) 8 Cloud medium probability (gray) 9 Cloud high probability (white) 10 Thin cirrus (very bright blue) 11 Snow or ice (very bright pink) Figure 1 describes the Sen2Cor Cloud/Snow detection algorithm: it performs six tests and the result of each pixel is a cloud probability (ranging from 0 for high confidence clear sky to 1 for high confidence cloudy sky). After each step, the cloud probability of a potentially cloudy pixel is updated by multiplying the current pixel cloud probability by the result of the test. The snow detection follows the same procedure with five different tests resulting in 0 for high confidence clear (no snow) to 1 for high confidence snowy pixel.

Vegetation
Two filters, namely the Normalized Difference Vegetation Index (NDV I) [35] and a reflectance ratio (R), are used to identify vegetation pixels. The NDV I and R filters are expressed by Equations (1) and (2) (N IR (Near Infra-Red), RED and GREEN refer to the corresponding band values). Thresholds of T1 = 0.40 and T2 = 2.50 are set for NDV I and R, respectively. If the NDV I and R values exceed the corresponding thresholds, the pixel is classified as vegetation in the classification map.

Soil and Water
Bare soil pixels are detected when their reflectance ratio R (Equation (2)) falls below a threshold T = 0.55. If the ratio R1 (Equation (3)) exceeds a threshold T = 4.0 the pixel is classified as bright water.

Cirrus Cloud
Under daytime viewing conditions, the presence of thin cirrus cloud in the upper troposphere is detected by Sentinel-2 band 10 (B10) reflectance threshold. In the first step, all B10 pixels with a value between T = 0.012 and T = 0.035 are considered as thin cirrus; in the second step, after generating a probabilistic cloud mask, if the cloud probability is below or equal to 0.35, the pixel is classified as a thin cirrus cloud.

Cloud Shadow
The cloud shadow mask is constructed using (a) a "geometrically probable" cloud shadow derived from the final cloud mask, sun position and cloud height distribution and (b) a "radiometrically probable" cloud shadow derived from a neural network [36].

Materials and Methods
This study evaluates the performance of the developed machine learning models in classifying water, shadow, cirrus, cloud, snow, and other scenes over Sentinel-2 imagery. This section focuses on (a) Dataset creation; (b) Classification Algorithms; (c) Feature analysis and, finally, (d) Experimental setup and architecture modeling.

Dataset Creation
In supervised machine learning, input-label pairs are required by the learning function [37,38]. In this regard, the overall (input-label) dataset creation process is described in Figure 2, which perform the following steps:

1.
For each product-ID in the original dataset, L1C products were downloaded from CREODIAS [39] platform; 2.
For each downloaded L1C product, a corresponding L2A product was generated using Sen2Cor v2.5.5. Afterwards, for each L2A product, Scene Classification was retrieved resulting in an extended dataset for Sen2Cor assessment; 3. Downloaded L1C products were re-sampled to 20 m (allowing spatial analysis) and the 13 bands of imagery were retrieved, resulting in an extended dataset for the ML model.

Original Data
Holstein et al. [40] created a database of manually labeled Sentinel-2 spectra using false-color RGB images. The database consists of images acquired over the entire globe ( Figure 3) and comprises 6.6 million points (exactly 6,628,478 points) classified into one of the six classes presented in Table 2. It includes a wide range of surface types and geometry from a total of 60 products from the five different continents Europe (22), Africa (14), America (12), Asia (6) and Oceania (6). The data is described by 4 attributes: product_id, latitude, longitude and class.  As mentioned, Holstein et al. [40] used false-color RGB images to classify images. First of all, L1C products (all bands) were spatially resampled to 20 m. Afterwards, RGB bands 1, 3 and 8 were used to classify clouds and shadow, RGB bands 2, 8 and 10 were used to classify cirrus and water, and RGB bands 1, 7 and 10 were used to classify snow and other. Additionally, the authors used a two-step approach to minimize human error: the labeled images were revisited to re-evaluate past decisions.

Extended Data
Knowing L1C product ID and coordinates for individual data points, we added surface reflectance information to each entry to build and assess a Machine Learning classification model and further compare it to the Sen2Cor algorithm.
First of all, using the list of the 60 products from the original data, the relevant L1C products were downloaded using the CREODIAS [39] platform. Then, the L1C products were resampled to 20 m (allowing spatial analysis) and the 13 bands were retrieved for each product.
Extended Dataset for the ML model.
The extended data used to build the ML model is composed of 17 attributes: prod-uct_id, latitude, longitude, B01, B02, B03, B04, B05, B06, B07, B08, B8A, B09, B10, B11, B12, class. Here, B refers the band; each band represents the surface reflectance (ρ) value at a different wavelengths [41]. Surface reflectance is defined as the fraction of incoming solar radiation that is reflected from Earth's surface for a specific incident or viewing case. In general, the reflectance values (also known as TOA) range from 0.0 to 1.0. Figure 4 details the class-wise ρ value distribution using the violin plot. Here, we can observe that for each class, the ρ value for each band is different, meaning that each band has its ρ value according to a different type of surface/class. For example, for all classes, B10 ρ value is zero, apart from the Cirrus class; this is because B10 is responsible for the detection of thin cirrus [34].
In Figure 4, we can see that there are points from classes with ρ greater than 1.0; unlike negative ρ, objects that have ρ > 1 are not unnatural. Circumstances that would lead to such observance are: (a) nearby thunderstorm clouds that provide additional illumination from reflected solar radiation; (b) the area receiving solar radiation is directly perpendicular to the sun; and (c) surfaces such as shiny buildings, waves, or ice crystals that act as mirrors or lenses and reflect incoming direct sunlight in a concentrated way rather than diffusely. Table 3 presents the number of points of the dataset per band and class with ρ > 1. We can observe that the class with a higher proportion of ρ > 1 values is the Snow one. This can be explained taking into account that snowy surfaces reflect incoming sunlight in a concentrated way rather than diffusely acting as a mirror, producing observed ρ > 1.  Extended Dataset for Sen2Cor assessment.
The data used for assessing Sen2cor algorithm is composed of 5 attributes: product_id, latitude, longitude, sen2cor_class, class. Since there are eleven classes in Sen2cor scene image (Table 1) and only six classes in original data (Table 2), a class mapping was performed. This mapping is shown in Table 4.

Classification Algorithms
In this study, we evaluated ensemble methods (Random Forest & Extra Tree) and a deep learning based method (Convolutional Neural Network) using the built extended dataset for satellite imagery scene classification. This section introduces shortly the used classification algorithms and provides details about feature analysis and the adopted experimental setup.
During the ML modeling process, the following statistical and CNN-based classifications algorithms were used.

Decision Tree (DT)
Decision tree is an efficient inductive machine learning technique [42,43] where the model is trained by recursively splitting the data [44]. The data splitting consists of a tree structure (root-nodes-branches-leaf), which successively tests features of the dataset at each node and splits the branches with different outcomes. This process continues until a leaf (or terminal node) representing a class is found. Each split is chosen according to an information criterion which is maximized (or minimized) by one of the "splitters".
However, a decision tree is sensitive to where it splits and how it splits. Generally, the bias-variance trade-off depends on the depth of the tree: a complex decision tree (e.g., deep) has a low bias and a high variance. Also, the tree makes almost no assumptions about the target function but it is highly susceptible to variance in data making decision trees prone to overfit [45].
Decision trees were used in many applications, including identification of land cover change [46], mapping of global forest change [47] and differentiating five palustrine wetland types [48].

Random Forest (RF)
Random Forest (RF) is a tree ensemble algorithm [49], which aims to reduce the decision tree variance (at the small cost of bias). During the random forest learning process, bagging is used to minimize the variance by learning from several trees over various sub-samples of the data [50] (in terms of both observations and predictors used to train them). Bagging is a process where several decisions are averaged [51].
In the random forest learning process [45], m variables are randomly selected from a potential set of p variables, resulting in two groups that separate the data in the best way possible. This process is repeated to the max depth of d, creating a forest of several individual trees. In the end, observation x is classified using all the individual trees, and the final decision is averaged.

Extra Tree (ET)
The main difference between a random forest and extra tree [52] (usually called extreme random forests) lies in the fact that, instead of computing the locally optimal feature/split combination (for the random forest), for each feature under consideration, a random value is selected for the split [53]. This leads to more diversified trees and fewer splitters to evaluate when training.

Convolutional Neural Networks (CNNs)
CNNs are Neural Networks that receive an input, assign importance (learnable weights and biases) to various aspects/objects within the input, and are able to differentiate one input from other. Due to the sparse interactions and weight sharing, Neural Networks (NN) are best suited for processing large-scale imagery [54]. When considering CNNs, the connection between the previous layer and the next layer is referred to as sparse interaction. Whereas, in weight sharing, layer share the same connection weights.
Recently researchers are proposing complex and deeper structures like, for example, AlexNet [55], VGGNet [56], and GoogLeNet [57], having depths of 8, 19, and 22, respectively [58]. In other words, CNN exploits domain knowledge about feature invariances within its structure and they have been successfully applied to various image analysis and recognition tasks [59], making it an effective technique for labeled tabular data classification [60].

Feature Analysis
Feature ranking helps to select the most useful features to build a model. The Scikit-Learn library [61] provides an implementation of the most useful statistical algorithms for feature ranking [62], including Chi-Squared (chi2) [63], Mutual Information (mutual info.) [64], ANOVA (anova) [65] and Pearson's Correlation Coefficient (pearson) [66]. We applied these measures over the full dataset and the ranking is shown in Table 5.
Sen2Cor uses nine bands (1, 2, 3, 4, 5, 8, 10, 11 and 12) to generate a scene classification map and ten bands (1, 2, 3, 4, 8, 8A, 9, 10, 11 and 12) to produce a L2A product [34]. Comparing this band information with the results shown in Table 5 and the surface reflectance (ρ) values distribution (Figure 4), the obtained statistical feature ranking produces contradictory results. For example, statistical feature ranking ranks B10 as the least important band, but the sole purpose of B10 is to identify the presence of the Cirrus class, which is supported by Figure 4. Removing B10 from the training dataset can hinder the model performance when applied on an unseen (satellite) image acquired from anywhere over the globe. This fact leads us to state that although the dataset is a collection over different continents and represent a possible global distribution, ranking bands upon the dataset is not ideal as each band has its own purpose/reflectance. Thus, in our experiments, we have used all the 13 bands to possibly cover all the surface reflectance (ρ) value distributions.

Experimental Setup
To assess the classifiers, 10 products (belonging to all five continents: one each from Asia and Oceania, two each from Africa and America, and four from Europe) out of 60 were randomly selected for the test set. Aiming at including all 5 continents in the test set led to a class distribution somehow different from the full dataset. The test set distribution can be seen in Table 6. The information about the experimental setup used to build the classifiers is presented in Table 7. Precision, recall and F1 score are performance measures that can be used to evaluate ML models. Precision is defined as the ratio between the number of correct positive and all positive results whereas, in recall all relevant samples (all samples that should have been identified as positive) are considered instead of all positive results [67]; F1 is the harmonic mean of Precision and Recall. These measures are calculated per class (considering one class as positive and all the other classes as negative) using Equations (4).
Here, TP, TN, FP, and FN stand for True Positive, True Negative, False Positive, and False Negative.
When aiming to have an unique performance value, precision, recall and F1 are averaged; this average can be calculated over the per class values (macro-average) or by summing the true positive, false positive and false negative for all classes and calculating the performance measures over these counts (micro-average).
To fine-tune the classification algorithms, a RandomizedSearchCV [68] with 5 folds cross-validation procedure over the train set was used. The assessment was done using the micro-F1 measure. Table 8 shows the best parameter values for Random Forests (RF) and Extra Trees (ET) algorithms. The found CNN parameter values were epochs = 8, batchsize = 32 , f ilters = 24 and, kernelsize = 4.  Figure 5 shows the CNN model structure. Using the CNN architecture [69] as a base reference, the used CNN model consists of an input layer, 1D convolutional layers, a dropout layer, a max-pooling layer, followed by one flatten, two dense, and an output layer. Following is a description of each layer: • Input layer: The input representation of this layer is a matrix value of 13 bands; • Convolutional-1D layer: This layer is used to extract features from input. Here, from the previous layer, multiple activation feature maps are extracted by combining the convolution kernel. In our architecture, we used a convolution kernel size of 4; • Dropout: A random portion of the outputs for each batch is nullified to avoid strong dependencies between portions of adjacent layers; • Pooling layer: This layer is responsible for the reduction of dimension and abstraction of the features by combining the feature maps. Thus, the overfitting problem is prevented, and at the same time, computation speed is increased; • Flatten layer: Here, the (5 × 24) input from the previous layer is taken and transformed into a single vector giving a feature space of width 120; • Dense layer: In this layer, each neuron receives input from all the neurons in the previous layer, making it a densely connected neural network layer. The layer has a weight matrix W, a bias vector b, and the activation function of the previous layer. • Softmax activation: It is a normalized exponential function which is used in multinomial logistic regression. By using the softmax activation function, the last output vector of the CNN model is forced to be a part of the sample class (in our case, the output vector is 6).

Results
As mentioned in the previous section, the classification models were compared over the test set that comprises 10 products from all five continents. Table 9 shows precision, recall and micro-F1 values for Random Forests (RF), Extra Trees (ET), Convolutional Neural Networks (CNN) along with Sen2Cor Scene Classification (SCL). Analysing Table 9, the following observations can be made:

1.
When looking at micro-F1, CNN performs similar to Random Forest and Extra Trees. The difference in micro-F1 is small (almost zero) and we cannot state that CNN outperforms the others. Moreover, one can state that each algorithm performs better than the others on specific classes; for example, ET has higher micro-F1 over classes Cirrus, Cloud, and Other, whereas RF has higher micro-F1 over Water and CNN over Snow.

2.
Looking at precision and recall for Cirrus and Shadow classes, it is noticeable that Sen2Cor has high precision but low recall. This means that Sen2Cor is returning very few results of Cirrus and Shadow (it has a very high rate of false negatives), although most of its predicted labels are correct (low level of false positive errors).

3.
Overall, the three Machine Learning algorithms generate models with similar performance with differences that range from 0% to 7% between the "best" and the "worst".
(for example, Cirrus has a "best" micro-F1 of 0.79% with ET and a "worst" micro-F1 of 0.72% with RF.) With regard to the classes, there is a great variation: precision values are above 90% for classes Snow and Shadow and less than 75% for the Other class; for recall, the highest values are obtained for the classes Cloud and Other (values above 80%) and the lowest for the Cirrus and Shadow classes (values between 67% and 77%). Regarding the micro-F1 measure, the only class with values below 80% is the class Cirrus; classes Snow and Water have values above 90%.

4.
Comparing the performance of ML algorithms with Sen2Cor, especially for the Cirrus and Snow classes, ML approaches are superior. For the same classes, Sen2Cor micro-F1 values are below 50%; these low values are due to the big difference between precision and recall (for Cirrus precision is above 90% while recall is 10%; for Snow precision is above 85% and recall around 30%). Considering the micro-F1 measure, the ML models present an increase of about 25 points (from 59% to 84%) when compared to the Sen2Cor Scene Classification algorithm.
To check if there is a significant difference between Sen2Cor and ML models (i.e., if the difference in micro-F1 is significant or not), the McNemar-Bowker test [70,71] was performed. The McNemar-Bowker's test is a statistical test used on paired nominal data (with k × k contingency tables following a dichotomous trait) to determine if there is a difference between two related groups. Here, k is the number of categories/labels, and the McNemar-Bowker B value is calculated using the Equation (5), where, O i,j is the count of the i th row and j th column in crosstab. A crosstab is a table that shows the relationship between k × k variables.
The acquired B value follows approximately a chi-square distribution [72], with df = (k − 1)/2 degrees of freedom. The probability density function is be calculated using Equation (6), where, Γ denotes the gamma function, which has closed-form values for integer (df /2).
Using Equation (6) for comparing Sen2Cor and ML models, our null hypothesis (the difference between two groups is statistically significant) was proved by obtaining a (p-value) less-then 0.05. The same can be visualized in Figure 6   To have a better grasp of the differences between the ML model and Sen2Cor, we randomly selected an image (Figure 7a) from the test set with classes Cloud, Shadow, and Other (identified as white, brown, and green respectively) and generated classification images using the ET model ( Figure 7b) and Sen2cor (Figure 7c). The original image and the resulting classified images are presented in Figure 7.
After analysing Figure 7a closely, it is possible to say that for each cloud present in the image there is an equivalent shadow. Comparing the generated images of Figure 7b, we can observe that the ML model is classifying cloud and cloud shadow accurately than Sen2Cor; Sen2Cor classification ( Figure 7c) is missing the majority of cloud shadows, whereas the ML model captures them all.
We state that a static rule-based approach, like the one used by Sen2Cor that heavily rely on surface reflectance (it uses a sensor-specific threshold based method) tends to miss marginal variation between two surfaces leading to misclassification. To support this claim and to prove the general-ability of the ML model, we classified a specific image (Figure 8) with brighter surface reflectance values (note that this image does not belong to the dataset). Figure 8a shows the RGB image of the coastal area of Lisbon, Portugal; in it, there are 3 visible parts: water, coastal area sand and land surface. Figure 8b,c show the generated classification images using the ET model and Sen2Cor, respectively.  Since Figure 7b shows shadows of all the clouds, we analyzed the the sensitivity of the ML  Figure 9 it can be observed that for each geometric independent region (Ballyhaunis, 365 Sukabumi, and Béja), the ML model is capturing, with high precision, the shadows of the (low, 366 medium, and opaque) clouds, proving the general-ability of the ML model. To this extent, we can 367 say that the ML models are sensitive and can detect even minor shadows (from low and medium 368 probability clouds). Moreover, the detection of shadow does not decrease the workable area as the 369 classifier is generating a mask and the end-user can still use these workable areas given they might 370 belong to the 'shadow' or 'cloud' class.

371
Further, we studied the biasness of the model towards the achieved results. To do so, we selected 372 59 products for training and 1 for testing. The main reason to split the dataset in this way was to make 373 sure that the knowledge about a region is not essential to classify that region. This reasoning enables 374 to pose the following question: 'will the system be able to classify a new, non seen product with high 375 performance?' To evaluate this, it would be interesting to pick a complete region as test while the rest 376 of the points compose the training set. 377 We replicated this procedure for each of the 60 products (i.e. using 1 product for test and the rest 378 59 products for training). The F1 avg results are presented in Table 10.  After analysing Figure 8 closely, it is possible to say that the bright coastal area/sand present in the Figure 8a is classified as Cloud by Sen2Cor (represented as a white line); on the other hand, the ML algorithm classified the same bright coastal area/sand as Other which is indeed is the correct classification. Thus, the ML model is able to better capture brighter coastal surface reflectance values when compared to Sen2Cor.

Discussion
Our proposed system uses a dynamic and generalized approach of surface reflectance, which does not rely on predefined threshold rules. Thus, we are preserving the pivotal motivation of classifying any unseen Sentinel-2 images acquired from the globe.
Further, ensemble methods can provide useful information like the Gini index to the end-user. The Gini index is a measure of statistical distribution intended to represent different attribute variables influencing the overall accuracy [73]. Using the Gini index, we were able to identify that B11 and B12 have a substantial effect on the overall model accuracy (data not presented). Thus, using our proposed model, the classification performance can be increased by including additional indexes [74], that use Band 11 and Band 12. For example, including Global Vegetation Index (GVI) [75], Normalized Difference Salinity Index (NDSI) [76], and Soil Composition Index (SCI) [77] would help in getting a better performance.
Overall, the random forest and extra tree algorithms produce fast and accurate predictions (with micro-F1 between 0.83 and 0.84). Nonetheless, when using these ensemble methods, the issues of overfitting, and bias/variance tradeoff should not be overlooked.
Since Figure 7b shows shadows of all the clouds, we analyzed the the sensitivity of the ML model. For that we randomly selected (not from dataset) three separate L1C image patches shown in Figure 9a   From Figure 9 it can be observed that for each geometric independent region (Ballyhaunis, Sukabumi, and Béja), the ML model is capturing, with high precision, the shadows of the (low, medium, and opaque) clouds, proving the general-ability of the ML model. To this extent, we can say that the ML models are sensitive and can detect even minor shadows (from low and medium probability clouds). Moreover, the detection of shadow does not decrease the workable area as the classifier is generating a mask and the end-user can still use these workable areas given they might belong to the 'shadow' or 'cloud' class.
Further, we studied the biasness of the model towards the achieved results. To do so, we selected 59 products for training and 1 for testing. The main reason to split the dataset in this way was to make sure that the knowledge about a region is not essential to classify that region. This reasoning enables to pose the following question: 'will the system be able to classify a new, non seen product with high performance?' To evaluate this, it would be interesting to pick a complete region as test while the rest of the points compose the training set.
We replicated this procedure for each of the 60 products (i.e., using 1 product for test and the rest 59 products for training). The F1 avg results are presented in Table 10. Equation (7) calculates the F1 avg value (over 60 products) for each class where F1 p is the F1 value of the particular class within the product p. N p is the number of points of the class within the product p, T is the total number of points of the class for all products, and p ∈ (1, 60) is the number of products.
When compared to the Sen2Cor, the ML approach achieved an overall improvement of 11% (77.54% with CNN, 66.40% with Sen2Cor). This study ensures that the achieved results are better due to the learning done by the algorithms, and that the proposed models do not possess biasness towards the test set.
Regarding the neural network architecture, different CNN-based models were proposed to classify cloud mask and land cover change using different spectral and temporal resolutions satellite imagery [1,21,22]. These studies look at different datasets and present different CNN architectures but, to the best of our knowledge, none evaluates the CNN architecture with the dataset used in this work making it impossible to make a comparison of the obtained results.

Conclusions
The significant development of remote sensing technology over the past decade has enabled us for intelligent Earth observation, such as scene classification using satellite images. However, the lack of publicly available "large and diverse data sets" of remote sensing images severely limits the development of new approaches especially Machine Learning methods. This paper first presents a comprehensive review of the recent progress in the field of remote sensing image scene classification, including existing data sets and Sen2Cor. Then, by analyzing the limitations of the existing data sets, it introduces a surface reflectance based, freely and publicly available data set and makes available a ready to use Python package(scripts) (refer Appendix A) with a trained ML model. These will enable the scientific community to develop and evaluate new data-driven algorithms to classify Sentinel-2 L1C images into six classes (Water, Shadow, Cirrus, Cloud, Snow, and Other).
Using the micro-F1 measure, we evaluated three ML representative algorithms (Random Forest, Extra Tree and Convolution Neural Network) for the task of scene classification using the proposed data set and reported the results as a useful (baseline) performance for future research. The proposed ML model presents a micro-F1 value of 0.84, a considerable improvement over Sen2Cor that reached a value of 0.59.
The results presented in Table 9 support our claim that the developed ML model can be used as a base tool for Sentinel-2 optical image scene classification. Moreover, when evaluating over one complete test set image (Figure 7), it is possible to conclude that while Sen2Cor misses the majority of cloud shadows, the ML model captures them. Additionally, model sensitivity ( Figure 9) and biasness (Table 10) tests were performed over different L1C images.
Being composed of several modules, each of them with a high level of complexity, it is certain that our approach can still be improved and an overall higher performance is achievable. As future work, we intend not only to continue improving the individual modules, but also extend this work to: • Add more training scenes with the help of image augmentation (also known as elastic transformation) [78] using existing training data. • Incorporate radar information and correlate the results and its impact over Water, Shadow, Cirrus, Cloud, and Snow detection. Funding: This work was funded by Project NIIAA: Núcleo de Investigação em Inteligência Artificial em Agricultura (Alentejo 2020, reference ALT20-03-0247-FEDER-036981).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement: Refer Appendix A.
Acknowledgments: Authors would like to thank Amal Htait and Ronak Kosti for proof reading the article.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: Through this article, the following resources are made publicly available [79]: (1) an extended (train and test) dataset and (2) a ready to use Python package (scripts) with a trained ML model to classify Sentinel-2 L1C image. The Python package takes the L1C product path and produces an RGB image with six classes (Water, Shadow, Cirrus, Cloud, Snow, and Other) at 20 m resolution. The working example of the developed Sentinel-2 L1C image scene classification package is discussed further. Figure A1 shows the processing steps of the developed package. The path to Sentinel-2 L1C product is passed as input, and a RGB image with six colors (each identifiying one class) at 20 m resolution is produced as output. Authors used the GDAL library [80] to read and rescale images. During post-processing, neighbour pixels are checked to minimize the classification error.  Figure A2 shows the working example of the developed package where, L1C product is classified into six classes. Figure A2a,b respectively present the corresponding RGB image of L1C product and classified image. Using our package the average time to produce a scene classified RGB image is 4 min; using Sen2Cor v2.5.5 takes 18 min over system specification detailed in Table 7 (it is worth mentioning that Sen2Cor performs many other operations apart from scene classification). For the sole purpose of scene classification, our model is 4 times faster than Sen2Cor when classifying Sentinel-2 L1C images into six classes (Water, Shadow, Cirrus, Cloud, Snow, and Other).

ML
(a) (b) Figure A2. (a) L1C product (b) RGB Scene classified image using developed package. Labels-Water as Blue, Shadow as Brown, Cirrus as light Purple, Cloud as White, Snow as Cyan and Other as Green.