Cloud and Cloud-Shadow Detection for Applications in Mapping Small-Scale Mining in Colombia Using Sentinel-2 Imagery

Small-scale placer mining in Colombia takes place in rural areas and involves excavations resulting in large footprints of bare soil and water ponds. Such excavated areas comprise a mosaic of challenging terrains for cloud and cloud-shadow detection of Sentinel-2 (S2A and S2B) data used to identify, map, and monitor these highly dynamic activities. This paper uses an efficient two-step machine-learning approach using freely available tools to detect clouds and shadows in the context of mapping small-scale mining areas, one which places an emphasis on the reduction of misclassification of mining sites as clouds or shadows. The first step is comprised of a supervised support-vector-machine classification identifying clouds, cloud shadows, and clear pixels. The second step is a geometry-based improvement of cloud-shadow detection where solar-cloud-shadow-sensor geometry is used to exclude commission errors in cloud shadows. The geometry-based approach makes use of sun angles and sensor view angles available in Sentinel-2 metadata to identify potential directions of cloud shadow for each cloud projection. The approach does not require supplementary data on cloud-top or bottom heights nor cloud-top ruggedness. It assumes that the location of dense clouds is mainly impacted by meteorological conditions and that cloud-top and cloud-base heights vary in a predefined manner. The methodology has been tested over an intensively excavated and well-studied pilot site and shows 50% more detection of clouds and shadows than Sen2Cor. Furthermore, it has reached a Specificity of 1 in the correct detection of mining sites and water ponds, proving itself to be a reliable approach for further related studies on the mapping of small-scale mining in the area. Although the methodology was tailored to the context of small-scale mining in the region of Antioquia, it is a scalable approach and can be adapted to other areas and conditions.


Introduction
Informal small-scale alluvial gold mining, also known as placer mining, has major social and environmental impacts and has been at the heart of complicated armed conflicts in various parts of the world. It is distinct from subsistence mining as it utilizes large machinery to excavate soil and river sediment [1]. When carried out on the riverbanks, it leaves large footprints of bare soil along with ponds of water that are utilized for on-site processing [2,3]. Such ponds are required to pump ore slurry and wash it through sluice boxes under pressure where the gold particles are collected. Prior to the 2018 mercury is also used for the detection of cirrus clouds as their high altitude can be detected using a band with high atmospheric absorption. Finally, filters applied on detected clouds are used to remove isolated pixels and to fill gaps within clouds [35]. On the other hand, cloud detection for L2A products utilizes several steps of threshold filtering using indices that involve land cover to avoid detecting false cloud pixels in regions of possible false detection, such as areas of bare soil [36]. Unfortunately, the cloud detection approach of Sen2Cor has been reported to result in the unsatisfactory detection of dense clouds and their shadows [14,24,37], and has been shown to result in false positives in small-scale mining areas [12]. This paper aims to provide improved cloud and shadow detection in an approach that is simple, efficient, and based on freely available tools. It aims at improving cloud and cloud shadow detection in the context of mapping small-scale mining where the areas of interest are bare soil and water ponds. This procedure consists of two consecutive machine-learning steps. First, a supervised classification detects candidate clouds and shadows; second, the solar-cloud-shadow-sensor geometry and a causality effect between cloud shadows and clouds are considered to reduce shadow commission error. There have been already various methods developed that include the reduction of cloud-shadow false positives. One "universal" method that can be used for Sentinel-2 data considers an object-based image analysis approach for shape spatial-matching of cloud and cloudshadows [22]. Another approach developed for MODIS data considers a geometry-based tool to detect potential shadows followed by classification to match the two outputs [13]. Other geometry-based approaches have been tailored for specific sensors that include thermal bands [28,38]. This paper proposes a simple pixel-based approach that provides a high-quality identification of clouds and their shadows for Sentinel-2 in the context of small-scale mining in Colombia. This work aims to efficiently provide a suitable tradeoff between omission errors leading to failure in excluding contaminated pixels and commission errors that result in masking out clear pixels. Although the methodology was tailored for the setting of the study area in the context of small-scale mining, it is scalable and can be a solid basis to develop a more generalized approach. The methodology is tested over an intensively excavated region through a mono-temporal approach due to the highly dynamic characteristics of the excavated areas and the rapid landcover change that needs to be depicted. A validation of the results using images acquired in different seasons was carried out on a well-studied pilot site in the vicinity of the town of El Bagre [5]. The success of this approach is a milestone for time series analysis of land cover around mining sites that will lead to an early warning system about the sprawl of excavations, especially Remote Sens. 2021, 13, 736 4 of 20 in the vicinity of protected or sensitive areas. Such important output is to be shared with stakeholders through MapX (https://mapx.org), an online information and engagement platform that would allow the consolidation of data, analysis, and spatial visualization [39]. MapX was developed by the United Nations Environment Program (UNEP) and UNEP/ GRID-Geneva (https://unepgrid.ch).

Study Area
The study area is in the department of Antioquia along the path of the Nechí river. This department is the main producer of gold in Colombia, and the abundance of placer mining in the area makes it an optimal site to test remote sensing applications. Figure 1a shows the location of the study area with respect to Antioquia and Colombia and (b) a Red-Green-Blue (RGB) view of the area using a Sentinel-2 (S2B) image acquired on 18 June 2019, with an indication of the pilot-site location around the town of El Bagre at latitude 7 • 36'17.88"N and longitude 74 • 48'32.32"W. The area includes water bodies (rivers, isolated bodies, etc.), non-vegetated regions (built-up areas, mining areas, bare soil, etc.), and vegetated regions (forests, shrubs, agriculture, etc.). Figure 1c shows the topography of the area using a Digital Elevation Model (DEM) from the Shuttle Radar Topography Mission (SRTM) (1 arc-second resolution), freely available through the United States Geological Survey (http://earthexplorer.usgs.gov/). The elevation ranges from 30 m to 500 m and is relatively smooth along the river with slightly rugged areas limited to the southern part. The average elevation along the riverbanks where the land excavations take place does not exceed 60 m. The study area has a tropical warm-humid climate with frequent cloud coverage. The region experiences a dry season from December to March and a rainy season the rest of the year. It has a relatively spatially homogeneous climate with an average annual temperature around 28 • C and seasonal temperature variability of approximately 5 • C ( Figure 2). The closest weather station within consistent topography and providing data through WeatherUnderground.com is at Los Garzones International Airport Station, about 175 km from the town of El Bagre and at 15 m of elevation ( Figure 2). A validation of the results using images acquired in different seasons was carried out on a well-studied pilot site in the vicinity of the town of El Bagre [5]. The success of this approach is a milestone for time series analysis of land cover around mining sites that will lead to an early warning system about the sprawl of excavations, especially in the vicinity of protected or sensitive areas. Such important output is to be shared with stakeholders through MapX (https://mapx.org), an online information and engagement platform that would allow the consolidation of data, analysis, and spatial visualization [39]. MapX was developed by the United Nations Environment Program (UNEP) and UNEP/ GRID-Geneva (https://unepgrid.ch).

Study Area
The study area is in the department of Antioquia along the path of the Nechí river. This department is the main producer of gold in Colombia, and the abundance of placer mining in the area makes it an optimal site to test remote sensing applications. Figure 1a shows the location of the study area with respect to Antioquia and Colombia and (b) a Red-Green-Blue (RGB) view of the area using a Sentinel-2 (S2B) image acquired on 18 June, 2019, with an indication of the pilot-site location around the town of El Bagre at latitude 7° 36, 17.88'' N and longitude 74° 48' 32.32'' W. The area includes water bodies (rivers, isolated bodies, etc.), non-vegetated regions (built-up areas, mining areas, bare soil, etc.), and vegetated regions (forests, shrubs, agriculture, etc.). Figure 1c shows the topography of the area using a Digital Elevation Model (DEM) from the Shuttle Radar Topography Mission (SRTM) (1 arc-second resolution), freely available through the United States Geological Survey (http://earthexplorer.usgs.gov/). The elevation ranges from 30m to 500m and is relatively smooth along the river with slightly rugged areas limited to the southern part. The average elevation along the riverbanks where the land excavations take place does not exceed 60m. The study area has a tropical warm-humid climate with frequent cloud coverage. The region experiences a dry season from December to March and a rainy season the rest of the year. It has a relatively spatially homogeneous climate with an average annual temperature around 28 o C and seasonal temperature variability of approximately 5 o C ( Figure 2). The closest weather station within consistent topography and providing data through WeatherUnderground.com is at Los Garzones International Airport Station, about 175 km from the town of El Bagre and at 15 m of elevation ( Figure 2).

Classification for Dense Cloud and Shadow Detection
Dense clouds have high reflectance in the visible part of the spectrum. This can cau misclassification of land-cover of high brightness as clouds [14,40]. In fact, this has be observed in the study area where bare soil and highly turbid shallow ponds of small-sc mining have been misidentified. On the other hand, areas shadowed by clouds are re tively dark due to lower irradiance, and thus can be misclassified as water bodies a

Classification for Dense Cloud and Shadow Detection
Dense clouds have high reflectance in the visible part of the spectrum. This can cause misclassification of land-cover of high brightness as clouds [14,40]. In fact, this has been observed in the study area where bare soil and highly turbid shallow ponds of smallscale mining have been misidentified. On the other hand, areas shadowed by clouds are relatively dark due to lower irradiance, and thus can be misclassified as water bodies and areas shaded by topographical features and vice versa [30,40]. Since the topography of the study area is generally smooth, such topographical impacts on alluvial mining sites can be considered minimal. Figure 3 shows examples of reflectance spectra, whereby it shows the mean spectrum (± standard deviation) of a selected cloud and its shadow along with a nearby mining site and water body. These spectra were extracted from the Sentinel-2 (S2B) image of the study area acquired on 18 June 2019. The reflectance of cloud and mine bare-soil pixels is relatively high with distinction depicted at band 1 and band 9 located in the water vapor absorption regions [19]. On the other hand, water and shadow pixels show low reflectance throughout the spectrum.
A supervised classification approach is used to identify three classes: clouds, cloud shadows, and clear pixels. The Sentinel-2 image acquired on 18 June 2019 ( Figure 1) is used to extract reference spectra because it includes clouds and shadows over various landcovers along the western and southern regions ( Table 2). These reference spectra are available as Supplementary Materials data with this manuscript. A Support-Vector-Machine (SVM) classifier is used as it has proven its suitability for landcover classification in diverse areas [41][42][43], for small-scale mining detection at the pilot site [5], and for cloudshadow detection [13]. SVM is implemented using "Scikit-learn: Machine Learning in Python" [44] where it aims to find an optimal hyperplane separating the data into the pre-specified classes, and "kernels" can be used to introduce new variables that improve class separability [45]. The commonly used kernel functions include Linear and Radial Basis Function (RBF-Gaussian) kernels, and they require optimization parameters. Both types of kernels use "C" (penalty for misclassification) that allows for modification in the rigidity of training data. The RBF kernel also requires "gamma" (reflecting the spread of the kernel) that impacts the smoothing of the hyperplane shape [42]. Larger values of "C" may lead to an over-fitting model, whereas increasing "gamma" will affect the Remote Sens. 2021, 13, 736 6 of 20 shape of the class-dividing hyperplane, which may affect the classification accuracy. To identify the most suitable parameters, the grid-search method is used, where "gamma" ∈ [1, 0.1, 0.01] and "C" ∈ [1, 50,100,200]. The parameter values are tested using a three-fold cross-validation approach, and those resulting in the highest classification accuracy are selected. Classification accuracy is reported as precision value Pr = T/(T + F), where T is the number of true positives and F the number of false positives. As SVM can handle the dimensionality of the Sentinel-2 data, the use of all 12 bands is possible. Furthermore, as water bodies are of concern in the analysis, the indices that have been proven to be powerful in the detection of water and distinguishing it from other landcovers are tested. The features include the Normalized Difference Vegetation Index (NDVI) (B8-B4)/(B8+B4) and Modified Normalized Difference Water Index (MNDWI) (B3-B11)/(B3+B11) [46][47][48]. As B1 and B9 reflectance is relatively higher for clouds than for mining areas, one more test is considered where features are reduced to only bands 1 and 9 along with NDVI and MNDWI,

Direction of Cloud Shadow with Respect to Cloud Projection
Sentinel-2's orbit is sun-synchronous where the twin satellites follow the same orbit at a mean altitude of 786 km but 180 degrees apart. They acquire the data at Mean Local Solar Time of 10:30 am at the descending mode [50]. Cloud shadow locations with respect to cloud projection in imagery are dependent on the direction of solar radiation represented by solar zenith and azimuth angles, and by the sensor viewing geometry along with cloud top and bottom height [28,29,40,46,51]. Except for cloud height, all parameters

Direction of Cloud Shadow with Respect to Cloud Projection
Sentinel-2's orbit is sun-synchronous where the twin satellites follow the same orbit at a mean altitude of 786 km but 180 degrees apart. They acquire the data at Mean Local Solar Time of 10:30 a.m. at the descending mode [50]. Cloud shadow locations with respect to cloud projection in imagery are dependent on the direction of solar radiation represented by solar zenith and azimuth angles, and by the sensor viewing geometry along with cloud top and bottom height [28,29,40,46,51]. Except for cloud height, all parameters are available in the Sentinel-2 image metadata. Accordingly, the direction of cloud shadow (Figure 4), referred to as Apparent Solar Azimuth (ϕ a ) can be estimated [40,51]: where ϕ s and θ s are the solar azimuth and zenith angles, respectively; ϕ v and θ v are the sensor's view azimuth and zenith angles, respectively. As ϕ a can have two possible angles with a difference of π radians, the angle is selected to be the one opposite to the sun's mean azimuth location on the image. As the images are acquired before noon, it is expected that ϕ a ∈ [180, 360]. Sentinel-2 metadata provide mean ϕ s and mean θ s . Yet, the view angles are reported per detector of the bushbroom sensor MSI along with their mean values per cell of a grid with a 5 km spacing.

Location of Shadow with Respect to Cloud Projection
To estimate the distance (d) between a pixel of a cloud projection on the image plane and its corresponding shadow (Figure 4), sun and viewing angles along with cloud height (h) are needed. As h is not available with the data, d/h [40] can be calculated to test possible locations of shadows depending on scenarios of cloud height.
As the top and base cloud height and cloud-top ruggedness are not available, it is essential to utilize an approach that does not require these important cloud characteristics. The approach considered in this work assumes that cloud-top and cloud-base height can vary in a predefined manner, an assumption that has been utilized successfully for cloud and shadow detection in MODIS data [12,28]. Various types of clouds develop in tropical areas and are expected at specific heights, with a maximum height of approximately 2 km assumed for the lower dense clouds (e.g. Cumulus, Cumulonimbus, Stratus, and Stratocumulus) [52,53] and 8 km for higher dense clouds (Nimbostratus, Altostratus, and Altocumulus) [53]. As cloud height is not available, a range of values is considered aiming to match clouds with their corresponding shadows.
Clouded scenes in tropical areas are highly likely to be dominated by low cumulus and cumulonimbus clouds that appear both in groups and as isolated entities [54]. Depending on meteorological conditions, such clouds are located at different heights above the ground surface [55,56]. A simplified approach to estimate the height of the base of a

Location of Shadow with Respect to Cloud Projection
To estimate the distance (d) between a pixel of a cloud projection on the image plane and its corresponding shadow (Figure 4), sun and viewing angles along with cloud height (h) are needed. As h is not available with the data, d/h [40] can be calculated to test possible locations of shadows depending on scenarios of cloud height.
As the top and base cloud height and cloud-top ruggedness are not available, it is essential to utilize an approach that does not require these important cloud characteristics. The approach considered in this work assumes that cloud-top and cloud-base height can vary in a predefined manner, an assumption that has been utilized successfully for cloud and shadow detection in MODIS data [12,28]. Various types of clouds develop in tropical areas and are expected at specific heights, with a maximum height of approximately 2 km assumed for the lower dense clouds (e.g., Cumulus, Cumulonimbus, Stratus, and Stratocumulus) [52,53] and 8 km for higher dense clouds (Nimbostratus, Altostratus, and Altocumulus) [53]. As cloud height is not available, a range of values is considered aiming to match clouds with their corresponding shadows.
Clouded scenes in tropical areas are highly likely to be dominated by low cumulus and cumulonimbus clouds that appear both in groups and as isolated entities [54]. Depending on meteorological conditions, such clouds are located at different heights above the ground surface [55,56]. A simplified approach to estimate the height of the base of a cumulus cloud in aviation has been as follows [57]: where h met is the cloud-base height estimated using meteorological data, T s is the surface temperature, and T dew is the dew point. Thus, for the entire study area, it is expected that a major part of cumulus clouds would be at a similar height from the ground surface due to the area's relatively homogenous topography and climate. As meteorological data at acquisition time are not available for the study area, measures such as h met cannot be used to guide the cloud-shadow detection. Thus, an iterative approach is used, aiming to empirically capture representative cloud heights.

Implementation of the Geometry-Based Improvement
Images of potential ϕ a and d/h are calculated using the 10m pixel size of view and sun angle data of each image using the SNAP-ESA Sentinel Application Platform (http: //step.esa.int). These images, in addition to the classification results, are the main input to the geometry-based improvement of the classified shadows that in turn is carried out using the python libraries Rasterio [58], Rasterstats [59], Shapely [60], Geopandas [61] along with Numpy, Pandas, and their dependencies. Figure 5 shows an overview of the geometry-based approach.
Remote Sens. 2021, 13, x FOR PEER REVIEW 9 of 21 (http://step.esa.int). These images, in addition to the classification results, are the main input to the geometry-based improvement of the classified shadows that in turn is carried out using the python libraries Rasterio [58], Rasterstats [59], Shapely [60], Geopandas [61] along with Numpy, Pandas, and their dependencies. Figure 5 shows an overview of the geometry-based approach. As the terrain topography is mainly smooth, the distance (d) between a cloud projection and its shadow is expected to be consistent for small and sparse clouds ( Figure 6, Case A). Yet, once the clouds and their shadows are adjacent due to large cloud geometry, the shadow geometry in the image is restricted (Figure 6, Case B and Case C). Further- As the terrain topography is mainly smooth, the distance (d) between a cloud projection and its shadow is expected to be consistent for small and sparse clouds ( Figure 6, Case A). Yet, once the clouds and their shadows are adjacent due to large cloud geometry, the shadow geometry in the image is restricted (Figure 6, Case B and Case C). Furthermore, with the presence of neighboring or contiguous clouds, shadows can also be restricted ( Figure 6, Case C). A first clean-up of the classification results is carried out; the classified image is sieved with a 60m 2 threshold (i.e., 6x10m pixels), and holes are closed in clouds and shadow geometry. This removes any speckles resulting from the pixel-based classification and reduces the computational needs for the geometry-based process. Then, for each cloud projection, zonal statistics of φa and d/h are calculated and a mean value of each, per cloud, is provided to guide the matching between each cloud projection and its shadow.
A first iteration considers low and dense clouds corresponding to case A (individual isolated clouds) and certain scenarios of case B, described in Figure 6. A range of h is tested, and the height corresponding to the maximum number of detected shadows is considered the most representative empirically derived cloud height (hemp). Assuming a Euclidian plane 2 and N detected cloud projection geometries by SVM, for each cloud where i ∈ [1, N], mean potential cloud shadow characteristics are extracted using zonal statistics (φai and (d/h)i). The centroid of each cloud ( ) is determined and is translated to ( , ) using potential cloud shadow geometry parameters at intervals j of 50m such that hj ∈ [200, 2, 000]: . cos .

. sin
A spatial query of cloud shadow geometries containing the translated centroids at each is carried out and the number of resulting cloud shadows is calculated. hj corresponding to the maximum number cloud shadows is considered hemp. .The use of the cloud centroid to match clouds to their corresponding shadows provides computational efficiency and does not require cloud shape matching, as cloud shadow footprints can vary from the cloud projection footprints. The clouds and their shadows corresponding to hemp are considered as the first correctly identified set and are retained.  A first clean-up of the classification results is carried out; the classified image is sieved with a 60m 2 threshold (i.e., 6x10m pixels), and holes are closed in clouds and shadow geometry. This removes any speckles resulting from the pixel-based classification and reduces the computational needs for the geometry-based process. Then, for each cloud projection, zonal statistics of ϕ a and d/h are calculated and a mean value of each, per cloud, is provided to guide the matching between each cloud projection and its shadow.
A first iteration considers low and dense clouds corresponding to case A (individual isolated clouds) and certain scenarios of case B, described in Figure 6. A range of h is tested, and the height corresponding to the maximum number of detected shadows is considered the most representative empirically derived cloud height (h emp ). Assuming a Euclidian plane 2 and N detected cloud projection geometries by SVM, for each cloud C i where i ∈ [1, N], mean potential cloud shadow characteristics are extracted using zonal statistics (ϕ ai and (d/h) i ). The centroid of each cloud (c i ) is determined and is translated to (c i,j ) using potential cloud shadow geometry parameters at intervals j of 50 m such that h j ∈ [200,2000]: A spatial query of cloud shadow geometries containing the translated centroids at each h j is carried out and the number of resulting cloud shadows is calculated. h j corresponding to the maximum number cloud shadows is considered h emp . The use of the cloud centroid to match clouds to their corresponding shadows provides computational efficiency and does not require cloud shape matching, as cloud shadow footprints can vary from the cloud projection footprints. The clouds and their shadows corresponding to h emp are considered as the first correctly identified set and are retained.

Cirrus Clouds
Band 10, the cirrus band, was designed to aid in the detection of cirrus clouds [20]. The L2A Sentinel-2 data provides a cirrus cloud mask using B10, detected using threshold filtering tests by Sen2Cor [36]. As the elevation in the area is relatively low, i.e., less than 2km [62], this mask is can be considered suitable for the detection of cirrus clouds. In fact, Sen2Cor has been reported to perform much better in the detection of cirrus clouds than low clouds due to its high reliance on the B10 in the cloud detection procedure [14].

Assessment with Images from Different Seasons and Diverse Cloud Cover
The pilot site (location shown in Figure 1) has been intensively studied using cloudfree imagery obtained from Sentinel-2 from 2016 to 2019, accompanied by field visits that took place on 28 November 2018 and 18 February 2019 [5]. Continuously excavated areas from 2016 to 2019 were detected along with areas that were consistently classified as water bodies. These are used for validation of the results considering images acquired in different seasons, i.e., with different solar angles, ambient temperature, and cloud coverage. Figure 8 shows the pilot site and its location in the vicinity of the town of El Bagre in the department of Antioquia and a view of the areas affected by placer mining throughout the study period, revealing bare soil and water ponds used in the processing of the extracted material.

Cirrus Clouds
Band 10, the cirrus band, was designed to aid in the detection of cirrus clouds [20]. The L2A Sentinel-2 data provides a cirrus cloud mask using B10, detected using threshold filtering tests by Sen2Cor [36]. As the elevation in the area is relatively low, i.e., less than 2 km [62], this mask is can be considered suitable for the detection of cirrus clouds. In fact, Sen2Cor has been reported to perform much better in the detection of cirrus clouds than low clouds due to its high reliance on the B10 in the cloud detection procedure [14].

Assessment with Images from Different Seasons and Diverse Cloud Cover
The pilot site (location shown in Figure 1) has been intensively studied using cloudfree imagery obtained from Sentinel-2 from 2016 to 2019, accompanied by field visits that took place on 28 November 2018 and 18 February 2019 [5]. Continuously excavated areas from 2016 to 2019 were detected along with areas that were consistently classified as water bodies. These are used for validation of the results considering images acquired in different seasons, i.e., with different solar angles, ambient temperature, and cloud coverage. Figure 8 shows the pilot site and its location in the vicinity of the town of El Bagre in the department of Antioquia and a view of the areas affected by placer mining throughout the study period, revealing bare soil and water ponds used in the processing of the extracted material.

Input Uncertainty and Error Sources
The approach aims to make use of readily available atmospherically corrected imagery of the Copernicus Open Access Hub. The correction is carried out using Sen2Cor. The fact that Sen2Cor has limitations in cloud and shadow detection along with misclassification of mining sites and water, it can lead to uncertainty in the reflectance data used in this work. An assessment of such uncertainty could be empirically carried out in the future through analyzing L1C and L2A data considering areas of high and low classification accuracy by Sen2Cor. Since this is not within the scope of this paper, it is not addressed. This topic has nonetheless been discussed in the 2017 ESA workshop Uncertainty in Remote Sensing, where the need "to improve characterization of the error induced by undetected cloud, cloud-shadows and adjacency effects at the cloud edges" was identified [63]. This uncertainly could contribute to error in the classification procedure. If this error is in the classification of shadows where false positives result in the process, these issues would be addressed through the procedure described in the paper where the classified shadows are improved. Yet, if the error is in the clouds class, this would not be corrected by the procedure considered in this paper.

Classification and Selection of Suitable Features
Using a 10m × 10m pixel size of all utilized features, the classification of clouds and shadows was conducted. The results of the optimal three-fold grid search used to determine the suitable parameters and features are shown in Table 3. For each combination of features, the highest classification accuracy of the reference spectra is shown, identifying the optimal combination of parameters. The RBF kernel provides the best results, with this outcome being consistent with a previous study on cloud shadow detection [13]. Sentinel-2 bands with no additional indices provide one of the best classification results.

Input Uncertainty and Error Sources
The approach aims to make use of readily available atmospherically corrected imagery of the Copernicus Open Access Hub. The correction is carried out using Sen2Cor. The fact that Sen2Cor has limitations in cloud and shadow detection along with misclassification of mining sites and water, it can lead to uncertainty in the reflectance data used in this work. An assessment of such uncertainty could be empirically carried out in the future through analyzing L1C and L2A data considering areas of high and low classification accuracy by Sen2Cor. Since this is not within the scope of this paper, it is not addressed. This topic has nonetheless been discussed in the 2017 ESA workshop Uncertainty in Remote Sensing, where the need "to improve characterization of the error induced by undetected cloud, cloud-shadows and adjacency effects at the cloud edges" was identified [63]. This uncertainly could contribute to error in the classification procedure. If this error is in the classification of shadows where false positives result in the process, these issues would be addressed through the procedure described in the paper where the classified shadows are improved. Yet, if the error is in the clouds class, this would not be corrected by the procedure considered in this paper.

Classification and Selection of Suitable Features
Using a 10m × 10m pixel size of all utilized features, the classification of clouds and shadows was conducted. The results of the optimal three-fold grid search used to determine the suitable parameters and features are shown in Table 3. For each combination of features, the highest classification accuracy of the reference spectra is shown, identifying the optimal combination of parameters. The RBF kernel provides the best results, with this outcome being consistent with a previous study on cloud shadow detection [13]. Sentinel-2 bands with no additional indices provide one of the best classification results.

Cloud-Shadow and Cloud Geometry Illustration for Various Seasons and Cloud Cover
Let us now consider the reference image acquired on 18 June 2019. The mean viewing angle per tile over all channels ranged from θ v, from 1 to 10 degrees and ϕ v from 19 to 232 degrees while the sun angles varied in a much smaller range (Figure 9). Accordingly, potential ϕ a was calculated and ranged between 212 and 223 degrees while potential d/h ranged between 0.39 and 0.47 ( Figure 10). The SVM results are shown in Figure 10b, where a large area of false-positive cloud shadows can be identified on the eastern part of the image. For the first iteration of cloud shadow improvement, the most representative cloud height h emp from the ground level was 1050 m and confirmed the shadows of 156 low dense clouds. The second iteration retained 42 other polygon geometries as shadows. All remaining geometrical features in the Shadow class were discarded. Figure 10e shows the improved cloud shadows using the geometry-based approach. Let us now consider the reference image acquired on 18 June 2019. The mean viewing angle per tile over all channels ranged from θv, from 1 to 10 degrees and φv from 19 to 232 degrees while the sun angles varied in a much smaller range (Figure 9). Accordingly, potential φa was calculated and ranged between 212 and 223 degrees while potential d/h ranged between 0.39 and 0.47 ( Figure 10). The SVM results are shown in Figure 10b, where a large area of false-positive cloud shadows can be identified on the eastern part of the image. For the first iteration of cloud shadow improvement, the most representative cloud height hemp from the ground level was 1050m and confirmed the shadows of 156 low dense clouds. The second iteration retained 42 other polygon geometries as shadows. All remaining geometrical features in the Shadow class were discarded. Figures 10e shows the improved cloud shadows using the geometry-based approach. Three images of different seasons and various cloud cover were used to assess the methodology with diverse cloud cover and solar and view angle conditions. The SVM classification model built by the reference spectra of 18 June 2019 was used to classify the three images. An overview of parameters needed for potential shadow locations are shown in Table 4. A range of h and the corresponding retained shadow polygons is shown in Figure 11 where the value corresponding to the highest number of detected shadows h emp is considered and shown in Table 4. Figure 12 shows the results of the three images.  Three images of different seasons and various cloud cover were used to asse methodology with diverse cloud cover and solar and view angle conditions. The classification model built by the reference spectra of 18 June 2019 was used to class three images. An overview of parameters needed for potential shadow locatio shown in Table 4. A range of h and the corresponding retained shadow polygons is s  Table 4.   Figure 13. shows a close-up to a region of one of the classified images and shows a visual comparison between the results of the current approach and those of Sen2Cor. An illustration of shadow omission by Sen2Cor can be clearly viewed and is consistent with literature reporting low detection reaching lower than 30% of cloud shadows in imagery Cirrus clouds Figure 11. h values tested during the first iteration and the corresponding retained shadows. The maxima represent the h emp values in Table 4. Figure 11. h values tested during the first iteration and the corresponding retained shadows. The maxima represent the hemp values in Table 4.   From Weather Underground (www.weatherunderground.com), data from the Los Garzones International Airport Station are used to estimate cloud-base height, h met (Table 5), with temperature measured around image acquisition time. Even though the station is not located in the study area, it is at similar topography, climate, and without topographical obstruction from the study site. Thus, it is used for demonstration purposes. The estimated h met values are consistently lower that corresponding h emp , where the latter considers cloud projection on the image, and thus is affected by cloud-top and cloud-top ruggedness. Thus, cloud thickness could play an essential role in the difference between these two measures. This thickness has been shown to rapidly increase with increasing diameter for small cumulus tropical clouds, while increasing more slowly for larger clouds [64]. Figure 13. shows a close-up to a region of one of the classified images and shows a visual comparison between the results of the current approach and those of Sen2Cor. An illustration of shadow omission by Sen2Cor can be clearly viewed and is consistent with literature reporting low detection reaching lower than 30% of cloud shadows in imagery [65]. Furthermore, the cloud commission error Sen2Cor can be recognized through the river pattern classified as "cloud medium probability" (Figure 13c).  [65]. Furthermore, the cloud commission error Sen2Cor can be recognized through the river pattern classified as "cloud medium probability" (Figure 13c).

Validation over the Pilot Site
The data from the pilot site shown in Figure 8 were used to assess the results and compare them to those of Sen2Cor's clouds of high and medium probability and cloud shadow detection. Figure 14 shows the correct characterization (true positives) of visually identified dense clouds and shadows over the pilot site subset where there were only two images with visually detected clouds over the site. The results show an improved detection to the major omission of both clouds and shadows by Sen2Cor. In fact, Sen2Cor reached as low as 50% and 35% of cloud and cloud-shadow detection, respectively.
(a) (b) Figure 14. True positives of visually detected using the current approach compared to Sen2Cor (a) dense clouds (512 and

Validation over the Pilot Site
The data from the pilot site shown in Figure 8 were used to assess the results and compare them to those of Sen2Cor's clouds of high and medium probability and cloud shadow detection. Figure 14 shows the correct characterization (true positives) of visually identified dense clouds and shadows over the pilot site subset where there were only two images with visually detected clouds over the site. The results show an improved detection to the major omission of both clouds and shadows by Sen2Cor. In fact, Sen2Cor reached as low as 50% and 35% of cloud and cloud-shadow detection, respectively.
Even though clouds contaminated the pilot site on only two dates, the possible false positive detection of clouds and shadows can be present on all four images. As the interest is also in the reduction of misclassification of water bodies and mining sites as clouds and shadows, Tables 6 and 7 show the "total negative" mining and water pixels (clear pixels) and detail any false positive detection by the current approach or by Sen2Cor. Specificity is reported in Figure 15 where Specificity = Neg/(Neg + F), where Neg is the number of total negatives, and F is the number of false positives. Specificity using the current approach reaches 1 for most cases and is constantly higher than Sen2Cor's detection (considering high and medium probability clouds), even with Sen2Cor's low detection rate.
The data from the pilot site shown in Figure 8 were used to assess the results and compare them to those of Sen2Cor's clouds of high and medium probability and cloud shadow detection. Figure 14 shows the correct characterization (true positives) of visually identified dense clouds and shadows over the pilot site subset where there were only two images with visually detected clouds over the site. The results show an improved detection to the major omission of both clouds and shadows by Sen2Cor. In fact, Sen2Cor reached as low as 50% and 35% of cloud and cloud-shadow detection, respectively.    Even though clouds contaminated the pilot site on only two dates, the possible false positive detection of clouds and shadows can be present on all four images. As the interest is also in the reduction of misclassification of water bodies and mining sites as clouds and shadows, Tables 6 and 7 show the "total negative" mining and water pixels (clear pixels) and detail any false positive detection by the current approach or by Sen2Cor. Specificity is reported in Figure 15 where Specificity = Neg/(Neg + F), where Neg is the number of total negatives, and F is the number of false positives. Specificity using the current approach reaches 1 for most cases and is constantly higher than Sen2Cor's detection (considering high and medium probability clouds), even with Sen2Cor's low detection rate.

Limitations and Future Work
While cloud dilation is not considered in this work, it can be a suitable approach for Sentinel-2 data to include fuzzy cloud pixels in cloud masks and to overcome parallax errors [14]. An automated cloud dilation approach will be considered in the future to obtain an improved exclusion of pixels affected by clouds. Figure 15. Specificity in the correct negative identification with respect to dense clouds and shadows for the reference date of the pilot site (a) mining areas (b) water bodies.

Limitations and Future Work
While cloud dilation is not considered in this work, it can be a suitable approach for Sentinel-2 data to include fuzzy cloud pixels in cloud masks and to overcome parallax errors [14]. An automated cloud dilation approach will be considered in the future to obtain an improved exclusion of pixels affected by clouds.
The approach illustrated an efficient improvement to cloud and cloud-shadow detection for Sentinel-2 using freely available tools. However, the approach also has its limitations. When a true shadow is located in a relatively dark area and is classified as a shadow along with its surroundings in one geometry, the entire geometry is retained as a shadow after the geometry-based improvement. Thus, those commissions cannot be excluded. Furthermore, the matching in the second iteration can result in commission errors in the shadows when candidate shadows are located in between a couple of matching cloud and shadow. Yet, all these potential drawbacks occur around areas where true cloud and shadow contamination exist, thus limiting the area of uncertainty in the results and leaving room for localized refinement of the methodology.
Another limitation of the presented methodology is that it is intended for relatively non-rugged terrain and relatively spatially homogeneous meteorological conditions where one representative h emp for cumulus clouds is considered. As such, additional considerations are needed for topography and potential micro-climates that can impact the cloud height with respect to the ground surface. However, the approach is scalable as it can be adjusted to allow the search for multiple representative h emp through considering local maxima for h emp when considering heterogenous areas. These aspects can be considered in the future when needed for other study areas.
As h emp is an empirical measure based on surface reflectance values, it is of great interest to analyze its correspondence to physical cloud characteristics. A future prospect of the work is to assess this measure's link to cloud-top and cloud-base heights (thickness) at various scenarios of cloud-top ruggedness. This would require carrying out an analysis around areas where meteorological data are available or through the use of satellite data that allows for the extraction of cloud 3D geometry, such as geostationary data.

Conclusions
This paper addresses the important topic of cloud and cloud shadow detection over areas of Colombia where small-scale mining activities frequently occur. It presents a workflow of pixel-based classification followed by refinement of classes using solar-cloudshadow-sensor geometry. The approach results in an improved detection of clouds and their shadows along with a reduction in commission errors. It makes use of freely available tools and does not require supplementary data on cloud-top or bottom heights nor cloudtop ruggedness. The geometry-based approach makes use of sun angles and sensor view angles available in Sentinel-2 metadata to identify potential directions of shadows for each pixel. For each cloud, this potential shadow direction is extracted using zonal statistics. An iterative approach is utilized for the exclusion of false positive shadows, given that cloud height is not available. In the first iteration, the focus is on low and dense clouds such as cumulus clouds where an empirical representative cloud height at the time of acquisition is obtained. A second iteration considers shadows and clouds not retained in the first iteration and considers higher cloud elevations. Non-retained shadows from the second iteration are relabeled as clear pixels and excluded from the cloud shadow mask. Compared to Sen2Cor, the semi-empirical model utilized for the atmospheric correction of Sentinel-2 data at the Copernicus Open Access Hub, the approach has shown a better detection of cloud and shadows. Furthermore, it has shown a reduction in the misclassification of mining and water pixels as clouds or shadows. Thus, this approach will be used to extract valid pixels of time-series of Sentinel-2 imagery over Antioquia for the development of an early warning system for sensitive areas that will be potentially affected by the uncontrolled sprawl of small-scale land-based alluvial mining.