Evaluating Combinations of Temporally Aggregated Sentinel-1, Sentinel-2 and Landsat 8 for Land Cover Mapping with Google Earth Engine

: Land cover mapping of large areas is challenging due to the signiﬁcant volume of satellite data to acquire and process, as well as the lack of spatial continuity due to cloud cover. Temporal aggregation—the use of metrics (i.e., mean or median) derived from satellite data over a period of time—is an approach that beneﬁts from recent increases in the frequency of free satellite data acquisition and cloud-computing power. This enables the efﬁcient use of multi-temporal data and the exploitation of cloud-gap ﬁlling techniques for land cover mapping. Here, we provide the ﬁrst formal comparison of the accuracy between land cover maps created with temporal aggregation of Sentinel-1 (S1), Sentinel-2 (S2), and Landsat-8 (L8) data from one-year and test whether this method matches the accuracy of traditional approaches. hirty-two datasets were created for Wales by applying automated cloud-masking and temporally aggregating data over different time intervals, using Google Earth Engine. Manually processed S2 data was used for comparison using a traditional two-date composite approach. Supervised classiﬁcations were created, and their accuracy was assessed using ﬁeld-based data. Temporal aggregation only matched the accuracy of the traditional two-date composite approach (77.9%) when an optimal combination of optical and radar data was used (76.5%). Combined datasets (S1, S2 or S1, S2, and L8) outperformed single-sensor datasets, while datasets based on spectral indices obtained the lowest levels of accuracy. The analysis of cloud cover showed that to ensure at least one cloud-free pixel per time interval, a maximum of two intervals per year for temporal aggregation were possible with L8, while three or four intervals could be used for S2. This study demonstrates that temporal aggregation is a promising tool for integrating large amounts of data in an efﬁcient way and that it can compensate for the lower quality of automatic image selection and cloud masking. It also shows that combining data from different sensors can improve classiﬁcation accuracy. However, this study highlights the need for identifying optimal combinations of satellite data and aggregation parameters in order to match the accuracy of manually selected and processed image composites.


Introduction
Land cover maps help us better understand environmental processes, such as water and biochemical cycles, energy exchanges, or biodiversity alterations [1]. Land cover has been recognised as an Essential Climate Variable (ECV) [2] and has also been proposed as a Satellite Remote Sensing Here, we assessed the accuracy of land cover maps created using temporal aggregation of one-year of data derived from L8, S1, and Sentinel-2 (S2) satellites. The two main objectives were to (1) test whether land cover classifications, created using temporal aggregation of large amounts of automatically-processed satellite data, outperform those created with traditional non-aggregated two-date composites and (2) compare the classification accuracy of temporally aggregated data from different sensors. To do this, multiple datasets were created using the temporal aggregation of satellite-derived measurements taken over different time intervals within one year. GEE was used to apply automated cloud-masking and to temporally aggregate the data. All data were resampled to 30 metres to match the spatial resolution of the most coarsely resolved sensor (L8). This was done to isolate the effects of temporally aggregated measurements on the classification accuracy. A wide variety of datasets were created, including combined multi-sensor data, and supervised classifications were used to create land cover maps. The accuracy of the land cover maps was then assessed using a field-based land cover dataset. The methods were tested in the UK which is a relatively cloudy area of the world [21].

Study Area
The study area is in the UK and covers most of Wales (82.7%) and some bordering regions of England ( Figure 1). The area covers 200 × 100 km 2 and corresponds to tiles T30UVC and T30UVD of S2. The region is geographically diverse, with mountainous areas in the north and centre. It is characterised by a maritime climate, with mild temperatures and rain on more than 200 days a year (https://www. metoffice.gov.uk/), so is often cloudy. The lowlands are dominated by grasslands dedicated to animal grazing, with some cropland in the east and near the north coast. The main urban areas are located in the south. Forested patches are abundant and distributed throughout the region. Semi-natural lowland habitats, formed by heathlands, natural grasslands, and saltmarshes are dispersed and fragmented, and they are mainly found in the coastal areas. Uplands are formed by a mosaic of coniferous woodlands, bogs, heath, and semi-natural grasslands [22]. Here, we assessed the accuracy of land cover maps created using temporal aggregation of oneyear of data derived from L8, S1, and Sentinel-2 (S2) satellites. The two main objectives were to (1) test whether land cover classifications, created using temporal aggregation of large amounts of automatically-processed satellite data, outperform those created with traditional non-aggregated two-date composites and (2) compare the classification accuracy of temporally aggregated data from different sensors. To do this, multiple datasets were created using the temporal aggregation of satellite-derived measurements taken over different time intervals within one year. GEE was used to apply automated cloud-masking and to temporally aggregate the data. All data were resampled to 30 metres to match the spatial resolution of the most coarsely resolved sensor (L8). This was done to isolate the effects of temporally aggregated measurements on the classification accuracy. A wide variety of datasets were created, including combined multi-sensor data, and supervised classifications were used to create land cover maps. The accuracy of the land cover maps was then assessed using a field-based land cover dataset. The methods were tested in the UK which is a relatively cloudy area of the world [21].

Study Area
The study area is in the UK and covers most of Wales (82.7%) and some bordering regions of England ( Figure 1). The area covers 200 × 100 km 2 and corresponds to tiles T30UVC and T30UVD of S2. The region is geographically diverse, with mountainous areas in the north and centre. It is characterised by a maritime climate, with mild temperatures and rain on more than 200 days a year (https://www.metoffice.gov.uk/), so is often cloudy. The lowlands are dominated by grasslands dedicated to animal grazing, with some cropland in the east and near the north coast. The main urban areas are located in the south. Forested patches are abundant and distributed throughout the region. Semi-natural lowland habitats, formed by heathlands, natural grasslands, and saltmarshes are dispersed and fragmented, and they are mainly found in the coastal areas. Uplands are formed by a mosaic of coniferous woodlands, bogs, heath, and semi-natural grasslands [22].

Datasets
The datasets were created using images from three satellites: L8, S2, and S1. L8 and S2 carry optical sensors and are the most common current data sources for medium-high resolution land cover mapping, while S1 carries a radar instrument (Table 1). Data from October 2016 to September 2017 were used. Data spanning the course of one year is ideal for the identification of land cover types that have seasonality patterns, such as agricultural crops, while also allowing for the analysis of medium-term land cover change [23]. All datasets, with the exception of the "two-date composite" datasets, were created by the temporal aggregation of optical or radar data across different time intervals. For example, a one-interval dataset aggregated the data taken over the whole year into one image, while a two-interval dataset aggregated the data into two images, representing the six-month winter/summer season split, and so on ( Figure 2). Datasets with a higher number of intervals aggregated the data across smaller time intervals of equal length. The datasets were created using images from three satellites: L8, S2, and S1. L8 and S2 carry optical sensors and are the most common current data sources for medium-high resolution land cover mapping, while S1 carries a radar instrument (Table 1). Data from October 2016 to September 2017 were used. Data spanning the course of one year is ideal for the identification of land cover types that have seasonality patterns, such as agricultural crops, while also allowing for the analysis of medium-term land cover change [23]. All datasets, with the exception of the "two-date composite" datasets, were created by the temporal aggregation of optical or radar data across different time intervals. For example, a one-interval dataset aggregated the data taken over the whole year into one image, while a two-interval dataset aggregated the data into two images, representing the six-month winter/summer season split, and so on ( Figure 2). Datasets with a higher number of intervals aggregated the data across smaller time intervals of equal length. The image pre-processing (see the details for each dataset in the following sections) and the aggregation of the data across time intervals was performed using GEE [17], except in the case of the "two-date composite" datasets. For optical data (L8 and S2), automatic cloud and cloud shadow masking methods were used. Cloud-masking creates gaps with no-data in different locations for different image dates. This means that the temporal aggregation will be performed using a varying number of images (dates) for different pixels. In order to assess the effect of cloud gaps on temporal aggregations, maps and histograms were created representing the amount of cloud-free pixels for different time intervals.
All images were resampled to 30 metres to match the original spatial resolution of L8, which offers the coarsest resolution of the sensors used. Doing so minimises the effect of spatial resolution on the final classification accuracies, thus simplifying the comparison between datasets. Additionally, elevation, aspect, and slope bands were added to the datasets to increase the classification accuracy [24]. The datasets were divided according to the sensor or processing procedure. A summary of the datasets main features can be seen in Table 2. The acquisition dates of every image used for the temporal aggregations can be found in Supplementary Material S1. L8 data were extracted from the USGS Landsat 8 Surface Reflectance Tier 1 dataset provided by GEE. These data are derived from L8's OLI/TIRS sensors and have been orthorectified and atmospherically corrected to obtain surface reflectance. Bands one to seven, with an original spatial resolution of 30 metres, were used in this study. An automatic cloud masking procedure was applied using the C Function of Mask (CFMask; [25]) band included with the Landsat data.
Temporal aggregation was applied by calculating the mean, median, and variance of the reflectance values across all the available images for a specified time interval. A single seven-band image is, therefore, obtained for every interval for each aggregation function. One or two time intervals (with lengths of 12 and six months, respectively) were used for the Landsat 8 datasets. Temporal aggregation methods work best when there are sufficient cloud-free values for each pixel in the time interval to get a representative value from the aggregation function. Intervals that are too short should, therefore, be avoided for the aggregation function to work correctly. Based on the preliminary cloud-cover analysis for Landsat, shorter intervals would have resulted in images with a large number of no-data values or pixels with just a one cloud-free date. This would have increased the risk of using cloud-contaminated data, as the cloud-masking methods do not always work perfectly [26]. In total, six datasets were created using the mean, median, or a combination of median and variance for one or two intervals within the space of a year ( Table 2).

Indices
Three spectral indices, the NDVI [27], the Normalized Difference Moisture Index (NDMI; [28]), and the Normalized Difference Water Index (NDWI; [29]), were used to create the "indices" datasets. The NDVI is calculated using the red and near-infrared (NIR) bands, (ρ NIR − ρ RED )/(ρ NIR + ρ RED ), where ρ represents spectral reflectance and characterises the "greenness" of the surface. The NDMI is calculated using the NIR and the mid-infrared (MIR) bands, (ρ NIR − ρ MIR )/(ρ NIR + ρ MIR ) and identifies the moisture content of soil and vegetation. The NDWI is calculated using the NIR and the green bands, (ρ GREEN − ρ NIR )/(ρ GREEN + ρ NIR ), and identifies water bodies. The first two indices are related to the structure and cover of vegetation, and all three indices have been widely used in land cover characterisation [23].
L8 data, atmospherically corrected and cloud masked as described in Section 2.2.1, were used to derive the indices. The mean, median, and variance of the indices were calculated for either one or two intervals. The temporally aggregated images using NDVI, or a combination of NDVI and NDMI/NDWI, were used to create the datasets ( Table 2).

Sentinel-2
Sentinel-2 data with level 1C processing [30] provided by GEE were used. These data have been orthorectified and radio-corrected providing top-of-atmosphere reflectance values. Bands 2 to 8, 11, and 12 were used with original spatial resolutions of 10 or 20 metres. An automatic cloud masking procedure was applied using band QA60 of the S2 1C product, masking both opaque clouds and cirrus clouds.
Following the same methodology as the two previous sections, the mean, median, and variance of the S2 reflectance values were calculated for one, two, three, and four time intervals. Datasets using three and four time intervals were included because despite not being comparable to Landsat 8 datasets (with a maximum of two intervals), they may give some insights into the effects of adding additional time intervals onto the classification accuracies. These temporally aggregated images were then combined and resampled to 30 metres to create several "Sentinel-2" datasets ( Table 2). Radar data were analysed using the dual-polarised C-band data from the Synthetic Aperture Radar (SAR) instrument carried by the S1A and S1B satellites. The level-1 Ground Range Detected product (GRD, [31]) provided by GEE was used. The GRD images have been radiometrically calibrated and orthorectified, and the terrain correction has been applied using SRTM30 [32]. Two different polarisation modes were used: single co-polarisation with vertical transmit/receive (VV) and dual-band co-polarisation with vertical transmit and horizontal receive (VH). An extra pre-processing procedure, consisting of spatial filtering using a 7 × 7 Refined Lee speckle filter [33], was applied in order to eliminate the "speckle noise" characteristic of radar images and to make them functional for land cover detection at the spatial resolution used in this study. An extra band, VV-VH, was also created using the difference between the two polarisation modes. A three-band composite image was then created for each date, combining the three polarisation modes (VV, VH, and VV-VH), as this combination has been reported as optimal for land cover characterisation [34].
Radar data is not affected by clouds, so a considerable number of gap-free images can be obtained every month. However, radar data is affected by weather conditions (i.e., recent rainfall or wind) and with S1A and S1B can produce large data sets, so temporal aggregation may still be a valuable tool. For the Sentinel-1 datasets, the main focus was on analysing different numbers of time intervals for the temporal aggregation. Temporally aggregated images using the median values were obtained for each of the bands (VV, VH, and VV-VH), thus creating composites for one, two, four, six, and 12 intervals across the year. All datasets were resampled to 30 metres (Table 2).

Multi-Sensor
A fourth type of dataset was created by combining the best single-sensor type datasets. Three datasets were created by merging S2 with L8 data, S1 with S2 data, and S1 with S2 and L8 data. The selection of the best dataset for each of the single-sensor types was done a posteriori, once the classification accuracy had been estimated for the datasets in the previous sections ( Table 2). Misalignments between satellite images from different satellites were examined visually by a comparison with easily identifiable ground locations and showed that spatial mismatches were always less than one 30 metre pixel.

Two-Date Composite
One final type of dataset was created using single-date images but without applying temporal aggregation techniques. Two relatively cloud-free S2 images were downloaded from the ESA Sentinel Hub (https://scihub.copernicus.eu/): one winter image (5 January 2017) and one summer image (17 June 2017). The images were obtained with a Level-1C pre-processing (orthorectified and radio-corrected). Only bands one to eight, 11, and 12 were used. An atmospheric correction was applied using the sen2cor algorithm [35] in the SNAP Toolbox. A terrain correction was then applied using a Minnaert algorithm with a slope correction [36]. In this case, cloud-masking was manually applied using visual interpretation. The two pre-processed images were combined into a two-date composite and resampled to 30 metres. This use of summer and winter images to create a two-date composite is the "traditional" method used by the UK Land Cover Maps [37,38]. It is used in this paper to provide a baseline accuracy against which to compare the other classifications and to determine whether the temporal aggregations match the accuracy of existing "traditional" methods.
A second "two-date composite" dataset was created using the same two images but without any extra pre-processing (Level-1C processing only). By doing so, the effects of the terrain correction and manual cloud-masking on classification could also be evaluated.

Land Cover Classification
Land cover classifications were carried out for each dataset by applying a supervised classification. Thirteen land cover classes were selected based on the UK Broad Habitat (BH; [39]) classification (Table 3). Training areas were based on polygons that had been identified with the same BH in the UK Land Cover Map 2000 (LCM2000; [37]) and 2007 (LCM2007; [38]), thus mimicking the classification methodology used for the UK LCM2015 (Rowland et al., in prep.). In this way, only stable areas or areas with a high probability of belonging to the assigned class can be selected. These areas were complemented with manually added polygons for rarer classes or classes for which the LCM2000 and LCM2007 were more inconsistent, such as coastal or semi-natural classes (e.g., fen). From the final set of training polygons, 10,000 points were randomly selected for each land cover class and were used to train a classification algorithm. A Random Forest (RF; [40]) classifier with 200 trees was trained and applied to each dataset to create land cover classifications. Because the validation of the classifications was performed by using an independent dataset, all training polygons were fed to the RF classifier, leaving out one-third of the training data for each bootstrap sample. The RF training and classification was implemented using the WEKA Data Mining Software [41]. Grassland Managed grasslands and other semi-natural "Improved Grassland" grasslands (grasses and herbs) on "Calcareous Grassland" non acidic soils "Neutral Grassland" Acid grassland Grasses and herbs on soils derived from acidic bedrock "Acid Grassland" Bog and fen Wetlands with peat-forming vegetation "Bog" such as bog, fen, fen meadows, rush pasture, swamp, flushes and springs "Fen, Marsh and Swamp" Heather Vegetation that has more than a 25% cover of species from the heath family "Dwarf Shrub Heath" Inland rock Natural and artificial exposed rock surfaces "Inland Rock" Saltwater Sea waters "Saltwater" Freshwater Lakes, pools, rivers and man-made waters "Freshwater" Coastal Beaches, sand dunes, ledges, pools "Supralittoral Rock" and exposed rock in the maritime zone "Supralittoral Sediment" "Littoral Rock" "Littoral Sediment"

Saltmarsh
Vegetated portions of intertidal mudflats; species adapted to immersion by tides "Littoral Sediment" ** Built-up areas Urban and rural settlements "Built-up Areas and Gardens" * These classes are based on the UK Broad Habitats or Priority Habitats [39]. ** The Saltmarsh class is based on the "Saltmarsh" UK Priority Habitat, which is included in the "Littoral Sediment" Broad Habitat.

Accuracy Assessment
The Glastir Monitoring and Evaluation Programme (GMEP; [42]) dataset was used for an independent map validation. The GMEP is a field survey-based habitat and vegetation monitoring programme which provides high-resolution georeferenced data of the Welsh countryside. Polygon data for UK Broad Habitats, collected by the GMEP in 2013, 2014, 2015, and 2016 were used. Some BH were merged to match the 13 classes used in this study. National Forest Inventory (NFI; [43]) data from 2016 were used to filter broadleaved and coniferous woodland patches, which had been harvested between the GMEP monitoring campaign and the acquisition of the satellite data. The GMEP data show a general lack of saltmarsh and inland rock polygons. In order to fill this gap between the classifications and the validation data, information on these two classes was extracted from the Natural Resources Wales (NRW) Terrestrial Phase 1 Habitat Survey (hereafter referred to as the Phase 1 Survey; [44]). The Phase 1 Survey is a semi-natural habitat map of Wales and is based on field surveys conducted over the course of several decades. Despite the potentially large time gap between the creation of Phase 1 Survey data and the dates of the satellite images used in this study, mismatches between the classifications and the validation data in terms of the presence of saltmarsh and inland rock polygons are unlikely to be due to changes in the land cover class, as these two classes are very stable over time.
To avoid areas of change, the additional validation polygons were also manually reviewed against aerial photography.
The final set of validation polygons was buffered inwards by 30 metres to avoid selecting mixed pixels at the object boundaries. A total of ten thousand points were randomly selected from the validation set, and the differences between the validation data and the classifications were studied using confusion matrices. The overall accuracy (OA) estimation and the kappa coefficient [45] were used to compare the classification accuracy between datasets.
Finally, to understand the effects of the number of bands on the land cover predictions, the classification accuracies were plotted against the number of satellite data bands. The datasets were grouped by type, and a linear model with an adjusted R 2 was calculated.

Cloud Cover
For temporal aggregations, sufficient observations are required per interval to ensure a good estimate. To evaluate this, the availability of cloud-free pixels during the year studied was estimated for L8 and S2. In general, the S2 data showed a very low number of pixels with zero or one cloud-free image(s) when using one or two time intervals, while L8 data showed more than 10% of pixels with zero or one cloud-free image(s) when using two intervals.
The temporal aggregation over one year led to high numbers of cloud-free pixels throughout the whole study area for the L8 and S2 satellites ( Figure 3). Low values were observed for L8 for the summer and winter time intervals. Values lower than four are observed in coastal regions during summer and in central regions (corresponding to upland areas) during winter. Large areas with values of two or less were found for the winter interval in the centre of the map. No clear geographical patterns for low numbers of cloud-free images were found for S2. However, a triangular-shaped area in the southeast presented lower values for both the single-interval map and the summer/winter maps. This is because the study area is divided into two orbit paths of the S2 satellite, meaning that the date and number of image acquisitions for these two regions differ. In this case, the eastern orbit shows fewer cloud-free images, leading to the creation of this pattern.
All pixels had at least two cloud-free instances over the one-year interval for L8, with seven being the most common value (Figure 4a). However, almost 10% of pixels had just one cloud-free instance for the winter interval and about 10% of them had just two cloud-free values for summer. For S2, there were no pixels with less than seven cloud-free values over the one-year interval, with 26 being the most common value (Figure 4b). The winter interval did not present any pixels with less than four cloud-free values, while less than 5% of pixels in the summer interval had less than three. A bimodal shape was observed for the S2 histograms, which was due to the differences between the regions falling into the different S2 orbit trajectories [7]. maps. This is because the study area is divided into two orbit paths of the S2 satellite, meaning that the date and number of image acquisitions for these two regions differ. In this case, the eastern orbit shows fewer cloud-free images, leading to the creation of this pattern. All pixels had at least two cloud-free instances over the one-year interval for L8, with seven being the most common value (Figure 4a). However, almost 10% of pixels had just one cloud-free instance for the winter interval and about 10% of them had just two cloud-free values for summer. For S2, there were no pixels with less than seven cloud-free values over the one-year interval, with 26 being the most common value (Figure 4b). The winter interval did not present any pixels with less than four cloud-free values, while less than 5% of pixels in the summer interval had less than three. A bimodal shape was observed for the S2 histograms, which was due to the differences between the regions falling into the different S2 orbit trajectories [7]. maps. This is because the study area is divided into two orbit paths of the S2 satellite, meaning that the date and number of image acquisitions for these two regions differ. In this case, the eastern orbit shows fewer cloud-free images, leading to the creation of this pattern.

Classification Accuracy
The classification accuracy of each dataset scenario was assessed using confusion matrices and their derived accuracy indices. The confusion matrices can be found in Supplementary Material S2. The two traditional-style two-date composites, which were not temporally aggregated, obtained the highest classification accuracy, followed by the combined sensors datasets. The kappa index was very closely related to the overall accuracy (OA), meaning that datasets with a higher OA also obtained higher kappa values. A summary of the accuracy results can be found in Figure 5.

Classification Accuracy.
The classification accuracy of each dataset scenario was assessed using confusion matrices and their derived accuracy indices. The confusion matrices can be found in Supplementary Material 2. The two traditional-style two-date composites, which were not temporally aggregated, obtained the highest classification accuracy, followed by the combined sensors datasets. The kappa index was very closely related to the overall accuracy (OA), meaning that datasets with a higher OA also obtained higher kappa values. A summary of the accuracy results can be found in Figure 5. All of the "Landsat 8" datasets obtained very similar levels of accuracy, with the OA ranging from 68.6% to 70.8%. The most accurate "Landsat 8" dataset used the median and variance values taken over two intervals. The datasets based on just the L8-derived indices obtained the lowest levels of accuracy. The temporal aggregation over the mean or median values of NDVI obtained very similar levels of accuracy, with the datasets using two intervals or a combination of median and variance over one interval being the most accurate. The combination of NDVI with NDMI or NDWI obtained higher levels of accuracy than the temporal aggregations using only NDVI, with the datasets using median values over two time intervals being the most accurate (OA = 62.2%).
The OA of the "Sentinel-2" scenarios ranged between 69.4% and 73.3%. The most accurate classification used median values calculated over three (OA = 72.9%) or four (OA = 73.3%) time intervals. These were followed by the dataset created using median values over two intervals (OA = Figure 5. The classification accuracies: the error bars represent a 95 percent confidence interval for the overall accuracy, calculated using an exact binomial test. All of the "Landsat 8" datasets obtained very similar levels of accuracy, with the OA ranging from 68.6% to 70.8%. The most accurate "Landsat 8" dataset used the median and variance values taken over two intervals. The datasets based on just the L8-derived indices obtained the lowest levels of accuracy. The temporal aggregation over the mean or median values of NDVI obtained very similar levels of accuracy, with the datasets using two intervals or a combination of median and variance over one interval being the most accurate. The combination of NDVI with NDMI or NDWI obtained higher levels of accuracy than the temporal aggregations using only NDVI, with the datasets using median values over two time intervals being the most accurate (OA = 62.2%).
The OA of the "Sentinel-2" scenarios ranged between 69.4% and 73.3%. The most accurate classification used median values calculated over three (OA = 72.9%) or four (OA = 73.3%) time intervals. These were followed by the dataset created using median values over two intervals (OA = 72.7%). Figure 6a shows this dataset and its classification for a region that presented cloudy pixels on several image dates. This dataset was slightly more accurate than the equivalent dataset created using the L8 data (OA = 70.8%). The OA of the "Sentinel-1" datasets increased with the number of time intervals. The scenario that used mean values over 12 time intervals obtained the maximum level of accuracy, with an OA of 69.0%, which was slightly below the most accurate scenarios of the "Landsat 8" or "Sentinel-2" types. 72.7%). Figure 6a shows this dataset and its classification for a region that presented cloudy pixels on several image dates. This dataset was slightly more accurate than the equivalent dataset created using the L8 data (OA = 70.8%). The OA of the "Sentinel-1" datasets increased with the number of time intervals. The scenario that used mean values over 12 time intervals obtained the maximum level of accuracy, with an OA of 69.0%, which was slightly below the most accurate scenarios of the "Landsat 8" or "Sentinel-2" types. The classification accuracy of individual land cover classes was compared for the most accurate Sentinel-1, Sentinel-2, and Landsat 8 classifications (Figure 7). Differences in the accuracy between Sentinel-2 and Landsat 8 were less than 7% for all classes, except for inland rock. Arable, bog and fen, inland rock, and sea water were slightly more accurately classified with the Landsat 8 dataset. Sentinel-2 datasets obtained better accuracy than Sentinel-1 for all classes except for arable and fen and bog. The higher frequency of the Sentinel-1 data may help to capture the phenology of arable lands, while its capability to detect soil moisture may help with the detection of fen and bogs. Inland rock was validated using only 12 points, due to its scarcity in the landscape, so slight changes in classifications produce a high variability in accuracy between datasets for this class. Saltwater's accurate detection and its confusion with coastal classes is strongly dependent on the tidal state at the time of image acquisition. However, inland rock and saltwater have relatively few validation points, so their accuracy will not have a significant impact on the overall classification accuracies. The classification accuracy of individual land cover classes was compared for the most accurate Sentinel-1, Sentinel-2, and Landsat 8 classifications (Figure 7). Differences in the accuracy between Sentinel-2 and Landsat 8 were less than 7% for all classes, except for inland rock. Arable, bog and fen, inland rock, and sea water were slightly more accurately classified with the Landsat 8 dataset. Sentinel-2 datasets obtained better accuracy than Sentinel-1 for all classes except for arable and fen and bog. The higher frequency of the Sentinel-1 data may help to capture the phenology of arable lands, while its capability to detect soil moisture may help with the detection of fen and bogs. Inland rock was validated using only 12 points, due to its scarcity in the landscape, so slight changes in classifications produce a high variability in accuracy between datasets for this class. Saltwater's accurate detection and its confusion with coastal classes is strongly dependent on the tidal state at the time of image acquisition. However, inland rock and saltwater have relatively few validation points, so their accuracy will not have a significant impact on the overall classification accuracies.
The "Combined" datasets obtained better accuracy results than any of the single-sensor datasets using temporally aggregated data. The dataset that combined S1, S2, and L8 data obtained the highest accuracy with a OA of 76.5%, followed by the scenario combining S1 and S2 data (OA = 75.4%). The scenarios combining the two optical sensors, i.e., L8 and S2, obtained an appreciably lower level of accuracy (OA = 72.8%) than the other two combined scenarios. The land cover classification map using the S1/S2/L8 combined data-the most accurate temporally-aggregated scenario-can be seen in Figure 8. Figure 9 shows this classification, together with the classification of the least accurate scenario, as well as its discrepancies with the validation data.
Finally, the two traditional "two-date composite" datasets, which were based on non-temporally aggregated data, obtained the highest accuracy. The area of these classifications were, however, smaller than the rest of the classifications, as one of the satellite images covered only about 80% of the study area. The dataset that used manual cloud-masking obtained the highest accuracy, with an OA of 77.9%, although the dataset that used automatic cloud-masking was almost equal at 76.5%. Despite the fact that both datasets obtained similar classification accuracies, the automatic cloud-masking failed to identify large cloud patches in some regions (Figure 6b,c). This affected the classification of these cloudy areas but, due to the quantity and distribution of the validation areas, were not enough to substantially decrease the estimated overall classification accuracy. Remote Sens. 2019, 11, x FOR PEER REVIEW 13 of 21 The difference in accuracy between the Sentinel-2 and Sentinel-1 datasets: the negative values represent a higher accuracy for the Sentinel-1 dataset.
The "Combined" datasets obtained better accuracy results than any of the single-sensor datasets using temporally aggregated data. The dataset that combined S1, S2, and L8 data obtained the highest accuracy with a OA of 76.5%, followed by the scenario combining S1 and S2 data (OA = 75.4%). The scenarios combining the two optical sensors, i.e., L8 and S2, obtained an appreciably lower level of accuracy (OA = 72.8%) than the other two combined scenarios. The land cover classification map using the S1/S2/L8 combined data-the most accurate temporally-aggregated scenario-can be seen in Figure 8. Figure 9 shows this classification, together with the classification of the least accurate scenario, as well as its discrepancies with the validation data.  The difference in accuracy between the Sentinel-2 and Sentinel-1 datasets: the negative values represent a higher accuracy for the Sentinel-1 dataset.
The "Combined" datasets obtained better accuracy results than any of the single-sensor datasets using temporally aggregated data. The dataset that combined S1, S2, and L8 data obtained the highest accuracy with a OA of 76.5%, followed by the scenario combining S1 and S2 data (OA = 75.4%). The scenarios combining the two optical sensors, i.e., L8 and S2, obtained an appreciably lower level of accuracy (OA = 72.8%) than the other two combined scenarios. The land cover classification map using the S1/S2/L8 combined data-the most accurate temporally-aggregated scenario-can be seen in Figure 8. Figure 9 shows this classification, together with the classification of the least accurate scenario, as well as its discrepancies with the validation data.  Finally, the two traditional "two-date composite" datasets, which were based on non-temporally aggregated data, obtained the highest accuracy. The area of these classifications were, however, smaller than the rest of the classifications, as one of the satellite images covered only about 80% of the study area. The dataset that used manual cloud-masking obtained the highest accuracy, with an OA of 77.9%, although the dataset that used automatic cloud-masking was almost equal at 76.5%. Despite the fact that both datasets obtained similar classification accuracies, the automatic cloudmasking failed to identify large cloud patches in some regions (Figure 6b,c). This affected the classification of these cloudy areas but, due to the quantity and distribution of the validation areas, were not enough to substantially decrease the estimated overall classification accuracy.
The effect on the overall accuracy of increasing the number of bands varied between the different dataset types (Figure 10). The "Combined" and "Indices" datasets showed very high correlations between the number of bands and the OA. The increase in OA for the "Indices" datasets was very acute. However, the number of bands only ranged between one and four for this dataset type, while the range for the rest of the datasets was at least 20 bands. The correlation with the "Sentinel-1" type dataset was moderate, while correlations with the "Sentinel-2" and "Landsat 8" types were weak. The effect on the overall accuracy of increasing the number of bands varied between the different dataset types (Figure 10). The "Combined" and "Indices" datasets showed very high correlations between the number of bands and the OA. The increase in OA for the "Indices" datasets was very acute. However, the number of bands only ranged between one and four for this dataset type, while the range for the rest of the datasets was at least 20 bands. The correlation with the "Sentinel-1" type dataset was moderate, while correlations with the "Sentinel-2" and "Landsat 8" types were weak.

Discussion
One of the challenges in producing land cover maps of large areas is the inconsistency in the number of cloud-free pixels available over a period of time. In this study, the differences in the number of cloud-free images per pixel over a year were analysed to evaluate the effects on the temporal aggregation of L8 and S2 data. For the data assessed in this study, to ensure at least one cloud-free pixel per time interval, a maximum of two intervals per year were possible with L8, while more than two intervals could be used for Sentinel-2. This is due to Sentinel-2's higher revisit frequency during 2016 (10 days against 16 days for L8). With the recent launch of Sentinel-2B, the revisit frequency of both Sentinel-2A and 2B is expected to increase to an average of five days, with

Discussion
One of the challenges in producing land cover maps of large areas is the inconsistency in the number of cloud-free pixels available over a period of time. In this study, the differences in the number of cloud-free images per pixel over a year were analysed to evaluate the effects on the temporal aggregation of L8 and S2 data. For the data assessed in this study, to ensure at least one cloud-free pixel per time interval, a maximum of two intervals per year were possible with L8, while more than two intervals could be used for Sentinel-2. This is due to Sentinel-2's higher revisit frequency during 2016 (10 days against 16 days for L8). With the recent launch of Sentinel-2B, the revisit frequency of both Sentinel-2A and 2B is expected to increase to an average of five days, with that number becoming larger for higher latitudes [7]. Such high revisit frequencies could enable the creation of temporal aggregations over at least five or six intervals per year. This will allow researchers to better characterise the seasonality of certain land types and to use time-series analysis approaches [23]. Differences in cloud-free coverage are also affected by the differences in satellite coverage, an issue that should be considered for large areas. For the study area, Sentinel-2 data showed a different level of cloud-free coverage for a large triangular region in the southeast, which corresponded to a different orbit path. This could result in inconsistencies in accuracy [12], and it should be acknowledged when deciding the length of the interval to be used for temporal aggregation or when evaluating the appropriateness of gap-filling methodologies.
The highest OA for a temporally aggregated dataset was 76.5% (the dataset combining L8, S1, and S2 data). Issues that may have the potential to decrease classification accuracy were detected with the training data; they include the overestimation of certain widespread classes (i.e., acid grassland) and the scarcity of training polygons for common upland classes (i.e., bog). The quality of the training data is key to obtaining satisfactory classification results [46]. A large area of this image was covered with uplands that are a complicated mix of habitats occurring in a complex mix of mosaics and gradual transitions between different habitats. These habitats can be difficult for surveyors in the field to map reliably and do not really conform to the idea of discrete patches of land cover implicit in mapping the dominant land cover in the pixel [22].
Some studies that have applied innovative input-data approaches for land cover mapping have obtained higher classification accuracies [47,48]. In general, most of those studies classified fewer classes than our study or they used classes that have traditionally been easier to characterise from remotely sensed data (i.e., forests or water). Inglada et al. [12], on the other hand, used interpolation methods to create a land cover classification of France, characterising a wide variety of land cover classes and obtaining kappa values ranging between 0.82 and 0.86. The main difficulty of comparing the accuracies of these studies with those obtained in our study is that their validation data was mainly based on the visual interpretation of remotely sensed data (satellite or aerial photography). Gebhardt et al. [17] applied temporal aggregation methodologies to create a land cover map of Mexico and obtained overall accuracies of up to 76% using ground-based validation data. Although the use of stratified-sampled ground reference data for validation, as we do here, is not exempt from error, it is considered a more accurate representation of the land surface and it can avoid the biases inherent in validating against remotely sensed data in terms of spatial or thematic resolution [49]. In any case, the aim of this study was to compare the accuracies of different input datasets. To do this, the training samples and validation samples were kept constant for every dataset. Using stratified-sampled ground data for validation was key for our purposes, as reliable data for rare land cover classes such as different types of natural grasslands can be difficult to obtain accurately from remote sensed datasets.
The classification accuracies varied from 46% to 76.5%. The "Indices" datasets, which were based on NDVI/NDMI/NDWI estimates, obtained the lowest levels of accuracy. The dataset that combined NDVI and NDMI over two time intervals was the most accurate; however, its accuracy was much lower than for the datasets using reflectance measurements. Although spectral indices have been shown to be a simple and effective way of analysing time series of vegetation data [50] and may improve the characterisation of the land cover when combined with other reflectance measures [51], our indices alone did not obtain good results in terms of predicting a wide variety of land cover classes. This is probably due to the fact that these indices considerably reduce the amount of spectral information used to characterize land cover and that indices are mostly useful when combined with reflectance bands. The "Sentinel-1" datasets obtained, on average, worse accuracies than the other two optical sensors. However, the dataset that used 12 time intervals obtained almost the same level of accuracy as the most accurate L8-based classification. Previous studies have shown the potential of S1 to characterise certain land types, such as crops [52] or forests [53]. S1 has also shown good results for the land cover mapping of heterogeneous landscapes [34]. However, this study showed how S1's capabilities for characterising land cover are principally based on its image frequency, as it is able to use a considerable number of dates per year whilst maintaining a relatively low number of bands. The temporal aggregation of S1 data might provide a good strategy for including frequent radar data in classifications but without needing to include every image that can be acquired in a year.
The "Sentinel-2" datasets obtained higher accuracy than other single-sensor datasets with a maximum OA of 73%, which was significantly higher than the most accurate "Landsat 8" dataset. However, this S2-based dataset used four time intervals, while the L8 datasets used a maximum of two intervals due to the lower numbers of available cloud-free pixels. The accuracy of the datasets using two intervals did not differ between S2 and L8. It has to be noted that some authors have pointed out the poor performance of the cloud mask used for S2 [54]. Additionally, the S2 data currently available via GEE is not atmospherically corrected, whereas the L8 data is. These two limitations might have affected the potential of S2 and its additional bands to outperform L8. We used the L8 and S2 data sets from GEE despite their differences in processing because one of our aims was to determine to what extent pre-processing could be simplified by adopting workflows in GEE.
The "Combined" datasets, which used multi-sensor derived data, obtained the most accurate classifications of all the temporally aggregated datasets. The combination of L8 and S2 data produced lower levels of accuracy than those that combined optical and radar data (L8/S2 or L8/S2/S1). Known co-registration problems between Landsat 8 and Sentinel-2 [55,56] could have lowered the accuracy of the combined datasets. However, our methodology, avoiding boundaries between land cover types for classification training and validation pixels, should reduce the effects of spatial misalignment between images. The launch of the Harmonized Landsat and Sentinel Datasets [57] should minimize this issue in the near future. Our results support findings from previous studies, which have suggested the potential of combining optical and radar data to characterise and detect changes in land cover [58][59][60].
The effect of the number of selected time intervals on classification accuracy was studied. In general, a higher number of intervals resulted in a higher level of classification accuracy, and two-interval datasets outperformed one-interval datasets that included variance measurements. Increasing the number of intervals has a similar effect to increasing the number of single-date images per year and helps to characterise the land cover, as certain classes have distinctive phenological signatures that can be detected by the classification algorithm [61]. The accuracy obtained by the "Sentinel-1" datasets increased linearly according to the number of time intervals. The temporal aggregation of S1 data over a sufficient number of time intervals can help to improve the accuracy of land cover classifications, especially when combined with optical data, while avoiding the necessity of adding massive amounts of data.
A linear correlation between the number of bands and the classification accuracy was found for some dataset types. Yu et al. [62] showed in their review that accuracy increased according to the number of bands, especially if the datasets combined data from different sensors. This was also true for this study, except in the case of the "Sentinel-2" and "Landsat 8" datasets, probably because some of these datasets included variance data, which increased the number of bands without significantly increasing the accuracy. Whether additional bands increase accuracy will depend on whether they contain additional information, for example, by capturing a different point in time. The growth in the availability of remote sensed data and processing platforms over the last few years is facilitating the usage of larger volumes of data [5] and is, therefore, providing the potential for improving the accuracy of land cover classifications. However, the selection of optimal input data is still important, as the accuracy/data size trade-off has not disappeared. For example, GEE proved to be a great tool for selecting, cloud-masking, stacking, and extracting big satellite datasets. However, GEE memory limits on the size of arrays prevented the training of the RF algorithm with the large number of training samples and input bands assessed here. To counter this we created the input data stacks in GEE and then exported them and processed them locally using established RF classification work flows. Processing in GEE is also currently affected by the very different levels of pre-processing applied to the data collections by the agencies providing the data. So the L8 data was atmospherically corrected and had an established cloud-mask, whereas the S2 data was not atmospherically corrected and had a cloud-mask that still has significant issues [54]. This complicates analyses between the capabilities of different sensors and will make GEE unsuitable for some types of processing.
The highest levels of accuracy were obtained by datasets that used traditional non-temporally aggregated two-date composites. The accuracy of the two-date composite with automatic cloud-masking was not significantly lower than the manually cloud-masked one. The images for these datasets were selected based on the low amount of cloud cover, and thus, the effects of poor cloud-masking for the automatically cloud-masked composite did not significantly affect the accuracy, as the quality of the classification was only affected in a relatively small area. Only the most accurate of the temporally aggregated datasets, which combined optical and radar data, matched the accuracy of the two-date composites. This highlights the quality of the methods that have underpinned the UK's Land Cover Map series to date. Some manual pre-processing tasks, such as manual cloud-masking or the manual selection of cloud-free areas for gap-filling, can require large amounts of labour [63]. Adopting the temporal aggregation methods and automatic cloud-masking could help to reduce the time needed to produce land cover maps of large areas, especially for regions for which the acquisition of cloud-free images is particularly difficult. However, increased accuracy should not be presupposed if these methods are used. For example, Senf et al. [10] found that data fusion matched the accuracy of manually processed Landsat images, but a lower level of accuracy was obtained when there were not enough cloud-free images available per year.
Approaches that use all available satellite data are also capable of dealing with the gaps and capable of reducing cloud-masking inaccuracies and have produced promising results for land cover mapping and the land cover change detection of large areas [20,64]. Non-aggregated datasets can be, however, very difficult to manage, and most classification algorithms cannot handle the combination of a large volume of training data, a large number of classes, and a high number of input bands, even when using cloud-computing platforms. On the other hand, temporal aggregation maintains reasonably small-sized data to feed the classification algorithms, while having the potential to reduce the processing time needed for land cover mapping of large areas. However, an optimal choice of satellite data and aggregation parameters are crucial to maintain the accuracy levels of the more traditional, manually intensive approaches.

Conclusions
The recent availability of frequent satellite data and cloud computing platforms are stimulating the emergence of new approaches for the land cover mapping of large areas. This paper has analysed one of these approaches, i.e., the temporal aggregation of automatically pre-processed satellite data, comparing it with traditional methods and studying the classification accuracy of different temporally aggregated datasets. Temporal aggregation of all the available images, over the course of one year, only matched the manually selected and processed two-date composite when an optimal combination of optical and radar data was used. Combined datasets (S1/S2 or S1/S2/L8) outperformed single-sensor datasets, while datasets based on spectral indices obtained the lowest levels of accuracy. This study provides, to the best of our knowledge, the first formal comparison of the accuracy between multi-sensor data and single-sensor data using S1, S2, and L8 data for land cover mapping, as well as providing a wide range of combinations and parameters for the classification input data.
Cloud computing platforms such as GEE allow researchers to compensate for the lower quality of automatic pre-processing methods by using larger amounts of satellite data. However, identifying the optimal input datasets, in terms of the best combination of satellite data, temporal interval, and other data aggregation parameters, are needed in order to match the accuracy of manually selected and