Earth Observation Data Cubes for Brazil: Requirements, Methodology and Products

: Recently, remote sensing image time series analysis has being widely used to investigate the dynamics of environments over time. Many studies have combined image time series analysis with machine learning methods to improve land use and cover change mapping. In order to support image time series analysis, analysis-ready data (ARD) image collections have been modeled and organized as multidimensional data cubes. Data cubes can be deﬁned as sets of time series associated with spatially aligned pixels. Based on lessons learned in the research project e-Sensing, related to national demands for land use and cover monitoring and related to state-of-the-art studies on relevant topics, we deﬁne the requirements to build Earth observation data cubes for Brazil. This paper presents the methodology to generate ARD and multidimensional data cubes from remote sensing images for Brazil. We describe the computational infrastructure that we are developing in the Brazil Data Cube project, composed of software applications and Web services to create, integrate, discover, access, and process the data sets. We also present how we are producing land use and cover maps from data cubes using image time series analysis and machine learning techniques.


Introduction
The growing demands for natural resources and the need to deal with climate change's effects resulting from greenhouse gas emissions present significant challenges for our societies. To understand how these changes are affecting the environment, one of our primary sources of information is land use and cover change data obtained by Earth observation satellites [1]. Analysis of remote sensing images is the most effective way to produce up-to-date information about the Earth's surface status [2].
Nowadays, researchers have free access to an unprecedentedly large number of remote sensing images collected by different satellites and sensors with distinct spatial, temporal, and spectral resolutions. and techniques to substantially improve the extraction of land use and cover change information from big Earth Observation data sets in an open and reproducible way. Based on the know-how acquired in the e-Sensing project and on national demands on land use and cover monitoring, the following requirements were defined to build EO data cubes for Brazil.

Image Time Series Analysis and Machine Learning
In the e-Sensing project, different classification methods, based on satellite image time series analysis and machine learning algorithms, were proposed, tested, and implemented. One of these methods is the time-weighted dynamic time warping (TWDTW) algorithm [28]. The authors improved the dynamic time warping (DTW) method for land use and cover classification, by introducing a time constraint to account for seasonality of land use types. This allows users to include information about the agricultural calendar in the region of interest, thereby obtaining better results.
Besides TWDTW, other machine learning methods, such as the support vector machine (SVM) and random forest (RF), were applied to classify satellite image time series for the Mato Grosso state and other regions in the Cerrado biome, Brazil [8,10,11]. These results presented good accuracy and made explicit the potential of image time series analysis and machine learning methods for land use and cover mapping. Thus, the main motivation to create multidimensional data cubes from ARD remote sensing images for all Brazilian territory is to be able to apply these methods for national land use and cover mapping. An important requirement is to continue investigating and advancing in techniques based on machine learning and image time series analysis, exploiting methods based not only on the temporal dimension but also on spatial context, using spatiotemporal heterogeneity and correlation [29].

ARD and Data Cubes of Medium-Resolution Satellite Images
In the e-Sensing project, we mainly used the MODIS sensor products MOD13Q1 and MYD13Q1 that have 250 m of spatial resolution and 16 days of temporal resolution. These products were created using a temporal composition method that selected the best pixels during each period of 16 days of the Terra and Acqua satellites. That increased the chances of finding values without cloud cover and enabled us to obtain dense time series.
The spatial resolution of the MOD13Q1 and MYD13Q1 products is not adequate for mapping agricultural activity in regions of smallholdings and identification of small deforestation areas. In our experiments, areas smaller than~11 ha could not be classified accurately using MODIS data.
To improve the ability to identify small areas of land use and cover, an important requirement is to produce advancements in the use of medium-resolution images from CBERS-4 (64 m), Landsat-8 (30 m), and Sentinel-2 (10, 20, and 60 m) satellites. We need to create ARD image collections from the satellite imagery for all Brazilian territory and to model these collections as multidimensional data cubes to support time series analysis.

Land Use and Cover Samples and Data Sets
Based on our previous experiences, we confirmed that a classified map's accuracy depends directly on the quality of the training samples used by the machine learning methods. Thus, it is essential to advance the sample collection tools and methodology, and techniques used to assess and improve the quality of these samples, including feature selection and outlier removal [30].
Land use and cover samples are collected by different projects and individuals, using different methods, such as in situ gathering in fieldwork and visual interpretation of high-resolution satellite images. An important requirement is to be able to describe samples with proper metadata that characterize their differences and organize them in a shared database to facilitate the reproducibility of experiments. It is also important to develop tools to easily discover, query, access, and process this shared sample database.

Interoperability and Web Services
In the e-Sensing project, several technologies for data management, access, and processing were tested and used [19]. Initially, we organized and stored time series of data in binary files in the local file system, and developed proprietary technologies to access them. This solution proved inefficient when processing large volumes of data, especially when coordinating multiple simultaneous data access. To solve this problem, we adopted the SciDB open array database [31].
We also tested alternative solutions to SciDB, such as Apache Hadoop, and concluded that SciDB had better computational performance and a lower cost of adapting legacy software [19]. The SciDB solution had solved problems related to the coordination of data access. However, it presented a technical difficulty of interoperability with other systems, especially those developed under the e-Sensing project, such as the Satellite Image Time Series (sits) R package [32]. Thus, a fundamental requirement is to use open and consolidated technologies that facilitate interoperability with other systems.
One way to improve the interoperability between systems is to develop software components as Web services. Hence, we developed the Web Time Series Service (WTSS), a lightweight Web service to retrieve time series from big data sets in an open format and based on consolidated technologies [33]. WTSS proved to be suitable for dealing with big amounts of remote sensing images by providing server-side processing. The development of software components as Web services facilitates the integration with different programming languages, such as R, Python, and C++, and different types of client applications. It also enables the adoption of the moving code paradigm and server-side processing [34].

Cloud Computing and Distributed Processing Environments
Cloud computing and distributed processing environments have been supplied as services by different providers, such as Amazon Web Services (AWS) and Microsoft Azure Cloud Services. These environments provide development tools, storage, virtual machines, databases, and network services in a highly scalable way. Some providers also contain large repositories of Earth observation satellite image collections as open data, meaning that they can be used without restrictions and with minimum cost. As an example, the Open Data on AWS [35] program offers data buckets with Landsat-8, Sentinel-2, and CBERS-4 imagery.
Brazil is a huge country, with a territory of over 8,500,000 km 2 . To build ARD and data cubes of Sentinel-2 (A and B) images from 2015 to 2020 for the entire country, we need to process 650 terabytes of data. Therefore, it is necessary to exploit the potential of these cloud computing environments to handle large amounts of remote sensing images efficiently. These environments do not impose any API for data processing, allowing the development of open source technologies in non-proprietary scripting languages, and thereby reducing potential vendor lock-in effects. We tested the cloud services on AWS to access remote sensing images stored as open data and concluded that they are a good alternative to generate and process EO data cubes [36].

Methodology for EO Data Cube Generation
Driven by the requirements described in Section 2, the Brazil Data Cube (BDC) project has four main objectives: (1) create ARD image collections from medium-resolution remote sensing images (10 to 64 m) for all Brazilian territory; (2) model these ARD images as multidimensional data cubes with three or more dimensions that include space, time, and spectral-derived properties, mainly to support image time series analysis; (3) use, propose, and develop big data technologies, such as cloud computing and distributed processing environments, to create, store, and process these data cubes; and (4) create land use and cover information, for Brazil, from these data cubes using satellite image time series analysis, machine learning methods, and image processing procedures. Figure 1 presents a general overview of the BDC project.  The images are collected from their different original providers and pre-processed into surface reflectance products. The atmospheric correction procedure depends on each sensor's characteristics, and different algorithms and software are used. For instance, for Sentinel-2/MSI and Landsat-8/OLI we use the LaSRC software [37] and for CBERS-4/AWFI and MUX the MS3 software [38]. For each image, cloud and cloud shadow masks are computed. For Sentinel-2 and Landsat-8, the Fmask (version 4.2) is being used [39] because it has presented the best overall accuracy, compared with MAJA, Sen2Cor, and s2cloudless, on images of the Amazon region in Brazil [40]. For CBERS-4 we use the CMASK algorithm because it does not have spectral bands need by Fmask. Once this process is performed and the resulting data set is cataloged, the surface reflectance image collections are used to generate the data cubes.

Tiling System
The organization of data cubes is based on a spatial partition of the territory. This partition is represented by a grid of tiles where every pixel can be efficiently located within a tile. In the BDC project, the data cubes are created by tiles, and each tile of a grid generates a set of files in the disk. Each file is associated with a specific data cube and with a grid tile, and contains information about a specific property or band. Considering the spatial resolution of the images and aiming at maintaining files that can be easily manageable, three hierarchical grids were defined using an Albers equal area projection and SIRGAS 2000 datum. The three grids were generated by taking -54 longitude as the central reference and defining tiles of 6 × 4, 3 × 2, and 1.5 × 1 degrees, referred to as BDC_LG (large), BDC_MD (medium)m and BDC_SM (small), as shown in Figure 3 and described in Table 1.

Data Cube Generation
We are creating identity data cubes and regular data cubes. Currently, each data cube contains images from a single sensor. An identity data cube uses all available images in a period of time, tiled to a common grid, without applying a temporal compositing function, as illustrated in Figure 4. An identity data cube is not regular in the time dimension; it instead maintains all available images with their original acquisition dates.
The regular data cubes are created by using functions for temporal compositing. They are regular in the time dimension. These temporal compositing functions reduce the original time dimension to a longer unit (e.g., a month or 16 days), generating observations that are regularly spaced in time. Figure 4 shows the entire process to create a monthly regular data cube.
The process to generate data cubes illustrated in Figure 4 is composed of two steps: warping and temporal compositing. The warping process generates the identity data cubes. The input images are resampled to the same spatial resolution using a bilinear function for all bands and nearest neighbor for cloud masks. Then, the images are re-projected to the common projection, and images from the same day are merged into one mosaic. Finally, the mosaic is cropped into the grid tile extension. Regular data cubes require the definition of a regular time step (e.g., one month or 16 days) that will guide the temporal compositing of multiples images available in each time step. After the warping process, the time compositing function is applied to the identity data cubes. It defines the approach to selecting or generating a value representing the series of observations available during the time step. It should be noticed that temporal compositing considers only valid observations, which are not detected as clouds, cloud shadows, or snows, nor does it account for the quality assessment of each band.
Currently, BDC supports three functions for temporal compositing: (i) median, (ii) average, and (iii) stack, as illustrated in Figure 5. The median compositing function consists of, given a time step, for each image band, calculating the median value of the valid observations and using it as the observation for that time step. The median function uses the central value, while keeping an existing observation, removing outliers in a given time step. The average composition uses the mean of all valid observations in the time step. This also removes outliers, but in this case a new smoothed observation is generated.
The stack time compositing function is the best pixel approach; it consists of ranking the time step images by the number of valid observations and selecting the observation from the best-ranked image. The stack compositing ensures that observations coming from less cloud coverage have higher probabilities of being selected to represent the time step. The image collections and data cubes are distributed as Cloud Optimized GeoTIFF (COG) files, with an internal organization that enables efficient data access in the distributed and high-performance cloud environments. In the data cubes, each GeoTIFF file contains information of a band or property of a specific tile of the BDC tiling system and of a certain time period. Besides the spectral bands from original satellite images, the data cube files also contain: (1) cloud mask; (2) spectral indices (NDVI and EVI); (3) the total number of observations for each pixel before time compositing; (4) the number of valid observations (excluding cloud, cloud shadow, snow, or partial images) for each pixel before time compositing; (5) a data provenance band that indicates the image of origin of each pixel, when the stack compositing function is applied; (6) a thumbnail for each time step.

Data Cube Validation and Metadata
We are conducting data quality assessment routines over image collections (surface reflectance products) and data cube collections. For surface reflectance products, we evaluate radiometric and geometric characteristics, cloud mask algorithms [40], and uncertainties in pre-processing steps (e.g., comparing atmospheric correction algorithms, Sen2Cor to Sentinel-2A/2B surface reflectance products, and LaSRC to Landsat-8 OLI surface reflectance products). We are exploring data quality dimensions for the data cubes, such as data completeness, consistency, uniqueness, validity, and accuracy [41]. Additionally, we are looking into quality analysis (e.g., spatial redundancy pixel counting), validation of the time compositing, checking the consistency of mask quality, date provenance of the observations, and differences caused by different versions of processing algorithms.
The metadata about the image collections and data cubes are stored in a relational database. We are using the SpatioTemporal Asset Catalog (STAC) [42] as the language to describe the metadata. STAC is an open specification, based on JSON and RESTful, that has its origins in 14 different organizations working together to increase the interoperability of searching for geospatial data, including satellite imagery. STAC also provides API mechanics to search for and access geospatial data, which are being used by the applications using the BDC data sets.

Software and Data Products
The software products that are being developed in the BDC project are free and open-source and can be grouped into two categories, services and applications, as shown in Figure 1. All project data sets and metadata are accessed and processed through Web services. Applications refer to software products for final users, including systems with graphical interfaces and APIs (application programming interfaces) for high-level programming languages such as R and Python. Web services provide the interfaceS between the project data sets and applications.

Web Services
In the BDC project, we are developing Web service specifications defined by the project team, such as Web Time Series Service (WTSS) and Web Land Trajectory Service (WLTS), and by third-party entities, such as the Tile Map Service (TMS) by THE Open Source Geospatial Foundation; and the Web Map Service (WMS), Web Feature Service (WFS), and Web Coverage Service (WCS) by the Open Geospatial Consortium (OGC) and ISO [43]. Besides using standard Web services protocols defined by OGC and ISO to promote interoperability, we decided to develop other lightweight Web services tailored to specific purposes.
As described in Section 3.4, the image collections and data cubes metadata follow the STAC specification. Thus, the STAC service provides a standard interface that allows users and applications to query and discover all image collections and data cubes stored in the project databases. Besides the STAC service, we developed two clients based on the STAC API; the stac.py package in Python and the rstac package in R.
The WTSS is a lightweight Web service for handling time series data from remote sensing imagery. In this service, we use the term coverage to refer to image collections and multidimensional data cubes. Given a coverage, a spatial location, and a time interval, it retrieves the time series associated with them [33].
As shown in Figure 1, besides the image collections and data cubes, we are organizing land use and cover samples, and data sets and their metadata in database systems, and developing Web services to deal with them. These samples are heterogeneous. They are gathered by different specialists using specific land use and cover classification systems and distinct means such as in situ fieldwork and visual interpretation of high-resolution satellite images. All these details are stored as metadata in the database system.
We are also organizing data sets from distinct projects and institutions containing land use and cover information about the Brazilian territory, such as PRODES, DETER, and TerraClass. These data sets are also created by distinct groups using particular land use and cover classification systems. To deal with distinct classification systems associated with samples and data sets, we developed the Web Land Classification System Service (WLCSS). The WLCSS is responsible for describing and mapping distinct land use and cover classification systems.
Given a spatial location, the Web Land Trajectory Service (WLTS) returns how that location was classified by distinct land use and cover data sets over time. The term land trajectory refers to a sequence of land use and cover classes, ordered in time, associated with a spatial location. Figure 6 illustrates how the WLTS works. Based on the land use and cover data sets generated by TerraClass, PRODES, and DETER, this picture shows a land trajectory associated with a specific location in Para  The Web Sample Assessment Service (WSAS) is being implemented to evaluate and improve the quality of land use and cover samples. These samples are used to train machine learning methods. They can be noisy, causing negative effects on classification performance. The accuracy of classified maps depends directly on the quality of the training samples used.
The WSAS is based on a method developed in the sits R package that identifies data wrongly labeled or data with low discrimination when mixed with other land use and cover classes. This method gets the time series associated with samples from data cubes and uses the self-organizing map (SOM) neural network and Bayesian inference to calculate the probability of each sample belonging to each class.

Applications
Through the BDC Web portal, shown in Figure 7, users can access, visualize using a temporal slider, download, and analyze differences between two time periods of image collection and those between two data cubes. The green grid tiles represent tiles that have ready data cubes associated with them. Figure 7 shows the Landsat-8 data cubes that are ready for Brazil, created by using the stack compositing function and a temporal step of 16 days.
The Open Data Cube (ODC) is a framework that provides tools for indexing, visualization, processing, and analysis of EO data. It is an open-source framework that is used as an infrastructure for the management of data cubes for a growing number of projects, such as the Australian Geoscience Data Cube [12], Swiss Data Cube [14], and Digital Earth Africa [18].
At INPE, we have an ODC instance with all image collections and data cubes created in the BDC project. To integrate ODC framework with BDC data products, we developed a tool called stac2odc for accessing our data sets and automatically indexing them in the ODC framework. This tool collects the necessary metadata from the BDC STAC Web service, converts them to the appropriate formats, and then records the data sets in the INPE's ODC instance. This ODC instance has direct access to the data sets produced in the BDC project.
In the BDC project, we are using four components of the ODC framework: (1) ODC Explorer to visualize the footprint and metadata of the BDC data products; (2) ODC Open Web Services to create the OGC WMS and WCS services for the BDC data sets; (3) ODC Python API in a JupyterHub environment to access and process the BDC data sets in Python; and (4) ODC Temporal Statistics Tools. Figure 8 shows the ODC Explorer and the ODC Python API with the CBERS4 data cubes created in the BDC project.  The applications include the Satellite Image Time Series (sits) package, written in R language, which started to be developed in the e-Sensing project [32]. This package contains functions to retrieve time series from image collections and data cubes, different visualization methods for image time series, smoothing functions for noisy time series, and distinct clustering algorithms, including dendrograms and self-organizing maps (SOM). It also implements interfaces with other R packages that contain machine learning methods for time series classification, providing a unified environment for satellite image time series analysis. In the BDC project, we are improving the sits package with new methods and interfaces with the BDC data cubes and Web services. Forest Monitor is a Web application that uses the image collections and data cubes created by the BDC project to identify deforestation alerts. The DETER project team is using this application.

Land Use and Cover Mapping from EO Data Cubes
This section presents an annual land use and cover mapping method using the time series extracted from the data cubes created in the BDC project. Figure 9 describes the process of creating land use and cover maps from EO data cubes using satellite image time series analysis and deep learning methods.

Study Area and Data
For this mapping, we selected a study area located in the Bahia state, Brazil, between the Cerrado and Caatinga biomes, and used the monthly data cubes CBERS-4, Landsat-8, and Sentinel-2, based on stack compositing function as show in Figure 10. This region is known for the expansion of agriculture and livestock, which has been happening over the last few years in an intensive way [44]. In this experiment, we used time series of a agricultural calendar year, from September of 2018 to August of 2019, extracted from these three data cubes. For this study area, remote sensing specialists collected land use and cover samples using visual interpretation of high-resolution images and data cube time series. These data sets includes 922 samples of three classes: natural vegetation (422), pasture (258), and agriculture (242). Figure 11 shows the NDVI time series (September of 2018 to August of 2019) extracted from the CBERS-4 data cube of the these three classes.

Classification Process
The classification was made using sits R package running in an AWS EC2 service and using the following data cube bands: red, green, blue, and near-infrared (NIR) along with the indices NDVI and EVI. We used these bands because they are common to the three data cubes. First, we collected the time series associated with the sample locations from each data cube and used them to train the classifier.
To classify the data cubes, we trained three different multi-layer perceptrons for a deep learning classification network [45], one for each data cube, using the the time series associated with sample locations. We used sits R package to train the models and classify the time series. The models consisted of five hidden layers with 512 neurons in a complete network architecture, and was trained through back propagation using adaptive learning optimization (Adam) and a rectilinear activation unit (ReLu) function.
The classification output is a map of probabilities for each pixel belonging to a land use and cover class. The majority of methods using this approach select the most probable class. Instead, we use the Bayesian smoothing method to select the pixel class, as proposed by Simoes et al. [11]. This method considers the pixel's mean and variance of the neighbors to compute the posterior Bayesian probabilities and then reanalyze the most probable class to each pixel.

Results
The classification process was applied to the three data cubes, generating three land use and cover maps shown in Figure 12. We validated the accuracy of the classified maps using the good practice guidelines by Olofsson et al. [46]. The validation was done independently for each map using the PRODES Cerrado data of 2019 [25]. This data consisted of anthropized areas in the Cerrado biome. To label the validation samples, we use the WLTS service.
We merged our agriculture and pasture classes to make them compatible with the anthropic class of PRODES Cerrado. Therefore, we obtained a classification with two classes natural vegetation and anthropic (pasture and Agriculture). We used these classes as strata to validate the maps. We assumed a target standard error of overall map accuracy was 0.01 [46] and the user accuracy (UA) of 0.8 for both natural vegetation and anthropic, which resulted in 1600 samples. We allocated an equal number of samples, 800 for each class. The classifications' accuracy are shown in Table 2. Although the classifications are from the same location and use the same attributes, the validation results indicated that there are differences in the representations of the targets in each data cube. This suggests that each data cube should be classified with different machine learning algorithm parameters. Additionally, other methodological steps could be included to improve the performance of the classifications. For example, Santos et al. [30] proposes a method for the detection and selection of representative samples, excluding outliers or samples from different classes with similar time series profiles from the data set. Moreover, Chaves et al. [47] suggests that the use of SWIR and Redge-Edge bands produces better land use and cover classification results. This indicates that there is room to improve further the results of Sentinel-2 and Landsat-8 data cubes.

Final Remarks and Future Directions
The first version (1.0) of the ARD image collections and data cubes produced by the BDC project is already available in the project Web portal and Web services for the Brazilian Cerrado biome. The image collections consist of surface reflectance data from CBERS-4/AWFI, Landsat-8/OLI for the entire biome ranging from 2017 to 2020, and specific tiles for Sentinel-2/MSI (A and B) considering the same period. From these image collections, the data cubes were generated for the same areas and periods using a 16 days time step and the stack temporal composing function.
To generate the ARD and data cubes, we developed a set of Python scripts and an application with a graphical interface called Data Cube Builder. These scripts and the application can be used to create ARD and data cubes for other regions, including other countries in South America. We have two versions of this application, one that runs on AWS using Lambda services and another running in our computer network at INPE. Therefore, users can run these scripts and applications to create data cubes based on their specific needs, such as using other time steps or temporal composition functions. All software products described in this paper are available in our GitHub area https: //github.com/brazil-data-cube.
The ARD image collections, data cubes, and software products that we are producing and developing in the BDC project are useful for different applications that investigate the dynamics of an environment over time based on remote sensing images. Such applications include land use and cover change, deforestation, fire monitoring, and the urbanization process. In this paper, we present land use and cover mapping as an example of an application that benefits from the BDC data cubes and software products.
Other similar projects that generate ARD and data cubes for specific countries, such as Swiss Data Cube (SDC) [14], Digital Earth Australia (DE Australia) [12], and Digital Earth Africa (DE Africa) [18], are using the ODC framework. These other projects follow a similar approach for making data available, analysis, and processing. The data produced by these projects can be accessed without restrictions through visualization portals, Web catalogs, and OGC services. The SDC uses the Geoserver server, while the other two use ODC to provide OGC services. Both SDC and DE Australia do not offer public access and processing data via the ODC Python API. These resources are restricted to authorized researchers. The DE Africa, on the other hand, allows any registered user to have access to a cloud-based JupyterLab environment to perform their analysis and processing. This functionality is still provisional, and at the moment, offers few computational resources for each user. In the same direction, the BDC project provides a portal for viewing the data sets, a catalog for accessing the collections via the STAC protocol, and a JupyterHub environment for the project researchers. Besides that, we provide an interface with the ODC platform, as described in Section 4.2.
In this first version, each data cube is created from images of a specific sensor. However, we are investigating methods to create "harmonized" data cubes by mixing images from distinct sensors. Recently, efforts have been made to harmonize images of multiple multi-spectral sensors to enhance the number of Earth surface observations. A "harmonized" surface reflectance product is obtained when required steps (radiometric, spectral, geometric, spatial, and others) have been applied to create a consistent data set to time series analysis as though it came from a single sensor [48]. One of the most promising approaches has been the c-factor, since it can be used to generate Nadir BRDF adjusted reflectance (NBAR) data for Landsat [49] and Sentinel-2 [50]. Other data cube initiatives have also used this method, such as the Harmonized Landsat Sentinel (HLS) [51,52], to combine Landsat-8 and Sentinel-2 data. Based on these studies, the Brazil Data Cube project team is also evaluating the impact of atmospheric correction algorithms and started developing harmonization procedures to produce Data Cubes containing an image from both satellite sensors.
About future directions-we intend to create data cubes from synthetic aperture radar (SAR) data. SAR data have low atmospheric influence in the microwave spectrum range and thus are useful for environmental monitoring in tropical regions. Sentinel-1 satellite has generated free, open, regular, and standardized SAR images since 2014. These Sentinel-1 images have been used for land use and cover monitoring with good results [53,54]. We hope to improve the land use and cover mapping in the BDC project using SAR data cubes.