# On-Demand Processing of Data Cubes from Satellite Image Collections with the gdalcubes Library

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Representing Satellite Imagery as On-Demand Data Cubes with gdalcubes

#### 2.1. Data Cubes vs. Image Collections

**Definition**

**1.**

- (i)
- Spatial dimensions refer to a single spatial reference system (SRS);
- (ii)
- Cells of a data cube have a constant spatial size (with regard to the cube’s SRS);
- (iii)
- The spatial reference is defined by a simple offset and the cell size per axis, i.e., the cube axes are aligned with the SRS axes;
- (iv)
- Cells of a data cube have a constant temporal duration, defined by an integer number and a date or time unit (years, months, days, hours, minutes, or seconds);
- (v)
- The temporal reference is defined by a simple start date/time and the temporal duration of cells;
- (vi)
- For every combination of dimensions, a cell has a single, scalar (real) attribute value.

**Definition**

**2.**

#### 2.2. Constructing User-Defined Data Cubes from Image Collections

- Spatial reference system;
- Spatiotemporal extent;
- Spatial size and temporal duration of cells (resolution);
- Spatial image resampling method, and;
- Temporal aggregation method.

- Allocate and initialize an in-memory chunk buffer for the resulting chunk data (a four-dimensional bands, t, y, x array);
- Find all images of the collection that intersect with the spatiotemporal extent of the chunk;
- For all images found:
- 3.1.
- Crop, reproject, and resample according to the spatiotemporal extent of the chunk and the data cube view and store the result as an in-memory three-dimensional (bands, y, x) array;
- 3.2.
- Copy the result to the chunk buffer at the correct temporal slice. If the chunk buffer already contains values at the target position, update a pixel-wise aggregator (e.g., mean, median, min., max.) to combine pixel values from multiple images which are written to the same cell in the data cube.

- Finalize the pixel-wise aggregator if needed (e.g., divide pixel values by n for mean aggregation).

#### 2.3. Data Cube Operations

#### 2.4. The gdalcubes Library

`create_image_collection()`, indexing available files on the local disk, then define a data cube view with

`cube_view()`, and create the cube with

`raster_cube()`. Calling this operation will however neither start any expensive computations nor read any pixel data from disk. Instead, the function immediately returns a proxy object that can be passed to data cube operations. We subset available bands of the data cube by calling

`select_bands()`and apply a median reducer over time with

`reduce_time()`. These functions also return proxy objects, containing the complete chain of operations and the dimensions of the resulting cube. Expressions passed as strings to data cube operations directly translate to C++ functions. In this case, the median reduction is fully implemented in C++ and does not need to call any R functions on the data. The

`plot()`call finally executes the chain of operations and starts actual computations and data reads. The advantage of such a lazy evaluation is that no intermediate results must be written to disk but can be directly streamed to the next operation so that the order of operations can be optimized. In an example with 102 images from three adjacent grid tiles (summing to approximately 90 gigabytes), stored as original ZIP archives as downloaded from the Copernicus Open Access Hub [29] (see also Section 3, where we use the same dataset in the second study case), computations take around 40 s on a personal laptop with a quad-core CPU, 16 GB main memory, and a solid state disk drive. The resulting image is shown in Figure 4. The complete script has less than 20 lines of code and if users want to apply the same operation at a higher resolution, possibly for a different spatial extent and time range, only parameters that define the data cube view must be changed.

## 3. Study Cases

#### 3.1. Constructing a Multi-Sensor Data Cube from Precipitation, Vegetation Data, and Land Surface Temperature Data

`join_bands()`function, which collects the bands from two identically shaped data cubes. Since the MOD13A2 product covers land areas only, we ignore any pixels in the combined cube without vegetation data by calling

`filter_predicate()`. Expressions passed to the

`apply_pixel`and

`filter_predicate`functions are translated to C++ functions, with

`iif`denoting a simple one line if-else statement. Finally, we export the cube as a netCDF file. Figure 7 shows a resulting temporal subset of a cube derived at a 10 km spatial resolution. Computation times to execute the script varied between 40 and 240 min on a 50 km and 1 km spatial resolution respectively, meaning that by reducing the number of pixels in the target data cube by a factor of 2500, we could reduce computation times by a factor of 6. In this case, data users would additionally need to reduce the area and/or time range of interest to try out methods and get interactive results within a few minutes.

#### 3.2. Processing Sentinel-2 Time Series

## 4. Discussion

#### 4.1. Interactive Analyses of Large EO Datasets

#### 4.2. Scalable and Distributed Processing in the Cloud

#### 4.3. Interfaces to Other Software and Languages

#### 4.4. Limitations

## 5. Conclusions

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## References

- Copernicus—The European Earth Observation Programme. Available online: https://ec.europa.eu/growth/sectors/space/copernicus_en (accessed on 14 June 2019).
- Lillesand, T.; Kiefer, R.W.; Chipman, J. Remote Sensing and Image Interpretation; John Wiley & Sons: New York, NY, USA, 2015. [Google Scholar]
- Gorelick, N.; Hancher, M.; Dixon, M.; Ilyushchenko, S.; Thau, D.; Moore, R. Google Earth Engine: Planetary-scale geospatial analysis for everyone. Remote Sens. Environ.
**2017**, 202, 18–27. [Google Scholar] [CrossRef] - Lewis, A.; Oliver, S.; Lymburner, L.; Evans, B.; Wyborn, L.; Mueller, N.; Raevksi, G.; Hooke, J.; Woodcock, R.; Sixsmith, J.; et al. The Australian Geoscience Data Cube—Foundations and lessons learned. Remote Sens. Environ.
**2017**, 202, 276–292. [Google Scholar] [CrossRef] - Giuliani, G.; Chatenoux, B.; Bono, A.D.; Rodila, D.; Richard, J.P.; Allenbach, K.; Dao, H.; Peduzzi, P. Building an Earth Observations Data Cube: Lessons learned from the Swiss Data Cube (SDC) on generating Analysis Ready Data (ARD). Big Earth Data
**2017**, 1, 100–117. [Google Scholar] [CrossRef] - Lu, M.; Appel, M.; Pebesma, E. Multidimensional Arrays for Analysing Geoscientific Data. ISPRS Int. J. Geo-Inf.
**2018**, 7, 313. [Google Scholar] [CrossRef] - Warmerdam, F. The geospatial data abstraction library. In Open Source Approaches in Spatial Data Handling; Springer: Berlin/Heidelberg, Germany, 2008; pp. 87–104. [Google Scholar]
- Baumann, P.; Dehmel, A.; Furtado, P.; Ritsch, R.; Widmann, N. The Multidimensional Database System RasDaMan. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, SIGMOD ’98, Seattle, WA, USA, 1–4 June 1998; ACM: New York, NY, USA, 1998; pp. 575–577. [Google Scholar]
- Stonebraker, M.; Brown, P.; Zhang, D.; Becla, J. SciDB: A Database Management System for Applications with Complex Analytics. Comput. Sci. Eng.
**2013**, 15, 54–62. [Google Scholar] [CrossRef] - Appel, M.; Lahn, F.; Buytaert, W.; Pebesma, E. Open and scalable analytics of large Earth observation datasets: From scenes to multidimensional arrays using SciDB and GDAL. ISPRS J. Photogramm. Remote Sens.
**2018**, 138, 47–56. [Google Scholar] [CrossRef] - Open Data Cube. Available online: https://www.opendatacube.org (accessed on 23 May 2019).
- Pangeo—A Community Platform for Big Data Geoscience. Available online: https://pangeo.io (accessed on 23 May 2019).
- Hoyer, S.; Hamman, J. xarray: ND labeled Arrays and Datasets in Python. J. Open Res. Softw.
**2017**, 5, 10. [Google Scholar] [CrossRef] - Rocklin, M. Dask: Parallel computation with blocked algorithms and task scheduling. In Proceedings of the 14th Python in Science Conference, Austin, TX, USA, 6–12 July 2015; pp. 126–132. [Google Scholar]
- R Development Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2008; ISBN 3-900051-07-0. [Google Scholar]
- Hijmans, R.J. raster: Geographic Data Analysis and Modeling, R Package Version 2.9-5; 2019. Available online: https://CRAN.R-project.org/package=raster (accessed on 27 June 2019).
- Pebesma, E. stars: Spatiotemporal Arrays, Raster and Vector Data Cubes, R Package Version 0.3-1; 2019. Available online: https://CRAN.R-project.org/package=stars (accessed on 27 June 2019).
- Baumann, P.; Rossi, A.P.; Bell, B.; Clements, O.; Evans, B.; Hoenig, H.; Hogan, P.; Kakaletris, G.; Koltsida, P.; Mantovani, S.; et al. Fostering Cross-Disciplinary Earth Science Through Datacube Analytics. In Earth Observation Open Science and Innovation; Mathieu, P.P., Aubrecht, C., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 91–119. [Google Scholar] [Green Version]
- Nativi, S.; Mazzetti, P.; Craglia, M. A view-based model of data-cube to support big earth data systems interoperability. Big Earth Data
**2017**, 1, 75–99. [Google Scholar] [CrossRef] [Green Version] - Strobl, P.; Baumann, P.; Lewis, A.; Szantoi, Z.; Killough, B.; Purss, M.; Craglia, M.; Nativi, S.; Held, A.; Dhu, T. The Six Faces of The Datacube. In Proceedings of the 2017 Conference on Big Data from Space (BIDS’ 2017), Toulouse, France, 28–30 November 2017; pp. 28–30. [Google Scholar]
- Gebbert, S.; Leppelt, T.; Pebesma, E. A Topology Based Spatio-Temporal Map Algebra for Big Data Analysis. Data
**2019**, 4, 86. [Google Scholar] [CrossRef] - Rew, R.; Davis, G. NetCDF: An interface for scientific data access. IEEE Comput. Graph. Appl.
**1990**, 10, 76–82. [Google Scholar] [CrossRef] - SQLite. Available online: https://www.sqlite.org (accessed on 24 May 2019).
- Stenberg, D.; Fandrich, D.; Tse, Y. libcurl: The Multiprotocol File Transfer Library. Available online: http://curl.haxx.se/libcurl (accessed on 24 May 2019).
- Tinyexpr. Available online: https://github.com/codeplea/tinyexpr (accessed on 24 May 2019).
- Date. Available online: https://howardhinnant.github.io/date/date.html (accessed on 24 May 2019).
- Tiny-process-library. Available online: https://gitlab.com/eidheim/tiny-process-library (accessed on 24 May 2019).
- JSON for Modern C++. Available online: https://github.com/nlohmann/json (accessed on 24 May 2019).
- Copernicus Open Access Hub. Available online: https://scihub.copernicus.eu (accessed on 24 May 2019).
- Inglada, J.; Christophe, E. The Orfeo Toolbox remote sensing image processing software. In Proceedings of the 2009 IEEE International Geoscience and Remote Sensing Symposium, Cape Town, South Africa, 12–17 July 2009; Volume 4, pp. 76–82. [Google Scholar]
- NumPy C-API. Available online: https://docs.scipy.org/doc/numpy/reference/c-api.html (accessed on 14 June 2019).
- Neteler, M.; Bowman, M.; Landa, M.; Metz, M. GRASS GIS: A multi-purpose Open Source GIS. Environ. Model. Softw.
**2012**, 31, 124–130. [Google Scholar] [CrossRef] - Walt, S.V.D.; Colbert, S.C.; Varoquaux, G. The NumPy Array: A Structure for Efficient Numerical Computation. Comput. Sci. Eng.
**2011**, 13, 22–30. [Google Scholar] [CrossRef] [Green Version]

1 | |

2 | |

3 | |

4 | |

5 | |

6 | |

7 |

**Figure 1.**Data structure for image collections in gdalcubes. Geospatial Data Abstraction Library (GDAL) datasets refer to actual image data, which can be local or remote files, objects in cloud storage, sub-datasets in a more complex file format, or any other resources that GDAL can read.

**Figure 3.**Example R script to derive a mosaic preview of Sentinel-2 images by calculating the median of visible bands over pixel time series.

**Figure 4.**Output of R script in Figure 3, plotting median reflectances of visible Sentinel-2 bands over time.

**Figure 6.**R script to combine data cubes from three different data products. The construction of the image collection is omitted here.

**Figure 7.**Temporal subset of the combined data cube with NDVI measurements (

**left**), average daytime land surface temperature (K) during the last 30 days (

**center**), and average daily precipitation (mm) during the last 30 days (

**right**).

**Figure 9.**Result map from water detection in the second study case. The right part illustrates the results at a high spatial resolution.

**Figure 10.**Computational results for the second study case. The left plot shows the achieved speedup factors depending on the reduction of pixels in the target data cube. For example, reducing the number of pixels by a factor of 100 resulted in a speedup of around 20, compared to computation times with a 10 m by 10 m spatial resolution. The center and right plots show computation times and pixel throughput respectively as a function of the number of used CPUs.

Operator | Description |
---|---|

raster_cube | Create a raster data cube from an image collection and a data cube view |

reduce_time | Apply a reducer function independently over all pixel time series |

reduce_space | Apply a reducer function independently over all spatial slices |

apply_pixel | Apply an arithmetic expression on band values over all pixels |

filter_pixel | Filter pixels with a logical predicate on one or more band values |

join_bands | Stack the bands of two identically shaped cubes in a single cube |

window_time | Apply a reducer function or kernel filter over moving windows for all pixel time series |

write_ncdf | Export a data cube as a netCDF file |

chunk_apply | Apply a user-defined function over chunks of a data cube |

**Table 2.**Summary of the data products as used in the first study case. Definitions: GPM, Global Precipitation Measurement mission; NDVI, normalized difference vegetation index; liquid_accum, liquid daily accumulated precipitation; LST_DAY, daytime land surface temperature; SRS, spatial reference system; MODIS, Moderate Resolution Imaging Spectroradiometer.

MOD13A2 | GPM | MOD11A1 | |
---|---|---|---|

Selected Variables | NDVI | liquid_accum | LST_DAY |

Spatial Resolution | 1 km × 1 km | 0.1${}^{\circ}$ × 0.1${}^{\circ}$ | 1 km × 1 km |

Area of Interest | global (land only) | global (60${}^{\circ}$ N–60${}^{\circ}$ S full) | Europe (land only) |

Temporal Resolution | 16 days | daily | daily |

Time Range | 2014-01-01–2019-01-01 | 2014-01-01–2019-01-01 | 2014-01-01–2019-01-01 |

File Format | HDF4 | GeoTIFF (zip compressed) | HDF4 |

SRS | MODIS sinusoidal | Lat/Lon grid | MODIS sinusoidal |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Appel, M.; Pebesma, E.
On-Demand Processing of Data Cubes from Satellite Image Collections with the gdalcubes Library. *Data* **2019**, *4*, 92.
https://doi.org/10.3390/data4030092

**AMA Style**

Appel M, Pebesma E.
On-Demand Processing of Data Cubes from Satellite Image Collections with the gdalcubes Library. *Data*. 2019; 4(3):92.
https://doi.org/10.3390/data4030092

**Chicago/Turabian Style**

Appel, Marius, and Edzer Pebesma.
2019. "On-Demand Processing of Data Cubes from Satellite Image Collections with the gdalcubes Library" *Data* 4, no. 3: 92.
https://doi.org/10.3390/data4030092