Semantic Earth Observation Data Cubes

: There is an increasing amount of free and open Earth observation (EO) data, yet more information is not necessarily being generated from them at the same rate despite high information potential. The main challenge in the big EO analysis domain is producing information from EO data, because numerical, sensory data have no semantic meaning; they lack semantics. We are introducing the concept of a semantic EO data cube as an advancement of state-of-the-art EO data cubes. We deﬁne a semantic EO data cube as a spatio-temporal data cube containing EO data, where for each observation at least one nominal (i.e., categorical) interpretation is available and can be queried in the same instance. Here we clarify and share our deﬁnition of semantic EO data cubes, demonstrating how they enable di ﬀ erent possibilities for data retrieval, semantic queries based on EO data content and semantically enabled analysis. Semantic EO data cubes are the foundation for EO data expert systems, where new information can be inferred automatically in a machine-based way using semantic queries that humans understand. We argue that semantic EO data cubes are better positioned to handle current and upcoming big EO data challenges than non-semantic EO data cubes, while facilitating an ever-diversifying user-base to produce their own information and harness the immense potential of big EO data.


Introduction
The current Earth observation (EO) data pool is vastly different than a mere decade ago, but the main challenge remains: to produce information from data to generate knowledge [1,2].We are surrounded by a growing ocean of EO data, but sensory data are not information and have no inherent meaning (i.e., lacking semantics) without some form of interpretation.At a minimum, this data pool is characterised by a rapidly growing data volume, accelerating data velocity (i.e., increasing data acquisition and processing speeds) and an increasingly diverse variety of sensors and products [3].The term "data cube" is broadly understood as a multi-dimensional array organising data in a way that simplifies data storage, access and analysis compared to file-based storage and access [4].Applying data cube technology to EO datasets attempts to address some of the challenges and opportunities rooted in these big data characteristics.
There is a growing number of implementations currently referred to as EO data cubes with the goal of lowering the barrier to store, manage, provide access to and analyse EO data in a more convenient manner.Data cubes of EO imagery typically are organised in three dimensions: latitude, longitude and time.The definitions or specifications of EO data cubes will not be discussed here but can be understood as a way of organising EO data using a logical view on them, either based on an existing archive (i.e., "indexing") or a specific, application-optimised, multi-dimensional data structure (i.e., "ingestion").The logical view refers to the way of accessing EO data by using spatio-temporal coordinates either in an application programming interface (API) or a query language instead of file names.The main advantage of ingesting data is that the data can be stored in a query-optimised way, and specific access patterns can be realised more efficiently, such as time series analysis or spatial analysis.
Various technical solutions to create these logical views on EO data have rapidly gained traction over the past few years.The first national scale EO data cube was established in Australia [5], whose technology is now the basis of Digital Earth Australia [6] and the Open Data Cube (ODC) [7].
The free and open source ODC technology is also behind other operational EO data cubes, such as in Switzerland [8], Colombia [9], Vietnam [10], the Africa Regional Data Cube [11] and at least nine other national or regional initiatives under development [7].Rasdaman [12], an array database system that has been around since the mid-1990s, is another leading technology behind initiatives such as EarthServer [13] and the Copernicus Data and Exploitation platform for Germany (CODE-DE, [14]).Other software implementations exist, such as the Earth System Data Cube from the European Space Agency [15] and SciDB [16].
State-of-the-art EO data cubes simplify data provision to users by facilitating data uptake and aiming to provide analysis-ready data (ARD) [4].While there is still an ongoing discussion about how ARD are defined and specified, it is usually understood as calibrated data, and in the case of CARD4L (Committee on EO Satellites ARD for Land), even contains masks as a target requirement specification, such as for cloud and water [17,18].The intention is to shift the burden of pre-processing from users to data providers, who are often better equipped to consistently and reliably process large volumes of high-velocity data [6,17,19].Processing steps with a high potential level of automation can be conducted centrally where they only must be conducted once and are then available to all users.This contrasts with requiring every user to pre-process the data they would like to use on their own and improves comparability of initial data conditions between users and applications.
Web-based access to these EO data cube implementations brings users closer to the data and implements a computation platform at the data location [20].This is a different strategy than providing EO data to users as individual, downloadable images of a pre-determined spatial extent.Data cubes make data access much more efficient and effective by providing users with data tailored more specifically to their needs, reducing unnecessary data transfer [20].Pairing data access from EO data cubes with computational environments, (e.g., processing resources accessible using Jupyter notebooks) moves in the direction of other existing Web-based geospatial computation platforms, such as Google Earth Engine [21].While these platforms are powerful, analyses sometimes have limited transferability to different geographic locations or points in time, or a low level of results or inferential reproducibility [22].
Even with tailored ARD access and Web-based processing capabilities, users of EO are still confronted with tons of data rather than information and the ill-posed challenge of reconstructing a scene from one or more images [23].In this case, a scene is understood as the content of an image, whereby the result of this challenge is some form of interpretation or classification map of an image.Images suffer from data dimensionality reduction and a semantic information gap.An image is a 2D snapshot of the 4D world (i.e., three spatial and one temporal dimension), whereby all the information required to reconstruct a comprehensive and complete descriptive scene is not available from one or multiple images over time [24].
Information production from EO images still generally relies on unstructured, application-specific algorithms or increasingly popular machine learning procedures.This often results in low to no semantic interoperability between workflows, sensors or images based on the findable, accessible, interoperable and reusable (FAIR) principles [25].The FAIR principles refer to data, and the algorithms, tools and workflows that produce them.If data-derived information is linked to the images used to generate them, provenance is maintained and accessible to users.Combining EO images with symbolic image-derived information in a collaborative, analytics environment effectively facilitates increased semantic interoperability between workflows and analyses while extending machine-actionability [26].
If the image-derived information is semantically interoperable and consistent between locations and acquisitions, semantic interoperability is established at least at the starting point of further analysis.
We are introducing the concept of a semantic EO data cube as an advancement of state-of-the-art EO data cubes.Semantic EO data cubes move beyond data storage and provision by offering basic, interoperable building blocks of image-derived information within a data cube.This enables semantic analyses that can be incorporated into simple rule-sets in domain language, and users are able to develop increasingly expressive, comprehensive rule-sets and queries.Given semantic enrichment that includes clouds, vegetation, water and "other" categories, certain semantic content-based queries covering a user-defined area of interest (AOI) in a given temporal extent are possible, such as for the most recent observations excluding clouds (e.g., user-defined cloud-free mosaic), or an observed moment in time with the maximum vegetation extent.These queries of the interpreted content of available images are independent of imposed spatial image extents and are made possible by including semantic enrichment.However, since the information is still tied to the EO images it is based on, it is also possible to search for and retrieve images based on their semantic content rather than only metadata (e.g., where and when each image was acquired).We argue that EO data cubes have the potential to offer much more than data and information product storage and access.They move towards reproducible analytical environments for user-driven information production based on EO images and allow non-expert users to use EO data in their specific context.
This paper focuses on the concept of a semantic EO data cube, assuming the basis is an EO data cube containing EO data together with a nominal interpretation for each observation.Multiple discussions and standardisation processes are currently underway to clarify what constitutes ARD and what minimum requirements constitute a data cube.However, this has little bearing on the base concepts presented here, which have implications for data access, data retrieval, semantic queries of data, semantic interoperability of different methods and results and more.We argue that semantic EO data cubes are better positioned to handle current and upcoming big EO data challenges than non-semantic EO data cubes, while facilitating an ever-diversifying user-base to produce their own information and harness the immense potential of big EO data.

Theoretical Framework
Concepts under the same name sometimes differ between domains.The concepts essential for our understanding of semantic EO data cubes are described for clarity, and our definition of what constitutes a semantic EO data cube is explained.

Clarifying Concepts
Data are not the same as information, and we find ourselves increasingly collecting data, yet not producing more information from them at the same rate.Information can be understood at least in two different ways: as a quantifiable measure in the sense of the information content of a message or an image (e.g., bits and bytes representing something informative [27]), or as a subjective concept, an interpretation (i.e., knowledge produced from a process) [28,29].Information is used to generate knowledge and understanding, which might lead to wisdom [1,2].
Two terms ought to be clarified before moving forward because they are not interchangeable from our perspective, nor in the domain of computer vision: images and scenes.An EO image is broadly understood as a pixel-discretised field representing measurements of reflected radiations from Earth in different wavelengths (e.g., temperature, visible light, microwave).EO data are delivered as images or single measurements, depending on the design of a sensor.Here we refer to numerical observations represented by pixels and delivered as images.A scene, however, refers to the represented content of an image, meaning that which was observed [30].
The goal of most EO analysis is to produce actionable information to support decision-making processes.This requires transforming EO data into information, or digital numbers into subjective concepts that describe a scene.An EO image is a 2D representation of a 3D scene on Earth at a fixed moment in time, and multiple 2D images of the same 3D scene acquired over time move towards representing a snapshot of the 4D world (i.e., 3D space through time).In this context, what an image or set of images can tell you about a scene is information.
The challenge of reconstructing scenes or generating information about a scene from a mono-temporal 2D image or set of images through time underpins any classification of remotely-sensed imagery and is inherently ill-posed.It is ill-posed in the Hadamard sense because a single, unique solution may not exist, or the solution does not depend continuously on data [31,32].The last criterion of data-dependence refers to stability, where small changes in the equation or conditions result in small changes in the solution.An ill-posed problem does not meet one or more of these criteria (e.g., there are a huge number of possible solutions when classifying imagery).
The ill-posed problem of reconstructing scenes from images stems primarily from what is known as the sensory gap [33].For optical EO images, this gap exists between the 2D image that has been sensed (e.g., digital numbers) and the 4D world (e.g., objects, states, events, processes).This gap introduces uncertainty that inherently complicates the interpretation of images and reliable, consistent information production.One aspect of the sensory gap is the sensor transfer function, which relates to the resolvability of phenomena by the given sensor (e.g., spatial, temporal, spectral, radiometric resolution).Another aspect relates to the reduction of dimensionality inherent to images (i.e., 4D to 2D; reducing a flood event to a snapshot in time).These aspects together allow for multiple interpretations of the same or similar representations (e.g., a green pixel in a true-colour image might represent a vegetated rooftop, forest, pasture, football field or something else entirely).
In the context of EO image classification, multiple classifications are possible for any given EO image or collection of images, and many current classification methods are very sensitive to changes in input parameters or starting conditions.Certain methods even produce similar but non-identical results each time they are run on the same initial data.In the case of well-established approaches of supervised classification, different users generally use different sets of samples even if using the same data and being interested in the same categories, which consequently produces different results.
What is known as the semantic gap also contributes to difficulties in producing information from images, and it refers to the gap between something that exists and what it means, regardless of how it is observed or represented [33].Semantics more broadly refers to a multi-domain study of meaning but influences research in many domains, such as philosophy, linguistics, technology (e.g., the semantic Web [34,35], ontology-based data access [36]) and interoperability (e.g., sharing geographic information [37], processing EO data [26]).
When we speak of semantics in EO, this refers to what an EO image represents in terms of how it is interpreted, usually by an expert.An image can be described using an unbelievable number of words and concepts, yet images do not have intrinsic meaning.Each person has their own definition or understanding of different concepts or symbols, not to mention what they find to be important in a given image or scene [38].Images gain meaning through relations to other images and the interpretation by a viewer, which is influenced by cultural and social conventions, not to mention the viewer's intention.In the context of image databases, how users search for and interact with images creates additional meaning, especially if given an exploratory user interface [39].
Using the term semantic in relation to EO data cubes refers to how an existing EO data cube is semantically-enabled, meaning a user can interact with it using semantic concepts rather than digital numbers or reflectance values.The ability to search for and retrieve EO data using spatially-explicit semantic content-based information rather than metadata, keywords, tags, or other linked data has strong implications for changing the way EO data is queried, accessed and analysed.However, to semantically-enable an EO data cube, some level of semantics needs to be available for every observation.In the case of EO imagery, this means semantics need to be available for each representation in space and time (i.e., pixel).

Our Definition of a Semantic EO Data Cube
A semantic EO data cube or a semantics-enabled EO data cube is a data cube, where for each observation at least one nominal (i.e., categorical) interpretation is available and can be queried in the same instance.Interpreting an EO image (i.e., mapping data to symbols that represent stable concepts) results in semantic enrichment [23].This data interpretation used in creating a semantic EO data cube may differ depending on the user and the intended purpose.Semantic variables are non-ordinal, categorical variables, but subsets of these variables may be ordinal (e.g., vegetation with sub-categories of increasing greenness or intensity) [40].See Figure 1 (left) for a schematic illustration of a semantic EO data cube.Semantic enrichment included in a semantic EO data cube may be at a relatively low or higher semantic level.A lower semantic level means that symbols may be associated with or represent multiple semantic concepts requiring further analysis or interpretation to align with more specific concepts.The concepts in a lower level semantic enrichment can be considered semi-symbolic in that they are a first step to connecting sensory data to symbolic, semantic classes [41].This could include information such as colour, or other ways of characterising the spatio-temporal context of each observation.A relatively high semantic level refers to explicit expert knowledge or existing ontologies.In the context of optical EO, one example of relatively high level semantic information would be land cover, such as the land cover classification system (LCCS) developed by the Food and Agriculture Organisation of the United Nations [42].
Other data and information may be combined with a semantic EO data cube to extend possible analysis, but what makes it semantically-enabled is that each observation in space over time has an interpretation.An interpretation that can be generated in an automated way with no user interaction is ideal for handling big EO data.It is also extremely beneficial if the resulting interpreted categories are transferable between different geographic locations, moments in time, images or sensors.
Only including well-known, data-derived indices for each observation (e.g., normalised difference vegetation index (NDVI)) is not sufficient to semantically-enable an EO data cube.Most of these indices are not inherently semantic, in that they still need to be interpreted to have symbolic meaning (e.g., at what NDVI is a pixel considered to contain vegetation or some other interpreted category?).Indices can, however, contribute different quantitative insights to existing interpretations of an image in a stratified analysis (e.g., this collection of pixels is interpreted as being vegetation, but what was the average NDVI in June 2018 compared to June 2019 within this area?).While the indices can be calculated on the fly since the EO data are also present in a semantic EO data cube, it is up to Semantic enrichment included in a semantic EO data cube may be at a relatively low or higher semantic level.A lower semantic level means that symbols may be associated with or represent multiple semantic concepts requiring further analysis or interpretation to align with more specific concepts.The concepts in a lower level semantic enrichment can be considered semi-symbolic in that they are a first step to connecting sensory data to symbolic, semantic classes [41].This could include information such as colour, or other ways of characterising the spatio-temporal context of each observation.A relatively high semantic level refers to explicit expert knowledge or existing ontologies.In the context of optical EO, one example of relatively high level semantic information would be land cover, such as the land cover classification system (LCCS) developed by the Food and Agriculture Organisation of the United Nations [42].
Other data and information may be combined with a semantic EO data cube to extend possible analysis, but what makes it semantically-enabled is that each observation in space over time has an interpretation.An interpretation that can be generated in an automated way with no user interaction is ideal for handling big EO data.It is also extremely beneficial if the resulting interpreted categories are transferable between different geographic locations, moments in time, images or sensors.
Only including well-known, data-derived indices for each observation (e.g., normalised difference vegetation index (NDVI)) is not sufficient to semantically-enable an EO data cube.Most of these indices are not inherently semantic, in that they still need to be interpreted to have symbolic meaning (e.g., at what NDVI is a pixel considered to contain vegetation or some other interpreted category?).
Indices can, however, contribute different quantitative insights to existing interpretations of an image in a stratified analysis (e.g., this collection of pixels is interpreted as being vegetation, but what was the average NDVI in June 2018 compared to June 2019 within this area?).While the indices can be calculated on the fly since the EO data are also present in a semantic EO data cube, it is up to the user as to whether calculating and incorporating such data-derived indices in a data cube reduces computational resources or has other benefits for further analyses.
Including additional data or information that is not directly derived from EO data does not semantically enable an EO data cube but might enable new query possibilities of EO data in space and time.Such data or information could concern the geographic area (e.g., digital elevation model (DEM)), socio-economic data, or masks of various kinds (e.g., urban area or forest mask).All of these data and information sources are not derived directly from the EO data such that they: (1) do not add information about each EO image's content, but rather the scene content or other characteristics pertaining to the time they were acquired; and (2) may no longer be true for the moment in time an EO image was captured (e.g., a DEM acquired before an earthquake).A DEM, for example, could be used as a spatial selection criterion, even if not specifically related to the semantic content of each image (e.g., selecting observations above a given elevation for alpine areas).Another example would be including an annual forest mask used by environmental regulatory bodies, but that annual mask may not be true even for the EO data available for that given year contained in the data cube.
In semantic EO data cubes it is crucial that EO data be stored with data-derived information for each acquisition.A data cube containing only data-derived interpretations could be considered semantic, but EO data have too much potential to be constrained to a single interpretation, especially since there is no single correct interpretation of image content.World ontologies are infinite.Multiple different perspectives and interpretations need to be possible to close the semantic gap [38], and users should be allowed to generate their own interpretations within a semantic EO data cube should those available not be suitable for their needs.The loss of connection to original EO data constrains semantics to the available interpretation, eliminates access to the source of the data-derived information important for provenance and limits further analysis.Some users might benefit from incorporating reflectance values from specific bands (e.g., calculating an index), using the semantic information to generate composite images through time, or generating different information based on the data to augment existing semantic enrichment.
The focus of semantic EO data cubes is to facilitate ad hoc, flexible information generation from data, that might have potential to lead to knowledge.Semantic EO data cubes combine concepts from EO, image processing, geoinformatics, computer vision, image retrieval and understanding, semantics, ontologies and more.Similar to how the semantic Web can be considered an extension of the Web [34], semantic EO data cubes offer a solution to combining EO data with meaning.This ultimately better enables people and computers to work together to access, retrieve and analyse EO data and data-derived information in a semantically-enabled and machine-readable way.

Examples from Existing Semantic EO Data Cubes
Three applied examples of semantic EO data cubes are presented, and each of them uses the same relatively low-level, generic, data-derived semantic enrichment as the basis for each of the semantic EO data cubes.This general-purpose semantic enrichment is application-and user-independent and thus can support multiple application domains.The semantic enrichment used in the following examples is automatically generated (i.e., without any user-defined parameterisation or training data) by the Satellite Image Automatic Mapper™ (SIAM™).This software is an expert system that employs a per-pixel physical spectral model-based decision-tree to images calibrated to at least top-of-atmosphere reflectance in order to accomplish automatic, near real-time multi-spectral discretisation based on a priori knowledge [43].The decision tree maps each observation located within a multi-spectral reflectance hypercube to one multi-spectral colour name, which is stable and sensor agnostic.It is sensor-agnostic in that data calibrated to at least top-of-atmosphere reflectance by optical sensors can be used to generate semantic enrichment comparable between sensors (e.g., Sentinel-2, Landsat).SIAM™'s output has been independently validated at a continental scale by [44].
This colour naming results in a discrete and finite vocabulary referring to hyper-polyhedra within a multi-spectral feature space, whereby the colour names create a vocabulary that is a mutually exclusive and totally exhaustive partitioning of the multi-spectral reflectance hypercube.These colour names have semantic associations using a knowledge-based approach and thus are considered semi-symbolic (i.e., semi-concepts).More broadly, this vocabulary of colour names can be thought of as stable, sensor-agnostic visual "letters" that can be used to build "words" (i.e., symbolic concepts) that have a higher semantic level using knowledge-based rules.The output may be considered sufficient for generating CARD4L masks as specified in the product family specification [18], but also offers building blocks for a complete scene classification map.
In the following examples, these data-derived information building blocks (i.e., semi-concepts) are based on Landsat 8 or Sentinel-2 images and are stored using either Open Data Cube or rasdaman technology to create semantic EO data cubes.While the semi-concepts themselves are inferior in semantics to land cover classes, they are reproducible, transferable between images and geographic locations, and each colour has a semantic association.These implementations serve as the foundation for semantic content-based image retrieval (SCBIR) (Section 3.1) or other semantic queries (Sections 3.2 and 3.3).Spectral-based semi-concepts can serve as the basis for more expressive, automated scene classification, queries and analysis within each of these prototypical semantic EO data cubes using knowledge-based rules (see Section 4.4).

Semantic Content-Based Image Retrieval
The example of operational SCBIR has been prototypically implemented within a semantic EO data cube based on Landsat 8 data and the rasdaman array database system as an underlying data cube technology [45].While this prototypical implementation (see Figure 1) did not cover a large database, it is designed for scalability by relying on parameter free, fully automated and multisensory enabled semantic enrichment, as well as on a data cube technology proven to be scalable to PB sizes [13].
Unlike a traditional content-based image retrieval system, a SCBIR system is expected to cope with spatially-explicit (i.e., area of interest (AOI)-based), temporal, semantic queries (e.g., "retrieve all images in the database where the AOI does not contain clouds or snow").Very few SCBIR system prototypes targeting EO images have been presented in the literature [46,47].None of them is available in operating mode to date.
The implementation of SCBIR is urgently needed in today's big EO archives to overcome the limitations of currently implemented image data retrieval methods using image metadata (e.g., acquisition date, sensor, pre-processing level) and image wide statistics like average cloud cover.The latter is especially a problem because the average cloud cover statistic is one of the most used pre-selection criteria for image retrieval of big EO data but is an average over an entire image.Spatially-explicit AOI-based querying that makes use of the semantic information of each pixel in a data cube could help in making use of hidden or "dark" data in big EO databases.This could, for example, lead to retrieving more cloud-free time series or improving cloud free mosaic composition, utilising data contained in images with low average cloud cover.
A SCBIR query is visualised in Figure 1 based on the prototypical implementation.A query based on the semantic information for low cloud cover combined with low snow cover in the selected AOI would only retrieve 2 of the 4 sample images in this example, making query results better posed for following analyses.While our definition of a semantic EO data cube does not prescribe any particular level of semantic enrichment, SCBIR queries beyond cloud/snow cover are possible depending on the available image interpretation, e.g., searches for images where flooding occurred, containing a low tidal range, or where a peak in vegetation coverage occurs.

Flood Extent in Somalia Based on Landsat 8 Imagery
One of the first implementations of a semantic data cube was a study to extract surface water dynamics and the maximum flood extent as an indicator for flood risk using a dense temporal stack of 78 Landsat 8 images [48].By using water observations of three years, areas are delineated that are prone to being flooded, as illustrated in Figure 2. In this study, the array database system rasdaman was used to instantiate a semantic EO data cube with pre-processed Landsat imagery and semantic enrichment generated with SIAM™, which can be accessed by using a self-programmed Web frontend, visually supporting the design of semantic queries.In this system the analyses are automatically translated database queries, which increase reproducibility, readability and comprehensibility for a human operator and can be conducted within a few minutes.The study showed how a generic semantic EO data cube can be used for on-the-fly information production using a very simple ruleset.showed how a generic semantic EO data cube can be used for on-the-fly information production using a very simple ruleset.Originally published as CC-BY-ND by [48], modified.

Semantic EO Data Cube along the Turkish/Syrian Border
The potential of semantic EO data cubes is demonstrated here using a proof-of-concept implementation based on ODC technology, described in detail by [49].In this case EO refers to satellite-based remote sensing data produced by the Copernicus programme's Sentinel-2 satellites.
All available Sentinel-2 data (i.e., ca.1000 images to date) covering over 30,000 km 2 along the northwestern Syrian border to Turkey (latitudes 36.01°-37.05°N;longitudes 35.67°-39.11°E)are continuously incorporated in an automated way including being mapped into semi-symbolic colour names by SIAM™.The example output generated here demonstrates that traditional statistical model-based algorithms may be replaced by querying symbolic information, starting from semisymbolic colour names with semantic associations that are not bound to a specific theme or application within a semantic EO data cube.Originally published as CC-BY-ND by [48], modified.

Semantic EO Data Cube along the Turkish/Syrian Border
The potential of semantic EO data cubes is demonstrated here using a proof-of-concept implementation based on ODC technology, described in detail by [49].In this case EO refers to satellite-based remote sensing data produced by the Copernicus programme's Sentinel-2 satellites.All available Sentinel-2 data (i.e., ca.1000 images to date) covering over 30,000 km 2 along the north-western Syrian border to Turkey (latitudes 36.01 • -37.05 • N; longitudes 35.67 • -39.11 • E) are continuously incorporated in an automated way including being mapped into semi-symbolic colour names by SIAM™.The example output generated here demonstrates that traditional statistical model-based algorithms may be replaced by querying symbolic information, starting from semi-symbolic colour names with semantic associations that are not bound to a specific theme or application within a semantic EO data cube.
In March 2019 flash flooding was reported in various parts of Syria [50].The worst flooding was reported in Idlib province, which is just south of the western most part of the study area (see Figure 3).While optical imagery is often hindered by cloud cover in rain events, a query for water-like pixels around the time of intense precipitation shows that certain flooded areas have been observed by Sentinel-2 satellites.A normalised observed surface water occurrence (SWO) over time is calculated for two spatio-temporal extents of interest, namely 15 March to 15 April for the entire study area in 2018 and 2019 (see Figure 4).The calculation of the result for each spatio-temporal extent took around 10 minutes to complete using the same hardware and software as described by [49].The algorithm, described in pseudocode in Figure 5, can be applied to any semantic concept that exists in the semantic EO data cube.This is demonstrated in Figure 6 where the same algorithm was applied to the semantic EO data cube, but for vegetation-like pixels rather than water-like pixels for the same two spatio-temporal extents.
Data 2019, 4, x FOR PEER REVIEW 9 of 19 10 minutes to complete using the same hardware and software as described by [49].The algorithm, described in pseudocode in Figure 5, can be applied to any semantic concept that exists in the semantic EO data cube.This is demonstrated in Figure 6 where the same algorithm was applied to the semantic EO data cube, but for vegetation-like pixels rather than water-like pixels for the same two spatio-temporal extents.divide "total water observations" by "total clean observations" change NaN to 0 in all arrays return three resulting 2D arrays: "normalised occurrence" "total water observations" "total clean observations" create GeoTiff of each returned array mosaic all cell-based GeoTiffs back together for each respective analysis Figure 5. Pseudocode describing how the normalised observed surface water occurrence (SWO) over time is calculated based on semi-concepts, in addition to two other outputs necessary for its calculation.The array of "total clean observations" provides the number of observations over time per-pixel after excluding cloud-like, snow-like and unknown pixels in the spatio-temporal extent of interest.Snow-like are excluded in this case based on the knowledge that there is generally no snow within the spatio-temporal extent of interest."Total water observations" refers to the number of observations over time per-pixel that water-like spectral profiles were observed.It is the ratio between these two outputs (i.e., total divided by clean observations per-pixel) that results in the normalised observed SWO.

Discussion and Outlook
Semantic EO data cubes are interdisciplinary in their conceptualisation, combining concepts related to image retrieval, computer vision, human cognition, semantics, world ontologies, remote sensing and more.The applied examples presented in Section 3 are brought into context of semantic EO data cubes, according to the definition and concepts provided in Section 2. Semantic EO data cubes also have the potential to be a foundational element in image understanding systems, which is discussed briefly in Section 4.4 and is a focus of on-going research.

Improvements to Data and Image Retrieval
Combining semantic enrichment with EO images has implications for EO archives, databases and the ways in which users can search for and select images [45,51].EO data cubes already enable users to retrieve data independent of the image's spatial extent.The best-case scenario is when images processed to ARD specifications are used as the basis of an EO data cube and not just any images or quality indicators.Semantic EO data cubes enable users to search for and retrieve EO data in their spatio-temporal extent of interest based on their content, rather than image-wide statistics.
Since data and semantic enrichment are both available, SCBIR can improve ARD provision to users by expanding the possibilities that users have to retrieve images that meet their requirements.
Currently a user may be interested in an area that occupies only 10% of an image.If this section of the image is cloud-free but the rest of the image is not, this image will not be returned when searching for low average image-wide cloud coverage statistics (Figure 1).Semantic EO data cubes can provide

Discussion and Outlook
Semantic EO data cubes are interdisciplinary in their conceptualisation, combining concepts related to image retrieval, computer vision, human cognition, semantics, world ontologies, remote sensing and more.The applied examples presented in Section 3 are brought into context of semantic EO data cubes, according to the definition and concepts provided in Section 2. Semantic EO data cubes also have the potential to be a foundational element in image understanding systems, which is discussed briefly in Section 4.4 and is a focus of on-going research.

Improvements to Data and Image Retrieval
Combining semantic enrichment with EO images has implications for EO archives, databases and the ways in which users can search for and select images [45,51].EO data cubes already enable users to retrieve data independent of the image's spatial extent.The best-case scenario is when images processed to ARD specifications are used as the basis of an EO data cube and not just any images or quality indicators.Semantic EO data cubes enable users to search for and retrieve EO data in their spatio-temporal extent of interest based on their content, rather than image-wide statistics.
Since data and semantic enrichment are both available, SCBIR can improve ARD provision to users by expanding the possibilities that users have to retrieve images that meet their requirements.Currently a user may be interested in an area that occupies only 10% of an image.If this section of the image is cloud-free but the rest of the image is not, this image will not be returned when searching for but it was unclear whether optical Sentinel-2 imagery was able to capture any of it, and if so, where.Applying a query for water or water-like pixels aggregated over time, such as shown in Figure 2, is a spatially-explicit way to help answer that question.Additionally, such a query might be even more powerful if the user has limited spatially-explicit precipitation or temperature information and is unaware of any flooding that may or may not have occurred in a given area at any point in time.
It is important to emphasise that any analysis of EO data is only relevant for the snapshots in time that are available.Information derived from them may also not necessarily be valid for much of the time between acquisitions.For example, just because flooding is not observed or detected using Sentinel-2 data does not mean that flooding did not occur in a given spatio-temporal extent.Even big EO data with a high temporal sampling rate must always be interpreted keeping this in mind and is best when combined with additional information or domain knowledge.
Including semantic enrichment for each image enables semantic queries to be applied to EO data and derived information without requiring complex algorithms to process all data for a geographic area or given timespan.Even though the semantic level of the interpretations may vary amongst implementations, algorithms can access the reflectance values already associated with an interpretation that can be referenced later in the workflow, if necessary.Data-derived content-based information is available for each existing observation and can be read in a machine-based way using categories that users understand.
Working with symbolic categories instead of reflectance values means that users can work with queries that are readily understandable if the vocabulary of a community is being used, or a standard set of classes such as LCCS or similar.However, using categories means an unfortunately non-reversible data reduction, or reduction of the feature space in comparison to a multitude of bands with a higher bit depth (e.g., 48 categories stored as 8-bit data in comparison to 13 bands of 12-bit data, such as for Sentinel-2).This data reduction benefits query performance, in particular, but needs to be taken into account for every analysis.Based on our definition of a semantic EO data cube, the original data is available and accessible should users require them.
Having the original data available with categories also creates new possibilities for other applications, such as stratifying data analysis based on semantic enrichment.This could be relevant for improving sampling for machine-learning algorithms based on the frequency and distribution of certain categories through space and time.For example, samples could be stratified based on the occurrence of spectrally similar pixels by category within a study area in an attempt to mitigate sampling bias.Other analysis can also benefit from stratification based on category, such as topographic correction (e.g., certain categories will be darker in terrain shadow than others, and clouds are unaffected), or calculating indices (e.g., first querying for vegetation before calculating NDVI to avoid having to set a threshold to distinguish vegetation with the index alone).

Automated, Generic Semantic Enrichment for Big EO Data
Semantic EO data cubes are most powerful when combined with semantically rich yet generic interpretations because semantics differ between domains, applications, users and the targeted purpose of analysis.Closing the semantic gap when generating information from EO data is very difficult and goes beyond the focus of this paper (refer to [44] as a starting point on this topic), but even the simplest semantic enrichment better positions EO data cubes for analysis than ones containing no semantics at all.Any data-derived semantic information can be used as the basis of a semantic EO data cube, but generic semantic enrichment is highly extendible.It allows multiple domains to simultaneously benefit from EO data and derived information without having to reprocess huge amounts of data for every analysis.Workflows utilising the same generic, data-derived building blocks for analysis also supports increased semantic interoperability.However, big EO data necessitates data-derived interpretations that can be generated without user parameterisation (i.e., automated), are reliable and acceptable in quality and with reasonable processing times [20].
The semantic enrichment generated by SIAM™ and used in the applied examples was chosen because it is fast, fully automated, scalable to handle big EO data, sensor-agnostic and comparable between images captured at different locations and times.The limited semantic depth can be partly compensated through the availability of the temporal dimension in dense time series because the concepts are particularly stable (i.e., robust to changes in input data and imaging sensor specifications).Semantic categories that are sensor agnostic means that users can compare the semantic content of different images acquired by different sensors using the same semantic concepts.Higher level semantics can improve information generation but are generally limited to a specific theme or application.This may be beneficial in some cases and those interested in generating information can decide what is necessary for them before processing massive amounts of EO data to create a semantic EO data cube.
The examples presented in Sections 3.2 and 3.3 both queried water-like pixels based on the low-level generic semantic enrichment available over time.Even with a semi-symbolic level of semantic enrichment, queries for water-like observations could be conducted for a single acquisition or aggregated over multiple acquisitions (Figures 2 and 5).Query results shown in Section 3.3 took an additional step of excluding cloud-like and snow-like pixels and normalising the results over time for increased comparability given spatio-temporal heterogeneity of available data.The same query for two different spatio-temporal extents as shown in Figure 5 were generated within 10 minutes on relatively limited computing resources as documented by [49].Especially in situations where timely information generation is critical, such generic implementations may be particularly useful.They can also serve in finding spatio-temporal locations interesting for further analysis using available data. Figure 6 demonstrates two semantic queries on two spatio-temporal extents based on the same semantic EO data cube, showcasing the benefit of being able to conduct various semantic queries using generic semantic enrichment.
Many other surface water occurrence algorithms and analysis for EO data exist but cannot necessarily be conducted ad hoc for user-defined spatio-temporal AOIs, are more computationally expensive, and results are not necessarily able to be queried.For example, work conducted at the European Commission's Joint Research Centre by [53] has generated various high-resolution global surface water information layers.These results provide valuable information based on EO data, but cannot be queried for content, are separate from the data that they were derived from and are limited to pre-defined temporal extents (e.g., annually).The surface water information generated by [54] or [55] for each EO observation and used in their surface water dynamics analysis could be the basis for a semantic EO data cube, but it would be semantically limited to the concept of water and does not seem to be continuously updated with newly available data (i.e., images acquired up to now) in an automated way.These implementations provide static layers, and are not currently posed to provide more dynamic, near-real-time or continuously updated results such as information about the maximum observed water extent in 2019 as it happens based on cloud-free/clean pixels.
In Figure 5 it is visible that large, permanent water bodies sometimes returned less than 100% of normalised observed surface water occurrence.This has to do with the semantic query not taking pixels associated with haze or very thin clouds into consideration, which are not necessarily water-like nor cloud-like.Queries can be improved, and more complex knowledge-based rules implemented.These proof-of-concept results demonstrate that even queries low in complexity based on low-level semantic enrichment can produce higher-level information that might be useful in certain scenarios.

Towards an Image Understanding System
While our definition does not specify applications and implementations of semantic EO data cubes, a prominent use-case is as part of an application-independent expert system, where the semantic EO data cube serves as a fact base.In an expert system, users connect rules stored within a knowledge base to a fact base to infer new information.In such a set-up, the knowledge base is continuously-augmented with rules based in domain knowledge.This allows using already existing encoded expert knowledge or having users contribute their own knowledge.An overall architecture such as proposed by [45] consists of an image understanding sub-system in addition to the semantic EO data cube, which both makes use of already existing interpretations and feeds the fact base with newly derived, true information.
A prototype of an expert-system-based architecture is currently under development for Austria, where a semantic EO data cube serves as a backbone for user-generated semantic queries [56].This system combines a fully automated semantic enrichment of Sentinel-2 images up to basic land cover types with a semantic EO data cube and Web interface for human-like queries based on semantic models of the spatio-temporal 4D physical-world domain.Although still under development, first results are promising and show that users are able to formulate even complex queries using the semantic pre-processing as simple building blocks to derive information at a higher semantic level than the initial building blocks.

Conclusions
The aim of this paper was to define what a semantic EO data cube is and what they make possible in terms of image retrieval, analysis and information production potential.Lots of EO data are being collected, yet proportionally less are being used to produce information, many domains are underserved in relation to what EO could offer, and users of EO data need to have a high level of technical competence to produce information from EO data.
By combining EO data with an interpretation for each observation of a scene, semantic EO data cubes allow users to run queries on big EO data and time-series that were not previously possible and provide imaged-derived information building blocks for analysis that are more meaningful than measured surface reflectance.Semantic enrichment enables semantic content-based image retrieval, allowing users to retrieve specific observations based on what they contain rather than image-wide statistics.Semantic queries (i.e., queries that exist independent of EO images) can be run on EO data that are at least at the semantic level of enrichment or higher without having to necessarily run complex, application-specific algorithms for each analysis.Including semantics in an EO data cube also establishes a minimal level of semantic interoperability for different analyses conducted within the same semantic EO data cube or a different implementation using the same semantic enrichment.This has implications for improving reproducibility of methods and results, especially when applying the same methods based on the same semantic enrichment to different spatio-temporal extents.
Semantic EO data cubes go beyond state-of-the art EO data cubes by managing image-derived information together with data accessible for querying, and thus serve as initial building blocks for semantic queries.Instead of attempting to answer a specific question using EO data, semantic EO data cubes move towards exploring what questions can possibly be answered using the EO data available for a given spatio-temporal extent of interest.Analysis is only limited by the semantic enrichment included and can be extended using transparently coded rule-sets or additional information and knowledge to produce information with a higher semantic level.
We believe that semantic EO data cubes are better positioned to serve big EO data than existing EO data cube implementations, especially when containing ARD and generic, sensor-agnostic semantic enrichment that can be automatically generated in a scalable way.The potential of semantic EO data cubes is just beginning to be explored, but hopefully it is evident that there is plenty of potential yet to be discovered.Semantic EO data cubes are the foundation for big EO data expert systems, where new information can be inferred automatically in a machine-based way using semantic queries that humans understand.

Data 2019, 4 ,
x FOR PEER REVIEW 5 of 19 categorical variables, but subsets of these variables may be ordinal (e.g., vegetation with subcategories of increasing greenness or intensity)[40].See Figure1(left) for a schematic illustration of a semantic EO data cube.

Figure 1 .
Figure 1.Schematic illustration of a semantic Earth observation (EO) data cube (left) used for an exemplary semantic content-based image retrieval (SCBIR) query.Here, a query searches for images with low cloud and low snow cover within a user-defined area of interest (AOI)-based on the associated semantic information.It retrieves images that match the semantic content-based criteria for the AOI instead of the entire image's extent.In a classic image wide query such AOI specific semantic queries are not possible.

Figure 1 .
Figure 1.Schematic illustration of a semantic Earth observation (EO) data cube (left) used for an exemplary semantic content-based image retrieval (SCBIR) query.Here, a query searches for images with low cloud and low snow cover within a user-defined area of interest (AOI)-based on the associated semantic information.It retrieves images that match the semantic content-based criteria for the AOI instead of the entire image's extent.In a classic image wide query such AOI specific semantic queries are not possible.

Data 2019, 4 ,
x FOR PEER REVIEW 8 of 19

Figure 2 .
Figure 2. A flood mask generated from 78 semantically enriched Landsat 8 images over 9 months in Somalia (left) as an indicator for flood risk is compared to a single event analysis following a reported flood event in the year before (right).Both maps are the result of basic user queries using the semantic information only, without the use of additional parameters or calculations on the original data sets.

Figure 2 .
Figure2.A flood mask generated from 78 semantically enriched Landsat 8 images over 9 months in Somalia (left) as an indicator for flood risk is compared to a single event analysis following a reported flood event in the year before (right).Both maps are the result of basic user queries using the semantic information only, without the use of additional parameters or calculations on the original data sets.Originally published as CC-BY-ND by[48], modified.

Figure 3 .
Figure 3.The spatial extent of the semantic EO data cube comprises three Sentinel-2 granules.(a) displays the true colour Sentinel-2 images as processed by the European Space Agency (ESA); (b) shows the area as represented in OpenStreetMap.

Figure 3 .
Figure 3.The spatial extent of the semantic EO data cube comprises three Sentinel-2 granules.(a) displays the true colour Sentinel-2 images as processed by the European Space Agency (ESA); (b) shows the area as represented in OpenStreetMap.

Figure 4 .
Figure 4.This figure displays the results of the semantic query for water-like observations for two spatio-temporal extents of interest.(a) Query for water-like observations from 15 March to 15 April 2018.(b) Query for water-like observations from 15 March to 15 April 2019.(c) Close-up of an area where water-like observations were present in 2019 but not in 2018.

Figure 4 .
Figure 4.This figure displays the results of the semantic query for water-like observations for two spatio-temporal extents of interest.(a) Query for water-like observations from 15 March to 15 April 2018.(b) Query for water-like observations from 15 March to 15 April 2019.(c) Close-up of an area where water-like observations were present in 2019 but not in 2018.

Figure 5 .
Figure 5. Pseudocode describing how the normalised observed surface water occurrence (SWO) over

Figure 6 .
Figure 6.This figure displays the results of a different semantic query for the same two spatiotemporal extents of interest used in the query of water-like observations seen in Figure 5. (a) Normalised observed vegetation occurrence from 15 March to 15 April 2018.(b) Normalised observed vegetation occurrence from 15 March to 15 April 2019.(c) Normalised observed SWO from 15 March to 15 April 2019 overlaid above normalised observed vegetation occurrence as represented in (b).

Figure 6 .
Figure 6.This figure displays the results of a different semantic query for the same two spatio-temporal extents of interest used in the query of water-like observations seen in Figure 5. (a) Normalised observed vegetation occurrence from 15 March to 15 April 2018.(b) Normalised observed vegetation occurrence from 15 March to 15 April 2019.(c) Normalised observed SWO from 15 March to 15 April 2019 overlaid above normalised observed vegetation occurrence as represented in (b).

Author Contributions:
All authors were involved in the conceptualisation of this paper.Software used for generating semantic enrichment in the applied examples, SIAM™, was developed by A.B. Example 3.1 was provided by D.T., 3.2 by M.S. and D.T. and example 3.3 was provided by H.A. Original draft preparation was predominantly conducted by H.A. with review and editing prior to submission by D.T., M.S., S.L. and H.A.