Semantic Integration of Raster Data for Earth Observation on Territorial Units (cid:63)

. Raster is a common data format in satellite image processing. A raster allows us to model geographic phenomena as a regular surface in which each cell (or pixel) is associated with a phenomenon value. Many rasters can be provided for the same geographic area, for the same phenomenon at diﬀerent dates or diﬀerent phenomena; they can be compared, combined, used to generate a new one, etc. A recurrent issue however is to transfer data from pixels to features to qualify territorial units, which requires complex aggregation processes. This paper addresses this issue thanks to a semantic data integration process based on spatial and temporal properties. We propose i) a modular and generic ontology used for the homogeneous representation of data qualifying a geographical area of interest; and ii) a Semantic Extraction, Transformation, and Load (ETL) process that relies on the ontology and data extracted from rasters and that maps the aggregated data to the corresponding areas. We evaluate our approach in terms of the (i) adaptability of the proposed model and pipeline to accommodate diﬀerent use cases, (ii) added value of the generated datasets in helping decision making, and (iii) approach scalability.

Earth Observation (EO) is a domain that has greatly evolved in the last years thanks to large-scale Earth monitoring programs, such as the US Landsat Programand the EU Copernicus Program In particular, with the Copernicus program launched by the European Space Agency (ESA), EO satellites provide users with free, reliable, and up-to-date Earth image data and metadata.The availability of these data sources has opened the opportunity to better support existing domain-oriented applications and to foster emerging new ones, from agriculture to forestry, environmental monitoring to urban planning, climate studies, and disaster monitoring.These sources of data, coupled with the development of machine learning algorithms have boosted the image processing field and their application in those different domains.
One common data format in satellite image processing is raster.A raster models geographic phenomena as a regular surface in which each cell (or pixel) is associated with an indicator or a phenomenal value according to a predefined codification or classification.Such representations may be automatically built by different types of algorithms including machine learning.Several rasters can be provided for the same geographic area to monitor the same phenomenon at different dates or different phenomena; they can be compared, combined, used to generate a new one [19].However, in a decision-making perspective, the interpretation of their content requires higher-level representations associated with features that bring meaning to the areas of interest on Earth.
This paper addresses the integration of data calculated from rasters as a way of qualifying geographic areas of interest, based on their spatio-temporal properties.These areas of interest are usually represented as geospatial features in a vector format.We are interested in studying (i) the kind of ontology required to support knowledge extraction from EO data and to describe homogeneously different analysis results provided by the rasters; (ii) how to make accessible and usable rich EO data and (iii) how to enable the data traceability to enhance user confidence and data exploitation.The contributions of this paper are as follows: (i) a generic vocabulary that allows the semantic and homogeneous description of spatio-temporal data to qualify predefined areas (ii) a configurable semantic ETL process based on the proposed model (iii) an EO eco-system that allows for exploiting Sentinel images.
The rest of this paper is organized as follows.Section 1 presents the semantic model for integrating geographic areas and data extracted from rasters, through their spatial and temporal dimensions.Section 2 details the semantic data integration process.Section 3 discusses the experimental evaluation of our approach.The main related work is discussed in section 4 and finally, section 5 ends the paper and presents the perspectives for future work.

Semantic model
We propose a semantic model to represent data from rasters that provide observed properties of a geographic area of interest (land parcel, administrative unit, forest, etc.) along with their spatio-temporal dimensions.This model also allows keeping track of integrated data and image metadata.To this end, it relies on a generic and modular ontology that extends vocabularies available as OGC standards such as GeoSPARQL, and W3C recommendations such as OWL-Time1 , SOSA2 , DCAT3 , and PROV-O 4 .
The model is composed of several sub-models describing the various kinds of data without the need of instantiating the whole model.These sub-models are a) a territorial observation model (for representing geospatial units and their Territorial observation model (tom): The tom model (Figure 1) takes as input any output of EO analysis activities as long as it comes in raster format.A collection of observations (represented as tom:GeoFeatureObservationCollection) 5is composed of several observations (tom:GeoFeatureObservation), observing a given property (sosa:ObservableProperty) on a given territorial .Territories, presenting a footprint on Earth, are represented by the tom:GeoFeature class that specializes the sosa:FeatureOfInterest and geo:Feature classes.They belong to a type (tom:GeoFeatureType), such as an administrative unit (village, county), land register parcel, agricultural parcel, forest, or geospatial data grid (such as Sentinel tiles).Each observation result in the percent value (sosa:hasSimpleResult) covered of the property among all observed properties in the collection to which it belongs; for example, 40% of Mixed Forest and 60% of Coniferous Forest.The prov-o:Entity class allows keeping trace (using the prov-o:wasDerivedFrom property) of the raster file used to create the collections of observations on the one hand and of the vector file used to create the territorial units on the other hand (Cf the eoam module described below).
Sentinel images metadata (eom): The eom (EO model) is the part of the ontology dedicated to representing metadata of Sentinel images.It mainly describes the product (an image) as a result of a sosa:Observation.Each observation is associated with a temporal information, i.e., the date of capture (sosa:phenomenomTime) and a spatial information, i.e., the area being cap-tured (either by the geometry property of the eom:Product or through the sosa:hasFeatureOfInterest relation).
EO analysis model (eoam): Sentinel images are consumed in different kinds of analyses, such as machine learning algorithms that identify changes between two images.The eoam model (EO Analysis Model) provides information about the results of these activities since they are or will be consumed by the semantic integration process.Thanks to the PROV-O vocabulary, it is possible to know which Sentinel images have been used as input of a process (eoam:EOAnalysis), and which agent (prov-o:wasAssociatedWith) realized it.DCAT is used to catalog both the raster datasets and vector datasets.In this way, both raster files (eoam:RasterFile) and vector files (eoam:GeoFeatureFile) are considered as distributions of these datasets.A raster distribution is detailed with temporal coverage information (i.e. the dct:temporal and dct:Location properties provided by DCAT), but also the spatial resolution (dcat:spatialResolutionInMeter ).A vector distribution describes the main attributes of the files, such as the file size (dcat:byteSize), file format (dct:format), or used CRS (dct:conformsTo).

EO data analysis process
The EO data analysis pipeline is depicted in Figure 2.This pipeline is composed of three main tasks: -Satellite images processing and analysis: This task consists of processing and analyzing satellite images coming from a DIAS (Data and Information Access Services).The possible analysis could be from the simple one as NDVI calculation to the sophisticated ones like change detection on time series or land cover annotation.The task generates raster files as a result.-Semantic data integration: The task extracts data from the generated raster files using vector files that contain the territorial units to be observed.Vector sources could come from open data repositories or our semantic database through Semantic search.-Semantic search: The task aims to analyze the integrated data by querying the semantic database.The SPARQL query results can be used to perform once again the first two tasks for further analysis, either as parameters for guiding the process or as input data.The results can also be used by specialized GIS applications for detailed analyses of the whole integrated data.
The semantic data integration process can be divided into several steps, as in an ETL process.The main steps of the process are described in the following.
Data extraction from rasters and vector files This step aims to extract and structure data according to different purposes.The process requires two sources: a vector and a raster.First, metadata are extracted.Next, the vector file is processed to extract information about territorial units of a given type.They will populate the tom:GeoFeature.The raster is also processed to qualify each territorial unit contained in the vector file.Properties of interest (e.g.mean values) are extracted through pixel aggregation or spatial masks are created to eliminate undesired areas.Currently, we distinguish two types of raster, depending on the type of their pixel values: -Categorical rasters (as for land cover): in this kind of rasters, since a pixel value represents a class (vineyard, for example), additional information is needed to decode it.For example, the value 15 is decoded as a vineyard in a CESBIO land cover raster; -Continuous rasters (as for NDVI or change indicators): a pixel value will be automatically classified into a level (or class) such as Very low, Low, Middle, High, or Very high.
To qualify a territory unit from a raster, either new properties (e.g., mean values) are extracted through pixel aggregation, or spatial masks are created to eliminate undesired areas.
Data transformation This step aims at transforming the processed data into the semantic one.Templates that define the mappings between the extracted data structure (in JSON) and the ontologies are used as a basis in this process.They are usually handwritten.While different data translation tools exist, such as D2RQ6 , Ultrawrap7 , Morph8 , Ontop9 , TripleGeo10 or GeoTriples11 , we chose to adapt the mapping template and processing mechanism described in [3].This choice is motivated by the fact that it contains functions helping to perform more sophisticated operations, especially feature masking.The output of this step is a set of RDF files.

Data load
The final step consists of importing the RDF files into the triplestore, following a materialization approach.The advantage of such an approach is to facilitate future processing, analysis, or reasoning on the materialized RDF data.
3 Experimental evaluation

Application use cases
Two use cases of the CANDELA project have been chosen for demonstration: -Vineyard use case: The objective of the use case is to retrieve changes in vineyards that were damaged by natural hazards such as frost or hail.The area of study is located in the Aquitaine region, in France.The vineyards of the area were reported heavily damaged by frost on the 20th of April 2017.-Urban expansion use case: The use case aims at studying the changes related to urban expansion in agricultural areas.We study changes in villages around Bordeaux city, one of the largest cities in France, and it is surrounded by agricultural areas, between 2017 and 2020.
Both use cases share a common set of raster types: -Change indicator: Change indicators, representing the probability of changes (between 0 and 1) of pixels of two Sentinel images, are obtained by executing tools from partners of the project.-NDVI: NDVI information is obtained by processing near-infrared and red sensors of Sentinel images.The output values are between -1 and 1.Since the values between -1 and 0 represent the elements composed of water, these values are set to 0 so that the rasters only contain values between 0 and 1. -Land cover: The datasets provide land cover information given an area on Earth.The CESBIO land cover datasets 12 are used for our use cases.They cover the French territory with a spatial resolution of 10m 2 .
Regarding the territorial units (vector data), land register data is used for the first use case while the second consumes administrative unit data: -Land register: Land register data is available from the French government data website 13 in GeoJSON format or shapefiles.-Administrative unit data: Information of villages inside an area of interest can be obtained from the French government website 14 .The datasets are available in shapefiles and are updated yearly.Administrative units are linked to the INSEE RDF database 15 .

Model genericity
The genericity of the model is proved by the facts that: (i) The model treats in the same manner whatever the raster datasets, and their versions, as long as they exist in the right format; (ii) Different classifications can be used to observe the same type of EO property; (iii) The system can consume whatever the vector source describes any territorial divisions.

Pipeline genericity
Since the components of the pipeline are organized as services (python functions and docker images), the users can customize the pipeline by choosing and chaining up as many services as they like through Jupyter notebooks of the platform.
-Vineyard use case: (i) we first obtain all parcels of the Saint Emilion village from the land register data and CESBIO land cover raster for 2017.Semantic data integration is next used to integrate land cover and parcel information.The semantic data integration is configurable with a set of parameters that can be provided as raster metadata or as function parameters.Users can also provide custom thresholds for class classification since the results of image processing and analysis may highly depend on the scenarios.

Use cases analysis
We evaluate the result obtained using the pipeline presented in 3.3, knowing that the results highly depend on the precision of the algorithms provided by our partners.
-Vineyard use case: Two Sentinel-2 images collected on the T30TYQ tile are used for change detection and NDVI computation; they are respectively dated 2017/04/19 and 2017/04/29: the choice of these images is based on the fact that they have very low cloud cover (0% and 15%) and the interval between these observations covers the period of study.Figure 3 represents an overview of the change levels detected and the degradation of NDVI The Very low change level is eliminated since it's not very relevant.The NDVI degradation indicator represents the total percentage of degradation of five NDVI levels.We also eliminated parcels having the NDVI degradation below 20% .Finally, there are 858 parcels detected as having changed, 756 parcels detected as having NDVI degradation above 20%, and 510 parcels detected in both cases.-Urban expansion use case: For NDVI calculation and change detection, it is recommended to collect images in the same period and in summer to limit the cloud cover and the influence of vegetation growth.So, two Sentinel-2 images were collected on 2nd August 2017 and 6th August 2020 and have 0% of cloud cover.Figure 4 (right) represents an overview of the change levels detected and the NDVI levels degraded between these two dates, together with the source Sentinel images (left).We can observe that: (i) the change levels detected match quite well with the degraded NDVI due to urbanization; (ii) the more the village approaches the city, the more it is changed.The next analysis could be comparing change, NDVI and land cover information at parcel level for particular villages.

Approach scalability
In terms of triplestore, we first opted for Strabon 16 for its many advantages.It has a good overall performance thanks to particular optimization techniques that allow spatial operations to take advantage of PostGIS functionality instead of relying on external libraries [16].For complex applications that include both spatial joins or spatial aggregations, Strabon is the only RDF store that performs well [13].While Strabon performed well when considering a reduced number of triples in the data store, we faced scalability problems when increasing the number of triples.Indeed, the same remarks are reported in [17].For these reasons, we thus examined query federation approaches where the data is stored in different triplestores.In our case, Strabon is used only for spatial data and GraphDB17 for the rest.Preliminary experiments showed several advantages of this option: (i) faster response time for non-spatial queries (ii) scalable and robust as it ensures the result regardless of the amount of integrated data.In terms of the size of the generated RDF datasets, we populated 0.5M geofeatures (about 2.5M triples) in Strabon and 4M observations (97.5M triples) in GraphDB using their API.The processing and loading time on a virtual node (4 cores CPU at 2.6GHz, 8GB of RAM, and SSD disk) was about 40 hours.
4 Related Work

Semantic ETL and models for EO data integration
Different semantic ETL proposals have addressed the transformation and integration of (open) EO data to LOD.In [20], the authors model EO imagery as a data cube with a specific place and time, thanks to the W3C RDF Data Cube (QB) ontology [9].This model combines standard vocabularies such as SSN, OWL-Time, SKOS, and PROV-O.In [14], QB was used to publish tabular time series data and to structure it into slices that support multiple views on the data.As a spatio-temporal data cube, the semantic EO data cube [5] contains EO data where for each observation at least one nominal interpretation is available.Following a semantic ETL approach along with ontology-based data access (OBDA), [7] extends Data Cube, GeoSPARQL, and OWL-Time ontologies to offer access to Copernicus services information.Here, SOSA is adopted to represent observation collections but alignments 18 exist between SOSA and QB.Closer to us, [11] defined an ETL process to integrate EO image and external data sources, such as Corine Land cover, Urban Atlas and Geonames.The process is carried out based on their SAR ontology.Another close work in terms of datasets is from [18], where data is integrated and published as LOD based on an ontology, called proDataMarket.Three data sources, the Spanish land parcel identification system, Sentinel-2 satellite, and LiDAR flights, are integrated.In [1], satellite images are classified and enriched with additional semantic data to enable queries about what can be found at a particular location.
In our approach, an exploitation of our integrated data could also be facilitating the image search.Finally, while we do not fully address the scalability of our approach, several works are addressing the issue of managing large volumes of EO data.

Processing of raster data in a semantic framework
Raster data can be represented following two approaches: either by treating the entire raster grid as coverage or by providing procedures to extract vector objects from the raster matrix.The first approach relies on constructing a semantic representation of the raster pixels so that each pixel attributes (geometry, values) are maintained.In [2] the RDF Grid coverage ontology is designed to allow a native integration of coverage in RDF triplestores.The gridded structure of the data is preserved and can be queried using SciSPARQL.Recently, the Ontopspatial extension [8] has been developed to process raster data and create virtual geospatial RDF views above it.
The second approach consists of extracting entities from rasters and representing them as ontological features.These entities are sets of raster elements (i.e.sets of pixels) that meet a certain context-dependent definition.In [4], the approach integrates and processes vector and raster data from LOD repositories using vectorization and mathematical tools for geo-processing.First, bounding boxes for input raster are used to query LOD endpoints for entities corresponding to a certain concept.The returned entities along with their geometry are next used to select raster pixels for supervised training based on content-based descriptors.Finally, the results can be vectorized and inserted back into the original repository.In [11], the approach proposes to restructure the images in patches that have a fixed size, on which external information is associated.Each patch is directly transformed into a feature based on the ontology.[12] extends current standards to represent raster geo-data.A region of interest is first polygonized and then the data is transformed into a semantic representation using R2RML mapping rules.This workaround is arguably not a complete solution to represent raster files in RDF as the original geometric source of the data is not preserved [15].Here, the areas of interest are predefined, hence the geometries (polygons) are known.Another close work is from [10], where a raster allows for modeling geographic phenomena as a regular surface in which each cell (or pixel) is associated with a phenomenon value.However, they store each value associated with a pixel corresponding to an observation.Here, we aggregate the values of a region.While their modeling is close to ours in terms of reused vocabularies, they do not represent metadata and provenance nor exploit Sentinel images.Finally, our work adapts the one presented in [6] in several ways, with a different focus on the pipeline that generates RDF data from EO rasters and other data sources.Moreover, we do not need to consider versions of administrative units, and we explicitly refer to satellite images thanks to which we can compute new indices

Conclusion
This paper presented an approach for the integration of data calculated from rasters as a way of qualifying territorial units, based on their spatio-temporal features.We proposed a modular and generic ontology for semantically and homogeneously describing spatio-temporal data that qualify predefined areas, together with the provenance of all sources of data.This ontology is generic enough for describing data that can be calculated from any source that conforms to a raster format.We defined a configurable semantic ETL process guided by this ontology.The process extracts data from rasters and links observations to territorial units through their spatio-temporal dimensions.This process produces a semantic database that can be exploited for different purposes.We illustrated the approach by integrating various raster datasets for two use cases of a joint project.As future work, we plan to exploit big data scenarios with the management of Natura 2000 areas 19 .We also consider extending the proposed approach to deal with data coming from CSV, especially, weather observation data.Finally, we would like to publish our knowledge base as linked open data for scientific purposes.

Fig. 1 .
Fig. 1.Territorial Observation Model representing property values of territories calculated from EO rasters.
(ii) Vineyard parcels inside the villages are retrieved via Semantic search.(iii) Appropriated Sentinel-2 images are used for NDVI calculation.(iv) These images are also used for change detection.(v) The generated rasters from the 3rd and 4th steps along with the vector from the 2nd step are integrated into the semantic database.(vi) Finally, semantic search can be used for analyzing all integrated information related to the vineyard of interest.-Urban expansion use case: (i) We first select adapted Sentinel images and execute NDVI calculation.(ii) These images are also used for change detection.(iii) We obtain vector data of all villages of the Gironde department.Semantic data integration is next launched using the raster generated from the previous step and the obtained vector files.iv) Finally, we perform a semantic search to analyze the integrated data related to the villages of interest.