Semantic Boosting: Enhancing Deep Learning Based LULC Classification

The classification of land use and land cover (LULC) is a well-studied task within the domain of remote sensing and geographic information science. It traditionally relies on remotely sensed imagery and therefore models land cover classes with respect to their electromagnetic reflectances, aggregated in pixels. This paper introduces a methodology which enables the inclusion of geographical object semantics (from vector data) into the LULC classification procedure. As such, information on the types of geographic objects (e.g., Shop, Church, Peak, etc.) can improve LULC classification accuracy. In this paper, we demonstrate how semantics can be fused with imagery to classify LULC. Three experiments were performed to explore and highlight the impact and potential of semantics for this task. In each experiment CORINE LULC data was used as a ground truth and predicted using imagery from Sentinel-2 and semantics from LinkedGeoData using deep learning. Our results reveal that LULC can be classified from semantics only and that fusing semantics with imagery—Semantic Boosting—improved the classification with significantly higher LULC accuracies. The results show that some LULC classes are better predicted using only semantics, others with just imagery, and importantly much of the improvement was due to the ability to separate similar land use classes. A number of key considerations are discussed.


Introduction
Land cover classes or types can be defined and determined in multiple ways. This can lead to ambiguous understandings of their characteristics and consequently their spatial distribution. Such ambiguity can arise from different mapping project objectives and the fact that different entities may view a given type of land cover differently in terms of its physical properties as well as different conceptualisations of land cover classes [1]. In addition, land cover can be modelled and determined from different data sources, the most prominent of which is remotely sensed imagery. This is commonly used to determine the presence of different land covers with respect to their electromagnetic signatures. The characteristics of this data influence how land cover is captured as a function of both the pixel [2] and pixel size [3]. Thus, any knowledge (including data) about the spatial distribution of land cover is inherently linked to how land cover classes are defined and recorded, with clear implications for applications that depend on land cover data. Uncertainty in this knowledge can have profound effects on the results of land use and land cover (LULC, as the terms are frequently used interchangably) data analyses, for example, as drivers of climate [4,5], the environment [6], on the allocation of land and resources [4,5,7], and on understanding biodiversity [8], with implications for decision making. Remotely sensed imagery allows LULC to be classified with high accuracy but only with respect to the aggregated electromagnetic reflectance as recorded in a pixel. This can exclude relevant information, such as how land is used, which is not captured by remotely sensing imagery. The aim of this work is to compensate for such limitations, by including geospatial semantics [9] into the LULC classification process. The semantics describe the types of geo-objects (e.g., House, Bench, Peak, etc.) and therefore relate to social and economic activity describing how the land is used. This paper demonstrates how geospatial semantics (which describes the land use) can be combined with imagery (which describes the land cover) in order to improve LULC classification. The contribution of this work is threefold: 1.
The development and application of a Semantic Boosting approach, for fusing remotely sensed imagery with geospatial semantics (obtained from vector data) for LULC classification based on deep learning; 2.
A quantitative analysis investigating the potential of geospatial semantics for LULC classification in depth; 3.
A qualitative analysis focusing on understanding and explaining when and why Semantic Boosting can be beneficial for LULC classification.
Similar to other work [10][11][12][13], CORINE is used here as ground truth data on LULC. The deep learning model seeks to predict the CORINE LULC class for a large area and a fusion of semantics (vector data) and imagery (raster data) is used to enhance this classification. This was compared with the results of classification from two deep learning models, one using imagery only and the other using semantics only in order to generate important insights on the characteristics of Semantic Boosting. This research utilises the following datasets covering the area of Austria, serving as a case study: • Geospatial semantic data from the LinkedGeoData platform [14]. • CORINE LULC (Level 2) data (https://land.copernicus.eu/pan-european/corineland-cover/clc2018 accessed on 23 January 2021). • Remotely sensed imagery from Sentinel-2 (https://apps.sentinel-hub.com/mosaichub/#/ accessed on 23 January 2021).
The novelty of this work is that it demonstrates the utility of including local semantic information in classifying land cover and how geospatial semantics can improve the accuracy of land cover classification in a meaningful way. Section 2 reviews related research using local ancillary information for land cover classification and some of the assumptions associated with classifying remotely sensed imagery into land cover and land use. Section 3 describes the data and methodology for fusing semantics and remotely sensed imagery for LULC classification. It also describes the experiments which were carried out in order to assess the potential of this data fusion. The results are described in Section 4, with a discussion of the findings and methods in Section 5. Finally, some conclusions are drawn in Section 6.

Land Use and Land Cover Semantics
Data on land use and land cover (LULC)-the two concepts are rolled together in most classifications including CORINE as discussed below-are important. They are used to understand environmental dynamics at global [4,5], regional [7,8,16], and local [17] scales for natural resource management, climate change, disease spread, air quality, and other ecosystem services. Different land covers and uses are associated with specific processes. For example, urban areas (land use) with lots of artificial surfaces (land cover) can result in heat islands and increase the ozone levels [4]. Agricultural expansion (land use) can decrease the water quality and the amount of carbon dioxide stored in the landscape [4,5,7,16]. The distribution of LULC has a significant impact on the global average surface temperature and variability of the climate system [5]. Reliable land use and land cover data is important for many activities related to planning sustainable global development [18][19][20]. Furthermore, LULC change influences biodiversity [8], ecosystem services [21], carbon emissions [22], and land surface temperature [23,24]. It is modeled using LULC classification products, typically (although erroneously-see [25]) through some post classification procedures (as reviewed in [26,27]). Many LULC change models therefore depend on the quality of the initial LULC classifications.
Regional LULC products have been created, such as CORINE (LULC classification for Europe), NALCD (LULC for North America), and AFRICOVER (LULC for Africa) [28]. These products were created using different methods and different types of remotely sensed imagery. The choice, definition, and number of LULC classes will impact how the features on the ground are represented as well as influence the classification accuracy [29,30]. Accuracy is influenced by the sensor type [29], the spatial resolution of the image data [31], and the number of LULC classes (negatively correlated with overall accuracy [32]). A final observation is that different LULC products have differing levels of accuracy with global, continental, and national products having accuracies ranging from 66.9% to 98.0% [28]. Reference [28] compared different LULC products and found GeoWiki (https://www. geo-wiki.org/ accessed on 30 January 2021) to have the highest accuracy of global LULC products with 10 LULC classes and a spatial resolution of 300 m × 300 m, a South American 30 m × 30 m product with 5 classes to have overall accuracy of 89.0 and a Russian 1 km × 1 km resolution dataset with 8 classes to have the highest accuracy (98.0%) amongst national products.
LULC classification is traditionally performed based on remotely sensed imagery under two inherit assumptions. (1) that LULC processes are captured by electromagnetic reflectances and can be differentiated [33] and (2) that the world can be described as a regular tessellation, i.e., a raster [2]. Reference [33] point out that LULC classes are delineated by subspaces within a feature space defined by numerical values retrieved by the electromagnetic reflectances captured remotely. However, they note that electromagnetic signatures are not consistent for different scenes, sensors, landscape contexts, and spatial scales and that in contrast to land cover, land use cannot be defined by electromagnetic signatures in a coherent and consistent manner. This is because land cover refers to the physical material at the surface of the earth, whereas land use is characterised by how people utilise the corresponding land, and as a result land use and land cover cannot be directly inferred from each other: a single land cover can have different land uses, and a single land use may be composed of different land covers. Thus, although land cover and land use are highly intertwined concepts, they can only be partly be identified from their electromagnetic reflectance values in remotely sensed imagery. The second assumption is introduced by modelling the real world using a raster representation. In a short letter, ref. [2] unpicks the ubiquitous use of the tessellated pixel as the default mode for representing real world objects that are not pixel shaped and landscape processes that do not exhibit this regular characteristic. Additionally, Reference [34] notes that the pixel introduces a topological bias, which differs from the vector model, such that pixel representations do not correctly capture topological relationships. These assumptions can lead to inconsistencies which ultimately propagate error and can confuse the modelling of environmental processes [2].
Thus, the semantics, meaning, and concepts (and accuracy) of any LULC dataset are deeply linked to the methods used to generate the data-its epistemology [35]. This includes decisions over imagery (type, scale), how features are represented, choice of training data, and classification algorithm.

New Forms and Sources of LULC-Related Information
Next to remotely sensed data, other data sources have been used to detect LULC such as cell phone data [36], social media data [37], or, volunteered geographic information (VGI), such as OpenStreetMap (OSM), with some limitations [38][39][40][41]. Reference [38] developed a LULC product for the city of Heidelberg, Germany, by using OSM data and remotely sensed imagery (Landsat) and harmonising OSM tags with Level 2 CORINE labels. Thus, a LULC class from CORINE was defined by a set of OSM tags and empty OSM areas were filled with classified satellite imagery using a classifier trained on OSM. The resulting LULC data had an overall accuracy of 81% with significant variation in per class accuracies. In [41], imagery was combined with POI data and a raster data from a Chinese internet provider (usage per grid cell) to classify LULC. They transform the imagery into visual continuous bags of words and combine it with labels of the other two datasets (also continuous bags of words) to finally apply a latent Dirichlet allocation (LDA) and random forest classifier to determine six different LULC classes within Shenzhen, China. The overall accuracy was 85.1% with a kappa coefficient of 0.812. However, both [38,41] restrict their work to one specific ROI, as a result failing to show how their approach generalises in areas with different image signatures and contexts such as rural areas, mountainous areas, industrial areas, or high-density built-up areas. We overcome this issue by choosing a ROI of the size of an entire country, i.e., Austria. In contrast to both works, we provide one single feature space for semantics and imagery. In addition, we employ labels from geospatial semantics which have an explicit subclass and superclass relationship to each other, using a Web Ontology Language (OWL) ontology (e.g., a Pet shop is also a Shop), while [38,41] use only thematic information on (POI) labels, which do not have an explicit relationship to each other. Reference [38] assigns OSM labels to LULC classes in a static way and [41] assigns POI labels by applying a series of steps. In contrast, we employ deep learning to determine the relationship between LULC classes and OWL classes. Reference [41] uses a continuous bag of word approach which considers the concepts in a grid cell; however, it ignores the geographical distribution within a grid cell. Our approach, in contrast, cherishes geographical distributions. Reference [41] predicts only six and [38] 10 LULC classes, while this work predicts 12 LULC classes.
Reference [40] illustrates how geospatial semantics from VGI can be used to predict urban growth with promising results. For this purpose they introduce a matrix which quantifies the geospatial semantics with respect to local geospatial configurations of geographical objects into a feature space. They denote this matrix as Geospatial Configuration Matrix (GSCM). Finally, they use the GSCM to predict urban growth for Europe, by means of deep learning. Their final urban growth prediction scores an overall accuracy of 88.6% for a time period of 3 years. We include the GSCM in our proposed method in an extended manner. The relationship between VGI and LULC data has been further explored by [42]. They used the OSM derived LinkedGeoData to examine the associations between LinkedGeodata objects and CORINE areas. The results showed that LULC classes have significant associations with specific classes of LinkedGeodata objects and that certain classes (e.g., restaurant, tree, street) are more likely to appear in areas of specific CORINE classes. This research is a precursor to and informs the current study.

Summary
There are two inherent and important assumptions made in most LULC classifications of remotely sensed imagery: (1) that the LULC classes of interest can be derived from electromagnetic reflectance and (2) that LULC can be reliably represented in a tessellated manner using pixels. These assumptions have impacts on the final LULC model and the way that "reality" is represented. The vector model enables modelling reality beyond the limitation of the regular and tessellated pixel model. Additionally, its geo-object attributes, such as semantics, enable gaining information on how land is used rather than only how it looks (electromagnetic reflectance of a remotely sensed image). Preliminary work by others found that thematic information from VGI can be used for LULC classification but with limitations. In this work, we overcome these limitations and illustrate how a deep learning model can learn dynamic relationships between geospatial semantics and LULC classes within an entire country and furthermore how semantics boost image based LULC classifications.

Methodology
In this work we explore the benefit of incorporating geospatial semantics into the LULC classification by comparing three LULC classification experiments: • Geospatial semantics synthesised with remotely sensed imagery (experiment 1); • Geospatial semantics only (Experiment 2); • Remotely sensed imagery only (Experiment 3).
All three experiments were applied to the Austrian case using data from CORINE as ground truth, LinkedGeoData (semantics) as input for Experiment 1 and 2, and Sentinel-2 imagery input for Experiments 1 and 3. The workflow of the method is illustrated in Figure 1. In all experiments CORINE data (level 2) is used as ground truth and results of the experiments are compared with this in order to determine the classification accuracies. A final comparison uses both quantitative and qualitative assessments. The quantitative is based on an accuracy assessment. The qualitative assessment visually examines selected samples and the spatial distribution of the classification errors. Both assessments are then used to inform the conclusions about the potential of Semantic Boosting.

Quantitative analyses:
Based on accuracy assessments

Qualitative analyses:
Based on classified samples

Experiments:
Comparison of experiments: Figure 1. A visualisation of the workflow of the methodology. Input data is passed to the three experiments. CORINE is used in each experiment as ground truth for evaluating the predicted class labels arising from each experiment. Experiment 1 uses imagery and semantics, Experiment 2 uses semantics only, and Experiment 3 uses imagery only. For each experiment, a GSCM is created and an optimal deep learning model is identified before an accuracy assessment is made and the results of the three experiments are compared.

Data
Three datasets were used in the analysis: (1) CORINE land cover (Level 2) data from 2018 for training and validating the models, (2) Sentinel-2 remotely sensed imagery, and (3) vector data obtained from LinkedGeoData, which contains geospatial semantics for its geo-objects. All three datasets cover the study area, Austria.

CORINE Land Cover
Sampled CORINE land cover data was used as ground truth. Specifically, it was used to allocate the LULC class label to each of the 156,000 samples (randomly selected). These were used to train and validate the performance of the deep learning classification models using a 10-fold cross-validation. CORINE was used as ground truth, as it is a well-studied data source and 13 out of the 15 available CORINE LULC classes are present in Austria (see Table 1). CORINE has a 100 m × 100 m resolution.

. Sentinel-2 Imagery
A Sentinel-2 image was obtained from the S2 Global Mosaic Hub for the entire area of Austria. This platform provides Sentinel-2 images with cloud removal, based on the sen2cor toolbox (https://usermanual.readthedocs.io/en/stable/pages/MosaickingAlgorithms.html accessed on 24 January 2021). This allows a single image mosaic for the entire scenery to be created using multiple single images from different dates, allowing clouds etc. to be removed. In this case images were selected for a three month period (the months July, August, and September, 2018) in order to generate a single mosaic image over Austria. This period was chosen in order to capture the vegetation during its active phase in the phenological cycle during summer. The final mosaic contains 11 out of the 13 available channels: Channels 10 (short wave infrared, cirrus) and 9 (water vapour) were excluded from the S2 Global Mosaic Hub. The spatial resolution of the image mosaic is 10 m × 10 m with any channels with a lower resolution resampled to 10 m × 10 m using a nearest neighbour approach.

LinkedGeoData
LinkedGeoData is a framework which provides OSM data in a linked data format [14]. Here, data from OSM are augmented by linking them to other data via ontology matching supported by standardised onotology schemas. The ontology is defined by LinkedGeoData and not by us. A linked data endpoint is supported by linking specific SPARQL queries (DuCharme [43]), converting OSM data into linked data with semantic descriptions of each geo-object (e.g., streets, buildings, etc.), using classes defined in OWL. Consequently, each geo-object can be described by multiple classes; for example, Chinese restaurant is a subclass of Restaurant and furthermore of class Amenity. Thus, the geo-object is an instance of all three classes. There are 1300 (OWL) classes within the given ontology. Within this work LinkedGeoData was set up on a local computer, storing all OSM data over Austria for December 2018 in a linked data format.

GSCM Construction
In order to train a deep learning model, input data has to be formatted as numerical values in vectors. Therefore, nominal descriptions provided by the semantics are transformed to a new feature space using a GSCM [40], where the records are the samples and the fields are the feature space. Once this matrix is computed, it can be linked to the remotely sensed imagery, extending the GSCM with information on the optical reflectances.
Consider a single grid cell provided by the CORINE LULC dataset. It has a geospatial extent of 100 m × 100 m and a label describing its LULC class. Additional to the grid cell, vector data from LinkedGeoData is present within and outside the grid cell. This vector data contains geo-objects with two attributes, its location (point geometry), and its OWL class. A feature vector for this grid cell is computed, containing descriptive statistics for each OWL class which is present within the grid cell as well as in a defined proximity d max around the grid cell centre (see Figure 2, left side): all geo-objects within a distance d max to the cell centre are described. Based on this subset of geo-objects, seven descriptive statistics are computed for each OWL class: (1) the minimum distance from the cell centre to a geo-object of this class, (2) the maximum distance to a geo-object of this class from the cell centre, (3) the standard deviation of all distances from the cell centre to geo-objects of this class, (4) the minimum azimuth from the cell centre to a geo-object of this class, (5) the maximum azimuth from the cell centre to a geo-object of this class, (6) the standard deviation of all azimuths from the cell centre to a geo-object of this class, and, (7) the number of geo-objects of this class. As the OWL classes are structured in an ontology, each geo-object can be part of multiple classes. A geo-object of class Pet shop, for example, is also of class Shop. A geo-object is included in all calculations of the descriptive statistical values for each OWL class it is part of. Thus, a geo-object of class Pet shop is not only included in calculating the descriptive values for class Pet shop but all of its parent classes (e.g., Shop and Amenity). The final feature vector contains these seven descriptive values for each class within the proximity of d max to the grid cell centre. In cases where there is no geo-object within the proximity d max , the corresponding grid cell is excluded from the procedure, as there is no data available around it. In cases where a specific OWL class is not present in the subset of geo-objects which are around the cell centre, its corresponding descriptive values are set by default values of d max for descriptive values (1), (2), and 0 for the other descriptive values. A matrix is formed when the procedure for creating a single vector based on semantics is undertaken for multiple grid cells (observations). This is the Geospatial Configuration Matrix (GSCM) [40]. Each row of the GSCM is a sample. It contains the descriptive values as well as the LULC class label which has to be classified correctly. The parameter d max defines the maximum distance in which geo-objects and their OWL classes are considered around a grid cell. Thus, d max parameterises the first law of Geography [44]. The GSCM can have up to 9100 columns (7 descriptive statistical values for 1300 OWL classes). One critical part of the experiments was to determine an optimal value for d max . We used the thresholds suggested by [40]: 20 m, 50 m, 500 m, 1 km, 5 km, 10 km, and 30 km. For every d max value a separate GSCM was computed. These were used in Experiments 1 and 2 to assess the impact of different maximum distances used to extract geo-objects.

Linking Semantic and Image Information
The Sentinel-2 image information within a CORINE grid cell was clipped and attached to the corresponding GSCM (see Figure 2, right side). As the dimension of each CORINE grid cell is 100 m × 100 m and the imagery has a resolution of 10 m × 10 m, the clipped image is sized 10 pixel × 10 pixel. These clipped images were then vectorized and the average and standard deviation of each channel were appended to an ordered sequence of then having 1122 elements ((width × height +2) × 11 channels).
Each vector was then appended to the corresponding GSCM row, resulting in a GSCM of 10,222 fields (9100 + 1122), composed of 2 sub matrices: one of the semantic information, denoted as S and another containing the image information, denoted as I. This GSCM therefore had the form GSCM = [S|I]. In Experiment 1, both sub matrices were used for the classification, thus, GSCM = [S|I]. For Experiment 2 only S, the geospatial semantics was used to classify the LULC, with I omitted resulting in GSCM = [S]. Experiment 3 used only the image information, thus GSCM = [I]. For each LULC class, 12,000 samples were used for training and testing (156,000 samples in total). As seven different d max thresholds were used, seven different matrices were evaluated for Experiment 1 and Experiment 2 to assess the impact of each distance on the semantic information incorporated into the analyses. Experiment 3 used one matrix of the image information for all selected grid cells.

Model Selection and Evaluation
After the GSCMs were constructed, deep learning was performed for LULC classification. However, first, a suitable deep learning model had to be found for each experiment. A multilayer perceptron (MLP) was applied to each experiment, as this was the optimal network type reported by [40]. Next, optimal hyperparameters were determined for the MLP model of each experiment in two steps: (1) finding optimal MLP models for each distance threshold d max by combining manual as well as random searches of the hyperparameters. As a result, a MLP architecture was obtained as well as a d max value for which the LULC classification worked best in the experiments; (2) given the optimal d max value, a second more precise hyperparameter search for the MLP was performed to ensure that the final MLP model was the most suitable, using a nested cross-validation. These steps are explained in detail in the next sections.
Step 1: The activation functions, number of neurons, and the optimiser function were searched for using a combination of manual and random search. Random search was used rather than systematic search, as literature suggests its superiority [45]. The performance of every potential model was internally evaluated through a 10-fold cross-validation to avoid overfitting in the final model. Afterwards, the best model for each experiment was chosen, based on the overall accuracy and the kappa coefficient. For Experiments 1 and 2, this was done for each maximum distance threshold d max . As Experiment 3 used remotely sensed imagery only, it did not depend on the threshold d max . Therefore, the 10-fold crossvalidation was done for each potential model in Experiment 3 but not for multiple d max threshold values. After this validation procedure, a d max value was obtained which yielded the highest classification accuracy as well as its optimal MLP model for each experiment.
Step 2: A final 5-fold nested cross validation was computed for each experiment, in order to gain confidence that the corresponding MLP models were optimal. In contrast to a normal cross-validation, the nested cross-validation computes an optimal classification model within every fold, by applying a hyperparameter search. We applied a randomised search with an increased hyperparameter search space compared to Step 1. As a nested cross-validation can yield long runtimes due to its computational complexity, we employed it only for the optimal d max values for each experiment. Thus, a 5-fold nested cross-validation was undertaken for Experiments 1, 2, and 3, in which three different hyperparameters could be chosen from: (1) the number of layers {1, 2, 3, 4, 5}; (2) the number of neurons a layer has {1400, 1300, 1200, 1100, 1000, 900}, and (3) the dropout rate {0.1, 0.2, 0.3, 0.4, 0.5, 0.6}. In this case, the 5-fold nested cross-validations did not identify classification models with higher overall accuracies or kappa coefficients than were already found. Future research will focus on hyperparameter searching to an even greater degree than undertaken here as the focus of this work is to illustrate the potential of semantics and its fusion with remotely sensed imagery for LULC classification, rather than hyperparameter searching.

Analyses
For each experiment, an accuracy assessment was made using overall accuracy, kappa, producer's accuracy (recall), and user's accuracy (precision). This allowed the different experiments to be compared quantitatively. These metrics are defined as follows: Overall accuracy: overall accuracy = number of all correct predictions number of all wrong predictions (1) Kappa [46]: where p 0 is defined as the proportion of correct predictions and p c as the expected proportion of predictions due to chance [46]. User's and producer's accuracy (precision and recall, respectively): where t p refers to true positive, f p to false positive and f n to false negative. For the classification model with the highest overall accuracy and kappa coefficient for each experiment, a qualitative assessment was performed. Two major aspects were considered: (1) The geographical distribution of the classification error. Here, a grid covering the study area was used and the ratio of correctly versus incorrectly samples was computed for each grid cell. The grid cell size was set by the d max value which yielded the highest classification scores; (2) Selected samples and their surrounding were then visually explored. For this purpose, the Sentinel-2 image was extracted around the corresponding grid cells. This enabled insights to be gained on the characteristics of the input data used. For example, some samples were classified correctly with using semantics only but not using imagery only. This might be due to the surrounding geo-objects as well as the imagery. The aim here was to examine classified samples and to determine potential characteristics in common. Four types of samples were defined: (1) samples correctly classified in Experiment 2 (semantics only) but not in Experiment 3 (imagery only) to examine the potential advantages of using semantics only over using imagery only.
(2) samples classified correctly in Experiment 3 but not in Experiment 2. These samples illustrate cases where the imagery only approach provides higher classification accuracy than using semantics only. (3) samples which were correctly classified in both Experiment 2 and Experiment 3. (4) samples classified correctly in Experiment 1 but not in Experiments 2 and 3. These samples highlight situations when semantics as well as imagery only were not sufficient alone to classify correctly but were once fused.
The potential of using geospatial semantics for LULC classification as well as its synergies with remotely sensed imagery for this purpose were identified through these quantitative and qualitative assessments.

Results and Analysis
The accuracy assessments of the three experiments are summarised in Table 2. Using the data fusion, the highest accuracies were generated (OA of 82.18% and a kappa coefficient of 0.8069). A d max of 1 km provided the highest accuracies for Experiment 1 and 2. The MLP model architectures can be seen in Figure 3. Experiments 1 and 2 had the same optimal MLP model architecture. The corresponding hyperparameters can be seen in Table 3.   Batch size Between using semantics only and imagery only, semantics only provided a more accurate classification, with an overall accuracy of 76.11 % and a kappa coefficient of 0.7412. Using remotely sensed imagery only, an overall accuracy of 65.52 % and a kappa coefficient of 0.6264 was scored. Observing the effect of the d max threshold in Table 2, it can be seen that the accuracies first increase (indicated by both overall accuracy and kappa coefficient); however, they decrease once d max increases above 1 km, for Experiments 1 and 2. Tables 4 and 5 show producer's and user's accuracy. They show that increasing d max increases the classification accuracy for single LULC classes. In addition, it can be observed that the user's accuracy is more homogenously distributed than the producer's accuracy, for Experiment 1 and 2. However, there is one exception, namely LULC class I, i.e., urban fabric. Using semantics only, Experiment 2 (see Table 5), it can be seen that producer's accuracy is the highest when d max is set to 20 meters and decreases with an increasing d max value. This stands in contrast to the user's accuracy of urban fabric for this experiment which increases with increasing d max . Observing the remaining LULC classes in Tables 4 and 5 for Experiment 1 and 2 it can be seen that LULC IX (Forest) has the lowest producer's accuracies for most d max thresholds using semantics only. This changes too, when the classification is based on fused semantics and imagery, and, in Forest, the majority of its producer's accuracy values are not the lowest. In Experiment 3 (see Table 5, right side) LULC classes which have a higher producer's as well as user's accuracy seem to benefit the most from the data fusion.   The confusion matrices can be seen in Figures 4 and 5. They reveal that single data source generates higher classification accuracies for specific LULC classes, while the data fusion seems to combine these benefits. In particular, for Experiment 2 (Figure 4b), i.e., using semantics only, over 90% of the samples of LULC classes II, III, IV, VI, and XII were classified correctly. They correspond to Industrial, commercial, and transport units, Mine, dump, and construction sites, Artificial, non-agricultural vegetated areas, Permanent crops, and Inland wetlands, respectively. Only LULC Forest is below 50% accuracy for Experiment 2, which was mostly confused with Arable land, Pastures, and Scrub and/or herbaceous vegetation associations. Urban fabric (LULC class I) was classified with an accuracy of 58% and was mostly confused with Pastures for Experiment 2. These values changed once the data fusion was used (Experiment 1, see Figure 4a), where LULC class Urban fabric was classified with an accuracy of 71%.

Producer's Accuracy (Recall) Users's Accuracy (Precision) CLASS 20 [m] 50 [m] 500 [m] 1 [km] 5 [km] 10 [km]30 [km] 20 [m] 50 [m] 500 [m] 1 [km] 5 [km] 10 [km]30 [km
The confusion matrix for Experiment 3, which was based on imagery only, can be seen in Figure 5. Here, the highest classification accuracy was obtained for class Inland waters and classes Open spaces with little or no vegetation as well as Permanent crops. In this experiment, the highest confusions can be observed for Scrub and/or herbaceous vegetation associations and Artificial, non-agricultural vegetated areas, Heterogeneous agricultural areas and Arable land, and Pastures and Heterogeneous agricultural areas.
The geographical distribution of classification errors can be seen in Figure 6. Subfigures a-c show the geographical distributions of the classification accuracy of Experiment 1, Experiment 2, and Experiment 3, respectively.    6a-c show colored 1 km × 1 km grids cells over Austria, each illustrating the ratio of the overall accuracies. A ratio of 1 states that 100% of the samples within a grid cell were classified correctly, whereas 0.60 suggest that 60% of the samples within a grid cell were classified correctly. Grid cells with the highest ratio are colored dark blue and as the ratio decreases, it shifts to beige. A grid size of 1 km × 1 km was chosen for the visualisation here, as this corresponds to the optimal d max value we computed. Observing Figure 6, several differences in the geographical distributions of the classification errors can be observed. For Experiment 1, most grid cells are coloured dark blue and distributed homogeneously over the ROI, illustrating that most of the samples were classified correctly.
Overall, in subfigure b, fewer grid cells are coloured dark blue than in subfigure a but more than in subfigure c, matching the observations of the accuracy assessment. Considering the geographical distributions of dark blue grid cells, Figure 6b,c exhibit differences: subfigure c shows clusters of dark blue grid cells, one in the east of Austria at a lake (location I). This confirms the high classification accuracy for inland waters using imagery only.
The second cluster lays within the Alps (location II), confirming the high classification accuracy for LULC class Open spaces with little or no vegetation (such as mountains) when using imagery only. In contrast, Figure 6b does not seem to exhibit such strong clusters. A series of areas are beige in both Figure 6b,c; however, the corresponding areas in subfigure a are dark blue. This indicates that both data sources complement each other efficiently once fused. Considering Figure 7, a series of insights can be obtained. Although the Sentinel-2 imagery used was subject to cloud removal, some clouds remained (Figure 7b). Its LULC class is Artificial, non-agricultural vegetated areas which was classified correctly when using semantics only but classified wrongly when using imagery only. Another aspect which can be seen in the remaining images of Figure 7 is that the LULC to be classified are related to man-made structures. For example, Figure 7 might appear as a forest at first glance; however, it is part of a ski slope in the mountains, making it a LULC of Artificial, non-agricultural vegetated areas. The remaining six subfigures show areas with man-made structures, such as Industrial, commercial, and transport units, Mine, dump, and sites, Urban fabric, or Permanent crops.
This behaviour changes in scenes of Figure 8. This shows cases where when using imagery only, the classification worked correctly but when using semantics only, the classification failed. Some grid cells (red squares) here are associated with LULC classes of natural green spaces but have a few man-made structures within their proximity (Figure 8b,c,e,g). For example, in Figure 8b, the grid cell to be classified is within a forest; however, it is surrounded by man-made structures such as streets and houses. Other grid cells are within a 1 km × 1 km area, which contains a mixture of different man-made structures (Figure 8d,f,h). Finally, Figure 8a shows a grid cell which is in a homogenous industrial area with a river in its proximity. In this case, the classification using semantics only yielded Inland waters whereas the classification using imagery only predicted correctly that the grid cell is an instance of the LULC class Industrial, commercial, and transport units. The discriminative power of using semantics only seems to suffer from such mixed cases. In Figure 9a-h, eight scenes are shown in which both approaches, using semantics only and imagery only, yield correct classification results. Here, most images show a homogenous surface (see Figure 9a-e). For example, Figure 9b shows forest only and Figure 9d shows mountainous area only. Furthermore, Figure 9g,h show scenes in which the imagery is consistent within the red square and its immediate surrounding and the semantic sources are evenly spread around the red square. In Figure 9g forest is present in almost every direction around the red square while in Figure 9h, an industrial compound is present within as well as around the red square. Figure 10a-h show cases where the fusion of semantics and imagery (Experiment 1) classified correctly, but classifications based on semantics only and imagery only yielded incorrect results. In Figure 10a-h, two aspects can be observed: (1) within the red squares the imagery is mixed. For example, in Figure 10a,c,d,f,g, the imagery within the red square is mixed with forest-like texture as well as grassland-like texture. (2) Figure 10c-g contain no semantic information (building, streets, etc.) within the red square but only outside of it.

Discussion
There are two major findings from this work. First, LULC can be classified using semantics only. In our experiments we found that the semantics of geo-objects provide meaningful information and enable the corresponding LULC class to be determined. Second, fusing semantics with imagery enhanced the classification results. Their combination complemented and increased the accuracy of the LULC classification, compared to using the two single data sources alone. Additionally, some LULC classes were predicted better than others using semantics only instead of using imagery only, which is reflected in the accuracy assessments and the qualitative analyses. This performance is discussed below from two perpectives: the first examines the overall performance of the LULC classifications and the second discusses the per class accuracies. The qualitative analysis are also discussed and highlight how semantics can be used as an information source in LULC classification.

Overall Classification Results
The overall accuracies as well as kappa coefficients suggest not only that LULC can be classified based on semantics but also that the fusion with imagery yields improved results. The impact of d max is important, and it was found to have an optimal value in Experiments 1 and 2, decreasing accuracy if it was higher or lower than this value. A potential explanation for this is that d max controls the area from which OWL class information is obtained from geo-objects around a sample. Thus, a low d max value results in too little local information about the types of nearby geo-objects. In contrast, a higher d max value results in the loss of valuable local information, as the computation of the feature vector relies on aggregation functions, such as the standard deviation and the maximum. However, despite the impact of d max , the fusion of imagery and semantics was always found to be superior to the classification using semantics only, for any d max value. This suggests that semantics, when used as auxiliary information to imagery, complement it in a meaningful way, independent of the d max value.

Classifications of Single Classes
The results showed that geospatial semantics predict certain LULC classes better than others and that the fusion of both semantics and remotely sensed imagery created a synergy, which yielded superior per class accuracies. For example, LULC class I (Urban fabric) was classified with a similar accuracy from single data sources (see Figures 4b and 5), but the fusion resulted in a superior classification accuracy (see Figure 4a). The same was true for class V (Arable land), VIII (Heterogeneous agricultural areas), X (Scrub and/or herbaceous vegetation associations), XI (Open spaces with little or no vegetation), XII (Inland wetlands), and XIII (Inland waters). The biggest classification improvements using the fused data were found for LULC classes Urban fabric. A potential explanation for this is that semantics complement the imagery well for this class. The semantics allow areas with a similar spectral signature but different underlying LULC class to be differentiated and vice versa. In general, the LULC classes were classified more accurately when the data fusion were used, overcoming the confusion within and between LULC classes when semantics only and imagery only were used. For example, while Urban fabric was mostly confused with class II (Industrial, commercial, and transport units) using imagery only, it was mostly confused with class VII (Pastures) using semantics only. Consequently, using the fused, the classification model is able to better distinguish between Urban fabric and Industrial, commercial, and transport units when semantics are included and can differentiate better between Urban fabric and (Pastures) using information from imagery. For LULC classes Industrial, commercial, and transport units (class II), Mine, dump, and construction sites (class III), Artificial, non-agricultural vegetated areas (class IV), and, Inland wetlands (class XII), semantics only was sufficient to achieve classification accuracies of over 90%. In order to provide a potential explanation for this, it has to be remembered that the used semantics is based on LinkedGeoData, which itself is based on OSM data. Thus, some regions might have greater coverage (and thus more mapped objects), providing more semantics. As such, classes II-IV could potentially benefit from this fact, as they are related to man-made structures, increasing the likelihood that relevant local data is captured by OSM volunteers. Furthermore, specific OWL classes could improve the detection of these LULC classes. For example, residential houses and an industrial complex might look similar on satellite imagery, while OWL classes can describe them with meaningful concepts such as residential house and factory, allowing a distinction of areas based on their functions and usage. In contrast to that, two LULC classes were classified more accurately with imagery only, than with semantics only, namely, classes V (Arable land) and IX (Forest). A potential reason for semantics to score a lower classification accuracy for these two classes could be that OWL classes from LinkedGeoData exhibit less significant associations to non-urban areas than to urban areas [42]. Thus, semantics in these areas might be too sparse to improve the classification.

Semantics for LULC Classification
Geospatial semantics exhibit different characteristics to conventional sensor data like optical imagery, when classifying LULC. For example, semantics rely on nominal values while optical imagery relies on ratios obtained from electromagnetic reflectance. As such, geospatial semantics reflect the meaning of geo-objects, such as Building or Bench and not their physical characteristics. In the case of LinkedGeoData, this is derived from OSM, which is created by volunteers. They capture and annotate the vector data, making themselves the sensors. This consequently enables the inclusion of a variety of different geo-object meanings into the LULC classification, as captured by the crowd of volunteers.
As such they provide potentially specific and meaningful class descriptions. For example, OWL class Peak is typically recorded on mountains and can therefore help to find the corresponding LULC class Open spaces with little or no vegetation (see Figure 9d). Another example for such a characteristic OWL class is Chair-lift which often occurs close to skiing areas/slopes. Here, the OWL class can help to identify slopes which are of LULC class Artificial, non-agricultural vegetated areas (see Figure 7). However, the advantage of having specific and meaningful OWL classes can become a disadvantage too: an industrial area can have OWL class River in the proximity, rendering the final LULC classification to Inland waters instead of Industrial, commercial, and transport units (see Figure 8a). In general, an even geographic distribution of characteristic geo-objects within the proximity of d max was found to foster a correct LULC classification when using semantics only. For example, Figure 9c,e,f, show such situations. Here, the entire 1 km × 1 km scene is covered with geo-objects. By contrast, in some cases, geo-objects are present within the proximity of d max , but the sample grid cell belongs to a LULC atypical for them. An example of such cases can be seen in Figure 8c,g: here, geo-objects such as houses might have led the classifier to compute that the samples are of LULC class Urban fabric, although they are of LULC class Pastures and Forest, respectively. This is likely to be due to the information around these grid cells (the red squares) being dissimilar to their surroundings. If the grid cell of the LULC is similar to geo-objects within the d max proximity, the classifications tend to be often correct. Examples for such cases can be seen in Figure 7a,c-e,h as well as Figure 9h. However, an important aspect of semantics as a data source becomes apparent when looking at Figure 7: any image effects such as clouds do not affect the semantics. In general, geospatial semantics rely on observations made by volunteers on the ground which, unlike spaceborne or airborne observations, do not need atmospheric corrections.

Future Work
This work makes a first step towards a new research domain which aims at understanding the relationship between semantics and LULC classification. It has deepened our understanding of the potential use of semantics for this task. This could be extended further by examining the impact of the ontology, which is the structure of OWL classes, as in the LinkedGeodata. However, perhaps an ontology with a deeper or wider structure, providing more specific or more classes overall, respectively, could improve the accuracy even further. Thus, this research direction would focus on the ontology as a classification parameter. Semantics allow space to be described in terms of meaning, which could be used for novel analysis methods, through, for example, the use of explainable artificial intelligence (XAI) to determine which types of geo-objects (OWL classes) are relevant for specific LULC classes. This would enable one to relate LULC classes to meaningful and human understandable concepts (OWL classes) and therefore help one to understand LULC in a novel way. Next to the investigation of the role of geospatial semantics for LULC classification, advanced deep learning architectures could be explored in future research in order to improve the classification accuracy even further. Particularly, networks with residual connections, convolutional neural networks [47] (for the images-as a separate input branch of the ANN), or attention mechanisms [48] could be employed to score even higher classification accuracies. Furthermore, other types of machine learning algorithms, such as support vector machines, could be explored too. Here, semantics were fused with a single image (mosaic) in feature space; future work can research how to combine semantics with multi-temporal imagery in an effective manner. Additionally, other types of remotely sensed imagery could be used, such as hyperspectral [49] or synthetic aperture radar [50] imagery. CORINE is associated with a certain classification accuracy itself, as such, our results come with the corresponding caveat as we rely on it as being the ground truth. Other sources on LULC ground truth could therefore be used in future work. Furthermore, not only a single source for ground truth could be used but a combination of different sources, increasing the overall reliability. In this work geospatial semantics were obtained from LinkedGeoData (http://linkedgeodata.org/ accessed on 5 January 2021) for the region of Austria. Data outside of Austria can be obtained from them as well, being the base for future work, which studies how Semantic Boosting works in other ROIs. Additionally, we plan on releasing the processed GSCM data for Austria as well as outside of Austria (for future studies).

Conclusions
The focus of this research was to investigate the inclusion of geospatial semantics within a LULC classification of remotely sensed imagery. For this purpose a GSCM was used and extended in order to combine the image information and semantics at a feature level. The results show that when geospatial semantics are fused with remotely sensed imagery, LULC classification accuracies are increased. In particular, LULC classes which relate to man-made structures, such as Urban fabric, are classified with higher accuracy, once the combination is used. Furthermore, geospatial semantics alone were shown to support the classification of LULC classes with promising accuracy, especially for LULC classes, which relate to specific land use, such as mines or industrial areas. The qualitative analysis showed, that in a series of cases, semantics enabled one to classify areas correctly, which would have otherwise been confused with other LULC classes, which have similar spectral signatures (e.g., Artificial, non-agricultural vegetated areas and Scrub and/or herbaceous vegetation associations). Next to the accuracy assessment and the qualitative analysis, the geographical distribution of the classification accuracy was analysed. Here, it was found that the combination of both information sources (imagery and semantics) yield correct LULC classifications, which are homogeneously spread in the study area, while the single sources yield LULC classifications which are more clustered in some regions. Overall, the results show that geospatial semantics are a fruitful source for LULC classification, especially once it is combined with imagery. Funding: This research received no external funding Data Availability Statement: The data presented in this study are available on request from the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.