SatImNet: Structured and Harmonised Training Data for Enhanced Satellite Imagery Classification

Automatic supervised classification of satellite images with complex modelling such as deep neural networks requires the availability of representative training datasets. While there exists a plethora of datasets that can be used for this purpose, they are usually very heterogeneous and not interoperable. This prevents the combination of two or more training datasets for improving image classification tasks based on machine learning. To alleviate these problems, we propose a methodology for structuring and harmonising open training datasets on the basis of a series of fundamental attributes we put forward for any such dataset. By applying this methodology to seven representative open training datasets, we generate a harmonised collection called SatImNet. Its usefulness is demonstrated for enhanced satellite image classification and segmentation based on convolutional neural networks. Data and open source code are provided to ensure the reproducibility of all obtained results and facilitate the ingestion of additional datasets in SatImNet.


Introduction
Data-driven modelling requires sufficient and representative samples that capture and convey significant information about the phenomenon under study.
Especially in the case of deep and convolutional neural networks (DNN-CNN), the usage of big training sets is a requisite to estimate adequately the high number of model weights (i.e. the strength of the connection between neural nodes) and avoid over-fitting.The lack of sizeable and labelled training data may be addressed by transfer learning [1].For instance, there exist large collections of pre-trained models dealing with image classification [2,3].However, these models are trained on images showing humans, animals, or landscapes from true colour images and it remains questionable whether transfer learning improves the generic purpose classification or segmentation of satellite images with various spectral, spatial, and temporal resolutions.
The collection of good quality training sets for supervised learning is an expensive, error-prone [4] and time-consuming procedure.It involves manual or semi-automatic label annotation, verification, and deployment of a suitable sampling strategy like systematic, stratified, reservoir, cluster, snowball, timelocation, and many other sampling techniques [5].In addition, consideration of sampling and non-sampling errors and biases need to be taken into account and corrected for.In satellite image classification, factors such as the spectral, spatial, radiometric and temporal resolution, type of sensor (active or passive [6]), data processing level (radiometric and geometric calibration, geo-referencing, atmospheric correction) to name few, synthesize a manifold of concepts and features that need to be accounted for.specific rules.The training data have been processed for the optimization of the information retrieval process over a distributed disk storage system.We also demonstrate the blending of different information layers by employing deep neural network modelling and solving concurrently the tasks of image classification and segmentation.This work ties in the framework of the European strategy for Open Science [7] which strives to make research more open, global, collaborative, reproducible and verifiable.Accordingly, all the data, models and programming code presented herein are provided under the FAIR (findable, accessible, interoperable and reusable) conditions.
The paper is structured as follows: Section 2 discusses the significant features that make interoperable the open source training sets for satellite image classification and introduces the SatImNet collection which organizes in an optimized and structural way existing training sets.Section 3 demonstrates CNN models that have been trained on blended training sets and solve satisfactorily a satellite image classification and segmentation task.Section 4 describes the computing platform over which the experimental process has been performed.Section 5 underlines the contribution of the present work and outlines the way forward.

SatImNet collection
In this section, we describe the initial edition of the SatImNet (Satellite Image Net) collection, a compilation of seven open-source training sets targeting various EO applications.Then, we define the minimal and necessary attributes one training set needs to be compliant with in order to be functional and readyto-serve a constructive blending with other satellite-derived products.Next, we tabulate the training sets under consideration according to the defined attributes.Lastly, we elaborate on the rationale behind our choices for the data structuring.We note that the SatImNet collection is constantly augmenting by incorporating either open-source datasets from the web, or EO derived products created in-house by the JRC (see Section 4) and available under an open data schema.

Description of the training sets
The initial edition of SatImNet consists of seven diverse training sets: 1. DOTA: A Large-scale Dataset for Object DeTection in Aerial Images, used to develop and evaluate object detectors in aerial images [8]; 2. xView: contains proprietary images (DigitalGlobe's WorldView-3) from complex scenes around the world, annotated using bounding boxes [9]; 3. Airbus-ship: combines Airbus proprietary data with highly-trained analysts to support the maritime industry and monitoring services [10]; 4. Clouds-s2-taiwan: contains Sentinel-2 True Colour Images (TCI) and corresponding cloud masks [11], covering the area of Taiwan; 5. Inria Aerial Image Labeling: comprises aerial ortho-rectified colour imagery with a spatial resolution of 0.3 m and ground truth data for two semantic classes (building and no building) [12]; 6. BigEarthNet-v1.0:a large-scale Sentinel-2 benchmark archive consisting of Sentinel-2 image patches, annotated by the multiple land-cover classes that were provided from the CORINE Land Cover database of the year 2018 [13]; 7. EuroSAT: consists of numerous Sentinel-2 (L1C) patches provided in two editions, one with 13 spectral bands and another one with the basic RGB bands; all the image patches refer to 10 classes and are used to address the challenge of land use and land cover classification [14].
With regard to satellite imagery, one of the discriminant features is the spatial resolution which determines substantially the type of application.The high resolution imagery provided by DOTA, xView, Airbus-ship, and Inria Aerial Image Labeling is suitable for object detection, localisation, and identification.
The remaining three datasets are fitting mostly applications relevant to both image patch and pixel-wise classification.

Major features of an interoperable training set
Deep supervised learning, and complex, multi-parametric modelling calls for big training sets, the creation of which is a laborious and time-consuming task.
A potential solution to this issue is the collection of different open datasets and their exploitation in such a way they complement each other and act in a synergistic manner.In the following, we define the minimal and essential attributes that should characterize a training set when examined under the prism of interoperability and completeness.For best clarity, we have grouped the attributes according to three categories: 1) Attributes related to the scope of the training data • Classification problem: denotes the family/genre of the supervised learning problem that the specific dataset can serve in a more effective way.
• Intended application: explains the primary purpose that the specific dataset serves according to the data designers.
• Definition of classes: signifies how the class annotations are originally provided by the dataset designer.
• Conversion of class representation: a feature related to the SatImNet collection, indicating whether the original type of class annotations has been converted into an image mask.
• Way of annotation: provides information about the class label annotation: whether derives from a manual or automated procedure, or if it is based on expert opinion, volunteering effort or organized crowd-sourcing; it serves as a qualitative indicator about the reliability of the provided data.
• Way of verification: linked with the former feature, it refers as well to the reliability and confidence of the information transmitted by the data.
• Licence: the existence of terms, agreements, restrictions as were explicitly stated by the data publishers.
• URL: the original external web link to the data.
2) Attributes related to the usage and sustainability of the training data • Geographic coverage: reveals the morphological and object variability, as well as the potential irregularities covered by the candidate dataset.
• Timestamp: the image sensing time or the time frame that covers the image acquisition is crucial information for change detection and seasonalitybased applications.This piece of information is closely related to the concept of temporal resolution but cannot be used interchangeably.
• Data volume: helps to determine disk storage and memory requirements.
• Number of classes: shows the plurality and the exhaustiveness of the targets to be identified.
• Name of classes: the semantic name that describes the generic category at which one object/target belongs.
• Naming convention: whether the file name conveys additional information such as sensing time, class name, location and so on.
• Quality of documentation: a subjective qualitative annotation about the existence of sufficient explanatory material.
• Continuous development: a qualitative indicator about data sustainability, error correction and quality improvement.
3) Intrinsic image attributes, which can be used for a proper viewing of the image raster as well • Spatial resolution: as mentioned before, it determines the target object that is subject for detection or recognition and indirectly points at the physical size of the training samples.Spatial resolution is often expressed in meters.
• Spectral resolution: it refers to the capacity of a satellite sensor to measure specific wavelengths of the electromagnetic spectrum.In data fusion context, spectral resolution helps to compare and match image bands delivered by different sensors.
• Temporal resolution: the amount of time that elapses before a satellite revisits a particular point on the Earth's surface.Although temporal resolution is an important attribute, yet there are not currently training sets that cover in detail this aspect.
• Type of original imagery: a piece of information with reference to the sensor type and source, the availability of masks, the existence of georeference and other auxiliary details.
• Orientation: information referring mainly to image or target rotation/positioning; in case of non-explicit statement, this feature contains basic photogrammetry information such as rectification and geometric correction.
• File format: indicates file compression, the file reader and encoding/decoding type, availability of meta-information (e.g.GeoTIFF, PNG, etc.), number of channels/bands.
• Image dimensions (rows × cols): a quick reference to estimate the batch size during the training phase.
• Number of bands: number of channels packed into a single file or number of separate single-band files belonging to a dedicated subfolder associated with the image name.
• Name of bands: standard naming of the image channels like RGB (Red/Green/ Blue) or specific naming that follows the product convention such as the naming of Sentinel-2 products.
• Data type per band: essential information about the range of band values that differentiates the data distributions and impacts the data normalization processes.
• No data value: specifies the presence of invalid values that affect processes such as data normalization and masking.
• Metadata: concerns mostly the geo-referenced and time-stamped images.
In the following Tables 1, 2, and 3, the seven training sets have been tabulated in accordance with the reported attributes.Additional informative features would be the purpose of dataset usage with respect to other research works, number of citations, impact gauging and the physical location (file path) at the local or remote storage system.In some cases, data publishers provide the date of the dataset release but this should not be confused with the pivotal attribute of timestamp mentioned above.

Organization of the SatImNet collection
This section explains the rationale behind the chosen data model.The SatImNet collection has been structured in a modular fashion, preserving the independent characteristics of each constituent dataset while providing a metalayer that acts in a similar way as an ontology does, recording links and modelling relations among concepts and entities from the different datasets.Figure 1 displays the typical route a query follows across the central semantic metalayer and each information module which condenses the essential information that characterizes every file.All the structures that compose the data model are represented by nested short tree hierarchies which have been proven to be quite efficient in information retrieval tasks [15].Since the entire or part of the collection is going to be transferred over the network, we selected a database-free solution for the tree hierarchies based on json files.This lightweight approach grants a standalone character to the modules of the collection, independent of specialized software and transparent to the non expert end-user.The lack of indexing which impacts critically the query speediness can be tackled (whenever is feasible) by keeping the depth and breadth of every single tree in moderate sizes.A consequence of this is the creation of multiple json files.At this point, we underline the fact that our baseline system upon which we optimize all the processes is the EOS open source distributed disk storage system developed at CERN [16], having as front-end a multi-core processing platform [17,18].This configuration allows multi-tasking and is suitable for distributed information retrieval out of many files.Another bounding condition set by the baseline system is the prevention of generating many small-sized files, given that EOS guarantees minimal file access latencies via the operation of in-memory metadata servers (MGMs) which replicate the entire file structure of the distributed storage system.For this reason, the files of the training sets have been zipped into larger archives, the size of which has been optimized in a way as to not stress the metadata server whilst allowing efficient data transfer across the network.Reading individual files from zip files can be achieved through various interfaces and drivers.In our case we employ the open source Geospatial Data Abstraction Library (GDAL) [19] which provides drivers for all the standard formats of raster and vector geospatial data.Jupyter notebooks demonstrating the execution of queries as well as the respective information retrieval from the ESM and considered all the pixels pointing at residential and non residential built-up areas.The non residential built-up areas refer to detected industrial buildings, big commercial stores and facilities.All the produced masks were resampled to 10 m spatial resolution using nearest neighbour interpolation.

Convolutional Neural Network Modelling
Although CNNs have been experimentally proven more adequately to object detection [23,24] and image segmentation [25,26] in very high spatial resolution, there is lately a considerable number of works [27,28,29,30] demonstrating promising results at coarser scales.Nevertheless, speaking about not very fine spatial resolution such as the spatial resolution of S2 products, most of the works focus on image/scene classification and not to pixel-wise segmentation.All the above mentioned parametrisations as well as the proposed CNN topologies derived from an extensive repetitive experimental process (3-fold grid search in the parameter set mentioned above).The input-output schema for the three models is depicted in Figure 2. We note here that the purpose of this case study is not to conduct a comparison analysis of widely accepted CNN-based classification or segmentation models against the proposed ones.The presented CNN topologies are lightweight modelling approaches that achieve satisfactory results as Table 4 shows.In this Table we show the best overall accuracy achieved by the proposed models and by using the four S2 10 m spatial resolution bands.Also, Figure 3 displays some indicative results.
Trying to improve the accuracy performance by finding the best band combination as input to the model, we concluded that the four 10 m bands, Blue (B02), Green (B03), Red (B04) and NIR (B08) give in the greater part of the performed experiments the most consistent results.Actually, we found out that there is an intense dissimilarity between the data distributions of the EuroSAT-

Discussion
In this experiment, we decided to not use a transfer learning approach in terms of pre-trained and configured NN layers.Instead, we would like to check whether the amount of training data was sufficient and we aimed at designing relatively light CNN topologies, keeping the number of model weights as low as possible while retaining the structural capacity of the model in adequate levels.The maximum 10-fold cross-validation accuracy reaches 99.37% for CNNclass and 95.41% for the 3-class output of the CNN-dual.One remark here is

Reproducibility and Computing Platform
Training deep neural networks requires hardware and libraries to be finetuned for array-based intensive computations.Multi-layered networks rely heavily on matrix math operations and demand immense amounts of computing capacity (mostly floating-point).For some years now, the state of the art in such type of computing and especially for image processing is shaped by powerful machinery such as the Graphical Processing Units (GPUs) and their optimized architectures.
In this regard, the JRC (Joint Research Centre) Big Data Analytics project having as major objective to provide services for large-scale processing and data analysis to the JRC scientific community and the collaborative partners, is constantly increasing the fleet of GPU-based processing nodes, including NVIDIA R Tesla K80, GeForce GTX 1080 Ti, Quadro RTX 6000 and Tesla V100-PCIE cards.Dedicated Docker images with CUDA [35] parallel model, back-ends and deep learning frameworks such as TensorFlow, Apache MXNet and PyTorch [36] and adequate application programming interfaces have been configured to facilitate and streamline the prototyping and large-scale testing of working models.
The entire experimental setting presented here has been performed onto the aforementioned platform, the so-called JEODPP (JRC's high throughput computing platform) [18].Data, Jupyter notebooks and Docker images are open and accessible upon request.This decision is in conformity with the FAIR Guiding Principles for scientific data management and stewardship, and promotes Open Science.

Conclusion
The availability and plurality of well-organized, complete and representative This paper proposes a methodology to structure and harmonize open-source training datasets designed for satellite image classification in view of fusing them with other Earth Observation (EO)-based products.It introduces the SatIm-Net collection of seven open training datasets, structured and harmonized along

( 1 )
enables discovery of more object classes; improves detection of fine-grained classes (2) combines public domain imagery with public domain official building footprints

Figure 1 :
Figure 1: Schematic representation of the information retrieval task: the query is passing first through the semantic meta-layer which seeks for conceptual similarities and then performs search across the json files which retain the synoptic but necessary information about every individual file.

json 3 .
scheme and not the optimal strategy.The selected BigEarthNet-v1.0image patches have been resized from their original size of 120 × 120 to 64 × 64 images by using bi-linear interpolation.On the basis of both EuroSAT and BigEarthNet-v1.0geo-referenced image patches, we warped and clipped the GSW 2018 yearly classification layer, producing in that way the necessary water masks.Similarly, we clipped the 10 m up-scaled

Figures 6 and 7
Figures 6 and 7 display two CNN-based approaches for the classification and segmentation of Sentinel-2 image patches.The former (named CNN-dual) is a two-branch dual output CNN architecture that segments the image according to two classification schemas: left-branch output is the classification result of assigning to each pixel one of the 10 EuroSAT classes (annual crop, forest, herbaceous vegetation, highway, industrial, pasture, permanent crop, residential, river and sea lake), and right-branch output is the pixel-wise classification result with reference to the aggregation classes water, built-up and other, as instructed by the GSW and ESM layers.The input to such a model is an 5 rows × 5 columns × N bands image (Fig. 2).A very brief description of the basic parametrisation: number of trainable parameters: 1,259,277, activation function: relu (softmax at the last layer), dropout rate: 0.1, initial random weights definition: He uniform variance scaling[31], batch normalization layer[32], loss function: categorical cross-entropy, optimizer: stochastic gradient descent with 0.01 learning rate.The same classification task could be formulated as a 12-class problem modelled by a single branch CNN, nevertheless experimental results showed that CNN-dual provides consistently better results.There are three layer-couplings which intertwine the intermediate outputs across the two branches and higher structural capacity.This neural network architecture should not be confused with the twin neural network topology (Siamese) that uses the same weights while pairing two different inputs to compute comparable outputs.The second CNN approach comprises two independent networks.The left network (CNN-class) as it appears in Figure7takes an 64 rows × 64 columns ×

Figure 2 :
Figure 2: Input and output with respect to the CNN-class model (top) and both the CNNsegm and CNN-dual models (bottom).The variable N signifies the number of bands.The variables m and n denote the number of rows and columns respectively of the output image.

Figure 3 :
Figure 3: Image patches (64 × 64) of the EuroSAT dataset kept out for validation.The first column displays the RGB composition, the middle column shows the output of the CNN-segm, and the last column displays the segmentation result of the CNN-dual model.The CNN-class classifies correctly the three image patches to river, sea lake and residential respectively.

Figure 4 :
Figure 4: Transfer learning on different geographic locations than the location from which the training samples have been selected.Two exemplar areas from China and USA in the 1st and 2nd row respectively whereas all the training samples have been selected from Europe.The columns from left to right: i) RGB, ii) CNN-class, iii) 10-class CNN-dual, and iv) 3-class CNN-dual output.
datasets is critical for the efficient training of machine learning models (specifically of deep neural networks) in order to solve satellite image classification tasks in a robust and operative fashion.Working under the framework of Open Science and very closely to policy support which invites for transparent and reproducible processing pipelines at which data and software are integrated, are open and freely available through ready-to-use working environments, the contribution of this paper is aligned with three goals: i) to define the functional characteristics of a sustainable collection of training sets, aiming at covering the various specificities that delineate the landscape of satellite image classification; ii) following the former definition, to structure and compile a number of heterogeneous datasets, and iii) to demonstrate a potential fusion of training sets by using deep neural network modelling and solving concurrently an image classification and segmentation problem.Future work involves systematic harvesting of training sets across Internet, automation of the quality control of the discovered datasets, and continuous integration of the distinct modules of the working pipeline.Apart from the accumulation of datasets which have been designed and provided by the research

Figure 7 :
Figure 7: Two independent CNN topologies: the left one performs an image patch-based classification and the right one a pixel-wise image segmentation.

Table 1 :
Attributes related to the scope of the training data

Table 3 :
Intrinsic image attributes

Table 4 :
Maximum overall accuracy achieved by the proposed CNN modelling approaches and computed via 10-fold cross-validation.