1. Introduction
Data-driven modelling requires sufficient and representative samples that capture and convey significant information about the phenomenon under study. Especially in the case of deep and convolutional neural networks (DNN–CNN), the usage of big training sets is a requisite to estimate adequately the high number of model weights (i.e., the strength of the connection between neural nodes) and avoid over-fitting. The lack of sizeable and labelled training data may be addressed by data augmentation [
1], deep modelling techniques such as the generative adversarial networks [
2], transfer learning [
3] and domain adaptation [
4]. Regarding transfer learning, for instance, there exist large collections of pre-trained models dealing with image classification [
5,
6]. However, these models have been trained on true colour images showing humans, animals, or landscapes, and it remains an open question whether transfer learning, without meticulous domain adaptation and fine-tunings, improves the generic purpose classification or segmentation of satellite images with various spectral, spatial, and temporal resolutions.
The collection of good quality training sets for supervised learning is an expensive, error-prone [
7], and time-consuming procedure. It involves manual or semi-automatic label annotation, verification, and deployment of a suitable sampling strategy like systematic, stratified, reservoir, cluster, snowball, time-location, and many other sampling techniques [
8]. In addition, consideration of sampling and non-sampling errors and biases need to be taken into account and corrected for. In satellite image classification, factors such as the spectral, spatial, radiometric, and temporal resolution, type of sensor (active or passive [
9]), and data processing level (radiometric and geometric calibration, geo-referencing, atmospheric correction), to name a few, synthesise a manifold of concepts and features that need to be accounted for.
On the other hand, deep supervised learning as well as complex and multi-parametric modelling require big training sets. A potential solution to this issue is the collection of different training data sets and their exploitation in such a way that they complement each other and act in a synergistic manner. However, the joint use of two or more existing training data sets require careful examination of their individual features and organisation. To ease this process, we propose a methodology to organise open and freely available training data sets designed for satellite image classification in view of fusing them with other Earth observation (EO)-based products. The proposed methodology is then applied to a series of seven free and open data sets and leads to the SatImNet collection. The process of information retrieval has been optimized over a distributed disk storage system. We also demonstrate the blending of different information layers by exploiting deep neural network modelling approaches with the objective to solve concurrently the tasks of image classification and semantic segmentation. This work ties in with the framework of the European strategy for open science [
10] which strives to make research more open, global, collaborative, reproducible, and verifiable. Accordingly, all the data, models, and programming code presented herein are provided under the FAIR [
11] (findable, accessible, interoperable, and reusable) conditions.
The paper is structured as follows:
Section 2 discusses the significant features that make the varying source training sets interoperable for satellite image classification, and
Section 3 introduces the SatImNet collection which organises existing training sets in an optimised and structural way.
Section 4 demonstrates CNN models that have been trained on blended training sets and solve satisfactorily a satellite image classification and semantic segmentation task.
Section 5 describes the computing platform over which the experimental process has been performed.
Section 6 underlines the contribution of the present work and outlines the way forward.
2. Major Features of an Interoperable Training Set
In the following, we define the minimal and essential attributes that should characterise a training set when examined under the prism of interoperability and completeness. For best clarity, we have grouped the attributes according to three categories enumerated with their associated attributes hereafter.
- 1.
Attributes related to the scope of the training data:
- -
Classification problem: denotes the family/genre of the supervised learning problem that the specific data set can serve in a more effective way;
- -
Intended application: explains the primary purpose that the specific data set serves according to the data set designers;
- -
Definition of classes: signifies how the class annotations are originally provided by the data set designer;
- -
Annotation: provides information about the class label annotation, whether derived from a manual or automated procedure, or it is based on expert opinion, volunteering effort, or organised crowd-sourcing; it serves as a qualitative indicator about the reliability of the provided data;
- -
Verification: linked with the former feature, it refers as well to the belief level and reliability of the information transmitted by the data;
- -
Licence: the existence of terms, agreements, and restrictions as were explicitly stated by the data publishers;
- -
URL: the original web link to the data.
- 2.
Attributes related to the usage and sustainability of the training data:
- -
Geographic coverage: reveals the terrain characteristics, the morphological features of the objects that shape the surface, and the landscape variability, as well as the potential irregularities covered by the candidate data set;
- -
Timestamp: the image sensing time or the time window to which the image refers is crucial information for change detection and seasonality-based applications. This piece of information is closely related to the concept of temporal resolution but cannot be used interchangeably;
- -
Data volume: helps to determine disk storage and memory requirements;
- -
Data lineage: necessary information for precise reproducibility of the data processing workflow, including pre-processing transformations such as standardisation, normalisation, clipping of values, quantisation, projection, resampling, correction (atmospheric, terrain, cirrus), etc;
- -
Name of classes: the semantic name that describes the category to which a single pixel, a group of pixels, or a rectangular window (patch) belongs to;
- -
Number of classes: shows the plurality and the exhaustiveness of the targets to be detected or identified;
- -
Naming convention: whether the file name conveys additional information such as sensing time, class name, location and so on;
- -
Quality of documentation: a qualitative annotation assigned by the users (in this case, by us) about the existence of sufficient explanatory material;
- -
Continuous development: a qualitative indicator about data sustainability, error correction and quality improvement that is based on the information provided by the data designers.
- 3.
Intrinsic image attributes:
- -
File format: indicates file compression, the file reader and encoding/decoding type, availability of meta-information (e.g., GeoTIFF, PNG, etc.), number of channels/bands;
- -
Image dimensions: a quick reference (image height and width) to estimate the batch size during the training phase, expressed as image rows × columns;
- -
Number of bands: number of channels packed into a single file or number of separate single-band files belonging to a dedicated subfolder associated with the image name;
- -
Name of bands: standard naming of the image channels like RGB (red/green/blue) or specific naming that follows the product convention such as the naming of Sentinel-2 products;
- -
Data type per band: essential information about the range of band values that differentiates the data distributions and impacts the data normalisation/standardisation operations;
- -
No data value: specifies the presence of invalid values that affect processes such as data normalisation and masking;
- -
Spatial resolution: it determines what types of targets can be detected and indirectly points at the physical size of the training samples. Spatial resolution is often expressed in meters;
- -
Spectral resolution: it refers to the capacity of a satellite sensor to measure specific wavelengths of the electromagnetic spectrum. In data fusion context, spectral resolution helps to compare and match image bands delivered by different sensors;
- -
Temporal resolution: the amount of time that elapses before a satellite revisits a particular point on the Earth’s surface. Although temporal resolution is an important attribute, there are no training sets that currently cover this aspect in detail;
- -
Type of original imagery: a piece of information with reference to the sensor type and source, the availability of masks, the existence of geo-reference, and other auxiliary details;
- -
Orientation: information referring mainly to image or target rotation/positioning; in the case of non-explicit statement, this feature contains basic photogrammetry information such as rectification and geometric correction;
- -
Metadata: extra information that accompanies an image. It concerns mostly the geo-referenced and time-stamped images.
Additional informative features could be the different/alternative usage of a data set with respect to other research works, number of citations, impact gauging, and the physical location (file path) at the local or remote storage system. In some cases, data publishers provide the date of the data set release, but this should not be confused with the formerly mentioned, pivotal attribute of timestamp.
For clarification purposes, we also detail the image classification applications that a training set with the above mentioned attributes could serve. Image classification is a generic term to describe the process of labelling an entire image or part of it. It differs from image segmentation wherein the image is partitioned in two or more connected sets of pixels; in this context, no semantics is needed since the segmentation can be based on pixel value similarities/differences and connectivity rules [
12]. Depending on the granularity of classification (pixel, group of pixels, block/window/patch), which is strongly impacted by the spatial resolution of the input imagery (it can be equally extended to the temporal resolution as well), we distinguish the following applications that can be considered as a semantic segmentation (a label is assigned to every segment/part of the image) at different levels of detail:
Block/window/patch-based classification: In this case, the input image is divided in several overlapping or non-overlapping units (blocks/windows/patches) and all or part of these units receive a label by the classifier. The assumption under this configuration is that a considerable amount of pixels that compose the unit belong to one class, the one that has been assigned to the unit by the classifier. Labelling a unit signifies that the specific unit contains the target which is subject to detection. Target localisation is when drawing a rectangle which actually frames the image area that incorporates the target.
Group of pixels classification: A label is assigned to spatially connected pixels, forming patterns with concrete dimensions and structure that resemble pre-defined models of physical or artificial objects. This type of application is known as instance segmentation or object delineation.
Pixel-wise classification: The target is every single pixel that can form a distinct class. The pixel labelling is performed by the classifier without considering the classification of the adjacent pixels. This type of operation occurs (i) when the spatial resolution is lower than the real dimensions of the target (for instance, the finer resolution of the Sentinel-2 radiometric bands is 10 m, permissible for a building detection but not admissible for the detection of cars), or (ii) for land cover classification or big areas identification such as the road or train network.
The term target conveys the abstract concept of a pattern that is detectable and sometimes identifiable. It is subjected to the satellite sensor capacity according to the supported spatial, spectral, temporal, and radiometric resolution. Pixel targets are considered mostly for classification of land cover types such as forest, urban, water, crops, bare soil, etc., i.e., classes that may have irregular morphological characteristics such as shape or compactness. Group of pixels are considered for object detection or recognition. In this context, the pixel formation visually resembles real-world objects such as trees, houses, cars, boats, etc., usually targets with concrete structure, distinct boundaries, cohesion, and other characteristics that make them separable from their surrounding area. Patches (rectangular windows) can be employed in both of the aforementioned applications. In object detection, the image patch encloses completely the target, i.e., comprises the group of pixels that form the object. In land cover classification, the rectangular window surrounds an ample number of pixels that represent a single land cover class or a mixture of land cover types.
3. SatImNet Collection
In this section, we describe the initial edition of the SatImNet (Satellite Image Net) collection, a compilation of seven open and free training sets targeting various EO applications. Then, we elaborate on the rationale behind our choices for the data structuring. Lastly, we tabulate the training sets under consideration according to the defined attributes of
Section 2.
3.1. Description of the Training Sets
The initial edition of SatImNet consists of seven diverse training sets:
- 1.
DOTA: A large-scale Dataset for Object deTection in Aerial images, used to develop and evaluate object detectors in aerial images [
13];
- 2.
xView: contains proprietary images (DigitalGlobe’s WorldView-3) from complex scenes around the world, annotated using bounding boxes [
14];
- 3.
Airbus-ship: combines Airbus proprietary data with highly trained analysts to support the maritime industry and monitoring services [
15];
- 4.
Clouds-s2-taiwan: contains Sentinel-2 True Colour Images (TCI) and corresponding cloud masks, covering the area of Taiwan [
16];
- 5.
Inria Aerial Image Labeling: comprises aerial ortho-rectified colour imagery with a spatial resolution of 0.3 m and ground truth data for two semantic classes (building and no building) [
17];
- 6.
BigEarthNet-v1.0: a large-scale Sentinel-2 benchmark archive consisting of Level-2A (L2A: Bottom Of Atmosphere reflectance images derived from the associated Level-1C products) Sentinel-2 image patches, annotated by the multiple land-cover classes that were provided from the CORINE Land Cover database of the year 2018 [
18];
- 7.
EuroSAT: consists of numerous Level-1C (L1C: top-of-atmosphere reflectances in cartographic geometry) Sentinel-2 patches provided in two editions, one with 13 spectral bands and another one with the basic RGB bands; all the image patches refer to 10 classes and are used to address the challenge of land use and land cover classification [
19].
With regard to satellite imagery, one of the unique features is the spatial resolution which determines substantially the type of application. The high-resolution imagery provided by DOTA, xView, Airbus-ship, and Inria Aerial Image Labeling is suitable for object detection, localisation, and semantic segmentation. The remaining three data sets are fitting mostly applications relevant to both image patch and pixel-wise classification.
As last note, we underline the fact that although the images of xView and Airbus-ship data sets have an ownership, they are provided for free under the licenses described in the subsequent
Table 1.
3.2. SatImNet Data Model
This section explains the rationale behind the chosen data model. The SatImNet collection has been structured in a modular fashion, preserving the unique characteristics of each constituent data set while providing a meta-layer that acts in a similar way as an ontology does, recording links and modelling relations among concepts and entities from the different data sets. The first two abstract layers of this meta-layer consist of the following keys: (1) built-up: residential, industrial, facilities, infrastructure, construction, areas; (2) transport means: vehicle, flying, vessel; (3) object: man-made; (4) natural areas: air, land, water. The third layer is composed by the leaf nodes (terminal nodes) representing all the classes of the seven data sets. While the intuitive choice was to consolidate the names of the classes, we decided to leave the original class names in order to retain a backward compatibility from SatImNet to the seven data sets. Accordingly, we preserve class names such as “residential building” although there is another leaf node “bulding” and a parent node “residential”. Another consequence is that leaf nodes have duplicates, as it is the case of “damaged building” that belongs to more than one parent nodes such as the “residential” and “industrial” nodes.
Apart from the meta-layer that has formed as a lattice, the other structures that compose the data model are represented by nested short tree hierarchies which have been proven to be quite efficient in information retrieval tasks [
20]. Accordingly, a
json file
content_public.json has been created for each data set that contains mostly the intrinsic attributes of the images of the specific data set in a hierarchical, tree-based format. These attributes represent information such as the physical path of the file, its size in bytes, the file type (genre), the image acquisition time or the time the image refers to, the class label (if any), the meta-information like projection, number of bands and so on.
Figure 1 displays a subset of the hierarchy and shows the typical route a query follows across the central semantic meta-layer and each information module which condenses the essential information that characterises every file.
Since the entire or part of the collection is accessible over the network, we selected a database-free solution for the tree hierarchies based on
json files. This portable layout grants a standalone character to the modules of the collection, independent of specialised software and transparent to the non-expert end-user. The lack of indexing which impacts critically the query speediness is tackled (whenever is feasible) by keeping the depth and breadth of every single tree in moderate sizes. A consequence of this is the creation of multiple
json files for each data set (e.g., 13 files in the case of BigEarthNet-v1.0). At this point, we underline the fact that the baseline system upon which we optimise all the processes is the EOS open-source distributed disk storage system developed at CERN [
21], having as front-end a multi-core processing platform [
22,
23]. This configuration allows multi-tasking and is suitable for distributed information retrieval out of many files. One bounding condition set by the baseline system is the prevention of generating many small-sized files, given that EOS guarantees minimal file access latencies via the operation of in-memory metadata servers which replicate the entire file structure of the distributed storage system. For this reason, the files of the training sets have been zipped into larger archives, the size of which has been optimised in a way as to allow admissible information retrieval whilst sustaining efficient data transfer across the network. Reading individual files from
zip files can be achieved through various interfaces and drivers. In our case, we employ the open source Geospatial Data Abstraction Library (GDAL) [
24] which provides drivers for all the standard formats of raster and vector geospatial data. Jupyter notebooks demonstrating the execution of queries as well as the respective information retrieval from the
json files and subsequently by the
zip archives are referred to in
Section 5. Although the decision to zip the files was based on the technical characteristics of the multi-petabyte EOS open storage solution, the files can be of course unzipped if more suitable in other environments. To harmonise the class annotations provided in different file formats (json, geojson, text, and csv), binary or labelled image masks were created for every single training sample. Although the aforementioned data model has been optimised upon a distributed storage system, it turns out to be a general-purpose model built on the principles of simplicity, speediness, extensibility, customisation, and portability.
3.3. Characterisation of the Data Sets
The three attribute categories presented in
Section 2 have been used as a basis to characterise the seven training sets of SatImNet.
Table 1,
Table 2 and
Table 3 respectively show the results of this analysis. In
Table 1, the feature
Conversion of class representation has been added, indicating whether the original type of class annotations has been converted into an image mask upon creation of SatImNet. The dash “-” has been used to denote either non-existent information or non-clear definition.
4. Fusion of Heterogeneous Training Sets: Experimental Results
This section shows the added value of SatImNet for enhanced information extraction from Earth Observation products. The fusion of training sets from multiple sources is challenging for the data are created with different technical specifications: source imagery, pre-processing workflow, assumptions, manual/automatic refinement, different spatial and temporal resolution, etc. Within SatImNet context, we investigate data fusion by conducting experiments demonstrating a satellite image classification and segmentation application based on CNN models. Working within the open and free data framework, we decided to employ satellite imagery provided by the Copernicus Sentinel-2 (S2) [
25] mission as input to the models since the S2 products are delivered for free and the experiments can be reproduced by anyone. The highest spatial resolution of S2 products is 10 m and fits better with applications such as land cover classification. This condition guided us to select the appropriate training data and exclude data sets that match better to object detection in very high resolution imagery. We note here that the objective of the SatImNet collection is not to provide a single big data set for any type of image classification or segmentation application but rather to offer varying source data that are homogenised to the greatest extent possible, thereby facilitating the user to select and combine different data sets that fit her purpose.
Two examples of applications are illustrated in this section. The first one deals with land cover classification formulated as a patch-based classification problem. From the SatImNet collection, we chose two sets: (i) the EuroSAT, and (ii) the BigEarthNet-v1.0 (see
Table 1,
Table 2 and
Table 3 for the technical characteristics). We underline that the EuroSAT L1C data set has been used as a backbone set (main corpus). In some of the experiments, this set has been enriched with training samples selected from the BigEarthNet-v1.0 L2A data set: (i) we added 1000 samples from the BigEarthNet-v1.0
Annual crops associated with permanent crops class to the EuroSAT class
annual crop, and (ii) we augmented the size of the EuroSAT class
forest with 1000 samples taken randomly from the BigEarthNet-v1.0 classes
broad-leaved forest,
coniferous forest, and
mixed forest.
The second application is a semantic segmentation problem. In addition to the two above mentioned data sets, we chose the following two products:
- 1.
In connection with the water class, we used the Global Surface Water (GSW) [
26], a collection of global and spatio-temporal water maps created from individual full-resolution 185 km
2 global reference system II scenes (images) acquired by the Landsat 5, 7, and 8 satellites and distributed by the United States Geological Survey (USGS). Two out of the ten EuroSAT classes refer to water variants (
river and
sea lake), depicting areas that are partially or totally covered by water. From the BigEarthNet-v1.0, we randomly selected image patches referring to the classes
coastal lagoons and
sea and ocean (1000 samples from each category).
- 2.
With regard to the EuroSAT classes
industrial and
residential, the image segmentation was based on the European Settlement Map [
27] (ESM 2015, R2019 dataset), a spatial raster data set that is mapping human settlements in Europe based on Copernicus Very High Resolution (VHR) optical coverage, having 2015 as the reference year. From the BigEarthNet-v1.0, we randomly selected image patches referring to the classes
continuous urban fabric and
discontinuous urban fabric (500 samples from each category).
The specific data fusion approach is just one of the many combinations and associations someone could follow, and can be deemed as a representative scheme rather than the optimal strategy.
The selected BigEarthNet-v1.0 image patches have been resized from their original size of 120 × 120 to 64 × 64 images by applying a Gaussian filter to smooth the data and then by using bi-linear interpolation. On the basis of both EuroSAT and BigEarthNet-v1.0 geo-referenced image patches, we warped and clipped the GSW 2018 yearly classification layer, producing in that way the necessary water masks. Similarly, we clipped the 10 m up-scaled ESM and considered all the pixels pointing at residential and non-residential built-up areas. The non-residential built-up areas refer to detected industrial buildings, big commercial stores, and facilities. All the produced masks were resampled to 10 m spatial resolution using nearest neighbour interpolation. These masks compose an auxiliary data set that has been added to the SatImNet collection under the name BDA.
4.1. Convolutional Neural Network Modelling
Although CNNs have been experimentally proved to be more adequate for object detection [
28,
29] and semantic segmentation [
30,
31] in very high spatial resolution (≤5 m), there is lately a considerable number of works [
32,
33,
34,
35] demonstrating promising results at coarser spatial resolutions such as those of the S2 imagery. Nevertheless, the majority of these works focus on image patch and not on pixel-wise classification, which is a more complex problem.
The experimental configuration in this study has been chosen to explore the challenging problem of image segmentation together with image classification in the spatial resolution of 10 m of the S2 products. In order to assess better the added value of the data fusion, rather than using pre-trained models, we have designed and tested two customised lightweight CNN approaches instead, trained from scratch for 100 epochs. The input–output schema for the CNN models is depicted in
Figure 2.
The first CNN architecture (named CNN-dual) is a two-branch dual output CNN architecture that segments the image according to two classification schemas:
left-branch output is the classification result of assigning to each pixel one of the 10 EuroSAT classes (annual crop, forest, herbaceous vegetation, highway, industrial, pasture, permanent crop, residential, river, and sea lake), and
right-branch output is the pixel-wise classification result with reference to the aggregation classes water, built-up and other, as instructed by the GSW and ESM layers. The input to such a model is a 5 rows × 5 columns ×
N bands image (
Figure 2).
Table 4 summarises the basic parametrisation of the model. The same classification task could be formulated as a 12-class problem modelled by a single branch CNN; nevertheless, experimental results showed that CNN-dual provides consistently better results. There are three layer-couplings which intertwine the intermediate outputs across the two branches and sufficiently high structural capacity. This neural network architecture should not be confused with the twin neural network topology (Siamese) that uses the same weights while pairing two different inputs to compute comparable outputs.
Figure 3 displays the CNN-dual model for the segmentation of S2 image patches.
The second CNN approach comprises two independent networks. The left network (CNN-class) as it appears in
Figure 4 takes a 64 rows × 64 columns ×
N bands image and performs patch-based classification, i.e., it assigns one label from the 10 EuroSAT classes to all the 64 × 64 image pixels (
Figure 2). The right network (CNN-segm) has been designed for image segmentation according to the three aggregated classes (water, built-up, and other). CNN-segm is applied solely on the 64 × 64 ×
N image patches classified by CNN-class as
water or
built-up. In this case, the 64 × 64 ×
N array disintegrates in blocks of size 5 × 5 ×
N following a sliding-window approach.
Table 4 shows the choices for the basic parameters of the two models.
All the above mentioned parametrisations as well as the proposed CNN topologies derived from an extensive repetitive experimental process (threefold grid search). We note here that the purpose of this case study is not to conduct a comparison analysis of widely accepted CNN-based classification or segmentation models against the proposed ones. The presented CNN topologies are lightweight modelling approaches useful for evaluating the impact an enriched training set brings on the classification performance.
Table 5 depicts the classification accuracy in terms of two metrics, the F1-score and the Kappa score, while considering the four S2 bands with 10 m spatial resolution. Regarding the CNN-dual results, there are two values for each metric corresponding to 10 (left) and 3 (right) classes, respectively.
Figure 5 displays some indicative screenshots.
We concluded also that the four S2 10 m bands, blue (B02), green (B03), red (B04), and NIR (B08) give for the most part of the performed experiments the most consistent results. Actually, we found out that there is an intense dissimilarity between the data distributions of the EuroSAT-provided bands B09 and B10 (the data set creators claim that all the S2 images have been downloaded via Amazon S3) and the respective bands of the MSI Sentinel-2 products we downloaded from the Copernicus Open Access Hub (
https://scihub.copernicus.eu/).
Visual inspection of the classification results over (i) additional S2 MSI images (i.e., images not included in the collected data sets), (ii) geographic areas outside of the European continent where the training patches have been extracted, such as in China and USA (
Figure 6), and (iii) Level-2A (surface reflectance in cartographic geometry) S2 products (
Figure 7), while the majority of the training samples have been derived by L1C images, is in accordance with the results obtained from the L1C tests. The visual inspection confirmed an agreement of more than 60% with respect to the segmentation results and around 80% for the patch-based classification. There is a consistent confusion among the classes
residential,
industrial, and
highway, as well as among the classes
forest,
herbaceous vegetation, and
pasture. The latter confusion is attributed mostly on the seasonality whereas the former happens as a result of the similar radiometric signature of the build-up structures. We observed also a known problem with the misclassification of the shadow cast by mountains or forest trees to one of the water classes. This can be potentially solved by incorporating training samples which represent this pattern. To quantitatively affirm our findings derived from visual observation, we have included an indicative confusion matrix (
Figure 8) that displays the agreement between the classification result of the model CNN-dual when applied over an S2 product covering an area in USA (Brawley—south region of Salton sea) and the global land cover product FROM-GLC [
40]. We employ the term agreement and not accuracy since the class nomenclature differs significantly between the two layers. To bridge the two layers, we applied the re-classification schema of
Table 6. The FROM-GLC classes
tundra,
bareland, and
snow/ice have not been considered. The comparison has been performed at the spatial resolution of FROM-GLC (30 meters) by up-scaling the segmentation output through the statistical
mode operator.
4.2. Transfer Model
To have a more complete picture and with regards to patch-based classification, we show results (
Table 7 and
Figure 9) of a widely used CNN model, the
ResNet50 [
41] with the following two adaptations:
M1: We kept frozen the main topology of the model and we added one 2D convolutional layer at the beginning of the model that merges the four-band input tensor to a standard RGB tensor, and two dense layers (512 and 10 nodes, respectively) at the end of the model’s architecture, resulting in 23,587,712 non-trainable parameters and 1,054,233 trainable parameters. The non-trainable parameters have been tuned based on the
ImageNet [
42] data set.
M2: We adjusted the ResNet50 topology in such a way to accept a four-band input tensor and, as in M1, we added two dense layers (512 and 10 nodes, respectively) at the end of the model’s architecture, resulting in a new model with 53,120 non-trainable and 24,591,946 trainable parameters. The parameters of the model obtain as initial value the optimal values based on the ImageNet training.
4.3. Data Augmentation
So far, the data augmentation has been derived by the blending of two different training sets. In this section, we investigate whether there is a gain in terms of accuracy by employing the more conventional data augmentation by means of standard image transformations. Since the image patches of size 5 × 5 are already quite small and their amount reaches very high numbers, we applied the image transformations over the 64 × 64 image patches solely. More specifically, we applied step-wise rotation of step 90°, and flipping in the left-right and up-down direction, resulting in 162,000 samples from the initial 27,000 of EuroSat only, and in 192,300 samples from the initial 32,050 of the mixed EuroSat & BigEarthNet-v1.0 data set. The results are summarised in
Table 8.
4.4. Experimental Findings
One of the objectives of the experimental study was to validate whether the presence of bigger amount of representative training samples impact the classification performance comparably or even better than the utilisation of pre-trained and adapted sophisticated models. To verify this hypothesis, we did not use a transfer learning approach in terms of pre-trained and pre-configured NN layers, but instead, we designed relatively light CNN topologies, keeping the number of model weights as low as possible while retaining the structural capacity of the model in adequate levels. The maximum tenfold cross-validation overall accuracy reaches 99.37% for CNN-class and 95.41% for the 3-class output of the CNN-dual. One remark here is that the same image patches shifted for several pixels may produce different results but always in line with the most dominant class. We do not report the accuracy results of CNN-segm because they are slightly worse than the results given by CNN-dual. Nevertheless, the role of CNN-segm is to segment the patches already classified by the CNN-class model in order to refine further the classification (3-class). CNN-dual has more parameters than CNN-segm and has been designed to provide concurrently a 10-class and 3-class segmentation of the image. Both CNN-dual and CNN-segm have been trained on 15,836,588 samples and validated on 3,959,147 samples.
The summarised results of
Table 5 justify comparatively the gain from the mixing of the training sets in terms of accuracy and robustness. The two measures, F1 and Kappa score, improve significantly when the training is based on both data sets than employing a single one (the EuroSAT in this case).
The results shown in
Figure 8 corroborate the confusion between the “green” vegetation classes that is due to the seasonality effect. The fact that several
cropland and
grassland pixels have been classified to the
impervious class can be explained by the fact that pixels reflecting rural roads or small settlements pull their surrounding pixels as well in many cases, resulting in thicker objects. This phenomenon is strongly dependent on the detection capacity of the sensor. Nevertheless, the agreement between the two layers is noteworthy if we consider that the selection of the 5 × 5 image patches have not undergone a refinement in terms of class labelling. More precisely, as the original 64 × 64 image patches have been divided into 5 × 5 blocks, likewise, the class label of each 64 × 64 image patch has been assigned to all its constituent 5 × 5 blocks, causing several false class assignments since there are more than one classes in most of the 64 × 64 image patches. Currently, we are working in this front, trying to find an automatic way for the precise assignment of class labels to the smaller blocks, something that will improve a lot the classification results.
The very low classification performance shown in
Table 7 with respect to model (M1) states that this type of transfer learning is not adequate at all since the low number of trainable parameters is not sufficient to help the model adapt into the new domain. In regard to the model (M2), the same Figures show that the classification performance is similar to the results of the patch-based classification (CNN-class) as presented in
Table 5, especially in the case of 100 epochs. Nevertheless, the lower number of parameters of CNN-class, adjusted from scratch upon the combined EuroSAT and BigEarthNet-v1.0 training set seems more adequate option in terms of computational complexity and adaptability than using a more complex model trained in one of the two sets.
The positive effect of data augmentation through standard image transformations is apparent as shown by the results of
Table 8. However, if we compare the pair
training: BigEarthNet-v1.0 & EuroSAT and
testing: EuroSAT of
Table 5 with the pair
training: EuroSAT and
testing: BigEarthNet-v1.0 & EuroSAT of
Table 8, we acknowledge the fact that data augmentation via image transformations alone, when applied on EuroSAT, does not improve the classification result so much compared to the data augmentation that has been derived by the blending of the two training sets BigEarthNet-v1.0 & EuroSAT.
We do not provide comparable results of transfer learning for image segmentation because, as we have already mentioned, the original purpose of both EuroSAT and BigEarthNet-v1.0 data sets was the patch-based classification, and the only way to convert the problem into a semantic segmentation problem is to deal with image blocks of size much smaller than the 64 × 64 patch size supported by the data sets. Although some of the standard pre-trained segmentation models can operate with the original 64 × 64 image blocks for the 3-class image segmentation, they cannot answer on the 10-class image segmentation problem due to the lack of detailed reference masks. This challenging problem has been tackled by our customised CNN model (CNN-dual) which operates at pixel level and can be trained fast from scratch upon the fused data set. It supports as well our hypothesis that the existence of big enough, representative, and enriched data sets combined with customised modelling approaches provides added-value to the classification outcome compared to transfer learning.
5. Open Access to Data and Workflows
Training deep neural networks requires hardware and software libraries to be fine-tuned for array-based intensive computations. Multi-layered networks rely heavily on matrix math operations and demand immense amounts of computing capacity (mostly floating-point). For some years now, the state of the art in such type of computing and especially for image processing is shaped by powerful machinery such as the graphics processing units (GPUs) and their optimised architectures.
In this regard, the JRC (Joint Research Centre) Big Data Analytics project, having as a major objective to provide services for large-scale processing and data analysis to the JRC scientific community and the collaborative partners, is constantly increasing the fleet of GPU-based processing nodes, including NVIDIA
® Tesla K80, GeForce GTX 1080 Ti, Quadro RTX 6000, and Tesla V100-PCIE cards. Dedicated Docker images with CUDA [
43] parallel model, back-end, and deep learning frameworks such as TensorFlow, Apache MXNet and PyTorch [
44] and adequate application programming interfaces have been configured to facilitate and streamline the prototyping and large-scale testing of working models.
We mention these technical details in order to underline the fact that although operations such as transfer learning, domain adaptation, model customization, and hyper-parameter fine-tuning are much lighter than training and optimising a deep neural network from scratch, they also require dedicated hardware and software for the exploration of various scenarios and the shortening of the experimentation process.
The entire experimental setting presented here has been performed onto the JRC’s high-throughput computing platform, the so-called JEODPP [
23]. Jupyter notebooks and Docker images are open and accessible upon request. This decision is in conformity with the FAIR Guiding Principles for scientific data management and stewardship, and promotes open science. Complete or sectional download of the SatImNet collection can be done via
ftp (
https://jeodpp.jrc.ec.europa.eu/ftp/public/MachineLearning/SatImNet) service. In addition, individual files are directly accessible through the support of
vsizip and
vsicurl drivers. The open repository (
https://github.com/syrriva/SatImNet) contains the Python scripts (in the form of Jupyter notebooks) to access and query the SatImNet collection via
ftp.
6. Conclusions
The availability and plurality of well-organised, complete, and representative data sets is critical for the efficient training of machine learning models (specifically of deep neural networks) in order to solve satellite image classification tasks in a robust and operative fashion. Working under the framework of open science and very closely to policy support which invites for transparent and reproducible processing pipelines at which data and software are integrated, are open and freely available through ready-to-use working environments, the contribution of this paper is aligned with three goals: (i) to define the functional characteristics of a sustainable collection of training sets, aiming at covering the various specificities that delineate the landscape of automated satellite image classification; (ii) according to the defined attributes, to structure and compile a number of heterogeneous data sets, and (iii) to demonstrate a potential fusion of training sets by using deep neural network modelling and solving concurrently an image classification and segmentation problem.
Future work involves systematic harvesting of training sets across Internet, automation of the quality control of the discovered data sets, and continuous integration of the distinct modules of the working pipeline. Apart from the accumulation of data sets which have been designed and provided by the research community, another scheduled activity concerns the methodical building of in-house training sets, targeting wide scope Earth observation applications such as crop and surface water monitoring, deforestation, and crisis management.