Deep Learning for Land Use and Land Cover Classiﬁcation Based on Hyperspectral and Multispectral Earth Observation Data: A Review

S.C


Motivation
The advances in remote sensing technologies and the resulting significant improvements in the spatial, spectral and temporal resolution of remotely sensed data, together with the extraordinary developments in Information and Communication Technologies (ICT) in terms of data storage, transmission, integration, and management capacities, are dramatically changing the way we observe the Earth. Such developments have increased the availability of data and led to a huge unprecedented source of information that allows us to have a more comprehensive picture of the state of our planet. Such a unique and global big set of data offers entirely new opportunities for a variety of applications that come with new challenges for scientists [1].
The primary application of remote sensing data is to observe the Earth and one of the major concerns in Earth observation is the monitoring of the land cover changes. Detrimental changes in land use and land cover are the leading contributors to terrestrial biodiversity losses [2], harms to ecosystem [3], and dramatic climate changes [4]. The proximate sources of change in land covers are human activities that make use of, and hence change or maintain, the attributes of land cover [5]. Monitoring the changes in land cover is highly valuable in designing and managing better regulations

Land Use and Land Cover Classification
Land mappings of Earth are traditionally categorised into land use classification and land cover classification. Although in many studies these two concepts are interchangeable, or, as stated in [9], are confused by each other, the proper definition of each makes them different. According to the Food and Agriculture Organisation (FAO) [10] of the United Nations, "Land cover is the observed (bio)physical cover on the Earth's surface", while "Land use is characterised by the arrangements, activities and inputs by people to produce, change or maintain a certain land cover type". According to the definition, land use and land cover are tightly related, and their joint classification is almost inevitable. Therefore, in recent studies "land use and land cover" (LULC) classification as a whole is considered as a more general concept also covering this relationship.
There are different taxonomies for LULC, based on the targeted applications; one of the most famous definitions belongs to FAO and offers a hierarchical land cover classification system (LCCS), which provides the ability to accommodate different levels of information, starting with structured broad-level classes, which allow further systematic subdivision into more detailed sub-classes [11] ( Figure 1). This definition assures a high level of mappability that also covers the user-defined land use descriptors.
In general, studies approaching LULC classification consider a very small number of land cover or land use categories. Depending on the target application, these categories may be at the higher level of the hierarchy, distinguishing obvious land covers, or focussing on specific land cover sub-class categories. The classification of wetlands [12,13], urban land-use [14,15], agriculture [16], forest [17], and other vegetation mappings are some examples of the application-focused LULC classification approaches that are available in the literature. The earliest use of remotely sensed data for the LULC classification goes back to mid-1940s when Francis J Marschner began to map the entire United States by associating the land uses to the Earth surface using aerial photography [18]. Years later, just after the launch of the Earth Resources Technologies research satellite equipped with a multispectral scanner (MSS) on July 1972 and the start of the Landsat program, the studies using the remotely sensed imagery data to classify the LULC stepped to a new level [19,20]. In fact, with the birth of the Landsat program and the (private) release of data, new challenges of multi-modal data fusion, land change detection on a temporal basis, and ecological applications of the satellite data, were introduced to the field of LULC. Some of the early works on these topics are discussed by [21][22][23][24].
The studies over LULC classification and its further challenges are constantly and rapidly evolving as the result of the fast improvements in the processing and storage capacity of computers and the evolution in Artificial Intelligence (AI). Moreover, any advance in the remote sensing technologies, and in the quality of data, comes with new opportunities for researchers to extract new information from the remote sensing data [25]. The growing trend of publications about the LULC classification of remote sensing data is pictured in Figure 2. The trend has been captured searching for a set of key terms in the title, abstract and keywords of all documents available in Scopus, grouped and filtered by five-year intervals. The trends in Figure 2 contain four different search results: the first one (Blue) is the count of publications on LULC classification/segmentation using all types of remote sensing data. The second one (Orange) restricts publications on LULC classification/segmentation to hyperspectral data: this emphasises an increasing number of studies working on such data in the last two decades. The third one (Green) shows the use of "deep learning" techniques in LULC classification with all types of remote sensing data, emerged in the last years (interested readers can find a review on such publications in [26]). The last one (Red) restricts the latter type considering only the use of multispectral and hyperspectral remote sensing data, which are increasingly getting attention due to their recent availability.
Hyperspectral imaging, being tied to the advances in digital electronics and computing capabilities, was embraced later by the Earth Observation community due to its complexity in nature and the computational limitations of the time. However, the great potentials of such data, its availability, and the fast developments in computational technologies are increasingly attracting scientists interested in LULC classification. Moreover, the extraordinary achievements of deep learning since the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [27], encouraged remote sensing scientists to employ these techniques on remote sensing data as well, starting from 2015. Reference [28] devote their study to the challenges of hyperspectral imaging technologies and review the state-of-the-art of deep learning methodologies used for hyperspectral image classifications. Reference [29] also presents an overview of deep learning methods for hyperspectral image classification and compare the effectiveness of these methods on common well-known datasets.
In this review paper, we explore this recent growing research trend in using deep learning techniques for LULC classification of hyperspectral and multispectral images, as both data types have significant common attributes that can be studied together. The main aim is to draw up a lively document that gives a framework about how to read the state-of-the-art of deep learning in the field of LULC classification of remote sensing data, with an emphasis on hyperspectral and multispectral images. The focus of this document is to provide a platform for the readers to extract proper methods and datasets to address the existing challenges of the field.
Considering Figure 2, this paper reviews the papers highlighted in red (LULC classification of hyperspectral and multispectral remote sensing images using deep learning techniques) obtained with the search query: TITLE-ABS-KEY("deep learning" OR "convolutional neural network" AND "land cover" OR "landcover" OR "land use" OR "landuse" OR "lulc" AND "multispectral" OR "multi spectral" OR "hyperspectral" OR "hyper spectral"). We considered the documents that were cited by these selected articles and other works that these selected articles were cited by. Going through these sources helped us sketch the general schema of the state-of-the-art you find in the following, focusing on the position of deep learning in the whole picture. As a side note, we stress here a clarification for the reader about the use of the term "land cover classification" in literature as in several works it actually refers to "land cover segmentation". In other words, the classification term refers to the pixel level, hence the final targeted result is a segmented map. In some works, the aim of classification is instead a patch-based classification, where a fixed size patch of an image is assigned to a specific class. In this review paper, for clarity, these approaches are referred to as "pixel-level classification" and "patch-level classification", respectively. For the sake of simplicity, we adopt the term "land cover classification" for pixel-level classification, when not explicitly specified.
From a formal point of view, the LULC classification process is defined as f : X → Y, with input space X ⊆ N W×H×K where W, H, K are respectively the width, height and number of spectral bands for each input image, which the output space for pixel-level land cover classification and patch-level land cover classification is represented as Y ⊆ C W×H and Y ⊆ C respectively, where C = {Ω 0 , Ω 1 ,. . . Ω k } is the set of possible land use and land cover categories.

Multispectral and Hyperspectral Remote Sensing Data
Remotely sensed images are usually captured by optical, thermal, or Synthetic Aperture Radar (SAR) imaging systems. The optical sensor is sensitive to a spectrum range from visible to mid-infrared of the radiations emitted from the Earth's surface, and it produces Panchromatic, Multispectral or Hyperspectral images. Thermal imaging sensors, capturing the thermal radiations from the Earth surface, are instead sensitive to the range of mid to long-wave infrared wavelengths. Unlike thermal and optical sensors that operate passively, the SAR sensor is an active microwave instrument that illuminates the ground scattering microwave radiations and captures the reflected waves from the Earth's surface.
The panchromatic sensor is a monospectral channel detector that captures the radiations within a wide range of wavelength in one channel, while multispectral and hyperspectral sensors collect the data in multiple channels. Therefore, unlike the panchromatic products that are mono-layer 2D images, hyperspectral and multispectral images share a similar 3D structure with layers of images, each representing the radiations within a spectral band. Despite the similarity in the 3D structure, the main difference between multispectral and hyperspectral images is in the number of spectral bands. Commonly, images with more than 2 and up to 13 spectral bands are called multispectral, while the images with more spectral bands are called hyperspectral. Nevertheless, the main difference is that the hyperspectral acquisition of spectrum for each image pixel is contiguous, while for multispectral it is discrete (Figure 3-Left). The wavelength acquisition of spectral bands for multispectral (below) and hyperspectral sampling (above) (taken from [30]). Right: a schema of multispectral and hyperspectral images in the spatial-spectral domain.
Having hundreds of narrow and contiguous spectral bands, hyperspectral images (HSI) come with specific challenges intrinsic to their nature that do not exist with multispectral (MSI) and panchromatic images. These challenges include: (1) High-dimensionality of HSI, (2) different types of noise for each band, (3) uncertainty of observed source, and (4) non-linear relations between the captured spectral information [31]. The latter is explained to result from the scatterings of surrounding objects during the acquisition process, the different atmospheric and geometric distortions, and the intra-class variability of similar objects.
Despite the mentioned differences in the nature of MSI and HSI, both share a similar 3D cubic-shape structure (Figure 3-Right) and are mostly used for similar purposes. Indeed, the idea behind LULC classification/segmentation relies on the morphological characteristics and material differences of on-ground regions and items, which are respectively retrievable from spatial and spectral information available in both MSI and HSI. Therefore, unlike [32] that review methodologies designed for spectralspatial information fusion for only hyperspectral image classifications, in this review we consider both data types as used in the literature for land cover classifications using deep learning techniques focusing on the spectral and/or spatial characteristics of land cover correlated pixels.

Data Sources and Datasets
There are many satellite and airborne imagery providers that release timely and high-resolution remote sensing data to the public without any cost. USGS [33,34], NEO [35], Copernicus open access hub [36], NASA Earth data [37], NOAA [38,39] and IPMUS Terra [40] are among the most popular open access remote sensing data providers. In the literature, satellite images used for deep learning purposes are mostly obtained from Landsat-7, Landsat-8, Sentinel-1, Sentinel-2, WorldView-2, WorldView-3, QuickBird, EO-1, PROBA-1, and SPOT-6 satellites. Table 1 presents a short overview of the status of these satellites and their image products. Except for Sentinel-1, EO-1, and PROBA-1 that produce both SAR and hyperspectral images, the products of the other satellites listed in the table are multispectral images. As explained before, panchromatic band images (black and white) are captured by a single channel detector that is sensitive to a broad wavelength range, coinciding with the visible range, which collects a higher amount of solar radiation. Therefore, the spatial resolution of panchromatic images is usually higher than the MSI. Landsat, WorldView, SPOT-6 and QuickBird capture panchromatic images together with MSI. Among the MSI providers, Sentinel-2, with the highest number of spectral bands (13 bands) and highest orbital altitude among these satellites, is the only mission that can provide data with global coverage data in five days.
Among the satellites in Table 1, the highest resolution images are obtained by WorldView-3 and WorldView-2, followed by QuickBird and SPOT-6 satellites. All these satellites are commercial, therefore their images are expensive and available in open access with limited land coverage. In the literature, very high resolution multispectral and hyperspectral images used for object detection, building and road extraction, or crop analysis, are mainly airborne images captured by digital sensors, such as AVIRIS and ROSIS. The spatial resolution of images of such sensors may vary depending on the altitude of the aircraft.
To exploit airborne or spaceborne images, supervised techniques are usually utilised. Such techniques infer the logic for classification based on labelled training data. However, explicitly labelling the data and collecting ground-truth for such supervised approaches is a complex and time-consuming task. Few available databases come with ground-truth. The most used datasets in the literature, already labelled, for land cover classification using deep learning techniques are graphically shown in Figure 4, and detailed in Table 2. In some of these datasets, the images are also properly cropped, corrected and archived in a way that is easy for the machine to retrieve and process.   Indian pines [45] and University of Pavia [45] datasets are used in many papers. Both datasets contain pixel-level ground-truth, and the images are captured by airborne hyperspectral imaging sensors. Indian pines dataset is taken by the AVIRIS sensor, which captures 224 band hyperspectral images. The dataset targets LULC in the agriculture field. Commonly, the studies using the Indian pines dataset remove the water absorption bands and consider only 200 spectral bands for the images. Salinas [45] data types are very similar to Indian pines, captured by the same sensor, targeting different agriculture classes. The University of Pavia dataset is captured by the ROSIS airborne sensor: resulting images have 103 spectral bands. The dataset is very similar to Pavia city centre [45], just a couple of classes are different (Pavia city centre has water and tile classes, while Pavia University has gravel and painted sheet classes). University of Pavia dataset is more popular as it has more samples for training.
GRSS 2013 [46], Kennedy space centre (KSC) [45], Botswana [45] and Cuprite [45] single images are other airborne pixel-level labelled imagery datasets used for land cover classification. DeepGlobe [47] (the land cover dataset) is a new pixel-level labelled dataset introduced in 2018 for the CVPR2018 challenge. It provides a huge number of pixel training samples, with high pixel resolution, but it contains only the RGB channels. The images of DeepGlobe dataset are the result of different commercial satellite image fusion, but there is no accurate indication on which sensors are used and how the images are fused.
Training samples with pixel-level labels are used for image segmentation. Therefore, the aforementioned datasets are in general adopted to classify the map pixels and to generate a segmented map. On the other hand, there exist also some datasets for which image patches are labelled with single or multiple tags. Sat-4 [48], Sat-6 [48], UCMerced [49], and Brazilian Coffee scenes [50] datasets are among the most popular patch-level labelled datasets. In addition to the commonly used datasets, some tools provide the users with access to annotated/semi-annotated databases, which are usually collected by combining information from different resources that target particular uses, for example, crops [51,52], forests [53], or wetlands [54] monitoring.
In general, almost all available labelled MSI and HSI datasets come with common limitations to apply supervised machine learning techniques. Effective use of supervised machine learning techniques requires a large number of training samples that should also cover different in-class variations. Since labelling of such data is quite slow, costly and labour intensive, these datasets are usually limited in the number of samples, lack variety and are too case-specific. Such limitations are mainly referred to as the limited ground-truth challenge, which will be discussed later in this paper.

Machine Learning for LULC
Conventional supervised LULC machine learning pipelines usually include four major steps: (1) pre-processing, (2) feature engineering, (3) classifier training and (4) post-processing ( Figure 5-top). Each of these stages may be composed of a set of sub-tasks. A good break down of the whole process into its sub-tasks, with an explicit statement of their assumptions, helps to define standalone sub-problems that can be studied independently and have solutions or models that can be incorporated into an LULC pipeline to accomplish the targeted classification/segmentation. Over the last years, with the growing popularity of deep learning as a very powerful tool in solving different types of AI problems, we are witnessing a surge in demand of research to employ deep learning techniques in tackling these sub-problems.
Specific sub-tasks are defined to tackle the needs of the above four stages of a machine learning pipeline; usually, the pre-processing stage includes the sub-tasks within which the input data get prepared for the following stages (i.e., feature engineering and classifier training). The preparation may require to correct, de-noise, synchronise, or fuse the data to come up with an enhanced version of the original input and to improve the whole process performance. The feature engineering phase is usually referred to as a set of feature extraction, selection, and transformation tasks, to remove redundant information from the processed input data, reducing its dimensionality, and defining a set of good representations (features) for the input, based upon which the machine can build a model to predict the target classes. The heart of the workflow is the classifier training, where the machine builds a mathematical model based on the training samples and understands the correlation between the training data features/representation and its pre-defined classes. The model, after being trained, tested and validated, is used to predict and classify the new data. Finally, the post-processing phase, in pixel-level classification, is usually a set of methods applied to enhance the final segmented image, by emphasising the morphological properties of classes or objects. Figure 5. The machine learning classification frameworks. The upper one shows the common steps of the conventional approaches, and the lower one shows the modern end-to-end structure. In the end-to-end deep learning structure, the feature engineering is replaced with feature learning as a part of the classifier training phase.
With the increased computational capacity in the new generation of processors, over the last decade, the end-to-end deep learning approach received lots of attention from the scientists. The end-to-end learning pipeline-taking the source data as the input-end and the classified map as the output-end-is a modern form of re-designing the process workflow, that is taking advantage of deep learning techniques in solving complex problems. Within the end-to-end deep learning structure, the feature engineering is replaced by feature learning as a part of the classifier training phase ( Figure 5-bottom). In this case, instead of defining the inner steps of the feature engineering phase, the end-to-end architecture generalises the model generation involving feature learning as part of it. Such improved capacity of deep learning has promoted its application on many research works where well-known, off-the-shelf, end-to-end models are directly applied to new data, such as remote sensing. However, there are some open-problems, complexities, and efficiency issues in the end-to-end use of deep learning in LULC classification, that encourages us to adopt a new approach for investigation of the state-of-the-art in deep learning for LULC classification.
In the next sections, we have collected the state-of-the-art in using deep learning techniques for the LULC classification of HSI and MSI, considering their use in an end-to-end approach or in one of the phases of the traditional approach, including the training of land-cover classifier, the ground-truth generation, data fusion, data pre-processing and output post-processing stages. In particular: • In Sections 4.1 and 4.2 we explain the feature learning property of an end-to-end approach and its limitations that lead us to consider the conventional machine learning model including feature engineering steps. Then we explain the concept of feature engineering, its components, and the common methodologies, as well as deep learning techniques employed in literature to accomplish them. We also discuss the importance of defining the feature space and its direct impact on shaping the process pipeline. • In Section 4.3 we explore the choices of MSI and HSI classifiers for the LULC classifications and discuss the effectiveness of deep learning techniques for this task. We also explain different types of deep learning approaches in classifying MSI and HSI used in the state-of-the-art.

•
Focusing on the well-know challenge of limited ground-truth, in Section 4.4 we explain how it impacts the performance of deep learning models for HSI and MSI. Then, we report the research works facing this challenge. • In Section 4.5 we discuss the challenge of data fusion as faced by many state-of-the-art studies. We explain the main concerns in data fusion and how deep learning is facilitating their accomplishments. • Finally, in Section 4.6 we discuss other potential pre-processing and post-processing techniques in literature that can improve the LULC classification performance.

End-To-End Deep Learning
As explained before, the increased computational capacity has popularised the end-to-end deep learning approaches, wherein instead of engineering the features, the features are automatically learnt by the classifier (Figure 5-bottom). In other words, in such approaches, the gradient-based learning is applied to the system as a whole. The end-to-end use of deep learning models has been very popular within the remote sensing community over the last years. The majority of the works compare the performance of such architecture with classical techniques, like for example, Support Vector Machine (SVM) and Random Forest (RF) classifiers [16]. However, the use of deep learning as an end-to-end approach comes with some complexities and inefficiencies in the processing time.
One insight is based on the Wolpert's "No Free Lunch" (NFL) theorem [55] (the theorem was later developed in collaboration with Macready [56]), which states that "any two optimisation algorithms are equivalent when their performance is averaged across all possible problems" [56]. This implies that there is no single supervised learning algorithm, out of a set of uniformly distributed possible functions, that performs the best for all kinds of problems. This theorem refutes the idea of a generalised single machine learning algorithm for all types of problems and data, and underlines the need to check all assumptions and if they are satisfied in our particular problem. In practice, such deep learning models have shown a great capacity to generalise well, which is theoretically unclear and it is still getting questioned [57][58][59][60].
A second open issue is that, to automatically generate a hierarchy of abstractions, the deep learning models require a massive amount of training samples annotated with the targeted classes. In the case of end-to-end approaches for land cover classification of HSI and MSI, the massive amount of training samples should well cover the output-end's class distributions. However, as stated in the previous section, due to difficulties in the collection of LULC ground-truth, it is subjected to the issue of the limited number of training samples.
Even if we could find an effective solution to increase the size of training datasets, for example, via unsupervised or semi-supervised learning, the issue of processing efficiency remains. As discussed in Section 1, the complexities in the nature of remote sensing data, such as multi-modality, resolution, high-dimensionality, redundancy and noise in data make it even more complex and challenging to model an end-to-end workflow for the LULC classification of MSI and HSI. The more complex the model architecture becomes, the more difficult the learning problem gets. In other words, increasing the complexity of deep learning architectures leads to more difficult optimisation problems and dramatically decreases the computational efficiency.
Therefore, despite the substantial attempts in applying end-to-end deep learning in LULC classification problems, the challenges of such structure open up the floor for alternative approaches and make the former four-stage machine learning pipeline structure a debatable candidate. Indeed, defining the process according to a conventional workflow format makes it easier to shape, customise, and adapt the system to meet the targeted needs and, at the same time, it reduces the model optimisation complexity and computational time of the learning process. Breaking down the assumptions, needs, and targets into a set of sub-tasks, the empirical process of choosing an effective algorithm for each sub-task becomes easier and more diagnosable. Indeed, we can employ deep learning techniques more effectively and transparently to accomplish single sub-tasks of a classical machine learning pipeline with smaller problems to solve. All the solutions and trained models for each sub-task can be then employed in parallel streams or in sequential order at different steps of the conventional workflow. For instance, the authors in [61] propose a model seeing the feature selection problem as a feature reconstruction problem using a deep belief network and compare its efficiency in time with a deep CNN end-to-end model. Or to deal with the ground-truth scarcity problem, the work in [62] proposes the use of deep learning in a semi-supervised generative framework that can deal with feature extraction from a small number of samples.

Feature Engineering
Feature engineering is one of the steps in the conventional LULC machine learning pipeline, before the classifier training, that deals with the definition of features (or representations) that better fit the classifier requirements. "Features are the observations or characteristics on which a model is built, and the process of deriving a new abstract feature based on the given data is broadly referred to as feature engineering" [63]. Feature engineering aims to reduce the size of input data and to transform it into a set of representations that carry only its relevant meaningful information. Building a model on large raw datasets, with a large number of attributes per data possibly with some redundancies, is computationally expensive and inefficient. Therefore, transforming the raw data into a manageable set of meaningful representations is very critical to build a model effectively. Commonly, different forms of feature engineering are referred to as feature selection, feature transformation, and feature extraction.
Feature selection and feature transformation are usually referred to as dimensionality reduction techniques. In particular, the aim of feature selection is to remove the irrelevant or redundant information of the data, possibly without altering the rest of the information in data. On the other hand, feature transformation maps the input into an alternative space, to make the process easier. Selecting and transforming features may be a manual task dealt with based on expert prior knowledge or can be automated employing machine learning techniques.
Feature extraction is mainly used to reduce the number of data features, by creating a new set of features out of the existing ones. In classical machine learning approaches, the feature extraction task, also called hand-crafting features, calculates the set of new representations using predefined algorithms. Thanks to deep learning, feature extraction can be also conducted automatically, without dealing with the complexity of designing and formulating proper algorithms.
The HSI and MSI are cubic types of data [64] that contain two spatial dimensions (the width and height of channels) and one spectral dimension (number of channels). The spatial domain contains the morphological information and the spectral one is to distinguish material that corresponds to a pixel on the ground. Indexing the data in the time order adds another dimension to the data space, and it comes with time-series challenges. Transferring such a complex 3 or 4-dimensional space into a feature space with relevant information is very critical. The dimension of the feature space is defined based on the interrelation among the spatial, spectral, and temporal aspects of data. In some works, all these aspects are considered independent, while others are considered partially or fully dependent. The prior assumption on the dependency or independence of such features plays a crucial role in the design of the machine learning pipeline, the choice of the classifier, and the feature engineering steps.
Feature engineering is very challenging for HSI data. There are three problems to be considered: (1) the high number of spectral bands leads to the problem of high-dimensionality, the so-called curse of dimensionality [65]: with limited training samples it implies that much of the hyperspectral data space is empty, i.e., it does not have any observation upon which it can build a model; (2) the correlation between the spectral bands is not necessarily linear; and (3) the similarity between some spectral bands denotes the high spectral interband redundancy in a way that reducing some spectral bands does not cause significant loss of information. Therefore, to extract proper representations from the spectral domain of HSI data, feature transformation and feature selection should be considered together with the feature extraction. In this way, it is possible to reduce dimension and to remove redundancy, which helps translate data into manageable and learnable representations. The work in [66] deeply discusses the differences in techniques and tools to implement the feature extraction of HSI.
Feature engineering of data is of high value when the amount of training samples does not satisfy the end-to-end learning requirement, or when the learning of representations through the end-to-end approach is not computationally efficient. Nevertheless, the feature engineering stage can still benefit from deep learning and other machine learning techniques to find good representations for the data. In the next subsections, we explain feature selection, transformation and extraction for HSI and MSI data, and we discuss the common machine learning techniques, including deep learning techniques, used to tackle these tasks.

Feature Selection and Transformation
Feature selection and feature transformation of data are mostly referred to as dimensionality reduction techniques. Feature selection aims at removing redundancy by selecting the relevant attributes of the data, while feature transformation maps the data into another simpler space with possibly smaller dimension. Although, selecting and transforming the data into a set of relevant manageable representations that are compatible to the classifier requirements can significantly improve the performance of the machine, reduce the overfitting possibility, and cut down the training time, an excessive reduction of information can also go in the opposite way. Therefore, transformation and selection of features are quite challenging and sensitive.
The most common dimensionality reduction algorithm used for HSI data is the Principal Component Analysis (PCA) [67][68][69]. PCA projects the data into a new space within which the dimensions are linearly independent (orthogonal), and they are ranked in such a way that the principal axis is the one that the data is more spread in [70]. Therefore, PCA transforms the data into a simpler space for analysis, tackling feature transformation to reduce the feature dimension upon which the model is built.
Feature selection, also referred to as data cleaning or data filtering, removes redundancies in data and keeps the most relevant attributes to create a set of features for building a model. It reduces the chance of overfitting and the time of training, and eventually improves the accuracy of the classification. In almost all stochastic learning techniques, the importance of the features is calculated automatically through the classifier training phase. Feature importance ranking shows the importance of input data attributes, so it makes clear which attributes of the input data are potentially removable. Feature selection for complex HSI and MSI data, with a huge amount of attributes, is two-fold; With classical machine learning classifiers, such as SVM, the feature selection is crucial as defining the hyper-parameters for massive input types is too complex and impractical [71,72]. However, with the modern classifiers designed to avoid the overfitting problem, the necessity to reduce information from input data is questionable [73,74].
The use of deep learning in the feature engineering phase is mainly referred to as feature extraction. A feature engineering deep learning model learns how to optimally transform the input space into a smaller coded space that includes all its important information. Usually, the important information is referred to as the coded features that are enough to reconstruct the input with. In the following subsection, the feature extraction methodologies based on deep learning techniques are discussed.

Feature Extraction
Feature extraction defines a new set of representations, or abstractions, for data based on all existing attributes in it, to make the training process easier for the machine. A good set of representations contains all relevant information that fits the classification requirement. Such representations can be hand-crafted, using algorithms that calculate a new set of features. For instance, the well-known NDVI (Normalised Difference Vegetation Index) is a simple example of hand-crafting features, simply combining NIR (Near Infrared) and red bands of the image, and is very informative for detecting vegetation on land cover. Refs. [13,75] use NDVI masks and other indices to guide the Convolutional Neural Network (CNN)-based model (the technique is explained in Section 4.3) in detecting vegetation, water and other elements which are highlighted by these masks. Hand-crafting features can be also obtained by using image processing techniques, such as edge detection, smoothing, or segmentation.
Unsupervised, semi-supervised, and supervised machine learning techniques can also extract relevant features for the classifier. The best known unsupervised machine learning techniques to extract features automatically are the deep learning Autoencoder (AE) techniques (the technique is explained in Section 4.3). Over the last few years, AE algorithms have become very handy and popular to extract the optimised abstraction of HSI and MSI data for the classifier [76][77][78]. Although such unsupervised algorithms can find the data representations without any hint or label, ref. [79] underline the advantages of using supervised algorithms, pointing out that not only the global mutual information but also the in-class discriminative projections have to be explored in HSI data. Supervised algorithms using labelled samples can learn metrics that keep data points within a class together and separate them from the other classes [80]. Since the preparation of labelled data for supervised techniques is quite labour-intensive, conventional supervised algorithms can be extended to the semi-supervised variants [81].

Classifier
Despite the growing popularity of deep neural networks, the classic supervised classifiers are still popular within the remote sensing community. RF and SVM are the most-common classic classifiers used in literature for the land cover classification of remote sensing data. Like the other non-parametric supervised classifiers, these algorithms do not make any assumption regarding the distribution of data and they have shown promising results in classifying remote sensing data overtaking the field's earlier classifiers adopted such as Linear Regression (LG), Maximum Likelihood (MLC), K Nearest Neighbor (KNN) and Classification and Regression Tree (CART).
RF is an ensemble classifier made of a set of tree-structured predictors (CARTs) such that each tree depends on a random set of training observations that are sampled independently with replacement [86] and, at each splitting node of the trees, a subset of features is randomly selected to grow the tree [87]. RF is pretty popular for classifying remote sensing data due to its simplicity and its power in reaching robust models. It has been broadly used to classify the land cover [88][89][90], and many other applications as reported in [91]. However, like the majority of supervised classifiers, RF requires an adequately big set of reference data to learn the class distributions, which is often a critical problem.
SVM is another popular classifier for remote sensing data that works well with a relatively small amount of training samples. The algorithm aims at finding an optimal separating hyperplane that separates the observations into target classes so that the boundaries among the classes minimise the misclassification rates [92]. The regularisation parameter in SVM plays a critical role in its performance; with well-tuned regularisations, SVMs tend to be resistant to overfitting and do not have any inherent problem when the number of observations is less than the number of attributes [93,94]. Relying on such characteristics, SVM has been very popular for land cover classification of MSI and HSI [95][96][97].
However, when it comes to complex problems such as classification of HSI images, deep learning approaches with the capability to learn from hierarchies of features, outperform the other classifiers. Deep learning models are composed of multiple layers such that each layer computes a new data representation from the representation in the previous layers of artificial neurons creating a hierarchy of data abstractions [98]. CNNs are a group of deep learning techniques that are composed of convolution and pooling layers that are usually concluded by a fully connected neural network layer and a proper activation function, i.e., in models that directly reconstruct an output image prediction, such as U-Net and generative models (explained later on), the fully connected network and activation function is not needed. CNNs, being very successful in classifying complex contextual images, have been widely used to classify remote sensing data too.
CNNs are feedforward neural networks (artificial neural networks wherein no cycle is formed by the connections between its nodes/neurons) that are designed to process the data types composed of multiple arrays (e.g., images, which have layers of 2D-array of pixels) [98]. Each CNN, as shown in Figure 6, contains multiple stages of convolution and pooling, creating a hierarchy of dependant feature maps. The example in the figure shows convolutional neural networks with two layers of convolution and two layers of pooling, for (a) patch-level classification, (b) pixel-level classification and (c) an image reconstructive model. In (a) and (b) a fully connected network is fed with the flattened feature maps of the latest pooling layer. In (b) the central part is shown in red is the pixel to which the class is assigned. In (c), the model does not include any fully connected network and activation function, but the right half part of the model directly reconstructs an output image predication.
At each layer of convolution, the feature maps are computed as the weighted sum of the previous layer of feature patches, using a filter with a stack of fixed-size kernels, and then pass the result into non-linearity, using an activation function (e.g., ReLU). In such a way, they detect local correlations (fitted in the kernel size), while keeping invariance to the location within the input data array. The pooling layer is used to reduce the dimension of the resulted feature map by calculating the maximum or the average of neighbouring units to create invariance to scaling, small shifts, and distortions. Eventually, the stages of convolution and pooling layers are concluded by a fully connected neural network and an activation function, which are in charge of the classification task within the network.
The process of training a CNN model, using a set of training samples, finds optimised values for the model learnable parameters, by reducing the cost calculated via a loss function (e.g., Minimum Square Error, Cross Entropy, or Hinge loss). In CNNs, learnable parameters are the weights associated with both convolution layer filters and connections between the neurons in the fully connected neural network. Therefore, the aim of the optimiser (e.g., Stochastic Gradient Descent, RMSprop, or Adam) is not only to train the classifier, but it is also responsible to learn data features by optimizing convolution layers parameters.
The size and dimension of filters for each convolution layer are the so-called model hyper-parameters. Although choosing the kernel size for the filters is usually an inductive process, the dimension of filters can be directly driven from a prior knowledge over the input space (e.g., time-series, one channel image, multi-channel image, or time-series of multi-channel images) and over the expectations on the type of features to be extracted (e.g., spatial, spectral, spatial-spectral, or spatial-spectral-temporal features). The CNNs used in the literature for classifying remote sensing data can be categorised into three sub-types: CNN with one-dimensional filters (1D-CNN), CNN with two-dimensional filters (2D-CNN), and CNN with three-dimensional filters (3D-CNN), shown in Figure 7. The differences in the mentioned sub-types of CNN are at the convolution layers. These networks may be used jointly in parallel streams to extract different independent features.
One-dimensional CNNs, mostly used for time series modelling, have also been used to extract the spectral features of pixels in HSI data [99][100][101]. This technique is sometimes called spectral curve classification [101]. Stacking the spectral layers of an HSI corresponding to three seasons on top of each other, ref. [102] apply 1D-CNN to also distinguish the seasonal change feature in the spectral-temporal domain. Reference [103] propose a hybrid model of 1D-CNN and RNN that learns the spectral dependencies automatically. In particular, it is composed of layers of convolution and pooling (to extract locally-invariant features), followed by recurrent layers (to retrieve the spectrally-contextual information from the latter extracted features), and concluded by a fully connected neural network and an activation function. Reference [104] use 1D-CNN in a generative adversarial network (1D-GAN) to generate fake spectral data, and also as a discriminator to classify the spectral features. Two-dimensional CNN is the common type of CNN, used to classify images where there is a correlation between the morphological details and the target classes. In remote sensing, 2D-CNN is typically adopted to extract the spatial features of the HSI and MSI, considering the continuity of land covers in the spatial domain [105].Well-known CNN pre-defined models, developed for image understanding, are sometimes used in literature to classify land covers, including LeNet5 [106], AlexNet [107], VGGNet [108], CaffeNet [109], GoogLeNet [110] and ResNet [111] models. In [8,112], authors have compared the mentioned models in the context of land cover classification of HSI. In general, as the relation among spectral bands of HSI is not linear, 2D-CNNs are usually used jointly with 1D-CNNs to cover the spectral-spatial domain of features of HSI data [99]. In such cases the models extract spatial and spectral features separately in parallel, then their extracted features are normally put together and fed to a fully connected classifier followed by an activation function. However, since combining the extracted features in such a structure is an additional empirical process, fine-tuning the model gets even more complex [113]. Three-dimensional CNN is an alternative approach that can reduce this complexity by simply leaving the features as tensors in a 3D space, considering also potential correlations between the spatial and spectral aspects of data.  Three-dimensional CNN is mostly used for multi-frame image classification in which the temporal dimension is added to the domain (spatio-temporal classification). In the case of remote sensing, 3D CNNs are used to extract spectral-spatial [114,115] and spatial-temporal [113] features. In such classifications, the features are assumed as tensors in 3D domains, and each layer of convolution-pooling affects the size of feature volume in depth, width and length. Authors in [73], focusing on the full utilisation of spectral and spatial information in input HSI data, propose an end-to-end model that contains four sequential residual blocks with 3D CNNs to extract the spectral and spatial features, respectively. Through a training-validation cycle in the proposed model and changing the CNN parameters, the features of the HSI data get learned. The authors of [116] introduce the attention network structure for hyperspectral image classification that includes 3D-CNN based spatial, spectral and attention modules; the latter one is designed to extract the discriminative features from attention areas of HSI cubes.
Normally, convolution and pooling layers apply linear operations involving the multiplication of a set of weights with the input to generate the input representations. However, good representations are generally highly non-linear functions of the input data as stated by [117], and modelling such complexity with the conventional convolution feature mapping strategy requires to get very deep with the stack of convolution and pooling, which is prone to overfitting and computational inefficiency problems. To solve them, the authors of [117] introduce the concept of Network in Network structures (NiN) or Inception networks, which can replace the linear convolution filters with "micro-networks" in order to deal with non-linear approximations. Inception network uses 1 × 1 filters that reduce the complexity of 3D-CNNs by decreasing the computational cost and the number of output features. Reference [118] employ this idea to realise the interaction of spectral information and the integration of specific bands in MSI data. GoogLeNet model, with nine inception modules, has been also popularly used in the literature for classification of MSI and HSI [119][120][121][122].
One of the main concerns of deep learning is the overfitting problem. Residual blocks, introduced by ResNet network [111], have been proven to be a good replacement for the conventional convolution and pooling blocks to avoid this overfitting problem. The residual blocks (Figure 8) are networks composed of convolutions and pooling layers with skip connections. The skip (or identity) connection provides the training process with the possibility to simply skip layers of convolution and pooling, if not needed. In some models the residual blocks are used in a customised network [73,74,123] and in many others, the well-known ResNet models are directly employed to perform the land cover classification of MSI and HSI [120][121][122]124,125].
To output a segmented map, some works suggest the use of U-Net, which was initially introduced by [126] for biomedical image segmentation. U-Net (Figure 9) is composed of three steps: (1) contraction with convolutional layers and max pooling, (2) bottleneck with a couple of convolutional layers and a drop-out, and (3) expansion with some deconvolutional (or transpose of convolution) for up-sampling, convolutional layers, and feature map concatenations. The architecture of U-Net, as also pictured in Figure 9, looks like a 'U', from which the name is derived. The contraction path behaves as an encoder, trying to find the latent representations or the coded values for the input. The expansion part behaves as a decoder, recovering the information. Since within the contraction path, the positional information gets lost, to precisely recover information at every step of the expansion, skip connections are used to pass a copy of corresponding encoded feature map from the contraction path. These copies of encoded feature maps are concatenated with the result of deconvolutions to force the model to learn more precise outputs. In the context of remote sensing, U-Net has shown very promising results in extracting buildings [127,128], roads [129,130], clouds [131,132], and to classify other land covers [133][134][135] using high resolution MSI data.
As a final note, the process of learning the substantial parameters of convolutions and deconvolutions within complex architectures comes with an important problem: choosing a viable optimiser with efficient computational complexity and its corresponding cost function that can evaluate it. The authors of [136] provide a review discussing the optimisation methods vs. lost functions in detail and explain the potential issues and their computational complexities.

The Challenge of Limited Ground-Truth
As explained before, for deep learning to outperform other approaches, a large quantity of training data with ground-truth is required. That is why sometimes the classical machine learning techniques, such as SVM, show better or comparable performance in LULC classification of MSI and HSI. As an example, the authors of [137] evaluate the performance of Sparse Auto-Encoder (SAE) and SVM in classifying popular datasets, concluding that with the common situation of a limited number of samples, SVM with fewer parameters to be learned, not only performs better than SAE but also requires a more reasonable computational time.
To deal with the aforementioned problem, ref. [138] propose a data augmentation approach which adopts image transformations (e.g., flip, translation, and rotation) to generate additional and more diversified data samples upon original data, which improve the performance of its CNN model ( Figure 10). An alternative approach consists of using semi-supervised learning methods that utilise unlabelled data. One way is to use self-labelling techniques by using a pre-trained labelling classifier [139], and another recent way is to use Generative Adversarial Networks (GAN) including generative models together with discriminative evaluation methods [62] (shown in Figure 11). Transfer learning is another approach proposed to deal with the challenge of limited ground-truth. The transfer learning methodology employs a pre-trained classifier to extract an initial set of representations for a new dataset ( Figure 12). According to [140], with transfer learning, the model can expect a higher start, higher slope and higher asymptotic performance during the training process. References [141,142] use a classifier pre-trained on the ImageNet dataset to transfer knowledge into a land cover classification problem. Another example is the methodology proposed by [143], which pre-trains a classifier on the datasets from VOC and PASCAL challenges, which is then used to extract initial representations of GoogleMap images for remote sensing object detection. Reference [144] propose a model that is based on the idea of combining transfer learning and semi-supervised methods, which can deal with the challenge of limited ground-truth. In this methodology, a pre-trained model on a labelled multi-modal dataset (MSI-HSI or SAR-HSI) is used to label a single-modal dataset (only MSI or only SAR).  Another approach to tackle the lack of labelled data is unsupervised learning. For instance, without any labelled data, the work in [145], being inspired by [146], proposes an unsupervised deep learning method for HSI segmentation that initially exploits 3D convolutional Auto-Encoders (AE) ( Figure 13) to learn embedded features, and uses the learnt representations in a clustering layer to segment an input image. The AE is composed of two stages: the encoding path and the decoding path. The encoding path uses convolutional layers together with pooling layers to transfer the input data into a latent representation space, or coded values. The decoder part evaluates how good the encoded representations are for recovering data, using up-sampling and convolutional layers. Autoencoders aim to extract meaningful information from the data in an unsupervised way. Although this methodology could dramatically facilitate the ground-truth generation process and could be useful for high-level applications such as anomaly detection, training these models is computationally expensive. Figure 13. An example of a 3D auto-encoder with a couple of convolution layers followed by pooling layers at the encoder and a couple of up-sampling layers followed by convolutional layers at the decoder part, which learns the representations from an unlabelled set of data. In such an unsupervised learning strategy, the learning process takes place to encode the data into a set of representations, and the decoder evaluates how the representations are good enough to reconstruct the original data using the same convolutions.
Labelled datasets are not only limited in number but are also very limited in terms of variety. In other words, the majority of available HSI and MSI labelled datasets are not sufficient to train a generalised model, as they are specific to time and location. This causes the common issue where the classifier trained using one dataset usually does not perform as well over other datasets. Indeed, the seasonal land cover changes, lighting effects, and intra-class variability in different regions are factors that are not considered in the majority of datasets with ground-truth. Moreover, each dataset has a limited number of classes that are mostly specific to the context, location and its original application target, which makes it difficult to mix them and generate a bigger comprehensive dataset.
Generating labelled datasets requires manual intervention. Yet, in comparison to the pixel-level labelled datasets, the patch-level datasets are relatively easier to get prepared as labelling is less sensitive to fine details. EuroSAT introduced by [122] (multi-labelled patches) and SAT-4 and SAT-6 datasets by [43] (single-labelled patches), are some examples of patch level labelled datasets released to the community. On the other hand, the pixel-level labelling is still a challenge, and it is usually done by field experts. Crowd-sourcing approaches can highly facilitate the generation of ground-truth maps; how to engage citizens for micro-tasks through gamification and competitions, is studied by [147,148]. The potential challenges and required assessments in using such approaches are also discussed in [149].
In addition to the aforementioned limitations, we have to take into account that almost all the available datasets have a fixed spatial resolution. Sensor specifications, as well as the choice of airborne versus spaceborne directly impact the resolution of the image. The spatial resolution of data may be insufficient or misleading for the classifier depending on the targeted classes. For instance, in conventional models, normally the Visual saliency [150], i.e., the selective perceptual quality of the human visual and cognitive system, which allows some items to immediately stand out among others within a scene, is not considered in feature extractions from high-resolution images. One common solution is multi-scale learning: the authors of [151] propose a multi-scale CNN framework in which a pyramid of differently scaled versions of the high-resolution image sample is fed to the machine to capture the different conceptual information. On the other hand, low-resolution images lack enough details to be extracted. Usually, to deal with such a problem, data from other sources may be injected into the model pipelines to assist the machine in capturing the relevant features. In other words, one of the possible tasks that can be carried out by fusing different types of data (multi-modal data fusion), is to improve the resolution of the images. This aspect is explained in more detail in the following subsections.

Multi-Modal Data Fusion
Data fusion is the process of combining data from multiple sources to improve the potential values and interpretation performance of the source data, and to produce a high-quality visible representation of the data [152]. In remote sensing, data fusion is commonly used to improve the spatial and spectral resolution of data. Although data fusion has a long history in the remote sensing community, the advent of machine learning and in particular deep learning techniques has dramatically changed the way the data are fused.
The initial step for any geo-data fusion is geo-coordinates matching. Then, having the paired data from the same scene, data fusion may take place in one of the following three stages: (1) at the data preparation stage, (2) at the feature engineering stage, or (3) at the decision stage (all shown in Figure 14).
Data fusion at the data preparation stage (also called Early Fusion) (Figure 14a) is usually referred to as super-resolution transformation. In this process, the aim is to increase the resolution of a targeted dataset by using another, sometimes temporary, source of data. A very traditional form of super-resolution transformation is pan-sharpening where the panchromatic data is employed to increase the resolution of MSI or HSI data. Different studies show that deep learning techniques outperform conventional pan-sharpening approaches by automatic extraction of features that indicate the correlations between the two data types [153][154][155][156].
Super-resolution generation using deep learning is obtained by a model in which two versions of an image (high resolution as the target and low resolution as the input) are used to learn how to reconstruct a higher resolution image out of a low resolution one [157][158][159]. These types of models are getting popular to increase the resolution of remote sensing data too [160,161]. In addition to the spatial resolution, the authors of [162] apply the same idea using 3D-CNNs on HSI to also provide higher spectral quality. By the launch of Sentinel 1 and Sentinel 2, many questions were raised regarding the fusion of SAR and MSI to increase the resolution of data in terms of filling the gaps caused by the atmospheric conditions. For instance, on a cloudy day, optical sensors can not capture the ground surface. To approach this problem, the authors of [163] propose a deep learning-based methodology using Sentinel 1 and Sentinel 2 time-series to estimate high-resolution NDVI time-series for monitoring agricultural changes. Another approach targeting the inner multi-modality of MSI data is by [164], wherein the proposed model super-resolves lower resolution spectral bands of Sentinel-2 data, using its higher resolution spectral bands. Data fusion at the feature engineering stage (also called Feature/Representation Fusion) is a very common and efficient way of using multi-modal data. As also pictured in Figure 14b, instead of generating a new version of the input, the data sources from a scene are processed in parallel for the feature extraction. Then the extracted features of each pipeline are put together and fed to the classifier. Deep learning has been a breakthrough in this process. Using parallel convolutional streams, References [165,166] fuse LiDAR and MSI/HSI data at the feature engineering stage for the crop and land cover classification. Similarly, ref. [167] fuse SAR and MSI data to collect more ground details for classifications. Another interesting work belongs to [168], where the authors use OSM (Open Street Map) maps for semantic labelling of Earth Observation images. Panchromatic and MSI data are also commonly fused in many studies [169].
Another stage in which data fusion may take place is at the decision level (also called Late Fusion). As shown in Figure 14c, parallel streams leading to predictions are considered for each source of data, and the final decision is made based on all the streams' predictions. Generally, the decision level data fusion stands out when the input data types and formats are not auto-correlated. Indeed, when the input data are heterogeneous, multi-modal, and multi-source at the same time, it is difficult to extract correlatable information at the earlier stages. Due to the large number of neurons in the architectures with data fusion at the decision stage, the required time to train, and hence to test the model, is significantly higher than the other data fusion architectures. Therefore, in the case of correlatable data types, as discussed in [167], the decision level data fusion is not the best approach. This is also stated in the work in [170], which compares the late fusion results with those of early fusion over aerial multispectral and LiDAR data and evaluates them with the same inference. On the other hand, a more heterogeneous input types situation has been explored by [171]: they adopt multi-modal (MSI and SAR), multi-source (aerial images, Sentinel 2 and Sentinel 1), and multi-temporal (for Sentinel 1 and 2) data types to rapidly extract flooded buildings.
Sometimes, data fusion targets the temporal resolution as well. The work in [172], published in 2018, is a review on the state-of-the-art of spatio-temporal multi-modal data fusion studies. In our review work, no study has been reported that tackles this problem using deep learning techniques. However, the authors predict a potential opening by emerging deep learning in the field.

Pre and Post-Processing
The aim of pre-processing is basically to enhance the raw input data for the analysis. Within this stage of the machine learning pipeline, many methodologies, as well as deep learning techniques, can be employed to generate an improved dataset from the raw data. As discussed earlier, data fusion can also happen during a pre-processing stage to generate super-resolution data. Furthermore, when pre-trained models on different datasets exist, transfer learning may also be considered within the data pre-processing stage. Also, the authors of [173] use transfer learning to overcome the problem of noises with the newly launched Chinese satellite hyperspectral images. Major pre-processing tasks given by the models in the literature focus on denoising, cloud detection, and resolution assessment. Besides data resolution assessment, deep learning has been also successful in HSI denoising [174,175], and in detecting clouds [176,177] for both MSI and HSI data.
Post-processing is an optional stage that is used to fine-tune the classifier output by usually employing image processing techniques. Based on prior knowledge about the expected output or about the potential classifier errors and noise, post-processing applies a set of adjustments to the output to enhance the model performance. In the context of remote sensing, the pre-processing stage is very useful to vectorise or create the shapefiles of on-Earth man-made objects (e.g., buildings) [170], for which the morphological characteristics of the expected output is known. Conditional random fields (CRFs) is the main technique used for this end [178], and it has been successfully practised by many studies jointly with deep learning models targeting semantically segmented maps [75,179,180].

Conclusions
Currently, the majority of the attempts to apply deep learning techniques on remote sensing data are proposed by non-machine learning experts. In this review, we addressed the critical challenges in employing such techniques and underlined the need for a deeper understating of machine learning as a complex problem. Focusing on the land use and land cover classification of multispectral and hyperspectral images, we provided a review on the state-of-the-art by converging a wide range of different approaches reported in the literature into a generic machine learning framework, which encompasses different aspects of the whole problem. We discussed how deep learning techniques have been utilised in different stages of the framework to target different tasks and challenges, standing out among the other approaches.
There is a growing interest in employing deep learning techniques for a wide spectrum of remote sensing applications, which encourages industries to invest in this field. Accordingly, fast developments in the ground knowledge and an increase in the number of open opportunities are expected. Going through the state-of-the-art, there seemed to be promising areas in which the implementation of deep learning can be of high potential:

•
For the majority of the commercially viable applications, the spatial resolution of remote sensing images is required to be higher than what any satellite can provide. Therefore, aerial remote sensing images are more popular due to their higher spatial resolution. Yet, the limited coverage and low temporal resolution of such aerial images come with some challenges for many applications that leave room for the use of satellite images as well. Therefore, the trade-off between temporal and spatial resolution lays the ground for further discussion on this matter.

•
The ground-truth scarcity is yet a challenge. An accurate annotated data set could open the doors to new opportunities for researchers. Most of the available solutions suffer from lack of funding and difficulty in assessment of their accuracy. Indeed, the use of IoT and the open science framework that supports the integration of citizen science, gamification, incentives and competitions, is still to be explored.

•
Despite the constant increase in the number of geospatial data providers, for many years there has been no standardised way to release and to get hold of the data. Commonly, processing and analysis of data are carried out on local machines, on the locally replicated instance of data. With the fast growth of data in volume and the limitation in memory, relying on conventional infrastructures appear not to be feasible and efficient anymore. Recently, data providers have introduced the cloud platform to access and analyse data directly, which offers the possibility of integration of data from different sources in the near future. Certainly, getting aligned with the advances in infrastructure opens up new opportunities to be investigated.

•
The recent idea of on-board data processing could introduce new challenges: as announced by NASA and ESA, the future satellites are planned to carry more powerful processors that can process data before transferring them to the Earth. However, the power-scale and energy management is a crucial problem for the on-board processes. Therefore, reducing the complexity of the models is a crucial matter to be considered for future works. The recent study by [181], which proposes the Firefly Harmony Search (FHS) tuning algorithm for its Deep Belief Network model, also proves that simplifying the models can also improve the accuracy of classifications.
Lastly, deep learning has an enormous capacity to act as an indispensable tool to tackle some of the most serious and urgent environmental concerns of our time. There is a sense of urgency to channel the direction of research activities to address such matters and there exist substantial potential scopes to be further developed in this area. Moreover, the effective use of deep learning deliberates new case studies and requires determined efforts to tackle the technical challenges that come with remote sensing data and the problem of resource-constrained machines. Memory management, data preparation, and data loading are of these technical challenges that call for further endeavours in applications of deep learning. Furthermore, deep learning studies in the field of remote sensing lack an established framework that can categorise and optimally group the models. Future efforts may consider the need for setting up such a framework that can set the ground for a rational and proper assessment of the effectiveness of the models.