Object Detection and Image Segmentation with Deep Learning on Earth Observation Data: A Review-Part I: Evolution and Recent Trends

: Deep learning (DL) has great inﬂuence on large parts of science and increasingly established itself as an adaptive method for new challenges in the ﬁeld of Earth observation (EO). Nevertheless, the entry barriers for EO researchers are high due to the dense and rapidly developing ﬁeld mainly driven by advances in computer vision (CV). To lower the barriers for researchers in EO, this review gives an overview of the evolution of DL with a focus on image segmentation and object detection in convolutional neural networks (CNN). The survey starts in 2012, when a CNN set new standards in image recognition, and lasts until late 2019. Thereby, we highlight the connections between the most important CNN architectures and cornerstones coming from CV in order to alleviate the evaluation of modern DL models. Furthermore, we brieﬂy outline the evolution of the most popular DL frameworks and provide a summary of datasets in EO. By discussing well performing DL architectures on these datasets as well as reﬂecting on advances made in CV and their impact on future research in EO, we narrow the gap between the reviewed, theoretical concepts from CV and practical application in EO.


Introduction
In recent years, deep learning (DL) has received a lot of attention, in both scientific research and practical application [1,2]. Two main factors are responsible for this growing attention: the accessibility of data and the increase in computational processing power, especially with graphics processing units [3][4][5]. Due to these developments, researchers were able to demonstrate working concepts for DL which could even outperform established approaches. Their fast improving insights were quickly applied in other disciplines and in practice. Therewith, a self-reinforcing research environment was created which today has significant impact on science and practice.
Increasing data accessibility can also be found in the field of Earth observation. The availability of high-resolution optical and multispectral imagery is particularly important. Due to the recent trend of opening archives of Earth observation data, it can be expected that this amount of high resolution remote sensing data will increase dramatically in near future. However, high resolution optical data have already paved the way for transferring DL concepts from computer vision to Earth observation application such as detecting or segmenting vehicles, roads and buildings from overhead images. With these proof-of-concepts of DL for Earth observation research, today, the applications are wide and no longer limited to RGB images. The numbers of DL implementations in Earth observation are still growing and showing new trends and possibilities of analysing remotely-sensed data [6][7][8][9].
The importance of DL today reaches far across the scientific community. In 2019, Nature led the Google scholar metric h5-index, where its top three papers are all covered DL and can therefore be seen as part of the most relevant papers in 2019 overall. In the same list of h5-index journals of 2019, the IEEE Conference on Computer Vision and Pattern Recognition, well known for its contributions to the research on DL, reached the top ten for the first time [10]. Furthermore, interest from data driven companies such as Google and Facebook has been responsible for driving its recent popularity [1]. Their contributions are both theoretical and practical. For instance, Google is the leading affiliation for papers submitted between 2014 and 2018 during one of the world's most important conferences in the field, Neural Information Processing Systems Conference [11]. At the same time, both Google and Facebook are mainly developing the two most popular DL frameworks, TensorFlow [12] and Pytorch [13], respectively (see Section 4).
One proxy for the recently fast growing interest in research on DL are the publications submitted to arXiv, a distribution service of open access articles frequently used in computer science. Figure 1 shows the annual absolute number of publications concerning deep learning and also its share of all publications submitted to arXiv in the computer science (cs) and statistics (stat) categories in the same year. Both numbers are growing, which means that there is not just an absolute growth in DL research but also its share of research in computer science and statistics is getting bigger each year. What makes DL so successful is its capacity to represent more abstract concepts [1,2,14] such as speech or images. DL models outperformed classical machine learning models and signal processing approaches [14], for instance in speech recognition [15][16][17] and image recognition for handwritten digits [18][19][20]. Finally, in 2012, Ciresan et al. [20] and Krizhevsky et al. [3] introduced convolutional neural networks (CNNs), the most representational DL models [2], for image recognition of natural images. The model of Krizhevsky et al. [3] called AlexNet, a CNN which extracts features from RGB images to predict a single label, won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012 [3]. ILSVRC is an annual competition, held since 2010 for computer vision tasks such as image recognition and object detection. The breakthrough of AlexNet in performance during the ILSVRC in 2012 can today be seen as the birth of recent developments in DL [21] and is used as starting point in this review.
From traditional computer science, where research is done on DL, the technique evolved into other disciplines which do research with DL. The paper of Krizhevsky et al. [3] is widely cited in both fields, those doing research on as well as with DL. Therefore, the number of citations can be interpreted as a proxy for how much attention DL and especially CNNs received over the last years, as well as outside the field of computer science. Figure 1 shows the increasing number of citations, thus demonstrating the relevance of not only AlexNet but also DL with CNNs for recent research.
Since imagery data are fundamental for Earth observation research, the application of DL seems likely. Nevertheless, looking at the combined publications of leading remote sensing journals and conference papers in Figure 1, one can see that DL reached the Earth observation community with an offset of three to four years, taking 2012 as the starting point. However, since then, the number of publications concerning DL and remote sensing has more than doubled each year. This trend continued in 2020, when in the first quarter the number of publications was nearly half the number for the entire year of 2019. Current reviews discuss the wide range of applications of DL on remotely-sensed data for super resolution, data fusion, denoising, weather forecasting, scene recognition, classification and object detection with optical, multispectral, hyperspectral and SAR sensors in different resolutions [6][7][8][9][22][23][24].
However, due to the vast amount of insights in DL over the last eight years, this review aims to provide a more detailed overview about the progress primarily made in computer vision from 2012 to late 2019 by focussing on object detection and image segmentation with CNNs. Therefore, the review takes an evolutionary perspective to outline the major milestones and their interrelations. By running a thread through important DL publications, we specifically address Earth observation researchers who want to add DL to their toolbox, or want to reflect on the evolution of DL approaches to choose a matching model design for their own research questions. Providing this thorough introduction, we contribute to the open question number eight, "How to best handle high entry barriers to DL?" (p. 32) as stated by Ball et al. (2017) [8]. Overall, this review contributes to a better understanding of the principles of CNNs. This foundation will further be used in Part II: Applications to discuss applications of CNNs in Earth observation research by reviewing leading Earth observation journals.

Terminology and Basic Concepts of Deep Learning with CNNs
In supervised machine learning (ML), algorithms try to learn features from labelled training data to predict an accurate output from an unknown input. DL models are specific ML models, made of stacked layers, which enable those models to consecutively extract richer features from the input data. As more layers are stacked, the model becomes deeper, and more complex features can be learned, hence the name deep learning. ML itself is part of the wide field of artificial intelligence (AI), thus DL is not the same as AI but part of it. Before different DL models are introduced, reference is made to Table 1. It explains the intensive and commonly used abbreviations in DL literature, which are also used in the following sections in order to accustom an audience interested in DL to its terminology. Furthermore, the table is ordered thematically and chronologically to provide a thread through DL literature concerning image recognition, image segmentation and object detection with CNNs.
The classical DL model, the artificial neural network (ANN) pictured in Figure 2c, can briefly be described as a sequence of fully connected layers, from input over hidden to output layers of artificial neurons. All neurons within one layer are connected to each neuron of the previous layer via linear operations also called weights or parameters, hence the term fully connected. The connections transport the values from all neurons of the previous layer by multiplying them with the weights assigned to each connection and passing them to the neurons of the next layer. Here, each neuron sums up the incoming values and performs a subsequent non-linear function, the activation. Such a network then transports values from the input layer via the weights over all hidden layers to the output layer and can be seen as a non-linear function of higher order, mapping input values to outputs [5,25]. Table 1. List of abbreviation, explanations and the original literature or further discussion. The sorting of the table reflects the structure of Sections 2 and 3. At the same time, it is a guiding thread through the milestone-publications on deep learning architectures of convolutional neural networks for image recognition, segmentation and object detection from 2012 to late 2019.
The main characteristics of a DL model can be summarised as stacked layers of non-linear functions, trained to extract features from input data to predict an output based on those automatically found instead of hand crafted features. The model type and its architecture are influencing the handling of the underlying structure of the data and the feature richness found by it. As a model gets deeper, more complex, abstract and distinctive features are able to be found. However, this behaviour is not infinite and overfitting tends to occur when simply adding parameters to the model by making it deeper.
An ANN as described above belongs to the DL models called stacked autoencoders. Other model types are Recurrent Neural Networks (RNNs) [34], or more specifically Long Short-Term Memory (LSTM) [35], which are used for processing sequential data; Generative Adversarial Networks (GANs) [36], the primary idea of which was to generate data; and finally Convolutional Neural Networks (CNNs).
CNNs are popular for processing 2D array input data such as images [2], which we focus on in this review.

Figure 2.
Overview and details of a convolutional neural network (CNN) architecture for image recognition. (a) Zoom in on a three-channel RGB input, convolution + activation function, e.g., Rectified Linear Unit (ReLU) (blue) and adjacent max pooling operations (orange). Those operations are used repeatedly in (b) the convolutional backbone, part of the overall structure of the architecture. From an input image, feature maps are created by convolution and resized by max pooling operations getting smaller in resolution but deeper in feature maps until they reach a classifier, the head of the architecture, here a fully connected artificial neural network (ANN). (c) Details on the transition between convolutional backbone and classifier head, as well as the structure of a multi-layer ANN performing classification.
Images are records of natural signals and therefore provide information as pixel values and their local connectivity. From this property, low level features such as edges can be combined to features of higher levels with a semantic meaning. Therefore, analysing images means exploring pixel values and their local connectivity to find strong, distinctive representations which can be used for image classification. Using a DL model such as an ANN would not fully take the local connectivity of a natural signal into account. In contrast, a CNN allows for the learning of those representational features of imagery data to an increasingly abstract degree, while also being aware of their values and local arrangement. The convolutional operations are thereby trainable kernel functions connecting the layers in a CNN as the linear connections are doing it in an ANN. Due to the local sensitivity of a kernel function, CNNs are able to take local connectivity of the input data into account when learning the features from them [2].
To further introduce the basic functionality of a CNN, a network architecture for image recognition is to be assumed, as shown in Figure 2b. The overall architecture can roughly be split into three modules: input, convolutional backbone and classifier head. The 2D array input is passed through a sequence of convolutions, activations and max pooling operations called the convolutional backbone in order to extract high level features. The adjacent classifier, located at the end of the backbone and therefore the head, is here, similar to an ANN, a sequence of stacked, fully connected layers. It uses the extracted features from the convolutional backbone to classify them into output classes and provide their probability. Figure 2a shows how k c1 kernels of size 3 × 3 convolve the 2D input array over its entire depth, producing a stack of d c1 feature maps corresponding to the kernel functions. Since the kernel functions are linear operations, each feature map is activated with a non-linear function, e.g., ReLU f (x) = max(0, x) [31][32][33]. On each activated feature map, a pooling operation is applied such as max pooling with a 2 × 2 kernel, selecting the maximum value from a 2 × 2 neighbourhood. Pooling introduces translation invariance to the model [2,95] and reduces resolution by factor 2 when applied with a stride of 2. This is useful, when going deeper into the net where progressively semantically low level features with high local relations are combined to many semantically high level features with low spatial relevance [2,96]. To keep the computations reasonable as well as allowing deeper stacks of feature maps with larger feature variance, the resolution is decreased by max pooling while the number of feature maps is increased by convolutional layers using gradually more kernels (depicted as subsequently smaller but deeper blocks in Figure 2b). All convolutional operations, the kernel functions, are the weights in a CNN, which are adjusted during training. Therewith, the extracted features are not hand crafted but learned from training data [3].
To enter the classifier, the last stack of max pooled feature maps in the backbone is flattened. This is done by transferring each pixel value of the final feature maps to an input neuron for a shallow fully connected ANN, as depicted in Figure 2c. The final output layer holds as many neurons as there are classes and the activation of this layer is in this case the softmax function, which returns the probability for each class based on the values transported to the last layer. The output probability distribution finally tells how likely each possible class is predicted for the whole image. From this introduction, major characteristics of CNNs and differences to ANNs can be summarised:

•
The convolutional backbone is a strong feature extractor for a natural signal, while it maintains the fundamental structure of that signal and is sensitive for local connectivity. • Instead of pairwise connections of neurons, kernel functions are used to connect layers, in order to learn features from training data. • By sequentially repeating convolution, activation and pooling, the idea of how natural signals are composed, of low combined to high level features, the artificial architectures of CNNs for extracting features follows the hierarchical structure of a natural signal and mimic the behaviour of the visual cortex of living mammals [96][97][98][99]. • The modular composition of both the convolutional backbone itself and the overall architecture makes the CNN approach highly adaptable for a variety of tasks and optimisations.
However, image recognition was chosen as an introductory example because of its relatively simple network architecture. By changing the head structure after the convolutional backbone, the network can be changed to perform a completely different task, such as image segmentation object detection or instance segmentation. In Figure 3, example applications are shown for all these tasks. The tasks main characteristics can be summarised as: • Image recognition is understood as the prediction of a class label for a whole image. • Image segmentation, semantic segmentation or pixel wise classification segments the whole image into semantic meaningful classes, where the smallest segment can be a single pixel. • Object detection predicts locations of objects as bounding boxes and a class label. • Instance segmentation is an object detection task on which an image segmentation task for the specific bounding box and class is applied additionally. This results in a segmentation mask of the specific object predictions. In this review, instance segmentation is discussed together with object detection, due to their evolutionary closeness. Figure 3. Examples for the tasks of: image recognition, assigns a single label to a whole image; image segmentation, densely classifies each pixel, object detection: locates and classifies specific objects in an image by providing a bounding box; and instance segmentation, provides a segmentation mask for detected objects within a bounding box. The example image is from the DOTA dataset, an object detection dataset of high resolution RGB aerial images [100].

Evolution of CNN Architectures in Computer Vision
The CNN family grew bigger when AlexNet was introduced during the ILSVRC in 2012 [3]. To provide an overview of the evolution of CNNs, AlexNet is used as the root of the highly ramified field, which reaches until late 2019 in this review. As mentioned in Section 2, next to the main trunk of architectures used for image recognition, the focus is also put on branches with architectures for image segmentation and object detection tasks. Despite this focus, the number of architectures and variations since AlexNet is still overwhelming. Hence, the evolutionary review focusses on the main successors, defined by their performance on well-established benchmark datasets of their specific tasks, as well as their legacy to the field. Thus, a thread through the evolution of CNNs is provided, which starts in 2012 and follows recent trends until late 2019 (see Table 1).

Image Recognition and Convolutional Backbones
As introduced in Section 2, the main part of a CNN is the convolutional backbone. Its design is of high relevance for optimising its performance. Since convolutional backbones are widely used in other DL tasks, such as image segmentation and object detection, achievements in image recognition can be seen as a main driver for the field. The review on architectures for image recognition in this section therefore discusses more sophisticated backbones and concepts about CNNs than the one already introduced in Section 2. For a better overview of the evolution, the different architectures are assigned to four groups of CNN families which we call: Vintage Architectures, containing AlexNet [3], ZFNet [38] and VGG variants [39]; the Inception Family; the ResNet Family; and finally architectures of the MobileNet family and such designed using neural architecture search with the goal of being efficient, which we refer to as the Efficient designs.
In addition to new architectures, training data is one significant component of the recent successes in DL [3][4][5]. The ImageNet dataset, with about 14 million labelled images for image recognition [3,37] in its latest version, is used during the ILSVRC. In 2012, Krizhevsky et al. [3] won the image recognition task with large margin using their CNN called AlexNet. Since then, the 2012 version of the ImageNet dataset is widely used as benchmark for CNNs on image recognition [21].
Concerning the benchmarking of architectures designed for image recognition, two measures are of importance when reviewing the evolution: their numbers of parameters and top five accuracy (acc@5): where n is the number of images,ŷ ij with j = 1, . . . , 5 are the five predicted classes for an image with the highest probability, y i the ground truth label of that image and for d(ŷ ij , y i ) = 0 ifŷ ij = y i and 1 if y ij = y i . The top five accuracy is used since the images in the dataset might show more than one class but the ground truth labels y i just label one class for each image in image recognition on ImageNet [21]. The higher is the acc@5, the better is the performance, where a smaller number of parameters enables more efficient processing and abstraction. Hence, the goal is to maximise acc@5 and minimise the number of parameters. Figure 4 shows the evolution of acc@5 performance for milestone architectures over time, where the size of a circle relates to the numbers of parameters in log scale. It becomes clear that, since AlexNet in 2012, the acc@5 first rapidly increased and until late 2015 saturates around 95% with a tendency of stable to smaller numbers of parameters. That leads to the questions of which major advances were introduced during the last seven years and which of them are still prominent in state of the art (sota) architectures in late 2019. AlexNet [3], named after Alex Krizhevsky, as well as its two follow up architectures ZFNet [38], named after the authors Zeiler and Fergus, and VGG-19 [39], named after the Visual Geometry Group (see Figure 5), are here grouped together as Vintage Architectures. All three architectures are similar in their design: convolutions with non-linear activation and max pooling layers are repeated. With it, features are extracted from an input image by subsequently deeper feature maps with smaller resolution until a fully connected classifier head is reached. This classifier predicts on the extracted features and provides the probability for each possible class. Whereas AlexNet uses 11 × 11 and 5 × 5 kernel sizes for the first convolutional layers, also called the stem of a network, reaching 81.8% acc@5 at 62M parameters [3], ZFNet decreased the size to 7 × 7 and 5 × 5, improving acc@5 to 83.5% at 62M parameters. Zeiler and Fergus [38] argued that a smaller receptive field in the beginning of the network extracts more information. Due to smaller kernel sizes in the stem, they were able to extract more feature maps holding information and made their net even deeper by increasing the number of feature maps in each convolutional layer [3,38].

Vintage Architectures
However, a significant leap in accuracy was possible with the VGG-19 architecture. Simonyan and Zisserman [39] introduced, as a relative to AlexNet and ZFNet, a deep network variant of 19 layers. The depth was reached by repeating building blocks, starting with a stack of convolutional layers up to four times and adjacent dimension reduction by max pooling. They also used 3 × 3 kernels exclusively in their convolutional layers, which is still common in recent architectures. The stacked 3 × 3 kernels are able to synthesise larger receptive fields by using fewer parameters. The main ideas of these Vintage Architectures can be summarised as: • The convolutional backbone consists of repeated convolutions to increase the feature depth and some kind of resizing method such as pooling with stride 2 to decrease resolution. • The ReLU activation after convolutional layers is used to speed up training with backpropagation with stochastic gradient descent [3,[31][32][33]101]. • In VGG-19, the repeated building blocks with stacked convolutions of constant size enlarge the receptive field and deepen the network.  The architecture designs of AlexNet [3] and ZFNet [38] are similar despite smaller kernel in the first convolutional layers and deeper feature maps in ZFNet, whereas VGG-19 [39] is considerably deeper overall and it uses a uniform kernel size.

Inception Family
In the same year that VGG-19 was presented, the GoogLeNet variation with Inception modules [40] was introduced, a late successor to the initial work of LeCun et al. (1989) [30]. Since this should be the starting point of the Inception Family, the network is called Inception V1 further on. The main idea of the network was that, after the stem of first convolutions, the novel Inception modules are repeated as building blocks with sporadic max pooling in between for dimension reduction (see Figure 6, right). An Inception module itself is made up of parallel convolutional layers of different kernel sizes and max pooling. As a result, an increased variety of feature representation is reached which is processed from the same input (see Figure 6, left). To avoid an explosion of parameters, so-called bottleneck layers were introduced in the beginning of the Inception module. These are 1 × 1 convolutional layers, which intermediately reduce the depth of an input tensor before it enters one of the next parallelised convolutions. Due to this depth reduction of feature maps, bottleneck layers lead to fewer parameters needed for each parallel operation but gain richer features when concatenating the results later on. Therewith, Szegedy et al. [40] were able to build a 21-layer deep network, when counting the convolutional layers. Because of this deep design, backpropagating the gradients while training became increasingly difficult. To avoid or reduce any training effects on the early layers in the network due to vanishing gradients [5,40,41], two additional classifiers in the middle of the network were integrated. The task of the so-called auxiliary classifiers is to provide additive gradients to early layers, so that they have an additional training effect. The so-derived gradients are able to adjust even early layers in the network during backpropagation. During inference, the auxiliary classifier branches are not used and cut off as they are only helpful during training the weights. The resulting Inception V1 reached a acc@5 of 89.9% with only 6.8M parameters. This is much more efficient than the VGG-19 design with a slightly higher acc@5 score of 92% but many more parameters, at 144M.
Five months later, Ioffe and Szegedy [41] proposed an adaptation of Inception-V1 with a stem of stacked 3 × 3 conv layers such as VGG-19. However, what was more important for this novel implementation and even the whole DL field was the introduction of batch normalisation after convolution and before ReLU activation. Batch normalisation [41] together with appropriate parameter initialisation [18,32,[102][103][104] and suitable activation functions [31][32][33]101] are part of the solution of solving the problems of vanishing and exploding gradients [32,105]. These concepts became highly important due to increasing network depth and the ability of those deep networks to converge during training. Enhancing the Inception V1 with batch normalisation, the updated variant was able to exceed VGG-19 in acc@5 performance, with 92.18% at 11.5M parameters being more efficient.
The idea of the Inception modules was further improved in Inception V2 and V3, where Szegedy et al. [42] applied factorisation on convolutions. Like VGG-19 they used stacked 3 × 3 convolutions to increase the receptive field. Factorisation can furthermore be used on a m × n kernel to split it into a stack of 1 × n and m × 1 kernels. Therefore, instead of m × n × d parameters, By applying factorisation on the original Inception module, three modified modules were created to increase the representational capacity of the model by using parameters more efficiently. The different modules are plugged into the network according to their ability to represent features in specific depths of the network, see Section 6 in [42]. The resulting Inception V3 network leaped up to an acc@5 of 94.4% using 23.6M parameters.
The idea of factorised convolutional filter banks can also be translated to a depth-wise factorisation, where convolutions are applied to each input channel separately. Chollet [45] assumed in his work "[. . . ] that cross-channel correlations and spatial correlations can be mapped completely separately" (p. 1801). With this extreme idea of factorisation, Chollet [45] presented a network called Xception. Beside the stem, it exclusively uses depth-wise separable convolutions as feature extractors. Those are 1 × 1 or pointwise convolutions with adjacent 3 × 3 depth-wise convolutions for each output channel, without a nonlinear activation in between. Benchmark results of Xception are similar to Inception V3 with 94.4% acc@5 but slightly less parameters 22.8M. The Inception Family proposed and established the following cornerstones of sota CNNs: • Bottleneck designs and complex building block structures • Batch normalisation to make deep networks trainable faster via stochastic gradient descent • Factorisation of convolutions in space and depth

ResNet Family
Beside the Inception modules, Xception also uses so-called residual connections, which lead to another stage of development, the ResNet Family. The first ResNet [43] was introduced in late 2015, at the time when Inception V3 was proposed. What gives this family its name is the above-mentioned residual connection used in the ResNet building block (see Figure 7). Technically, within the block's main trunk, the input depth is reduced by a 1 × 1 convolution, then features are extracted by a 3 × 3 convolutional layer and finally depth is increase again using 1 × 1 convolution. This bottleneck design is closely related to the slightly older Inception Family, focussing on parameter efficiency. What is new is what goes around the main trunk. Before the first convolution, the input is branched away from the trunk and merged back after the second 1 × 1 convolution by simply adding the values to the output of the trunk. This branch is called a shortcut or residual connection. In case that the input is not manipulated on the shortcut, it is called an identity shortcut. As the depth and resolution have to be the same of residual branch and main trunk for adding them back together, but still allow the depth to increase and resolution to decrease within the main trunk, the shortcut can also perform operations such as convolution [43] (see version 2 of the ResNet building block in Figure 7).
With the two ResNet building block designs, which are less complex than the Inception building blocks, it was possible to stack the components to very deep networks. However, increasing depth led to the counterintuitive degradation problem, which can be described as saturation, and rapidly decreased accuracy when using deeper architectures. Inception V1 [40] used the auxiliary classifiers to cope with this problem, but they made the network overly complex with additional branches. To avoid the degradation of deeper networks and support the optimisation with backpropagation, the above-described ResNet design reformulates the function H(x, W i ), which describes the convolutional operations of the main trunk. In ResNet, it is now a residual function H(x, W i ) := F(x, W i ) + x, where F(x, W i ) is the residue to approximate and x the input before the convolutional operation. Since the x is known due to the residual connection, approximating the residue is considered easier [43]. Furthermore, when letting the residue of convolutional operations in the final part of the network be zero, an identity function is learned from x. From the backpropagation point of view, it is now possible to transport larger gradients to convolution operations early in the network and enable optimisation of those layers. Hence, residual connections enable deeper networks to converge during training and perform better than shallow networks [43,44,46].
ResNet [43] was presented with multiple depths, where the ResNet-152 model (152 layers, see Figure 7), performed with an acc@5 of 95.5% at 60.3M parameters. Similar to the Inception Family, the ResNet idea evolved over time. ResNeXt [44] changed the ResNet building blocks by introducing so-called cardinality, which can be imagined as parallel convolutional operations with fewer feature maps in each parallel branch. This design widens the block to the next dimension, hence the name.
Technically, this was implemented by redesigning the 3 × 3 convolution operation in the main trunk of the block to be grouped convolution, introduced in ShuffleNet [106]. The use of 32 groups on a 128 deep input feature map leads to four feature maps per group, which are convolved together. ResNeXt-101 was able to improve acc@5 slightly up to 95.6% at 83.6M parameters. Another ResNet variant is DenseNet [46] which uses residual connections intensively. As the name suggests, in a DenseNet building block,. each convolutional layer takes as input the result of the previous convolution as well as all previous inputs within the block via multiple residual connections, forming a densely-connected building block. Instead of addition, DenseNet uses concatenation to merge the layers resulting in a very deep feature map as the output of a block. For this reason, before entering the next block, depth is reduced in a transition block using a bottle neck design. For a detailed description, we refer to the work of Huang et al. [46]. DenseNet is trailing ResNet and ResNeXt designs with an acc@5 of 93.9% for DenseNet-264, which uses significantly fewer parameters, just 34M.
Considering the network depth of the Vintage Architectures or the Inception Family, the ResNet Family led to very deep architectures, which were trainable without auxiliary classifiers by using the novel residual connection and the now established batch normalisation. Furthermore, the residual connection introduced by the ResNet Family is still a widely used component in CNN architectures and contributes to more sophisticated flow control in network architectures. Together, both families, Inception and ResNet contributed to the evolution by demonstrating that CNNs are highly modular models. Xception as well as the older Inception-ResNet-V2 [107] are exemplary for merging the corner stones of each family into new architectures. The results are Inception-like modules with bottleneck designs and batch normalisation, enhanced by residual connections. Those networks are deeper than their ancestors without auxiliary classifiers and perform better or equal to those with fewer parameters.
Besides complete new architectures, plug in modules such as the Squeeze and Excitation (SE) module [47] demonstrated that the modification of existing model layouts can enhance even the Vintage Architectures in an efficient way. SE modules used in SENet are small fully connected neural networks which weight the feature map outputs of a convolutional operation. They support the hypothesis that not all features in a feature map are equally responsible for the final prediction. The usage of such weighted convolutional operations improves Vintage Architectures such as VGG as well as networks from the ResNet and Inception Family by adding few parameters to the network [47].

Efficient Designs
The modular concept of CNNs and their building blocks is crucial for the next group of architectures. In 2017, two major trends led to today's state of the art architectures in image recognition: highly parameter efficient networks and, instead of hand crafted designs, architectures drafted by other neural networks in a so-called Neural Architecture Search (NAS) [48,49].
To use DL to solve every day problems, the architectures had to run on mobile devices. The designs making this possible are grouped as Efficient designs in this review. Motivated by the restriction of computations for mobile devices, the lightweight MobileNet family was founded. The first MobileNet-224 [50] had only 4.2M parameters; nevertheless, it performed with an acc@5 of 89.9%. The network mainly consists of depth-wise separable convolutions, which use a highly parameter efficient stack of 3 × 3 convolution on each input feature map separately, with an adjacent 1 × 1 pointwise convolution across the entire depth. The next version, MobileNet-V2 [108], further improves this idea by using mobile inverted depth-wise convolution with residual connections, pictured in Figure 8 (left). This describes a building block, which first performs an 1 × 1 pointwise convolution that expands the depth of feature maps for the adjacent 3 × 3 depth-wise convolution. Afterwards, another 1 × 1 pointwise convolution defines the output depth, which is normally smaller than the intermediate expansion depth. A surrounding residual connection adds the input to the output maps by connecting the bottleneck layer. MobileNet-V2 performed with an acc@5 of 92.5 at 6M parameters [108,109].
In parallel, a few months after the first MobileNet was introduced, NASNet [49] and therewith a new way to define architectures was added to the development of CNNs. Neural architecture search (NAS) follows the idea "[. . . ] learning beats programming" of Krizhevsky et al. [110] (p. 84) not only for features but also whole architectures. In NAS, a defined search space of CNN building blocks was used by a controller, such as a recurrent neural network (RNN) [34], to find the best so-called child network architecture. The controller does this by using reinforcement learning which maximises the accuracy of prediction on an underlying dataset reached by the child during every iteration. Thereby, the RNN architecture in combination with reinforcement learning allows subsequently adapting the design of the child network. The reward signal, the accuracy performed by the resulting child network, is used to update the controller in order for it to produce a new child, which performs better in its defined task [48,49].
One drawback is that the briefly described search algorithm needs to train each child it produces. Since ImageNet is a relatively large dataset, it is common praxis to use a smaller dataset such as the CIFAR-10 dataset [111] during NAS. After the new architecture is defined, it is scaled up to match the larger variance of ImageNet without being trained on it each time [49,51,52,[112][113][114]. Scaling can be done by simply repeating the NAS defined building blocks to build a deep CNN [49,112], or by defining a more complex scaling rule [52], which is mentioned below. The NASNet variant NASNet-A (6@4032), introduced by Zoph et al. [49], performed with an acc@5 of 96.2% at 88.9M parameters.  [108], here with an additional Squeeze-and-Excitation (SE) module [47]. In comparison with a ResNet building block, the bottleneck design is inverted, so that first the expansion factor (t) is larger than 1, which leads to intermediate deeper feature maps (td) as the final output depth of the building block (d'). (Middle) The Mnas search space with a fixed overall architecture of the network, the skeleton, but fully optional layer designs, based on the MobileNetV2 building block [51]. (Right) A recurrent neural network (RNN) [34] controller searches the search space for the best performing combination of layer designs by maximising an optimisation rule [48]. The resulting architecture is scaled in depth, width and resolution to become the EfficientNet-B7 architecture, the sota design in late 2019 [52].
Due to this new way of designing networks, the next generation of the MobileNet family used the NAS approach to generate the MobileNASNet (MnasNet). The Mnas search space, pictured in Figure 8, defines a fixed skeleton of seven repeated blocks, where each block has n layers with d 1,. . . ,7 output depth, where n and d 1,. . . ,7 are searched. Furthermore, the layer model of each block is a SE module enhanced MobileNet-V2 building block, but each component of this block is optional and the layer design is searched by the RNN [51]. By increasingly more complex layer designs found during NAS, layer diversity was increased, which was found to be important for richer feature representations [112,115]. With acc@5 of 93.3% at only 5.2M parameters, the resulting MnasNet performed better than the hand crafted MobileNet-V2, on which it is based, with fewer parameters needed [51].
The same search space was used in late 2019 by Tan and Le [52] to create the last family of CNNs for image recognition reviewed in this paper, the EfficientNets. Since the search space remained unchanged and the optimisation rule was also similar to MnasNet, the resulting baseline model EfficientNet-B0 (acc@5 = 93.5%, parameters = 5.3M) performed similar to MnasNet. However, what makes EfficientNet successful is the scaling approach. Tan and Le [52] proposed compound scaling, which balances scaling in depth, width and resolution of the network (see Figure 8). The resulting EfficientNet-B7 architecture reached 97.1% acc@5 at 66M parameters, which means an increase in acc@5 of 15.3% at 4M more parameters used, compared to AlexNet seven years before, see Table 2.
EfficientNet [52] and its closest relatives, the MobileNets [50,51,108,114] and other NAS induced architectures [49,112,113], are based on the findings of their ancestors. The most relevant features and concepts from Vintage Architectures and the Inception and ResNet Family, which are bottleneck designs, the factorisation of convolutional operations and residual connections are the main components which are characteristic for the sequential evolution of CNNs. All of these components are finally included in the MobileNet building blocks and NAS search spaces (see Table 2). Furthermore, using NAS and scaling successfully demonstrates how flexible CNNs can be designed to fit specific tasks [52]. However, what has not changed over the whole evolution is the meta structure of CNNs. For image recognition, it still remains: input, convolutional backbone and a final classifier head. Applying CNNs for image recognition on Earth observation data is normally done by classifying a small patch of a remote sensing image with one or multiple labels of land cover classes. An example dataset is the BigEarthNet dataset which contains 152 Sentinel-2 tiles divided into 590,326 patches labelled with 43 land cover classes [116]. Sumbul et al. [117] used a VGG-19 and ResNet-50 to predict labels on BigEarthNet and reached an average precision of 64.87% and 74.78%, respectively. Therewith, they demonstrated the increase in performance between this architectures even for Earth observation data.

Image Segmentation
As pointed out in the previous section, the features extracted by a convolutional backbone hold high level semantic information and are used for predicting classes of whole images. Image segmentation also uses this high level features extracted by a convolutional backbone, but predicts classes on pixel level, which leads to the following problem: within a convolutional backbone, the feature maps are subsequently resized to a lower resolution with increasing semantic information. Hence, they provide less accurate location information, which is necessary for accurate pixel wise prediction. As a result, image segmentation is confronted with a high resolution-feature depth trade-off.
Additionally, to predict the class of a single pixel, contextual relationship is important. This information can be found in a range of long and small distances around the pixel. The contextual information depends on the size and continuity of the semantic uniform segment the pixel belongs to, as well as the amount and density of neighbouring segments of other classes and background. Therefore, image segmentation can be seen as a multi scale context problem, even when prediction is made on single pixels. The advances in image segmentation are focussing strongly on solving this problem by exploiting features of different stages in the network or preserving or reconstructing high resolution during feature extraction and prediction.
Since image segmentation makes predictions on pixel level, the underlying benchmark dataset and metric to compare architectures has to be different from ImageNet and acc@5. In this review, the Segmentation on PASCAL-VOC 2012 test dataset [53,54] is used and is further referred to as PASCAL-VOC, if not stated otherwise in this section. This specific dataset was chosen to give an overview due to its long lasting tradition over other more recently popular and challenging datasets such as the Cityscape dataset [118,119] to overlook the evolution since 2014. The most frequently reported metric for this dataset is the mean Intersection over Union (mIoU) over all classes [53,78].
where C is the number of classes, |y c ∩ŷ c | is the intersection between the ground truth y and predicted segmentationŷ per class and |y c ∪ŷ c | is the union of ground truth y and predicted segmentationŷ per class. Its range is between 0% and 100% with the goal to maximise mIoU. Over the years, higher accuracies were achieved, as shown in Figure 9, similar to the evolution of image recognition.
In the following overview about image segmentation, focus is put on Fully Convolutional Networks (FCNs) inspired architectures. For a wider review of segmentation approaches which use LSTM or GANs we refer to Minaee et al. [78], Garcia-Garcia et al. [120]. Image segmentation with CNNs was strongly influenced by the work of Long et al. [55] who introduced FCNs in 2014. To further discuss the evolution of FCN architectures, they are separated into two groups: naïve decoder and encoder-decoder architectures. By using the term naïve decoder models, we follow, e.g., Chen et al. [60], who used this term to describe feature map upsampling mainly by bilinear interpolation and without using additional information from the encoder path. The overall idea of naïve decoder models (Figure 10, left) is to use a convolutional backbone to extract feature maps, upsample the feature maps to recover the input resolution by using bilinear interpolation and finally make pixel-wise prediction and optional post processing to gain segmentation masks.
In encoder-decoder models (Figure 10, right), the encoder part of the network can be seen as the convolutional backbone which extracts feature maps from input data. The decoder uses the final feature map in combination with spatially more accurate information from earlier stages in the encoder path. Within the decoder, the feature map is upsampled or deconvolved and additionally combined with spatially more accurate information transported from the corresponding layer in the encoder to the decoder via skip connections. After the input resolution is recovered, pixel wise predictions are made to produce the segmentation mask. The following overview of image segmentation architectures will first look at the original FCN, then focus on naïve decoder, characterised by the DeepLab family, and finally discuss encoder-decoder models.  As an example for the naïve decoder design, DeepLabV1 was chosen, which uses atrous convolution as the last layers in the backbone, bilinear interpolation as upscaling method and a fully connected conditional random field (CRF) module for refinement [56]. As example for the encoder-decoder models, an U-Net inspired design shows the skip connections, transporting feature maps from the encoder path to the corresponding decoder path, providing increasingly precise feature localisation during upscaling [67].

Naïve Decoder
In 2014, Long et al. [55] introduced Fully Convolutional Networks (FCN) for semantic segmentation. They used VGG-16 as backbone to create feature maps and trained deconvolutional layers to upsample them to input resolution. To increase the precision of fine grained features in their best performing FCN-8s variant (mIoU = 62.2%), they used feature maps from earlier layers in the VGG-16 backbone via skip connections during upsampling. With that, they provided a combination of spatially more accurate and higher level semantic features for the pixel wise classification. Inspired by the FCN design, Chen et al. [56] introduced DeepLabV1. The major differences are their use of atrous convolutions, explained below and a naïve decoder performing upsampling by bilinear interpolation. Atrous or dilated convolutions, as depicted in Figure 11 (left), which are intensively used in the DeepLab family, are inspired by the algorithme á trous [121]. In atrous convolutions, "holes" are inserted in a convolutional kernel to increase the receptive field and at the same time maintain resolution in order to gain dense feature maps with high resolution [56,58]. Because of these dense feature maps, upsampling can be done with the more computational efficient bilinear interpolation, compared to trainable layers used 2014 Long et al. [55] in FCN. However, despite the dense feature maps, after pixel wise prediction, in DeepLabV1 [56], the segmentation map is further refined by a fully connected conditional random field (CRF) [61]. When published, DeepLabV1 performed with a mIoU of 66.4%. The disadvantage of this combination of CNN and CRF is that DeepLabV1 is not end-to-end trainable in one stage.
The later proposed DeepLab-LargeFOV [57] variation uses only 3 × 3 kernels and a larger dilation rate of 12 for the atrous convolution in order to generate a large receptive field. After CRF refinement, they reached 72.7% mIoU [57,58]. The successor of DeepLabV1, DeepLabV2 [58], uses instead of an VGG-16 as backbone a ResNet-101 and the novel Atrous Spatial Pyramid Pooling (ASPP) module in order to introduce multiscale feature exploitation to the DeepLab family.
ASPP is motivated by SPPNet [74], which will be discussed further in Section 3.3. The core idea of ASPP can be described as a parallel multiscale exploitation of feature maps processed by atrous convolution of different rates, see Figure 11 on the right. Therefore, multiscale information can be extracted using an efficient network design, instead of computational expensive processing of multiscale images, so-called image pyramids. DeepLabV2 consists of an atrous convolution enhanced ResNet-101 backbone and an ASPP module of convolutional layers with four different dilation rates, 6, 12, 18 and 24. The output feature maps from ASPP are upsampled to input resolution by bilinear interpolation on which pixel wise prediction is performed. After an adjacent fully connected CRF refinement, DeepLabV2 reached a mIoU of 79.7% [58].
Even when the multiscale exploitation due to the ASPP module was successful, some problems were noted later on. By using the largest dilation rate of 24 within ASPP, one can imagine that the outer cells of the kernel can lie outside the input feature map. Hence, this atrous convolution is degraded to a 1 × 1 convolution of the centre, and the goal of extracting long distance context information by using large dilation rates is turned upside down [59].
Another contributing factor of long distance context on pixel-wise classification in segmentation architectures was investigated in ParseNet [65], which can be seen as an alternative to large dilation rates. Global image context was exploited by using global average pooling and fusing this branch of the network with the standard features extracted by convolution. ParseNet was able to reach 69.8% mIoU without CRF refinement and is an end to end trainable CNN for image segmentation.
Pyramid Scene Parsing Network (PSPNet) [66] goes even further with its Pyramid Pooling Module. This module pools from the input feature map on different scales, and applies 1 × 1 convolution and upsampling in order to concatenate context features from different scales with the original input feature map of the module. After a last convolutional operation on those fused features, the final pixel-wise prediction is performed. Therewith, local context of different scales is exploited for better classification. PSPNet with a ResNet backbone and without any CRF refinement performed at 82.6% mIoU. The described advances, such as the extraction of features from multiple scales, the usage of contextual information from global and local scales and relying completely on a CNN model without combining it with graphical models for refinement, were aggregated in DeepLabV3 [59] (see Figure 11, right). Its structure is the same as in DeepLabV2 [58], but changes were made in the ASPP module. Chen et al. [59] removed the largest dilation rate, due to the problem described above and instead introduced image level feature extraction by global average pooling. Therewith, long range information on image level, which was used successfully in ParseNet [65] and PSPNet [66], was incorporated into the ASPP module in DeepLabV3. Upsampling is still done with bilinear interpolation until input image resolution is reached, but no CRF refinement is applied after prediction of the last convolutional layer that produces the segmentation mask. DeepLabV3 is therefore the first end to end trainable DeepLab variant, and, when published, it performed with state-of-the-art results of 85.7% mIoU [59]. The last naïve decoder model reviewed here is HRNetV2 [71], published in 2019. Since no metric is reported on PASCAL-VOC but on the Cityscape dataset, HRNetV2 cannot directly be compared with the other architectures discussed before. To narrow this gap, we compare it to the reported performance of DeepLabV3 on the Cityscape test set with 81.3% mIoU. Both HRNetV1 [70] and its successor HRNetV2 [71] use four stages of five convolutional blocks. From the second stage, parallel convolutions on different resolutions are performed and finally merged in the fifth convolution by upsampling. With it, in Stages 2-4, features of different resolutions are combined with every other resolution in the so-called multi-resolution block. The difference between HRNetV1 and HRNetV2, is that V1 only uses the block of highest resolution for final prediction, whereas V2 uses all blocks of all resolutions by combining them using concatenation and stronger upsampling of low resolution feature maps. The final input resolution for prediction is reached by further upsampling with bilinear interpolation. The performance of HRNetV2 on the Cityscape test set is 81.6% mIoU, slightly better than DeepLabV3 [70,71].

Encoder-Decoder Models
All of the image segmentation architectures reviewed until now, besides FCN [55], share relatively naïve decoders. This means that mostly bilinear interpolation was used to upsample the feature maps back to input image resolution. The encoder-decoder models are different from that by introducing more complex decoder (see Figure 10, right). The main idea is to use shortcuts or skip connections to transport information from the encoder branch to the decoder. Therewith, high level feature from the later layer in the encoder are subsequently fused with more locally precise information from early layers in the encoder during upscaling.
In 2015, Noh et al. [64] introduced DeconvNet, which uses a VGG-16 as encoder and a mirrored VGG-16 as decoder. Within the decoder, instead of pooling, unpooling layers use the spatial localisation recorded during pooling, called pooling indices. With this information, a sparse feature map of higher resolution is restored and an adjacent deconvolution layer densifies them. Therewith, the decoder reaches the input resolution where pixel wise prediction is performed to generate the segmentation mask. DeconvNet reached a mIoU of 69.6%, matching the DeepLab version at this time without using additional refinement with a CRF.
In SegNet, Badrinarayanan et al. [122] used a similar approach and added batch normalisation with ReLU activation for each convolutional layer in the encoder and decoder. SegNet outperformed DeconvNet slightly; those results were reported not on PASCAL-VOC but the CamVid dataset [123].
In 2015, Ronneberger et al. [67] proposed U-Net, an encoder-decoder architecture whose structure has been applied widely in many domains since it was introduced. U-Net, originally built for medical image analysis for cell tracking, is similar to SegNet [55] and ConvNet [64]. Its encoder is made up of five building blocks where each consists of two adjacent 3 × 3 convolutions which double the amount of feature maps subsequently from 64 up to 1024. Between the blocks, the feature maps are downscaled with 2 × 2 max pooling of stride 2. The decoder uses deconvolution layers for upscaling within five blocks, where each block receives the whole feature maps from the encoder path of the same resolution, not only the pooling indices. The feature maps are transported via skip connections from the encoder to the corresponding building block in the decoder. Feature maps coming "up" within the decoder are concatenated with the feature maps from the encoder. Thereafter, high level semantic information is combined with more precise local information but lower semantic meaning during upscaling. When input resolution is finally restored, a last 1 × 1 convolution predicts the segmentation map [67]. Since U-Net was developed for a specific cell tracking task, no results on PASCAL-VOC are reported in the original publication. However, Zhang et al. [124] reported a 72.7% mIoU for a vanilla U-Net on PASCAL-VOC. The U-Net design was modified, for instance in Tiramisu [68], which combines U-Net structure and DenseNet [46] building blocks, and in U-Net++ [125], where nested dense building blocks perform convolution on the skip connections.
With RefineNet [69], an encoder-decoder design was proposed, which focusses on residual connections. The encoder is built on a ResNet-152, the decoder of RefineNet blocks, which are using residual connections with sparse nonlinear activation. With this combination in RefineNet, a direct flow of features within the decoder was emphasised. RefineNet reached a mIoU of 83.4%.
Due to the convincing results achieved by encoder-decoder architectures, the DeepLab family also modified their naïve decoder model to an encoder-decoder design in DeepLabV3+ [60]. DeepLabV3+ uses a modified Xception backbone in the encoder. The modifications were inspired by the depth-wise separable convolutions [45] and the last convolutional layers became separable atrous convolutional layers. Due to this factorisation in depth, the model gained computational efficiency. With an overall stride of 16 for the encoder, the decoder first bilinearly upsamples the encoded feature maps by a factor of 4. After that, the corresponding feature maps from the encoder are concatenated via skip connections and after additional convolutional operations an adjacent bilinear upsampling of factor 4 restores input resolution. On these feature maps, the final prediction is performed to produce the segmentation mask. This effective decoder model in combination with the modified DeepLabV3 model reached a mIoU of 87.8%.
Similar to image recognition, after architectures were designed and optimised heavily by hand, NAS was used to search for high performance and computationally efficient image segmentation architectures. Dense Prediction Cell (DPC) [62], which belongs to the DeepLab family, and Auto-DeepLab [63] are two examples based on NAS [48]. DPC focusses on the optimisation of the ASPP module by searching for, e.g., dilation rates and the grid size of average spatial pyramid pooling [62]. DPC performs with 87.9% mIoU on PASCAL-VOC using a modified Xception backbone [62].
In contrast, Auto-DeepLab [63] focusses on searching for an image segmentation optimised backbone instead of optimising a single module. That means that in addition to the operations in a building block, the overall network structure is searched too. Hence, no fixed skeleton is defined such as in Mnas searchspace [51] or the DPC. For further details on the searchspace, we refer to the work of Liu et al. [63]. The Auto-DeepLab-L variant reached a mIoU of 85.6% but performed comparatively better on the more challenging Cityscape dataset, where it reached the same mIoU of 82.1% as DeepLabV3+ [63].
Reviewing the advances in CNNs for image segmentation, it was pointed out that multiscale feature and image context exploitation by maintaining high resolution and fusing features from different stages of the model are the cornerstones of sota architectures. Atrous convolution which are mostly used in naïve decoder models and skip connections which are typical for encoder-decoder models, are major concepts that were established during the evolution of image segmentation architectures. Table 3 summarises this evolution and demonstrates the decreasing use of CRFs for refinement and increasing use of atrous convolution and multiscale feature extraction. Table 3. Summary of the evolution of convolutional neural networks (CNNs) for image segmentation and the above discussed cornerstones. The Backbone is the reported CNN used for feature extraction; CRF describes the use of a conditional random field for refinement; Atrous shows the application of atrous convolution; Multiscale shows if a dedicated module that handles multiscale feature extraction, e.g. ASPP, is used; NAS means that parts of the architecture are drafted by using neural architecture search; and mIoU describes the performance on the PASCAL-VOC 2012 dataset for image segmentation.

Architecture
Year When CNNs for image segmentation are applied to Earth observation data, encoder-decoder models are a popular choice (see Section 4.2). The ISPRS 2D labelling dataset [126] is a widely used benchmark dataset consisting of a digital surface model and multispectral aerial images with very high spatial resolution. One example investigation of this dataset by Wang et al. [127] shows the application of the above-discussed DeepLabV3+ based on a ResNet-101 backbone. They modified this combination with an auxiliary loss directly after the ASPP module and a final CRF refinement and therewith demonstrated the applicability of a sota CNN coming from computer vision and its adaptation to the needs of Earth observation data.

Object Detection
While image recognition and image segmentation can be modelled as classification problems, object detection is a multi-task problem. Predicting the object class remains a classification problem, whereas predicting the location, which is a bounding box around each predicted object, is a regression problem. Therefore, the benchmark dataset has to contain additional bounding box labels and architectures have to handle both classification and regression at the same time. The used benchmark dataset is the object detection on Microsoft Common Objects in Context (MS-COCO) test-dev set [72]. The measure of interest is the mean Average Precision (mAP) or in the case of MS-COCO the AP, which is considered the same (see [128]).
MS-COCO's AP, also called AP [0.5:0.05:0.95] , is an average over all classes for 10 different IoU (Intersection over Union) levels, between ground truth and predicted bounding boxes. The IoU threshold defines if a prediction is a true positive. By taking different IoU levels into account, models with more accurate localisation characteristics are favoured. The calculation of AP for a single IoU is the average of interpolated precision values p interp at 101 equally spaced recall values r =∈ 0.0, . . . , 1.0 [128,129]. Precision and recall are thereby: where TP are true positives, FP false positives and FN false negatives. The empirical precision recall curve p(r) is interpolated with p interp (r) = max r :r≥r p(r) (5) in order to generate p interp (r). Finally AP is calculated with [53,129]: Other jointly reported metrics beside AP on MS-COCO are AP 75 and AP 50 with a single fixed IoU of 75% and 50%, respectively. Since MS-COCO was introduced in 2014, the first architectures reviewed use the PASCAL-VOC 2007 test set of the object detection task [54]. For these architectures, the mAP is reported, which is calculated similarly to MS-COCO's AP but with a fixed IoU and 11 instead of 101 bins of the precision-recall curve (see Equation (6)).
Architectures for object detection can be divided into two groups: two-stage detectors and one-stage detectors. In general, two-stage detectors perform with higher AP, which is shown in the progress made from late 2013 to 2019 in Figure 12. One-stage detectors are lightweight in parameters and complexity and thereby faster, measured in processed frames per second (fps) [130]. On a closer look, two-stage detectors are distinguished by a first stage, which processes class agnostic region proposals. The object class prediction of those potential regions and the final bounding box regression are then performed in the second stage. On the other hand, one-stage architectures perform class prediction and bounding box regression in a single shot from the input image. One-stage detectors play an important role in the evolution of CNN based object detection and are also widely used in applications and research [130][131][132]. Therefore, some popular designs are reviewed below. However, focus is put on two-stage detectors, and, among those, especially on the Region based CNN (R-CNN) family. For a wider overview on object detection, we refer to the works of Liu et al. [130], Wu et al. [132], Zhao et al. [133] and Jiao et al. [134].

Two-Stage Detectors
In 2013, Girshick et al. [73] introduced Region based CNN (R-CNN), which incorporated a CNN into a pipeline of established graphical models and classifiers in order to predict objects in images. From the input image, class-agnostic region proposals are generated with the selective search algorithm of Uijlings et al. [135]. For each proposal produced this way, AlexNet [3] or VGG-16 [39] is used to extract meaningful features for each proposal. Those features are then forwarded to binary class specific support vector machine (SVM) classifiers predicting the object class. To improve localisation, a class specific bounding box regression is applied using the initial region, coming from the selective search algorithm as a starting point. R-CNN with the bounding box regression scored a mAP of 66% on PASCAL-VOC 2007. However, R-CNN has some drawbacks, due to the property of performing tasks in a repetitive manner, e.g., feature extraction with a CNN for each region proposal or class specific SVM classifiers. Furthermore, the model is more of a model composition and therefore no end to end training is possible [73,76].
Spatial Pyramid Pooling (SPP) used in SPPNet by He et al. [74] and a redesigned R-CNN inspired architecture fixed some of the above mentioned issues. SPP is not a novel idea, but using it in a CNN context was a novel application. SPP pools features from arbitrary input sizes on different scales resulting in fixed length outputs. This enables processing on multiple scales which makes the object detection task more robust since objects are naturally appearing at different scales in images. Within the convolutional backbone, the SPP layer replaces the last pooling operation [74].
Another change in the overall design of SPPNet made it up to over one hundred times faster than the original R-CNN. This speed up was achieved by applying the convolutional backbone with adjacent SPP to extracted features only once on the whole input image instead of processing each region proposal coming from selective search separately. Therefore, the region proposals derived from the same input image with selective search are downscaled in order to match the resolution of the shared feature map to extract the corresponding features from there. SPPNet with bounding box regression performed with a mAP of 59.2% on PASCAL-VOC 2007 using a fast variant of ZFNet as convolutional backbone. This is less, measured by mAP, but 38 times faster than R-CNN [74].
Fast R-CNN was introduced by Girshick [75]. Similar to SPPNet, it also exploits shared feature maps for the regions proposed by selective search and established the use of shared feature maps as the standard approach for two stage object detection. An other similarity to SPPNet is the RoI (Region of Interest) pooling layer in Fast R-CNN, which connects convolutional backbone with a RoI wise classifier and bounding box regression. It performs max pooling by dividing the RoI provided by selective search into a feature map of fixed resolution. A clear difference from SPPNet is that Fast R-CNN introduced a multitask head that performs classification and regression within fully connected layers. Besides the region proposal, feature extraction and object detection are within one model, and no further SVMs are necessary. This was possible due to the definition of a multi task loss which sums the loss of regression and classifier head. Due to this combined design, training became much more efficient. Nevertheless, since selective search is still used for proposing regions the model is not end-to-end trainable. Using a VGG-16 backbone, Fast R-CNN performed with a mAP of 66.9% on PSCAL-VOC 2007. A first performance on MS-COCO is also reported with an AP of 19.7% [75].
The next successor from the R-CNN family is Faster R-CNN by Ren et al. [76], developed in 2015. With it, the Region Proposal Network (RPN) module was introduced and finally two stage object detection became end to end trainable models, unified in a single network performing all tasks needed for competitive object detection results. After the convolutional backbone the RPN, a small fully convolutional network is inserted. At each position of its sliding window, k translation invariant anchor boxes are investigated, where k is 9 in the proposed method, three scales and three aspect ratios. Since multiple anchors are predicted at the same sliding window position, heavy overlapping and overly noisy object proposals would be the result. To mark a region proposal as valid, it must be the region with the highest IoU score or a score greater than 0.7, whereas a proposal with an IoU less than 0.3 is marked as a negative example. Each positive anchor is then regressed to the proposed object boundary and is finally used as an RoI. Beside the RPN module, the architecture is the same as for Fast R-CNN: RoI pooling with adjacent multitask object class predictor and final bounding box regression. Based on a VGG-16 backbone, Faster R-CNN performs with 21.9% AP [76], changing the backbone to ResNet-101 AP goes up to 27.2% [101].
Further advances in gaining higher AP on MS-COCO address the following dilemma: for classification, high level semantic features are optimal but they lack localisation. At the same time, object detection needs precise local information to find objects of different scales and potentially densely packed in the image. These issues are comparable to the mentioned problems in image segmentation, in Section 3.2, and thus are the solutions.
Feature Pyramid Network (FPN) (see Figure 13), introduced in 2016 by Lin et al. [77], enables the inherent bottom up feature pyramid of the convolutional backbone with an adjacent top down path. The novel top down path is upsampled by factor of two beginning with the last layer of the convenient bottom up path by using nearest neighbour interpolation. In addition, each upsampled level is connected via lateral connections to the corresponding bottom up feature maps. In that way, fine grained low level features with high location information are passed to upscaled coarse high level semantic features. This idea is comparable with encoder-decoder models but as part of the convolutional backbone for object detection. Therefore, the feature pyramid defined by the levels of the novel top down path holds features in different scales and localisation precision, all enhanced by high level features. Each of the feature pyramid levels is connected to a RPN with three aspect ratios but no scale ratio for the anchors, since the different scales are now provided by the FPN levels. The FPN enhancement of a Faster R-CNN with ResNet-101 backbone reached 35.8% AP, an increase of 8.6%. In 2017, with the advances in CNN architectures, made in object detection and image segmentation, He et al. [80] presented an end to end trainable DL model for instance segmentation Mask R-CNN. Besides the two heads for classification and bounding box regression, a third head which performs instance mask segmentation was added to the architecture. By using the features within a RoI, binary class specific masks are predicted with the FCN for image segmentation introduced by Long et al. [55]. Due to the higher spatial precision needed for image segmentation, RoI pooling was adapted to be RoI align. Therefore, the RoI coordinates were represented as floating point numbers instead of quantising them to discrete granularity. To extract the feature values for one RoI bin, values are also sampled at four equally spaced points, bilinearly interpolated and aggregated by using max or average pooling to finally represent the feature value in the RoI bin. Since standard object detection is still possible with instance segmentation architectures, performance is reported on MS-COCO object detection task with 39.8% AP [80].
Another optimisation of Faster R-CNN was introduced by Cai and Vasconcelos [79] who proposed Cascade R-CNN. Looking back at RPN, which, since it was introduced, is the state-of-the-art region proposal method for two-stage detectors, a positive anchor is mainly defined by an IoU between 0.5 and 0.7 during training. The trade-off between a high and small IoU is that a small IoU proposes more regions of interest as positive. This leads to noisy object proposals, whereas a high IoU declines the hypothetical bounding box coming from RPN as too weak, resulting in false negatives. Therefore, a cascading IoU with increasing bounding box refinement, which starts at a low IoU, was introduced to tackle this problem. Starting with the proposals from RPN, RoI align extracts the features for anchors with an IoU of 0.5 and a first classification and bounding box regression is performed. The bounding box results are then used as if they were produced by RPN but now an IoU of 0.6 is applied. Again, RoI align extracts the features for the new, more accurate region proposals from the same feature map as before and the next classifier and bounding box regression use them. This is repeated for a third time with an IoU of 0.7 which produces the final results of the object detection. Using ResNet-101 as backbone, Cascade R-CNN performed with an AP of 42.8% on MS-COCO [79]. A later implementation for Cascade Mask R-CNN which uses a ResNeXt-152 backbone reached 50.2% [136]. Rethinking the feature pyramid of FPN, high semantic features are passed within the top down path to finer grained feature maps, with weaker semantic but better localised features, and thereby enhancing those. However, no fine grained information is directly passed to the first top down pyramid level, which holds coarse but high level features. Path Aggregation Network (PANet) [82] adds an additional augmented bottom up path with access to low level, highly localised features via lateral connections directly after the FPN inspired top down path. The feature pyramid of this second bottom up path therefore holds both high semantic feature and low semantic highly localised feature enhanced activation maps. The following architecture, which uses the Mask R-CNN detector head and a PANet enhanced ResNeXt-101 backbone, reached an AP of 45% [82].
FPN became a popular method to handle the important scale variation in object detection; however, different approaches do exist such as SNIP [83] and its successor SNIPER (Scale Normalisation for Image Pyramids with Efficient Resampling) [84] focussing on multi scale training. Another example for the handling of scale by using atrous convolution is the scale-aware TridentNet [85]. The last convolutional blocks of a backbone are replaced with Trident blocks. Those, as the name suggests, consist of three branches of atrous convolution with three different dilation rates. The branches share weights, to prevent overfitting due to potentially tripling the amount of parameters. Their detection is aggregated by none maximum suppression (NMS): for small objects, the branch with the smallest dilation rate finds the strongest activation and suppresses weaker activations from the other two branches and vice versa. After applying a scale aware training scheme, the TridentNet-enhanced ResNet-101 Backbone, which ends in a common Faster R-CNN detector, reaches an AP of 48.4% [85].
With FPN and TridentNet, the deeper layers in convolutional backbones were successfully enhanced in order to produce richer features for the subsequent detector. Nevertheless, the backbones remain architectures designed for image recognition tasks. However, object detection specific backbones have to be trained from scratch. That means, that no pre-trained parameters can be utilised for refinement, but all initialised values have to be solely trained on a given object detection dataset, which is a time and processing power consuming task [81]. Therefore, further enhancements of image recognition backbones pre-trained on ImageNet were investigated. Instead of changing some layers in a single backbone, a couple of the same backbones, differentiated into assistant backbones and one lead backbone, are connected to a Composite Backbone Network (CBNet) [81]. The novel composite connections of the neighbouring networks are lateral connections in each stage of the neighbouring backbones, following the so-called Adjacent Higher-Level Composition scheme. After each convolutional block in a network, the feature maps are passed to the neighbouring network as input to the same stage. Therefore, Adjacent Higher-Level Composition upsamples the feature map and performs a 1 × 1 convolution in a bottle neck style, to reduce feature map depth. Following that approach, features are passed from assistant to assistant until the lead backbone is reached, which then passes the feature maps from different stages and therefore in different scales to the detector. By using three connected backbones as CBNet and the most recent member of the R-CNN family, the Triple-ResNeXt-152 Cascade Mask R-CNN model reaches state-of-the-art AP of 53.3% on MS-COCO in 2019.
With RPN, FPN, RoI pooling the later RoI align, cascading classifier and regression heads and finally composite backbones, the most important modules and insights for object detection with two-stage detectors were developed in close relation to the R-CNN family. However, those modules are not exclusively made for the R-CNN design and most of them not even for two-stage detectors. This means that the introduced models and insights for object detection with CNNs are flexible and a foundation for highly task specific architectures. However, this flexibility comes at a price of highly complex models. One-stage detectors, on the other hand, tend to be less complex, by using predefined anchors to extract features after the convolutional backbone and passing them directly to a detector head, resulting in computational efficient and more streamlined architectures. To complete the review on object detection models, a brief overview about a selection of one-stage detectors is now provided.

One-Stage Detectors
In 2015, Redmon et al. [86] introduced the first member of the YOLO (You Only Look Once) family, called YOLO-V1 in this review. The motivation for the whole YOLO family is representative for one-stage detectors: one-stage detectors are lightweight models that perform accurate real time detection of more than~20 fps to bring object detection to mobile platforms, for example. The YOLO-V1 detector separates the feature map which comes from the backbone into S * S cells. In each cell, YOLO-V1 looks for object centres and predicts object class and bounding box simultaneously. This approach results in many boxes which are then sorted out by NMS of a class agnostic objectness score. On PASCAL-VOC 2007, YOLO-V1 performed with mAP of 36.4%, trailing the Faster R-CNN at the time, but with 45 fps, where the Faster R-CNN processed at 7-18 fps, depending on the backbone [86]. YOLO-V2 [87] introduced multi scale training, a custom and fast backbone called DarkNet-19, batch normalisation after convolutional layers, a fully convolutional design and more important the use of anchor boxes. The dimensions of the anchors are chosen by clustering the training dataset using k-means with five cluster centres. This approach is crucial, since well-chosen anchors have a high impact on the detection of objects. Thereafter, YOLO-V2 predicts five bounding boxes at each of the pixel of the last feature map after the backbone. The performance of the largest model with 17 × 17 pixels in the last feature map, YOLO-V2 performed on PASCAL-VOC 2007 with 40 fps and a mAP of 78.6%. Notably, it matched sota performance on this dataset and 21.6% AP on MS-COCO trailing sota models at the time of its publication.
Finally, YOLO-V3 [88] uses ResNet-inspired building blocks for its DarkNet-53 backbone and a FPN-inspired detector that passes features from high semantic to lower semantic layers. Due to this enhancement in scale invariance, the issue of its ancestors, which perform poorly on smaller objects, was tackled [86,88]. The model detects on three different scaled feature maps using nine ratios for the anchors, defined by k-means similar to in YOLO-V2, resulting in three anchors per scale. YOLO-V3 performs with 33% on MS-COCO now trailing sota networks of two and one stage designs, but outperforms them in speed by a factor of 2.32 compared to RetinaNet [88,90]. The YOLO-V3 design is therefore partly similar to the earlier proposed Single Shot MultiBox Detector (SSD) [89], which also performs on different scaled feature maps, but uses manually predefined anchors. With a VGG-16 backbone, its performance was reported with an AP of 26.8%.
The above-mentioned RetinaNet [90] mainly focusses on an adaptive loss function, called focal loss, to tackle the problem of unbalanced foreground-background classes, explained here. Due to the lack of a RPN module, one-stage detectors sample many predefined anchors over the whole feature map in a dense manner. This leads to many boxes which classify background, easy negatives, and just a few boxes which classify objects. This property of one-stage detectors is used to provide small gradients for harder to classify foreground examples and an overwhelmingly large loss contribution, and therefore large gradients for easier to classify background examples. This unbalanced loss leads to minimal training effect considering the real objects. To overcome this problem, the focal loss down-weights the loss contribution which comes from easy negatives, in order to focus more on the harder to classify foreground examples. Since the advances of RetinaNet focus on the loss, its architecture consists mostly of established modules such as a FPN enhanced ResNet-101 backbone connecting to a fully convolutional classifier and bounding box regression. Nevertheless, due to the focal loss, RetinaNet matches sota performance of 39.1% AP at 5 fps.
In late 2019, Tan et al. [93] introduced EfficientDet, a one-stage detector which further filled the gap towards two-stage detector performance. Based on the novel EfficientNet family for image recognition, which they use as backbone, they further used its compound scaling approach [52]. Instead of a FPN at the end of the EfficentNet, they redesigned it to BiFPN. In it, they use the improvements of PANet, an additional bottom up path, and orthogonal to the common top down and bottom up paths, cross scale lateral connections, first proposed in NAS-FPN [92] (see Figure 13, right). Feature connectivity across the FPN module is therefore enhanced even on a consistent scale level within the feature pyramid by a bidirectional flow, hence BiFPN. The best performance on MS-COCO was reached with the EfficientDet-D7 variant using auto-augmentation [137] with 52.2% AP [93].
As in image segmentation, object detection architectures focus heavily on multiscale feature exploitation, but instead of atrous convolution, the FPN is widely used. Two-stage detectors dominate the object detection designs, when it comes to AP performance. Nevertheless, with the EfficientDet family, one-stage detectors improved recently. What remains an advantage of two-stage detectors is their modularity, which makes them easily adaptable for new modules, concepts and to other types of imagery data. Furthermore, the initialisation of the ratio of anchor boxes and the selection of the IoU which defines a true positive should always be considered when designing the architecture to match the dataset at hand. This was pointed out by discussing the anchor initialisation, e.g., of the YOLO architectures and cascading IoU in Cascade R-CNN. Table 4 summarises those cornerstones and provides an overview of the evolution and the onset and use of specific cornerstones in both two-and one-stage detectors. Table 4. Summary of the evolution of convolutional neural networks (CNNs) for object detection and the above-discussed cornerstones, divided into two-and one-stage detectors. The Backbone is the reported CNN used for feature extraction; RPN shows if a region proposal network is used; RoI describes which kind of pooling is used for the regions of interest; Anchors describes if prior anchor boxes are utilised; NAS describes if neural architecture search was used; Multiscale Feature describes the approach of multiscale feature extraction; and AP describes the performance on the MS-COCO dataset. In a recent example in Earth observation, Ding et al. [138] presented the application of an established Faster R-CNN with a FPN enhanced ResNet-101 backbone for object detection on the DOTA dataset [100] containing 2806 optical, aerial images of very high resolution with 15 object classes. During their application they demonstrated how to adapt the region proposal step to the needs of densely cluttered and arbitrary rotated objects in Earth observation data. Their approach takes rotation of objects and their specific height width ratio into account by adding an angle offset for rotation and spatial transformation of the RoI to fit the objects height width ratio. The adjacent RoI align does extract much more object specific features from this rotated RoIs resulting in state of the art performance on the DOTA dataset [138]. This study is one example on how to leverage the modularity of two-stage detectors to adapt the model to the needs of Earth observation data.
Independent of the task, image recognition, image segmentation or object detection, we want to mention that careful feature selection, an efficient training process and the choice, combination and possibly weighting of loss functions might have huge impacts on the model performance. The decision for an appropriate architecture is the first step but in DL more has to be considered to make the model perform well for the problem and data at hand. However, the next steps in training and fine tuning the network are more dependent on the combination of the task and data as well as used hardware. Due to this, we do not discuss them in this introduction to the cornerstones of DL architectures but want to highlight their importance and refer to open question number 9 "How to train and optimise the DL system?" (p. 32) by Ball et al. [8] for more details.

Deep Learning Frameworks
DL specific frame works are important tools for investigating and using the architectures introduced in Section 3. Among others, such frameworks should provide cornerstones to create architectures, input data pipelines to feed labelled data efficiently during the training process and training schemes themselves to fit a model to the training data, to name some important issues. Over the years, when architectures were proposed and optimised, DL frameworks were also introduced. They opened the method for a wider community with increasingly easier applications of DL. Here, the evolution of the two most popular DL frameworks TensorFlow [12] and Pytorch [13] and their closest relatives is outlined.
In 2007, Theano developed as Python package for ML eventually became a DL framework. One strength of Theano was the calculation of gradients by automatic differentiation of designed networks as well as CPU and GPU processing inspiring today's leading DL frameworks [139,140]. Its officially supported development has ended with version 1.0 in November 2017 [141]. For an easier access to Theano and later TensorFlow and other frameworks, Keras was introduced in May 2015 [142]. It became popular mainly because of its user friendly handling for a wider research community interested in applying DL. The main author is associated with Google, hence Keras is closely related to the TensorFlow library published later in November 2015, developed by GoogleBrain [12]. TensorFlow's handling was criticised for time intensive prototyping mainly due to its session initialisation. However, in September 2019, this major drawback was fixed by introducing eager mode in TensorFlow2 which also includes Keras as high level API. With this union, Keras provides easy access to TensorFlow in a native way but at the same time it is now possible to have full access to core TensorFlow functionality if necessary without leaving the framework [12,143,144].
The evolution of Pytorch, which is supported by Facebook AI Research (FAIR), starts with Caffe, which was originally developed by Jia et al. [145] in 2013. Caffe is a DL framework specialising in CNNs and image processing. In April 2017, Facebook released Caffe2 as an open source derivative and finally merged it with Pytorch one year later in May 2018 [146]. Pytorch [13] is mainly based on the Torch library and as the name suggests developed for Python, same as for Keras and TensorFlow. Both frameworks, Pytorch and TensorFlow, are open source and today's most popular and widely used in research and practice [140].
From an Earth observation perspective, data, features and labels are commonly processed and visualised in a geo information system (GIS). Commercial tools such as ArcGIS as well as open source solutions such as QGIS provide accessibility to the above mentioned frameworks and therewith support the use of DL in a spatial data environment. For ArcGIS, the tools are implemented in the Image Analyst extension [147] supporting TensorFlow, Keras, Pytorch and CNTK. For QGIS the established Orfeo ToolBox for ML [148] supports DL via the remote OTBTF (Orfeo ToolBox meets TensorFlow) module [149], which uses TensorFlow as backend. Another open source project which leverages QGIS is rastervision [150] which supports TensorFlow, Keras and Pytorch.

Earth Observation Datasets
Given the tools as architectures and frameworks, the application on Earth observation data is the next step to bridge the gap between computer vision and applied Earth observation. Since overhead imaging data differ from so-called natural images, it is questionable if the architectures developed for the datasets presented above also perform well on Earth observation data. The main differences between Earth observation data and natural images as they were used in the datasets mentioned are: • The position of the sensor in Earth observation data has mostly an overhead perspective relative to the scene, whereas a natural image is captured from a side looking perspective, hence the same object classes appear differently. • Data intensively used in computer vision are often three channel RGB images, whereas Earth observation data often consist of a multichannel image stack with more than three channels, which has to be considered, especially when transferring models from computer vision to Earth observation applications. • Computer vision model input data are often from the same sensor and platform, whereas in Earth observation both can change and data fusion has to be incorporated into the model. • Objects which appear in overhead images do not have a general orientation. That means that objects of the same class commonly appear at 360°rotation, which has to be considered in training data, architecture or both. Whereas in natural images bottom and top of the image and therewith also of the pictured objects are often defined more specifically which results in a general orientation of objects which can be expected for natural images. • In natural images, objects of interest tend to be in the centre of the image and in high resolution, whereas in Earth observation data the objects can lie off nadir or at boarders with coarse resolution. • In Earth observation data, objects or classes tend to be more densely packed and heterogeneous than in natural images [6,134,151].
Due to these properties of Earth observation data, the tasks of image recognition, image segmentation and object detection can be considered more challenging. To cope with this problem, DL models which were fitted to computer vision datasets are fine-tuned on Earth observation datasets. Those Earth observation datasets are often smaller and therefore it is common practice to refine models which have already learned how to hierarchically extract features from imagery data on larger computer vision datasets. This refinement of models, originally trained on other datasets, is known as transfer learning. Already optimised parameters for a computer vision task are refined to an Earth observation task which can be seen as transferring learned skills to apply them in a different context [152][153][154]. However, even for transfer learning, special Earth observation datasets have to be created. In Table 5, popular Earth observation datasets are presented and also well performing architectures, best practice examples or the baseline models of the datasets are associated in the last column. Table 5. Summary of DL datasets, with abbreviations for longer names, grouped by their tasks image recognition (IR), image segmentation (IS) and object detection (OD). In the column Topic, the abbreviation LULC means Land Use Land Cover classification. All datasets are freely available, with the need to contact the author for a few of them. The last column shows example applications by briefly describing the architectures.  [24] When looking at the tasks in combination with platforms, sensors and resolution, a pattern can be observed that spaceborne platforms with multispectral sensors which provide lower resolution are used for image recognition of whole image chips, so-called scene labelling. On the other hand, for image segmentation and object detection sensors with a higher resolution are used. Examples are spaceborne systems, mainly WorldView, and airborne platforms with sensors of very high spatial resolution. This underlines the necessity of richer feature information for image segmentation and object detection tasks than for image recognition. Since sensors with higher resolution provide this information depth, optical and multispectral sensors are dominating the selection for image segmentation and object detection tasks presented in the overview.
To cope with the special properties of Earth observation data, the best performing architectures are mostly modified versions of the architectures discussed in Section 3 or completely new designs made especially for the requirements of Earth observation data. For the selected studies here, the ResNet Family is the most widely used backbone architecture, within which the layer depth is rather shallow between 18 and 101 compared to the 152 layers commonly chosen in the computer vision related designs.
In image segmentation, encoder-decoder models dominate, often due to their complex but modular design which can be adapted to the properties of Earth observation data. The same is true for object detectors. More complex two-stage detector models are used more often than one-stage detectors. However, what is more obvious when reviewing the evolution of Earth observation datasets and models for object detection is their ability to deal with rotated bounding boxes. This becomes necessary due to the rotational invariance of objects in Earth observation data, which when not considered can easily lead to a large amount of false negatives when objects are densely accumulated [100,138].

Future Research
The advances in DL models for image recognition, image segmentation and object detection since 2012 brought a better general understanding of hierarchical feature extraction from imaging data. The major findings are: the use of feature pyramid networks, atrous convolutions, image context exploitation and using features from different stages in the network. Overall, techniques focus on extraction and combination of multiscale features. With this increasing understanding, features can be used more effectively and models became better in general. However, with the recent emergence of NAS-induced network designs, architectures are now able to reach better or equal performance by using fewer parameters. This is partly due to the fact that they are now specifically trained to handle a specific task and even more important, on a specific type of dataset. Since NAS uses the reward signal of how designs perform on tasks and datasets to optimise the architecture, it can be argued that those architectures are no longer that easily transferable to other domains. Reflecting this development from an Earth observation perspective, the new advances in leading architectures which come from the computer vision domain might not directly match the properties of Earth observation data due to their now more specific design for an underlying computer vision dataset. To take advantage of NAS, it should be used for optimising Earth observation specific architectures or modules, too. Since NAS is highly computationally expensive, it remains questionable how fast and widely this technique will enter the Earth observation community compared to the established hand crafted designs.
Beside architectures, datasets are highly important for pushing DL itself and also in specific domains. The great effort to create a DL dataset, and therefore the lack of datasets for specific tasks, remains a major concern in Earth observation research. Most importantly, the diversity of sensors, labelled classes and topics, as well as the size of datasets should be increased. With increasing datasets and dataset size insights about the properties of Earth observation data with respect to DL models will be made. Using these insights, DL models can be better optimised for the properties of the data of interest [8] instead of solely relying on the findings made in computer vision, which uses different kinds of data. The example of U-Net [67], proposed in medical imagery analysis shows the impact that customised architectures can have when designed for specific data. This was also presented recently in Earth observation by Azimi et al. [181] introducing the SkyScapes dataset and the custom designed SkyScapesNet which combines cornerstones of CNN architectures to create a model that matches the specific properties of Earth observation data. On the other hand, focus should also be lain on training and designing DL models with small datasets [8], for example with the established use of transfer learning and data augmentation, as well as weakly supervised learning [57,[190][191][192].
The observation of shallower ResNet Family backbone networks in well performing deep learning models for Earth observation compared to computer vision tasks, points to the question of the optimal depth of network designs balancing accuracy and overfitting. Simply using deeper models to gain higher scores will eventually lead to overfitted models [6]. Hence, the relative shallow depth of the well performing architectures presented in Table 5 encourages to have a closer look at optimal network depth. By building architectures with as few parameters as possible, more generalised models are produced. That would contribute to better transferability of models, which is another open issue for DL.
Because of the trends in DL architectures and growing dataset diversity, we argue that a thorough knowledge of the cornerstones of DL is important to assess the vast amount of models in order to find the designs that match restrictions in data, hardware and time by balancing performance trade-offs [131]. Furthermore, a well-founded understanding of DL as Earth observation researcher is highly necessary when it comes to incorporating knowledge of physical and ecological relationships into DL models [7,9]-an ability that will make the difference between purely data driven Earth observation with a tendency to remain a black box and more understandable models, combining data science and geoscientific expert knowledge.

Conclusions
The emergence of deep learning (DL) led to the adaptation of models developed in computer vision for applications in Earth observation and provided novel possibilities to analyse remotely-sensed data. Nevertheless, the entry barriers remain high for Earth observation scientists who want to use DL models. To lower them, this review provided a fundamental introduction to the most popular DL model for image processing: the convolutional neural networks (CNNs). By discussing the evolution of CNNs in computer vision, we assigned main characteristics to specific tasks. For image recognition, they are: • A so-called convolutional backbone extracts features from input data in a hierarchical manner by stacked convolutional operations. The repeated convolutions with non-linear activation increase the semantic meaning of features while going deeper into the model [2,3]. • A stack of fully connected artificial neurons uses the extracted features to predict the probability of the class. • Such deep models need specific normalisation schemes such as batch normalisation to make supervised training of deep networks possible and faster [41]. • Residual connections further alleviates the training of increasingly deep architectures [43]. • To emphasise more complex networks, design elements such as bottleneck layers reduce intermediate feature depth [40] and factorisation of convolutional operations reduce the number of parameters [42]. • The recent findings in neural architecture search (NAS) bring together complex network structures and efficient usage of parameters. With NAS, architectures are searched for by an artificial controller. This controller tries to maximise specific metrics of the networks it creates iteratively and therewith finds highly efficient architectures [48,52].
Improvements in image segmentation and object detection mainly focus on the so-called heads of the architectures, which use the backbones developed in image recognition. The output of image segmentation, a segmentation mask where each pixel is assigned to one class, has the same resolution as the input image. Since the feature extraction first downscales the input image, image segmentation models have to recover input resolution. Those opposed operations lead to a high resolution-feature depth trade-off. The following properties of CNNs for image segmentation were found to effectively handle this problem: • Encoder-decoder models, which first encode information by extracting features and use the downscaled features maps from different stages of the encoder to recover input resolution in the decoder, are the most popular designs, namely U-Net [67] and DeepLabV3+ [60]. • So-called atrous convolutions [58], which maintain resolution while extracting features, are widely used to cope with the high resolution-feature depth trade-off. • The combination of feature maps of different scales with context information from image level was found to contribute effectively to pixel wise classification [59].
In object detection, an additional bounding box has to be provided which presents precise localisation information. Thereby, objects of different size and density should be equally well detected. This leads to problems which relate to multiscale feature extraction. The most effective modules invented in respect to object detection are: • Two stage object detectors show both good performance and adaptability. The most popular detectors are the Faster R-CNN [76] design and its successors. In the first stage, they propose class agnostic regions of interest (RoIs) for objects. During the second stage those RoIs are classified and the bounding box is regressed to tight object boundaries. • For multiscale processing, the feature pyramid network (FPN) [77] enhances the convolutional backbone by merging high semantic features with precise localisation information. • Cascading classifiers and bounding box regression suppress noisy detections by iteratively refining RoIs [79].
Atrous convolution, image context aware designs and FPN show how important multiscale feature exploitation is for image segmentation and object detection. However, these findings can be applied regardless of the task for which they were originally developed. This leads to fruitful development but is also the reason for the high entry barriers, since interrelations of different approaches have to do be considered. To narrow the gap between theoretical concepts and the application of DL to Earth observation research, an overview of Earth observation related datasets is given and recent trends are discussed from an Earth observation perspective. The major findings are: • Building models with NAS might lead to overly optimised architectures for specific tasks and datasets. Therefore, it is questionable if such models which are recently successful in computer vision tasks perform equally well in Earth observation, as was the case with hand crafted designs. However, NAS can also be used to find Earth observation specific designs. • The number of DL datasets for Earth observation applications is still small in relation to possible applications and sensor diversity. Since datasets are highly important to push the understanding of the interaction between DL models and specific types of data, an increase in datasets has huge potential for further advances for DL in Earth observation. • Beside more datasets, weakly supervised learning provides encouraging results as an alternative to expensive dataset creation. It is especially important for proof of concepts studies and experimental research.
While reflecting on the differences and commonalities of data and used architectures in computer vision and Earth observation, we state that the advances in computer vision have to be adapted to match Earth observation applications. Therefore, a thorough understanding of DL concepts is crucial for assessing and adapting models. With the provided extensive introduction to CNNs, we created a foundation to closely review the application of DL in the field of Earth observation research. In Part II of this survey, we will use this basis to further discuss the recent trends of CNNs doing image segmentation and object detection by reviewing published findings in leading Earth observation journals.