A Review of Data Augmentation Methods of Remote Sensing Image Target Recognition

Hao, Xuejie; Liu, Lu; Yang, Rongjin; Yin, Lizeyan; Zhang, Le; Li, Xiuhong

doi:10.3390/rs15030827

Open AccessReview

A Review of Data Augmentation Methods of Remote Sensing Image Target Recognition

by

Xuejie Hao

^1,†

,

Lu Liu

^1,†,

Rongjin Yang

²,

Lizeyan Yin

³,

Le Zhang

² and

Xiuhong Li

^1,*

¹

College of Global Change and Earth System Science, Beijing Normal University, Beijing 100875, China

²

Chinese Research Academy of Environmental Sciences, No. 8, Da Yang Fang, An Wai, Chao Yang District, Beijing 100012, China

³

Institute of Computing, Modeling and Their Applications, Clermont-Auvergne University, 63000 Clermont-Ferrand, France

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work and should be considered co-first authors.

Remote Sens. 2023, 15(3), 827; https://doi.org/10.3390/rs15030827

Submission received: 17 December 2022 / Revised: 14 January 2023 / Accepted: 29 January 2023 / Published: 1 February 2023

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Abstract

:

In recent years, remote sensing target recognition algorithms based on deep learning technology have gradually become mainstream in the field of remote sensing because of the great improvements that have been made in the accuracy of image target recognition through the use of deep learning. In the research of remote sensing image target recognition based on deep learning, an insufficient number of research samples is often an encountered issue; too small a number of research samples will cause the phenomenon of an overfitting of the model. To solve this problem, data augmentation techniques have also been developed along with the popularity of deep learning, and many methods have been proposed. However, to date, there is no literature aimed at expounding and summarizing the current state of the research applied to data augmentation for remote sensing object recognition, which is the purpose of this article. First, based on the essential principles of data augmentation methods, the existing methods are divided into two categories: data-based data augmentation methods and network-based data augmentation methods. Second, this paper subdivides and compares each method category to show the advantages, disadvantages, and characteristics of each method. Finally, this paper discusses the limitations of the existing methods and points out future research directions for data augmentation methods.

Keywords:

data augmentation; remote sensing; deep learning; target recognition

Graphical Abstract

1. Introduction

Image target recognition realizes image recognition by comparing the stored information with the current information [1]. With the continuous development of image target recognition technology, its application in remote sensing images has become increasingly common. Research on remote sensing image target recognition mainly includes the identification of ports [2], ships [3], buildings [4], electrical towers, water bodies, and vegetation [5], among others.

Although the selection of data augmentation methods needs to consider the individual characteristics of different types and domains of images, all of them need to have one core principle: add as much variation as possible without changing the original semantic information of the image. For example, rotating and flipping up and down can change the semantics of natural images. Therefore, they are rarely used for natural image tasks. However, they are very suitable for remote sensing images; In terms of adding variation, images taken in natural environments are affected by different lighting conditions. The color gamut transform can simulate different lighting environments. For remote sensing images, the spectral information of the features, i.e., the color information, is very important for interpretation. The color gamut transform can easily change the original spectral information of the features in the image, which leads to incorrect recognition; in addition, both natural and remote sensing images are often affected by occlusion factors in content understanding. In addition, because both natural images and remote sensing images are often affected by occlusion factors in content understanding, such as in the occlusion of background by foreground in natural scenes and the occlusion of features by clouds in remote sensing scenes, cropping and local erasure methods can improve the robustness of the model against occlusion. Meanwhile, the imaging method is different for medical images; there is no problem of occlusion.

In deep learning, if the model is complex and contains many parameters, the model can predict the known data well and the unknown data poorly. This phenomenon is called overfitting [6,7]. In research on remote sensing image target recognition using deep learning, some serious data shortages are often encountered, and a dataset that is too small can easily lead to an overfitting of the model. Because of its structure, a deep neural network has a strong expressive ability compared with traditional models, which require more data to avoid overfitting and ensure that the trained model can perform well on new data [8,9]. In response to the problem of overfitting, the concept of data augmentation is introduced, which is the method of choosing to train on similar but not identical samples to add to the training data [10]. Any method to prevent network overfitting is data augmentation; whether it is reducing network capacity or increasing data, their effect is the same. Although the method of controlling network capacity can improve the generalization ability, the effect is poor, and the network parameters are highly dependent on the network structure. Data augmentation does not reduce the capacity of the network, nor does it increase the computational complexity and amount of parameter adjustment; it is an implicit regularization method, which is more meaningful in practical applications [11]. In addition, data augmentation is closely related to the problem of data imbalance. Data augmentation is one of the important methods for solving the problem of imbalanced datasets. All data-based data augmentation methods in this paper can be used as important means to solve the problem of imbalanced datasets [12].

In recent years, many scholars have reviewed data augmentation based on deep learning. Mikolajczyk et al. compared and analyzed various data augmentation methods in image classification tasks, provided representative examples, and proposed their data augmentation method based on an image style transfer technique [13]. Nalepa et al. reviewed the latest advances in the application of data augmentation techniques for brain tumor magnetic resonance images and verified the effectiveness of some of the data augmentation methods [14]. Shorten et al. [15] reviewed algorithms for image augmentation (geometric transformation, color space augmentation, kernel filters, blending images, random erasure, feature space augmentation, adversarial training, generative adversarial networks, neural style transfer, and meta-learning) and briefly discussed features of data augmentation, such as test-time augmentation, resolution impact, dataset size, etc. Lemley et al. [16] summarized and discussed the most important algorithms in advanced data augmentation strategies. Chlap et al. [17] conducted a systematic review based on the literature regarding the training of deep learning models using data augmentation for medical images (It is limited to CT (computed tomography) and MRI (magnetic resonance imaging)). Song et al. [18] extensively surveyed recent papers related to FSL (few-shot learning), outlined its recent progress, and fairly compared the strengths and weaknesses of the existing works. Furthermore, based on the FSL challenge, the existing work is classified according to the level of knowledge abstraction.

The above review mainly focuses on the recent development of data augmentation and the review and summary of data augmentation methods. To the best of our knowledge, there is no extensive review of deep learning-based remote sensing image data augmentation methods. Remote sensing imagery is the research basis for many research fields, and with the rapid emergence of new data augmentation methods in recent years, it is necessary to summarize its development and provide a comprehensive review for scholars and practitioners. In addition, we determine the future research directions of data augmentation methods for remote sensing images. Statistics of the number of references in terms of years is shown in Figure 1.

We can see from the figure that from 1998 (when the concept of data augmentation was proposed) to 2013, the development of data augmentation technology was slow. Since 2013, data augmentation technology has rapidly developed, and it reached a peak in 2018. Comparing the development process of deep learning, it can be seen that the development trend of data augmentation technology and deep learning is consistent.

The authors of [1,2,3,4,5] detail the application area of data augmentation; refs. [6,7,8,9] detail the application of data augmentation. The authors of [10] introduced the concept of data augmentation, and [11,12] presented the theoretical basis of data augmentation. Refs. [13,14,15,16,17,18] are reviews of data augmentation, and [19,20,21,22,23,24,25,26,27,28,29] detail the theoretical basis and applications related to one-sample transformation methods. Refs. [30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47] is the theoretical basis and applications related to multi-sample synthesis methods. The authors of [48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87] provide the theoretical basis and applications related to deep generative modeling methods. The authors of [88,89,90,91,92,93,94,95] detail the theoretical basis and applications related to virtual sample generation methods, and [96,97,98,99,100,101,102,103,104,105] detail the theoretical basis and applications related to transfer learning methods. Refs. [106,107,108,109] describe the theoretical basis and applications related to regularization methods. The authors of [110,111,112,113,114,115,116,117] explain the theoretical basis and applications related to meta-learning methods. Lastly, [118,119,120] detail the theoretical basis and applications related to reinforcement learning methods.

Figure 1. Statistics of the number of references in terms of years. (1998: [10], 1999: [31], 2000: [30], 2002: [40], 2004: [6], 2005: [44], 2008: [2,96], 2009: [42,100,119], 2010: [98], 2011: [9,74], 2012: [10,97], 2013: [4,43,50,99,101], 2014: [46,55,64,72,77,104,105,106], 2015: [52,54,56,57,58,67,80], 2016: [23,53,60,61,75,79,89,103,108], 2017: [7,22,26,29,32,41,65,66,71,72,78,81,90,102,110,112], 2018: [5,8,11,12,13,19,24,28,34,35,36,45,47,59,68,70,76,107,109,110,115,118], 2019: [14,15,25,27,37,38,39,82,87,91,95], 2020: [1,3,16,33,83,84,86,92,93,120], 2021: [17,20,21,62,69,85,88,94,111,116], 2022: [18,48,63,117]).

The rest of this article is organized as follows. Section 2 summarizes, analyzes, and compares data-based data augmentation methods based on the essential principles of the method. Section 3 summarizes network-based data augmentation methods. Section 4 discusses the challenges and developments in data augmentation based on remote sensing imagery. Section 5 concludes this review.

2. Data-Based Data Augmentation Methods

The data-based data augmentation method is based on the original data, and the amount of data is increased by different methods and means. By increasing the amount of data, the current deep learning model requires a large amount of data to meet the requirements. Theoretically, data augmentation is a collection of artificial transformations of the original training set with the labels unchanged. The mathematical relationship can be expressed as Equation (1):

ϕ : S \to T

(1)

where

Φ

defines the transformation from

S

to

T

.

S

represents the original training dataset, and

T

represents the augmented set of

S

.

\to

represents the transformation from

S

to

T

. Therefore, the artificially augmented training set can be expressed as Equation (2):

S^{'} = S \cup T_{1} \cup \dots \cup T_{n}

(2)

where

S^{'}

contains the corresponding transformed dataset defined by the original training set

S

and

T

.

T_{n}

represents the augmented set of s generated by the nth method and

n = 2, 3, 4 \dots

. Data augmentation does not change the label information in the original training dataset; that is, if the label of image

a

belongs to class

b

, then the label of

ϕ (x)

also belongs to class

b

[19]. A summary of the data-based data augmentation methods is shown in Table 1.

2.1. One-Sample Transform

The single-sample transformation method takes a single sample dataset as the operation object, changes the original data through geometric transformation, color transformation, sharpness transformation, noise disturbance, and random erasure, and generates new data that are different from the original data. In remote sensing image data, color is an important factor in image interpretation; therefore, color transformation is not suitable for the data augmentation of remote sensing images, and color transformation is not discussed in this paper. Because the single-sample transformation method has the advantages of simplicity of operation and low time cost, this method is widely used in data augmentation [20].

2.1.1. Geometric Transformations

Geometric transformations change the geometry of an image by mapping pixel values to new destinations. Geometric transformations generate new data by rotating, scaling, flipping, shifting, and cropping the image. The basic shape of the class in general imagery is preserved, but its position and orientation are changed.

The rotation transformation randomly rotates the angle of the image and changes the orientation of the image content (Figure 2a). The scaling transformation enlarges or reduces the image according to a certain ratio (Figure 2b). The flip transformation flips the image along the horizontal or vertical direction (Figure 2c). The displacement transformation translates the image plane in a certain way. The translation range and translation step can be specified in a random or artificial way, and the translation is performed in the horizontal or vertical direction to change the position of the image content (Figure 2d). The cropping transformation randomly crops the image according to the given size (Figure 2e, which shows random cropping according to the original size). Among the techniques, displacement transformation and cropping transformation have strong practicability because the image displacement model can traverse across all hierarchical features of the image, thereby effectively improving the training effect of the model. However, these two methods have some defects; the selected target area may not contain the real target area or a large amount of key information is lost, resulting in incorrect network training labels [21].

Most of the remote sensing images are obtained by positioning the sensor vertically to the ground. The semantic category information of the image can be preserved through the rotation, flip, proper scaling, displacement, cropping, and other geometric transformations of remote sensing images. The data augmentation of geometric transformation increases the deviation of factors such as position and angle of view on the dataset and reduces the sensitivity of the model to the image, thereby improving the robustness of the model and finally achieving the purpose of improving the test accuracy. The results of [20] showed that cropping is the best way to improve recognition accuracy. The advantage of geometric transformation is that the operation is simple, and the original semantic annotation information of the image data can be preserved. However, geometric transformation also has disadvantages, such as data duplication and a limited amount of added information. It is precisely because of these shortcomings that in practical applications, the improvement of model accuracy by geometric transformation is very limited [19].

2.1.2. Sharpness Transformation

As the name suggests, sharpness transformation generates new data by changing the sharpness of the image (Figure 2f). Changes in sharpness are typically achieved by sharpening and blurring the image. Image sharpening reduces image blur by enhancing high-frequency components. After the remote sensing image is sharpened, more remote sensing information about the image can be highlighted. Image sharpening is widely used in the field of small target recognition for remote sensing images [22]. The principle of image blur processing is to reset the value of each pixel in the image to the value related to the surrounding pixels, such as the mean and median values of the surrounding pixels. Image blur is generally achieved by Gaussian blur, average blur, and median blur. Models trained on blurred images perform well on test sets with motion blur [20].

2.1.3. Noise Disturbance

Noise perturbation generates new data by randomly perturbing the pixel RGB (red-green-blue) information of the image (Figure 2g is the image generated by the perturbation of salt and pepper noise). Introducing noise into an image introduces redundant and interfering information into the image to visually change the image quality. The improvement of the model’s ability to filter noise disturbance and redundant information can improve the model’s ability to recognize datasets with uneven image quality. Commonly used noise types in remote sensing images include Gaussian noise, salt and pepper noise, and speckle noise. In the noise perturbation method, because the synthetic aperture radar (SAR) imaging process is greatly affected by speckle noise, the speckle noise modeled by multiplication is more suitable [23]. Wang et al. [24] trained a convolutional neural network (CNN) with speckle noise augmentation, which performed well on test data. However, if the model is underfitting, the way the noise is perturbed does not generate new data, so this method does not substantially improve the performance of the model. Ma et al. [25] used Gaussian noise, salt and pepper noise, and other noise disturbance methods to perform data augmentation on a remote sensing image scene dataset (Aerial Image Dataset, AID). The results show that noise perturbation does not significantly improve the accuracy of the remote sensing scene classification task.

2.1.4. Random Erase

Similar to Mixcut [26], random erasing generates new data by randomly erasing a region of image information on the image (Figure 2h). Random erasing is visually equivalent to adding an occlusion to the image. In theory, after the information on the image is occluded, the model will learn other more descriptive features in the image to prevent the model from overfitting to a specific visual feature, thereby improving the robustness of the model under occlusion conditions. Random erasing not only helps the model cope with the occlusion problem of the image, but also ensures that the model learns the global features of the entire image rather than the local features. However, random erasure may erase important information in the object recognition task, which may result in unrecognized objects in the image. Therefore, random erasure is necessary for human intervention to ensure the validity of the generated data. This article uses remote sensing images with a size of 800 × 800 pixels (px) as the original image. The Augmentor tool (Augmentor is a Python package designed to aid in the artificial generation and data augmentation of image data for machine learning tasks) processes the original image to generate new data, as shown in Figure 2.

Su [27] used noise perturbation, flipping, rotation, and other means to augment the data to improve ship recognition accuracy in a faster region-based CNN model. Ding et al. [24] augmented SAR data with translational transformation and noise perturbation to improve the recognition accuracy of military vehicles in CNNs. Wang, Z. et al. [28] generated some new training images by manually cropping, noise perturbing, and flipping the original images to improve vehicle recognition accuracy in SDD networks (single shot multi-box detector; SDD is a real-time target detection method based on the convolutional neural network that is used to realize the target detection of synthetic aperture radar images). Su [27] used methods such as flipping, adding Gaussian noise, and performing rotation to conduct data augmentation on data samples to improve the recognition accuracy of aircraft in a dense feature pyramid network (DFPN). Scott, G. J. et al. [29] extended a remote sensing image dataset with a single-sample transformation (rotation, flip) to improve the robustness of the deep CNN (DCNN) model for remote sensing image data. Applying these techniques to the public UC Merced land use dataset, land cover classification accuracies of 97.8 ± 2.3%, 97.6 ± 2.6%, and 98.5 ± 1.4% were achieved using the CaffeNet, GoogLeNet, and ResNet models, respectively. These results show that the single-sample transformation method can effectively improve detection accuracy and simultaneously demonstrate the effectiveness and universality of the single-sample transformation. In addition, Table 2 shows the experimental results of different geometric transformation methods on the Caltech101 dataset.

2.2. Multi-Sample Synthesis

Unlike single-sample transformation, multi-sample synthesis is a process of artificially mixing information from multiple images to generate new data. Multi-data synthesis is divided into two types: image space information synthesis and feature space information synthesis. The representative algorithms of image space information synthesis include Mixup, sample pairing, and between-class (BC), among others, and the representative algorithms of feature space information synthesis include SMOTE (synthetic minority oversampling technique).

Most label-preserving data augmentation methods can be represented by the following random Equation (3):

(\tilde{x}, \tilde{y}) = \tilde{f} (x, y) = (f (x), y)

(3)

where

\tilde{x} = λ x_{i} + (1 - λ) x_{j}

,

\tilde{y} = λ y_{i} + (1 - λ) y_{j}

.

x_{i}

and

x_{j}

are raw input vectors.

y_{i}

and

y_{j}

are one-hot label encodings.

(x_{i}, y_{i})

and

(x_{j}, y_{j})

are two examples drawn at random from our training data, and

λ \in [0, 1]

.

\tilde{f}

is a linear combination of

(x_{i}, y_{i})

,

(x_{j}, y_{j})

.

f

is any function in a one-sample transformation. Based on the above theoretical information, Summers and Dinneen proposed a general equation for data augmentation (4):

(\tilde{x}, \tilde{y}) = \tilde{f} ({\{(x_{i}, y_{i})\}}_{i = 1}^{2})

(4)

This equation is highly generalized by ignoring either of the input functions (

f

or

\tilde{f}

) in Equation (3).

2.2.1. Spatial Information Synthesis

Image spatial information synthesis methods can be divided into two types: The linear stacking method of multiple images and the multi-image nonlinear blending method.

(1): Linear stacking method of multiple images

Common algorithms for the linear stacking method of multiple images include Mixup, between-class (BC), and sample pairing algorithms.

➀: Mixup

The Mixup algorithm is a data augmentation method based on vicinal risk minimization (VRM), which uses linear interpolation to obtain new sample data [30,31,32]. The neighborhood distribution of Mixup is Equation (5):

μ (\tilde{x}, \tilde{y} | x_{i}, y_{i}) = \frac{1}{n} \sum_{j}^{n} E_{λ} [δ (\tilde{x} = λ \cdot x_{i} + (1 - λ) \cdot x_{j}, \tilde{y} = λ \cdot y_{i} + (1 - λ) \cdot y_{j})]

(5)

In Equation (5),

λ ~ B e t a (α, α), α \in (0, \infty)

, and Mixup’s hyperparameter

α

controls the difference strength between image pairs or label pairs.

E []

is the expected value,

λ \in [0, 1]

,

x_{i}

and

x_{j}

represent images, and

y_{i}

and

y_{j}

represent the labels corresponding to the images.

(x_{i}, y_{i})

and

(x_{j}, y_{j})

are two samples randomly drawn from the training data.

δ (x_{i}, y_{i})

is a Dirac mass centered at

(x_{i}, y_{i})

. In summary, the virtual feature-target vectors produced by sampling from the Mixup neighborhood distribution are as in Equations (6) and (7):

\tilde{x} = λ x_{i} + (1 - λ) x_{j}

(6)

\tilde{y} = λ y_{i} + (1 - λ) y_{j}

(7)

α

controls the strength of the interpolation between feature-target pairs. When

α

tends to 0, the Mixup model regression will return to empirical risk minimization (ERM). The variable

x_{i}

in Equations (6) and (7) represents the image vector randomly selected from the data set, and

y_{i}

represents the encoding vector of the semantic category probability corresponding to the image

x_{i}

,

λ \in [0, 1]

,

\tilde{x}

is the final generated image, and

\tilde{y}

is the semantic class probability vector corresponding to

\tilde{x}

.

Mixup is rich in experimentation. In the study of remote sensing image scene classification, Yan, P. et al. [33] introduced the Mixup method into the genetic neural network, which solved the problem of small datasets, stabilized the training process, and proved the effectiveness of the method. The experimental results of Zhang, H. et al. [34] show that Mixup-enhanced images can improve the generalization error of deep learning models on ImageNet, CIFAR, speech, and tabular datasets; it can also reduce the model’s memory for corrupted labels and enhance the robustness of the model to adversarial examples while also enhancing the stability of training adversarial generative networks. The Mixup algorithm’s processing achieves boundary fuzzification, provides smooth prediction effects, and enhances the model’s predictive ability beyond the range of training data. Mixup implicitly controls the complexity of the model. As the model capacity and hyperparameters increase, the training error decreases. Despite a considerable performance improvement, the Mixup method is not yet well-explained in terms of bias–variance balance; Mixup has a lot of room to grow among other types of supervised, unsupervised, semi-supervised, and reinforcement learning [30].

➁: Between-Class (BC)

The central idea of the BC method is to train a neural network by inputting a mixture of samples from different classes to output a mixture ratio. That is, BC mixes two images belonging to different classes at random ratios to generate inter-class images; then, the mixed images are fed into the model, and the model is trained to output the mixing ratio. BC was originally developed as a method for sound to allow for digital mixing. It does not seem to make sense to mix the two image data; however, because convolutional neural networks (CNNs) have an aspect that treats input data as waveforms, what works for sound must also work for images. Based on the above theoretical basis, the improved BC and developed between-class (BC+) were created. BC can impose constraints on the shape of the feature distribution, thereby improving the generalization ability. BC is not only simple to operate and easy to implement, but also improves the image recognition performance of the network. The research of Tokozume, Y. et al. showed that the BC method was introduced and that the CNN networks improved the image recognition error by 1% on the ImageNet-1K dataset. At the same time, it proved the effectiveness of BC in data augmentation [34].

➂: Sample Pairing

The idea of sample pairing is to randomly select two images from the training set, process them through the single-sample transformation data augmentation operation, and then superimpose and synthesize a new sample in the form of pixel average [35]. The model structure of sample pairing is shown in Figure 3.

The training image A is random, and image B is randomly retrieved from the training set. Both are enhanced by the data of single-sample transformation, and the two images are averaged, but the label is A, and then sent to the training network. Therefore, sample pairing randomly creates new images from the image set, and label B is not used. The weights of image A and B in the model are consistent; even if a large network is used, the training error cannot become 0, and the training accuracy cannot exceed 50% on average. Although the training accuracy of sample pairing will not be very high, when training with sample pairing is stopped as final fine-tuning, the training accuracy and validation accuracy quickly improve. After fine-tuning the model, the network trained with sample pairing is much taller than the model trained without sample pairing.

However, the experimental results [36] show that the training samples with different labels may be introduced into the data augmentation operation of sample pairing, which leads to a significant increase in the training error of sample pairing on each dataset; in terms of detection error, the validation error using sample pairing training is greatly reduced. Although the idea of sample pairing is simple, the performance improvement effect is considerable, and it conforms to the principle of Occam’s razor. Unfortunately, the interpretability is not strong, and there is still a lack of theoretical support. At present, there are only experiments with image data, and further experiments and interpretations are needed [35]. In the field of remote sensing research, it is theoretically feasible, but the applied research has not yet been carried out, which also guides our research direction.

➃: CutMix

The CutMix method [37] replaces the simple pixel erasure method by replacing the erased area with blocks of images from other samples. Labels are also proportionally incorporated into the pixels of the fused image. CutMix has both the benefit of not losing information during training, which is efficient, and the benefit of keeping the region dropout, which is a more focused model on the undifferentiated part of the target. The addition of blocks further enhances localization by making it necessary for the model to identify targets from a partial view. The consumption of training and inference remains unchanged. When compared with Mixup samples, the Mixup sample looks unnatural. CutMix overcomes the unnatural-looking problem by replacing the image region with a block from another training sample. The experimental results show that CutMix not only significantly improves the accuracy of the model, but also outperforms the model in the target localization task. CutMix also enhances the robustness of the model and reduces the over-confidence problem of the neural network.

(2): Nonlinear blending method of multiple images

There are many ways to use the nonlinear blending method of multiple images. The common ones are Vertical Concat, Horizontal Concat, Mixed Concat, Random 2 × 2, VH-Mixup, VH-BC+, Random Square, Random Column Interval, Random Row Interval, Random Rows, Random Columns, Random Pixels, Random Elements, Noisy Mixup, Random Cropping, Stitching, etc. An example of a multi-image nonlinear blending method is shown in Figure 4 and Figure 5.

The Vertical Concat method samples the random mixing coefficient

λ ~ B e t a (a, a) .

In this method, the

λ

part at the top of an image

x_{1}

and the

(1 - λ)

part at the bottom of the image

x_{2}

are vertically connected. The expression equation is Equation (8).

\tilde{x} (r, c) = \{\begin{matrix} x_{1} (r, c), i f r \leq λ H \\ x_{2} (r, c), o t h e r w i s e \end{matrix}

(8)

where

H

is the height of the image, and

x (r, c)

represents the three-bit pixel of image

x

at row

r

and column

c

. The retained original label

\tilde{y}

is equal to the original label weighted by the mixing coefficient

\tilde{y} = λ y_{1} + (1 - λ) y_{2}

.

Horizontal Concat is like Vertical Concat as it horizontally connects the left

λ

part of an image

x_{1}

with the right

(1 - λ)

part of an image

x_{2}

. The expression equation is Equation (9).

\tilde{x} (r, c) = \{\begin{matrix} x_{1} (r, c), i f c \leq λ w \\ x_{2} (r, c), o t h e r w i s e \end{matrix}

(9)

where

w

is the width of the image, and

x (r, c)

represents the three-bit pixel of image

x

at row

r

and column

c

. The retained original label

\tilde{y}

can be expressed as

\tilde{y} = λ y_{1} + (1 - λ) y_{2}

.

Mixed Concat is a combination of Vertical Concat and Horizontal Concat. Mixed Concat first samples the random mixing coefficients

λ_{1}

,

λ_{2} ~ B e t a (a, a)

, and then divides the output image into a

2 \times 2

grid.

λ_{1}

determines the horizontal boundaries in the grid;

λ_{2}

determines the vertical boundaries in the grid. The upper left and lower right parts of the output image are set to corresponding pixel values in image

x_{1}

, and the upper right and lower left parts are set to corresponding pixel values in image

x_{2}

. The retained original label

\tilde{y}

can be expressed as

\tilde{y} = (λ_{1} λ_{2} + (1 - λ_{1}) (1 - λ_{2})) y_{1} + (λ_{1} (1 - λ_{2}) + λ_{2} (1 - λ_{1})) y_{2}

.

Random

2 \times 2

is a more randomized version of Mixed Concat. Random

2 \times 2

first divides the output image into

2 \times 2

grids of random size, and the contents of the grid are randomly filled from the image

x_{1}

,

x_{2}

. Research by [38] shows that Random 2 × 2 can constrain the generated

2 \times 2

grid. Specifically, this constraint forces the intersection of the grid to appear somewhere in the image, preventing the image from becoming too long, too narrow, or even nonexistent. Although this constraint is not critical to the success of the method, it improves performance in a small but significant way.

VH-Mixup takes advantage of Vertical Concat, Horizontal Concat, and Mixup methods (equivalent to BC). First, the two intermediate images are populated by the results of Vertical Concat and Horizontal Concat, each with its own randomly chosen

λ

. These two images are then applied as an input to Mixup, which has the effect of generating an image where the upper left corner is from

x_{1}

, the lower right corner is from

x_{2}

, and the upper right and lower left are mixed in between with different mixing coefficients. The expression equation is Equation (10):

\begin{matrix} x_{1} (r, c), & if r \leq λ_{1} H \land c \leq λ_{2} w \\ λ_{3} x_{1} (r, c) + (1 - λ_{3}) x_{2} (r, c), & if r \leq λ_{1} H \land c > λ_{2} w \\ (1 - λ_{3}) x_{1} (r, c) + λ_{3} x_{2} (r, c), & if r > λ_{1} H \land c \leq λ_{2} w \\ x_{2} (r, c), & if r > λ_{1} H \land c > λ_{2} w \end{matrix}

(10)

The label

\tilde{y}

is determined according to the label generation rules in Vertical Concat, Horizontal Concat, and Mixup and can be thought of as the expected score for the random pixel values of images

x_{1}

and

x_{2}

.

VH-BC+ is a combination of Vertical Concat, Horizontal Concat, and BC+. This method and VH-Mixup not only replace Mixup with BC+, but also subtract the mean of each image before generating random stitched images. Random Square replaces a random square in image

x_{1}

with a part of image

x_{2}

. Random Column Interval is a slight generalization of Horizontal Concat. Random Row Interval is the same as Random Column Interval but only applies to random intervals of rows, not columns. Random Row is a high-frequency variant of Vertical Concat, where the rows are taken entirely from

x_{1}

or

x_{2}

; however, using this approach allows for alternating between

x_{1}

and

x_{2}

, possibly multiple times, rather than being grouped into one big block of rows. Random Columns is the same as Random Rows but only samples the columns. Random Pixels are like Random Rows, but each pixel is separately sampled. Random Elements is like Random Rows, but it samples each element of the image separately. Noisy Mixup is like Mixup, except that this method adds random zero-centered noise to the mixing coefficients.

The experimental results of [39] show that the accuracy improvement achieved by the multi-image nonlinear mixing method is greater than that obtained by the multi-image linear mixing method. Moreover, the data augmentation method of the random image cropping and patching (RICAP) hybrid is better than the local erasing method. Although such methods of mixing images seem irrational and lack interpretability, they are very effective in improving the classification accuracy of the model and can achieve very competitive results [20].

2.2.2. Feature Space Information Synthesis

Feature-level image fusion is an intermediate level of fusion in the fusion hierarchy. The fusion principle of the method: First, useful features are extracted from the source image; then, a comprehensive analysis and processing of the extracted features is conducted. On the premise of ensuring the information required for fusion, the input information is filtered, which not only compresses the amount of information effectively, but also greatly improves the fusion speed. The algorithms commonly used in this method are CNN, SMOTE, etc.

(1): CNN

For images, the CNN model has powerful feature extraction capabilities and can obtain features at different levels of the image. Therefore, data augmentation can also be performed in the feature space with the help of image features extracted by CNN. Data augmentation methods for deep learning are briefly mentioned here and will be highlighted in Section 2.3.

(2): SMOTE

Like Mixup’s interpolation in image space for data augmentation, the SMOTE [40] method is an approach of generating new samples by interpolation in the feature space, which can solve the problem of an unbalanced number of samples well. SMOTE is used to solve the problem of data imbalance in classification. The algorithm proposed by [40] is the synthetic minority oversampling technique, which is an improved scheme based on a random oversampling algorithm. This technology is currently a common method for dealing with unbalanced data and is unanimously recognized by academia and industry [41], and many methods have been improved based on SMOTE [42,43,44,45]. The basic idea of the SMOTE algorithm is to analyze the minority class samples and artificially synthesize new samples according to the minority class samples and add them to the dataset to improve the performance of the classifier. (In Python, the SMOTE algorithm has been packaged into the imbalanced-learn library.)

The interpolation-based SMOTE method synthesizes new samples for the small sample class. The main ideas are: ① Define the feature space, correspond each sample to a certain point in the feature space, and determine the sampling ratio

N

according to the sample imbalance ratio; ② for each small sample class sample

(x, y),

find the

K

nearest neighbor samples according to the Euclidean distance and randomly select a sample point from them, assuming that the selected nearest neighbor point is

(x_{n}, y_{n})

. A point is randomly selected as a new sample point on the line segment connecting the sample point and the nearest neighbor sample point in the feature space, which satisfies the following Equation (11); and ③ repeat the selection sampling until the number of large and small samples is balanced.

(x_{n e w}, y_{n e w}) = (x, y) + r a n d (0 - 1) * ((x_{n} - x), (y_{n} - y))

(11)

where rand (0–1) indicates that the rand function randomly generates random values between 0 and 1.

In the field of remote sensing, [46] used the SMOTE algorithm to generate new synthetic data from existing data, which was used in the study of evaluating soil properties by the diffuse reflectance spectroscopy technique. The authors of [47] proposed a new SMOTE-based rotating forest method (SMOTEROF) for the classification of imbalanced hyperspectral image data.

One disadvantage of generating new sample feature vectors in feature space is that it is difficult to interpret the vector data. Although an autoencoder can be used to decode the new vector into an image, a decoding model corresponding to the CNN encoding model needs to be trained. Because many studies cannot reach consistent conclusions, the method of data mixing based on feature space is rarely used.

2.3. Deep Generative Models

The data augmentation methods of single-sample transformation and multi-sample synthesis mainly take a single image or multiple images as the operation object to generate new images. The generated new image only contains the information of the original image, and there is little prior knowledge available. The data augmentation method of the depth generation model randomly generates new samples by learning the probability density of the original data. Because the deep generative model approach takes the entire dataset as prior knowledge, this approach is theoretically optimal. The objective function of the deep generative model is the distance between the data distribution and the model distribution, which can be solved by the maximum likelihood method. From the perspective of dealing with maximum likelihood function methods, deep generative models can be divided into three types: the approximation method, the implicit method, and the deformation method [48]. The classification details are shown in Table 1.

2.3.1. Approximation Method

The approximation method obtains the approximate distribution of the likelihood function through variational or sampling methods, mainly including restricted Boltzmann machines (RBMs) [49] and variational autoencoders (VAE) [50]. An RBM is a shallow model that approximates the likelihood function by sampling. VAE uses the variational lower bound of the likelihood function as the objective function. This approximation method using the variational lower bound to replace the likelihood function is much more efficient than the sampling method of the RBM, and the actual effect is better. The representative models of VAE include importance-weighted autoencoders (IWAE), auxiliary deep generative models (ADGM), and adversarial autoencoders (AAE).

(1): RBM

Boltzmann machines (BMs) are structured, undirected graph probabilistic models defined by energy functions for learning arbitrary probability distributions over binary vectors. RBM is a derivative model of BM. Models such as the deep belief network (DBN), Helmholtz machine (HM), and deep Boltzmann machine (DBM) were derived based on the RBM model.

The units of the RBM are divided into two groups. Each group is called a layer, and the connections between the layers are described by a weight matrix. The upper layer of RBM is an unobservable hidden layer, the lower layer is an observable input layer, and all neurons in the two layers only take 1 or 0, where these two values correspond to the two states of the neuron being activated or not activated, respectively. The model structure of the RBM is shown in Figure 6.

In Figure 6,

x

represents the visible layer neuron (input), hidden layer neuron

z

represents the mapping of the input, and

a, b, and w

represents the visible layer bias vector, hidden layer bias vector, and weight matrix, respectively. The energy function Equation (12) of the RBM is as follows:

E (x_{i}, z_{j}) = - \sum a_{i} x_{i} - \sum b_{j} z_{j} - \sum \sum x_{i} W_{i j} z_{j}

(12)

In Equation (12),

E (x_{i}, z_{j})

represents the energy function, which makes the probability of any variable in the model infinitely approach 0 but be unable to reach 0.

➀: DBN

The belief network is generally a directed graph model, where the lowest layer of the deep belief network has undirected connection edges and the other layers are directed connection edges. A DBN has multiple hidden layers. Hidden layer neurons usually only take

0

and

1

, and visible layer units take binary or real numbers. Except for the undirected connections between the top two layers, the remaining layers are belief networks connected by directed edges. Arrows point to visible layers. The DBN belongs to the directed probability graph model, and its structure is shown in Figure 7. In the figure,

x

represents the visible layer neuron (input); hidden layer neuron

z

represents the mapping of the input,

h^{1}

and

h^{2}

represent two hidden layers, and

w

represents the weight matrix.

➁: HM

The emergence of DBNs, although a huge advancement for deep models, has slowed the training speed of the models. The HM model proposed by [51] can be regarded as another form of connection DBN. The basic idea of HM is to maintain the DBN orientation while adding separate weights between layers so that the hidden variables of the top layer can communicate with the visible layer, which can effectively improve the model training speed. The structure of the HM model is shown in Figure 8. The figure shows the partial connections in the fully connected network HM, where

a_{x}

represents the bias of the visible layer

x

and

a_{h}

and

a_{z}

represent the biases of the two hidden layers

h

and

z

. The upward-pointing solid line connection

Q

represents the cognitive weight, and the downward-pointing dotted line

P

connection represents the generative weight.

➂: DBM

A DBM model is another deep generative model based on RBM. The difference from DBN is that DBM is an undirected probabilistic graphical model that belongs to the Markov random field model. The model parameters can be quickly initialized through simple bottom-up propagation, and the uncertainty of the data can be dealt with through top-down feedback. The disadvantage is that a certain amount of computation is required when generating samples. The neurons of the binary DBM only take 0 and 1, and it is also easy to extend to real values. Each neuron in the layer is independent of the other and is conditional on the neurons in adjacent layers. The DBM structure with three hidden layers is shown in Figure 9.

When the model of the RBM machine structure is applied to generate new samples, the training efficiency is low and the theory is complex, which seriously limits the theoretical development of the RBM. The RBM can only have three hidden layers at most, which makes the model’s applicable scope relatively narrow, and it is easy to reach a performance bottleneck. Various training algorithms and improvements in model structure cannot solve this problem. The RBM method has gradually lost attention due to problems such as low training efficiency and poor performance.

(2): VAE

VAE is a deep generative model based on the autoencoder structure. Autoencoders are widely used in the fields of dimensionality reduction and feature extraction. The basic structure maps samples to latent variables in low-dimensional space through the encoding process and then restores the hidden variables to reconstructed samples through the decoding process. VAE completes the encoding process from input samples to hidden variables and the generation process from hidden variables to new samples through the three processes of encoding, reconstruction, and decoding. The important structure of the VAE is the importance weighted autoencoder (IWAE) [52], ADGM [53], and AAE [54].

VAE maps the samples to the hidden variable z through the encoding process

P (z | x)

, assumes that the hidden variable obeys the multivariate normal distribution

P (z) ~ N (0, 1)

, and draws samples from the hidden variable. This method can transform the likelihood function into the mathematical expectation under the hidden variable distribution as Equation (13):

P (x) = \int P (z | x) P (z) d z

(13)

The decoding process that generates samples from hidden variables is the generative model we need. Encoders and decoders can adopt a variety of structures, and recurrent neural networks (RNNs) or CNNs are now commonly used to process sequence samples or picture samples. The overall structure of the VAE is shown in Figure 10. To make a one-to-one correspondence between the samples and the reconstructed samples, each sample

x

must have its own corresponding posterior distribution so that the random hidden variable sampled from the posterior distribution can be restored to the corresponding reconstructed sample

\hat{x}

by the generator. Each batch of n groups of samples will be fitted by the neural network with n groups of corresponding parameters to facilitate sample reconstruction with the generator. If the distribution is a normal distribution, there are two encoders in the VAE, which generate the mean

μ = g_{1} (x)

and variance

\log σ^{2} = g_{2} (x)

of the sample in the hidden variable space.

➀: IWAE

The IWAE model is one of the most important improvement methods for VAE models. From the perspective of the variational lower bound, IWAE alleviates the problem of the posterior distribution to a certain extent and improves the performance of the generative model by weakening the role of the encoder in the variational lower bound. IWAE improves the ability of generative models at the cost of reducing the performance of the encoder, and the generation ability is significantly improved. Later VAE models are mostly based on IWAE, but if a good encoder and generator need to be trained at the same time, IWAE will no longer be applicable.

➁: ADGM

The ADGM model is the best and most influential conditional variational autoencoder and considers both supervised and semi-supervised learning. The processing method of label information y by ADGM is like the method proposed by [55] to apply a deep generative model to semi-supervised learning, which constructs the likelihood function of labeled data and unlabeled data, respectively. ADGM can be used for supervised or semi-supervised learning, and this model and IWAE use different approaches to solve the problem of oversimplified posterior distributions. The advantage of the ADGM model is that it does not impair the encoder at the cost of requiring five neural networks, which is computationally intensive. Representing label information as one-hot vectors enables VAEs to handle supervised data, essentially adding a conditional constraint to the encoder. The model adds label factors when learning samples so that the VAE can generate corresponding types of samples according to the specified labels.

➂: AAE

The AAE model is a generative model that applies the adversarial idea to the VAE training process. The generator of AAE uses the same neural network as the encoder, which is used to fake the distribution of the hidden variables close to the real distribution; the discriminator is responsible for distinguishing the samples obtained from the true and false distributions. The generator and discriminator together form an additional adversarial network. The purpose of the adversarial network is to make the distribution of arbitrary complexity generated by the generator close enough to the real hidden variable, and the objective function of this part is the same as the objective function of the generative adversarial network. This is equivalent to replacing the regularization term of Kullback–Leibler (KL) divergence in VAE in the overall loss function of AAE. The AAE model can construct three different model structures suitable for supervised learning, semi-supervised learning, and style transfer and obtain good experimental results in their respective fields, which greatly increases the application scope of VAE models.

In terms of sample generation, the VAE class model can generate high-definition handwritten numbers [56], natural images [57], human faces [58,59], and other basic data and successfully generate future prediction pictures [60] of static pictures. The most influential application is the DRAW network using an RNN in the encoder and decoder of the VAE [57]. DRAW extends the structure of VAE and generates realistic house number pictures (SVHN Dataset). The DRAW model was added to the convolutional network to extract spatial information [61], which further improved the generation ability of the model and generated clear natural image samples. Due to the inherent shortcomings of the VAE structure, the image samples generated by the model have considerable noise, and most of the VAE structures have difficulty generating high-definition image samples. In the field of image generation, the effect is not as good as that of the generative model based on a generative adversarial network (GAN) and FLOW, so VAE is usually used as a feature extractor in the field of images.

➃: MAE

The Masked Autoencoders (MAE) model [62] is a scalable self-supervised learning method for computer vision. The advantage is the scalability and simplicity of the method. The improved model MRA [63] based on MAE can generate distorted views of the input images. The generation process is as follows. First, the image is segmented into blocks, and a group block is masked out from the input image. This means that only part of the image is fed to the autoencoder. Then, the autoencoder is needed to reconstruct the missing patches in the pixel space. Finally, we use the reconstructed image as an enhancement for the recognition vision task. In this way, MRA can not only perform strong nonlinear augmentation to train robust deep neural networks but also regulate the generation with similar high-level semantics restricted by the reconstruction task. That is, the model can generate robust images with similar semantics and enable the model to generalize well across different recognition tasks. Experimental results show that erasing label-independent noisy patches leads to a more expected and constrained generation, which is highly beneficial for stable training and enhances the object awareness of the model. Notably, the whole pre-training process of MRA is label-free and cost effective. MRA also shows stronger robustness compared with CutMix, Cutout, and Mixup, suggesting that masked autoencoders are robust data enhancers.

2.3.2. Implicit Method

The implicit method is a method that avoids the process of seeking maximum likelihood, and its representative model is a generative adversarial network (GAN) [64]. A GAN uses the learning ability of a neural network to fit the distance between two distributions. It cleverly avoids the difficulty of solving the likelihood function and is currently the most successful and influential generative model. Its representative models are the deep convolutional generative adversarial network (DCGAN) and BigGAN (the current generation ability is the best).

GAN is a method of unsupervised learning. The data generated by the GAN models are more diverse than those generated by traditional data augmentation. It allows for two neural networks to learn by playing against each other. A GAN consists of a generative model

α

and a discriminative model

D

.

G

randomly samples from the hidden space as the input, and the output needs to imitate the real samples in the training set as much as possible. The input of

D

is the real sample or the output of

G

, and the goal is to distinguish the output of

G

from the real sample as much as possible. However, G tries to deceive

D

as much as possible. The two models fight against each other, constantly adjusting the parameters, and the goal is to make

D

unable to distinguish the output of

G

from the real samples [24,65,66]. Goodfellow, I. et al. theoretically proved the convergence of the algorithm, and when the model converges, the generated data have the same distribution as the real data. The objective function of GAN is shown in Equation (14).

\min_{G} \max_{D} V (D, G) = E_{x ~ p_{d a t a} (x)} [\log D (x)] + E_{z ~ p_{z} (z)} [\log (1 - D (G (z)))]

(14)

In the equation,

x

represents the real picture,

z

represents the noise of the input

G

network, and

G (z)

represents the picture generated by the

G

network.

D (x)

represents the probability that the

D

network judges whether the real picture is real because

x

is real, so for

D

, the closer this value is to 1, the better.

D (G (z))

is the probability that the

D

network judges whether the picture generated by

G

is real.

GANs have an absolute advantage in the field of image generation. A GAN essentially converts the difficult-to-solve likelihood function into a neural network and allows the model to train suitable parameters to fit the likelihood function. This neural network is the discriminator in a GAN model. The internal confrontation structure of GAN can be regarded as a training framework. In principle, any generative model can be trained, the model parameters are optimized through the confrontation behavior between the two types of models, and the process of solving the likelihood function is cleverly avoided. The principle of GAN is simple and easy to understand, and the clarity and resolution of the generated images are higher than those of other generative models. The disadvantage is that the training is unstable, so that the generation effect and training stability of the model have become the most concerning aspects.

The players in a GAN model are the generator and the discriminator. The goal of the generator is to generate realistic fake samples so that the discriminator cannot tell the true sample from the fake. The goal of the discriminator is to distinguish whether the data are a real sample or a fake sample. In the process of the game, the two competitors need to continuously optimize their generating ability and discriminative ability, and the result of the game is to find the Nash equilibrium between the two. When the discriminator’s recognition ability reaches a certain level but cannot correctly judge the data source, a generator that learns the real data distribution is obtained. The calculation process and structure of the GAN model are shown in Figure 11.

The shortcomings of the GAN model are as follows: 1. The model is difficult to train—the disappearance of the gradient often causes the model to be unable to continue training. The generator form is too free, and the gradient greatly fluctuates during training, causing the training to be unstable; 2. model collapse occurs, which is manifested in the fact that a single sample is generated, and samples of other categories cannot be generated; and 3. the form of the objective function results in the model not having any indicators to indicate the training progress during the training process.

The deep convolutional GAN (DCGAN) model is the first important improvement of GAN [67]. It selects a set of generators and discriminators with the best effect in various structures, which significantly improves the stability of GAN training. It is still a commonly used architecture. The main feature of the model architecture is that the discriminator and generator use convolutional and deconvolutional networks, and each layer uses batch normalization. The DCGAN is fast to train, has a small memory footprint, and is the most commonly used structure for fast experiments. Its disadvantage is the inherent checkerboard artifacts of the deconvolution structure in the generator. The specific performance is that after the image is enlarged, the interlaced texture similar to a chessboard can be seen, which seriously affects the quality of the generated image and limits the reconstruction ability of the DCGAN structure.

The BigGAN [68] model based on the residual structure framework is the best model in the current image generation field, and the fidelity of generating high-definition samples is significantly ahead of other generative models. BigGAN handles image details very well, can generate very realistic natural scene images, and achieves a great improvement and balance of scale and stability. The disadvantage of the BigGAN model is that it requires a large amount of labeled data for training. Adding an additional unsupervised task model to BigGAN’s discriminator S³GAN can add labels to unlabeled samples, thereby adding a large amount of training data and enabling S³GAN to match the quality of samples generated by BigGAN with 10% of the labeled data.

With an extreme scarcity of training data, Salazar, A [69] proposed the GANSO method based on generative adversarial networks (GAN) and vector Markov random field (vMRF) for oversampling the training set of the classifier. The generation block of the GAN model uses the vMRF model to synthesize surrogates via the graph Fourier transform. The discriminant block then applies a linear discriminant to the features that measure the partial similarity between the synthesized and original images. These two blocks are repeated until the linear discrimination method fails to distinguish the synthetic instances from the original ones. Experimental results demonstrate that the classifier can be trained with only 3 or 5 instances. GANSO can effectively improve the performance of the classifier, and when compared with SMOTE, SMOTE is not suitable for handling small training sets.

In the field of remote sensing, [70] proposes a deep generative framework based on a GAN model and autoencoder is proposed to generate noncorresponding SAR patches for hard negative mining with limited data volume. The study also evaluates the effectiveness of this hard-negative mining equation in reducing false-positive rates and improving network determinism in SAR optical patch matching applications. After training the generative network, realistic SAR images can be generated using existing SAR optical matching datasets. These generated images are then used as noncorresponding hard negative samples for training a SAR optical matching network. The experimental results show that the network can generate realistic SAR images with many SAR-like features, such as dwell and speckle. Guo, J. et al. [71] developed an end-to-end model based on the GAN approach to directly synthesize desired images from a known image database. Its feasibility is verified by comparison with real images and ray tracing results. As a further step, samples were synthesized from angles outside of the dataset. However, the training process of GAN models is difficult, especially for SAR images that are often affected by noise interference. The main failure modes are analyzed in the experiments, and a clutter normalization method is proposed to improve them. The results show that the method improves the convergence speed up to 10 times. The quality of composite images has also been improved. Although unsuccessful, Marmanis, D. et al. [72] proposed the idea of using a GAN network to generate remote sensing datasets.

2.3.3. Deformation Method

The deformation method appropriately deforms the likelihood function. The purpose of deformation is to simplify the calculation. Such methods include two models: flow models [73] and autoregressive models [74]. The flow model uses the reversible network to construct the likelihood function and directly optimizes the model parameters; the trained encoder uses the characteristics of the reversible structure to directly obtain the generative model. The flow model includes three types: the normalizing flow model, the invertible residual network (i-ResNet), and the variational inference with the flow model. The autoregressive model decomposes the objective function into the form of the conditional probability product. There are many such models, including the pixel recurrent neural network, mask autoencoder, and WaveNet.

(1): Flow Model

In the mainstream deep generative models, the VAE deduces the variational lower bound of the likelihood function; however, replacing the real data distribution with an easy-to-solve variational lower bound is an approximation method, and the obtained approximation model cannot obtain the best generation effect. Although a GAN approach uses model confrontation and alternate training to avoid optimizing the likelihood function and to retain the accuracy of the model, various problems will occur during the training process. Therefore, it is very meaningful to study a deep generative model that can not only guarantee the accuracy of the model but also be easy to train. The basic idea of the flow model is that the real data distribution must be mapped to the artificially given simple distribution by the transformation function. If the transfer function is invertible and the form of the transfer function can be found, then this simple distribution and the inverse of the transfer function can form a deep generative model. The nature of the invertible function suggests that the flow model is an accurate model that hopefully generates samples of sufficiently good quality.

The transfer function of the flow model is represented by a neural network, which is equivalent to the accumulation of the effects of a series of transfer functions. The superposition process of this simple transformation is like flowing water, so it is called flow. Most flow models are based on this model framework. The log-likelihood function of the flow model can be written as Equation (15):

\log P (x) = - \log P (z) - \sum_{i = 1}^{k} \log |\det (\frac{d h_{i}}{d h_{i - 1}})| = - \sum_{i = 1}^{k} (\frac{1}{2} {||G^{i} (x)||}^{2} - \log |\det (\frac{d h_{i}}{d h_{i - 1}})|) + c

(15)

where

c = - \frac{D}{2} \log (2 π)

represents a constant,

\det (\cdot)

represents the Jacobian,

P (x)

represents the data distribution function, and

G (x)

represents a conversion function.

➀: Normalizing Flow Model

Normalizing flow is the most important model in the flow model and includes three models: NICE, Real NVP, and Glow. These three models put forward the concept of the flow model and establish the basic frame of the model and specific form of the conversion function so that the performance of the model is gradually improved. Nonlinear independent component estimation (NICE) was the first flow model [73], and most of the flow models that have appeared since are based on the structure and theory of NICE. Real-valued nonvolume preserving (Real NVP) [75] means that the value of the Jacobian of the model is not 1. Based on the basic structure of NICE, Real NVP proposes an affine coupling layer and a random shuffling mechanism of dimension, which is more nonlinear than the additive coupling layer. Introducing a convolutional layer in the coupling layer enables the flow model to better handle image problems, and a multiscale structure is designed to reduce the computation and storage space of the NICE model. Glow is a model based on NICE and real NVP, and it is the best model in the current flow model [76].

➁: i-ResNet

Invertible residual networks are generative models based on residual networks. Using constraints to make the residual block invertible and then using an approximate method to calculate the Jacobian of the residual block makes i-ResNet fundamentally different from other flow models. The difference is that the basic structure and fitting ability of the residual network (ResNet) are preserved so that the residual block is symmetric and has a strong nonlinear transformation ability. i-ResNet uses a variety of methods to directly and efficiently solve the Jacobian of the residual block. Although the generation capability of the model is very different from that of Glow, it is an innovation and a bold attempt at the flow model by getting rid of the drawbacks of the coupling layer.

➂: Variational Inference with the Flow

Variational inference with the flow is the introduction of variational inference in a flow model. The mean and variance of the encoder output are mapped to a more complex distribution with a transfer function; then, the decoder reconstructs the samples according to the posterior distribution. This method makes the posterior distribution obtained by the variational flow map closer to the true posterior distribution.

Flow is a very sophisticated model and a theoretically error-free model. Flow-type models designed a reversible encoder. If the parameters of the encoder are trained, the complete decoder can be directly obtained, and the construction of the generative model is completed. To ensure the reversibility and computational feasibility of the encoder, the current flow class model can only use the stacking of multiple coupling layers to increase the fitting ability of the model. However, the fitting ability of the coupling layer is limited, and this method largely limits the performance of the model. The current application scope of flow focuses on face generation in the field of image generation, and the best model is Glow. Compared with other deep generative models led by GAN, flow has more parameters and more computation, and its application field is limited to image generation. These drawbacks limit the further development of flow. As an error-free generative model, the flow model with its great potential should be searched for a more efficient reversible encoder structure or a coupling layer with stronger fitting ability in future research, and the application scope of the model should be expanded.

(2): Autoregressive Model

Autoregression is a method of dealing with time series in statistics, which predicts the observed value of the variable at the current moment using the observed value of the same variable at each previous moment. Models that use conditional probabilities to represent the relationship between adjacent elements of visible layer data and conditional probability products to represent the joint probability distribution can be called autoregressive networks. The most influential model in autoregressive networks is neural autoregressive distribution estimation (NADE). The model originated from RBM, which combines the weight sharing and probability product criteria with the autoregressive method. The forward propagation of this model is equivalent to an RBM, assuming that the hidden variables obey the mean field distribution, and is more flexible, easier to interpret, and has better model performance. Typical representatives of autoregressive networks are neural autoregressive distribution estimation, pixel recurrent neural networks (PixelRNN), and masked autoencoders (MADE) for distribution estimation.

➀: NADE

NADE is modeled by decomposing the probability of high-dimensional data into a product of conditional probabilities through the chain rule as Equation (16):

P (x) = \prod_{d = 1}^{D} P (x_{o_{d}} | x_{o < d})

(16)

Among them,

x_{o < d}

represents all dimensions located to the left of

x_{o_{d}}

in the observed

D

dimensional observation data. This indicates that the value of the

i

th dimension in this definition is only related to the dimension before it and has nothing to do with the dimension after it. The weight from the output layer to the hidden layer in the RBM is the transpose of the weight from the hidden layer to the input layer, and NADE can use the above equation to independently parameterize the weight between each layer. In the extension of the model, NADE-k enables the NADE model to better infer missing values in the data. Raiko et al. [77] repeated iterations between the visible layer and the hidden layer according to the idea of the CD-k algorithm. The single iteration of the original NADE is replaced with this repeated iteration. The authors’ experiments show that this method can effectively improve the ability of the NADE model to infer missing values. Although the training speed of NADE is fast, the ordering of conditional probabilities makes the model unable to be processed in parallel, so the speed of generating samples is slow. To break weak correlations between pixels, Reed, S. et al. [78] proposed allowing certain groups of pixels being modeled to be conditionally independent, leaving only highly correlated adjacent pixels. This allows NADE to generate multiple pixels in parallel, greatly speeding up the sampling; it also sharply reduces the amount of computation required for hidden variables and conditional probabilities. However, this method of discarding the weak correlation between pixels will inevitably affect the performance of the model to a certain extent. The deep NADE model requires the same amount of computation in the first hidden layer and output layer as the single-layer model described above. Reed, S. et al. derived an unbiased estimate of the loss function when disordered and introduced a mask at the input layer to process high-dimensional data using convolutional neural networks. Additionally, NADE can process data in any order by random sampling. Feeding information to hidden units designated to observe certain inputs and predict missing information allows NADE models to make efficient inferences. The disadvantage of deep NADE models is that as the number of hidden layers increases, the amount of computation for other hidden layers greatly increases. A large amount of computation makes it difficult to scale the NADE model to multiple layers. The experiments by Reed et al. show that the NADE model performs better than the DBM. However, the number of model parameters and the amount of computation is much larger than those of the DBM.

➁: PixelRNN

PixelRNN [79] takes the pixels of the image as the input to the network. Its essence is the application of autoregressive neural networks for pictures. The model uses a deep autoregressive network to predict the pixel value of an image and proposes three deep generative models with different structures.

PixelCNN: The model directly uses a CNN to process pixels and then uses specially structured masks to avoid the problem of missing pixels when generating samples. This method is simple in structure, fast and stable in training, and can directly target the likelihood function, making PixelCNN’s likelihood index far superior to other deep generative models. However, the disadvantage is that the generated samples are not ideal, possibly because the convolution kernel is not large enough.

Row LSTM: This model structure captures more information about neighboring pixels. The model performs row convolution on the output of the long short-term memory network (LSTM), and the three gates are also produced by the convolution. This method can capture a wider range of pixels. However, the problem is that the pixel-dependent region of the model is a funnel shape, which obviously misses many important relevant pixels.

Diagonal BiLSTM: In this model, there are no missing pixels in the input of LSTM by reconstructing the pixel position; that is, BiLSTM, a bidirectional long short-term memory network, uses the inversion of feature maps to construct a bidirectional LSTM network, eliminating pixel blind spots during mapping and capturing pixel information better than row LSTM. The essence of these models is to capture the pixel information around the current element, optimize the depth model with the residual structure, and serially generate pixel samples. However, the sample generation method of pixel-by-pixel generation makes the model generation speed very slow.

➂: MADE

MADE [80] applies the autoregressive method to the autoencoder to improve the ability of the autoencoder for estimating the density. The implementation method mainly uses the mask to modify the weight matrix so that the output of the autoencoder becomes the conditional probability of the autoregressive form. Autoencoders usually have poor representation ability, so they are suitable for combination with autoregressive models with strong representation ability. The output of MADE should be conditional probabilities according to the autoregressive approach for estimating probability density. When the input data are binary, the objective function of the model is the cross-entropy loss function. The value of the connected rows of the weight matrix part of the autoencoder is 0. The easiest way to construct such a weight matrix is to mask the weight matrix, block the connection channels between irrelevant variables, and realize the combination of an autoencoder and autoregressive network. Another advantage of MADE is that it is easy to extend to deep networks by simply increasing the number of hidden layers and adding corresponding masks. Germain et al. [80] presented other hidden layer mask design methods and training algorithms for mask-agnostic connection methods. From the experimental results, the generation ability of MADE is the same as that of NADE and surpasses NADE on some datasets.

The greatest advantage of the autoregressive structure is that it can provide good density estimates for sequence data and can also be combined with other generative models. The disadvantage of autoregressive networks is that the conditional probability product representing the objective function cannot be operated in parallel, resulting in much more computation required for training and generating samples than other general models such as VAE and GAN. This greatly limits the development and application of autoregressive generative models.

Deep generative models attempt to combine knowledge of probability theory and mathematical statistics with the representation learning capabilities of powerful deep neural networks. Significant progress has been made in recent years, and it is the current mainstream deep learning direction. Although deep generative models have great potential, they also have many problems: ① Improve evaluation indicators and evaluation systems: The training process of deep generative models is complex, the structure is not easy to understand and use, and the training speed is slow. It is difficult to learn models based on large-scale data. Therefore, it is an urgent problem to improve effective evaluation indicators and practical evaluation systems; ② Uncertainty: The motivation and construction process of the deep generative model usually have strict mathematical derivation, but the model is often limited by its solving difficulty and must be approximated and simplified, which makes the model deviate from the original goal. The trained model is difficult to thoroughly analyze in theory, and the adjustment method can only be judged in the opposite direction by means of the experimental results, which creates problems for the training of the generative model and is an important factor restricting the further development of the model. Therefore, understanding the impact of model approximation and simplification on model performance, error, and practical application is an important direction for developing production models; ③ Sample diversity: how to make the image samples generated by deep generative models have diversity is a problem worthy of research; ④ Rethinking the generalization power of deep learning: machine learning theory believes that a good model should have better generalization ability; and ⑤ More efficient model structures and training methods: a representative batch of state-of-the-art generative models, such as BigGAN, Glow, and VQ-VAE. Van and Razavi [81,82] have been able to generate sufficiently clear image samples. However, behind such models is a much larger amount of computation, which is the drawback of all large generative models. Expensive computer hardware equipment and long training make it difficult for many people to pursue cutting-edge research in this field. Therefore, a more efficient model structure and training method are one of the future development directions. From the perspectives of model complexity, bias–variance trade-off, etc., theoretically discussing the learning mechanisms of various deep generative models and enriching the theoretical basis of the models to truly establish the prominent position of deep generative models in deep learning is a problem worth considering.

The experiments of [83] based on the NWPU VHR-10 and DOTA datasets show that deep generative methods generate objects of high quality. The latest target detection model was used in the experiment, and the accuracy of target detection was satisfactorily improved after data augmentation based on deep generation. This shows that data augmentation based on depth generation can effectively improve the accuracy of remote sensing image target detection. Generative adversarial network-based remote sensing image generation methods are used to create high-resolution annotated samples for scene classification [84]. The experimental results show that the method achieves satisfactory performance in high-quality annotated sample generation, scene classification, and data augmentation. To address the problem of synthetically generated remote sensing noise in self-supervised machine learning tasks, the natural noise of satellites while collecting images is generated based on GAN [85]. Experimental results show a stable convergence of the generator and discriminator networks and demonstrate the indistinguishability of the generator-derived patches from the real dataset. The style of the test data is converted to training data based on GAN [86]. A UNet is first trained from real training data and then fine-tuned on test-styled fake training data generated by the proposed method. Experimental results demonstrate that our framework outperforms existing domain adaptation methods. Zheng, K. et al. [87] improved GAN generated learning methods for vehicle synthetic generative adversarial networks (VS-GANs). This method can quickly generate high-quality annotated vehicle data samples, which is of great help in training vehicle detectors. Experimental results show that the framework can synthesize images of vehicles and their backgrounds with varying and different levels of detail. Compared with traditional data augmentation methods, this method significantly improves the generalization ability of vehicle detectors.

2.4. Virtual Sample Generation

Among data-based data augmentation methods, in addition to single-sample transformation, multi-sample transformation, and deep generative models, there are also virtual sample generation methods. The virtual sample is used to establish a computational model and simulate the imaging process, and it can output images through the computer. According to the modeling scale, the virtual sample generation method for object recognition is mainly instance modeling [88].

Instance modeling is the process of establishing a three-dimensional model of a research target and then mapping it to the real remote sensing background using a simulation system. This method is widely used for synthetic image generation in object recognition and object detection tasks. Kusk, A. et al. [89] added simulated object radar reflectivity to a terrain model scattered by a single point to obtain simulated SAR data. Malmgren-Hansen, D. et al. [90] used electromagnetic simulation techniques and CAD models to model thermal noise and terrain clutter to generate synthetic SAR images. Yan, Y. and You, T. [91,92] used a real background image to generate a virtual sample with a 3D model of the same scale as the real object. To solve the problem of unbalanced numbers of samples, Mo, N. et al. [93] used the method of splicing real background images and real research objects to generate virtual samples. Xiao, Q. et al. [94] generated ship simulation samples from 3D models of real images. To eliminate the domain gap between real images and simulated samples, a neural style transfer (NST)-based network Sim2RealNet was used to achieve style transfer from simulated samples to real images. Wang, K. et al. [95] also introduced virtual samples in the target recognition research of SAR images to alleviate the problem of insufficient training data.

Based on the above research results, it is shown that convolutional neural networks that are pretrained on simulated data have greater advantages than those pretrained on only real data, especially when the real data are sparse. The advantages of pretraining the model on simulated data are shown in the faster convergence in the training phase and final accuracy when benchmarked against moving and stationary object acquisition and recognition datasets. The main advantage of this augmentation method is that the image generation speed is fast and the image content information is highly controllable. Based on this technology, remote sensing images can be effectively enhanced with high fidelity and low cost, especially for some images that are difficult to obtain. However, one limitation of this augmentation technique is that there is still a domain gap between the synthetic and real images. To solve the field offset problem caused by the gap, it is often necessary to combine transfer learning for further optimization.

3. Network-Based Data Augmentation Methods

In a broad sense, any method to prevent network overfitting is data augmentation, so starting from the network itself, it is also a good research point to study methods to prevent overfitting. In this part, we mainly introduce and summarize the network strategy and learning strategy. The network-based data augmentation methods are summarized in Table 3.

3.1. Network Strategy

In addition, data augmentation is described in Section 2 to generate new data. Improving the existing learning model, such as the target task, and reducing the network capacity are also effective methods to suppress overfitting. At present, transfer learning and regularization are relatively mature techniques.

3.1.1. Transfer Learning

Transfer learning is based on the target task and the original data looking for an existing learning model that is like the target task; based on this model, it is improved according to the requirements of the target task to meet the needs of the task. Transfer learning is a machine learning method that transfers knowledge from one domain (source domain) to another domain (target domain) so that the target domain can achieve better learning effects. Usually, the amount of data in the source domain is sufficient, and the amount of data in the target domain is small, which is very suitable for transfer learning. For example, we want to classify a task, but there is not enough data for this task (target domain). Meanwhile, there is a large amount of relevant training data (source domain). However, these training data are different from the test data feature distribution in the required classification task. In this case, if a suitable transfer learning method can be used, the classification and recognition results of insufficient samples can be greatly improved.

The notation for transfer learning is defined as follows: (1) The data domain

D

consists of two parts: the feature space

X

and the marginal probability distribution

P (X)

, where

X = {x_{1}, \dots, x_{n}} \in X

. (2) Given a specific domain

D = \{X, P (X)\}

, a task consists of two parts: a spatial label

Y

and a target prediction function

f (.)

. The task is denoted by

T = \{Y, f (.)\}

, and the objective function cannot be observed but can be learned through training data. It contains variables

\{x_{i}, y_{i}\}

, where

x_{i} \in X

,

y_{i} \in Y

. The function

f (.)

can be used to predict new variable labels. From a probabilistic point of view,

f (x)

can be written as

P (y | x)

. To simplify the problem, only one case of

D_{s}

and

D_{t}

are discussed here. Among them, the source field

D_{S} = \{(x_{S_{1}}, y_{S_{1}}), \dots, (x_{S_{n_{S}}}, y_{S_{n_{S}}})\}

. Among them,

x_{S_{i}} \in X_{s}

represents the observed samples in the source domain, and

y_{S_{i}} \in Y_{s}

represents the label corresponding to the target domain

x_{S_{i}}

. Target field

D_{T} = \{(x_{T_{1}}, y_{T_{1}}), \dots, (x_{T_{n_{T}}}, y_{T_{n_{T}}})\}

. Among them,

x_{T_{i}} \in X_{T}

represents the observed samples in the target domain, and

y_{T_{i}} \in Y_{T}

represents the corresponding output of the target domain

y_{T_{i}}

. Usually, the relationship between the number of observed samples

n_{S}

in the source domain and the number of observed samples in the target domain

n_{T}

is as follows:

0 \leq n_{T} ≪ n_{S} .

Based on the above notation definitions, the migrations are defined as follows, given a source domain

D_{S},

source domain learning task

T_{S}

, target domain

D_{T}

, and a target domain task

T_{T}

. We have

T_{T}

is not equal to

D_{T}

or

T_{S}

is not equal to

D_{T}

. Transfer learning uses knowledge in the source domains

D_{S}

and

T_{S}

to improve or optimize the learning effect of the target prediction function

f_{T (.)}

in the target domain

D_{T}

[96].

Transfer learning is a machine learning theory. Yu, C. and Pan, S. [97,98] proposed solving domain adaptation problems in recent years mainly based on the AdaBoost transfer algorithm [99], semi-supervised domain adaptation transfer component analysis method [100] and support vector-level transfer learning theory [101].

In the field of remote sensing, Hu, K. et al. [102] used the multisource weighted TrAdaBoost algorithm in transfer learning to detect clouds in satellite nebula images. Many thick cloud samples marked by multiple people (multi-source) were used to form a multisource auxiliary sample set; a small number of thin cloud samples were used to form a target sample set. By using transfer learning and an auxiliary sample set, the extreme learning machine classifier obtained by only training under the thin cloud sample set was used for auxiliary training to improve its thin cloud recognition rate. Based on the satellite data of HJ-1A/B of the National Satellite Meteorological Center, the experimental results showed that transfer learning can make full use of the easily obtained sample knowledge of thick clouds to assist in the identification and improvement of the small sample thin cloud classifier of the same type. The experiments show that the transfer learning algorithm can be further used for more multisource samples and other cloud classification tasks.

Based on the low accuracy of remote sensing classification under limited target samples, Han, M. et al. [103] proposed an improved Bayesian ARTMAP neural network transfer learning-based remote sensing image classification algorithm. The class diffusion is suppressed by improving resonance matching, and the discrete incremental expectation maximization parameter update strategy of nodes is used to transfer the a priori information of the ground object classification in historical remote sensing samples to the target model. The experiments show that the method can effectively use historical remote sensing data to compensate for the lack of target training data and greatly improve the classification accuracy of remote sensing images compared with other sample utilization strategies. Wu, T. J. et al. [104] proposed a transfer learning method for remote sensing image object classification. In this method, the invariant objects are marked on the new target image through change detection, and the knowledge of the object categories interpreted in the past is transferred to the new image. Then, the relationship between new features and objects is established to complete the automatic object classification of target images assisted by historical thematic data. The experimental results show that under the guidance of the image knowledge of the existing historical topic layer, the method can effectively and automatically select reliable samples suitable for new image classification, obtain a better information extraction effect, and improve the efficiency of object classification. Liu, Y. et al. [105] proposed a remote sensing image classification method based on the reuse of spatiotemporal information based on case reasoning. The method introduces the TrCbrBoost model, which utilizes source domain data to successfully train a classifier to map land-use types in the target domain when newly labeled data are not available. A classifier of the modified TrAdaBoost algorithm is trained using source domain samples, where the weight of each sample is adjusted according to the performance of the classifier. In land use classification experiments using time series SPOT images, the experimental results show that TrCbrBoost is more effective than the traditional classification model on the premise that enough source domain data can be obtained. Under this condition, the accuracy of the proposed classification method is improved by 9.19%.

3.1.2. Regularization

Both data augmentation and regularization can be viewed as ways to incorporate prior knowledge into the model, either through the invariance of the data or through priors (regularization) of how the weights and activations in the neural network should behave, and both have the goal of reducing the training and testing the generalization gap. In the fitting process, we usually tend to make the weights as small as possible and finally construct a model with all parameters being relatively small. Because it is generally considered that the model with a small parameter value is relatively simple, it can adapt to different datasets and also avoids the phenomenon of overfitting to a certain extent. It can be imagined that, for a linear regression equation, if the parameters are large or if the data are shifted slightly, it will have a great impact on the results; however, if the parameter is small enough, the data offset will not have any effect on the result, that is, the anti-disturbance ability is strong. Regularization makes the model prefer smaller weights. Smaller weights mean lower model complexity. Adding regularization is equivalent to adding some kind of prior to the model, which limits the distribution of the parameters, thereby reducing the complexity of the model. The reduced complexity of the model means that the model’s ability to resist noise and outliers is enhanced, thereby improving the generalization ability of the model. Intuitively, the fitting of the training data is adequate, and the training data will not be overfitted. The representative method in regularization is the dropout method. When a complex feedforward neural network is trained on a small dataset, it is prone to overfitting. To prevent overfitting, in 2012, Hinton et al. proposed the dropout method; some subsequent studies have also verified the effectiveness of dropout in preventing network overfitting. The term “dropout” refers to temporarily dropping some connections of units, inputs, and outputs in a neural network, as shown in Figure 12. Which unit chosen is random? In the simplest case, each unit maintains a fixed probability

ρ

independent of the others, where

ρ

can be chosen using the validation set or be simply set to 0.5, which seems to be close to optimal for a wide range of networks and tasks. However, for input units, the best probability retained is usually closer to 1 rather than 0.5.

(1): Dropout workflow

Assuming a standard neural network, input

x

, and output

y

, the normal process is as follows: first, propagate x forward through the network, and then backpropagate the error to decide how to update the parameters for the network to learn. After using dropout, the process is as follows: ① First, half of the hidden neurons in the network are randomly deleted. The input and output neurons remain the same. ② Secondly, the input x is forwarded through the modified network, and the resulting loss is backpropagated through the modified network. After performing this process on a small batch of training samples, the corresponding parameters

(w, b)

are updated according to the stochastic gradient descent method on the neurons that have not been deleted. ③ Thirdly, we continue to repeat the process and restore the deleted neurons. At this point, the deleted neurons remain as they are, while the neurons that have not been deleted have been updated. A half-sized subset is randomly selected from the hidden layer neurons and temporarily deleted, and the parameters of the deleted neurons are backed up. For a small batch of training samples, forward-propagate and then backpropagate the loss, and update the parameters

(w, b)

according to stochastic gradient descent. The part of the parameters that have not been deleted is updated, and the parameters of the deleted neurons keep the results before deletion. ④ The process is repeated.

(2)

Use of Dropout in Neural Networks

➀: During the training model phase

A probabilistic process is added to each unit of the training network, as shown in Figure 13. The corresponding equation changes are as follows. The calculation equations of the network without dropout are Equations (17) and (18):

z_{i}^{(l + 1)} = w_{i}^{(l + 1)} y^{l} + b_{i}^{(l + 1)}

(17)

y_{i}^{(l + 1)} = f (z_{i}^{(l + 1)})

(18)

where

l \in \{1, \dots, L\}

.

L

is the number of hidden layers of the network,

z^{(l)}

is the input vector at layer

l

of the network,

y^{(l)}

is the input vector at layer

l

of the network (

y^{(l)} = x

is the input),

W^{(l)}

and

b^{(l)}

are the weights and biases at layer

l

, and

f

is any activation function.

The network calculation equations using dropout are Equations (19)–(22):

r_{j}^{(l)} = B e r n o u l l i (p)

(19)

{\tilde{y}}^{(l)} = r^{l} * y^{l}

(20)

z_{i}^{(l + 1)} = W_{i}^{(l + 1)} {\tilde{y}}^{l} + b_{i}^{(l + 1)}

(21)

y_{i}^{(l + 1)} = f (z_{i}^{(l + 1)})

(22)

Here

*

represents the product of elements. For any layer

l

, the Bernoulli function in Equation (19) generates a probability

r

vector; that is, it randomly generates a vector of 0 and 1.

r^{(l)}

is a vector of independent Bernoulli random variables, each of which has a probability

ρ

of being 1.

{\tilde{y}}^{(l)}

are thinned outputs created by

y^{(l)}

. The code level implementation to stop a neuron from working with probability

ρ

is to make its activation function value change to 0 with probability

ρ

[106].

➁: During the testing phase

When predicting the model, the weight parameter of each neural unit is multiplied by the probability

p

, as shown in Figure 14. Test phase dropout Equation (23):

W_{t e s t}^{(l)} = p W^{(l)}

(23)

In the field of remote sensing, an improved strategy based on dropout was proposed [107]. This strategy only selects part of the local area data to clear the weights each time, which not only maintains the local information of the image itself but also enhances the generalization ability of the model. Differential evolution is used to optimize the weights and offsets of the neural network. The experimental results on the remote sensing object image set show that an improved strategy based on dropout has an obvious effect on preventing overfitting of the remote sensing classification neural network. Jiang Hanlu [108] used the dropout model to prevent overfitting and to optimize the convolutional neural network model. The experimental results show that the introduction of dropout into the model not only suppresses the overfitting problem, but also improves the classification accuracy of the test set to a certain extent. Jiao, J. et al. [109] added dropout to the AlexNet convolutional neural network classification and recognition algorithm model to reduce the occurrence of overfitting.

3.2. Learning Strategy

The central idea of data augmentation for learning strategies is to train a model to adaptively select the optimal data augmentation strategy to maximize model performance improvement. The most representative methods are meta-learning and reinforcement learning.

3.2.1. Meta-Learning

In recent years, meta-learning methods [110] have reached the forefront of few-shot performance and have thus received increasing attention. Through meta-learning, people can learn by using their previous experience to guide the learning of new tasks. In meta-learning, knowledge from one problem domain is transferred to other domains or tasks, and the learning mechanism continuously improves the learning performance over time by accumulating experience. The main point of meta-learning is determining how to build a model that can quickly learn new tasks, whereas the problem of few-shot learning is determining how to build a model with excellent generalization performance. Therefore, we can treat meta-learning as a data augmentation method.

Meta-learning algorithms need to learn a network that can easily adapt to tasks with limited data and that can generalize to unknown examples. To achieve this functionality, the adaptation and evaluation processes during meta-training need to be simulated. Ni, R. et al. [111] sampled the support data

T_{i}^{s}

and query data

T_{i}^{η}

to simulate a task

T_{i}

with a multiway classification such that

T_{i} = \{T_{i}^{s}, T_{i}^{η}\}

. Supports were used to simulate a small amount of training data, while queries were used to simulate unknown test data. The meta-learning algorithm consists of an inner loop and outer loop for each parameter update in the training process. In the inner loop, the model is first fine-tuned on the support data

T_{i}^{s}

. Then, the model is updated to be evaluated on the query data

T_{i}^{η}

in the outer loop, which minimizes the query data loss to model parameters before fine-tuning. This loss minimization step may require computing gradients through a fine-tuning process.

Existing meta-learning algorithms apply various methods to fine-tune the support data during the inner loop, such as MAML [112], Reptile [113], MetaOptNet [114], and R2-D2 [115].

Ni, R. et al.’s [105] research shows that meta-learning is more sensitive to query data than it is to supporting data, for which four patterns of data augmentation are described. ① Support augmentation: Data augmentation can be used to support fine-tuning the data in the inner loop. This strategy expands the fine-tuning data pool. ② Query augmentation: Data augmentation can also be applied to query data. This strategy expands the pool of evaluation data to be sampled during training. ③ Task augmentation: We can increase the number of possible tasks by uniformly augmenting the entire class to add new classes for training. ④ Lens augmentation: At test time, we can artificially enlarge the shot by adding an extra augmented copy of each image. The experimental results of the authors show that boosting with support weakens the performance of the network and that query augmentation can improve the performance of the network. The combination of task augmentation and query augmentation can also improve the performance of the network.

In the field of remote sensing, there are few studies on data augmentation based on meta-learning, and most practitioners use improved meta-learning algorithms or a combination of traditional data augmentation methods (single-sample change, multi-sample transformation) to improve the performance of target recognition. For example, Li, Y. et al. [116] proposed a new model, Meta-FSEO, based on meta-learning to improve the performance of few-shot remote sensing scene classification in different urban scenes. Yang, Q. et al. [117] proposed a meta-captioning framework based on meta-learning to effectively add captions to remote sensing images. However, these studies are based on the characteristics of meta-learning suitable for small-shot learning tasks.

Meta-learning outperforms transfer learning in data augmentation. However, meta-learning is more sensitive to the network structure and requires a fine-tuning of the hyperparameters. Meta-learning is still limited to performance in a specific task space under a defined network structure. For classification tasks, only associations between classification tasks can currently be considered. Is it possible to build a framework that considers tasks such as classification, detection, prediction, and generation at the same time? This would separate meta-learning from the notion of tasks. Some recent work attempts to optimize each mini-batch, and in this case, how to optimize the inner loop will be an important direction for efficient application optimization [111].

However, meta-learning has its own limitations: Meta-learning is rarely used to initialize parameter weights when there is a clear domain gap between training and testing tasks. This can easily lead to the negative transfer of the model. Furthermore, meta-learning is highly dependent on the structure of the network and needs to be redesigned for a wide variety of tasks. Meta-learning is a relatively new concept that has not been rigorously tested, and no theory has emerged to explain the causal relationships behind meta-learning. As the theoretical framework of causality develops, meta-learning may become a more complete framework.

3.2.2. Reinforcement Learning

Reinforcement learning provides a new idea for data augmentation, which is to search for the optimal combination strategy from a given image transformation and mixing method. The AutoAugment method is the most representative method of reinforcement learning. AutoAugment is a new image data augmentation method technique proposed by Cubuk et al. [118]. The model treats the problem of learning the best augmentation policy as a discrete search problem. Sixteen kinds of image transformation operations are used to establish the search space, each augmentation strategy is set to include several sub-strategies, and each sub-strategy is composed of several image transformation operations. Each operation is determined by two parameters: the degree of image transformation and the probability of performing the operation. The method’s main idea is to create a search space of data augmentation strategies to directly evaluate the quality of a strategy on the dataset of interest.

The specific implementation process is as follows. First, a search space is designed in which each policy package contains many sub-policies, and a sub-policy is randomly selected for each image in each mini-batch. Each sub-policy consists of two operations, each of which is an image processing function such as translation, rotation, or shear, as well as the probability and magnitude of applying these functions. A search algorithm is then used to find the best policy so that the neural network can achieve the highest validation accuracy on the target dataset. Finally, policies learned from one dataset can be transferred well to other similar datasets [119].

Cubuk et al. [120] chose reinforcement learning as the search algorithm to search for the optimal strategy, pointing out that random search or evolutionary strategies are equally effective. AutoAugment reduced the error rate from 2.67% to 1.48% on the CIFAR-10 dataset and achieved an accuracy of 83.54% on the ImageNet dataset.

Although the search data augmentation strategy can effectively improve the accuracy of image classification and object detection tasks, this method also increases the complexity of training models, increases the computational cost, and cannot adjust the regularization strength according to the model or dataset size. Determining how to tailor a set of image transformation strategies for a given task to further improve the prediction performance of a given model remains an open question. Due to the relatively late appearance of this method, thus far, there has been no related application in the field of remote sensing. However, this also guides the direction for our future research.

4. Challenges and Future Directions

In summary, the related research to solve the problem of few-shot learning due to the small number of samples is gradually improving. Among these solutions, single-sample transformation, deep generative, transfer learning, regularization, meta-learning, and reinforcement learning models have a good theoretical basis and have many experiments to confirm them because they emerged relatively early, so the technology is relatively mature. The multi-sample synthesis method appeared relatively late, so the theoretical basis is immature and the experimental aspects are also lacking; its application in the field of remote sensing needs to be studied. The advantages and disadvantages of different data augmentation methods are compared in Table 4. Table 5 shows the comparison of the results of different data augmentation algorithms on the CIFAR dataset.

Currently, the mainstream methods for data augmentation are one-sample transformations and deep generative models. (1) The one-sample transformation method is simple and effective in improving the classification accuracy of the model. Therefore, one-sample transformation is the most common data augmentation method. However, the use of inappropriate transformation methods can have negative effects. Therefore, the applicability of the method becomes the primary consideration for using data augmentation. In addition, there are many transformation methods in one-sample transformations. Each transformation method has its advantages and disadvantages. The main direction of development is how to combine the transformation methods to avoid their shortcomings and make the training model achieve optimal results. (2) The variety of deep generative models is rich and rapidly evolving, although all types of models have certain problems and limitations. However, with the further deepening of theoretical research and further expansion of application fields, deep generative models will become the mainstream technology in the future. The main directions for the future development of deep generative models are as follows. ① Evaluation metrics and evaluation systems: Deep generative models suffer from complex training processes, structures that are not easy to understand and use, and slow training speed. Urgent research problems are the difficulty in learn models on large-scale data and that there should be corresponding effective evaluation metrics and practical evaluation systems in different application areas. ② Uncertainty: The motivation and construction process of deep generative models usually has a rigorous mathematical derivation, but in practice, the process is often limited to the difficulty where solving it has to be approximated and simplified, making the model deviate from the original objective. Therefore, understanding the impact of model approximation and simplification on model performance, error, and practical applications is an important direction for developing production models. ③ Sample diversity: How to make deep generation models generate diverse images is a problem worth investigating. ④ Generalization capability: The generalization ability of the model is an important indicator to evaluate the quality of model. Improving the learning mechanism of the model theoretically is an issue worthy of deep consideration. ⑤ Efficient model structures and training methods: The high cost of computer hardware equipment and the long training time makes it difficult for many people to enter frontier research in this field. Therefore, more efficient model structures and training methods are one of the future directions. ⑥ Application area expansion: The application scope of deep generative models is relatively small. How to use deep generative models in more fields is a key direction for further development in the future.

Combining different data augmentation methods brings different aspects and incremental enhancements to the model. It is theoretically a path to achieve the most optimal data augmentation effect. AutoAugment [118] and RandAugment [120] use reinforcement learning as a policy search algorithm, which lays the research foundation for automating the selection of the optimal data augmentation solution in the combined space. However, the applicability of various types of data augmentation algorithms for different data, different tasks, and different application scenarios, greatly varies. The characteristics of data and tasks need to be considered when defining the search space. Therefore, the theoretical analysis and experimental validation of the suitability of various data augmentation methods for different data and tasks are of great research significance and application value.

In future research, the most promising development prospects for data augmentation in remote sensing are learning strategies and virtual sample generation methods. The development prospects of learning strategy-based methods are mainly reflected in the following aspects: ① exploring optimal combination strategies through reinforcement learning based on data and tasks; ② adaptively learning optimal data deformation and mixing methods based on meta-learning; ③ further fitting the real data distribution based on a generative adversarial network to sample high-quality unknown data; and ④ exploring the application of multimodal data interconversion based on style transfer. Although better data augmentation strategies can be more intelligently obtained by learning or searching augmentation strategies, how to automatically customize a set of optimal data augmentation schemes for a given task remains to be studied. The development prospect based on virtual samples mainly depends on the development of virtual visualization technology. With the development of metaverse technology and the high-standard pursuit of real scenes by game customers and scientific researchers, the rapid development of virtual visualization technology has been stimulated. The computer can simulate any scenario the researcher needs, so the researcher can produce enough samples. However, at present, the high cost of computer simulations of high-quality virtual scenes has seriously hindered the development of virtual sample production methods. Simulating a virtual scene that is closer to the real scene at a low cost is an urgent problem that must be solved by the virtual sample production method.

5. Conclusions

At present, data augmentation is an important means for solving the problem of improving network performance and recognition accuracy due to the small number of samples. This paper summarizes the data augmentation techniques used in target recognition in the remote sensing field, which are mainly divided into two categories: data-based data augmentation methods and network-based data augmentation methods. Data-based data augmentation methods are mainly divided into single-sample transformation, multi-sample synthesis, deep generative modeling, and virtual sample generation; network-based data augmentation methods are mainly divided into control network capacity strategies and learning strategies. Starting from the theoretical basis of each method, combined with the existing research results and experimental data, this paper provides a high degree of generalization and in-depth analysis of each method. Through this article, readers can have a more comprehensive understanding of image data augmentation methods in the field of remote sensing. In practical applications, practitioners can select and combine the most suitable methods according to the characteristics of data and tasks to form a set of data augmentation schemes suitable for tasks, which in turn provides a strong impetus for the application of deep learning methods. In addition, compared with other data augmentation review articles, the methods summarized in this article are more comprehensive and more biased toward the field of remote sensing.

Author Contributions

Conceptualization, X.H. and X.L.; methodology, X.H.; validation, X.L., L.L. and R.Y.; formal analysis, L.Z.; investigation, L.L.; resources, X.H.; data curation, X.L.; writing—original draft preparation, X.H.; writing—review and editing, L.L. and L.Y.; visualization, X.L.; supervision, X.L.; project administration, X.H.; funding acquisition, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Watershed Non-point Source Pollution Prevention and Control Technology and Application Demonstration Project (2021YFC3201505), the National Key Research and Development Project (No. 2016YFC0502106), the Natural Science Foundation of China Research Grants (No. 41476161), and the Fundamental Research Funds for the Central Universities.

Data Availability Statement

The storage URL of the structured raw data to construct the knowledge graph is: https://github.com/hao1661282457/remotesensing-images.git (accessed on 2 December 2022).

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhang, S.; Cheng, Q.; Chen, D.; Zhang, H. Image Target Recognition Model of Multichannel Structure Convolutional Neural Network Training Automatic Encoder. IEEE Access 2020, 8, 113090–113103. [Google Scholar] [CrossRef]
Zhou, Y.; Zhu, Z.; Ding, Q. Port Target Recognition of Remote Sensing Image. J. Nanjing Univ. Aeronaut. Astronaut. 2008, 40, 350–353. [Google Scholar]
He, J.; Guo, Y.; Yuan, H. Ship Target Automatic Detection Based on Hypercomplex Flourier Transform Saliency Model in High Spatial Resolution Remote-Sensing Images. Sensors 2020, 20, 2536. [Google Scholar] [CrossRef] [PubMed]
Huang, X.; Zhang, L.; Zhu, T. Building Change Detection from Multitemporal High-Resolution Remotely Sensed Images Based on a Morphological Building Index. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2013, 7, 105–115. [Google Scholar] [CrossRef]
Shu, C.; Sun, L. Automatic target recognition method for multitemporal remote sensing image. Open Phys. 2020, 18, 170–181. [Google Scholar] [CrossRef]
Jin, L.; Kuang, X.; Huang, H.; Qin, Z.; Wang, Y. Over-fitting Study of Artificial Neural Network Prediction Model. J. Meteorol. 2004, 62, 62–70. [Google Scholar]
Lee, J.-G.; Jun, S.; Cho, Y.-W.; Lee, H.; Kim, G.B.; Seo, J.B.; Kim, N. Deep Learning in Medical Imaging: General Overview. Korean J. Radiol. 2017, 18, 570–584. [Google Scholar] [CrossRef]
Zhai, J. Why Not Recommend a Small Sample for Further Study? 2018. Available online: https://www.zhihu.com/question/29633459/answer/45049798 (accessed on 20 November 2022).
Vogel-Walcutt, J.; Gebrim, J.; Bowers, C.; Carper, T.; Nicholson, D. Cognitive load theory vs. constructivist approaches: Which best leads to efficient, deep learning? J. Comput. Assist. Learn. 2010, 27, 133–145. [Google Scholar] [CrossRef]
Simard, P.Y.; LeCun, Y.A.; Denker, J.S.; Victorri, B. Transformation Invariance in Pattern Recognition—Tangent Distance and Tangent Propagation. In Neural Networks: Tricks of the Trade; Springer: Berlin/Heidelberg, Germany, 2012; pp. 235–269. [Google Scholar]
Abu Alhaija, H.; Mustikovela, S.K.; Mescheder, L.; Geiger, A.; Rother, C. Augmented Reality Meets Computer Vision: Efficient Data Generation for Urban Driving Scenes. Int. J. Comput. Vis. 2018, 126, 961–972. [Google Scholar] [CrossRef]
Fernández, A.; García, S.; Galar, M.; Prati, R.C.; Krawczyk, B.; Herrera, F. Learning from Imbalanced Data Sets; Springer International Publishing: Cham, Switzerland, 2018. [Google Scholar] [CrossRef]
Mikolajczyk, A.; Grochowski, M. Data augmentation for improving deep learning in image classification problem. In Proceedings of the International Interdisciplinary PhD Workshop (IIPhDW), Swinoujscie, Poland, 9–12 May 2018; pp. 117–122. [Google Scholar]
Nalepa, J.; Marcinkiewicz, M.; Kawulok, M. Data Augmentation for Brain-Tumor Segmentation: A Review. Front. Comput. Neurosci. 2019, 13, 83. [Google Scholar] [CrossRef]
Shorten, C.; Khoshgoftaar, T.M. A survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
Lemley, J.; Corcoran, P. Deep Learning for Consumer Devices and Services 4—A Review of Learnable Data Augmentation Strategies for Improved Training of Deep Neural Networks. IEEE Consum. Electron. Mag. 2020, 9, 55–63. [Google Scholar] [CrossRef]
Chlap, P.; Min, H.; Vandenberg, N.; Dowling, J.; Holloway, L.; Haworth, A. A review of medical image data augmentation techniques for deep learning applications. J. Med. Imaging Radiat. Oncol. 2021, 65, 545–563. [Google Scholar] [CrossRef]
Song, Y.; Wang, T.; Mondal, S.K.; Sahoo, J.P. A Comprehensive Survey of Few-shot Learning: Evolution, Applications, Challenges, and Opportunities. arXiv 2022, arXiv:2205.06743. [Google Scholar] [CrossRef]
Taylor, L.; Nitschke, G. Improving Deep Learning with Generic Data Augmentation. In Proceedings of the 2018 IEEE Symposium Series on Computational Intelligence (SSCI), Bangalore, India, 18–21 November 2018; pp. 1542–1547. [Google Scholar]
Ma, D.A.; Tang, P.; Zhao, L.J.; Zhang, Z. Review of Data Augmentation for Image in Deep Learning. J. Image Graph. 2021, 26, 0487–0502. [Google Scholar]
Zhang, W.; Cao, Y. A new data augmentation method of remote sensing dataset based on Class Activation Map. J. Phys. Conf. Ser. 2021, 1961, 012023. [Google Scholar] [CrossRef]
Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Shi, W. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4681–4690. [Google Scholar]
Ding, J.; Chen, B.; Liu, H.; Huang, M. Convolutional Neural Network with Data Augmentation for SAR Target Recognition. IEEE Geosci. Remote Sens. Lett. 2016, 13, 364–368. [Google Scholar] [CrossRef]
Wang, Y.K.; Zhang, P.Y.; Yan, Y.H. Data Enhancement Technology of Language Model Based on Countermeasure Training Strategy. J. Autom. 2018, 44, 126–135. [Google Scholar]
Ma, D.; Tang, P.; Zhao, L. SiftingGAN: Generating and Sifting Labeled Samples to Improve the Remote Sensing Image Scene Classification Baseline In Vitro. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1046–1050. [Google Scholar] [CrossRef]
DeVries, T.; Taylor, G.W. Improved Regularization of Convolutional Neural Networks with Cutout. arXiv 2017, arXiv:1708.04552. [Google Scholar]
Su, N. A Data Augmentation Strategy Based on Simulated Samples for Ship Detection in RGB Remote Sensing Images. ISPRS Int. J. Geo-Inf. 2019, 8, 276. [Google Scholar]
Wang, Z.; Du, L.; Mao, J.; Liu, B.; Yang, D. SAR Target Detection Based on SSD with Data Augmentation and Transfer Learning. IEEE Geosci. Remote Sens. Lett. 2019, 16, 150–154. [Google Scholar] [CrossRef]
Scott, G.J.; England, M.R.; Starms, W.A.; Marcum, R.A.; Davis, C.H. Training Deep Convolutional Neural Networks for Land–Cover Classification of High-Resolution Imagery. IEEE Geosci. Remote Sens. Lett. 2017, 14, 549–553. [Google Scholar] [CrossRef]
Chapelle, O.; Weston, J.; Bottou, L.; Vapnik, V. Vicinal Risk Minimization. Adv. Neural Inf. Process. Syst. 2000, 13. [Google Scholar]
Vapnik, V.N. An overview of statistical learning theory. IEEE Trans. Neural Netw. 1999, 10, 988–999. [Google Scholar] [CrossRef]
Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. Mixup: Beyond Empirical Risk Minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar]
Yan, P.; He, F.; Yang, Y.; Hu, F. Semi-Supervised Representation Learning for Remote Sensing Image Classification Based on Generative Adversarial Networks. IEEE Access 2020, 8, 54135–54144. [Google Scholar] [CrossRef]
Tokozume, Y.; Ushiku, Y.; Harada, T. Between-Class Learning for Image Classification. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5486–5494. [Google Scholar]
Inoue, H. Data Augmentation by Pairing Samples for Images Classification. arXiv 2018, arXiv:1801.02929. [Google Scholar]
Yan, Y. Data Augmentation Method in Deep Learning. 2018. Available online: https://www.jianshu.com/p/99450dbdadcf (accessed on 2 November 2022).
Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. arXiv 2019, arXiv:1905.04899. [Google Scholar]
Summers, C.; Dinneen, M.J. Improved Mixed-Example Data Augmentation. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, 7–11 January 2019; pp. 1262–1270. [Google Scholar]
Takahashi, R.; Matsubara, T.; Uehara, K. Data Augmentation Using Random Image Cropping and Patching for Deep CNNs. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 2917–2931. [Google Scholar] [CrossRef]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Bakkouri, I.; Afdel, K. Breast Tumor Classification Based on Deep Convolutional Neural Networks. In Proceedings of the 2017 International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), Fez, Morocco, 22–24 May 2017; pp. 1–6. [Google Scholar]
Bunkhumpornpat, C.; Sinapiromsaran, K.; Lursinsap, C. Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, Bangkok, Thailand, 27–30 April 2009; pp. 475–482. [Google Scholar]
Bunkhumpornpat, C.; Subpaiboonkit, S. Safe Level Graph for Synthetic Minority Over-Sampling Techniques. In Proceedings of the 2013 13th International Symposium on Communications and Information Technologies (ISCIT), Surat Thani, Thailand, 4–6 September 2013; pp. 570–575. [Google Scholar]
Han, H.; Wang, W.Y.; Mao, B.H. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In Proceedings of the International Conference on Intelligent Computing, Hefei, China, 23–26 August 2005; pp. 878–887. [Google Scholar]
Douzas, G.; Bacao, F. Effective data generation for imbalanced learning using conditional generative adversarial networks. Expert Syst. Appl. 2018, 91, 464–471. [Google Scholar] [CrossRef]
Bogner, C.; Kuhnel, A.; Huwe, B. Predicting with Limited Data—Increasing the Accuracy in Vis-Nir Diffuse Reflectance Spectroscopy by Smote. In Proceedings of the 2014 6th Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS), Lausanne, Switzerland, 24–27 June 2014; pp. 1–4. [Google Scholar]
Feng, W.; Huang, W.; Ye, H.; Zhao, L. Synthetic Minority Over-Sampling Technique Based Rotation Forest for the Classification of Unbalanced Hyperspectral Data. In Proceedings of the IGARSS 2018-2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; pp. 2651–2654. [Google Scholar] [CrossRef]
Hu, M.F.; Zuo, X.; Liu, J.W. Survey on Deep Generative Model. Acta Autom. Sin. 2022, 48, 40–74. [Google Scholar]
Smolensky, P. Information Processing in Dynamical Systems: Foundations of Harmony Theory; Colorado University at Boulder Department of Computer Science: Boulder, CO, USA, 1986. [Google Scholar]
Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Dayan, P.; Hinton, G.E.; Neal, R.M.; Zemel, R.S. The Helmholtz Machine. Neural Comput. 1995, 7, 889–904. [Google Scholar] [CrossRef]
Burda, Y.; Grosse, R.; Salakhutdinov, R. Importance Weighted Autoencoders. arXiv 2015, arXiv:1509.00519. [Google Scholar]
Maaloe, L.; Sonderby, C.K.; Sonderby, S.K.; Winther, O. Auxiliary Deep Generative Models. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1445–1453. [Google Scholar]
Makhzani, A.; Shlens, J.; Jaitly, N.; Goodfellow, I.; Frey, B. Adversarial Autoencoders. arXiv 2015, arXiv:1511.05644. [Google Scholar]
Kingma, D.P.; Mohamed, S.; Jimenez Rezende, D.; Welling, M. Semi-Supervised Learning with Deep Generative Models. Adv. Neural Inf. Process. Syst. 2014, 27, 1–9. [Google Scholar]
Salimans, T.; Kingma, D.; Welling, M. Markov Chain Monte Carlo and Variational Inference: Bridging the Gap. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 1218–1226. [Google Scholar]
Gregor, K.; Danihelka, I.; Graves, A.; Rezende, D.; Wierstra, D. Draw: A Recurrent Neural Network for Image Generation. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 1462–1471. [Google Scholar]
Kulkarni, T.D.; Whitney, W.F.; Kohli, P.; Tenenbaum, J. Deep Convolutional Inverse Graphics Network. Adv. Neural Inf. Process. Syst. 2015, 28, 1–9. [Google Scholar]
Chen, R.T.; Li, X.; Grosse, R.B.; Duvenaud, D.K. Isolating Sources of Disentanglement in Variational Autoencoders. Adv. Neural Inf. Process. Syst. 2018, 31, 1–11. [Google Scholar]
Walker, J.; Doersch, C.; Gupta, A.; Hebert, M. An Uncertain Future: Forecasting from Static Images Using Variational Auto-encoders. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 835–851. [Google Scholar]
Gregor, K.; Besse, F.; Jimenez Rezende, D.; Danihelka, I.; Wierstra, D. Towards Conceptual Compression. Adv. Neural Inf. Process. Syst. 2016, 29, 1–9. [Google Scholar]
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollar, P.; Girshick, R. Masked Autoencoders Are Scalable Vision Learners. arXiv 2021, arXiv:2111.06377. [Google Scholar]
Xu, H.; Ding, S.; Zhang, X.; Xiong, H.; Tian, Q. Masked Autoencoders are Robust Data Augmentors. arXiv 2022, arXiv:2206.04846. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Bengio, Y. Generative Adversarial Nets. Adv. Neural Inf. Process. Syst. 2014, 27. [Google Scholar]
Wang, K.F.; Gou, C.; Duan, Y.J.; Lin, Y.L.; Zheng, X.H.; Wang, F.Y. Generative Adversarial Networks: The State of the Art and Beyond. Acta Autom. Sin. 2017, 43, 321–332. [Google Scholar]
Wang, S.Y.; Gao, X.; Sun, H.; Zheng, X.W.; Sun, X. An Aircraft Detection Method Based on Convolutional Neural Networks in High-Resolution SAR Images. J. Radars 2017, 6, 195–203. [Google Scholar]
Radford, A.; Metz, L.; Chintala, S. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv 2015, arXiv:1511.06434. [Google Scholar]
Brock, A.; Donahue, J.; Simonyan, K. Large Scale GAN Training for High Fidelity Natural Image Synthesis. arXiv 2018, arXiv:1809.11096. [Google Scholar]
Salazar, A.; Vergara, L.; Safont, G. Generative Adversarial Networks and Markov Random Fields for oversampling very small training sets. Expert Syst. Appl. 2020, 163, 113819. [Google Scholar] [CrossRef]
Hughes, L.H.; Schmitt, M.; Zhu, X.X. Generative Adversarial Networks for Hard Negative Mining in CNN-Based SAR-Optical Image Matching. In Proceedings of the IGARSS 2018-2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; pp. 4391–4394. [Google Scholar]
Guo, J.; Lei, B.; Ding, C.; Zhang, Y. Synthetic Aperture Radar Image Synthesis by Using Generative Adversarial Nets. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1111–1115. [Google Scholar] [CrossRef]
Marmanis, D.; Yao, W.; Adam, F.; Datcu, M.; Reinartz, P.; Schindler, K.; Stilla, U. Artificial Generation of Big Data for Im-proving Image Classification: A Generative Adversarial Network Approach on SAR Data. arXiv 2017, arXiv:1711.02010. [Google Scholar]
Dinh, L.; Krueger, D.; Bengio, Y. Nice: Non-Linear Independent Components Estimation. arXiv 2014, arXiv:1410.8516. [Google Scholar]
Larochelle, H.; Murray, I. The Neural Autoregressive Distribution Estimator. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 11–13 April 2011; pp. 29–37. [Google Scholar]
Dinh, L.; Sohl-Dickstein, J.; Bengio, S. Density Estimation Using Real NVP. arXiv 2016, arXiv:1605.08803. [Google Scholar]
Kingma, D.P.; Dhariwal, P. Glow: Generative Flow with Invertible 1x1 Convolutions. Adv. Neural Inf. Process. Syst. 2018, 31, 1–10. [Google Scholar]
Raiko, T.; Li, Y.; Cho, K.; Bengio, Y. Iterative Neural Autoregressive Distribution Estimator Nade-K. Adv. Neural Inf. Process. Syst. 2014, 27, 1–9. [Google Scholar]
Reed, S.; Oord, A.; Kalchbrenner, N.; Colmenarejo, S.G.; Wang, Z.; Chen, Y.; Freitas, N. Parallel Multiscale Autoregressive Density Estimation. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 2912–2921. [Google Scholar]
Van Oord, A.; Kalchbrenner, N.; Kavukcuoglu, K. Pixel Recurrent Neural Networks. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1747–1756. [Google Scholar]
Germain, M.; Gregor, K.; Murray, I.; Larochelle, H. Made: Masked Autoencoder for Distribution Estimation. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 881–889. [Google Scholar]
Van Den Oord, A.; Vinyals, O. Neural Discrete Representation Learning. Adv. Neural Inf. Process. Syst. 2017, 30, 1–10. [Google Scholar]
Razavi, A.; Van den Oord, A.; Vinyals, O. Generating Diverse High-Fidelity Images with VQ-VAE-2. Adv. Neural Inf. Process. Syst. 2019, 32, 1–11. [Google Scholar]
Zhu, D.; Xia, S.; Zhao, J.; Zhou, Y.; Jian, M.; Niu, Q.; Yao, R.; Chen, Y. Diverse sample generation with multi-branch conditional generative adversarial network for remote sensing objects detection. Neurocomputing 2019, 381, 40–51. [Google Scholar] [CrossRef]
Han, W.; Wang, L.; Feng, R.; Gao, L.; Chen, X.; Deng, Z.; Chen, J.; Liu, P. Sample generation based on a supervised Wasserstein Generative Adversarial Network for high-resolution remote-sensing scene classification. Inf. Sci. 2020, 539, 177–194. [Google Scholar] [CrossRef]
Zaytar, M.A.; El Amrani, C. Satellite Imagery Noising with Generative Adversarial Networks. Int. J. Cogn. Informat. Nat. Intell. 2021, 15, 16–25. [Google Scholar] [CrossRef]
Tasar, O.; Happy, S.L.; Tarabalka, Y.; Alliez, P. SEMI2I: Semantically Consistent Image-to-Image Translation for Domain Adaptation of Remote Sensing Data. In Proceedings of the IGARSS 2020–2020 IEEE International Geoscience and Remote Sensing Symposium, Waikoloa, HI, USA, 26 September–2 October 2020; pp. 1837–1840. [Google Scholar]
Zheng, K.; Wei, M.; Sun, G.; Anas, B.; Li, Y. Using Vehicle Synthesis Generative Adversarial Networks to Improve Vehicle Detection in Remote Sensing Images. ISPRS Int. J. Geo-Inf. 2019, 8, 390. [Google Scholar] [CrossRef]
Sun, X.; Wang, B.; Wang, Z.; Li, H.; Li, H.; Fu, K. Research Progress on Few-Shot Learning for Remote Sensing Image Interpretation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 2387–2402. [Google Scholar] [CrossRef]
Kusk, A.; Abulaitijiang, A.; Dall, J. Synthetic SAR Image Generation Using Sensor, Terrain and Target Models. Proceedings of EUSAR 2016: 11th European Conference on Synthetic Aperture Radar, Hamburg, Germany, 6–9 June 2016; pp. 1–5. [Google Scholar]
Malmgren-Hansen, D.; Kusk, A.; Dall, J.; Nielsen, A.A.; Engholm, R.; Skriver, H. Improving SAR Automatic Target Recogni-tion Models with Transfer Learning from Simulated Data. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1484–1488. [Google Scholar] [CrossRef]
Yan, Y.; Zhang, Y.; Su, N. A Novel Data Augmentation Method for Detection of Specific Aircraft in Remote Sensing RGB Images. IEEE Access 2019, 7, 56051–56061. [Google Scholar] [CrossRef]
You, T.; Chen, W.; Wang, H.; Yang, Y.; Liu, X. Automatic Garbage Scattered Area Detection with Data Augmentation and Transfer Learning in SUAV Low-Altitude Remote Sensing Images. Math. Probl. Eng. 2020, 2020, 730762. [Google Scholar] [CrossRef]
Mo, N.; Yan, L. Improved Faster RCNN Based on Feature Amplification and Oversampling Data Augmentation for Oriented Vehicle Detection in Aerial Images. Remote Sens. 2020, 12, 2558. [Google Scholar] [CrossRef]
Xiao, Q.; Liu, B.; Li, Z.; Ni, W.; Yang, Z.; Li, L. Progressive Data Augmentation Method for Remote Sensing Ship Image Clas-sification Based on Imaging Simulation System and Neural Style Transfer. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 9176–9186. [Google Scholar] [CrossRef]
Wang, K.; Zhang, G.; Leung, H. SAR Target Recognition Based on Cross-Domain and Cross-Task Transfer Learning. IEEE Access 2019, 7, 153391–153399. [Google Scholar] [CrossRef]
Wang, H.; Gao, Y.; Chen, X. Transfer of Reinforcement Learning: Methods and Progress. Acta Electron. Sin. 2008, 36, 39–43. [Google Scholar]
YU, C.C.; Tian, R.; Tan, L.; TU, X.Y. Integrated Transfer Learning Algorithmic for Unbalanced Samples Classification. Acta Electron. Sin. 2012, 40, 1358–1363. [Google Scholar]
Pan, S.J.; Tsang, I.W.; Kwok, J.T.; Yang, Q. Domain Adaptation via Transfer Component Analysis. IEEE Trans. Neural Netw. 2010, 22, 199–210. [Google Scholar] [CrossRef]
Ni, T.-g.; Wang, S.-t.; Ying, W.-h.; Deng, Z.-h. Transfer Group Probabilities Based Learning Machine. Acta Electron. Sin. 2013, 41, 2207–2215. [Google Scholar]
Bruzzone, L.; Marconcini, M. Toward the Automatic Updating of Land-Cover Maps by a Domain-Adaptation SVM Classifier and a Circular Validation Strategy. IEEE Trans. Geosci. Remote Sens. 2009, 47, 1108–1122. [Google Scholar] [CrossRef]
Zhang, Y.; Zheng, X.; Liu, G.; Sun, X.; Wang, H.; Fu, K. Semi-Supervised Manifold Learning Based Multigraph Fusion for High-Resolution Remote Sensing Image Classification. IEEE Geosci. Remote Sens. Lett. 2013, 11, 464–468. [Google Scholar] [CrossRef]
Hu, K.; Yan, H.; Xia, M.; Xu, T.; Hu, W.; Xu, C. Satellite Cloud Classification Based on Migration Learning. Trans. Atmos. Sci. 2017, 40, 856–863. [Google Scholar]
Han, M.; Yang, X. Modified Bayesian ARTMAP Migration Learning Remote Sensing Image Classification Algorithm. Acta Electron. Sin. 2016, 44, 2248–2253. [Google Scholar]
Wu, T.J.; Luo, J.C.; Xia, L.G.; Yang, H.; Shen, Z.; Hu, X. An Automatic Sample Collection Method for Object-oriented Classification of Remotely Sensed Imageries Based on Transfer Learning. Acta Geod. Cartogr. Sin. 2014, 43, 908–916. [Google Scholar]
Liu, Y.; Li, X. Domain adaptation for land use classification: A spatio-temporal knowledge reusing method. ISPRS J. Photogramm. Remote Sens. 2014, 98, 133–144. [Google Scholar] [CrossRef]
Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A Simple Way to Prevent Neural Net-works from Overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
Li, W.; Wu, L.; Chen, G.Y. Research on Deep Belief Network Model Based on Remote Sensing Classification. Geol. Sci. Technol. Inf. 2018, 37, 208–214. [Google Scholar]
Jiang, H. Research on Feature Extraction and Classification Technology of Hyperspectral Data Based on Convolutional Neural Network. Master’s Thesis, Harbin Institute of Technology, Harbin, China, 2016. [Google Scholar]
Jiao, J.; Zhang, F.; Zhang, L. Estimation of Rape Planting Area by Remote Sensing Based on Improved AlexNet Model. Comput. Meas. Control 2018, 26, 186–189. [Google Scholar]
Lemley, J.; Bazrafkan, S.; Corcoran, P. Smart Augmentation Learning an Optimal Data Augmentation Strategy. IEEE Access 2017, 5, 5858–5869. [Google Scholar] [CrossRef]
Ni, R.; Goldblum, M.; Sharaf, A.; Kong, K.; Goldstein, T. Data Augmentation for Meta-Learning. In Proceedings of the International Conference on Machine Learning, Shenzhen, China, 26 February 2021–1 March 2021; pp. 8152–8161. [Google Scholar]
Finn, C.; Abbeel, P.; Levine, S. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 1126–1135. [Google Scholar]
Nichol, A.; Achiam, J.; Schulman, J. On First-Order Meta-Learning Algorithms. arXiv 2018, arXiv:1803.02999. [Google Scholar]
Lee, K.; Maji, S.; Ravichandran, A.; Soatto, S. Meta-Learning with Differentiable Convex Optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10657–10665. [Google Scholar]
Bertinetto, L.; Henriques, J.F.; Torr, P.H.; Vedaldi, A. Meta-Learning with Differentiable Closed-Form Solvers. arXiv 2018, arXiv:1805.08136. [Google Scholar]
Li, Y.; Shao, Z.; Huang, X.; Cai, B.; Peng, S. Meta-FSEO: A Meta-Learning Fast Adaptation with Self-Supervised Embedding Optimization for Few-Shot Remote Sensing Scene Classification. Remote Sens. 2021, 13, 2776. [Google Scholar] [CrossRef]
Yang, Q.; Ni, Z.; Ren, P. Meta captioning: A meta learning based remote sensing image captioning framework. ISPRS J. Photogramm. Remote Sens. 2022, 186, 190–200. [Google Scholar] [CrossRef]
Cubuk, E.D.; Zoph, B.; Mane, D.; Vasudevan, V.; Le, Q.V. Autoaugment: Learning Augmentation Policies from Data. arXiv 2018, arXiv:1805.09501. [Google Scholar]
Pan, S.J.; Yang, Q. A Survey on Transfer Learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345–1359. [Google Scholar] [CrossRef]
Cubuk, E.D.; Zoph, B.; Shlens, J.; Le, Q.V. Randaugment: Practical Automated Data Augmentation with a Reduced Search Space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 702–703. [Google Scholar]

Figure 2. A collection of one-sample transforms. (a–h) stands for rotation, zoom, flip, shift, random crop, definition transformation, noise disturbance, random erasing. The first column of images is the original image, and the second, third, and fourth columns are the images after transformation. The size of each image is 800 × 800 pixels).

Figure 3. Model structure of sample pairing.

Figure 4. Example of image spatial information synthesis.

Figure 5. Randomly cropped stitched blended image.

Figure 6. The model structure of RBM.

Figure 7. The model structure of DBN [48].

Figure 8. The structure of the HM model [51].

Figure 9. The structure of the DBM model [48].

Figure 10. The overall structure of VAE [50].

Figure 11. The calculation process and structure of the GAN model.

Figure 12. Dropout neural network model. (Left: Standard neural network with two hidden layers. Right: Examples without sparseness produced by applying dropout to the network on the left. Cross cell has been removed.).

Figure 13. Comparison of standard networks with dropout networks (Left: standard network; Right: join dropout network).

Figure 14. Dropout operation in prediction model (Left: at training time; Right: at test time.).

Table 1. Data-based data augmentation methods.

Primary Classification	Secondary Classification	Three-Level Classification	Representative Method
One-Sample Transform	Geometric Transformation		Rotate, scale, flip, shift, crop, etc. [21]
	Sharpness Transformation		Sharpen, blur, etc. [22]
	Noise Disturbance		Gaussian noise, salt and pepper noise, speckle noise, etc. [23]
	Random Erase		Erase, mask, etc. [20]
Multi-Sample Synthesis	Image Spatial Information Synthesis	Linear stacking method of multiple images	Mixup [32]
			Between-class (BC) [34]
			Sample pairing [35]
		Nonlinear blending method of multiple images	Vertical Concat, Horizontal Concat, Mixed Concat, Random 2 × 2, VH-Mixup, VH-BC+, random square, random column interval, random row interval, random rows, random columns, random pixels, random elements, noisy mixup, random cropping and stitching, etc. [38,39]
	Feature Space Information Synthesis		SMOTE [40]
Deep Generative Models	Approximation Method	Restricted Boltzmann machines	Deep Boltzmann machines [48]
			Helmholtz machine [51]
			Deep belief network [48]
		Variational auto-encoder	IWAE [52]
			ADGM [53]
			AAE [54]
			MAE [62]
	Implicit Method	Generative Adversarial Nets (GAN) [64]	DCGAN [67]
	Implicit Method	Generative Adversarial Nets (GAN) [64]	BigGAN [69]
	Deformation Method	Flow Model [73]	Normalizing flow model [73]
			i-ResNet [48]
			Variational inference with flow [48]
		Autoregressive Model [74]	NADE [48]
			PixelRNN [79]
			MADE [80]
Virtual Sample Generation	_______________

Table 2. The experimental results of geometric transformation methods on the Caltech101 dataset.

Data Augmentation Methods		Top-1 Accuracy (%)	Top-5 Accuracy (%)
Baseline (No data augmentation)		48.13 ± 0.42	64.50.13 ± 0.65
Geometric transformations	Rotation	50.80 ± 0.63	69.41 ± 0.48
	Flip	49.73 ± 1.13	67.36 ± 1.38
	Random crop	61.95 ± 1.01	79.10 ± 0.80

Table 3. The network-based data augmentation methods.

Primary Classification	Secondary Classification	Representative Method
Network Strategy	Transfer Learning [97]	_______________
Network Strategy	Regularization [26,37]	Dropout [106]
Learning Strategy	Meta-learning [111,112,113,114,115,116]	MAML [112], Reptile [113], MetaOptNet [114], R2-D2 [111]
Learning Strategy	Reinforcement Learning [96]	AutoAugment [118]

Table 4. Comparison of advantages and disadvantages of different data augmentation methods.

Methods	Advantages	Disadvantages
Geometric Transformation [21]	It is easy to operate, the spatial collection information of the dataset can be increased, and it improves the robustness of the model in terms of geometry.	The amount of expanded information is limited; it increases the amount of duplication of data; improper manipulation may change the original semantic annotation of the image.
Sharpness Transformation [22]	It can improve the robustness of the model to fuzzy targets, and it can highlight the details of the target object.	This method is mostly realized by filtering. Duplicates with convolutional neural network internal mechanism.
Noise Disturbance [23]	It improves the model’s ability to filter noise interference and redundant information as well as the model’s ability to recognize images of different resolutions.	It cannot add new effective information, and the improvement of model accuracy is not obvious.
Random Erase [20]	It can increase the robustness of the model under the condition that the object is occluded; the model can learn more descriptive features of objects, and it pays attention to the global information of the whole image.	It may change the semantic information of the original image, and the image may not be recognized after the most characteristic part is occluded.
Image Spatial Information Synthesis [20]	It mixes pixel value information from multiple images.	Unreasonable and lacks interpretability.
Feature space information Synthesis [20]	It combines the feature information of multiple images.	Its eigenvectors are difficult to interpret.
Deep Generative Models [48]	It samples from the fitted data distribution and can generate an unlimited number of samples.	The operation is complex, and a certain number of training samples are needed to train the model, which is difficult to train; the model training efficiency is low, the theory is complex, and the quality of the resulting images is not high.
Virtual Sample Generation [88]	It enables the generation of any sample required.	Production costs are high.
Transfer Learning [98]	High efficiency, low cost, and short cycle	The accuracy of the target domain cannot be guaranteed, and there is currently no flexible and easy-to-use transfer learning tool.
Regularization [106]	It prevents overfitting and reduces complex co-adaptive relationships between neurons.	The cost function cannot be well-defined, which significantly increases training time.
Meta-Learning [110]	The neural network is used to replace the determined data augmentation method, and the model is trained to learn a better augmentation strategy.	The complexity is high, an additional network is required, and the training cost is increased.
Reinforcement Learning [118]	It combines existing data augmentation methods to search for optimal strategies.	The policy search space is large, the training complexity is high, and the computational cost is high.

Table 5. The comparison of the results of different data augmentation algorithms on the CIFAR dataset.

Algorithm	Model	Baseline Test Error (%)→Test Error after Data Augmentation (%)
Algorithm	Model	CIFAR-10	CIFAR-100
Random Erase [20]	WideResNet	3.80→3.08	18.49→17.73
Cutout [26]	WideResNet	6.97→5.54	26.06→23.94
Sample Pairing [35]	CNN (8 layers)	8.22→6.93	30.5→7.9
Mixup [32]	WideResNet	3.8→2.7	19.4→17.5
BC [34]	ResNet-29	8.38→7.69	31.36→0.79
AutoAugment [118]	WideResNet	3.9→2.6	18.8→17.1
AutoAugment [118]	PyramidNet	2.7→1.5	14.0→0.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hao, X.; Liu, L.; Yang, R.; Yin, L.; Zhang, L.; Li, X. A Review of Data Augmentation Methods of Remote Sensing Image Target Recognition. Remote Sens. 2023, 15, 827. https://doi.org/10.3390/rs15030827

AMA Style

Hao X, Liu L, Yang R, Yin L, Zhang L, Li X. A Review of Data Augmentation Methods of Remote Sensing Image Target Recognition. Remote Sensing. 2023; 15(3):827. https://doi.org/10.3390/rs15030827

Chicago/Turabian Style

Hao, Xuejie, Lu Liu, Rongjin Yang, Lizeyan Yin, Le Zhang, and Xiuhong Li. 2023. "A Review of Data Augmentation Methods of Remote Sensing Image Target Recognition" Remote Sensing 15, no. 3: 827. https://doi.org/10.3390/rs15030827

APA Style

Hao, X., Liu, L., Yang, R., Yin, L., Zhang, L., & Li, X. (2023). A Review of Data Augmentation Methods of Remote Sensing Image Target Recognition. Remote Sensing, 15(3), 827. https://doi.org/10.3390/rs15030827

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Review of Data Augmentation Methods of Remote Sensing Image Target Recognition

Abstract

1. Introduction

2. Data-Based Data Augmentation Methods

2.1. One-Sample Transform

2.1.1. Geometric Transformations

2.1.2. Sharpness Transformation

2.1.3. Noise Disturbance

2.1.4. Random Erase

2.2. Multi-Sample Synthesis

2.2.1. Spatial Information Synthesis

2.2.2. Feature Space Information Synthesis

2.3. Deep Generative Models

2.3.1. Approximation Method

2.3.2. Implicit Method

2.3.3. Deformation Method

2.4. Virtual Sample Generation

3. Network-Based Data Augmentation Methods

3.1. Network Strategy

3.1.1. Transfer Learning

3.1.2. Regularization

3.2. Learning Strategy

3.2.1. Meta-Learning

3.2.2. Reinforcement Learning

4. Challenges and Future Directions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI