Enhancing crop classification accuracy by synthetic SAR-Optical data generation using deep learning

Crop classification using remote sensing data has emerged as a prominent research area in recent decades. Studies have demonstrated that fusing SAR and optical images can significantly enhance the accuracy of classification. However, a major challenge in this field is the limited availability of training data, which adversely affects the performance of classifiers. In agricultural regions, the dominant crops typically consist of one or two specific types, while other crops are scarce. Consequently, when collecting training samples to create a map of agricultural products, there is an abundance of samples from the dominant crops, forming the majority classes. Conversely, samples from other crops are scarce, representing the minority classes. Addressing this issue requires overcoming several challenges and weaknesses associated with traditional data generation methods. These methods have been employed to tackle the imbalanced nature of the training data. Nevertheless, they still face limitations in effectively handling the minority classes. Overall, the issue of inadequate training data, particularly for minority classes, remains a hurdle that traditional methods struggle to overcome. In this research, We explore the effectiveness of conditional tabular generative adversarial network (CTGAN) as a synthetic data generation method based on a deep learning network, in addressing the challenge of limited training data for minority classes in crop classification using the fusion of SAR-optical data. Our findings demonstrate that the proposed method generates synthetic data with higher quality that can significantly increase the number of samples for minority classes leading to better performance of crop classifiers.


Introduction
Cropland classification using remote sensing data has been among the hot and remarkable topics of research in the last two decades.The capability of remotely sensed data acquired by synthetic aperture radar (SAR) and optical sensors has an undeniable role in estimating the area under cultivation and determining the crop yield by preparing a reliable crop map.The fusion of SAR and optical data can greatly help to improve classification accuracy and to achieve more comprehensive information.
The RapidEye satellite is one of the relatively high-resolution optical satellites that has been used in several recent studies for agricultural applications, especially crop mapping [1][2][3][4][5][6].The spectral bands of RapidEye have been specially designed for applications related to vegetation analysis and have provided more indexing capabilities for extracting crop types [4].On the other hand, the uninhabited aerial vehicle synthetic aperture radar (UAVSAR) radar satellite has also been one of the most widely used radar sensors in the field of crop mapping in the last few years [7][8][9][10][11][12][13]. In addition to providing high spatial resolution pixels, this sensor has all four polarizations.Therefore, it is possible to extract coherent and incoherent decomposition parameters related to vegetation and crop types.The fusion of both sensors has shown a high potential in improving the accuracy of crop mapping in different studies [14][15][16].
Agricultural regions have distinct cultivation patterns, with each region typically characterized by one or two predominant crops, while other crops are less prevalent.Consequently, the availability of training data for all classes, particularly minority crops, is limited.This scarcity poses a challenge for conventional classifiers, as they struggle to accurately differentiate between minority and dominant classes.Consequently, classes with insufficient training samples often experience misclassification.For instance, previous research [17] demonstrated that the Maximum Likelihood and Fully Connected classifiers exhibited poor performance when trained on datasets with insufficient samples compared to datasets with sufficient samples.In summary, the heavy emphasis of agricultural regions on a small set of major crops leads to an unfair distribution of training data, resulting in insufficient data for minority crops.This imbalance negatively impacts the performance of conventional classifiers, resulting in misclassification of minority classes.
In order to address the challenge of insufficient training data, various methods are employed in the data preprocessing stage.These methods aim to mitigate the impact of limited samples on classifier performance.For instance, the random under-sampling (RUS) method attempts to equalize the influence of low training samples across all classes by randomly removing samples from the majority classes [18].However, this approach carries the risk of discarding potentially valuable and informative samples that could benefit the classifier.Conversely, the random over-sampling (ROS) method aims to artificially increase the sample size of minority classes by duplicating samples.While this technique may help balance the class distribution, it also introduces the possibility of overfitting the classifier due to the generation of redundant, uninformative data [18].It is important to note that both RUS and ROS methods have their limitations and potential drawbacks.RUS may result in sample loss, while ROS can lead to overfitting.
To solve such problems, the synthetic minority oversampling technique (SMOTE) was proposed in which, synthetic data is created based on the feature space similarities between minority class samples and using linear interpolation between K existing samples [19].This method has shown good performance in some studies [19][20][21].However, this algorithm has also weaknesses.Specifically, SMOTE can be divided into two parts.The first part is the strategy of selecting the available samples to be used in the synthetic data generation stage.Since SMOTE considers the importance of all samples of the minority class to be the same, the generated synthetic data is always accompanied by noise.The second part of SMOTE is the linear interpolation strategy for synthetic data generation.This strategy leads to the generation of almost duplicate data, which will cause overfitting in the classifier training process.Table 1 summarizes previous studies that used data generation methods to improve the accuracy of land use and land cover classification.
Recent advances in deep generative networks have created many possibilities in the field of synthetic data generation.These networks try to learn the probability distribution of real data and produce high-quality synthetic samples.Typically, generative models have illustrated their good performance in the image and text domains, but they have not achieved much success in producing structured (tabular) synthetic data.In recent years, several studies have focused on improving the performance of generative models, especially generative adversarial networks (GAN), on structured data [27].One of the most important challenges often is the non-Gaussian distribution of features in tabular data [28].As a solution, conditional tabular GAN (CTGAN) has been developed to generate synthetic data by considering the distribution of input features.In the architecture of the CTGAN network, a new normalization method is used to overcome non-Gaussian distributions.Thus, CTGAN can be potentially employed for synthetic feature generation in case of insufficient training data for the crop classification task.This paper aims to explore the potential of the CTGAN network in addressing the challenge of crop classification using unbalanced tabular samples that comprise optical and SAR polarimetric features.While previous studies have examined the capabilities of the CTGAN network in various data science domains, this study focuses on investigating the network's effectiveness in mitigating the impact of insufficient training data for agricultural product classification using SAR-Optical derived features.The integration of optical and SAR data simultaneously has the potential to yield significant improvements in classification accuracy.By combining the unique strengths of these two data modalities, such as the spectral information from optical data and the structural information from SAR data, we can obtain a more comprehensive understanding of the target objects or land cover classes.This fusion of information enhances the discriminative power of the classification models and enables them to capture a wider range of features and characteristics.Furthermore, this study aims to explore the capabilities of the CTGAN model in generating synthetic data using both optical and SAR images.CTGAN, as a powerful deep learning-based generative model, has shown promise in generating realistic synthetic data that closely resemble the distribution of the original data.By leveraging this capability, we can effectively augment the training dataset with synthetic samples, thereby increasing data diversity and balancing class distributions.This approach has the potential to address the challenge of limited or imbalanced training data, ultimately improving the classification performance.The investigation of simultaneously integrating optical and SAR data, along with the generation of synthetic data using CTGAN, holds tremendous potential for advancing classification tasks, enhancing accuracy, and improving the effectiveness of analyzing complex Earth observation datasets.This combined approach offers a promising avenue to achieve more precise and reliable results, enabling researchers to extract valuable insights from diverse data sources and address challenges such as imbalanced or limited training data.By leveraging the complementary nature of optical and SAR data and harnessing the synthetic data generation capabilities of CTGAN, we can create a comprehensive dataset that captures the unique characteristics of both modalities and enhances the performance of classification models.Ultimately, this research direction has the potential to significantly contribute to the accuracy, robustness, and reliability of classification analyses in the field of earth observation.This manuscript consists of several sections; the literature review and the objective of this investigation were introduced earlier.In the following, the study area and the dataset are introduced in Section 2.1.The details of the proposed method and experimental settings are explained in Sections 2.2 and 2.3, respectively.Section 3 presents the results of experiments.Finally, the paper is concluded with a discussion of the achieved results.

Data set and study area
The study area in this research was an agronomical area of Winnipeg, Manitoba, Canada (see Figure 1).The research used the fused data of bi-temporal optical and PolSAR images.The optical and PolSAR images were acquired from the RapidEye and UAVSAR sensors on 5 and 14 July 2012.Spectral bands of the RapidEye images were blue (B), green (G), red (R), near-infrared (NIR), and red-edge (RE) with a spatial resolution of about 5 m.The UAVSAR images had four polarizations at L-band frequency with a spatial resolution of about 15 m.
The Soil Moisture Active Passive Validation Experiment 2012 (SMAPVEX 2012) campaign was handled/conducted for the calibration and validation of the National Aeronautics and Space Administration (NASA)'s SMAP satellite over 43 days during the summer of 2012 [29].During this operation, the crop type labels of this data set were collected from the study area including seven classes: broadleaf, canola, corn, oats, peas, soybeans, and wheat.
Table 2 presents the imbalance ratio (IR) in the utilized dataset, which measures the disparity in sample distribution.The IR is calculated as the ratio of  1 to  2 , where  1 represents the number of samples in each class and  2 represents the number of samples in the majority class.It is evident that there is an imbalanced distribution among the samples across different classes.In particular, the Peas and Broadleaf classes are identified as minority classes, while the remaining classes are categorized as majority classes.

Methodology
The detrimental impact of insufficient training samples in the minority classes on classifier performance has been highlighted in the introduction.This study investigates the potential of the CTGAN network in addressing this issue, specifically in the context of agricultural product classification using SAR and optical polarimetric features.The research process is depicted in Figure 2, outlining the key steps involved.Initially, preprocessing is applied to the SAR and optical images.Before any process, it is necessary to co-register SAR and optical images.These two image sources were co-registered with a linear polynomial for geometrical rectifying and the nearest neighbor method for gray level interpolation [30].Then, various features are extracted from the optical and SAR images in  The resulting new dataset is then utilized for hyperparameter tuning of different classifiers.Notably, the test samples are separated in advance and are not involved in the process of hyperparameter tuning as well as training the models.Finally, the quality of the generated synthetic data is evaluated by assessing the performance of the trained classifiers on the independent test data.This evaluation serves to illustrate the effectiveness of the CTGAN network in addressing the challenge of insufficient training data for minority classes in agricultural product classification.More details of each step are described in the following sections.

Optical and polarimetric feature extraction
In this research, we extracted features from SAR and optical imagery according to the methodology presented in [15].Tables 3 and 4 present the optical and polarimetric features, respectively.The optical features for RapidEye image included: 5 spectral channels, 17 vegetation indices, and 16 textural indicators which was a total of 38 features.Spectral channels were Blue (B), Green (G), Red (R),   and 4 Yamaguchi parameters which were a total of 46 features.It's noteworthy that , , and  are the entropy, anisotropy, and alpha angle, respectively. 1 ,  2 , and  3 are the eigenvalues of the coherency matrix (T),  is the pedestal height, and RVI is the radar vegetation index.The polarimetric features give information about the physical and structural properties and also the scattering mechanisms of the various crop types [32].

Machine learning classifiers
Classifiers based on machine learning have received much attention in past studies, especially in the field of remote sensing [33][34][35][36].These algorithms have a good ability to model complex classes and understand different input features.Also, they don't need any initial assumptions about data distribution.In general, these algorithms are more accurate than traditional parametric methods, especially in the face of high dimensional data [37].In this research, three algorithms, i.e., random forest (RF), extreme gradient boosting (XGBoost), and K nearest neighbor (KNN), are used to investigate the performance of the CTGAN network in generating the synthetic SAR-optical features.
RF utilizes the bagging method for training, where base learners (decision trees) are trained independently.In this approach, random sampling with replacement is performed, meaning that data points are randomly selected from the training set.Consequently, a training sample may be selected multiple times within this chosen data.The majority vote of each decision tree's output determines the final output class.By aggregating decision trees, RF is robust against overfitting, capable of identifying outliers, and can assess the importance of input variables.However, by raising the number and complexity of trees, the training and prediction time of the model also increases [38].
XGBoost algorithm, based on gradient-boosted decision trees (GBDT), is another popular machine learning algorithm.It leverages the errors from previous iterations and enhances the importance and weight of incorrectly predicted instances in subsequent iterations.XGBoost incorporates Regularization in the cost function to avoid overfitting and employs parallel processing during training, resulting in faster processing and improved accuracy.Each tree in this algorithm generates an output based on different independent variables.After constructing the trees, the majority of predicted classes determine the class of the input data [39].
KNN is a lazy learning algorithm that operates based on nearest neighbors.It calculates the distance between the test sample and all training samples, selects the K closest samples based on distance, and determines the dominant class among these K samples as the class of the test sample [40].

Synthetic data generation
The introduction highlighted the negative impact of insufficient training samples on the performance of various classifiers.This issue leads to reduced effectiveness of minority classes in minimizing the Loss function during training, resulting in classifier bias towards majority classes [41].
To address this challenge, different methods are employed, including random sampling techniques such as ROS and RUS.The strengths and weaknesses of these methods were already discussed in Introduction.Another popular method for generating synthetic data for minority classes is SMOTE.This technique generates artificial samples along the connecting line between K real samples within the minority classes.Previous studies have successfully utilized SMOTE for synthetic data generation [19,22,42].However, this method still suffers from some problems such as generating outlier data.Moreover, there are scenarios that using linear interpolation in the SOMTE method will cause the generation of duplicate data and the occurrence of problems such as overlearning [22].
Recently, GAN networks have been employed for generating various types of synthetic data.GAN networks include two parts, a generator, and a discriminator, which learn the distribution of data through an adversarial training process.The task of the generator is to generate synthetic data assuming a Gaussian distribution for the real data.The discriminator is responsible for distinguishing synthetic data produced by the generator from real data.When the generator can defeat the discriminator, the GAN network will be able to produce synthetic data similar to the real data.However, these networks encounter challenges in generating desirable synthetic samples in the case of the non-Gaussian distribution of tabular datasets.To diminish this weakness, the CTGAN network has been proposed.CTGAN is a special type of GANs designed to generate synthetic structured (tabular) data.Despite the architectural similarities of CTGAN with other GANs, there are also key differences between these networks.First, CTGAN is specifically designed for generating synthetic tabular data, which includes both continuous and categorical variables organized in a tabular format.Second, CTGAN supports conditional data generation, which means users can specify conditions to generate synthetic data of a particular variable.Thirdly, CTGAN uses categorical embeddings to represent categorical variables in the generated synthetic data.This allows CTGAN to effectively handle discrete categorical variables in tabular data.Other GANs may not have specific mechanisms to handle categorical variables or may require additional preprocessing or encoding techniques [28,[43][44][45].Figure 3 illustrates the overall structure of CTGAN, which includes novel preprocessing techniques to improve GAN performance on tabular data generation.Tabular data distributions may be non-Gaussian.This can cause GANs to struggle with the "vanishing gradient" problem during training.To address this, CTGAN first estimates the underlying distribution of each feature using a Variational Gaussian Mixture Model (VGMM).VGMM represents the overall distribution as a weighted combination of multiple Gaussian components, each with its own mean and covariance.This models multi-modal distributions more flexibly than a single Gaussian.The estimated VGMM distribution is then used to normalize each feature through "encoding".The encoded data has a standardized distribution that helps GAN training converge.The generator produces synthetic samples in this normalized space.After training, a "decoding" step transforms the generated data back to the original distribution through the reverse of the encoding transformations.This preprocessing allows CTGAN to handle complex, non-Gaussian tabular datasets while stabilizing GAN training.The end-to-end framework can generate high-quality synthetic samples in the native distribution of the real data [28,43].
CTGAN network training is based on the following loss function: where   and   are the loss functions for the discriminator and the generators, respectively,  () is the output of the discriminator for real data,  ( ′ ) is the output of the discriminator for synthetic data,  is the cross-entropy score and  denotes the number of synthetic samples [28,43].

Experimental setups
During the training process of all algorithms, 10% of the dataset was allocated as the training dataset, while the remaining data served as the test dataset.To determine the optimal hyperparameters for each classifier, we employed the random search algorithm combined with K-Fold cross-validation strategy.Specifically, we set the value of K to 3, indicating that the training dataset was divided into three subsets (folds) of approximately equal size.During the random search process, different combinations of hyperparameters were randomly sampled from predefined ranges for each classifier.These hyperparameters included parameters such as learning rate, regularization strength, number of hidden layers, and activation functions, among others, depending on the specific classifier being tuned.For each sampled combination of hyperparameters, the classifier was trained on two folds of the training dataset and evaluated on the remaining fold.This process was repeated three times, with each fold serving as the evaluation set once.The evaluation results from the three folds were then averaged to obtain a more robust estimate of the classifier's performance for that particular set of hyperparameters.The generator and discriminator components of the CTGAN network were defined using residual fully connected neural networks and linear networks, respectively.The discriminator consisted of two layers, each containing 255 neurons.The CTGAN model was trained in 300 iterations, utilizing an Adam optimizer with a learning rate of 0.002.For comparison purposes, the SMOTE, ROS, and RUS methods were also implemented.Table 5 summarizes the hyperparameters tuned for different classifiers as well as synthetic data generators implemented in this study.
According to Table 2, the generated dataset significantly increases the ratio of data imbalance for the majority class compared to the original dataset.Specifically, the ratio after synthetic data generation is 50 times greater than that of the original dataset.(from 0.002 to 0.12).Synthetic samples for the minority classes (pea and broadleaf) were generated using each of the algorithms (CTGAN, SMOTE, and ROS).Additionally, in the RUS method, the number of samples for all classes was reduced to match the number of samples in the minority classes.
In this study, to evaluate the performances of the various data generators,   and Sensitivity (recall) were defined as below: =

𝑇 𝑃 𝑇 𝑃 + 𝐹 𝑁
(3) where TP represents the positive samples that have been predicted as true.FN denotes the negative samples that have been predicted false.TN represents negative samples that have been predicted as true and FP identifies positive samples that have been predicted as false.These values are calculated separately for each class.According to equations 2 to 4, The sensitivity metric, also known as the true positive rate or recall, measures the proportion of actual positive samples correctly classified as positive by a classifier.It focuses on correctly identifying samples belonging to the positive class and is particularly sensitive to the classification of a sample in the wrong class.Sensitivity is an important metric, especially in scenarios where the accurate detection of positive instances is critical.On the other hand, the G-mean, or geometric mean, is a metric that evaluates the overall performance of a classifier by considering both the majority and minority classes.It takes into account both the sensitivity (true positive rate) and specificity (true negative rate) metrics.The G-mean is calculated as the square root of the product of sensitivity and specificity, providing a balanced measure of classifier performance across different classes.The G-mean is advantageous when dealing with imbalanced datasets, where the number of samples in one class is significantly smaller than the other.In such cases, accuracy alone can be misleading since a high accuracy can be achieved by simply classifying all samples into the majority class.The G-mean helps to capture the classifier's ability to perform well in both the majority and minority classes, as it considers the trade-off between sensitivity and specificity.By using the G-mean metric, researchers and practitioners can obtain a more comprehensive evaluation of classifier performance, especially in imbalanced datasets.It provides insights into how well the classifier can handle both positive and negative instances, enabling a more accurate assessment of its effectiveness in real-world applications.In summary, while sensitivity focuses on the correct classification of positive samples, the G-mean takes into account the performance of classifiers in both majority and minority classes.Together, these metrics provide a more comprehensive understanding of classifier performance and are particularly useful when dealing with imbalanced datasets [46].

Result
This section presents the performance of implemented synthetic data generation methods for different classifiers.As shown in Table 6, based on the Sensitivity metric, the performance of all three classifiers in detecting minority classes (Peas and Broadleaf) improved after synthetic data generation.The best performance among different methods belongs to CTGAN while maintaining the overall performance of the classifier based on the total Sensitivity metric.The improvement in the XGBoost classifier was 9.4% and 8.9% for CTGAN for Peas and Broadleaf classes, respectively, compared to the original dataset.The results of the RF classifier trained and tested on the CTGAN dataset show that the Sensitivity increases to 95% from 92.8% and 80% for both Peas and Boardleaf classes, respectively.Unlike the previous two classifiers, the KNN algorithm performs very poorly in classifying these two classes with imbalanced and insufficient datasets, so that it almost cannot classify any of the samples of these two classes.After data generation by CTGAN, the performance of the KNN algorithm for the Peas class reaches 92.2%, which is 20.5%, 57.2%, and 20.0% better than the dataset produced by SMOTE, ROS, and RUS, respectively.But for the Broadleaf class, the performance of SMOTE is 2.2% better than CTGAN.Also, according to the obtained results, the performance of RUS is very good in increasing the performance of minority classes based on the Sensitivity metric, but the overall performance of the classifiers demonstrates that the RUS method reduces the total Sensitivity of crop classification.
For better evaluation, the confusion matrices of the RF classifier for the original (imbalanced and insufficient), RUS, ROS, SMOTE, and CTGAN datasets are displayed in Figure 4. Based on this figure, the correctly classified samples for the Peas class in the original dataset and SMOTE dataset are equal to 93% while this value is 94% for the CTGAN dataset.In addition, for the Broadleaf class, the amount of correctly classified samples increased from 80% for the original dataset to 90% for the CTGAN datasets, respectively.Despite the balancing using RUS, ROS, and SMOTE, increases the Sensitivity of the minority classes, the performance of the classifier decreases for other classes.
The evaluation of crop classification with the XGBoost algorithm using synthetic data generated by RUS, ROS, SMOTE, and CTGAN based on the   metric is presented in Table 7.As shown, the classification accuracy is improved for minority classes.The performance improvement for the Peas class is 5.0% for CTGAN datasets.In addition, in the Broadleaf class, the   metric is improved from 92.2% for the original dataset to 0.969% for the CTGAN dataset.Also, the performance of the classifiers using the ROS dataset decreased significantly.
In summary, CTGAN more effectively generated datasets for classes with insufficient samples, while maximizing both overall and minority class performance across classifier metrics, outperforming alternative techniques.

Discussion
This study aimed to address the issue of limited training samples in minority crop classes by utilizing synthetic data generated by the CTGAN network.Specifically, the proposed approach utilized the fusion of optical and polarimetric SAR features for crop classification.As illustrated in section 3, the KNN performance was improved significantly by employing the synthetic data generated by CTGAN. Figure 5 demonstrates the KNN classifier performance for different quantities of synthetic samples from 100 to 1000 produced by CTGAN.The red line plots the accuracy of the Peas class.Similarly, the blue and brown lines, respectively depict the overall accuracy based on the F1-score (a comprehensive metric accounting for precision and sensitivity).Precision refers to the proportion of correctly identified positive cases out of all classified as positive.For this problem, 1000 synthetic samples achieved slightly higher accuracy than other volumes for the classifier and Broadleaf class.The Peas class accuracy peaked at 200 samples.However, the addition of 200 samples did not sufficiently reduce the imbalance in the dataset.Therefore, 1000 synthetic samples were generated for the minority classes, resulting in a 50x reduction in the class imbalance ratio.
While the primary goal of this research was to generate additional data by CTGAN for minority classes in order to address the problem of insufficient training samples, using this data generation method also helped tangentially reduce the class imbalance in the dataset.By increasing the number of samples for minority classes, the technique brought the class distribution closer to a balanced ratio, even though balancing the dataset was not the main focus.Thus, the data generation by the CTGAN approach served a dual purpose; producing more training samples for insufficient classes and mitigating the existing skew between majority and minority classes.
Figure 5 also illustrates the ability of CTGAN to produce diverse data volumes, while the optimal should be determined depending on the problem.In summary, CTGAN-generated synthetic data leveraging multimodal crop data helped to boost classifier performance on minority classes.The analysis determined generating 1000 samples per class would achieve a good balance between accuracy and balancing class representation.In summary, the result illustrated the efficiency of CTGAN in addressing limited training data challenges for crop classification tasks.

Influence of synthetic data on the performance of classifiers
The results in Tables 6, 7, and Figure 4 show that synthetic data generation can impact the performance of classification depending on the classifier model.For example, KNN benefited more significantly from additional training data compared to RF and XGBoost, whose performance increased to a lesser extent.Based on G-mean, CTGAN produced higher quality synthetic data that led to greater classification accuracy improvements for minority classes over other methods.However, based on sensitivity, RUS outperformed CTGAN for minority classes, but RUS reduced overall performance by removing useful information from other classes.While ROS yielded a slight boost to classifiers, its performance was weaker than SMOTE and CTGAN due to providing less new information for training.On the other hand, SMOTE generated synthetic data without considering real data distributions, diminishing accuracy gains shown in Table 6.CTGAN uniquely can produce a balanced dataset that accurately reflects the real data distribution.This preserves classification performance for the majority class while substantially improving the classification accuracy of the minority classes.Whereas other methods either overfit certain classes or remove informative samples, CTGAN's distribution-aware generation approach leads to well-balanced classification across all classes.
In summary, CTGAN yielded the most robust accuracy improvements by introducing informative synthetic samples without distorting real data properties or removing important information.Its ability to balance datasets while maintaining fidelity to underlying distributions provides an advantage over other data augmentation methods.

Quality of generated synthetic data
To further investigate the quality of the synthetically generated data, the means and standard deviations of real and synthetic data for the 168 optical and polarimetric features introduced in Section 2.2 generated by CTGAN and SMOTE for the Peas and Broadleaf classes.Figure 6 displays the correlation plots between means and standard deviations of real and synthetic data.The position of each scatter point is the mean (or standard deviation) of the real data versus the synthetic data for one employed feature.The plot in which the positions of points are closer to the line of equivalence (Y=X line) implies that the synthetic data have similar statistical properties with the real data, which reflects more similarity between distributions.The correlation plots show that the means and standard deviations of the features generated by CTGAN have a higher correlation to the real data compared to those generated by SMOTE.This indicates that CTGAN is able to accurately reproduce the distribution of the real data compared to SMOTE.
Figures 7 and 8 provide a detailed comparison of feature distributions between real and synthetic data generated by CTGAN for two minority classes.Figure 7 focuses on the Peas class, showing the distributions for 6 exemplary features in real data (blue columns) versus synthetic data (orange columns).Figure 8 repeats this comparison for those 6 features of the Broadleaf class.In both figures, the synthetic data distributions generated by CTGAN closely match those of the corresponding real data features.This consistency demonstrates CTGAN's ability to accurately model the underlying distributions existing in the real data.Notably, CTGAN is also capable of reconstructing synthetic data in a wider range compared to real features.For instance the distribution of, some synthetic features in Figures 7 and 8 extend beyond the maximum and minimum values of the real data.This extension reduces the risk of overfitting during subsequent classifier training, as the models are exposed to a more diverse representation of each feature data during training.Overall, these distribution comparisons provide strong evidence that CTGAN can successfully capture the statistical properties of the real data, thereby can generate synthetic data well representative of the original samples.This fidelity facilitates the effective application of synthetic data for classification tasks.

Conclusion
This article investigated the performance of the CTGAN model to generate synthetic data to reduce the impact of insufficient samples in crop classification.To study this issue, the features extracted from the optical and SAR images obtained from the RapidEye and UASAR sensors were employed.In this research, by using three classifiers XGBoost, RF, and KNN, the quality of synthetic data generated by CTGAN (as a state-of-the-art method) was evaluated in comparison to RUS, ROS, and SMOTE.In general, the results of the research demonstrated the significant superiority of the CTGAN network over the comparative algorithms.While SMOTE generated synthetic data without considering the distribution of real data, the CTGAN method took account of data distribution during the data generation.Furthermore, RUS and ROS did not generate desirable data to considerably improve the performance of classifiers compared to the CTGAN model.Also, the quality of the synthetic data generated by CTGAN was evaluated by comparing it to real data using statistical metrics.Specifically, the mean, standard deviation, and distributions of different features were measured and compared.The results showed that the data produced by CTGAN exhibited similar to the real data across the aforementioned statistical metrics.This indicates that the synthetic data generated by CTGAN accurately reflects the real data distribution.Therefore, the CTGAN network is a better alternative to the basic methods of generating synthetic datasets.However, it is important to acknowledge that using CTGAN for synthetic data generation has certain limitations.One such limitation is the requirement of a minimum amount of data for training.CTGAN relies on a sufficient quantity of training data to effectively learn the underlying data distribution and capture the intricate dependencies within the dataset.Additionally, it is worth noting that the training process of CTGAN can be more time-consuming compared to traditional data generation methods.CTGAN involves training a generative model that learns the complex patterns and relationships inherent in the data.This training process typically requires multiple iterations and can be computationally intensive, especially when dealing with large and high-dimensional datasets.The processing time required for training CTGAN should be considered when deciding on the appropriate data generation approach, especially in time-sensitive applications or scenarios with resource constraints.Despite these limitations, the benefits of CTGAN should not be overlooked.CTGAN excels at capturing the underlying data distribution and generating synthetic samples that closely resemble the real data.It has the potential to overcome the limitations of traditional methods by preserving complex relationships and dependencies present in the original dataset.Additionally, CTGAN offers more flexibility in generating synthetic data with desired characteristics, allowing researchers to control specific features or adjust the balance between classes.Some important directions for future work include: 1) Applying CTGAN and comparing its performance to other generative models in other remote sensing applications beyond crop classification.This could include tasks like land cover mapping, object detection in aerial/satellite imagery, and environmental monitoring.Evaluating generative solutions across different problem types would expand our understanding of their capabilities and limitations.2) Leveraging synthetic data generation to address lack of training samples in various remote sensing and geospatial problems beyond agriculture, such as damage assessment from natural disasters, urban development monitoring, infrastructure mapping, and species habitat modeling, where limited labeled data exists, generative models may help boost predictive accuracy.3) Developing new generative model architectures and training procedures specialized for different remote sensing inputs.

Figure 1 .
Figure 1.The study area and the reference data in this research.

Figure 2 .
Figure 2. The framework implemented for synthetic data generation and cropland classification.

Figure 3 .
Figure 3.The structure of CTGAN for synthetic data generation.

Figure 4 .
Figure 4. Confusion matrices for the RF classifier trained and tested on: a) Orginal b) RUS c) ROS d) SMOTE e) CTGAN datasets.

Figure 5 .
Figure 5.The performance of the KNN classifier, measured by the F1 score metric, in relation to varying quantities of synthetic samples generated by the CTGAN model.

Figure 6 .
Figure 6.The correlation plots between means and standard deviations of synthetic data and real data.Each blue point corresponds to a feature extracted from SAR and Optical images.

Figure 7 .
Figure 7.The data distribution of 6 exemplary features generated by CTGAN versus the distribution of the real data for the Peas class.

Figure 8 .
Figure 8.The data distribution of 6 exemplary features generated by CTGAN versus distribution of the real data for Broadleaf class.

Table 1 .
Preview of previous studies that used data generation methods to improve the accuracy of classification.

Table 2 .
The ratio of the number of samples for each class to the number of samples for the majority class.

Table 5 .
The hyperparameters tuned for different classifiers and synthetic data generators.

Table 6 .
The sensitivity of different classifiers after data generation with various methods.The sensitivity has been improved for all classifiers after generating data using different methods.

Table 7 .
of XGBoost for different crop classes.