1. Introduction
Farmland classification, including crop classification, is essential for agricultural monitoring, yield estimation, and policy-making. Advances in satellite-based remote sensing have enabled accurate mapping of farmland distribution and characteristics over extensive areas, greatly enhancing resource management and decision-making [
1].
Early remote sensing-based farmland classification mainly employed pixel-wise classification methods [
2]. However, these methods have notable drawbacks, including “salt-and-pepper” noise resulting from isolated misclassified pixels and difficulty handling mixed pixels [
3]. To overcome these limitations, Object-Based Image Analysis (OBIA) emerged as an alternative. OBIA segments satellite images into meaningful objects (e.g., parcels) and classifies these based on statistical features such as the mean Normalized Difference Vegetation Index (NDVI), texture, and area, typically using traditional machine learning algorithms [
4,
5].
Recently, convolutional neural network (CNN)-based deep learning techniques have increasingly been applied to farmland classification. These approaches can be broadly categorized into two types: (i) methods that first extract field boundaries and subsequently classify each parcel using CNNs [
6,
7,
8,
9] and (ii) end-to-end segmentation models that directly learn from pixel-level data to predict land cover classes for entire images [
10,
11,
12]. In parallel, research focusing on precise field-boundary extraction has gained traction [
13,
14,
15,
16], aiming to improve object definitions within classification pipelines.
Alongside these technical advances, numerous studies have reported that not only the design of model architectures but also non-architectural factors—such as the composition of training samples, data sampling methods, and pretraining strategies—can play a significant role in improving classification performance [
17,
18,
19,
20]. Based on this perspective, the present study aims to improve performance by refining the method of training for data extraction while preserving the model architecture.
Farmland classification is also particularly sensitive to temporal and regional variability. Even identical crop types can exhibit significant spectral and textural differences, driven by variations in growth stages, soil properties, and climate conditions across regions [
21]. These factors critically impact the generalization ability of classification models, emphasizing the need for methods that maintain robust performance across spatio-temporal variations. To address this need, the present study aims to develop a farmland classification model capable of strong generalization—both spatially and temporally.
Class imbalance is among the most significant challenges in farmland classification. Due to variations in crop cultivation areas influenced by region, season, and agricultural policies, farmland datasets often exhibit highly imbalanced class distributions. This trend has also been observed in previous studies, where farmland data were found to be heavily skewed toward a few dominant crop types, while rare crops appeared only sparsely [
22]. For instance, staple crops such as rice and barley typically occupy large portions of agricultural datasets, whereas specialty crops like ginseng or nursery fields appear far less frequently—a pattern observed in our experimental dataset as well.
This imbalance frequently leads models to overfit majority classes and poorly classify minority classes [
23]. Models trained with cross-entropy loss are particularly vulnerable to this issue, as the loss is inherently biased toward well-represented classes due to its weighting based on sample frequency. Consequently, accuracy for under-represented classes decreases, diminishing the overall robustness and fairness of the models.
In the field of remote sensing, approaches to address class imbalance can be broadly categorized into two groups. The first is data-level oversampling strategies, and the second includes algorithmic approaches such as loss-function adjustment using focal loss or weighted loss.
Oversampling methods aim to balance the class distribution by increasing the number of samples for minority classes. The most straightforward technique is simple replication of minority-class samples. However, this approach can lead to overfitting due to repeated exposure to identical data [
24,
25]. To alleviate this, techniques such as SMOTE [
26] and ADASYN [
27] have been proposed, which generate synthetic samples by interpolating between minority-class instances. These techniques have also been successfully applied to remote sensing classification problems, demonstrating strong performance in various applications [
24,
25,
28,
29]. However, such methods have been criticized for potentially generating noise or distorted spectral information that deviates from the true data distribution in high-dimensional remote sensing imagery [
30,
31]. To overcome these issues, recent studies have explored GAN-based oversampling, which offers the advantage of generating more realistic samples [
32,
33]. However, when the number of original minority-class samples is extremely limited, there is a risk of generating repetitive or unrealistic synthetic images, as has been reported in the literature [
34,
35].
In contrast, algorithmic approaches adjust the loss function to compensate for class imbalance by assigning greater weights to errors in minority-class predictions during training. For example, focal loss [
36] and weighted cross-entropy modify the loss values to emphasize under-represented classes. These methods have the advantage of enhancing minority-class learning without altering the original data itself [
37,
38]. However, they are generally highly sensitive to hyperparameter settings, and if not carefully tuned, they may lead to excessive overfitting to the minority class or instability in the training process [
23,
39].
In this study, we adopt a data-level oversampling strategy, and accordingly, avoiding redundant learning and ensuring diversity in the training data become key objectives.
While addressing class imbalance is important, it is equally critical to consider the geometric variability of objects in remote sensing imagery, including farmland data. Even at a fixed spatial resolution, objects in such imagery can vary significantly in scale, shape, and aspect ratio, making it challenging to process them in a consistent manner [
40,
41,
42,
43]. This variability poses a fundamental obstacle to the accuracy of CNN-based analysis tasks in remote sensing, such as classification and detection. To address this issue, previous studies have proposed various approaches, including multi-scale feature pyramid frameworks [
40,
41] and patch-based prediction methods that divide images into fixed-size regions [
42,
43]. Among these, our study explicitly adopts the patch-based prediction strategy known as tiling, which effectively handles geometric variability while satisfying the fixed input-size requirements of CNN models.
One straightforward method is resizing each parcel image to a predetermined input size. However, because farmland parcels vary significantly in shape and size, applying a uniform output size introduces inconsistent scaling factors across samples.
Figure 1 illustrates this issue by comparing two example parcels and how they are transformed when applying different input preparation strategies. As shown in
Figure 1d, resizing causes images with different native resolutions to be rescaled disproportionately, distorting their original textures and spatial proportions. This inconsistency undermines the uniformity of image preprocessing and hampers the model’s ability to learn meaningful visual features. As a result, the visual distinction among classes is reduced, especially for crop types with fine-grained structural differences, making classification more difficult. In contrast, fixed-size patch extraction avoids this problem by directly sampling from the original image without altering its resolution, as shown in
Figure 1c. This method preserves the local spatial scale and textural characteristics of each parcel, enabling the model to capture more reliable and class-specific visual cues, thereby improving its ability to distinguish between visually similar classes.
To systematically implement this fixed-size patch approach, tiling-based classification frameworks have been introduced [
42,
43]. Instead of resizing the entire parcel, this approach divides each parcel into overlapping fixed-size patches. Each patch is individually classified by a CNN, and parcel-level predictions are aggregated from patch-level results—typically using a class-wise product. This strategy maintains spatial resolution, avoids global scaling distortions, and satisfies CNN input requirements, making it a widely adopted approach in tiling-based classification.
Despite these advantages, typical tiling-based training methods have limitations. First, the Fixed Patch Extraction (FPE) method crops patches from fixed spatial offsets, which matches inference-time behavior. However, this fixed sampling strategy restricts spatial diversity in training data, as the same patch locations are repeatedly used during training epochs, thereby limiting variation in spatial contexts [
44,
45]. Such structural limitations can lead to overfitting to specific spatial patterns and degrade generalization performance for unseen domains [
44]. Although repeated patch exposure is somewhat reduced by randomly sampling only 40% of patches when more than three patches are generated per parcel, fixed offsets still cause patch duplication across epochs. This redundancy limits training diversity and heightens the risk of overfitting, as shown in
Figure 2.
Second, most tiling-based pipelines preserve the original class distribution without applying balancing mechanisms—a scenario we define as No Class Balancing (NCB). In imbalanced farmland datasets, NCB results in skewed training exposure: majority classes dominate the training process, while rare classes (e.g., ginseng or tree nursery) become severely under-represented. Consequently, models develop biases toward frequent classes, performing poorly on minority classes.
In short, patch duplication from FPE and class imbalance due to NCB degrade the generalization capability of models, underscoring the need for more robust training strategies tailored to parcel-based tiling classification.
To address these challenges, we adopt two complementary techniques: random patch extraction (RPE) and class-balanced sampling (CBS).
RPE improves overall classification performance by ensuring spatial diversity in training patches and also enhances performance across most minority classes. Under the 1× resolution setting, overall accuracy increased by +0.0176 on the validation set and +0.0574 on the test set. For an easy minority class, F1 scores improved by +0.0696/+0.0790 (Val/Test), respectively. However, when used alone, RPE does not resolve class-level training imbalance. In very difficult minority classes, the spatial diversity provided by RPE hindered generalization under limited training exposure, and the model failed to make any correct predictions, with F1 scores dropping to 0 on both the validation and test sets.
CBS mitigates majority-class bias by equalizing class-level training opportunities and facilitates better inter-class discrimination, improving both overall and minority-class performance. In the 1× setting, overall accuracy improved by +0.0172 (Val) and +0.0334 (Test), and for a normal-difficulty minority class, F1 scores increased significantly by +0.4000/+0.4332 (Val/Test). However, in a very difficult minority class, the F1 score increased by +0.4086 in validation but only +0.0349 in test, indicating limited generalization. An easy minority class also showed signs of overfitting, with an F1 gain of +0.0600 in validation but only +0.0123 in test.
By combining the two methods, we simultaneously achieve balanced training opportunities through CBS and spatial patch diversity through RPE, resulting in the following complementary effects. CBS compensates for the insufficient training exposure of minority classes under RPE, especially in high-difficulty cases, by ensuring sufficient repetition for learning. Meanwhile, RPE introduces spatial diversity at the patch level, alleviating the structural limitation of repeated patch sampling inherent in CBS and helping prevent overfitting. In practice, the proposed combined method achieved consistent performance improvements across minority classes of varying difficulty. Notably, for a very difficult minority class where both individual methods struggled, the combined strategy yielded substantial F1 gains of +0.6171 (Val) and +0.4044 (Test). Furthermore, overall accuracy improved significantly, with gains of +0.0443 (Val) and +0.0876 (Test), demonstrating enhanced general classification performance.
The key contributions of this study are summarized as follows:
Class-Balanced Random Patch Training
We integrate data augmentation via random patch extraction (RPE) and over/undersampling through class-balanced sampling (CBS) in a unified training pipeline for tiling-based classification. RPE and CBS independently address patch duplication and class imbalance, respectively, but their combination demonstrates complementary effects on performance. This combined strategy significantly improves the F1 score for minority crop classes, achieving gains of up to +0.6171 (Validation) and +0.4044 (Test) compared to the baseline. Unlike many conventional methods that rely on synthetic data to address class imbalance, our method leverages only real data to ensure training diversity and achieves stable performance without complex hyper-parameter tuning.
Robustness to Context Loss and Aggregation Effects under Upscaling
We perform 2× upscaling of parcels to reduce the context per patch while increasing the number of patches per parcel. As a result, our method demonstrates robustness to context loss, maintaining stable patch-level performance, even when context is reduced to one-fourth. Furthermore, we quantitatively confirm that the increased number of patches enhances the aggregation effect of the softmax product, leading to improved overall classification accuracy.
3. Results
This section presents the results of experiments carried out to evaluate the effectiveness of our proposed method. In
Section 2.2.1, we introduced the core techniques, random patch extraction (RPE) and class-balanced sampling (CBS). Here, we compare them against two baselines, fixed patch extraction (FPE), which uses a fixed offset per parcel, and no class balancing (NCB), which preserves the original class distribution, by performing an ablation study that involves gradually adds RPE, CBS, and the combined RPE + CBS on top of the baselines.
Specifically, we (i) measure model performance using accuracy, the macro F1 score, and Kappa to determine how our techniques affect overall classification performance; (ii) examine class-wise F1 scores to check whether minority classes (e.g., ginseng and tree nursery) benefit from RPE and CBS; (iii) evaluate patch-level predictions to gauge the model’s direct performance, as well as parcel-level predictions to see how aggregation (i.e., combining patches) enhances the final outcome; and (iv) analyze computational efficiency in terms of training time and memory usage across different configurations.
As noted in
Section 2.3, all experiments were performed under both 1× (no upscaling) and 2× (bilinearly upscaled) settings. In the 2× environment, bilinear interpolation enlarges the image, reducing each patch’s spatial context but increasing the number of overlapping patches per parcel, thereby compensating for the loss at the patch level via aggregated predictions.
3.1. Overall Performance
The ablation study’s overall performance (accuracy, macro f1 score, and kappa) at 1× and 2× is summarized in
Table 2a (validation set) and
Table 2b (test set).
Based on these results, we observe that the baseline configuration—NCB + FPE—exhibits limited performance due to two key issues: FPE restricts data diversity by applying a fixed offset, and NCB amplifies class imbalance by preserving the natural class distribution. In contrast, RPE, when replacing FPE, enhances training diversity through random offsets. CBS, when substituting NCB, mitigates majority-class bias by enforcing uniform class sampling. Both modifications lead to individual performance gains over the baseline. When combined, the full RPE + CBS scheme achieves the greatest improvements across all evaluation settings.
In particular, the test set under the 2× condition is the most challenging, given not only domain differences but also reduced patch context. Nevertheless, RPE + CBS raises accuracy from 0.4857 (baseline) to 0.8033 (+31.76%p), the macro F1 score from 0.4394 to 0.7166 (+0.2772), and Kappa from 0.4227 to 0.7687 (+0.3410). Such a large improvement indicates that random patch extraction and class-balanced sampling remain highly effective, even when the patch context is reduced and the domain differs substantially.
3.2. Per-Class Performance
Table 3 and
Table 4 summarize the per-class F1 scores for the validation (Validation) and test (Test) sets, respectively. In this section, we primarily focus on the minority classes—namely, ginseng and tree nursery—while also briefly noting changes in major classes. We examine how each method’s F1 score changes (
Val/
Test) compared to the baseline (NCB + FPE).
RPE enhances training diversity by introducing random offsets during patch selection. While this approach generally improves performance across various classes, including several major ones, our analysis focuses on its effects on minority classes. For instance, ginseng () showed substantial gains (+0.0696/+0.0788). However, tree nursery () exhibited a performance drop (−0.2025/−0.0795), indicating that RPE does not consistently benefit all minority classes.
CBS addresses class imbalance by ensuring equal sampling probability across all classes, thereby providing more consistent learning opportunities for minority classes. While the number of training examples for most major classes is moderately reduced by undersampling, their performance sometimes improves—likely because balancing helps the model learn clearer class boundaries. Ginseng () shows a large improvement of . In contrast, ginseng () sees an improvement of , which is on par with other methods on the validation set but significantly lower on the test set. Tree nursery () also improves to on the validation set, but the gain is limited to on the test set. These results suggest that the use of CBS alone may lead to performance gaps between validation and test sets in certain minority classes.
When RPE and CBS are combined, the proposed method achieves top or near-top F1 scores for minority classes under both
/
conditions on the validation and test sets, and it also frequently attains the highest scores among the major classes. Overall, this indicates stable and consistent improvement for minority classes. In particular, tree nursery(
) rises to
, yielding the highest observed performance gain. Further interpretation of these results is presented in
Section 4.1.
3.3. Tiling Aggregation Experiment
As described in
Section 2.2.3, parcel-level classification is achieved by subdividing each parcel into multiple overlapping patches and aggregating their individual predictions. Although our final objective is parcel-level prediction, we separately evaluate patch-level performance to assess the model’s robustness under reduced contextual information and to verify how aggregation improves final accuracy. Therefore, we first evaluate the model at the patch level, then compare it to parcel-level results after aggregation, as summarized in
Table 5.
Under the 2× environment, each patch covers less contextual information compared to the 1× environment, while the total number of overlapping patches per parcel increases. In the 1× setting, parcel-level accuracy improved by an average of +0.0114 over patch-level accuracy (with a maximum of +0.0238), the macro F1 score rose by an average of +0.0197 (up to +0.0293), and the Kappa improved by an average of +0.0179 (up to +0.0296). In contrast, under the 2× setting, parcel-level performance improved by an average of +0.0345 in accuracy (up to +0.0518), +0.0417 in macro F1 score (up to +0.0542) and +0.0458 in Kappa (up to +0.0669), highlighting how an increased number of patches per parcel can enhance aggregation effectiveness.
Because each patch necessarily covers less context in the 2× setting, every method exhibited lower patch-level performance compared to the 1× setting. In particular, for the baseline approach, switching from 1× to 2× caused patch-level accuracy to drop by 0.2015, the macro F1 score to decrease by 0.1835, and the Kappa to fall by 0.2151, underscoring that the loss of contextual information makes the model’s prediction task more challenging. A more detailed analysis of these results is provided in
Section 4.1.2.
Summarizing all experimental findings in the Results section, our combined approach (RPE + CBS) consistently surpassed both the baseline and single-method variants in in-domain experiments (same year/region) and demonstrated robust performance, even in out-of-domain experiments (different year/region). These findings underscore the method’s effectiveness for farmland classification, especially for minority classes such as ginseng and tree nursery. In the following Discussion section, we address potential limitations and propose future directions.
3.4. Computational Efficiency
To assess potential increases in computational overhead due to additional preprocessing required by RPE and CBS compared to the baseline, we measured the average training time and memory usage over 100 epochs for each experimental configuration in the 1× setting; results are summarized in
Table 6.
Using RPE increased training time by only 1.3% compared to FPE due to the additional step of verifying valid regions within each patch. In contrast, CBS introduced negligible computational overhead, as it merely involves selecting the class from which patches are extracted, exhibiting computational efficiency similar to that of the baseline. Overall, the combined approach of RPE and CBS effectively enhanced classification performance at the cost of a marginal increase in computation.