4.1. Feature Selection and Its Implications
The selection of features for LULC classification, or more generally, the input dataset used for modeling, is a critical issue. As partly discussed in the Introduction, most current LULC classification methods rely on satellite indices rather than monoband alone to reduce input data dimensionality. Many of these indices are chosen for their ability to effectively capture surface information, as partially illustrated in
Table 2.
The most commonly used indices are those related to canopy cover reflectance, especially NDVI. Nearly all LULC studies used NDVI in their classification to distinguish between forest, cropland, water bodies, built-up areas, and bare soil. Given the seasonal changes in canopy coverage of various crops, several studies have used a time series of NDVI, combined with other indices, to predict LULC, as referenced in, e.g., [
8,
34,
68,
69]. Typically, the selection of features was based on their ability to identify water bodies, built-up areas, croplands, or forests. As indicated in
Section 2.4.1, the set usually included NDVI, NDBI, and NDWI, along with chlorophyll indices [
4,
7,
11,
70]. However, a limitation of these studies is the absence of temporal signatures in the input datasets, mainly due to limited resources.
Our proposed approach aims to overcome the challenges of resource limitations by utilizing a new index, a time-series-like platform, and a feature selection procedure to reduce dimensionality and enhance model performance. As mentioned earlier, the VMRD is a large delta, mostly flat and covered by paddy rice fields, other annual crops, perennial crops, melaleuca forests, and mangrove forests. The region suffered seasonal flooding each year and has a very dense river and channel system. Our feature selection result supported our choices by highlighting the importance of greenness indices at the transition point of the season, including GCVI at the late stage of the dry season and NDVI at the early stage of the dry season, when the paddy rice fields are clearly observed. The high intensity of tillage in the region was implicated in the feature selection, as indicated by the high score of NDTI at the end of the dry season and the early stage of the flood season. This marks the beginning of the rice season in the region. Therefore, the high importance of NDTI implies the intense tillage and high residual cover in the region at that time. This finding is consistent with other studies on the ability of NDTI [
71].
4.2. Limitations of Unsupervised and Supervised Models in LULC Classification
Some studies have announced success in utilizing unsupervised classification, where k-means is one of the most popular methods. For instance, Fatchurrachman et al. (2020) proposed a method to map rice fields using 24 monthly NDVI and VH polarization time series data, employing a K-means clustering method, which achieved 95.95% accuracy in identifying paddy rice fields from other fields in Malaysia [
69]. Similarly, Cai et al. (2019) utilized the same types of NDVI and Sentinel-1 data to map paddy rice fields in a small wetland area in China, but with the Unsupervised Random Forest (URF) method, achieving an accuracy of 0.95 [
68]. Other unsupervised approaches, such as deep clustering with convolutional autoencoders [
72], have also shown promise in feature learning and pattern recognition tasks. However, it is not certain that such high results can be achieved for multi-class classification. For the multi-class classification [
14], the achieved accuracy fell significantly, revealing the limitations of unsupervised models, even those with complex architectures.
The limitations of unsupervised models, especially K-means, are evident during the labeling process. K-means struggles because it relies on spectral similarity, often resulting in the mixing of different land cover types. Unsupervised models, such as K-means, rely on statistical properties to distinguish data points, which may not always align with actual class boundaries. For example, it frequently misclassifies flowing streams as static water bodies, particularly in aquaculture zones, showing a failure to recognize thematic details in LULC definitions. This dependence on spectral similarity underscores the need for supervised models, which can more effectively capture thematic differences through labeled training data. In practice, it can be challenging to assign class labels to clusters based on ground-truth pixel distribution. Often, a single ground-truth class is spread across multiple clusters, and a single cluster may include pixels from several classes. This mismatch reflects a gap between real-world land cover patterns and the criteria used by clustering algorithms. These models rely on statistical properties to separate data points, which may not always align with true class boundaries. Such differences are caused by factors such as feature selection, which determines the dimensions used during clustering.
RF and SVM exhibited signs of overfitting, as evidenced by the discrepancies between cross-validation and testing accuracies. These gaps (approximately 7–8%) indicated that while the RF and SVM models achieved their optimal ability to learn from the labeled data (performing well on the training data), they failed to generalize effectively to unseen data, often misclassifying certain land cover types (performing worse on the testing data). Such overfitting compromises the reliability of model predictions in real-world applications, especially in diverse land cover scenarios.
In contrast, CNN outperformed RF and SVM models by efficiently extracting spatial and contextual features. This ability allows CNNs to discern complex patterns within the data that RF and SVM may overlook, particularly in densely mixed pixel environments. The CNN outperformed both RF and SVM models primarily due to its ability to extract spatial and contextual features. CNNs utilize local spatial patterns and relationships within the data, enabling them to identify complex structures and variations in land cover. This capability enables CNNs to better distinguish between closely related classes, providing a more detailed classification compared to RF and SVM, which rely more heavily on feature importance and hyperplane separation, respectively.
Evaluation metrics and pattern analysis consistently show that supervised models outperform unsupervised ones. For instance, when comparing RF and CNN models with the leading unsupervised method, K-means clustering, it becomes clear that K-means struggles to differentiate between flowing streams and static water bodies, especially within aquaculture zones. While not definitively confirmed, this suggests a tendency toward overfitting water bodies in unsupervised models. In contrast, supervised models exhibit underfitting in water body classification, often favoring forested areas.
The confusion matrices, shown in
Figure 6, indicated that paddy rice is primarily associated with Type I errors (false negatives) in the RF model, while prone to Type II errors (false positives) in the SVM model. This indicates that RF tends to misclassify paddy rice pixels as belonging to other classes, while SVM tends to label paddy rice pixels incorrectly. This highlights the limitations of using a hyperplane-based approach to separate complex land cover types. The CNN model demonstrated the highest accuracy, albeit with a few mislabeled samples.
Several studies employed comparison, which aimed to identify the best supervised classification method. For instance, Talukdar et al. (2020) used several machine learning methods, including RF, SVM, artificial neural network (ANN), fuzzy adaptive resonance theory-supervised predictive mapping (Fuzzy ARTMAP), spectral angle mapper (SAM), and Mahalanobis distance with Landsat 8-derived satellite indices and other variables, to classify six types of LULC in three different Ganga River landscapes [
4]. In the study, RF was the most effective strategy. Basheer et al. (2022) found that the SVM classifier outperformed others when applied to different datasets for LULC mapping in Charlottetown, Canada, between 2017 and 2021 [
5]. Phan and Kappas (2014) evaluated the efficacy of RF, k-nearest neighbor (kNN), and SVM methods in a 30 by 30 km
2 area of the Red River Delta, northern Vietnam [
6]. They tested six LULC types and 14 different training sample sizes, using all single bands of Sentinel-2 image data as independent variables, concluding that SVM exhibited the highest accuracy. Their highest accuracy was generally achieved at 0.95, which is quite close to our result. However, these successes were achieved in small study areas, with a limited number of classes, and through limited labeled sampling.
Due to their high resource consumption, deep learning models are rarely applied to LULC. Another reason is that deep learning models, such as CNNs, are better suited to object-detection tasks than conventional LULC classification tasks [
19]. Therefore, the input data was generally resampled to a higher resolution and then divided into square patches, and the CNN model was used to label these patches. This approach resulted in a significantly coarser output compared to the original imagery resolution. Kussul et al. (2017) applied 1-D and 2-D CNN models to separate several crop types in Kyiv, Ukraine, and compared the results with those of the RF model [
8]. Overall, the 2-D CNN model achieved the highest accuracy of 0.97, while the RF model reached 0.89. CNN models are run with a stride of 1, providing the same resolution as the input imagery. Our study also applied this method, and as a result, the output LULC map resolution remained at 30 m. Considering the results of other studies, it is concluded that the architecture of the deep learning model should be chosen carefully, depending on the size of the study problem and available computing resources. This is the primary constraint of applying deep learning models in LULC classification.
4.3. Suitability of Semi-Supervised Approaches in Land Cover Mapping
While many studies have discussed the limitations of supervised models, the most prominent concern centers on the limited availability of labeled data for training. The relatively small sample size compared to the study area raises questions about the reliability of even high testing accuracy scores. Moreover, several authors have noted that out-of-class data can dominate the classification process, leading to skewed results [
10,
22,
60,
62,
73]. As a result, SSL methods are increasingly viewed as a promising alternative for land cover classification [
64,
66]. However, the application of SSL and deep learning models is constrained by computational demands, technical complexity, and resource consumption (
Table 6). Our observations suggest that only a limited number of studies in this domain have achieved credible accuracy evaluations [
12,
15,
17]. The semi-supervised model (SoC4SS-FGVC), although it does not achieve the highest test accuracy, provides a more accurate reflection of LULC in reality by effectively utilizing unlabeled data. By leveraging large volumes of unlabeled samples, the model captures broader patterns and trends that may be absent in a smaller labeled dataset. This approach allows for a more comprehensive understanding of land cover distribution, making it particularly advantageous in data-scarce environments where labeled data is limited.
The experiments with varying sample sizes clearly demonstrate that the semi-supervised model exhibits robust resistance to the scarcity of ground-truth data. Remarkably, when ground-truth data is extremely limited, the semi-supervised model consistently outperforms traditional supervised models, highlighting its strength in data-scarce environments. Additionally, increasing ground-truth data appears to have minimal impact on the semi-supervised model’s accuracy, contradicting the behavior observed in supervised models and underscoring its stability. While there is potential to further enhance the performance of semi-supervised models and to deepen the analysis of accuracy matrices, the current findings strongly affirm the model’s superior capabilities in challenging conditions.
One of the primary challenges in assessing the quality and applicability of LULC classification is the issue of pure/mixed pixels, which is influenced by spatial resolution. High-resolution sensors, such as Sentinel-2 and PlanetScope, tend to reduce spectral mixing and enhance class separability. In contrast, medium-resolution imagery often suffers from mixed pixels that blur boundaries and diminish classification reliability [
74,
75]. Nonetheless, empirical research suggests that increasing spatial resolution does not always result in improved classification accuracy. For example, Hsieh et al. (2001) developed a simulation framework to systematically examine the impacts of various parameters on classification performance [
76]. Their results demonstrated that classification errors initially decrease and then increase as the ratio of ground sampling distance to field width decreases. However, finer spatial resolution does not inherently enhance classification accuracy due to boundary effects and within-class variability.
Furthermore, traditional accuracy evaluation metrics such as Overall Accuracy (OA), Producer’s Accuracy (PA), and User’s Accuracy (UA) have limitations, including the potential to obscure uncertainty and misrepresent datasets with class imbalance. Consequently, emerging approaches incorporate probabilistic and uncertainty-aware evaluation methods. These include soft classification accuracy [
77], information-theoretic measures such as cross-entropy [
78], and metrics that assess quantity and allocation disagreement, explicitly differentiating between systematic and random errors [
79]. Such frameworks provide a more comprehensive account of classification uncertainty and are more consistent with modern per-pixel probability outputs derived from machine learning and deep learning classifiers.
To obtain a more precise evaluation, we do not rely solely on the limited labeled data. We attempt to validate our results using statistical data derived from field survey reports of local government staff. This type of comparison has been applied in a few studies, yielding limited results. First, it is difficult to obtain up-to-date, official data for many study regions. Secondly, the table dataset is difficult to use in line with the polygonal dataset. Using statistical data directly as a validation dataset is impossible. However, we can use real data that provides the total area of a LULC type in a region to compare with that retrieved from different methods. This approach cannot provide us with a quantified evaluation parameter. However, it can provide an intuitive evaluation of the model’s performance, particularly in addressing overfitting problems.
The comparisons in
Table A2 indicate that the area of rice land (including one-season rice crop areas) has remained stable over the years, at approximately 1.7 million hectares. Of the models, the SoC model is the closest. From our perspective, the quality of ground-truth data and the selection of input features are more critical to classification success than the choice of model itself.
Our framework also highlights the utility of satellite-derived indices, which function similarly to principal component analysis (PCA) variables, reducing input dimensionality and enhancing model performance in large-scale data environments.
One key limitation of our study is the sample size. The number of ground-truth samples is disproportionately small relative to the study area, which constrains the robustness of our results. Although SSL models can leverage unlabeled data, their effectiveness is still dependent on the quality and representativeness of the labeled subset. In our view, the sampling strategy and feature selection are more decisive than the model architecture for achieving reliable LULC classification.
Another limitation is the resolution of the resulting LULC map. As mentioned above, a finer resolution generally yields richer and more accurate information. However, given specific goals, it is not necessary to achieve fine spatial resolution for many LULC types, as the computational requirements are very high.