Advanced Semi-Supervised Learning for Remote Sensing-Based Land Cover Classification in the Mekong River Delta, Vietnam

Bui, Hai-An; Hsu, Chih-Hua; Young, Hsu-Wen Vincent; Chen, Yi-Ying; Liou, Yuei-An

doi:10.3390/rs18070989

Open AccessArticle

Advanced Semi-Supervised Learning for Remote Sensing-Based Land Cover Classification in the Mekong River Delta, Vietnam

by

Hai-An Bui

^1,2,3,†,

Chih-Hua Hsu

^4,†

,

Hsu-Wen Vincent Young

⁵,

Yi-Ying Chen

^2,6

and

Yuei-An Liou

^1,2,*

¹

Center for Space and Remote Sensing Research, National Central University, Taoyuan City 320317, Taiwan

²

Taiwan International Graduate Program—Earth System Science (TIGP-ESS), Academia Sinica, National Central University, Taoyuan City 320317, Taiwan

³

Soils and Fertilizers Research Institute, Vietnam Academy of Agricultural Sciences, 10 Duc Thang, Dong Ngac, Hanoi 11909, Vietnam

⁴

Department of Industrial and Systems Engineering, Chung Yuan Christian University, Taoyuan City 320314, Taiwan

⁵

Department of Electronic Engineering, Chung Yuan Christian University, Taoyuan City 320314, Taiwan

⁶

Research Center for Environmental Changes, Academia Sinica, Taipei City 115024, Taiwan

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2026, 18(7), 989; https://doi.org/10.3390/rs18070989

Submission received: 29 October 2025 / Revised: 15 January 2026 / Accepted: 23 March 2026 / Published: 25 March 2026

(This article belongs to the Special Issue State of the Art in Land Cover Classification and Mapping: Building Up Digital Twins of Earth)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We proposed a semi-supervised framework that integrates multi-temporal remote sensing indices for land cover classification in data-scarce environments.
The proposed SoC4SS-FGVC model achieved 0.92 accuracy with only 500 labeled samples, outperforming supervised models (RF, SVM, CNN).

What are the implications of the main findings?

Semi-supervised learning improves the discrimination of spectrally similar LULC classes in complex agricultural landscapes.
A realistic validation strategy reduces overfitting and provides more reliable land-cover estimates.

Abstract

The Vietnam Mekong River Delta (VMRD) is a climate-sensitive region characterized by diverse ecosystems, including extensive mangrove forests that protect against sea-level rise and contribute to global carbon sequestration. Accurate land cover classification in the VMRD is essential but remains challenging due to complex landscapes and dynamic environmental conditions. The primary objective of this study is to propose a semi-supervised deep learning framework that integrates satellite indices with multi-temporal remote sensing data to address key classification challenges, particularly in situations where ground truth data is limited, as compared to unsupervised and supervised machine learning methods. Our comparative analysis across different sample sizes (500 to 6000 ground-truth data points) reveals critical insights into model performance and scalability. Supervised models, including Random Forest (RF), Support Vector Machine (SVM), and Convolutional Neural Network (CNN), demonstrated strong performance when sufficient labeled data were available, with CNN achieving the highest accuracy (0.97 at 6000 samples). However, at minimal sample sizes (500 sample points), these supervised approaches exhibited substantial limitations, with accuracies dropping dramatically (RF: 0.75, SVM: 0.80, CNN: 0.81). Supervised models also showed overfitting tendencies compared to official land cover statistics. In contrast, the semi-supervised approach (SoC4SS-FGVC) achieves remarkably high performance at small sample sizes (0.92 accuracy with 500 sample points), demonstrating strength under minimal data availability. The framework also showed improved capability in distinguishing spectrally similar land-cover classes and detecting environmentally sensitive types such as mangrove forests. Cross-validation with official statistics confirmed the semi-supervised model’s superior effectiveness in delineating paddy rice fields and its resistance to overfitting. The performance analysis demonstrates that SoC4SS-FGVC provides a practical and cost-effective solution for land cover mapping, particularly in regions where extensive ground-truth data collection is prohibitively expensive or logistically challenging.

Keywords:

deep learning; land cover classification; Mekong River delta; satellite indices; semi-supervised learning

1. Introduction

Land cover refers to the collection of biotic and abiotic components on the Earth’s surface, while land use describes the economic and cultural activities (e.g., agricultural, residential, industrial, mining, and recreational uses) carried out in a specific area. Accurate Land Use and Land Cover (LULC) mapping is essential for effective land management, water resource distribution, agricultural planning, food security, and environmental preservation. Regularly updated LULC maps offer governments, policymakers, and stakeholders vital information for sustainable development and climate adaptation strategies [1]. However, a major challenge remains in obtaining precise and detailed LULC data in certain regions, as traditional field survey mapping is often too costly or logistically difficult.

In agricultural landscapes, monitoring crop patterns is particularly important because they reflect the interactions between human activities and environmental stressors such as drought, floods, typhoons, and freshwater scarcity [2,3]. Despite this importance, creating the LULC map requires a significant investment of both money and human resources. Even when using supervised machine learning models, obtaining sufficient ground-truth data remains a major challenge, and the scarcity of labeled samples continues to constrain the effectiveness of the models.

Traditionally, LULC classification depended on visual interpretation of true-color (RGB: Red–Green–Blue) and false-color (NIR-R-G: Near-Infrared–Red–Green) imagery, supported by statistical clustering techniques like Agglomerative Hierarchical Clustering (AHC). The combination of satellite and airborne remote sensing has since transformed mapping methods, allowing large-scale, data-driven analyses. Machine learning models such as Random Forest (RF) and Support Vector Machines (SVM) have become widely used to enhance classification accuracy. RF showed strong performance in diverse landscapes [4], and SVM provided robustness across various environments [5,6]. Recent advances in deep learning have revolutionized LULC mapping. Convolutional Neural Networks (CNNs) outperform traditional techniques by learning spatial hierarchies of features and achieving better results in image segmentation and classification [7]. However, deep learning methods require very high spatial resolution imagery, extensive computational resources, and most importantly, large amounts of labeled training data that are costly and difficult to acquire [8,9,10].

The use of satellite-derived indices has further enhanced LULC classification by reducing data dimensionality and capturing meaningful biophysical signals. Vegetation, water, built-up, and chlorophyll-based indices have been especially effective in distinguishing land cover types and detecting seasonal variations [11,12]. Time-series remote sensing data also allow dynamic monitoring, although persistent challenges remain in addressing cloud contamination, missing data, and atmospheric distortions [13,14,15].

Semi-supervised learning (SSL) methods have emerged as a promising alternative to address the ground-truth scarcity problem. By using both limited labeled data and abundant unlabeled satellite imagery, SSL approaches can potentially deliver reliable classification performance without the prohibitive data collection costs of fully supervised methods [16,17]. The theoretical foundation of SSL rests on three key assumptions: (1) similar samples should share labels (smoothness assumption); (2) decision boundaries should pass through low-density regions (low-density separation); and (3) data points within low-dimensional manifolds should have unified labels (manifold assumption) [18,19]. These principles enable SSL to exploit the structure inherent in unlabeled data, reducing dependence on exhaustive ground-truth information. However, the practical application of SSL in LULC mapping remains underexplored, particularly at different scales of ground-truth availability.

The Vietnam Mekong River Delta (VMRD) represents an ideal testbed for addressing these questions. As one of the most climate-vulnerable regions in the world, the VMRD faces severe challenges from sea level rise, saltwater intrusion, shifting flood regimes, and land subsidence. Within this context, monitoring changes in land cover, particularly paddy rice fields and mangrove forests, is critical, as these ecosystems are both environmentally sensitive and socio-economically significant. Mangrove forests provide an essential ecological footprint and contribute substantially to global carbon sequestration, yet accurate mangrove detection remains technically challenging. Previous studies have generated annual LULC maps for the region using average spectral indices in traditional machine learning methods [20,21,22,23] or localized classification methods [24,25,26]. However, these approaches have been constrained by limited ground-truth availability and the practical impossibility of collecting comprehensive labeled datasets across the delta.

This study addresses these gaps by developing a semi-supervised deep learning framework tailored to the VMRD. By integrating localized feature selection with advanced modeling techniques, we aim to overcome the limitations of both unsupervised and supervised methods, providing more reliable LULC prediction, especially for mangrove and paddy rice observation.

2. Materials and Methods

2.1. Study Area

The Mekong River is one of the world’s most significant river basins (Figure 1). Starting from the Tibetan Plateau, the river traverses China, Myanmar, Laos, Thailand, and Cambodia, transporting 150 million tons of alluvial sediment annually to the South China Sea. The region still extends to the sea, engaging with it through a system of mangrove forests [27]. The inundation has a significant impact on the region, resulting in two distinct climatic seasons: the wet season, also referred to as the inundation season, from April to September, and the dry season, from October to early April of the following year. The inundation coincides with the rainy season, which accounts for 90% of the total rainfall. However, the primary water resources for the region consist of the river’s flowing surface waters, which contribute approximately 400 billion m³ annually, compared to 65 billion m³ of annual rainfall [28,29].

The total area of the region is 40,816 km², of which more than 65% is dedicated to agriculture; paddy rice cultivation is predominant, occupying half of the agricultural area. According to recent reports, the paddy rice area spans 1.6 million hectares, facilitating two or three seasons of paddy rice cultivation each year [30]. During the dry season, the lower part can only support one season of paddy rice [27]. The new land is one of the world’s leading rice exporters, producing approximately 6 million tons annually, and therefore plays a crucial role in Vietnam’s and the world’s food security. Its mangrove forest system also serves as an important Carbon storage for the world, contribute to the global ecological security and sustainability.

Due to its importance, the region has been the subject of a dense LULC mapping study. However, these works suffered from several limitations as mentioned in the introduction. Our study offers a novel approach to overcoming the limitations, providing a new approach to handling the overfitting/underfitting problems.

2.2. Ground Truth Data

Inherited from previous studies, we chose to generate 12 LULC types, including paddy rice, rice–shrimp system, annual crops, perennial crops, productive forests, protective forests, mangrove forests, aquaculture, wetlands, dense built-up areas, sparse built-up areas, and rivers. They are the most concerning LULC types in the region. To collect samples for training, testing, and validation, we follow a procedure based on existing collection methods, with revisions made to adapt to the study’s specific requirements. The procedure is described below.

We collected a ground-truth dataset using the concept introduced by [31], making a few simplification steps along the way. Technical works are described in [32,33]. We manually selected and stored ground-truth data in a yearly mean mosaic true color image, which is loaded into the Google Earth Engine (GEE) platform.
Then, the collection was compared to the 2005 LULC map compiled and provided by the Vietnam Ministry of Natural Resources and Environment (MONRE). Any points that received a different label were removed.
We reviewed various LULC maps from previous studies to achieve higher accuracy and compare their ground-truth data or modeling maps with our dataset to identify a consensus. The references are found in [23,24,25,26,34,35,36]. These steps imply that our collection contains samples whose LULC has remained unchanged for a long time.

Table 1 shows the number of samples collected and details for each class. The number of labeled samples is insignificant compared to the total area, which encompasses approximately 44 million samples (pixels at a 30 m resolution). This is the primary concern when assessing the reliability of the supervised model, which is evaluated using ground-truth data. Therefore, SSL models, which incorporate labeled or unlabeled data during training, are considered a better approach.

2.3. Remote Sensing Data

The study uses Landsat 8 image bands as the primary data source, collected from the United States Geological Survey (USGS) database at: https://earthexplorer.usgs.gov/, accessed on 17 April 2024. We utilized level 1, collection 2 (Landsat 8/9 OLI/TIRS C2 L1) product. In the study, we have selected five bands of optical sensors, including band 3 (green), band 4 (red), band 5 (near-infrared), band 6 (shortwave infrared 1), and band 7 (shortwave infrared 2). The VMRD is covered by six tiles of Landsat images, including path/row 125/052, 125/053, 125/054, 126/052, 126/053, and 126/054. We therefore mosaiced these images into one that covers the study region.

Further investigation of the satellite image database reveals high cloud coverage in the region. We selected clear sky images from late July 2022, late December 2022, early March 2023, and late May 2023. At each point, we collected six bunches of images that encompassed the entire study area. Each bunch contained five mono-band images. We created the region’s monoband images by mosaicking all six original images that cover the region at each specific time point. Pixels identified as low quality by the dataset’s quality flags were marked as NoData pixels. Next, we extracted the mosaicked monoband images by masking them to fit the study region, thereby removing pixels outside the study area that would be used in the subsequent modeling work. The NoData (NaN) areas in the clipped images were filled with the mean of the nearest pixels.

We also collected and used the digital elevation model (DEM), representing the elevation information. DEM data is a 30 m resolution product of the Shuttle Radar Topography Mission (SRTM) and accessed from the GEE [37].

The input dataset includes 25 variables (6 indices × 4 time points + DEM).

2.4. Feature Processing and Selection

2.4.1. Feature Engineering

Many authors utilized all original bands of a satellite image in the LULC classification. However, due to the intensive computer resources and time-consuming nature of the study, the ancillary variables are chosen to limit the number of features. Numerous studies have demonstrated the efficient application of satellite indices, including Normalized Difference Vegetation Index (NDVI) [38], Normalized Difference Built-up Index (NDBI) [39], and Modified Normalized Difference Water Index (MNDWI) [40], in LULC classification.

The occurrence of water and plant coverage varies seasonally in the VMRD. Therefore, we generate several satellite indices to reflect the variation, including NDVI, NDBI, Green Chlorophyll Vegetation Index (GCVI) [41], Normalized Difference Tillage Index (NDTI) [42], Normalized Difference Latent Heat Index (NDLI) [43], and MNDWI, using data from different times as input.

The indices initially considered in the study are listed in Table 2, which describes the goals of each index, its name, abbreviation, derived equation, and original reference. The final selected indices for the LULC classification task are GCVI, NDTI, NDVI, NDLI, and MNDWI. The specific acquisition dates and the corresponding feature importance rankings of the selected indices are listed in Table A1.

In Table 2, ρ indicates the bands of the Landsat 8 image, and the corresponding numbers indicate the band names in order by their wavelengths. From the ρ₃ to ρ₇ are green, red, near-infrared, and two shortwave infrared bands.

These satellite-derived spectral indices deliver concise representations of complex surface properties vital for LULC classification. Recent developments have introduced advanced drought-sensitive indices that significantly enhance feature selection for environmental monitoring and classification. These include the Surface Water Availability and Temperature Index (SWAT), which robustly links hydrological and thermal dynamics for drought monitoring [44]; the Temperature-Soil Moisture Dryness Index (TMDI), which integrates thermal and soil moisture signals and performs effectively in semi-arid regions [45]; the Surface Water Availability-Temperature Index (SWATI), which combines water availability with thermal stress indicators for global-scale drought assessment [46]; and the Relative Surface Evapotranspiration Index (RSETI), which characterizes drought via relative evapotranspiration estimates across diverse landscapes [47]. Additionally, the most recent studies [48,49] apply these indices for urban island studies associated with LULC. In general, these indices capture subtle surface conditions often overlooked by conventional approaches, thereby improving LULC classification accuracy.

2.4.2. Feature Selection

Random Forest Gini Index Importance

To reduce redundant predictors that do not significantly contribute to the modeling process, we first apply an RF model using the collected ground-truth data to determine the importance of each feature. In this case, the importance is defined by the Gini index. The Gini impurity at node j is defined as:

G i n i (j) = 1 - \sum_{i} {[P (i | j)]}^{2}

(1)

The Gini index calculated at different splitting nodes is:

{G i n i}_{s p l i t} = \sum_{j = 1}^{M} \frac{n_{j}}{n} G i n i (j)

(2)

where j ∈ {1, 2, …, M} is a node, P(i|j) is the relative frequency of class i at node j, and n is the number of instances at node j.

Based on the RF-derived importance rankings, the top 10 features were retained for subsequent modeling (Table A1).

Mutual Information—Based Feature Importance

To mitigate potential bias arising from reliance on a single embedded feature selection method, we complemented the RF approach with an independent filter-based method using Mutual Information (MI). This technique evaluates feature relevance by quantifying the amount of information shared between each input feature and the target LULC class labels, effectively capturing both linear and nonlinear dependencies without assuming any specific functional form or classifier [50].

For a feature

X_{i}

and the LULC class label

Y

, the mutual information is defined as:

M I (X_{i}; Y) = \sum_{x \in X_{i}} \sum_{y \in Y} p (x, y) \log (\frac{p (x, y)}{p (x) p (y)})

(3)

where

p (x, y)

is the joint probability distribution function of

X_{i}

and

Y

, and

p (x)

and

p (y)

are the marginal probability distribution functions of

X_{i}

and

Y

, respectively.

To ensure objectivity and reproducibility, feature importance was assessed solely based on band indices rather than semantic or descriptive feature names. The MI scores were normalized and ranked in descending order to identify the most informative features. This MI-based ranking was subsequently compared with the RF Gini-based importance scores to evaluate the consistency of feature relevance across the two distinct methodological approaches.

After feature selection, a K-means clustering model was applied, and its performance, evaluated using the Silhouette score and inertia, was examined to assess how effectively the selected features partition the study area into 12 distinct classes.

2.5. Classification Models

2.5.1. K-Means Clustering

The K-means clustering idea is based on Euclidean distances calculated for instances. The data points are then iteratively grouped according to their proximity to the centroids [51]. In the study, the Euclidean space is a 10-dimensional space representing 10 features (Table A1). Therefore, one instance X is represented as

\vec{X} = (x_{1}, x_{2}, x_{3}, \dots ., x_{10})

where x_i is the value of X in feature i. The distance between two instances is calculated by Euclidean distance.

The next step involves randomly selecting 12 points as the initial class centers (means of the clusters, denoted by

M_{k}^{(t)}

, k ∈ [12] represents the class number {1, 2, …, 12}, and t is the number of iterations that will update the means.

Next, the means generate a Voronoi diagram S, which assigns each observation to a cluster. After assigning all observations into clusters, the means (centroids) will be updated using the equation:

M_{k}^{(t + 1)} = \frac{1}{| S_{k}^{(t)} |} \sum_{X \in S_{k}^{(t)}} X

(4)

The objective function used is the square error function, or the within-cluster sum of squares function. The resulting classes were then assigned names based on how the training sets were distributed within each class and how they aligned with the official statistical areas for each LULC type in 2020.

2.5.2. RF and SVM Models Optimization

We test the supervised models using several subsets of labeled data to assess the impact of data scarcity on model accuracy. The number of labeled samples in the subsets is 500, 1000, 2000, 4000, and 6024 (the entire ground-truth dataset).

We apply the Bayesian algorithm to optimize the hyperparameters used in RF and SVM models. Due to the requirement for balance among class members, we need to address the imbalance in the input labeled subset. To address this, we first applied the Synthetic Minority Over-sampling Technique (SMOTE) [52]. SMOTE creates new synthetic samples, or artificial instances, for minority classes by connecting them to their k nearest neighbors within the same class.

We use the Hyperopt library for hyperparameter tuning. It uses Tree-structured Parzen Estimator (TPE) as a Bayesian optimization algorithm to minimize the objective functions for RF and SVM.

The hyperparameters optimized in RF are n_estimators, the number of trees in the forest, and max_depth, the maximum depth of each tree. In this study, n_estimator is set to a range of 50 to 500 with an increment of 20, and max_depth is set to a range of 5 to 20 with an increment of 1.

In SVM, the hyperparameters optimized are C, the penalty term that controls the trade-off between misclassification and margin maximization, and γ, the kernel coefficient for the radial basis function (RBF) kernel, which influences the shape of the decision boundary. The value of C is set in the range of −4 to 4, and γ is set in the range of −4 to 4 in natural logarithm form. Cross-validation is performed using K-Fold with five splits for each hyperparameter tuning trial. The accuracy for each trial is calculated as in [53]:

A c c u r a c y = \frac{\sum_{i = 1}^{K} n_{i i}}{n}

(5)

where n_ii is the number of samples that are predicted to belong to class i and are in class i; n is the total number of samples; and K is the number of classes. In our case, K = 12.

2.5.3. Random Forest Classifier

RF [54,55] is a well-known and popular model in machine learning. It evolved from the concept of a decision tree, which categorizes observations based on their similarity. However, it utilized the classifiers by voting for the most popular class among the ensemble trees.

Each decision tree in the RF splits the data at each node based on a measure of “purity” of the resulting subsets. The most common impurity measure used in classification is the entropy of a node H:

H = - \sum_{i = 1}^{k} p_{i} \log_{2} (p_{i})

(6)

where p_i is the probability of class i in the node, and k is the total number of classes.

I n f o r m a t i o n G a i n = H_{p a r e n t} - \sum_{j} \frac{| S_{j} |}{| S |} H_{j}

(7)

where

\frac{| S_{j} |}{| S |}

is the proportion of samples in the subset after the split. RF optimizes the following objectives:

Maximizing node purity in individual trees: Each tree splits the data to maximize purity (measured by minimizing Gini impurity or maximizing information gain) at each node.

Aggregating decisions from multiple trees: The result is the aggregation of all trees’ outputs. In classification, each tree votes for a class, and the class that gains the highest votes is chosen. Thus, the RF classifier can be viewed as optimizing the following ensemble objective:

H (x) = m o d e (h_{1} (x), h_{2} (x), \dots, h_{m} (x))

(8)

where h_i(x) is the prediction of the i-th tree and H(x) is the final prediction.

2.5.4. Support Vector Machine

The decision boundary is determined by maximizing the margin between the two classes [56]. In the feature space, the optimal hyperplane is:

W \cdot X + b = 0

(9)

where W is the normal vector to the hyperplane, and the parameter b is the bias of the hyperplane from the two margins. The model then minimizes the objective:

\min_{w, b} \frac{1}{2} {‖ W ‖}^{2} + C \sum_{i = 1}^{n} m a x (0, 1 - y_{i} (W \cdot X_{i} + b))

(10)

where y_i = ±1 is the i-target of the dataset (intended output), and

W \cdot X_{i} + b

is the i-th output (“raw” output of the classifier’s decision function).

2.5.5. Convolutional Neural Network

One of the most popular deep learning techniques is the convolutional neural network [57,58,59]. During the preprocessing step, image tiles are divided into smaller stacks of 48 × 48 pixels, each containing 10 bands. The labels are one-hot encoded, i.e., each label is a vector whose length equals the number of classes, where the index corresponding to the actual class is assigned a value of 1, and the rest are 0. The label is assigned to the center pixel of the image. The data is shuffled; 70% are used for training and 30% for validation.

The CNN architecture has several layers, which can be mathematically described as follows:

(1): Convolution Layers:

Each convolutional layer uses a series of filters (kernels) on the input to detect features such as edges, textures, or patterns. For a convolutional layer with kernel K and 2D input X, our 2D convolution operation can be expressed as:

C_{i, j} = \sum_{m = 0}^{k_{h} - 1} \sum_{n = 0}^{k_{ω} - 1} X_{i + m, j + n} \cdot K_{m, n}

(11)

where i, j represent the spatial coordinates and k_h and k_ω represent the size of the kernel.

After each convolution, batch normalization is applied to stabilize and accelerate training by normalizing activations:

\hat{B N} (C_{i, j}) = \frac{C_{i, j} - μ}{σ + ε}

(12)

where µ is the mean, σ is the standard deviation, and ε is a small constant for numerical stability. BatchNorm then applies a learnable transformation:

B N (C_{i, j}) = γ \hat{B N} (C_{i, j}) + β

(13)

where γ is the scale, initialized to 1, and β is the shift trainable parameter, initialized to 0.

(2): Activation (ReLU): After convolution, a ReLU activation function is applied to introduce non-linearity:

R e L U (x) = m a x (0, B N (C_{i, j}))

(14)

(3): Pooling Layers:

Max pooling reduces the spatial dimensions of the data by taking the maximum value within a window (e.g., 2 × 2).

P_{i, j} = m a x P o o l {C_{2 i, 2 j}, C_{2 i + 1, 2 j}, C_{2 i, 2 j + 1}, C_{2 i + 1, 2 j + 1}}

(15)

(4): Global Average Pooling (GAP):

Instead of flattening the output of the final convolutional layer, Global Average Pooling is used to reduce the spatial dimensions to a single value per feature map:

{G A P}_{k} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} P_{i, j}^{(k)}

(16)

where H is the height, and W is the width of the feature maps. This reduces the spatial dimensions to a single value per feature map, minimizing overfitting while preserving important information.

(5): Dense (Fully Connected) Layers with Dropout:

The GAP output is passed through fully connected (dense) layers for classification. Each dense layer applies:

Z = W \cdot X + b

(17)

where W is the weight matrix, X is the input feature vector, and b is the bias term. Dropout is used after each dense layer to reduce overfitting by randomly deactivating neurons during training.

(6): Output Layer (Softmax):

The final layer uses a Softmax activation function to convert the output into probabilities:

S o f t m a x (z_{i}) = \frac{e^{z_{i}}}{\sum_{j} e^{z_{i}}}

(18)

where z_i is the input to the softmax function for class i, and the output is a probability distribution over all classes.

Adam optimizer is used to optimize the model. Categorical cross-entropy loss is the objective function. Categorical Cross-Entropy Loss measures the error between the predicted probabilities and the actual one-hot encoded labels:

L = - \sum_{i} y_{i} \log ({\hat{y}}_{i})

(19)

where y_i is the actual class and

{\hat{y}}_{i}

is the predicted probability.

In the study, we set the learning rate η (=10⁻⁵), decay rates β₁ (=0.9) and β₂ (=0.999).

The model is trained in over 100 epochs using mini-batch gradient descent and monitored using the accuracy metric. The validation dataset is used to ensure that the model generalizes well.

2.5.6. Semi-Supervised Method

The self-training process began by initially training a supervised classifier using only the labeled dataset [22,60,61,62,63]. In the next step, we used this trained model to attribute labels to the unlabeled samples. The predictions with the highest confidence are selected and added to the labeled dataset as pseudo-labeled examples. The classifier is retrained using the original labeled data and the newly added pseudo-labeled samples. This cycle is repeated until all unlabeled data has been processed.

Recently, an advanced idea for SSL has been proposed based on the concept of applied likelihood estimation for the probability that an instance is assigned to a particular class [64,65,66]. Furthermore, the class may not be associated with the label feature. A novel method for SSL in fine-grained visual classification tasks is introduced in [66]. This method, named Soft Label Selection (SoC) with Confidence-Aware Clustering, addresses challenges in fine-grained recognition tasks where distinguishing between classes is particularly difficult due to subtle visual differences.

The conventional SSL model assigns a hard pseudo-label to an original, unlabeled sample and adds it to the training set. Meanwhile, SoC provides a subset of labels for the unlabeled sample. The objective of conventional SSL is to minimize entropy. However, it is challenging to distinguish data in fine-grained visual classification, where images are often impure. The authors proposed a SoC approach with a loss function that integrates two key objectives: Expansion and Shrinkage.

The expansion objective (

L_{e x p}

) encourages soft labels to include multiple potential candidate classes. Fine-grained classification tasks often involve subtle differences between classes, making it challenging to correctly identify the ground-truth class based on early-stage pseudo-labels. The expansion objective aims to maximize the inclusion of the correct class by assigning probabilities to several candidate classes rather than making a difficult assignment.

L_{e x p} = \sum_{i = 1}^{n} (\sum_{j = 1}^{n} {\tilde{y}}_{i}^{(j)} \log {\hat{y}}_{i}^{(j)})

(20)

where

{\tilde{y}}_{i}^{(j)}

is the soft pseudo-label assigned to class j for a data point i, n is the total number of samples, and

{\hat{y}}_{i}^{(j)}

is the predicted probability for class j.

The shrinkage objective aims to reject noisy classes that might have been incorrectly considered during the pseudo-labeling process. This is achieved by minimizing entropy, focusing the model on the most confident predictions, and gradually reducing the number of candidate classes over time.

L_{s h r i n k} = - \sum_{i = 1}^{N} \sum_{j = 1}^{K} {\hat{y}}_{i}^{(j)} \log {\hat{y}}_{i}^{(j)}

(21)

where N is the number of samples in the dataset, and K is the number of possible classes in the classification task.

The method’s total loss function combines the Expansion and Shrinkage objectives. This creates a balance where the model expands its set of possible classes while simultaneously shrinking the selection by removing noisy predictions:

L_{t o t a l} = L_{e x p} + α L_{s h r i n k}

(22)

where α is a hyperparameter controlling the trade-off between expansion and shrinkage.

The authors also introduce Class Transition Tracking (CTT), which monitors the evolution of class predictions over time to adjust confidence in class assignments. CTT helps to fine-tune the clustering of soft labels by dynamically modifying the candidate class set based on the history of class transitions.

In our study, the input datasets were preprocessing as described in Section 2.5.5 earlier. We applied the CNN13 architecture with 500 CTT evaluation iterations in the model. CNN13 architecture consists of three main feature extraction blocks. Each block includes multiple convolutional layers, Batch Normalization, Leaky Rectified Linear Unit activations, pooling layers, and dropout for regularization. The network begins with 128 filters in the first block, increasing to 256 and 512 filters in subsequent blocks to progressively extract higher-level features. The final layers perform global average pooling and classification using a fully connected layer, making it suitable for tasks with 12 output classes.

To summarize, Table 3 provides detailed information for the chosen hyperparameters and the tuning process for each model.

2.6. Evaluation Matrices

All the resulting LULC maps generated from the models are evaluated in the testing dataset. The evaluation matrices include [53,64,67].

Accuracy: As described in Equation (5), the accuracy is the percentage of all predictions a model gets right.
Kappa index: The Cohen’s Kappa index measures similarity between predicted and actual labels, adjusted for random chance. In the case of multi-class classification, Cohen’s Kappa score becomes more like the Mattheus Correlation Coefficient:

κ = \frac{n \sum_{i = 1}^{K} n_{i i} - \sum_{i = 1}^{K} n_{i j} n_{j i}}{\sqrt{(n^{2} - \sum_{i = 1}^{K} n_{i j}^{2}) (n^{2} - \sum_{i = 1}^{K} n_{j i}^{2})}}

(23)

where n_ij represents the count of samples where the true class is i but was predicted as class j (confusion matrix entry at row i, column j), and n_ii represents the number of samples that are precisely classified for class i (diagonal elements of the confusion matrix).

3.: Precision: indicates the reliability of a model in predicting a class of interest:

{P r e c i s i o n}_{i} = \frac{T P}{T P + F P}

(24)

where True Positive, TP, is the number of samples that are predicted to belong to the targeted class, and False Positive, FP, is the number of samples that are predicted to belong to the targeted class but do not. The multiple classes classification precision, a.k.a. macro average precision, is simply the average of all individual class precisions, i.e.,

P r e c i s i o n = \frac{\sum_{i = 1}^{K} {P r e c i s i o n}_{i}}{K}

(25)

4.: Recall: indicates the proportion of correct predictions of positives to the total number of positives:

R e c a l l = \frac{T P}{T P + F N}

(26)

where FN is False Negatives, the number of samples that are predicted not to belong to the targeted class but do, and TP is True Positives. The algorithm for multiple-class classification recall follows the rule described in Equation (25) for Precision.

5.: F1 score: As a combination of multiple measures into one, the F-score gives the right measure by which the performance of different models can be compared:

F_{1} = 2 \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(27)

Figure 2 illustrates our workflow chart in a short and brief illustration.

3. Results

3.1. Feature Selection

The selected 10 highest-score features, as presented in Table A1, highlighted the impact of vegetation coverage and the region’s annual variation in canopy cover. The high agreement between MI-based rankings and RF Gini importance indicates that the selected spectral bands are robust across both filter-based and embedded feature importance methods.

Using 12 classes as a reference point is a logical approach. The results substantiated our feature selection, indicating that the 10 features can proficiently classify the 12 classes inside the region in a 30 m medium-resolution configuration.

The results highlighted the impact of vegetation coverage, the region’s annual variation in canopy cover, and the development of construction in the region. Only one of the ten, MNDWI, relates to the water body. It denotes the peak of the flooding season in VMRD. The difference in retrieval time of the chosen indices illustrates the annual landscape alterations in the region due to the flooding season and the development of annual crops and other crop systems.

3.2. K-Means Clustering

As illustrated in Figure 3, the three most suitable numbers of classes are 10 to 14, corresponding to Silhouette scores of 0.49 to 0.50. Moreover, the inertia representation delimited the ambiguous elbow point, which serves as a guide for determining the optimal class count. Figure 3a clearly indicates that the choice of class number is not significantly different between 10 and 15 classes. Therefore, 12-class selection is the acceptable choice. The 12-cluster mapping presented in Figure 4 still highlights the weakness of K-means clustering, along with its extremely low accuracy, particularly when a large area is mixed by aquaculture and rivers classes.

After clustering the region into 12 LULC groups, the 4216 ground-truth points were placed on the classification map. Next, we named the clusters based on the labels that contained the highest number of these ground-truth points, as counted in each cluster. This was accomplished while also considering the known patterns of the LULC types in the area, which were obtained from reference sources.

However, the model exhibited suboptimal performance when the evaluation matrices were deficient. The testing accuracy, kappa index, F₁ score, and precision were all very low, at 0.16, 0.09, 0.16, and 0.16, respectively. These values are aligned with the validation evaluation metrics.

3.3. Supervised Classifications

The cross-validation and testing evaluations of the supervised models yielded superior results. Table 4 shows the optimal parameter selections and the time consumed for each supervised model. SVM, which is designed for binary classification, consumed a large amount of time (about two hours) when tackling multi-class classification, leading to it being considered an inefficient method in terms of resource consumption. The CNN model, due to its complex structure, took significantly longer to implement compared to the SVM or RF models (approximately 8.5 h). However, the heaviest work lay in map compiling due to the huge number of pixels. It suggested that the resource consumption of the supervised model mostly relies on the labeled samples, regardless of the unlabeled ones. This is also a weakness of supervised approaches, where the large number of unlabeled samples is excluded from the analysis.

The results of the cross-validation accuracy evaluation for the RF, SVM, and CNN models, as shown in Table 5, indicate that the CNN model outperformed the SVM and RF models in terms of testing accuracy. The gap between training/testing accuracy of the SVM and RF models proved that the machine learning models had a high learning capacity, but also a high potential for overfitting. CNN model, with its complex architecture, together with other deep learning models, can address the problem of machine learning models.

The final maps resulting from RF, SVM, and CNN models are represented in Figure 5. It is clear below that the SVM overpredicted the paddy rice fields, which represented the largest area in the region, while the CNN model appeared to correctly predict the river in a certain area. CNN is shown as the best supervised model in LULC classification in the VMRD.

The types of errors in supervised models are identified in the confusion matrices in Figure 6, yielding acceptable results. The erroneous predictions were relatively low. All the models faced general issues, reflecting the problem of distinguishing between similar LULC types. The sparse built-up type was incorrectly predicted as other types; it was mostly predicted as the annual or perennial crop types, which was affected by a type I error. Several sparse built-up, annual, and perennial crop pixels were wrongly predicted as paddy rice fields. Additionally, there was significant confusion between annual and perennial crops, reflecting the difficulty in distinguishing between these two types of land cover in this specific region.

The trend variation in Loss function values (on a log scale) per epoch of implementing the CNN model is shown in Figure 7, indicating a decreasing trend and convergence in both the training and testing phases over 100 epochs, which suggests the suitability of the chosen number of epochs. The convergence in the last epoch indicated that the model had well-designed and suitable hyperparameter choices.

3.4. Semi-Supervised Model

The VMRD LULC map for 2023, generated using the proposed SoC4SS-FGVC model, is illustrated in Figure 8. Implementing the CNN13 architecture with 500 CTT evaluation iterations resulted in significant computational intensity; however, this structural complexity was necessary to capture high-level features from limited data. The model achieved a testing accuracy of 0.81 when the sample size was the whole 6024 labeled data. Although this metric is moderate, the model encountered persistent challenges in distinguishing aquaculture systems from other water bodies spectrally, and limitations were also observed in the baseline deep learning models (CNN). As evidence for the model’s effectiveness, the accuracy of the other sample size experiments, such as 500 or 1000 data, of the SoC model were higher than the other models (Section 3.5), implicating the capability of the semi-supervised model in the very scarce ground-truth data case.

However, relying solely on standard accuracy metrics in this context is insufficient. Given the scarcity of ground-truth data relative to the vast study area, standard metrics can be misleading. For instance, the high accuracy scores observed in supervised models for minority classes, such as mangrove forests, likely indicate overfitting to the small training set rather than true generalization. To address this, we adopted a broader validation approach by cross-referencing our results with regional statistical data. As shown in Figure 9, the SoC4SS-FGVC model estimated that 61% of the total area is dedicated to agriculture, with 45% specifically identified as paddy rice fields. These proportions closely align with official agricultural statistics and local knowledge of the region. Furthermore, the SoC4SS-FGVC model resulted that the mangrove forest is about 2% of the total area, very close to the statistical data (2.12%). This consistency suggests that while the semi-supervised model may have lower ‘test’ accuracy on limited labeled data, it offers superior generalization and produces a more realistic representation of the VMRD landscape compared to overfitted supervised methods.

3.5. Sensitivity of Supervised and Semi-Supervised Models to Sample Size

Through experiments with different sample sizes, we explored how sample size influences the accuracy of both supervised and semi-supervised models. RF accuracy reached 0.75, 0.81, 0.81, 0.87, and 0.87 at sample sizes of 500, 1000, 2000, 4000, and 6024, respectively. For the SVM model, the corresponding accuracies were 0.8, 0.83, 0.82, 0.87, and 0.87, showing a similar trend to RF. The CNN model displayed a different trend, with accuracies of 0.6, 0.76, 0.91, 0.93, and 0.96. In contrast, the SoC 4SS FGVC model showed a decreasing trend, with accuracies of 0.91, 0.88, 0.85, 0.84, and 0.82.

As shown in Figure 10, the sample size significantly affected the CNN model, while the SoC model demonstrated strong resilience, showing little sensitivity to sample size changes. The SVM model also remained resistant to changes in sample size, as increasing the sample from 500 to 6000 labeled samples led to only a 5% accuracy increase, compared to about 36% in the CNN model. Notably, with a sample size of 500 or 1000, the SoC model outperformed its counterparts in accuracy. Even as the CNN, RF, and SVM models improved with additional labeled samples, the semi-supervised approach aligned more closely with real-world data than the other supervised models. This will be discussed further in the next section, where we compare our results with official statistical data.

The decline in the accuracy of the SoC model can be largely attributed to the addition of new labeled samples into both the training and testing datasets after each epoch of training. As more samples are incorporated into these datasets, the overall complexity and variability within the data increase, resulting in a more diverse and intricate representation of information. This heightened complexity makes it more challenging for the model to accurately predict the SoC, resulting in a general decrease in accuracy as the datasets expand and become more representative of varied scenarios.

3.6. Statistical Validation

Annually, the Statistical Office provides updated estimates of survey information, including areas for specific land use types, especially paddy rice areas. According to the 2020 release, VMRD currently has approximately 600,000 hectares of aquaculture cultivation, 160,000 hectares of rice–shrimp systems, and 80,000 hectares of mangrove forests. We also referred to the 2005 official Land Use Maps for the VMRD of the Ministry of Natural Resources and Environment. To align the data with our study, we merged different legends from the land-use maps into our 12 classes of LULC.

Other sources of statistical validation come from previous studies. The numeric statistical values of each LULC type’s area evaluated by each of our used methods and the official corresponding data in 2020 and 2005 are presented in Table A2 and Table A3, where the cells highlighted by a shaded background indicate the LULC area from the prediction that is closest to the statistical area of the LULC type. According to the sources, small areas of the rice–shrimp system are spread out in parts of Soc Trang, Bac Lieu, and Kien Giang provinces. The aquaculture system is the most common LULC type in the lower part of the VMRD, while paddy rice cultivation is dominant in the upper part.

As seen in the resulting maps, Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9 and Figure 10 of the previous Sections, the supervised and unsupervised models tended to overpredict the annual crops. At the same time, considerable ambiguity existed among the forest classes. Moreover, the models underestimated the perennial crops or failed to distinguish them from productive forests with similar reflective signals. SoC4SS-FGVC obtained values closer to those of the retrieved data, indicating that SSL models are beneficial. In general, the distinctions between lower levels of LULC classification pose significant challenges for the classification work. It is easier to separate water bodies and land surfaces than to try to differentiate between different levels of canopy coverage areas. In addition, the delta has a very homogeneous land cover type. In actual classification tasks, this is an important obstacle to overcome.

Mangrove forests and coastal areas occupy a small proportion. Along with their similar properties to other forest classes and other water body classes, respectively, they are challenging to detect. The SoC model displays statistics like those of the mangrove forest, compared to official data. Furthermore, SoC proved a much stronger ability to detect paddy rice than other models. This fact implies the potential of semi-supervised models in the specific task of LULC classification for the specific study areas like VMRD, which is characterized by a large area of paddy rice, along with rice–shrimp system and other special types of land cover, such as mangrove forest and aquaculture land, which are the target of LULC classification studies.

4. Discussion

4.1. Feature Selection and Its Implications

The selection of features for LULC classification, or more generally, the input dataset used for modeling, is a critical issue. As partly discussed in the Introduction, most current LULC classification methods rely on satellite indices rather than monoband alone to reduce input data dimensionality. Many of these indices are chosen for their ability to effectively capture surface information, as partially illustrated in Table 2.

The most commonly used indices are those related to canopy cover reflectance, especially NDVI. Nearly all LULC studies used NDVI in their classification to distinguish between forest, cropland, water bodies, built-up areas, and bare soil. Given the seasonal changes in canopy coverage of various crops, several studies have used a time series of NDVI, combined with other indices, to predict LULC, as referenced in, e.g., [8,34,68,69]. Typically, the selection of features was based on their ability to identify water bodies, built-up areas, croplands, or forests. As indicated in Section 2.4.1, the set usually included NDVI, NDBI, and NDWI, along with chlorophyll indices [4,7,11,70]. However, a limitation of these studies is the absence of temporal signatures in the input datasets, mainly due to limited resources.

Our proposed approach aims to overcome the challenges of resource limitations by utilizing a new index, a time-series-like platform, and a feature selection procedure to reduce dimensionality and enhance model performance. As mentioned earlier, the VMRD is a large delta, mostly flat and covered by paddy rice fields, other annual crops, perennial crops, melaleuca forests, and mangrove forests. The region suffered seasonal flooding each year and has a very dense river and channel system. Our feature selection result supported our choices by highlighting the importance of greenness indices at the transition point of the season, including GCVI at the late stage of the dry season and NDVI at the early stage of the dry season, when the paddy rice fields are clearly observed. The high intensity of tillage in the region was implicated in the feature selection, as indicated by the high score of NDTI at the end of the dry season and the early stage of the flood season. This marks the beginning of the rice season in the region. Therefore, the high importance of NDTI implies the intense tillage and high residual cover in the region at that time. This finding is consistent with other studies on the ability of NDTI [71].

4.2. Limitations of Unsupervised and Supervised Models in LULC Classification

Some studies have announced success in utilizing unsupervised classification, where k-means is one of the most popular methods. For instance, Fatchurrachman et al. (2020) proposed a method to map rice fields using 24 monthly NDVI and VH polarization time series data, employing a K-means clustering method, which achieved 95.95% accuracy in identifying paddy rice fields from other fields in Malaysia [69]. Similarly, Cai et al. (2019) utilized the same types of NDVI and Sentinel-1 data to map paddy rice fields in a small wetland area in China, but with the Unsupervised Random Forest (URF) method, achieving an accuracy of 0.95 [68]. Other unsupervised approaches, such as deep clustering with convolutional autoencoders [72], have also shown promise in feature learning and pattern recognition tasks. However, it is not certain that such high results can be achieved for multi-class classification. For the multi-class classification [14], the achieved accuracy fell significantly, revealing the limitations of unsupervised models, even those with complex architectures.

The limitations of unsupervised models, especially K-means, are evident during the labeling process. K-means struggles because it relies on spectral similarity, often resulting in the mixing of different land cover types. Unsupervised models, such as K-means, rely on statistical properties to distinguish data points, which may not always align with actual class boundaries. For example, it frequently misclassifies flowing streams as static water bodies, particularly in aquaculture zones, showing a failure to recognize thematic details in LULC definitions. This dependence on spectral similarity underscores the need for supervised models, which can more effectively capture thematic differences through labeled training data. In practice, it can be challenging to assign class labels to clusters based on ground-truth pixel distribution. Often, a single ground-truth class is spread across multiple clusters, and a single cluster may include pixels from several classes. This mismatch reflects a gap between real-world land cover patterns and the criteria used by clustering algorithms. These models rely on statistical properties to separate data points, which may not always align with true class boundaries. Such differences are caused by factors such as feature selection, which determines the dimensions used during clustering.

RF and SVM exhibited signs of overfitting, as evidenced by the discrepancies between cross-validation and testing accuracies. These gaps (approximately 7–8%) indicated that while the RF and SVM models achieved their optimal ability to learn from the labeled data (performing well on the training data), they failed to generalize effectively to unseen data, often misclassifying certain land cover types (performing worse on the testing data). Such overfitting compromises the reliability of model predictions in real-world applications, especially in diverse land cover scenarios.

In contrast, CNN outperformed RF and SVM models by efficiently extracting spatial and contextual features. This ability allows CNNs to discern complex patterns within the data that RF and SVM may overlook, particularly in densely mixed pixel environments. The CNN outperformed both RF and SVM models primarily due to its ability to extract spatial and contextual features. CNNs utilize local spatial patterns and relationships within the data, enabling them to identify complex structures and variations in land cover. This capability enables CNNs to better distinguish between closely related classes, providing a more detailed classification compared to RF and SVM, which rely more heavily on feature importance and hyperplane separation, respectively.

Evaluation metrics and pattern analysis consistently show that supervised models outperform unsupervised ones. For instance, when comparing RF and CNN models with the leading unsupervised method, K-means clustering, it becomes clear that K-means struggles to differentiate between flowing streams and static water bodies, especially within aquaculture zones. While not definitively confirmed, this suggests a tendency toward overfitting water bodies in unsupervised models. In contrast, supervised models exhibit underfitting in water body classification, often favoring forested areas.

The confusion matrices, shown in Figure 6, indicated that paddy rice is primarily associated with Type I errors (false negatives) in the RF model, while prone to Type II errors (false positives) in the SVM model. This indicates that RF tends to misclassify paddy rice pixels as belonging to other classes, while SVM tends to label paddy rice pixels incorrectly. This highlights the limitations of using a hyperplane-based approach to separate complex land cover types. The CNN model demonstrated the highest accuracy, albeit with a few mislabeled samples.

Several studies employed comparison, which aimed to identify the best supervised classification method. For instance, Talukdar et al. (2020) used several machine learning methods, including RF, SVM, artificial neural network (ANN), fuzzy adaptive resonance theory-supervised predictive mapping (Fuzzy ARTMAP), spectral angle mapper (SAM), and Mahalanobis distance with Landsat 8-derived satellite indices and other variables, to classify six types of LULC in three different Ganga River landscapes [4]. In the study, RF was the most effective strategy. Basheer et al. (2022) found that the SVM classifier outperformed others when applied to different datasets for LULC mapping in Charlottetown, Canada, between 2017 and 2021 [5]. Phan and Kappas (2014) evaluated the efficacy of RF, k-nearest neighbor (kNN), and SVM methods in a 30 by 30 km² area of the Red River Delta, northern Vietnam [6]. They tested six LULC types and 14 different training sample sizes, using all single bands of Sentinel-2 image data as independent variables, concluding that SVM exhibited the highest accuracy. Their highest accuracy was generally achieved at 0.95, which is quite close to our result. However, these successes were achieved in small study areas, with a limited number of classes, and through limited labeled sampling.

Due to their high resource consumption, deep learning models are rarely applied to LULC. Another reason is that deep learning models, such as CNNs, are better suited to object-detection tasks than conventional LULC classification tasks [19]. Therefore, the input data was generally resampled to a higher resolution and then divided into square patches, and the CNN model was used to label these patches. This approach resulted in a significantly coarser output compared to the original imagery resolution. Kussul et al. (2017) applied 1-D and 2-D CNN models to separate several crop types in Kyiv, Ukraine, and compared the results with those of the RF model [8]. Overall, the 2-D CNN model achieved the highest accuracy of 0.97, while the RF model reached 0.89. CNN models are run with a stride of 1, providing the same resolution as the input imagery. Our study also applied this method, and as a result, the output LULC map resolution remained at 30 m. Considering the results of other studies, it is concluded that the architecture of the deep learning model should be chosen carefully, depending on the size of the study problem and available computing resources. This is the primary constraint of applying deep learning models in LULC classification.

4.3. Suitability of Semi-Supervised Approaches in Land Cover Mapping

While many studies have discussed the limitations of supervised models, the most prominent concern centers on the limited availability of labeled data for training. The relatively small sample size compared to the study area raises questions about the reliability of even high testing accuracy scores. Moreover, several authors have noted that out-of-class data can dominate the classification process, leading to skewed results [10,22,60,62,73]. As a result, SSL methods are increasingly viewed as a promising alternative for land cover classification [64,66]. However, the application of SSL and deep learning models is constrained by computational demands, technical complexity, and resource consumption (Table 6). Our observations suggest that only a limited number of studies in this domain have achieved credible accuracy evaluations [12,15,17]. The semi-supervised model (SoC4SS-FGVC), although it does not achieve the highest test accuracy, provides a more accurate reflection of LULC in reality by effectively utilizing unlabeled data. By leveraging large volumes of unlabeled samples, the model captures broader patterns and trends that may be absent in a smaller labeled dataset. This approach allows for a more comprehensive understanding of land cover distribution, making it particularly advantageous in data-scarce environments where labeled data is limited.

The experiments with varying sample sizes clearly demonstrate that the semi-supervised model exhibits robust resistance to the scarcity of ground-truth data. Remarkably, when ground-truth data is extremely limited, the semi-supervised model consistently outperforms traditional supervised models, highlighting its strength in data-scarce environments. Additionally, increasing ground-truth data appears to have minimal impact on the semi-supervised model’s accuracy, contradicting the behavior observed in supervised models and underscoring its stability. While there is potential to further enhance the performance of semi-supervised models and to deepen the analysis of accuracy matrices, the current findings strongly affirm the model’s superior capabilities in challenging conditions.

One of the primary challenges in assessing the quality and applicability of LULC classification is the issue of pure/mixed pixels, which is influenced by spatial resolution. High-resolution sensors, such as Sentinel-2 and PlanetScope, tend to reduce spectral mixing and enhance class separability. In contrast, medium-resolution imagery often suffers from mixed pixels that blur boundaries and diminish classification reliability [74,75]. Nonetheless, empirical research suggests that increasing spatial resolution does not always result in improved classification accuracy. For example, Hsieh et al. (2001) developed a simulation framework to systematically examine the impacts of various parameters on classification performance [76]. Their results demonstrated that classification errors initially decrease and then increase as the ratio of ground sampling distance to field width decreases. However, finer spatial resolution does not inherently enhance classification accuracy due to boundary effects and within-class variability.

Furthermore, traditional accuracy evaluation metrics such as Overall Accuracy (OA), Producer’s Accuracy (PA), and User’s Accuracy (UA) have limitations, including the potential to obscure uncertainty and misrepresent datasets with class imbalance. Consequently, emerging approaches incorporate probabilistic and uncertainty-aware evaluation methods. These include soft classification accuracy [77], information-theoretic measures such as cross-entropy [78], and metrics that assess quantity and allocation disagreement, explicitly differentiating between systematic and random errors [79]. Such frameworks provide a more comprehensive account of classification uncertainty and are more consistent with modern per-pixel probability outputs derived from machine learning and deep learning classifiers.

To obtain a more precise evaluation, we do not rely solely on the limited labeled data. We attempt to validate our results using statistical data derived from field survey reports of local government staff. This type of comparison has been applied in a few studies, yielding limited results. First, it is difficult to obtain up-to-date, official data for many study regions. Secondly, the table dataset is difficult to use in line with the polygonal dataset. Using statistical data directly as a validation dataset is impossible. However, we can use real data that provides the total area of a LULC type in a region to compare with that retrieved from different methods. This approach cannot provide us with a quantified evaluation parameter. However, it can provide an intuitive evaluation of the model’s performance, particularly in addressing overfitting problems.

The comparisons in Table A2 indicate that the area of rice land (including one-season rice crop areas) has remained stable over the years, at approximately 1.7 million hectares. Of the models, the SoC model is the closest. From our perspective, the quality of ground-truth data and the selection of input features are more critical to classification success than the choice of model itself.

Our framework also highlights the utility of satellite-derived indices, which function similarly to principal component analysis (PCA) variables, reducing input dimensionality and enhancing model performance in large-scale data environments.

One key limitation of our study is the sample size. The number of ground-truth samples is disproportionately small relative to the study area, which constrains the robustness of our results. Although SSL models can leverage unlabeled data, their effectiveness is still dependent on the quality and representativeness of the labeled subset. In our view, the sampling strategy and feature selection are more decisive than the model architecture for achieving reliable LULC classification.

Another limitation is the resolution of the resulting LULC map. As mentioned above, a finer resolution generally yields richer and more accurate information. However, given specific goals, it is not necessary to achieve fine spatial resolution for many LULC types, as the computational requirements are very high.

5. Conclusions

This study proposes a future-oriented framework for LULC classification that combines remote sensing resources with deep learning models. Our results emphasize key limitations of purely unsupervised and supervised learning methods in LULC applications, especially when labeled data is scarce. These limitations highlight the need for alternative modeling strategies that can operate effectively in data-sparse environments.

Our comparative analysis across various sample sizes (500 to 6000 ground-truth data points) reveals important insights into model performance and scalability. Supervised models, including RF, SVM, and CNN, demonstrated strong performance when sufficient labeled data were available, with CNN achieving the highest accuracy. We recommend CNN as a practical and efficient solution for LULC classification when large amounts of ground-truth samples are available, owing to its favorable balance of accuracy and computational efficiency. However, at smaller sample sizes, these supervised methods showed significant limitations, with accuracies declining sharply.

In contrast, the semi-supervised approach (SoC4SS-FGVC) demonstrated consistently stable performance across all sample sizes (0.82–0.92 accuracy), highlighting its strength in scenarios with minimal data. This stability results from the model’s effective use of abundant unlabeled satellite data alongside limited labeled samples. Cross-validation using official statistics confirmed that the semi-supervised model more effectively delineates paddy rice fields and is less prone to overfitting. SSL models are therefore best suited to situations in which labeled data are scarce and unlabeled data are prevalent in large-scale remote sensing tasks. Although their computational costs are comparable to those of deep learning methods, SSL models offer a promising way to overcome the limitations of supervised models by reducing the time and expenses associated with ground-truth data collection. Among the main challenges are the significant demands on computational resources and time. The semi-supervised framework presented here offers a practical and cost-effective solution for land-cover mapping, particularly in regions such as the VMRD, where extensive ground-truth data collection is often prohibitively expensive or logistically challenging.

Results from the most reliable models, SoC4SS-FGVC, CNN, RF, and SVM, reveal the continued dominance of paddy rice cultivation in the study region, despite recent declines in net area. Concurrently, rice–shrimp systems, annual crops, aquaculture zones, wetlands, and mangrove forests have shown notable expansion. These shifts are largely driven by seawater intrusion and evolving agricultural policies and market dynamics, which increasingly favor high-value commodities such as shrimp, melon, and pineapple over traditional rice cultivation.

Author Contributions

Conceptualization, H.-A.B., C.-H.H., Y.-Y.C. and Y.-A.L.; methodology, H.-A.B., C.-H.H., H.-W.V.Y. and Y.-A.L.; software, H.-A.B. and C.-H.H.; validation, H.-A.B. and C.-H.H.; formal analysis, H.-A.B., C.-H.H. and Y.-A.L.; investigation, H.-A.B. and Y.-A.L.; resources, Y.-A.L.; data curation, H.-A.B.; writing—original draft preparation, H.-A.B., C.-H.H. and Y.-A.L.; writing—review and editing, H.-A.B., C.-H.H., H.-W.V.Y., Y.-Y.C. and Y.-A.L.; visualization, H.-A.B.; supervision, Y.-A.L.; project administration, Y.-A.L.; funding acquisition, Y.-A.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science and Technology Council of Taiwan (Grant No. 114-2111-M-008-026) and the Taiwan and Global Drought Investigation and Research Center (TGDIRC) at National Central University. Additional support was provided by the Ministry of Education in Taiwan, as part of the collaborative “Joint Research Project between the University of Illinois System and the University Academic Alliance in Taiwan.” The author H.-A.B. receives a TIGP scholarship from Academia Sinica.

Data Availability Statement

All processed outputs and the source code developed for this study are openly available in a public GitHub repository (https://github.com/buihaian1403/Land-use-land-cover-classification, accessed on 24 March 2026). The repository is released under the MIT open-source license, permitting reuse and modification with proper attribution.

Acknowledgments

The authors sincerely thank the editors and anonymous reviewers for providing constructive comments to improve the original manuscript. The authors gratefully acknowledge the support provided by the Taiwan and Global Drought Investigation and Research Center (TGDIRC/Drought Hub) at National Central University.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

VSA	Vietnam Statistical Administration
VMRD	The Vietnam Mekong River Delta
LULC	Land use land cover
SSL	Semi-supervised learning
URF	Unsupervised random forest

Appendix A

Table A1. Feature importance scores.

No	Feature Name	Importance	Mutual Information
1	NDVI in December 2022 (NDVI_Dec)	0.082625	0.680490
2	MNDWI in July 2022 (MNDWI_Jul)	0.062459	0.642543
3	NDVI in July 2022 (NDVI_Jul)	0.055597	0.661184
4	GCVI in May 2023 (GCVI_May)	0.055361	0.688441
5	NDBI in May 2023 (NDBI_May)	0.051718	0.477795
6	NDBI in December 2022 (NDBI_Dec)	0.050296	0.553992
7	GCVI in March 2023 (GCVI_Mar)	0.046039	0.697758
8	GCVI in December 2022 (GCVI_Dec)	0.044268	0.710642
9	NDBI in March 2023 (NDBI_Mar)	0.043294	0.466922
10	NDTI in March 2023 (NDTI_Mar)	0.042161	0.726247
11	NDTI in July 2022 (NDTI_Jul)	0.041319	0.621134
12	MNDWI in March 2023 (MNDWI_ Mar)	0.040202	0.583532
13	NDTI in December 2022 (NDTI_Dec)	0.039911	0.667353
14	NDLI in July 2022 (NDLI_Jul)	0.039722	0.489303
15	NDVI in March 2023 (NDVI_Mar)	0.039327	0.648327
16	GCVI in July 2022 (GCVI_Jul)	0.038708	0.623524
17	NDLI in December 2022 (NDLI_Dec)	0.035886	0.584635
18	NDBI in March 2023 (NDBI_Mar)	0.035736	0.426260
19	MNDWI in May 2023 (MNDWI_May)	0.035037	0.565789
20	MNDWI in December 2022 (MNDWI_Dec)	0.034100	0.682353
21	Digital elevation model (DEM)	0.029817	0.326861
22	NDVI in May 2023 (NDVI_May)	0.029074	0.543718
23	NDLI in May 2023 (NDLI_May)	0.026552	0.501958
24	NDLI in March 2023 (NDLI_Mar)	0.000551	0.433832
25	NDTI in May 2023 (NDTI_May)	0.000239	0.695665

Table A2. Area of LULC types in different models generated in 2023 compared to the statistical data provided by VSA in 2020. The shaded cells highlight the model’s LULC types with the closest areas to the statistical data.

No	LULC Types	SVM Map		Random Forest Map		CNN Map		VSA LULC 2020
No	LULC Types	Area (ha)	Percentage	Area (ha)	Percentage	Area (ha)	Percentage	Area (ha)	Percentage
1	Paddy rice	1,572,285.96	39.73	707,251.05	17.87	1,043,381.88	26.16	1,790,574.00	42.96
2	Rice–shrimp	216,701.46	5.48	442,399.05	11.18	365,161.77	9.16	-	-
3	Annual crops	682,218.27	17.24	558,114.12	14.10	325,514.88	8.16	115,159.00	2.76
4	Perennial crops	321,299.91	8.12	294,378.39	7.44	338,649.57	8.49	669,585.00	16.07
5	Productive forest	117,016.47	2.96	318,940.29	8.06	304,295.49	7.63	130,333.00	3.13
6	Protective forest	46,231.29	1.17	168,813.90	4.27	82,459.26	2.07	88,350.00	2.12
7	Mangrove forest	184,561.56	4.66	198,634.23	5.02	46,397.34	1.16	76,131.00	1.83
8	Aquaculture	385,400.79	9.74	465,475.14	11.76	464,171.49	11.64	509,034.00	12.21
9	Coastal wetland	78,362.19	1.98	278,883.45	7.05	418,679.73	10.50	79,211.00	1.90
10	Sparse built-up area	192,636.00	4.87	311,363.10	7.87	229,054.77	5.74	458,964.00	11.01
11	Dense built-up area	94,185.90	2.38	116,986.86	2.96	163,089.45	4.09
12	Rivers	66,157.92	1.67	95,818.14	2.42	207,052.29	5.19	250,528.00	6.01
	Total	3,957,057.72	100.00	3,957,057.72	100.00	3,987,907.92	100.00	4,167,869.00	100.00

Table A3. Area of LULC types in different models generated in 2023 compared to the statistical data provided by VSA in 2005. The shaded cells highlight the model’s LULC types with the closest areas to the 2020 statistical data.

No	LULC Types	K-Means Map		SoC Map		VSA LULC 2005
No	LULC Types	Area (ha)	Percentage	Area (ha)	Percentage	Area (ha)	Percentage
1	Paddy rice	612,935.82	15.49	1,474,693.47	35.74	1,742,307.00	42.73
2	Rice–shrimp	248,068.89	6.27	233,059.41	5.65	158,683.00	3.89
3	Annual crops	320,849.10	8.11	777,327.84	18.84	143,832.00	3.53
4	Perennial crops	442,477.17	11.18	203,071.23	4.92	534,641.00	13.11
5	Productive forest	146,033.01	3.69	58,877.73	1.43	210,691.00	5.17
6	Protective forest	296,375.49	7.49	267,472.80	6.48	88,782.00	2.18
7	Mangrove forest	199,880.64	5.05	82,174.95	1.99	55,556.00	1.36
8	Aquaculture	494,889.48	12.51	646,280.28	15.66	501,538.00	12.30
9	Coastal wetland	196,308.90	4.96	123,163.74	2.99	29,218.00	0.72
10	Sparse built-up area	292,199.58	7.38	114,701.13	2.78	392,644.00	9.63
11	Dense built-up area	304,174.53	7.69	31,726.71	0.77	-	-
12	Rivers	402,865.11	10.18	113,430.06	2.75	220,039.00	5.40
	Total	3,957,057.72	100.00	4,125,979.35	100.00	4,077,931.00	100.00

References

Sudhakar, S.; Rao, K. Land use and Land cover Analysis. In Remote Sensing Applications, 2nd ed.; Roy, P.S., Dwivedi, R.S., Vijayan, D., Eds.; National Remote Sensing Center: Hyderabad, India, 2010; pp. 21–48. [Google Scholar]
Nguyen, K.A.; Liou, Y.A. Global mapping of eco-environmental vulnerability from human and nature disturbances. Sci. Total Environ. 2019, 664, 995–1004. [Google Scholar] [CrossRef] [PubMed]
Talukdar, S.; Pal, S. Wetland habitat vulnerability of lower Punarbhaba river basin of the uplifted Barind region of Indo-Bangladesh. Geocarto Int. 2020, 35, 857–886. [Google Scholar] [CrossRef]
Talukdar, S.; Singha, P.; Mahato, S.; Shahfahad Pal, S.; Liou, Y.A.; Rahman, A. Land-Use Land-Cover Classification by Machine Learning Classifiers for Satellite Observations—A Review. Remote Sens. 2020, 12, 1135. [Google Scholar] [CrossRef]
Basheer, S.; Wang, X.Q.; Farooque, A.A.; Nawaz, R.A.; Liu, K.; Adekanmbi, T.; Liu, S.Q. Comparison of Land Use Land Cover Classifiers Using Different Satellite Imagery and Machine Learning Techniques. Remote Sens. 2022, 14, 4978. [Google Scholar] [CrossRef]
Phan, T.N.; Kappas, M. Comparison of Random Forest, k-Nearest Neighbor, and Support Vector Machine Classifiers for Land Cover Classification using Sentinel-2 imagery. Sensors 2018, 18, 18. [Google Scholar]
Campos-Taberner, M.; Garcia-Haro, F.J.; Martinez, B.; Izquierdo-Verdiguier, E.; Atzberger, C.; Camps-Valls, G.; Gilabert, M.A. Understanding deep learning in land use classification based on Sentinel-2 time series. Sci. Rep. 2020, 10, 17188. [Google Scholar] [CrossRef]
Kussul, N.; Lavreniuk, M.; Skakun, S.; Shelestov, A. Deep learning classification of land cover and crop types using remote sensing data. IEEE Geosci. Remote Sens. Lett. 2017, 14, 778–782. [Google Scholar] [CrossRef]
van Strien, M.J.; Gret-Regamey, A. Unsupervised deep learning of landscape typologies from remote sensing images and other continuous spatial data. Environ. Model. Softw. 2022, 155, 105462. [Google Scholar] [CrossRef]
Dewangkoro, H.I.; Arymurthy, A.M. Land Use and Land Cover Classification Using CNN, SVM, and Channel Squeeze & Spatial Excitation Block. IOP Conf. Ser. Earth Environ. Sci. 2021, 704, 012048. [Google Scholar]
Szabo, S.; Gacsi, Z.; Bertalan-Balazs, B. Specific features of NDVI, NDWI, and MNDWI as reflected in land cover categories. Landsc. Environ. 2016, 10, 194–202. [Google Scholar] [CrossRef]
Hosseiny, B.; Abdi, A.M.; Jamali, S. Urban land use and land cover classification with interpretable machine learning—A case study using Sentinel-2 and auxiliary data. Remote Sens. Appl. Soc. Environ. 2022, 28, 100843. [Google Scholar] [CrossRef]
Chen, Y.Y.; Huang, W.; Wang, W.H.; Juang, J.Y.; Hong, J.S.; Kato, T.; Luyssaert, S. Reconstructing Taiwan’s land cover changes between 1904 and 2015 from historical maps and satellite images. Sci. Rep. 2019, 9, 3643. [Google Scholar] [CrossRef] [PubMed]
Laine, S.; Aila, T. Temporal ensembling for semi-supervised learning. arXiv 2017, arXiv:1610.02242. [Google Scholar] [CrossRef]
Mahlayeye, M.; Darvishzadeh, R.; Nelson, A. Cropping Patterns of Annual Crops: A Remote Sensing Review. Remote Sens. 2022, 14, 2404. [Google Scholar] [CrossRef]
Zhu, X.X.; Tuia, D.; Mou, L.C.; Xia, G.S.; Zhang, L.P.; Xu, F.; Fraundorfer, F. Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geosci. Remote Sens. Mag. 2017, 5, 8–36. [Google Scholar] [CrossRef]
Vali, A.; Comai, S.; Matteucci, M. Deep learning for Land Use and Land Cover classification based on hyperspectral and multispectral Earth Observation Data: A review. Remote Sens. 2020, 12, 2495. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. Efficientnetv2: Smaller models and faster training. In Proceedings of the Machine Learning Research; Meila, M., Zhang, T., Eds.; PMLR: Birmingham, UK, 2021; Volume 139, pp. 10096–10106. [Google Scholar]
Feng, J.F.; Luo, H.X.; Gu, Z.J. Improving semi-supervised remote sensing scene classification via Multilevel Feature Fusion and pseudo-labeling. Int. J. Appl. Earth Obs. Geoinf. 2024, 136, 104335. [Google Scholar] [CrossRef]
Viana, C.M.; Oliveira, S.; Oliveira, S.C.; Rocha, J. Land Use/Land Cover Change Detection and Urban Sprawl Analysis. In Spatial Modeling in GIS and R for Earth and Environmental Sciences; Pourghasemi, H.R., Gokceoglu, C., Eds.; Elsevier: Amsterdam, The Netherlands, 2019; pp. 621–651. [Google Scholar]
Phan, D.C.; Trung, T.H.; Truong, V.T.; Sasagawa, T.; Vu, T.P.T.; Bui, D.T.; Hayashi, M.; Tadono, T.; Nasahara, K.N. First comprehensive quantification of annual land use/cover from 1990 to 2020 across mainland Vietnam. Sci. Rep. 2021, 11, 9979. [Google Scholar] [CrossRef]
Su, J.C.; Cheng, Z.Z.; Maji, S. A Realistic Evaluation of Semi-Supervised Learning for Fine-Grained Classification. arXiv 2021. [Google Scholar] [CrossRef]
Vu, H.T.D.; Tran, D.D.; Schenk, A.; Nguyen, C.P.; Vu, H.L.; Oberle, P.; Trinh, V.C.; Nestmann, F. Land use change in the Vietnamese Mekong Delta: New evidence from remote sensing. Sci. Total Environ. 2022, 813, 151918. [Google Scholar] [CrossRef]
Tran, H.; Tran, T.; Kervyn, M. Dynamics of Land Cover/Land Use Changes in the Mekong Delta, 1973–2011: A Remote Sensing Analysis of the Tran Van Thoi District, Ca Mau Province, Vietnam. Remote Sens. 2015, 7, 2899–2925. [Google Scholar] [CrossRef]
Ngo, D.K.; Lechner, A.M.; Vu, T.T. Land cover mapping of the Mekong Delta to support natural resource management with multi-temporal Sentinel-1A synthetic aperture radar imagery. Remote Sens. Appl. Soc. Environ. 2020, 17, 100272. [Google Scholar] [CrossRef]
Phan, M.H.; Stive, M.J.F. Managing mangroves and coastal land cover in the Mekong Delta. Ocean Coast. Manag. 2022, 219, 106013. [Google Scholar] [CrossRef]
Social Impact Inc.; USAID Learns. Climate Change Mitigation in the Mekong River Delta Assessment; Report for USAID; Social Impact Inc.: Arlington, VA, USA, 2023. [Google Scholar]
Bui, H.A.; Tran, M.T.; Nguyen, V.B. Identify limiting factors of soil fertility in Mekong river delta rice cultivable soils. In National Second Conference on Crop Sciences; The Gioi Publisher: Hanoi, Vietnam, 2016; pp. 1138–1143. [Google Scholar]
Connor, M.; Le, A.T.; De Guia, A.H.; Wehmeyer, H. Sustainable rice production in the Mekong River Delta: Factors influencing farmers’ adoption of the integrated technology package “One Must Do, Five Reductions” (1M5R). Outlook Agric. 2021, 50, 90–104. [Google Scholar] [CrossRef]
Vo, T.X. Rice Cultivation in the Mekong Delta-Present Situation and Potentials for Increased Production. Southeast Asian Stud. 1975, 13, 88–111. [Google Scholar]
Stanimirova, R.; Tarrio, K.; Turlej, K.; McAvoy, K.; Stonebrook, S.; Hu, K.T.; Arévalo, P.; Bullock, E.L.; Zhang, Y.T.; Woodcock, C.E.; et al. A global land cover training dataset from 1984 to 2020. Sci. Data 2023, 10, 879. [Google Scholar] [CrossRef]
Tenneson, K. Training Data Collecting Using Google Earth Engine. Available online: https://www.openmrv.org/w/modules/mrv/modules_1/training-data-collection-using-google-earth-engine (accessed on 20 July 2024).
ESRI. ArcGIS Desktop Documentation. 2021. Available online: https://desktop.arcgis.com/en/arcmap/latest/extensions/spatial-analyst/image-classification/creating-training-samples.htm (accessed on 25 July 2024).
Sakamoto, T.; Van, P.C.; Kotera, A.; Duy, K.N.; Yokozawa, M. Detection of Yearly Change in Farming Systems in the Vietnamese Mekong Delta from MODIS Time-Series Imagery. Jpn. Agric. Res. Q. 2009, 43, 173–185. [Google Scholar] [CrossRef]
Minh, H.V.T.; Avtar, R.; Mohan, G.; Misra, P.; Kurasaki, M. Monitoring and Mapping of Rice Cropping Pattern in Flooding Area in the Vietnamese Mekong Delta Using Sentinel-1A Data: A Case of An Giang Province. ISPRS Int. J. Geo-Inf. 2019, 8, 211. [Google Scholar] [CrossRef]
Liu, S.A.; Li, X.; Chen, D.; Duan, Y.Q.; Ji, H.N.; Zhang, L.P.; Chai, Q.; Hu, X.D. Understanding Land use/Land cover dynamics and impacts of human activities in the Mekong Delta over the last 40 years. Glob. Ecol. Conserv. 2020, 22, e00991. [Google Scholar] [CrossRef]
Farr, T.G.; Rosen, P.A.; Caro, E.; Crippen, R.; Duren, R.; Hensley, S.; Kobrick, M.; Paller, M.; Rodriguez, E.; Roth, L.; et al. The Shuttle Radar Topography Mission. Rev. Geophys. 2007, 45, RG2004. [Google Scholar] [CrossRef]
Rouse, J.W.; Haas, R.H.; Schell, J.A.; Deering, D.W. Monitoring vegetation systems in the Great Plains with ERTS. In Third Earth Resources Technology Satellite-1 Symposium; NASA SP-351; NASA: Washington, DC, USA, 1974; Volume 1, pp. 309–317. [Google Scholar]
Zha, Y.; Gao, J.; Ni, S. Use of normalized difference built-up index in automatically mapping urban areas from TM imagery. Int. J. Remote Sens. 2003, 24, 583–594. [Google Scholar] [CrossRef]
Xu, H. Modification of normalised difference water index (NDWI) to enhance open water features in remotely sensed imagery. Int. J. Remote Sens. 2006, 27, 3025–3033. [Google Scholar] [CrossRef]
Gitelson, A.A.; Viña, A.; Arkebauer, T.J.; Rundquist, D.C.; Keydan, G.; Leavitt, B. Remote estimation of leaf area index and green leaf biomass in maize canopies. Geophys. Res. Lett. 2003, 30, 1248. [Google Scholar] [CrossRef]
van Deventer, P.; Ward, D.; Gowda, P.H.M.; Lyon, J.G. Using thematic mapper data to identify contrasting soil plains and tillage practices. Photogram. Eng. Remote Sens. 1997, 63, 87–93. [Google Scholar]
Liou, Y.A.; Le, M.S.; Chien, H. Normalized Difference Latent Heat Index for Remote Sensing of Land Surface Energy Fluxes. IEEE Trans. Geosci. Remote Sens. 2019, 57, 1423–1433. [Google Scholar] [CrossRef]
Liou, Y.A.; Thai, M.T. Surface Water Availability and Temperature (SWAT): An innovative index for remote sensing of drought observation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4507312. [Google Scholar] [CrossRef]
Thai, M.T.; Liou, Y.A. Advancements in the Temperature-Soil Moisture Dryness Index (TMDI) for Drought Monitoring in Southwestern Taiwan. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4405415. [Google Scholar] [CrossRef]
Thai, M.T.; Liou, Y.A. Surface Water Availability-Temperature Index (SWATI) for Global Drought Monitoring. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4403714. [Google Scholar] [CrossRef]
Liou, Y.A.; Thai, M.T. Relative Surface Evapotranspiration Index (RSETI): A Novel Approach for Drought Characterization in Australia. Remote Sens. Environ. 2025, 329, 114948. [Google Scholar] [CrossRef]
Liou, Y.A.; Tran, D.P.; Melkamu, T. Overcoming cloud-induced gaps in surface urban heat island intensity monitoring using annual temperature cycle modeling across Taiwanese cities. Urban Clim. 2026, 65, 102748. [Google Scholar] [CrossRef]
Nguyen, K.A.; Thai, M.T.; Melkamu, T.; Le, T.V.; Liou, Y.A. Quantifying the cooling intensity of urban green spaces (UGSs) on land surface temperature (LST) in Hanoi metropolitan Area, Vietnam. City Environ. Interact. 2025, 28, 100264. [Google Scholar] [CrossRef]
Vergara, J.R.; Estevez, P.A. A review of feature selection methods based on mutual information. Neural Comput. Appl. 2013, 24, 175–186. [Google Scholar] [CrossRef]
Molinski, S. Data Science: Unsupervised Classification of Satellite Images with K-Means Algorithm. ML&GIS Service. Available online: https://ml-gis-service.com/index.php/2020/10/14/data-science-unsupervised-classification-of-satellite-images-with-k-means-algorithm/ (accessed on 1 August 2024).
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Yilmaz, A.E.; Demirhan, H. Weighted kappa measures for ordinal multi-class classification performance. Appl. Soft Comput. 2023, 134, 110020. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Breiman, L. Manual On Setting Up, Using, And Understanding Random Forests. V3.1; Statistics Department, University of California Berkeley: Berkeley, CA, USA, 2002. [Google Scholar]
Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Zhang, L.P.; Zhang, L.F.; Du, B. Deep learning for remote sensing data—A technical tutorial on the state of art. IEEE Geosci. Remote Sens. Mag. 2016, 4, 22–44. [Google Scholar] [CrossRef]
Ma, L.; Liu, Y.; Zhang, X.L.; Ye, Y.X.; Yin, G.F.; Johnson, B.A. Deep learning in remote sensing applications: A meta-analysis and review. ISPRS J. Photogramm. Remote Sens. 2019, 152, 166–177. [Google Scholar] [CrossRef]
Karra, K.; Kontgis, C.; Statman-Weil, Z.; Mazzariello, J.C.; Mathis, M.; Brumby, S.P. Global land use/land cover with Sentinel 2 and deep learning. In 2021, IEEE International Geoscience and Remote Sensing Symposium (IGARSS); IEEE: Piscataway, NJ, USA, 2021; pp. 4704–4707. [Google Scholar]
Zhu, X.J. Semi-Supervised Learning Literature Survey; TR 1530; Computer Sciences, University of Wisconsin: Madison, WI, USA, 2008. [Google Scholar]
Reddy, Y.C.A.P.; Viswanath, P.; Reddy, B.E. Semi-supervised learning: A brief review. Int. J. Eng. Technol. 2018, 7, 81–95. [Google Scholar] [CrossRef]
Rizve, M.N.; Kardan, N.; Shah, M. Towards Realistic Semi-Supervised Learning. arXiv 2022. [Google Scholar] [CrossRef]
van Engelen, J.E.; Hoos, H.H. A survey on semi-supervised learning. Mach. Learn. 2020, 109, 373–440. [Google Scholar] [CrossRef]
Lucas, B.; Pelletier, C.; Schmidt, D.; Webb, G.I.; Petitjean, F. A Bayesian-inspired, deep learning-based, semi-supervised domain adaptation technique for land cover mapping. Mach. Learn. 2023, 112, 1941–1973. [Google Scholar] [CrossRef]
Yang, X.L.; Song, Z.X.; King, I.; Xu, Z.L. A Survey on Deep Semi-Supervised Learning. IEEE Trans. Knowl. Data Eng. 2023, 35, 8934–8954. [Google Scholar] [CrossRef]
Duan, Y.; Zhao, Z.; Qi, L.; Zhou, L.P.; Wang, L.; Shi, Y.H. Roll with the Punches: Expansion and shrinkage of soft label selection for semi-supervised fine grained learning. In Proceedings of the 38th AAAI Conference on Artificial Intelligence and 36th Conference on Innovative Applications of Artificial Intelligence and 14th Symposium on Educational Advances in Artificial Intelligence; Wooldridge, M., Dy, J., Natarajan, S., Eds.; AAAI Press: Washington, DC, USA, 2024; Volume 38, pp. 11829–11837. [Google Scholar]
Grandini, M.; Bagli, E.; Visani, G. Metrics for multi-class classification: An overview. arXiv 2020. [Google Scholar] [CrossRef]
Cai, Y.T.; Lin, H.; Zhang, M. Mapping paddy rice by the object-based random forest method using time series Sentinel-1/Sentinel-2 data. Adv. Space Res. 2019, 64, 2233–2244. [Google Scholar] [CrossRef]
Fatchurrachman; Rudiyanto; Soh, N.C.; Shah, R.M.; Giap, S.G.E.; Setiawan, B.I.; Minasny, B. High-Resolution Mapping of Paddy Rice Extent and Growth Stages across Peninsular Malaysia Using a Fusion of Sentinel-1and 2 Time Series Data in Google Earth Engine. Remote Sens. 2022, 14, 1875. [Google Scholar] [CrossRef]
dela Torre, D.M.G.; Gao, J.; Macinnis-Ng, C.; Shi, Y. Phenology-based delineation of irrigated and rain-fed paddy fields with Sentinel-2 imagery in Google Earth Engine. Geo-Spat. Inf. Sci. 2021, 24, 695–710. [Google Scholar] [CrossRef]
Beeson, P.C.; Daughtry, C.S.T.; Wallander, S.A. Estimates of Conservation Tillage Practices Using Landsat Archive. Remote Sens. 2020, 12, 2665. [Google Scholar] [CrossRef]
Guo, X.F.; Liu, X.W.; Zhu, E.; Yin, J.P. Deep Clustering with Convolutional Autoencoders. In Neural Information Processing, ICONIP; Liu, D.R., Xie, S.L., Li, Y.Q., Zhao, D.B., El-Alfy, E.S.M., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2017; Volume 10635. [Google Scholar]
Tanha, J.; van Someren, M.; Afsarmanesh, H. Semi-supervised self-training for decision tree classifiers. Int. J. Mach. Learn. Cyber. 2017, 8, 355–370. [Google Scholar] [CrossRef]
Ye, N.; Morgenroth, J.; Xu, C.; Chen, N. Indigenous Forest classification in New Zealand—A comparison of classifiers and sensors. Int. J. Appl. Earth Obs. Geoinf. 2021, 102, 102395. [Google Scholar] [CrossRef]
Zhao, Y.Q.; Liu, D.S. A robust and adaptive spatial-spectral fusion model for PlanetScope and Sentinel-2 imagery. GIScience Remote Sens. 2022, 59, 520–546. [Google Scholar] [CrossRef]
Hsieh, P.F.; Lee, L.C.; Chen, N.Y. Effect of spatial resolution on classification errors of pure and mixed pixels in remote sensing. IEEE Trans. Geo Remote Sens. 2001, 39, 2657–2663. [Google Scholar] [CrossRef]
Gu, J.; Congalton, R.G.; Pan, Y. The Impact of Positional Errors on Soft Classification Accuracy Assessment: A Simulation Analysis. Remote Sens. 2015, 7, 579–599. [Google Scholar] [CrossRef]
Sun, P.; Lu, Y.; Zhai, J. Mapping land cover using a developed U-Net model with weighted cross entropy. Geocarto Int. 2022, 37, 9355–9368. [Google Scholar] [CrossRef]
Pontius, R.G., Jr.; Santacruz, A. Quantity, exchange, and shift components of difference in a square contingency table. Int. J. Remote Sens. 2014, 35, 7543–7554. [Google Scholar] [CrossRef]

Figure 1. VMRD position and representative satellite indices.

Figure 2. General methodological flowchart of the proposed research framework.

Figure 3. Silhouette scores (a) and the inertia (b) of the models for different numbers of classes used.

Figure 4. LULC map in VMRD for 2023, generated by k-means clustering.

Figure 5. RF-generated (a), SVM-generated (b), and CNN model-generated (c) LULC map.

Figure 6. Testing confusion matrices of RF (a), SVM (b), and CNN (c). The numbers on the two axes represent the LULC classes in the order listed in Table 1.

Figure 7. Loss function values of training and validating data by epoch in the CNN model.

Figure 8. VMRD LULC map in 2023 generated by SoC4SS-FGVC model.

Figure 9. Portions of LULC types derived from the SoC4SS-FGVC model.

Figure 10. Accuracy evaluation of supervised and semi-supervised models under different ground-truth data sizes. The horizontal axis represents the number of ground-truth data, and the vertical axis represents the accuracy achieved. Blue squares, brown crosses, yellow circles, and magenta asterisks depict the accuracy values of RF, SVM, CNN, and SoC4SS-FGVC models with 500, 1000, 2000, 4000, and 6024 samples, respectively.

Table 1. Number of ground-truth data samples by class.

No.	Used Types	Total Samples	Training Samples	Testing Samples
1	Paddy rice	1103	775	328
2	Rice–shrimp system	507	350	157
3	Annual crops	646	457	189
4	Perennial crops	500	335	165
5	Productive forest	548	361	187
6	Protective forest	414	300	114
7	Mangrove forest	453	323	130
8	Aquaculture	518	358	160
9	Coastal wetland	244	174	70
10	Sparse built-up area	230	159	71
11	Dense built-up area	313	225	88
12	Rivers, streams, etc.	548	399	149
	Total	6024	4216	1080

Table 2. List of features used in the study.

No	Feature Name	Feature Abbreviation	Equation	Description	Reference
1	Green Chlorophyll Vegetation Index	GCVI	$G C V I = \frac{ρ_{5} - ρ_{3}}{ρ_{3}}$	Refer to the green leaf biomass at different land cover states.	[38]
2	Modified Normalized Difference Water Index	MNDWI	$M N D W I = \frac{ρ_{3} - ρ_{6}}{ρ_{3} + ρ_{6}}$	Refer to the water surface at different land cover states.	[36]
3	Normalized Difference Built-up Index	NDBI	$N D B I = \frac{ρ_{6} - ρ_{5}}{ρ_{6} + ρ_{5}}$	Identifies urban and built-up areas	[37]
4	Normalized Difference Latent Heat Index	NDLI	$N D L I = \frac{ρ_{3} - ρ_{4}}{ρ_{3} + ρ_{4} + ρ_{6}}$	Refer to the latent heat flux from the surface at different land cover states.	[40]
5	Normalized Difference Tillage Index	NDTI	$N D T I = \frac{ρ_{6} - ρ_{7}}{ρ_{6} + ρ_{7}}$	Refer to tillage intensity at different land cover states.	[39]
6	Normalized Difference Vegetation Index	NDVI	$N D V I = \frac{ρ_{5} - ρ_{4}}{ρ_{5} + ρ_{4}}$	Refer to the canopy cover at different land cover states.	[35]
7	Digital Elevation Model	DEM	$SRTM image mosaic data .$	Elevation information (in meters)	[41]

Table 3. Hyperparameter settings and tuning strategies for the implemented machine learning models.

Model	Tuning Method	Tuning Package	Tuning Parameter	Search Space
K_means	Fix to 12 classes		Silhouette scores	-
SVM	Loss function: accuracy	Hyperopt	k-fold	5
			C	log[10⁻⁴, 10⁴]
			γ	log[10⁻⁴, 10⁴]
RF	Loss function: accuracy	Hyperopt	k-fold	5
			Max_depth	[1, 20]
			N_estimators	[50, 500]
CNN	Optimizer: Adam	TensorFlow	epoch	100
	Loss function:		Learning rate	10⁻⁵
	Cross Entropy		Decay rate	default
SoC-FGVC	Optimizer: SGD	TensorFlow	epoch	500
			Batch size	32
			Learning rate	0.01

Table 4. Supervised model optimization.

Model	Parameter	Optimization
RF	Max_depth	20
RF	N_estimator	420
SVM	C	11.95
SVM	γ	0.78
CNN	Epoch	100
CNN	Decay rate	0.0001

Table 5. Training and testing evaluation matrices for supervised methods.

No	Metrics	RF	SVM	CNN
1	Cross-validation accuracy	0.95	0.97	0.96
3	Testing accuracy	0.87	0.87	0.97
4	Testing kappa index	0.86	0.85	0.96
5	Testing precision	0.85	0.86	0.97
6	Testing F₁ score	0.85	0.85	0.97
7	Testing recall	0.84	0.83	0.97

Table 6. Cost of training each model.

Model	Time-Consuming (Hours)		Memory Usage (GB)
Model	500 Samples	6024 Samples	Memory Usage (GB)
RF	0.52	0.41	18
SVM	0.64	1.93	69
CNN	8.11	8.32	25
SoC4SS-FGVC	10.16	15.92	34

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bui, H.-A.; Hsu, C.-H.; Young, H.-W.V.; Chen, Y.-Y.; Liou, Y.-A. Advanced Semi-Supervised Learning for Remote Sensing-Based Land Cover Classification in the Mekong River Delta, Vietnam. Remote Sens. 2026, 18, 989. https://doi.org/10.3390/rs18070989

AMA Style

Bui H-A, Hsu C-H, Young H-WV, Chen Y-Y, Liou Y-A. Advanced Semi-Supervised Learning for Remote Sensing-Based Land Cover Classification in the Mekong River Delta, Vietnam. Remote Sensing. 2026; 18(7):989. https://doi.org/10.3390/rs18070989

Chicago/Turabian Style

Bui, Hai-An, Chih-Hua Hsu, Hsu-Wen Vincent Young, Yi-Ying Chen, and Yuei-An Liou. 2026. "Advanced Semi-Supervised Learning for Remote Sensing-Based Land Cover Classification in the Mekong River Delta, Vietnam" Remote Sensing 18, no. 7: 989. https://doi.org/10.3390/rs18070989

APA Style

Bui, H.-A., Hsu, C.-H., Young, H.-W. V., Chen, Y.-Y., & Liou, Y.-A. (2026). Advanced Semi-Supervised Learning for Remote Sensing-Based Land Cover Classification in the Mekong River Delta, Vietnam. Remote Sensing, 18(7), 989. https://doi.org/10.3390/rs18070989

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Advanced Semi-Supervised Learning for Remote Sensing-Based Land Cover Classification in the Mekong River Delta, Vietnam

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Ground Truth Data

2.3. Remote Sensing Data

2.4. Feature Processing and Selection

2.4.1. Feature Engineering

2.4.2. Feature Selection

Random Forest Gini Index Importance

Mutual Information—Based Feature Importance

2.5. Classification Models

2.5.1. K-Means Clustering

2.5.2. RF and SVM Models Optimization

2.5.3. Random Forest Classifier

2.5.4. Support Vector Machine

2.5.5. Convolutional Neural Network

2.5.6. Semi-Supervised Method

2.6. Evaluation Matrices

3. Results

3.1. Feature Selection

3.2. K-Means Clustering

3.3. Supervised Classifications

3.4. Semi-Supervised Model

3.5. Sensitivity of Supervised and Semi-Supervised Models to Sample Size

3.6. Statistical Validation

4. Discussion

4.1. Feature Selection and Its Implications

4.2. Limitations of Unsupervised and Supervised Models in LULC Classification

4.3. Suitability of Semi-Supervised Approaches in Land Cover Mapping

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI