Using Google Earth Engine to Assess the Current State of Thermokarst Terrain on Arga Island (the Lena Delta)

: The mapping of thermokarst landscapes and the assessment of their conditions are becoming increasingly important in light of a rising global temperature. Land cover maps provide a basis for quantifying changes in landscapes and identifying areas that are vulnerable to permafrost degradation. The study is devoted to assessing the current state of thermokarst terrain on Arga Island. We applied a random forests algorithm using the capabilities of the Google Earth Engine cloud platform for the supervised classification of the composite image. The analyzed composite consists of a Sentinel-2 image and a set of calculated indices. The study found that thermokarst-affected terrains occupy 35% of the total area, and stable terrains cover 29% at the time of image acquisition. The classifier has also mapped water bodies, slopes, and blowouts. The accuracy assessment revealed that the overall accuracy for all the different land cover classes was 98.34%. A set of other accuracy metrics also demonstrated a high level of performance. This study presents significant findings for assessing landscape changes in a region with unique environmental features. It also provides a potential basis for future interdisciplinary research and for predicting future thermokarst landscape changes in the Lena Delta area.


Introduction
The observations of atmospheric and terrestrial conditions indicate a rising global temperature [1].Furthermore, the 10 warmest years recorded have all occurred in the last decade (2014-2023) [1].At the same time, the average temperature of the Earth's surface in the polar regions of the Northern Hemisphere is increasing at approximately twice the rate [2].Rapid climate change has led to significant changes in the Arctic ecosystems and landscapes [3][4][5].Specifically, surface deformation rates are enhanced [6,7], active layer thickness increases at annual time scales [8], the mean coastal erosion rate grows [9], methane emissions are enhanced [10], periglacial geohazard risks increase [11,12], etc.
Permafrost degradation processes are a significant result of climate change.These processes include thermokarst, thermal erosion, thermal abrasion, and thermal denudation.In some instances, landforms that have been created by one process may subsequently undergo modification due to other processes.For example, the hollows created by the thermokarst process may be subsequently modified by thermal erosion, resulting in the formation of gullies and valleys.Because of this, we use the term "thermokarst" in a broad sense to characterize the landforms caused by the surface changes from permafrost degradation, according to [13].Thermokarst can damage infrastructure and impact arctic hydrology and ecology.Due to this, thermokarst landscape mapping on a local and global scale to forecast future topography changes has become an increasingly important task [14][15][16][17][18][19].
However, permafrost degradation is a complex process that has several effects on landscapes.There are several models and classification systems that describe the relationships Earth 2024, 5 229 between permafrost degradation and landscape features.One of the first classifications was developed by Czudek, T. and Demek, J. [20].Jorgenson [13] recognized 23 thermokarst and other thaw-related landforms based on differences in landform morphology, volume of ground ice, as well as mass and heat transfer processes.All classifications differentiate between degradation stages and landforms based on various factors and characteristics [20][21][22][23].Nevertheless, Arctic lowland landscapes can be conventionally divided into two main categories.The first one is stable surfaces of undegraded or already degraded permafrost ground.The second category includes permafrost ground surfaces that continuously degrade at the observation moment.Permafrost degradation has a direct impact on the successional patterns of ecosystem development, specifically on soils, vegetation, and hydrology [13,[24][25][26][27]. Since the surfaces of different landscapes have various spectral characteristics in remote sensing data, it provides the opportunity to map thermokarst landscapes using remote sensing techniques [16][17][18][19]28].
In recent years, the appearance of new remote sensing (RS) data, the advancement of analysis methods, and the development of new tools and software have significantly simplified thermokarst landscape mapping [19,29,30].For example, the Landsat and Sentinel satellite missions have accumulated a vast amount of imagery that is available to scientists [31].The launch of other new satellite missions has led to improvements in the spectral, spatial, and temporal resolution of satellite imagery.The use of machine learning (ML) algorithms for mapping has become increasingly significant and has led to enhanced accuracy in the mapping process.[19, [29][30][31].These techniques are based on learning from data and making decisions with minimal human input.This enables a significant improvement in the quality of the mapping.It is also worth noting the significant progress in land-use/land-cover mapping of Arctic landscapes [32][33][34].For example, Desjardins et al. utilized machine learning techniques to produce a highly accurate land cover map of the immediate area surrounding the Canadian Forces Station Alert [35].The use of cloud-based technologies and ML techniques has made it possible to detect changes in the landscape over the last few decades.This is because these technologies help analyze large volumes of data and provide the opportunity for cloud computing [36].The launch of the Google Earth Engine (GEE) cloud platform has significantly simplified the analysis of RS data [37][38][39][40].This is a cloud-based platform that provides researchers with access to and the ability to analyze extensive geospatial data for environmental monitoring, land use change assessment, and a variety of other Earth science applications.GEE combines a vast amount of RS data with the capabilities of modern data analysis techniques.
In this research, we focused on investigating the landscape of Arga Island within the Lena Delta.This island is mainly occupied by thermokarst terrain.The unique geographical location of this island, combined with its specific geological and geomorphological features and permafrost conditions, makes it a region of significant interest.
The purpose of this study is to evaluate the current state of the thermokarst terrain on Arga Island using the Google Earth Engine platform.This will help to determine the accuracy of ML techniques in GEE in classifying thermokarst terrain.The study includes the following steps.The first step was the preparation of data, which involved the filtering of satellite imagery and the creation of indices.The second step was the identification of land cover (LC) classes and the manual mapping of their training polygons.This identification is based on a visual analysis of RS data and a literature review.This step also includes the determination of the spectral signature of pixels within each identified LC class.The third step involves the supervised classification of RS data using the capabilities of the Google Earth Engine cloud platform.Finally, we analyzed the classification results to estimate the current state of the thermokarst landscapes on Arga Island.This analysis included an assessment of the accuracy (through quantitative measures and a visual inspection) and an estimation of the spatial distribution of LC classes in the study area.

Study Area
The total area of the Lena Delta (Figure 1) is approximately 29,278 km 2 [41].The average annual discharge volume is approximately 513 km 3 [42].The Lena Delta is composed of three terraces that differ from each other in terms of their geomorphic environment, sediment composition, and age [43,44].The youngest one is the first terrace.It comprises Holocene-age organic-rich sand or silty sand peats and mainly covers the eastern half of the delta [45,46].The second terrace, which is located in the western part of the delta, is represented by Arga Island.This terrace is composed of MIS 3-1 fluvial sediments with a fine-grained texture and no silt, clay, or organic matter [47].The third terrace consists of Late Pleistocene-Holocene deposits [48].It is characterized by the presence of different geological units of various origins and ages, such as Early Weichselian fluvial sands, Middle-Late Weichselian Ice complex deposits (Edoma), and Holocene peaty and sandy silts with plant remains [44,48].In addition, some islands in the delta represent erosional remnants of pre-Quaternary age bedrock.The active tectonic processes have played a significant role in shaping the Lena Delta and have been influenced by the topography of its islands [44,[49][50][51][52]. Due to this, different islands within the Lena Delta have elevation differences of up to 60 m [49].Thus, modern neotectonic processes have an impact on permafrost degradation, and they are one of the factors that cause differences in the geomorphic conditions of various islands.

Study Area
The total area of the Lena Delta (Figure 1) is approximately 29,278 km 2 [41].The average annual discharge volume is approximately 513 km 3 [42].The Lena Delta is composed of three terraces that differ from each other in terms of their geomorphic environment, sediment composition, and age [43,44].The youngest one is the first terrace.It comprises Holocene-age organic-rich sand or silty sand peats and mainly covers the eastern half of the delta [45,46].The second terrace, which is located in the western part of the delta, is represented by Arga Island.This terrace is composed of MIS 3-1 fluvial sediments with a fine-grained texture and no silt, clay, or organic matter [47].The third terrace consists of Late Pleistocene-Holocene deposits [48].It is characterized by the presence of different geological units of various origins and ages, such as Early Weichselian fluvial sands, Middle-Late Weichselian Ice complex deposits (Edoma), and Holocene peaty and sandy silts with plant remains [44,48].In addition, some islands in the delta represent erosional remnants of pre-Quaternary age bedrock.The active tectonic processes have played a significant role in shaping the Lena Delta and have been influenced by the topography of its islands [44,[49][50][51][52]. Due to this, different islands within the Lena Delta have elevation differences of up to 60 m [49].Thus, modern neotectonic processes have an impact on permafrost degradation, and they are one of the factors that cause differences in the geomorphic conditions of various islands.The Arga Island (Figure 1) is one of the main geomorphological elements of the Lena Delta.The island has a height of 10-30 m a.s.l.The surface of the island is mainly flat, complicated by thermokarst landforms, such as thermokarst depressions, gullies, and erosional valleys.The polygonal microrelief is not as pronounced as on other terraces [43].Numerous thermokarst lakes, with a sub-meridional orientation of their long axis, are located in depressions [43,53].The soils in the area primarily consist of tundra-gleys and peaty-gleys (histosols and inceptisols) [54].The dominant vegetation is moss-grass low- The Arga Island (Figure 1) is one of the main geomorphological elements of the Lena Delta.The island has a height of 10-30 m a.s.l.The surface of the island is mainly flat, complicated by thermokarst landforms, such as thermokarst depressions, gullies, and erosional valleys.The polygonal microrelief is not as pronounced as on other terraces [43].Numerous thermokarst lakes, with a sub-meridional orientation of their long axis, are located in depressions [43,53].The soils in the area primarily consist of tundra-gleys and peaty-gleys (histosols and inceptisols) [54].The dominant vegetation is moss-grass low-shrub tundra [54].According to a previous LC mapping of the Lena Delta, Arga Island appears to be dominated by classes that indicate mainly dry conditions [55].The permafrost in this area is about 400-1000 m thick [52].The active layer is approximately 20-40 cm in thickness.The cryostructure of Arga sediments is massive and contains thin polygonal ice wedges [43].Today, the Laptev Sea area and Lena Delta are experiencing active warming and are hot spots for hydrological consequences, as identified by numerous observations [56,57].One of the latest detailed studies on climate, permafrost, and active layer conditions in the Lena Delta was presented by Boike et al. [56].According to this study, the coldest month in the southern part of the delta is February, with an average monthly air temperature of −32.7 • C (−25.6 • F), while the warmest month is July with an average of 9.5 • C (48.2 • F).The mean annual air temperature is −12.3 • C (10.4 • F).The period of active layer thawing lasts from the end of May until the end of August or the beginning of September.Due to the fact that the near-coastal areas of the island are significantly influenced by the sea, we analyzed a rectangular area that encompasses most of the island, excluding the land-sea interface.
Therefore, there are several reasons to conduct a thorough investigation of the thermokarst landscape on Arga Island: (1) The island is situated in a unique location on a global scale, specifically at the transition between continent and sea.Due to this, the island's landscapes are highly sensitive to global climate change.(2) The island is affected by active modern tectonics and permafrost degradation, which are reflected in its landscapes.Owing to this, studies of landform morphology and landscape distribution can help to reveal the unknown relationships between the geological properties of the island's sediments, modern tectonic movements, and physical processes related to permafrost degradation.(3) Arga Island exhibits a uniform pattern of thermokarst topography over a significant area.This results in a reduction in the number of LC classes that need to be mapped and simplifies the LC mapping process.( 4) Additionally, it should be marked that Arga Island is practically inaccessible to expeditions.As a result, RS analysis is the sole method for observing the landscapes and environmental condition changes in one-third of the largest Arctic delta.

Data Processing
We conducted all image processing, creating vector polygons and data acquisition on the GEE platform.First, we selected the appropriate Sentinel-2 (Level-2A) image (Table 1) by filtering for cloud cover in order to ensure the quality and accuracy of the data in our analysis.We filtered summer images for the study area from recent years.The most appropriate image is the Sentinel-2 image captured on 29 July 2021 with a cloud cover of 0.008006% of the pixels.As thermokarst processes are characterized by changes to vegetation cover and hydrology, we added the number of indices to the dataset.Based on a review of the relevant literature, we selected several indices that are commonly used for the estimation of biomass and identification of water bodies.These are Normalized Difference Vegetation Index (NDVI) [59], Normalized Difference Water Index (NDWI) [60], Enhanced Vegetation Index (EVI) [61], and results of Tasselled Cap transformation [62,63]-wetness (TCW) and greenness (TCG) (Table 2).The NDVI is a widely used indicator for assessing and monitoring the amount and health of plant cover.The EVI is nonlinearly related to the NDVI and helps to extend the range over which vegetation indices respond to changes in vegetation density [61].In turn, the use of TCG can provide a more detailed understanding of vegetation dynamics, thanks to its ability to differentiate between various aspects of vegetation reflectance [64].The NDWI is a well-established index for analyzing surface water, and the use of TCW provides more detailed information about soil moisture [63].The formulas used to calculate these indices from the Sentinel-2 image channels were obtained from the Index Database [65].
Table 2. Spectral indices used in the study.The band's names in the formulas are in accordance with Table 1.

Index Formula
Normalized Difference Vegetation Index (NDVI) After producing indices, we normalized the initial Sentinel-2 image bands.Then, we used the principal components analysis (PCA) to derive three new uncorrelated bands from Sentinel-2 bands and calculated indices.This transform can be carried out in GEE by using a covariance reducer on an array image and the eigen() command on the resultant covariance array.After normalizing the PCA bands, the first three were added to the dataset.Therefore, the final dataset includes normalized Sentinel-2 Level-2A channels (Figure 2): B1, B2, B3, B4, B5, B6, B7, B8, B8A, B9, B11, and B12; calculated indices such as NDVI, NDWI, EVI, TCW, and TCG (Figure 2c-g); the first three principal component analysis results: PCA 1, PCA 2, and PCA 3 (Figure 2b).
Earth 2024, 5, FOR PEER REVIEW 5 literature, we selected several indices that are commonly used for the estimation of biomass and identification of water bodies.These are Normalized Difference Vegetation Index (NDVI) [59], Normalized Difference Water Index (NDWI) [60], Enhanced Vegetation Index (EVI) [61], and results of Tasselled Cap transformation [62,63]-wetness (TCW) and greenness (TCG) (Table 2).The NDVI is a widely used indicator for assessing and monitoring the amount and health of plant cover.The EVI is nonlinearly related to the NDVI and helps to extend the range over which vegetation indices respond to changes in vegetation density [61].In turn, the use of TCG can provide a more detailed understanding of vegetation dynamics, thanks to its ability to differentiate between various aspects of vegetation reflectance [64].The NDWI is a well-established index for analyzing surface water, and the use of TCW provides more detailed information about soil moisture [63].The formulas used to calculate these indices from the Sentinel-2 image channels were obtained from the Index Database [65].After producing indices, we normalized the initial Sentinel-2 image bands.Then, we used the principal components analysis (PCA) to derive three new uncorrelated bands from Sentinel-2 bands and calculated indices.This transform can be carried out in GEE by using a covariance reducer on an array image and the eigen() command on the resultant covariance array.After normalizing the PCA bands, the first three were added to the dataset.Therefore, the final dataset includes normalized Sentinel-2 Level-2A channels (Figure 2): B1, B2, B3, B4, B5, B6, B7, B8, B8A, B9, B11, and B12; calculated indices such as NDVI, NDWI, EVI, TCW, and TCG (Figure 2c-g); the first three principal component analysis results: PCA 1, PCA 2, and PCA 3 (Figure 2b).

Mapped LC Classes
We identified five LC classes based on a literature review and a visual analysis of the study area using RS data.In general, we streamlined the classifications of thermokarst terrain and identified common categories [13,14].The first one is water bodies.They vary in size from small ponds formed between collapsed ice-wedge polygons to large, elongated thermokarst lakes.In some instances, ponds are completely overgrown with vegetation.Additionally, some smaller lakes have a green color to their water.Many lakes have deep central areas and shallower areas near their shores.We manually created training polygons for each water body type in order to account for this diversity.
The second LC class is stable terrains.These are stable at the moment of observation.We mapped training polygons of this class on the flat upland surfaces that were not affected by thermokarst processes.These surfaces have lighter and more yellowish hues in the true color imagery.The third class is thermokarst-affected terrains.We identified training polygons for thermokarst-affected terrain at the bottoms of thermokarst hollows and erosional valleys.The colors of this landscape are generally darker and more brownish in the true color band combination.
Moreover, we identified a distinct class for slopes that visually differ from other classes and a specific class for blowouts.This latter class has a unique spectral signature and should be identified separately.Therefore, we identified 20 training polygons per LC class revealed, amounting to 100 training polygons in total (Figure 3).We distributed all the polygons evenly throughout the study area.The calculated spectral signatures for each class demonstrate their multidirectional nature, differentiation between LC classes, and the fundamental potential for classification (Figure 4).

Mapped LC Classes
We identified five LC classes based on a literature review and a visual analysis of the study area using RS data.In general, we streamlined the classifications of thermokarst terrain and identified common categories [13,14].The first one is water bodies.They vary in size from small ponds formed between collapsed ice-wedge polygons to large, elongated thermokarst lakes.In some instances, ponds are completely overgrown with vegetation.Additionally, some smaller lakes have a green color to their water.Many lakes have deep central areas and shallower areas near their shores.We manually created training polygons for each water body type in order to account for this diversity.
The second LC class is stable terrains.These are stable at the moment of observation.We mapped training polygons of this class on the flat upland surfaces that were not affected by thermokarst processes.These surfaces have lighter and more yellowish hues in the true color imagery.The third class is thermokarst-affected terrains.We identified training polygons for thermokarst-affected terrain at the bottoms of thermokarst hollows and erosional valleys.The colors of this landscape are generally darker and more brownish in the true color band combination.
Moreover, we identified a distinct class for slopes that visually differ from other classes and a specific class for blowouts.This latter class has a unique spectral signature and should be identified separately.Therefore, we identified 20 training polygons per LC class revealed, amounting to 100 training polygons in total (Figure 3).We distributed all the polygons evenly throughout the study area.The calculated spectral signatures for each class demonstrate their multidirectional nature, differentiation between LC classes, and the fundamental potential for classification (Figure 4).

The Band Selection for Classification from Composite
After identifying LC classes and preparing the final dataset, we needed to reduce the number of analyzed bands without compromising the accuracy of the classifier.It was necessary to avoid user memory limit errors and reduce the classifier runtime.Additionally, we endeavored to identify the most suitable bands for our objectives.We selected the bands based on two criteria.
The first criterion is the average correlation coefficient between bands in the composite.We attempted to select bands with maximal values of either positive or negative correlation coefficients.In other words, we aimed to select representative bands.In order to achieve this goal, we calculated a correlation matrix for the bands in the composite.We calculated Pearson's correlation coefficient using the ee.Reducer.pearsonsCorrelation()function in GEE.The correlation coefficients were computed for a random sample of 5000 points.In order to calculate the average correlation, each correlation coefficient was transformed using the Fisher Z-transformation according to [66,67].Then, we calculated the mean of the Z values, after which we converted the calculated means back to the correlation coefficient (Figure 5).

The Band Selection for Classification from Composite
After identifying LC classes and preparing the final dataset, we needed to reduce the number of analyzed bands without compromising the accuracy of the classifier.It was necessary to avoid user memory limit errors and reduce the classifier runtime.Additionally, we endeavored to identify the most suitable bands for our objectives.We selected the bands based on two criteria.
The first criterion is the average correlation coefficient between bands in the composite.We attempted to select bands with maximal values of either positive or negative correlation coefficients.In other words, we aimed to select representative bands.In order to achieve this goal, we calculated a correlation matrix for the bands in the composite.We calculated Pearson's correlation coefficient using the ee.Reducer.pearsonsCorrelation()function in GEE.The correlation coefficients were computed for a random sample of 5000 points.In order to calculate the average correlation, each correlation coefficient was transformed using the Fisher Z-transformation according to [66,67].Then, we calculated the mean of the Z values, after which we converted the calculated means back to the correlation coefficient (Figure 5).

The Band Selection for Classification from Composite
After identifying LC classes and preparing the final dataset, we needed to reduce the number of analyzed bands without compromising the accuracy of the classifier.It was necessary to avoid user memory limit errors and reduce the classifier runtime.Additionally, we endeavored to identify the most suitable bands for our objectives.We selected the bands based on two criteria.
The first criterion is the average correlation coefficient between bands in the composite.We attempted to select bands with maximal values of either positive or negative correlation coefficients.In other words, we aimed to select representative bands.In order to achieve this goal, we calculated a correlation matrix for the bands in the composite.We calculated Pearson's correlation coefficient using the ee.Reducer.pearsonsCorrelation()function in GEE.The correlation coefficients were computed for a random sample of 5000 points.In order to calculate the average correlation, each correlation coefficient was transformed using the Fisher Z-transformation according to [66,67].Then, we calculated the mean of the Z values, after which we converted the calculated means back to the correlation coefficient (Figure 5).The second criterion is the standard deviation of the average spectral characteristics of each LC class in all bands.This criterion was used to select the bands in which the LC classes exhibit the greatest separability.The use of bands with a significant difference in LC class values simplified the image classification process.As a result, we selected the following seven bands: B6, B7, B11, B12, NDWI, NDVI, and PCA1 (Figure 6).They have the highest values according to both criteria.It should be noted that we did not select B9 due to the lack of values in this band.The B1 has a lower correlation coefficient compared to the B7, and due to this, we chose the last one.The second criterion is the standard deviation of the average spectral characteristics of each LC class in all bands.This criterion was used to select the bands in which the LC classes exhibit the greatest separability.The use of bands with a significant difference in LC class values simplified the image classification process.As a result, we selected the following seven bands: B6, B7, B11, B12, NDWI, NDVI, and PCA1 (Figure 6).They have the highest values according to both criteria.It should be noted that we did not select B9 due to the lack of values in this band.The B1 has a lower correlation coefficient compared to the B7, and due to this, we chose the last one.

Classification Method
GEE offers a range of machine-learning algorithms for supervised classification tasks.We selected the tree-based random forest ensemble due to its widespread use for LC classification, owing to its flexibility, robustness, and frequent superior performance in comparison to other classifiers [68,69].The theoretical foundations of this method are described in [70,71].The random forest algorithm is a machine learning technique that constructs multiple decision trees during the training process and produces the most common class label for classification or an average prediction value for regression.Each tree in the random forest model is built using a random selection of features and a subset of the training dataset.The final prediction is determined by combining the outputs of all the individual trees in the ensemble.This approach helps to enhance accuracy and mitigate overfitting by incorporating the predictions from various trees.The GEE employs this algorithm through the Statistical Machine Intelligence and Learning Engine (SMILE) JAVA library.When running RF in GEE, it is necessary to set two important hyperparameters.

Classification Method
GEE offers a range of machine-learning algorithms for supervised classification tasks.We selected the tree-based random forest ensemble due to its widespread use for LC classification, owing to its flexibility, robustness, and frequent superior performance in comparison to other classifiers [68,69].The theoretical foundations of this method are described in [70,71].The random forest algorithm is a machine learning technique that constructs multiple decision trees during the training process and produces the most common class label for classification or an average prediction value for regression.Each tree in the random forest model is built using a random selection of features and a subset of the training dataset.The final prediction is determined by combining the outputs of all the individual trees in the ensemble.This approach helps to enhance accuracy and mitigate overfitting by incorporating the predictions from various trees.The GEE employs this algorithm through the Statistical Machine Intelligence and Learning Engine (SMILE) JAVA library.When running RF in GEE, it is necessary to set two important hyperparameters.These are the number of decision trees and the minimum number of variables selected at each node.We identified the minimum number of decision trees that can be used to achieve the highest level of accuracy.This was performed by estimating the overall accuracy of the classifier with a range of tree counts from 10 to 200 in increments of 10.Based on this analysis, we determined the optimal value for the number of decision trees.The second parameter, which is the minimum number of variables selected at each node, is set to the square root of the total number of variables by default.This corresponds to the recommended value [72].

Accuracy Assessment Technique
In order to quantitatively estimate the classification accuracy, we calculated a confusion matrix (CM).For the CM calculation, we divided the pixels from the training polygons into two random groups.The first group was used to train the classifier, and the second group was used to validate the predictions.The ratio was 60% for the training set and 40% for the validation set.After training the classifier and classifying the dataset, we compared the predicted values with those in the validation set using ee.Classifier.confusionMatrix()method.Based on the CM, we calculated the following metrics that determine the performance of the algorithm [73,74].Recall (producer's accuracy, PA) was calculated by dividing the number of correctly classified pixels in a given class by the total number of pixels in that class.Precision (consumer's accuracy, CA) was computed by dividing the number of true positive pixels by the total number of pixels predicted to be in the class.The overall accuracy (OA) of the model was calculated by taking the sum of the true positive and true negative classification outcomes and dividing this by the total number of data points.The F1 score was computed by taking the harmonic mean of precision and recall values [75].The Kappa coefficient determines the level of consistency between multiple measurements of a given variable under various conditions [76].We estimated the recall, precision, and OA on our own and verified these results using GEE [73].The F1 score and the Kappa coefficient were calculated using the GEE functionality provided.However, we also conducted a visual inspection of the classification results.

Classification Result
We programmatically tested a range of tree numbers in order to find the smallest possible value that results in the highest accuracy.Although overall accuracy was similar for all tested numbers of trees, the classifier with 130 trees achieved the highest accuracy, 98.3368% (10 trees-0.979;20 trees-0.982;30-200 trees-0.983).It should be noted that repeated execution of the classification process might produce slightly different results with minor variations in accuracy.However, the classifier usually produces results with an accuracy close to 98.3%.As a result of the classification, we obtained a map of the LC class distribution (Figure 7).The presented map allows for an understanding of the location of stable and thermokarst-affected areas, as well as the location of other LC classes.Despite the fact that this map is an independent product of the research, it has become a basis for the further spatial distribution of LC classes.

Accuracy Assessment
Based on the aforementioned methods, we calculated the confusion matrix in order to perform an accuracy assessment (Figure 8).The quality validation results are presented in Table 3.The overall accuracy of all the different LC classes was 0.9834, equivalent to 98.34%.We calculated the precision in order to demonstrate the positive predicted value.The outcome was that the overall precision was 0.9818, equivalent to 98.18%.Thermokarstaffected terrains showed the lowest observed level of precision, 0.9296.The recall metric of the confusion matrix is useful for assessing the rate of correctly identified true positives.The overall recall rate was 0.9782, which is equivalent to 97.82%.The stable terrains had the lowest recall rate of 0.9121.The F1 score has been taken into account, considering both the precision and recall metrics simultaneously.The overall F1 score was 0.9795, and the stable terrains achieved the lowest value of 0.9497.The kappa coefficient of the confusion matrix was 0.9782, which is a significant value.

Accuracy Assessment
Based on the aforementioned methods, we calculated the confusion matrix in order to perform an accuracy assessment (Figure 8).The quality validation results are presented in Table 3.The overall accuracy of all the different LC classes was 0.9834, equivalent to 98.34%.We calculated the precision in order to demonstrate the positive predicted value.The outcome was that the overall precision was 0.9818, equivalent to 98.18%.Thermokarst-affected terrains showed the lowest observed level of precision, 0.9296.The recall metric of the confusion matrix is useful for assessing the rate of correctly identified true positives.The overall recall rate was 0.9782, which is equivalent to 97.82%.The stable terrains had the lowest recall rate of 0.9121.The F1 score has been taken into account, considering both the precision and recall metrics simultaneously.The overall F1 score was 0.9795, and the stable terrains achieved the lowest value of 0.9497.The kappa coefficient of the confusion matrix was 0.9782, which is a significant value.

Accuracy Assessment
Based on the aforementioned methods, we calculated the confusion matrix in order to perform an accuracy assessment (Figure 8).The quality validation results are presented in Table 3.The overall accuracy of all the different LC classes was 0.9834, equivalent to 98.34%.We calculated the precision in order to demonstrate the positive predicted value.The outcome was that the overall precision was 0.9818, equivalent to 98.18%.Thermokarst-affected terrains showed the lowest observed level of precision, 0.9296.The recall metric of the confusion matrix is useful for assessing the rate of correctly identified true positives.The overall recall rate was 0.9782, which is equivalent to 97.82%.The stable terrains had the lowest recall rate of 0.9121.The F1 score has been taken into account, considering both the precision and recall metrics simultaneously.The overall F1 score was 0.9795, and the stable terrains achieved the lowest value of 0.9497.The kappa coefficient of the confusion matrix was 0.9782, which is a significant value.In addition to quantitative measures for evaluating the results, we also conducted a visual inspection of them.This inspection revealed the following (Figure 9).The classifier experienced difficulties with pixels at the transitions between different classes.This is Earth 2024, 5 238 because these pixels typically exhibit intermediate spectral characteristics and may, therefore, be classified as belonging to both categories.It results in decreased accuracy in the transitions between all the LC classes.Furthermore, the classifier made some errors in the interpretation of LC class pixels that have similar spectral characteristics.For example, the classifier may confuse water bodies and thermokarst-affected terrains due to their common "wet" characteristics, as well as the fact that small water bodies are often overgrown and are very shallow in the Lena Delta [77].However, in these cases, the classifier identifies small, shallow, overgrown ponds as thermokarst-affected terrain, which is generally true [26].Along with slopes and stable terrains, which are "drier" and have less biomass, they also may confuse the classifier [55].Nevertheless, in most cases, blowouts and lakes are correctly classified due to their specific spectral signatures.Therefore, based on both quantitative and qualitative accuracy measurements, we can conclude that the classifier functioned correctly, and the classification result reflects the spatial distribution of the identified LC classes.
blowouts 1 1 1 Mean: 0.9818 0.9782 0.9795 In addition to quantitative measures for evaluating the results, we also conducted a visual inspection of them.This inspection revealed the following (Figure 9).The classifier experienced difficulties with pixels at the transitions between different classes.This is because these pixels typically exhibit intermediate spectral characteristics and may, therefore, be classified as belonging to both categories.It results in decreased accuracy in the transitions between all the LC classes.Furthermore, the classifier made some errors in the interpretation of LC class pixels that have similar spectral characteristics.For example, the classifier may confuse water bodies and thermokarst-affected terrains due to their common "wet" characteristics, as well as the fact that small water bodies are often overgrown and are very shallow in the Lena Delta [77].However, in these cases, the classifier identifies small, shallow, overgrown ponds as thermokarst-affected terrain, which is generally true [26].Along with slopes and stable terrains, which are "drier" and have less biomass, they also may confuse the classifier [55].Nevertheless, in most cases, blowouts and lakes are correctly classified due to their specific spectral signatures.Therefore, based on both quantitative and qualitative accuracy measurements, we can conclude that the classifier functioned correctly, and the classification result reflects the spatial distribution of the identified LC classes.

The Spatial Distribution of the LC Classes
We estimated the area of the LC classes to assess the current state of thermokarst landscapes on Arga Island (Figure 10).The analysis reveals that the most prevalent LC class is thermokarst-affected terrains, which cover 705.4 km 2 or 35% of the total area.The stable terrains are the second largest LC class in terms of area.It covers 569.9 km 2 , 29% of the total area.The water bodies, slopes, and blowouts covered 19%, 16%, and 2%, respectively.It is worth noting that the western part of the study area exhibits a greater number of gullies and a more pronounced elevation variation in the topography.Conversely, the eastern part is more flat and slopes mainly delineate the thermokarst depressions in this area.
We estimated the area of the LC classes to assess the current state of thermokarst landscapes on Arga Island (Figure 10).The analysis reveals that the most prevalent LC class is thermokarst-affected terrains, which cover 705.4 km 2 or 35% of the total area.The stable terrains are the second largest LC class in terms of area.It covers 569.9 km 2 , 29% of the total area.The water bodies, slopes, and blowouts covered 19%, 16%, and 2%, respectively.It is worth noting that the western part of the study area exhibits a greater number of gullies and a more pronounced elevation variation in the topography.Conversely, the eastern part is more flat and slopes mainly delineate the thermokarst depressions in this area.The estimated area of the LC classes provides an indication of the current condition of the thermokarst landscape on Arga Island.It should be noted that stable terrains were mapped within the thermokarst depressions.It means that these areas were completely degraded and have now stabilized.Conversely, thermokarst-affected terrains, whose training polygons were identified in thermokarst depressions, were also mapped on upland surfaces.This suggests that these non-degraded uplands have begun to thaw and are at risk of future permafrost degradation.Therefore, these two LC classes (stable terrains and thermokarst-affected terrains) can be further divided into two subclasses each.These are stable, non-degraded uplands; thermokarst-affected uplands; thermokarst-affected terrains that are experiencing ongoing degradation; and stable thermokarst terrains.The water bodies can also be divided into roughly two categories: deep lakes, located in thermokarst depressions, and shallower, smaller water bodies situated on uplands.However, morphometric analysis is necessary in order to identify and map these subclasses.We intend to do so in future studies.The estimated area of the LC classes provides an indication of the current condition of the thermokarst landscape on Arga Island.It should be noted that stable terrains were mapped within the thermokarst depressions.It means that these areas were completely degraded and have now stabilized.Conversely, thermokarst-affected terrains, whose training polygons were identified in thermokarst depressions, were also mapped on upland surfaces.This suggests that these non-degraded uplands have begun to thaw and are at risk of future permafrost degradation.Therefore, these two LC classes (stable terrains and thermokarst-affected terrains) can be further divided into two subclasses each.These are stable, non-degraded uplands; thermokarst-affected uplands; thermokarst-affected terrains that are experiencing ongoing degradation; and stable thermokarst terrains.The water bodies can also be divided into roughly two categories: deep lakes, located in thermokarst depressions, and shallower, smaller water bodies situated on uplands.However, morphometric analysis is necessary in order to identify and map these subclasses.We intend to do so in future studies.

Conclusions
In this study, we focused on evaluating the current state of the thermokarst terrain on Arga Island.We prepared a composite image that includes a Sentinel-2 image (Level-2A) and a number of calculated indices.We classified this composite image using an RF algorithm in GEE.The accuracy assessment revealed that the overall accuracy for all the different LC classes was 0.9834, which is equivalent to 98.34%.Other metrics calculated from CM also show a high level of accuracy.The resulting map illustrates the distribution of identified LC classes.Their area estimation indicates that the most prevalent LC class was thermokarst-affected terrains, which occupied 35% of the total area.The stable terrains cover 29% of the total area.The classifier has also identified water bodies, slopes, and blowouts.Based on this assessment, we can conclude that approximately one-third of the study area is affected by ongoing thermokarst processes at the time of observation.The mapping allowed us to determine the location of these areas.
The presented findings reflect and quantitatively characterize the current condition of the thermokarst landscape on Arga Island.The quantitative nature of the evaluation will enable the comparison of image classification outcomes from different time periods; it makes the identification of the rate and character of landscape changes possible.Therefore, the presented study may serve as a basis for future investigations into the state of thermokarst landscapes.Furthermore, we plan to subdivide the identified LC classes into several subclasses after conducting a morphometric analysis of the study area.In particular, we plan to identify thermokarst-affected terrains in upland areas, stable terrains in thermokarst depressions and erosional valleys, and subdivide water bodies into two subclasses.In turn, it will enable the prediction of future landscape changes.
The results obtained are significant, as they provide insight into the state of the thermokarst landscape in the study area, which is characterized by a unique set of geological features and is inaccessible to field expeditions.Beyond that, the results may be applicable in multidisciplinary studies such as those related to soil, botanical, and hydrological studies in the Lena Delta, as well as general investigations of the relationships between tectonics, topography, and geocryology.In addition, this study demonstrates the potential of ML techniques and the GEE platform for thermokarst landscape mapping.The described methodology could be applied to mapping areas that are susceptible to thermokarst processes, especially if these Arctic lowland areas contain a number of distinct LC classes that are related to thermokarst processes.

Figure 1 .
Figure 1.Location of the study area (the black rectangle) in a mosaic of satellite Sentinel-2 images of the Lena Delta.The red square on the map represents a portion of the study area shown in Figure 2. The study area covers approximately 2000 km 2 .

Figure 1 .
Figure 1.Location of the study area (the black rectangle) in a mosaic of satellite Sentinel-2 images of the Lena Delta.The red square on the map represents a portion of the study area shown in Figure 2. The study area covers approximately 2000 km 2 .

Figure 3 .
Figure 3. Identified training polygons for each revealed LC class in R-G-B composite.Figure 3. Identified training polygons for each revealed LC class in R-G-B composite.

Figure 3 .
Figure 3. Identified training polygons for each revealed LC class in R-G-B composite.Figure 3. Identified training polygons for each revealed LC class in R-G-B composite.

Figure 4 .
Figure 4. Spectral signatures of revealed LC classes.The y-axis represents normalized reflectance for the Sentinel-2 bands, as well as index values for the indices.

Figure 4 .
Figure 4. Spectral signatures of revealed LC classes.The y-axis represents normalized reflectance for the Sentinel-2 bands, as well as index values for the indices.

Earth 2024, 5 , 7 Figure 4 .
Figure 4. Spectral signatures of revealed LC classes.The y-axis represents normalized reflectance for the Sentinel-2 bands, as well as index values for the indices.

Figure 5 .
Figure 5.The correlation matrix for the bands of analyzed composite image.The values of Pearson's correlation coefficients are presented after the Fisher Z-transformation.The mean values are presented after reverting back to correlation coefficients using the inverse Fisher transform.

Earth 2024, 5 , 8 Figure 5 .
Figure 5.The correlation matrix for the bands of analyzed composite image.The values of Pearson's correlation coefficients are presented after the Fisher Z-transformation.The mean values are presented after reverting back to correlation coefficients using the inverse Fisher transform.

Figure 6 .
Figure 6.The selection of the bands is based on both criteria.The columns represent the following.Mean values of each LC class (from Figure 3): 1-water bodies; 2-stable terrains; 3-thermokarstaffected terrains; 4-slopes; 5-blowouts.MV-the mean values of Pearson's correlation coefficients of each band (from Figure 5); SD-the standard deviation of the average spectral characteristics of each LC class in all bands.The chosen bands are indicated by a yellow color.The bands are sorted by SD column.

Figure 6 .
Figure 6.The selection of the bands is based on both criteria.The columns represent the following.Mean values of each LC class (from Figure 3): 1-water bodies; 2-stable terrains; 3-thermokarstaffected terrains; 4-slopes; 5-blowouts.MV-the mean values of Pearson's correlation coefficients of each band (from Figure 5); SD-the standard deviation of the average spectral characteristics of each LC class in all bands.The chosen bands are indicated by a yellow color.The bands are sorted by SD column.

Figure 7 .
Figure 7.The map of the LC class distribution.

Figure 8 .
Figure 8.The confusion matrix.The classifier achieved an overall accuracy of 0.9834 and a Kappa coefficient of 0.9782.

Figure 7 .
Figure 7.The map of the LC class distribution.

Figure 7 .
Figure 7.The map of the LC class distribution.

Figure 8 .
Figure 8.The confusion matrix.The classifier achieved an overall accuracy of 0.9834 and a Kappa coefficient of 0.9782.Figure 8.The confusion matrix.The classifier achieved an overall accuracy of 0.9834 and a Kappa coefficient of 0.9782.

Figure 8 .
Figure 8.The confusion matrix.The classifier achieved an overall accuracy of 0.9834 and a Kappa coefficient of 0.9782.Figure 8.The confusion matrix.The classifier achieved an overall accuracy of 0.9834 and a Kappa coefficient of 0.9782.

Figure 9 .
Figure 9.A visual inspection of the classification results: (a) the R-G-B composite; (b) the map of the LC classes' distribution; (c) a transparent map of the LC class distribution on the RGB composite; (d) the PCA 1-2-3 composite.The yellow circles indicate certain aspects of the classification results that require further attention.These include the following: 1. Difficulties in distinguishing between pixels belonging to different classes at the boundaries between them.2. Confusion in the recognition of water bodies and thermokarst-affected terrains.3. Challenges in distinguishing slopes from stable terrains.

Figure 10 .
Figure 10.The spatial distribution of the LC classes.

Figure 10 .
Figure 10.The spatial distribution of the LC classes.

Table 2 .
Spectral indices used in the study.The band's names in the formulas are in accordance with Table1.