Simulation of DEM Based on ICESat-2 Data Using Openly Accessible Topographic Datasets †

: The digital elevation model (DEM) is a three-dimensional digital representation of the terrain or the Earth’s surface. For determining topography, DEMs are the most used and ideal method with (i


Introduction
The digital elevation model is a visualization of the bare Earth's surface elevations [1].DEMs are generated from numerous sources including contour lines, topographic maps, stereo photogrammetry, SAR interferometry, DGPS points, etc. Amongst all the techniques to create DEMs, high-resolution laser altimetry (LiDAR) is proven to generate higher accuracy DEMs [2].Various terrain-related studies including hydrological modelling, flood inundation mapping, monitoring volcanic activities, etc. use DEM as an integral input data.Therefore, the accuracy of the input DEMs for various applications is an important Environ.Sci.Proc.2024, 29, 66 2 of 8 parameter to yield good quality results [3].Systematic errors in DEM products are still possible due to equipment precision limitations, which is time consuming, costly and difficult to rectify [4].To enhance the quality and accuracy of the available open-source DEMs various studies have been conducted [5].
An Earth Observation System satellite, the Ice, Cloud, and Land Elevation Satellite (ICESat-2), was launched by NASA.Highly accurate data from ICESat-2 provides extensive and sufficient reference data for quality analyzing different DEMs [6].
Interpolation is the process of estimating the value of attributes at unsampled sites from measurements made at point locations within the same area or region, but it often leads to over-smoothening [7].Simulation technique can be defined as a statistical way to generate data, where unavailable, based on the statistical models like linear regression which correlates the input and output of the sample/training data and calculates the statistical relationship between the two and implements the same for other input points to generate their corresponding output.This study hence utilizes the CartoDEM and ICESat-2 LiDAR data to simulate a higher accuracy DEM using machine learning algorithms.
Various studies have shown that for the Indian region, good-quality and best accuracy terrain data are available via Cartosat-1 DEM [8].This study focusses on simulating a higher accuracy spaceborne LiDAR DEM by correlating it with the CartoDEM measurements.The simulated DEM is then validated using DGPS data.The accuracy of the simulated output DEM is higher than the CartoDEM and closer to the LiDAR measurements.

Study Area
This study was conducted over the hilly terrain of the Dehradun region in the foothills of the Himalayas.The study area lies between latitudes 30 • 01 ′ N and 31 • 2 ′ N and longitudes 77 • 34 ′ E and 78 • 18 ′ E (Figure 1).integral input data.Therefore, the accuracy of the input DEMs for various applications an important parameter to yield good quality results [3].Systematic errors in DEM pro ucts are still possible due to equipment precision limitations, which is time consumin costly and difficult to rectify [4].To enhance the quality and accuracy of the availa open-source DEMs various stud-ies have been conducted [5].
An Earth Observation System satellite, the Ice, Cloud, and Land Elevation Satell (ICESat-2), was launched by NASA.Highly accurate data from ICESat-2 provides exte sive and sufficient reference data for quality analyzing different DEMs [6].
Interpolation is the process of estimating the value of attributes at unsampled si from measurements made at point locations within the same area or region, but it oft leads to over-smoothening [7].Simulation technique can be defined as a statistical way generate data, where unavailable, based on the statistical models like linear regressi which correlates the input and output of the sample/training data and calculates the s tistical relationship between the two and implements the same for other input points generate their corresponding output.This study hence utilizes the CartoDEM and ICES 2 LiDAR data to simulate a higher accuracy DEM using machine learning algorithms.
Various studies have shown that for the Indian region, good-quality and best acc racy terrain data are available via Cartosat-1 DEM [8].This study focusses on simulati a higher accuracy spaceborne LiDAR DEM by correlating it with the CartoDEM measu ments.The simulated DEM is then validated using DGPS data.The accuracy of the sim lated output DEM is higher than the CartoDEM and closer to the LiDAR measuremen

Study Area
This study was conducted over the hilly terrain of the Dehradun region in the fo hills of the Himalayas.The study area lies between latitudes 30°01′ N and 31°2′ N a longitudes 77°34′ E and 78°18′ E (Figure 1).

Datasets 2.2.1. CartoDEM V3 R1
Using the CartoDEM V3 R1 product, the corresponding LiDAR DEM was generated to enhance the vertical accuracy of the CartoDEM.The Cartosat-1 satellite is the first Indian remote sensing satellite that can provide stereo visualization in orbit.A number of products derived from Cartosat-1 can be used for various geographical information system (GIS) applications, including digital elevation models (Figure 1), orthoimage products, and value-added products for GIS.

ICESat-2
In ICESat-2, the ATLAS instrument provides all of the topographic data through its advanced topographic laser altimetry system.A total of three relatively strong beams and three relatively weak beams are present [9].In the context of the accurate analysis of different DEMs, it provides enough and high-quality reference data [10].

Ground Control Points (GCPs)
The Trimble R7 GNSS receivers and Leica 500 series receivers were used for the collection of the field data.A total of 16 GCPs were collected over the Dehradun region and utilized for the validation of the simulated DEM.

Methodology
The overall methodology followed for this study is depicted in Figure 2. Pancholi et al. has successfully generated DEM using the machine learning models of decision tree (DT), random forest, gradient boosting machine (GBM), and multi-layer perceptron (MLP) [11], out of which the MLP model gave minimal error output.Using the CartoDEM V3 R1 product, the corresponding LiDAR DEM was generated to enhance the vertical accuracy of the CartoDEM.The Cartosat-1 satellite is the first Indian remote sensing satellite that can provide stereo visualization in orbit.A number of products derived from Cartosat-1 can be used for various geographical information system (GIS) applications, including digital elevation models (Figure 1), orthoimage products, and value-added products for GIS.

ICESat-2
In ICESat-2, the ATLAS instrument provides all of the topographic data through its advanced topographic laser altimetry system.A total of three relatively strong beams and three relatively weak beams are present [9].In the context of the accurate analysis of different DEMs, it provides enough and high-quality reference data [10].

Ground Control Points (GCPs)
The Trimble R7 GNSS receivers and Leica 500 series receivers were used for the collection of the field data.A total of 16 GCPs were collected over the Dehradun region and utilized for the validation of the simulated DEM.

Methodology
The overall methodology followed for this study is depicted in Figure 2. Pancholi et al. has successfully generated DEM using the machine learning models of decision tree (DT), random forest, gradient boosting machine (GBM), and multi-layer perceptron (MLP) [11], out of which the MLP model gave minimal error output.

Machine Learning Models
This section describes the machine learning models used for this study.

• Decision Tree
The decision tree model, which finds a foundation in machine learning theory, is a potent tool for dealing with regression and classification challenges.In contrast to other classification approaches that use a group of features (or bands) together to complete classification/regression in a single decision step, it relies on a multilevel or hierarchical decision strategy or a tree-like structure.It consists of leaves, internal nodes, and the root node.Each decision tree node uses a top-down technique to perform binary classification, separating one or more classes from the others by progressing down the tree until the leaf

Machine Learning Models
This section describes the machine learning models used for this study.

• Decision Tree
The decision tree model, which finds a foundation in machine learning theory, is a potent tool for dealing with regression and classification challenges.In contrast to other classification approaches that use a group of features (or bands) together to complete classification/regression in a single decision step, it relies on a multilevel or hierarchical decision strategy or a tree-like structure.It consists of leaves, internal nodes, and the root node.Each decision tree node uses a top-down technique to perform binary classification, separating one or more classes from the others by progressing down the tree until the leaf node is reached.In essence, a complicated problem statement is divided into lesser problems by a decision tree, and the simpler decisions that follow lead to the complex conclusion.The decision tree model is chosen for the study because it effectively resolves problems involving both linear and non-linear interactions [12].

• Random Forest
The ensemble machine learning model random forest has two or more decision trees, which together form a "decision forest".Finding the majority by voting on the individual decision tree outcomes yields the random forest model's outcome.The design of each decision tree that makes up random forest affects how well it performs.There are two steps in this process that include random selection.The first step uses a bootstrap technique to randomly select about two-thirds of the training dataset before beginning to build each decision tree.Out-of-bag (OOB) data, which make up the final third of the dataset, are utilized for inner cross-validation to assess the precision of the mode [13].

• Gradient Boosting Machine
Gradient boosting is a unique ensemble machine learning approach that utilizes the predictive capability of boosting on a decision tree.It has several decision trees constructed sequentially, each of which is a "weak" learner.These following learners draw lessons from the preceding model's errors to create the final model, which is a "strong" one.The first model is given some initial constant values that are calculated by averaging all of the target values.Residuals are the calculated differences between the anticipated value and the actual target values.The goal values for the following decision tree are these residuals r1, and the residuals r2 are computed from the anticipated value and r1.This carries on until every decision tree is trained [14].

• Artificial Neural Network
An artificial neural network (ANN) is a nonlinear nonparametric framework that uses neural network propagation across layers based on gradient learning techniques to simulate human brain receptors and information processing.The input layer, hidden layer, and output layer are the three layers that make it up.Through synapses, the input layer receives the input and transmits it to the hidden levels; likewise, the hidden layers transmit the data to the output layer.The weights that the synapses hold regulate how information moves from one layer to the next.Equation (1) mathematically describes a neuron in the hidden layer or output layer.
where w denotes synaptic weights, x denotes the input to neurons, y denotes the output from neurons, u denotes a linear combiner of input signals, b denotes bias, and φ() is the activation function used to restrict the input range.

Hyperparameters Used
Some variables must be put up in advance and cannot be changed while training.These variables or parameters are called hyperparameters.They are the factors that control how a learning algorithm learns and determine the final outcome of the models [15].The goal of hyperparameter optimization is to find the optimal settings for hyperparameters to provide good results from data as rapidly as feasible.Hyperparameter optimization is performed as the parameters tuned during this process (Table 1) are not optimized by the models during training and has to be provided to the models before the training actually begins.

Accuracy Assessment
Utilizing the coefficient of determination (R2), root mean square error (RMSE), and mean absolute error (MAE) in comparison to the simulated DEM and DGPS data, the machine learning model was statistically evaluated for the Dehradun region.The regression model performance is often evaluated using the R2 and RMSE of the predicted and actual values.For estimating accuracy metrics over an area's elevation values, a higher R2 and lower MAE and RMSE are correlated with higher precision and accuracy, respectively.To obtain a clearer result, the LE90 value was also calculated for the simulated DEM using the MLP regressor model.The formula extensively used for LE90 is given in Equation ( 2) [16,17].

Results and Discussion
In this study, an implementation of machine learning models was conducted to simulate a higher accuracy DEM providing elevation values closer to the ICEsat-2.The accuracy of the simulation was evaluated primarily using the 20% testing data that is unseen by the model and is shown in Table 2.The ANN model displayed the best results in terms of RMSE and MAE followed by RF, GBM, and DT. Figure 3 shows the simulated DEM using the four models.The validation of the simulated output using MLP was conducted using DGPS GCPs (shown in Figure 4).The accuracy of the simulated DEM using DGPS yielded an RMSE of 6.58 m which is very promising, on a hilly terrain in the foothills of the Himalayas, for the simulated DEM product.The LE90 score for the simulated DEM was 10.82 m, signifying the confidence that a minimum of 90% of the vertical error falls within the limit of 10.82 m.The variation in the RMSE while comparing the RMSE derived from ICESat-2 and DGPS can be attributed to the lower uncertainty of DGPS on collecting the elevation data when compared to ICESat-2 points, which need the filtering of footprints (elevation values) based on the deviations.Furthermore, ICESat-2 footprints are not evenly distributed throughout the study area and are more concentrated in plane areas and less concentrated in hilly areas.
higher than 1976.87 m.
An even distribution of ICESat-2 data in plane and hilly terrains while training the model can potentially improve the accuracy of the models.Including ICESat-2 points in the hilly terrain of a nearby area for training the models or using the same for developing a deeper neural network based on the transfer learning approach can evenly balance training data in all elevation ranges and improve the results of the model.

Conclusions
The current study attempted to simulate an ICESat-2 DEM over a 388 km 2 area in the hilly terrain of Dehradun located in the foothills of the Himalayas.Four machine learning algorithms, DT, RF, GBM, and MLP, were used for the simulation using CartoDEM and ICESat-2 data and produced promising results with MLP performing the best.The accuracy assessment was initially conducted using ICESat-2 points and validated using DGPS GCPs.The study concluded that although DGPS points provide a planned way of validating DEMs, the collection of a large number of DGPS points is time consuming and a costly issue, whereas the ICESat-2 dataset not only provide a large number of high accuracy elevation points for the simulation.Further investigations must be carried out to improve the accuracy of the DEM at the centimeter scale.Increasing the number of training points in all elevation zones and land use or land cover areas, the transfer learning ML approach is suggested for future improvements.

Figure 2 .
Figure 2. Methodology followed for the simulation of DEM.

Figure 2 .
Figure 2. Methodology followed for the simulation of DEM.

Figure 3 .
Figure 3. Simulated DEM from CartoDEM and ICESat-2 (a) decision tree regressor, (b) gradient boosting regressor, (c) decision tree regressor, (d) multi-layer perceptron.The highest values of elevation are 1950.87m, 1975 m, 1964.77m, and 1967.78m for DT, GBM, RF, and MLP machine learning models, whereas the highest elevation value in the ICESat-2 footprint is 1976.87 m.This is a realistic representation of elevation with respect to the training data used in the model.However, since the ICESat-2 data points are not densely distributed in the study area and very sparsely distributed in the high elevation zones, there are possibilities of the under-representation of elevation in zones higher than 1976.87 m.An even distribution of ICESat-2 data in plane and hilly terrains while training the model can potentially improve the accuracy of the models.Including ICESat-2 points in the hilly terrain of a nearby area for training the models or using the same for developing a deeper neural network based on the transfer learning approach can evenly balance training data in all elevation ranges and improve the results of the model.

Figure 4 .
Figure 4. GCPs collected using DGPS survey are overlaid on simulated MLP DEM product.

Table 1 .
Hyperparameters tuned for the regression models.

Table 2 .
Accuracy metrics of machine learning models.