U-Net-STN: A Novel End-to-End Lake Boundary Prediction Model

: Detecting changes in land cover is a critical task in remote sensing image interpretation, with particular signiﬁcance placed on accurately determining the boundaries of lakes. Lake boundaries are closely tied to land resources, and any alterations can have substantial implications for the surrounding environment and ecosystem. This paper introduces an innovative end-to-end model that combines U-Net and spatial transformation network (STN) to predict changes in lake boundaries and investigate the evolution of the Lake Urmia boundary. The proposed approach involves pre-processing annual panoramic remote sensing images of Lake Urmia, obtained from 1996 to 2014 through Google Earth Pro Version 7.3 software, using image segmentation and grayscale ﬁlling techniques. The results of the experiments demonstrate the model’s ability to accurately forecast the evolution of lake boundaries in remote sensing images. Additionally, the model exhibits a high degree of adaptability, effectively learning and adjusting to changing patterns over time. The study also evaluates the inﬂuence of varying time series lengths on prediction accuracy and reveals that longer time series provide a larger number of samples, resulting in more precise predictions. The maximum achieved accuracy reaches 89.3%. The ﬁndings and methodologies presented in this study offer valuable insights into the utilization of deep learning techniques for investigating and managing lake boundary changes, thereby contributing to the effective management and conservation of this signiﬁcant ecosystem.


Introduction
Changes in the Earth's surface cover and its man-made development have caused various impacts on ecosystems and environmental processes on a local, regional, and global scale [1][2][3]. The detection of lake boundaries is extremely important for predicting future lake boundary changes and supporting future lake protection policy decisions [4]. By studying changes in the lake boundary, it is possible to monitor the impact of various factors on the lake. This information helps assess the ecological integrity of the lake and identify any potential threats and is critical for the effective management and conservation of freshwater resources [5][6][7]. Remote sensing has become the main source of lake spatial information [8,9], and remote sensing images intuitively show the spatial distribution and dynamic change process of natural lakes. Remote sensing data reduce the manual work of collecting data on-site and reduce the cost of obtaining geographic information [10,11]. With the development of remote sensing technology and digital image processing technology, computer-based remote sensing image processing has recently become popular [12,13]. after images, and fits the corresponding evolution field. STN then predicts the change in surface coverage based on the resulting evolution field. Taking Lake Urmia as a case, this study performs image segmentation and image grayscale filling pre-processing operations on small sensing images in some areas of panoramic remote sensing images of Lake Urmia and then inputs them into the model designed in this paper for prediction.

Study Area
Lake Urmia is the largest lake in Iran, located in the basin between the provinces of East Azerbaijan and West Azerbaijan in the northwest of Iran, and is the second largest saltwater lake on earth, as shown in Figure 1. The lake once covered an area of 5200 square kilometers and was 16 m deep. Environmental changes have caused a 70% reduction in the surface area of Lake Urmia between 2002 and 2016. The ecological environment has undergone tremendous changes, the salinity of the lake water has increased, and the number of migratory birds and organisms in the lake has decreased. context information of each pixel in the image, making it possible to complete feature extraction and change prediction in one network at the same time. Firstly, U-Net processes a pair of consecutive images in the time series image, extracts the changes between the before and after images, and fits the corresponding evolution field. STN then predicts the change in surface coverage based on the resulting evolution field. Taking Lake Urmia as a case, this study performs image segmentation and image grayscale filling pre-processing operations on small sensing images in some areas of panoramic remote sensing images of Lake Urmia and then inputs them into the model designed in this paper for prediction.

Study Area
Lake Urmia is the largest lake in Iran, located in the basin between the provinces of East Azerbaijan and West Azerbaijan in the northwest of Iran, and is the second largest saltwater lake on earth, as shown in Figure 1. The lake once covered an area of 5200 square kilometers and was 16 m deep. Environmental changes have caused a 70% reduction in the surface area of Lake Urmia between 2002 and 2016. The ecological environment has undergone tremendous changes, the salinity of the lake water has increased, and the number of migratory birds and organisms in the lake has decreased.

Dataset
According to the research objectives, this paper selects two datasets for experiments. The data used are satellite image data obtained from Google Earth Pro software. The first dataset is the panoramic remote sensing image of Lake Urmia to verify the effectiveness of this model in predicting the overall evolution trend of cover; the second dataset is the remote sensing images of some areas of Lake Urmia to test the ability of this model to capture regional details.

Dataset 1: Sequence of Panoramic Historical Images of Lake Urmia
In this study, the historical images of the Google Earth remote sensing satellite of Lake Urmia from 1996 to 2014 are arranged in chronological order to form the first time series dataset, with 19 scenes. Each scene is taken on 30 December of that year, and the image size is 560 × 640. Latitudes are taken from 37°00′ N to 38°15′ N and longitudes from 46°10′ E to 44°50′ E, as shown in Figure 2. This is abbreviated as dataset 1.

Dataset
According to the research objectives, this paper selects two datasets for experiments. The data used are satellite image data obtained from Google Earth Pro software. The first dataset is the panoramic remote sensing image of Lake Urmia to verify the effectiveness of this model in predicting the overall evolution trend of cover; the second dataset is the remote sensing images of some areas of Lake Urmia to test the ability of this model to capture regional details.

Dataset 1: Sequence of Panoramic Historical Images of Lake Urmia
In this study, the historical images of the Google Earth remote sensing satellite of Lake Urmia from 1996 to 2014 are arranged in chronological order to form the first time series dataset, with 19 scenes. Each scene is taken on 30 December of that year, and the image size is 560 × 640. Latitudes are taken from 37 • 00 N to 38 • 15 N and longitudes from 46 • 10 E to 44 • 50 E, as shown in Figure 2. This is abbreviated as dataset 1. The data from 1996 to 2013 are mainly used as the data source for surface cover prediction, and the training set and test set are selected from them, while the 2014 image is not used for training, and is mainly used to test the predictive generalization ability of the trained model. During training, the model uses N images of the next year to predict its surface cover map in the next year, and the corresponding image of this year in the dataset is used as the ground truth. For example, the CNN prediction model used 1999 and 2000 images as input for training, the model learned the law of cover evolution and predicted its cover map in 2001, and the data in 2001 in the dataset are used as the ground truth to calculate loss.

Dataset 2: Sequence of Partial Historical Images of Lake Urmia
This section selects the historical images of the Google Earth remote sensing satellite in the north part of Lake Urmia from 2000 to 2014 to form the second time series dataset. Latitudes are taken from 38°05′ N to 38°15′ N and longitudes from 45°30′ E to 44°20′ E. The size of a single image is 1280 × 560, taking the year as the cycle unit, there are 15 scenes in total, and each scene is taken on 30 December of that year, as shown in Figure 3. This is abbreviated as dataset 2. The data from 1996 to 2013 are mainly used as the data source for surface cover prediction, and the training set and test set are selected from them, while the 2014 image is not used for training, and is mainly used to test the predictive generalization ability of the trained model. During training, the model uses N images of the next year to predict its surface cover map in the next year, and the corresponding image of this year in the dataset is used as the ground truth. For example, the CNN prediction model used 1999 and 2000 images as input for training, the model learned the law of cover evolution and predicted its cover map in 2001, and the data in 2001 in the dataset are used as the ground truth to calculate loss.  Figure 3. This is abbreviated as dataset 2.
Among them, the data from 2000-2013 are mainly used as the data source for surface cover prediction, and the training set and test set are selected from them, while the 2014 image is not used for training and is mainly used to test the predictive generalization ability of the trained model. For example, the U-Net prediction model used 2008 and 2009 images as input, the model learned the law of cover evolution and predicted its cover map in 2010, and the data in the dataset for 2010 are used as ground truth to calculate loss. Among them, the data from 2000-2013 are mainly used as the data source for surface cover prediction, and the training set and test set are selected from them, while the 2014 image is not used for training and is mainly used to test the predictive generalization ability of the trained model. For example, the U-Net prediction model used 2008 and 2009 images as input, the model learned the law of cover evolution and predicted its cover map in 2010, and the data in the dataset for 2010 are used as ground truth to calculate loss.

Image Segmentation
Image segmentation and binarization are essential pre-processing steps for remote sensing images [34,35]. In remote sensing, image segmentation is useful for separating different land cover types. By segmenting the image, it is easier to extract information from each segment, and it can also help reduce the computational cost of processing large images. In remote sensing, binarization is often used to separate the land cover from the background and to identify specific features of interest. This process can also help to reduce the complexity of the image and make it easier to analyze.
First, 15 remote sensing images were separated to distinguish the lake area from other areas. Image segmentation is conducive to the visualization of the evolution process of the lake boundary and is of great help to the follow-up model to learn the evolution process of the overburden layer. Figure 4 shows the results of remote sensing image segmentation and binarization in 2000.

Image Segmentation
Image segmentation and binarization are essential pre-processing steps for remote sensing images [34,35]. In remote sensing, image segmentation is useful for separating different land cover types. By segmenting the image, it is easier to extract information from each segment, and it can also help reduce the computational cost of processing large images. In remote sensing, binarization is often used to separate the land cover from the background and to identify specific features of interest. This process can also help to reduce the complexity of the image and make it easier to analyze.
First, 15 remote sensing images were separated to distinguish the lake area from other areas. Image segmentation is conducive to the visualization of the evolution process of the lake boundary and is of great help to the follow-up model to learn the evolution process of the overburden layer. Figure 4 shows the results of remote sensing image segmentation and binarization in 2000.

Image Grayscale Gradient Fill
The gray gradient filling method mainly includes three steps: In the first step, the coating boundary in the image gradually shrinks smoothly to the coating center through multiple continuous erosion operations, and the intermediate results after each shrinkage are saved. Image boundary erosion processing involves removing the edges of an image to reduce the impact of edge artifacts [36]. The erosion

Image Grayscale Gradient Fill
The gray gradient filling method mainly includes three steps: In the first step, the coating boundary in the image gradually shrinks smoothly to the coating center through multiple continuous erosion operations, and the intermediate results after each shrinkage are saved. Image boundary erosion processing involves removing the edges of an image to reduce the impact of edge artifacts [36]. The erosion algorithm of image A using erosion operator B can be expressed as Equation (1).
The erosion operator B is a circular operator with a radius of 3. For erosion operation , use erosion operator B to corrode binary image A once. The flow of erosion once is shown in Algorithm 1.

Algorithm 1: Erosion Algorithm
Input: Binary image A, erosion operator B Output: Erosion results A B

1.
Select the category, size, and center point z of the erosion operator B.

2.
Select the area A to be corroded of the original image A, and the remaining area is denoted as A − A , and the pixels set of A is denoted as N(p).

3.
Make the center point z of B coincide with an unselected point p in N(p), and then coincide other pixels of B with the corresponding pixel of A .

4.
Select an unselected point m with a pixel value of 1 in B, and find the corresponding pixel m in A that coincides with it. If the pixel value of m is 0, delete the point (p) in N(p), that is, the pixel value of point p is 0. Otherwise, continue to select another point m with a pixel value of 1 in B and repeat step 4 until all the pixels in B are traversed.
Combine the updated N(p) with A − A , and get erosion results A B.
The erosion operator is also called a structural element. Generally, cross, rectangle, circle, and ellipse can be selected. This paper uses a circular operator with a radius of 3 to sequentially corrode the 2000 images in the second dataset 50 times. Some of the results are shown in Figure 5. In the second step, the difference image of two adjacent shrinkage images is obtained by subtracting in sequence according to the erosion order. The pixel value corresponding to the i + 1 erosion result is subtracted from the i + 1 erosion result. The point set with a difference of 255 is the pixel point set deleted during the i + 1 erosion, and the result is displayed as the lake boundary. Figure 6 shows a partial difference image after the erosion of the binarization image of Lake Urmia in 2000.  In the second step, the difference image of two adjacent shrinkage images is obtained by subtracting in sequence according to the erosion order. The pixel value corresponding to the i + 1 erosion result is subtracted from the i + 1 erosion result. The point set with a difference of 255 is the pixel point set deleted during the i + 1 erosion, and the result is displayed as the lake boundary. Figure 6 shows a partial difference image after the erosion of the binarization image of Lake Urmia in 2000.
The third step is to sort the obtained difference image set in turn, fill the gray level from front to back according to the linear change, and change the pixel value of the white points in the difference image. Filling the lakes in the segmented binarized image with grayscale gradients can add additional contextual information to the image [37].
In the second step, the difference image of two adjacent shrinkage images is obtained by subtracting in sequence according to the erosion order. The pixel value corresponding to the i + 1 erosion result is subtracted from the i + 1 erosion result. The point set with a difference of 255 is the pixel point set deleted during the i + 1 erosion, and the result is displayed as the lake boundary. Figure 6 shows a partial difference image after the erosion of the binarization image of Lake Urmia in 2000. The third step is to sort the obtained difference image set in turn, fill the gray level from front to back according to the linear change, and change the pixel value of the white points in the difference image. Filling the lakes in the segmented binarized image with grayscale gradients can add additional contextual information to the image [37].
The gray value of the pixel set in the first difference image is the highest, close to white, and the gray value of the pixel set in the last difference image is the lowest, close to black. The erosion times are 50, the gray value of the outermost boundary is the highest, which is 200, and the gray level series is 2. Therefore, the gray value of the innermost layer is the lowest, which is 100. The erosion times, gray level series, and the highest gray value can be selected according to the size of the remote sensing image. The algorithm flow is shown in Algorithm 2.  The gray value of the pixel set in the first difference image is the highest, close to white, and the gray value of the pixel set in the last difference image is the lowest, close to black. The erosion times are 50, the gray value of the outermost boundary is the highest, which is 200, and the gray level series is 2. Therefore, the gray value of the innermost layer is the lowest, which is 100. The erosion times, gray level series, and the highest gray value can be selected according to the size of the remote sensing image. The algorithm flow is shown in Algorithm 2.

Algorithm 2: Grayscale Gradient Fill Algorithm
According to algorithm 1: the i-th erosion operation was performed to P, and the obtained result is Q.

3.
Let R i = P − Q, that is, the corresponding pixel values are subtracted, and the difference image is R i , and the set of points with a difference of 255 is denoted as N(p).

4.
Set all pixels belonging to N(p) in the image. R i to have pixel values G.
Let A = R 1 + R 2 + . . . + R 49 + R 50 , that is, the corresponding pixel values of 50 differential images R i are added together to obtain a grayscale gradient-filled image A .

Wave Filtering
After the image filled with a gray gradient is filtered, the gray change tends to be smooth. It is beneficial to the subsequent gradient descent algorithm of the model to obtain the optimal solution [38].
The filtering algorithm uses Gaussian filtering, and Equation (2) is a two-dimensional Gaussian function for calculating the weights of neighboring pixels. The size of the Gaussian kernel is 5 × 5.

Method
The dynamic evolution prediction algorithm flow of the remote sensing image lake boundary designed in this study is shown in Figure 7. It can be roughly divided into five steps: image pair pre-processing, evolution feature extraction, feature fusion, spatial transformation, and coverage prediction.

Method
The dynamic evolution prediction algorithm flow of the remote sensing image lake boundary designed in this study is shown in Figure 7. It can be roughly divided into five steps: image pair pre-processing, evolution feature extraction, feature fusion, spatial transformation, and coverage prediction. (1) Image pair pre-processing. Input two remote sensing images that were taken every other year. The pre-processing methods are used for cover segmentation, image binarization, erosion, difference, gray gradient filling, and filtering for subsequent evolutionary feature extraction; (2) Evolutionary feature extraction. Input the pre-processed image pair in step (1), increase the dimension of the input image, and then splice the image pair along the increased dimension. The input feature extraction module is composed of multiple convolution layers with the same convolution kernel size to extract the multi-scale features of the input image pair; (3) Feature fusion. The features extracted in the previous step are decoded through the multi-layer upper sampling layer, and the decoded features are fused with the multiscale features of the same size extracted in step (2), which are input into the subsequent layer to learn and obtain the evolution field required for prediction; (1) Image pair pre-processing. Input two remote sensing images that were taken every other year. The pre-processing methods are used for cover segmentation, image binarization, erosion, difference, gray gradient filling, and filtering for subsequent evolutionary feature extraction; (2) Evolutionary feature extraction. Input the pre-processed image pair in step (1), increase the dimension of the input image, and then splice the image pair along the increased dimension. The input feature extraction module is composed of multiple convolution layers with the same convolution kernel size to extract the multi-scale features of the input image pair; (3) Feature fusion. The features extracted in the previous step are decoded through the multi-layer upper sampling layer, and the decoded features are fused with the multi-scale features of the same size extracted in step (2), which are input into the subsequent layer to learn and obtain the evolution field required for prediction; (4) Spatial transformation. Using the evolution field learned in the previous module, the corresponding spatial transformation operation is performed on the subsequent time series image y of the input image pairy, and the image obtained after spatial transformation is the predicted land cover map z_pred of the next time series target area; (5) Cover prediction. Enter ground truth, i.e., the next sequential image z of y, compare the output z_pred of step (4), calculate to obtain the loss, and train the parameters of the network with the backpropagation.
The specific content is introduced in detail below.

Evolutionary Feature Extraction and Fusion Module
Different from the traditional change detection methods, the use of a change map can only show the characteristics of the limited target area. This model uses a CNN model similar to the U-Net [39] structure to fit the change mapping between two temporal remote sensing images and retains the position spatial structure and context information of each pixel in the image. U-Net is a CNN architecture designed for semantic segmentation tasks. The U-Net architecture has skip connections connecting the encoder and decoder layers at multiple resolutions. These connections enable the decoder to access low-level and high-level features, which help preserve spatial details and improve segmentation accuracy. The model structure is shown in Figure 8. similar to the U-Net [39] structure to fit the change mapping between two temporal remote sensing images and retains the position spatial structure and context information of each pixel in the image. U-Net is a CNN architecture designed for semantic segmentation tasks. The U-Net architecture has skip connections connecting the encoder and decoder layers at multiple resolutions. These connections enable the decoder to access low-level and high-level features, which help preserve spatial details and improve segmentation accuracy. The model structure is shown in Figure 8. One specific network structure is used in this experiment, but other network frameworks and structures may also be applicable. The same number of network layers and convolution kernel parameters are not our requirements. The change mapping modeling between image pairs fitted by the U-Net model is expressed as Equation (3): Among them, , are two temporal remote sensing images, is the evolution field fitting the displacement of each pixel of , , and is the network parameter, which is the core of the convolutional layer. In other words, for each pixel ∈ Ω in , ( ) indicates One specific network structure is used in this experiment, but other network frameworks and structures may also be applicable. The same number of network layers and convolution kernel parameters are not our requirements. The change mapping modeling between image pairs fitted by the U-Net model is expressed as Equation (3): Among them, x, y are two temporal remote sensing images, ϕ is the evolution field fitting the displacement of each pixel of x, y, and θ is the network parameter, which is the core of the convolutional layer. In other words, for each pixel p ∈ Ω in x, ϕ(p) indicates the displacement of point p between x, y. The displacement makes [x·ϕ](p) and y(p) correspond to the same pixel position. Figure 9 shows the network structure of the evolutionary feature extraction module, which can be roughly divided into four structures: input, encoder, decoder, and evolutionary field.
Different initialization methods are applied to the convolution layer and evolution field in the encoder and decoder, which are described in detail below.
(1) Input. The network receives a single input formed by connecting x, and y into a two-channel 2D image; (2) Encoder. It is composed of four convolution layers. In the continuous convolution layer, we set the convolution kernel size to 3 × 3, and the number of channels is 16, 32, 32, and 32 in turn; (3) Decoder. In the decoding stage, we alternately use up-sampling, convolution, and activation functions to increase the dimensions of the features learned in the encoding stage. The size of the convolution kernel is still 3 × 3, and the number of channels in the convolution layer is 32, 32, 32, 32, 32; (4) Evolution field. At the end of the network, we use the convolution layer with the same size as (1) input to fit the field. Through the multi-scale features extracted in steps (2) and (3), the network learns the displacement law of all pixels in the twodimensional direction of the two time series images and maps it into an evolving field ϕ of 560 × 640 × 2, where the tensor with the size of 560 × 640 × 1 of the first channel corresponds to the left and right displacement of each pixel, and the tensor with the same size of the second channel corresponds to the up and down displacement of each pixel.
Land 2023, 12, x FOR PEER REVIEW 10 of 23 the displacement of point between , . The displacement makes[ • ]( ) and ( ) correspond to the same pixel position. Figure 9 shows the network structure of the evolutionary feature extraction module, which can be roughly divided into four structures: input, encoder, decoder, and evolutionary field. Different initialization methods are applied to the convolution layer and evolution field in the encoder and decoder, which are described in detail below.
(1) Input. The network receives a single input formed by connecting x, and y into a twochannel 2D image; (2) Encoder. It is composed of four convolution layers. In the continuous convolution layer, we set the convolution kernel size to 3 × 3, and the number of channels is 16, 32, 32, and 32 in turn; (3) Decoder. In the decoding stage, we alternately use up-sampling, convolution, and activation functions to increase the dimensions of the features learned in the encoding stage. The size of the convolution kernel is still 3 × 3, and the number of channels in the convolution layer is 32, 32, 32, 32, 32; (4) Evolution field. At the end of the network, we use the convolution layer with the same size as (1) input to fit the field. Through the multi-scale features extracted in steps (2) and (3), the network learns the displacement law of all pixels in the twodimensional direction of the two time series images and maps it into an evolving field of 560 × 640 × 2, where the tensor with the size of 560 × 640 × 1 of the first channel corresponds to the left and right displacement of each pixel, and the tensor with the same size of the second channel corresponds to the up and down displacement of each pixel.

Spatial Transformation Module
The STN aims to enable neural networks to learn spatial transformations and perform geometric manipulations on input data, making the network more robust to variations in scale, rotation, translation, and other transformations. The STN can be integrated into various neural network architectures and tasks, allowing the network to learn to focus on relevant regions, handle data augmentation, and adapt to different input variations. The spatial change module receives the output of the evolution feature extraction module, which is the predicted evolution field. After spatially transforming the second image y of x, y, the predicted image z_pred is outputted. The flowchart is shown in Figure 10.
The model adjusts the network parameters by minimizing y·ϕ, that is, the difference between z_pred and z. To use gradient descent-based methods to minimize the loss function loss, we construct a differentiable operation based on the spatial transformation network to calculate z_pred. For each pixel p in z_pred, we calculate that the sub-pixel position of the pixel before the evolution field transformation is p = p + ϕ(p). As pixels are only defined in integer positions, we use the values of eight neighboring pixels to perform linear interpolation, as shown in Equation (4): variations in scale, rotation, translation, and other transformations. The STN can be integrated into various neural network architectures and tasks, allowing the network to learn to focus on relevant regions, handle data augmentation, and adapt to different input variations. The spatial change module receives the output of the evolution feature extraction module, which is the predicted evolution field. After spatially transforming the second image of , , the predicted image _ is outputted. The flowchart is shown in Figure 10.

The model adjusts the network parameters by minimizing
• , that is, the difference between _ and . To use gradient descent-based methods to minimize the loss function loss, we construct a differentiable operation based on the spatial transformation network to calculate _ . For each pixel in _ , we calculate that the sub-pixel position of the pixel before the evolution field transformation is ′ = + ( ) . As pixels are only defined in integer positions, we use the values of eight neighboring pixels to perform linear interpolation, as shown in Equation (4): Among them, ( ′) is the neighborhood pixel set of pixel ′, d is the dimension of the image, including , dimensions, and · represents the spatial transformation operation.

Loss Function
The model can be trained with any differentiable loss function. In this study, the loss function proposed by us consists of the following Equations (5) and (6): (1) Loss of similarity (2) Penalty items ℎ (∅) Among them, � = • ∅ is the nth image predicted by the model, where · represents the spatial transformation operation, and is the real n-th image, is an image pixel set, ( , ) is the pixel coordinate of any point.
is the mean square error of pixel values of

Loss Function
The model can be trained with any differentiable loss function. In this study, the loss function proposed by us consists of the following Equations (5) and (6): (1) Loss of similarity l sim MSE P n , (2) Penalty items l smooth (∅) Among them, ∼ P n = P n ·∅ is the nth image predicted by the model, where · represents the spatial transformation operation, and P n is the real n-th image, ω is an image pixel set, (u, v) is the pixel coordinate of any point. l sim is the mean square error of pixel values of all pixels of the predicted image and the real image, l smooth (∅) is used to smooth the cover evolution field.
The final loss function is shown in Equation (7): Among them, α is the regularization parameter.

Experiment Setting
This section divides a total of 19 images in dataset 1 into a group of 3 in chronological order, and splits them into 17 samples, as the experiment 1 and experiment 3 datasets Col 1 , which are {P 1 , P 2 /P 3 }, {P 2 , P 3 /P 4 } . . . . . . {P 17 , P 18 /P 19 }. The first two images in each sample are input data, and the last image is ground truth. The first 16 samples are used as the training set, and the last group is used as the test set to evaluate their performance.
Dataset 2 has 15 images, which are similarly split into 14 samples, as the dataset of experiment 2 and experiment 4 Col 2 . The first 13 samples are training data, and the last sample is used to test the model's ability to predict the coverage and preserve detail.
To verify whether the model built in this paper is effective in the task of predicting the evolution of the cover, experiments 1 and experiment 3 are set up. To analyze the impact of the length of the input image sequence on the prediction results, experiment 2 and experiment 4 are set up. Experiment 1 compared the prediction effect of using the time series images of the first 2 years, 5 years, 10 years, and 15 years in Col 1 . Experiment 2 compared the prediction effect of using the time series images of the first 2 years, 5 years, 6 years, and 7 years in Col 2 .
Experiment 3 compared the first 4 years, 5 years. . . 16, and 17 years of Col 1 to predict the effect of coverage in 2014. Experiment 4 used years 5, 6, 7, and 8 of time series images in Col 2 to predict the effect of the 2014 cover.
In addition, experiment 5 in the paper also gives a comparison of the prediction effect before and after data pre-processing to verify the effectiveness of the proposed preprocessing method. There is a pre-experiment that gives the comparative experimental results of the influence of the value of the regularization parameter α of the evolution field on the prediction results.
As for training details, the network is trained by the batch gradient descent method (BGD). The learning rate is set to 1 × 10 −4 , and the training cycle is 10,000 epochs to ensure network convergence.

Evaluating Indicators
This article uses ACC, MSE, MAE, and DICE as the experimental evaluation indicators. We use the difference image between the predicted result and the true value image to visually show the gap in the predicted result. See Equation (7) for MSE, Equation (8) for ACC, and Equation (9) for MAE: MAE P n , where N correct is the number of pixels in the predicted image with the same pixel value at the corresponding position of the real image, and N total is the total number of pixels. Use DICE as an accuracy metric for evaluating image segmentation, as shown in Equation (10).
DICE is a similarity metric commonly employed to quantify the resemblance between two samples. It is frequently utilized for assessing the similarity of two samples. The DICE metric is computed as depicted in Figure 11, where the red region corresponds to the actual lake-covered area denoted as T 1 . The remaining red portions represent the noncovered areas, designated as T 0 . The blue region corresponds to the predicted lake area, denoted as P 1 , while the remaining blue portions encompass the predicted non-covered areas designated as P 0 . We assume the lake region to be the positive sample and the areas outside the coverage to be the negative sample.
(1) TP: true positive, the predicted sample is positive, and the real sample is also positive; (2) TN: true negative, predicted as a negative sample, and true as a negative sample; (3) FP: false negative, predicted as a positive sample, true as a negative sample; (4) FN: false negative. It is predicted to be a negative sample.
where N TP is the number of pixels in the TP set, N TN is the number of pixels in the TN set, N FP is the number of pixels in the FP set, and N FN is the number of pixels in the FN set. between two samples. It is frequently utilized for assessing the similarity of two samples. The DICE metric is computed as depicted in Figure 11, where the red region corresponds to the actual lake-covered area denoted as 1 . The remaining red portions represent the non-covered areas, designated as 0 . The blue region corresponds to the predicted lake area, denoted as 1 , while the remaining blue portions encompass the predicted noncovered areas designated as 0 . We assume the lake region to be the positive sample and the areas outside the coverage to be the negative sample. (1) TP: true positive, the predicted sample is positive, and the real sample is also positive; (2) TN: true negative, predicted as a negative sample, and true as a negative sample; (3) FP: false negative, predicted as a positive sample, true as a negative sample; (4) FN: false negative. It is predicted to be a negative sample. where is the number of pixels in the set, is the number of pixels in the set, is the number of pixels in the set, and is the number of pixels in the set.

Parameter Debugging
This paper conducts a comparative experiment on the selection of parameters to achieve a better prediction effect. Figure 12 shows the effect of the evolution field regularization parameter α in the test loss function on the prediction task.

Parameter Debugging
This paper conducts a comparative experiment on the selection of parameters to achieve a better prediction effect. Figure 12 shows the effect of the evolution field regularization parameter α in the test loss function on the prediction task. (1) TP: true positive, the predicted sample is positive, and the real sample is also positive; (2) TN: true negative, predicted as a negative sample, and true as a negative sample; (3) FP: false negative, predicted as a positive sample, true as a negative sample; (4) FN: false negative. It is predicted to be a negative sample. where is the number of pixels in the set, is the number of pixels in the set, is the number of pixels in the set, and is the number of pixels in the set.

Parameter Debugging
This paper conducts a comparative experiment on the selection of parameters to achieve a better prediction effect. Figure 12 shows the effect of the evolution field regularization parameter α in the test loss function on the prediction task.  By comparison, it can be seen that the effect is better when α is 0.01. Through reasonable settings, parameters can make the evolution field extract the displacement value of the corresponding pixel and fit the real data.

Experiment 2
Similarly, experiment 2 compared the predictive performance using time se images from the first 2 years, 5 years, 6 years, and 7 years of collection 2 ( 2 ).  In each case, the comparison demonstrates the predictive performance of the model based on different learning durations and input data lengths for capturing land cover change patterns. The ground truth images from subsequent years are used for evaluating the accuracy of the predictions. The evaluation indicators of the prediction results are shown in Table 1.

Experiment 2
Similarly, experiment 2 compared the predictive performance using time series images from the first 2 years, 5 years, 6 years, and 7 years of collection 2 (Col 2 ). Figure   images from the years 2009 and 2010. The prediction is performed using images from 2010 to 2011, and the ground truth is the panoramic image of Lake Urmia from 2012.      Compared with directly predicting the land cover map of the next year according to the change law of the previous year, the model built in this paper is more mobile after learning the change law for many years. When predicting the overall and detailed coverage changes, the change rules learned from the training set can be adaptively changed and can be transferred to the new target image without any additional learning process to achieve better prediction results.

Timing Length Analysis and Comparison
To analyze the influence of the time series length on the prediction results, the training set of this section uses the time series images of the previous N years, and the same image pair [2012,2013] is used to predict the coverage in 2014. The dataset of experiment 3 is Col 1 , N = 4, 5, 6, . . ., 18. Figure 15 and Table 3 show some comparative experimental results when N = 4, 7, 10, 14, and 18. The results using more data have better prediction accuracy.

Timing Length Analysis and Comparison
To analyze the influence of the time series length on the prediction results, the training set of this section uses the time series images of the previous N years, and the same image pair [2012,2013] is used to predict the coverage in 2014. The dataset of experiment 3 is 1 , N = 4, 5, 6, ..., 18. Figure 15 and Table 3 show some comparative experimental results when N = 4, 7, 10, 14, and 18. The results using more data have better prediction accuracy.   When N is relatively small, the coverage predicted by the model is roughly similar to the coverage in 2013, and the ability to extract the existing time series features is not strong, so the corresponding migration is also poor. With the increase in N, the prediction result of the model is closer to the true value image, and there is less gap in the difference image. The prediction result of the lake center is more accurate than that of the smaller N, and the contour is more similar.  Figure 16 is a graph showing the changing trend of ACC and DICE as N increases. It can be seen that as N increases, ACC and DICE generally maintain an upward trend, while MSE generally maintains a downward trend, indicating that the prediction accuracy of the model is gradually improving. Therefore, increasing the time sequence length is beneficial for the model to extract the temporal and spatial characteristics of the coverage, and it has stronger mobility when applied to new data so that it can better predict the future coverage of remote sensing images. The dataset of experiment 4 is 2 , N = 5, 6, 7, 8. Figure 17 and Table 4 show the comparative experimental results. It can be seen from the experimental results that as the length of the time series increases, the number of samples increases, and the predicted results are more accurate, indicating that an appropriate increase in the historical data during the experiment can help improve the accuracy of the prediction.  The dataset of experiment 4 is Col 2 , N = 5, 6, 7, 8. Figure 17 and Table 4 show the comparative experimental results. It can be seen from the experimental results that as the length of the time series increases, the number of samples increases, and the predicted results are more accurate, indicating that an appropriate increase in the historical data during the experiment can help improve the accuracy of the prediction. The dataset of experiment 4 is 2 , N = 5, 6, 7, 8. Figure 17 and Table 4 show the comparative experimental results. It can be seen from the experimental results that as the length of the time series increases, the number of samples increases, and the predicted results are more accurate, indicating that an appropriate increase in the historical data during the experiment can help improve the accuracy of the prediction.

Data Pre-Processing Effect Verification
Experiment 5 gives a comparison of the results before and after the dataset preprocessing. Figure 18 and Table 5 show the results of prediction using the original image and the pre-processed dataset 1 . It can be seen from the experimental results that the pre-processed image experimental results are more intuitive and have clear boundaries.

Data Pre-Processing Effect Verification
Experiment 5 gives a comparison of the results before and after the dataset preprocessing. Figure 18 and Table 5 show the results of prediction using the original image and the pre-processed dataset Col 1 . It can be seen from the experimental results that the pre-processed image experimental results are more intuitive and have clear boundaries.   Since the unprocessed original image has more noise, the accuracy of the prediction decreases after increasing the length of the training set, while the prediction accuracy of the pre-processed image increases with the increase in N. It can be seen from the experimental results that the image experimental results after data pre-processing are more intuitive, and the boundaries are clear.
Using the pre-processed image to train the evolution field and applying it to the original image for prediction and simulation is a good approach to fitting real coverage situations. The experimental results of applying the deformation field obtained from training the pre-processed images of the first 18 time steps of Col1 to simulate the original image are shown in Figure 19 and Table 6. Figure 18. Comparison of experimental results using pre-processed images and original images: (a) pre-processed ground truth image for the year 2014; (b) predicted results using pre-processed images from 5 years; (c) difference image between (a,b); (d) predicted results using pre-processed images from 9 years; (e) difference image between (a,d); (f) predicted results using pre-processed images from 13 years; (g) difference image between (a,f); (h) original ground truth image for the year 2014; (i) predicted results using original images from 5 years; (j) difference image between (h,i); (k) predicted results using original images from 9 years; (l) difference image between (h,k); (m) predicted results using original images from 13 years; (n) difference image between (h,m). Since the unprocessed original image has more noise, the accuracy of the prediction decreases after increasing the length of the training set, while the prediction accuracy of the pre-processed image increases with the increase in N. It can be seen from the experimental results that the image experimental results after data pre-processing are more intuitive, and the boundaries are clear.
Using the pre-processed image to train the evolution field and applying it to the original image for prediction and simulation is a good approach to fitting real coverage situations. The experimental results of applying the deformation field obtained from training the pre-processed images of the first 18 time steps of Col1 to simulate the original image are shown in Figure 19 and Table 6.   Figure 19, it can be observed that using pre-processed images for trainin and then applying the obtained evolution field to the original image for simulating actua predictions results in better performance compared to training on the original image. Th simulation results are also better compared to using pre-processed images for testing and are more closely aligned with the real image.

Discussion
U-Net is a semantic segmentation network based on fully convolutional network (FCN), originally designed to address segmentation tasks in medical imaging [40]. In th field of remote sensing imagery, the U-Net network has been widely utilized for task such as object detection and instance segmentation [41][42][43][44]. While U-Net-based prediction techniques have found extensive applications in medical disease prediction, they hav been less utilized in the context of geographic information. This is mainly due to th availability of large-scale datasets for medical diseases, whereas specific natura phenomena such as lake boundary changes lack sufficient datasets for training. Han et a [45], in their study on convective precipitation nowcasting, employed U-Net to build prediction model. The model took radar images as inputs and transformed the prediction problem into an image-to-image translation task in deep learning. The aforementioned research demonstrated the capability of U-Net as a prediction model in geographi  The left image shows the result of training on the original image of the first 18 time steps and testing on the original image of the last time step. The middle image shows the ground truth image of 2014, and the right image shows the experimental result of applying the deformation field obtained from training the pre-processed images of the first 18 time steps of Col1 to simulate the original image of the last time step.
Based on Figure 19, it can be observed that using pre-processed images for training and then applying the obtained evolution field to the original image for simulating actual predictions results in better performance compared to training on the original image. The simulation results are also better compared to using pre-processed images for testing and are more closely aligned with the real image.

Discussion
U-Net is a semantic segmentation network based on fully convolutional networks (FCN), originally designed to address segmentation tasks in medical imaging [40]. In the field of remote sensing imagery, the U-Net network has been widely utilized for tasks such as object detection and instance segmentation [41][42][43][44]. While U-Net-based prediction techniques have found extensive applications in medical disease prediction, they have been less utilized in the context of geographic information. This is mainly due to the availability of large-scale datasets for medical diseases, whereas specific natural phenomena such as lake boundary changes lack sufficient datasets for training. Han et al. [45], in their study on convective precipitation nowcasting, employed U-Net to build a prediction model. The model took radar images as inputs and transformed the prediction problem into an imageto-image translation task in deep learning. The aforementioned research demonstrated the capability of U-Net as a prediction model in geographic information studies and showcased the research prospects when focusing on specific natural phenomena.
This study demonstrates the promise of the U-Net model in the task of lake boundary prediction through the Urmia Lake case. Its ability to automatically extract features and its architecture, which allows for the retention of spatial information, make it well-suited for this type of application. However, there is still room for further improvement and exploration in this area.
Although this study can complete the prediction task for lake boundaries, its limitations are strong. U-Net is a powerful architecture for spatial tasks, but predicting temporal dependencies between consecutive images may require more complex models or ensemble recurrent or attention-based mechanisms. In this study, the interval between data samples is relatively long, and some details may be missed in the change rule calculated on the change scale with a time interval of one year. In the remote sensing image, the edge of the lake is not strictly clear, and local cloud cover may also lead to inaccurate boundary delineation. In the data pre-processing stage, we processed the images into binary images, strictly distinguishing the lake surface from other areas, which resulted in the loss of a lot of land cover information. The lake surface information was also processed into an average grayscale distribution from the center to the edge without considering the depth of the lake and other water body features. However, the above processing can more accurately delineate the lake boundaries, and for our research purposes, this processing method can indeed obtain more accurate prediction results.
One potential area for future research is the integration of U-Net with recurrent neural networks (RNNs) or other deep learning algorithms. This could improve the model's ability to capture the temporal dynamics of lake boundary changes, which is an important factor in predicting future trends. In follow-up research, we will consider combining the predictions of various U-Net models with different architectures or training strategies to improve the accuracy and robustness of the overall predictions. From the perspective of data, in future research, we will try to increase the data size, reduce the time interval of time series images, and find the highest efficiency ratio between calculation and accuracy. At the same time, the research on the prediction of lake boundary change can not only be limited to images, but more relevant factors can be combined in future research. By combining U-Net's strengths in spatial feature extraction with the strengths of other models in capturing temporal patterns, a more comprehensive and accurate prediction model could be developed.

Conclusions
In this study, we design a novel prediction model of lake boundary change by combining U-Net with STN, taking Lake Umir as a case study. Different from the traditional algorithms that detect first and then predict, the proposed method uses an end-to-end prediction approach and can visually display the spatial location information of the predicted cover at the same time. The model includes two modules: the extraction and fusion module for extracting the features of cover change from remote sensing images and the spatial transformation module for fitting future cover. The original tasks that need to be completed twice are fused into a model, which improves prediction efficiency and saves resources. At the same time, the model can adaptively change the change rules learned from the training set and be transferred to the new target image without any additional learning process, which has good transferability and provides a basis for the application of the model to other cases. The experimental results show the importance of time series length for prediction and the effectiveness of data pre-processing methods. Compared with traditional models such as Markov models and cellular automata, the proposed model can dig deeper into the details and automatically fit the time series data trend. In addition, unlike existing deep learning methods, the proposed model can simultaneously detect and predict land cover change.
In summary, the use of deep learning models such as U-Net has great potential in land use/cover prediction areas such as lake boundaries. The method proposed in this study for natural lakes can predict changes in the next year with 89% accuracy using continuous images of about ten years. Its ability to automatically extract elements and capture spatiotemporal information can improve the accuracy of prediction, which is conducive to making better land-use policy decisions. In decision-making such as lake ecological protection, the comparison between the results after the implementation of protection measures and the results predicted by the model can be used as an evaluation index for the effectiveness of the measures. Future research should explore ways to further improve these models and integrate them with other algorithms to achieve more effective predictions. Data Availability Statement: The data presented are available upon request from the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.