LaeNet: A Novel Lightweight Multitask CNN for Automatically Extracting Lake Area and Shoreline from Remote Sensing Images

: Variations of lake area and shoreline can indicate hydrological and climatic changes effectively. Accordingly, how to automatically and simultaneously extract lake area and shoreline from remote sensing images attracts our attention. In this paper, we formulate lake area and shoreline extraction as a multitask learning problem. Different from existing models that take the deep and complex network architecture as the backbone to extract feature maps, we present LaeNet—a novel end-to-end lightweight multitask fully CNN with no-downsampling to automatically extract lake area and shoreline from remote sensing images. Landsat-8 images over Selenco and the vicinity in the Tibetan Plateau are utilized to train and evaluate our model. Experimental results over the testing image patches achieve an Accuracy of 0.9962, Precision of 0.9912, Recall of 0.9982, F1-score of 0.9941, and mIoU of 0.9879, which align with the mainstream semantic segmentation models (UNet, DeepLabV3+, etc.) or even better. Especially, the running time of each epoch and the size of our model are only 6 s and 0.047 megabytes, which achieve a signiﬁcant reduction compared to the other models. Finally, we conducted ﬁeldwork to collect the in-situ shoreline position for one typical part of lake Selenco, in order to further evaluate the performance of our model. The validation indicates high accuracy in our results (DRMSE: 30.84 m, DMAE: 22.49 m, DSTD: 21.11 m), only about one pixel deviation for Landsat-8 images. LaeNet can be expanded potentially to the tasks of area segmentation and edge extraction in other application ﬁelds.


Introduction
Lakes are an important component in the global terrestrial ecosystem and represent a key water resource for human beings. The expansion or shrinkage of lakes are affected by regional as well as global conformation and climate changes [1]. Hence, the lake area and shoreline variation can be taken as an indicator to monitor the current climate fluctuation and predict future climate change [2]. For example, since the 1990s, some lakes in the Tibetan Plateau have expanded significantly [3]. This is induced by: (a) increasing precipitation rates [4,5]; (b) melting snows or glaciers [6]; and (c) decreasing lake evaporation [7]. Both expansion and shrinkage of lakes have a great influence on the regional environment and inhabitants. At present, frequent measurements of lake dynamic changes by using remote sensing images are necessary for the conservation and utilization of the lakes as well as to understand climate change [8]. It is also a practical and effective method for numerous lake-changed applications [9] including lake water storage changes [10], lake level changes [11][12][13], etc. For the lake change detections, accurate extraction of lake area and shoreline is a crucial step.
Many studies have focused on lake area segmentation or shoreline extraction by using remote sensing images which can observe a wide area periodically in a few days. For example, a scheme to investigate the spatial distribution of the lake area and temporal changes of glacial lakes shoreline using manual digitization of Landsat data by GIS technologies were presented in [14,15]. Although traditional manual digitization methods guarantee consistent examination and quality control to some extent, it requires large amount of domain knowledge, time, as well as cost while the accuracy might be not high. Apart from manual digitization methods, many techniques can automatically or semiautomatically derive the lake area or shoreline from satellite images as technology advances. The dominant methods are threshold approaches [16][17][18][19][20][21][22][23] which are based on the water index, such as Normalized Difference Water Index (NDWI) [24] and Modified Normalized Difference Water Index (MNDWI) [25]. They are captured by the normalized relationship of the appropriate bands. Such threshold methods are easy to calculate and less time consuming, but the magnitude of the errors varies significantly. This is because the appropriate thresholds are difficult to be configured when water and land are mixed in pixels.
To address the variation in the optimal thresholds and reduce errors, many machine learning algorithms [26][27][28][29][30][31][32][33][34] have been used for lake water body extraction. For example, Random Forest (RF) [26], Support Vector Machine (SVM) [27][28][29], and Artificial Neural Networks (ANNs) [30][31][32] were extensively applied in lake level prediction and lake quality mapping. Recently, deep learning [35][36][37][38][39][40][41] has attracted great attention in the field of remote sensing image processing, especially deep Convolutional Neural Network (CNN)-based semantic segmentation [42][43][44][45]. Such semantic segmentation methods [46][47][48][49] aim to assign pixel-level componential labels for remote sensing image, which means each pixel of the image is classified as a componential category. However, such studies apply deep learning algorithms to perform one task first, then perform another task by using postprocessing tools. It is unfavorable to get the mutual feedback which is beneficial information in multitask deep learning models, thereby, this can cause models to anchor in local minima. To overcome this, Liu et al. [50] proposed an end-to-end semantic segmentation network based on UNet [51] which can be reinforced with spatial boundary information for remote sensing images. Waldner et al. [52] have proposed a multitask semantic segmentation with ResUNet [53] which includes the extent of fields, the field boundaries, and the distance to the closest boundary. Despite the success of such end-to-end multitask learning for obtaining global optimum, the backbone is normally a deep and complex network architecture which generates numbers of parameters; this requires larger training time for each epoch and a huge amount of space. Meanwhile, training the deep network on a small remote sensing image dataset tends to result in the overfitting problem. Additionally, some deep learning algorithms (such as DeepLabV3+ [54,55]) are not feasible to be applied for some multispectral (including more than three bands) remote sensing observations since the band information will be lost when reducing the original bands to three bands.
To overcome the aforementioned difficulties, we propose a novel end-to-end lightweight multitask no-downsampling fully convolutional neural Network to segment area and extract edge from remote sensing images simultaneously. We name it the Lightweight area and edge Network (LaeNet) and its architecture is illustrated in Figure 1. Specifically, we firstly pack several no-downsampling and multichannel fully convolutional layers with ReLU activation function as a feature extractor to learn high-level feature map from multiband remote sensing imagery. Then, another no-downsampling and single-channel convolutional layer with Sigmoid activation function is applied to predict lake area and nonarea (land), thereby achieving area segmentation. Based on this, the difference between area segmentation and its spatial gradient is derived as the corresponding predictive edge. The edge label is derived from the mask label by the Canny edge detection operator in OpenCV. This does not require extra manual labeling, but still can provide accurate spatial edge details and reduce semantic feature ambiguity at area segmentation stage. We train our model on Landsat-8 images near the lake Selenco and validate on the ones from the lake Selenco in the Tibetan Plateau. Extensive experiments demonstrate that our model can achieve comparable or even better result in both the performance and the complexity, compared to mainstream semantic segmentation model (UNet, DeepLabV3+, etc.). Meanwhile, our model can generate lake shoreline very well. In order to further validate the effectiveness, we compare the lake shoreline results of our model with the in-situ measurements over lake Selenco captured by GPS. Figure 1. The architecture of our proposed LaeNet model for automatically and simultaneously extracting lake area and shoreline. The wheat boxes represent multichannel convolutional layers for semantic feature extraction. The khaki box is a single-channel convolutional layer for area segmentation. The red box indicates the gradient of the segmented area by using the max pooling operation Our main contributions can be summarized as follows: (1) An end-to-end lightweight multitask no-downsampling fully convolutional neural network (LaeNet) is proposed to extract lake area and shoreline automatically from multiband remote sensing images simultaneously.
(2) The edge is extracted by computing the difference between the area segmentation map and its spatial gradient, where the spatial gradient is produced by commonly used max-pooling operation. This does not increase the complexity of our LaeNet model.
(3) We assess the capability of the proposed LaeNet model on a real-world multiband remote sensing images. Extensive experimental results demonstrate the superiority of our model in extracting lake area compared with mainstream deep image semantic segmentation models (UNet, DeepLabV3+,etc.), especially in the cost time of each epoch and the model size. Moreover, the in-situ observed data collected from GPS further validate the effectiveness of the model. The remainder of this paper is organized as follows. In Section 2, we briefly introduce the study area and data. Then, we elaborate our proposed lightweight end-to-end multitask no-downsampling fully CNN (LaeNet) in Section 3. The experimental settings, evaluation criteria, analysis, and assessment of the model are reported in Section 4.
In Section 5, we discuss the application of the model on different attention mechanisms as well as various satellite sensors. Finally, conclusion is summarized in Section 6.

Study Area
The Selinco region in the Tibetan Plateau ( Figure 2) is selected as the study area since it is sensitive to climate change with numbers of lakes scattered over this area. It covers an area of 113,781 km 2 with the longitude ranges from 87 • 13'19"E to 91 • 00'16"E and latitude ranges from 30 • 08'04"N to 32 • 48'19"N. The average altitude of this area is 4542 m a.s.l. and numerous lakes are scattered in the surrounding area. The annual mean precipitation of this area is about 315 mm with the monsoon season covers from May to September [56]. The mean annual temperature, the average annual sunshine duration and annual panevaporation are 0.7 • C, 2950 h and 2080 mm, respectively [57].
Over this region, the 18 lake regions (indicated by the white boxes in Figure 2) around lake Selinco are chosen as training data. The Selinco lake region (indicated by the red box in Figure 2) is selected as testing data to evaluate the effectiveness and practicability of the extractor since it covers a large area and sensitive to climate change, this is proved by 26% expansion over the past 40 years [58].

Landsat Images
The Landsat-8 OLI/TIRS images are acquired from the USGS (https://glovis.usgs.gov). In total, 7 scenes of cloudless and clear images are downloaded. Each scene has blue, green, red, near-infrared, and short-wave infrared-1 bands. The wavelength and spatial resolution of each band are shown in the Landsat-8 part of Table 1. After atmospheric correction, water reflectivity image is produced. The bit depth is unified into 8 bits. The binary label is derived from the specific band of the corresponding Landsat-8 image by using the single-band threshold method, as shown in Figure 3. According to the input requirement of the LaeNet model, we need to further subdivide the lake-images and the corresponded binary labels into patches by setting a patch size as 512 pixels with an overlap of 300 pixels. The overlap ensures training data augmentation and consistent results among the adjacent patches. Ultimately, 121 subdivided testing image patches are obtained by using the image of Selneco lake region. Similarly, we have 542 subdivided training image patches by utilizing the images of 18 lake regions near Selneco lake region.  Figure 3. Example of the subsets used for training and testing data. (a,b) An example of training subset and its corresponding binary label image. (c,d) An example of testing subset and its corresponding binary label image.

Field-Measured Lakeshore from GPS
To further validate the LaeNet model more accurately, we conducted fieldwork to collect the in-situ observation of one typical part of the lake Selinco (indicated as the yellow box of Figure 4) by handheld GPS. We used SOUTH S86 GPS RTK Surveying instrument for GPS data collection and set it to automatically record positions at equal intervals of 5 s. Before surveying, the RTK was kept still to obtain the static accuracy (STD = E: 1.5 cm, N: 0.5 cm, U: 4.8 cm). Then, we walked along the lake Silingco from 88 • 36'13"E, 31 • 44'12"N to 88 • 37'13"E, 31 • 42'50"N on 20 August 2020. Since we only focused on the 2-D horizontal results, the antenna height of RTK was ignored in our measurement. During the data postprocess stage, the modified RTKLIB software [59] was utilized. We chose positioning mode as PPP-Kinematic for calculation after adding the final precise ephemeris, which were downloaded from IGS Analysis Center of the Helmholtz-Centre Potsdam-GFZ German Research Centre for Geosciences (ftp://ftp.gfzpotsdam.de/GNSS/products/final/). Finally, we obtained approximately 5 km long vector line data with geographic coordinates, shown as the orange line in Figure 4.

LaeNet Model
The LaeNet model feeds with a multiband input image, then generates two binary grayscale segmentation area and edge probability maps of the corresponding input image. It mainly consists of three components: (1) extracting semantic feature maps from multiband remote sensing images via several multichannel no-downsampling fully CNN layers with ReLU activation function; (2) segmenting area by a single-channel no-downsampling fully CNN layer with Sigmoid activation function; (3) computing edge using the difference between the area and the corresponding gradient map. The detailed procedure is illustrated as follows.

Semantic Feature Extraction
It is known that the convolutional layers can extract different kinds of semantic features from images. In order to extract semantic features of multiband remote sensing image and keep their subtle structures, we also employ a multichannel fully convolutional network including several no-downsampling convolutional layers in this work, which is inspired by matting refinement used three no-downsampling fully convolutional layers in [60]. Following this, a rectified linear unit (ReLU) is added, shown as the wheat boxes in Figure 1. Each convolutional layer can be formulated as: where W l conv and b l conv represent the filters and bias of the l-th convolutional layer; * denotes the convolution operation, and max(0, ·) implements ReLU activation function. Wherein, when l=0, X 0 refers to multiband remote sensing image; or X l represents the l-th semantic feature map. Specifically, in each convolutional layer, the kernel size is set as 3×3, the number of convolutional filters is 64, the stride is 1 and the padding is SAME type which implements the no-downsampling function. Regarding the number of convolutional layers to be used, it can be derived based on the experimental performance.

Area Segmentation
The objective of our first task is to segment area from multiband remote sensing images. Based on the obtained high-level semantic feature, we perform a mask predictive layer to predict the lake areas or nonarea (land), resulting in image area segmentation. Specifically, as the khaki box is shown in Figure 1, we apply a single-channel no-downsampling convolutional layer with Sigmoid activation function as the mask predictive layer. It can be formulated as follows: where W conv , b conv and * denote the filters, bias and convolution operation, respectively; σ(·) implements sigmoid activation function; X is the high-level semantic feature extracted from the last layer of the multiple convolutional layers indicated in Section 3.1; and M pt is the predictive area. Here, for the convolutional layer, the kernel size is set as 3×3, the stride is 1, the padding is SAME type, and the number of convolutional filters is configured as 1 which produces a binary area segmentation map. Cross-entropy loss is commonly used in the field of semantic segmentation. For area segmentation, each value of the final predictive layer is corresponding to a binary value (such as lake or land). Hereby, binary cross-entropy loss is introduced for each pixel during the training process in this paper. The formulation of binary cross-entropy loss for two classes (such as lake or land) is as follows: where M gt i is the pixel value of the mask label (1 for lake and 0 for land) and M pt i is the probabilistic estimate value of the predictive layer.

Edge Extraction
The second task is to identify and extract edge from multiband remote sensing images. When area segmentation is finished by following the procedure in Section 3.2, we can obtain the probability map of the area from predictive mask layer. Motivated by [61], edge can be derived by spatial gradient from the area segmentation. In particular, max-pooling layer is employed to derive spatial gradient with the formulation is indicated as follows: where ∇ is gradient calculation; ∇M is the spatial gradient map of the M pt , shown as the red box in Figure 1; maxpooling is a max-pooling layer used to calculate the gradient map of the predictive area M pt . Here, for the max-pooling layer, the kernel size and stride are set to 3 and 1, respectively. The type of padding is SAME which implements the no-downsampling function. The simple and general max-pooling layer operation is implemented here because it is available in any deep learning framework (such as Keras, Tensorflow, Pytorch, etc.). Based on this, we can obtain the edge by computing difference between the probability map of segmentation area and its spatial gradient. In order to make the difference value amenable to train via backpropagation and preserve semantic information of the edge to maximize, Leaky ReLU activation function is applied. The computational procedure of the edge can be formulated as follows: where E pt is the predictive edge; max(0, ·) + α × min(0, ·) implements Leaky ReLU activation function. Here, the α value is set as 0.2. For clarity, an example of extracting the edge of a 7×7 binary image is given as shown in Figure 5. The red "0" represents the edge of the binary image in Figure 5a. A 3×3 maxpooling operation with the SAME type of padding and the stride of 1 (the red rectangle of Figure 5a) is performed on the binary image to derive the spatial gradient map (Figure 5b). Then the edge can be obtained by computing difference value between the binary image ( Figure 5a) and its spatial gradient map ( Figure 5b). The remaining red "1" denotes the edge of the binary image in Figure 5c. In order to better train the edge extraction task, the edge label is derived from the mask label generated by Canny edge detection operator in OpenCV without extra manual labeling effort. Following prior work proposed by Zhen et al. [61], we also apply Mean Absolute Error Loss to measure inconsistency between predictive edge map and edge label. The loss can be computed as follows: where E gt i is the pixel value of the edge label (1 for edge and 0 for nonedge) and E pt i is the probabilistic estimate value of the predictive edge map.
We put the area segmentation loss and edge extraction loss together to train our network via back-propagation. Thus, the total loss function can be formulated as follows: where λ is hyperparameter to balance the area segmentation loss and edge extraction loss.
Here, it is set to 1 in our experiments.

Results
In this section, we firstly describe our experimental settings and evaluation criteria. Then, extensive experiments are carried out by various networks to extract lake area and shoreline. Moreover, the results are analyzed with quantitative and qualitative comparisons.

Experimental Settings
In the training phase, we randomly shuffled the training data and employed data augmentation for the training subdivided image patches and the corresponding label patches. The data augmentation includes flipping, rotating, and random cropping. All the experiments were implemented with the help of Keras framework (https://keras.io/) and conducted on a 64-bit Ubuntu 18.04 Server with Inte(R) Core(TM) i7-6700HQ CPU at 2.60 GHz × 8 on 16 GB RAM and NVIDIA GeForce GTX 1070 GPU support. The adaptive moment estimation (ADAM) [62] was selected as the optimizer to train the networks. The initial learning rate for the training was set to 10 −2 . The reduction factor was set to 0.7 and the patience number was set as 15 in the learning rate reduction policy. Training stops if the learning rate reaches the value of 10 −5 or there is no significant improvement after 50 epochs. The batch size was set as 4. All the trainable parameters of the networks in the kernel of convolutional layers were initialized with a uniform random distribution between [−1, 1]. In order to remove the randomness, each experiment was repeated ten times and the average result was recorded.
The testing phase was employed under the same experimental environment with the training phase. Furthermore, the postprocessing (comparing with in-situ results, etc.) was implemented by Python 2.7, ArcGIS 10.5 and ArcGIS Pro 2.5.

Evaluation Criteria
We mainly evaluate the performance of lake area segmentation task. The main reasons are: (1) lake area segmentation task is the prerequisite of lake shoreline extraction task.
(2) lake area segmentation task is an application in semantic segmentation field, which can be compared with other mainstream deep learning models (UNet, DeepLabV3+, etc.). In the testing dataset, the predictive mask is compared with the corresponding mask label to perform assessments at the pixel level. If a pixel correctly detected as lake water, we refer to it as a True Positive. Otherwise, it is a False Positive, which is misclassified as lake water. True Negative denotes the nonlake pixel that is detected. Nevertheless, False Negative means that the nonlake pixel is misidentified as lake water. The metrics can be derived as follows: where TP, TN, FP, and FN are the number of True Positives, True Negatives, False Positives, and False Negatives, respectively. mIoU is the mean intersection over union for two categories generally. Here, it is used to represent the shared regions.
To further demonstrate the prediction ability of the model, Mean Square Error (MSE) and Mean Absolute Error (MAE) are used. They can be formulated as follows: where P mask i is the pixel value of the mask label and P pt i is the predicted value of the model. Similarly, the model size, which is mainly decided by the model parameter number, is also a metric for the network model. In our proposed LaeNet model, the parameter number is only generated by the convolutional layers. Specifically, the parameter number of each convolutional layer by calculating weights and biases can be derived as follows: where C i and C o are input and output channel numbers, K w and K h are width and height of kernel in the convolutional layer. In addition, the time expenditure of each epoch in the training process is also considered as a metric. The field observation dataset captured by GPS is recorded as a vector format. However, the edge binary image predicted by our LaeNet model is a raster. To make them comparable, the center coordinate of the predictive edge pixel and the coordinate of the shortest distance point from it to the measured line are converted from raster to vector using the API in ArcGIS Pro 2.5. In order to assess the deviation between the center coordinate and the corresponding coordinate measured by GPS, Distance Root Mean Square Error (DRMSE), Distance Mean Absolute Error (DMAE), and Distance Standard Deviation (DSTD) are adopted. The formulations are as follows: where (x

Performance Comparison on Band Combination and Different CNN Layers
We aim to explore the effects of the multiband information and the semantic feature extraction with the CNN for lake area segmentation. The single-band of Landsat-8 images, different band combination, and the CNN of the various layer numbers for semantic feature extraction are considered. In the Landsat-8 image, B2 as Blue, B3 as Green, B4 as Red, B5 as NIR, and B6 as SWIR-1 were used in our experiments. We summarized the quantitative performance results of utilizing different band combinations and CNN layers in terms of mIoU in Table 2. From this table, we can conclude as follows: (1) In the case of single band, blue and green show the worst performance, while NIR and SWIR-1 perform best. This is caused by the spectral absorption characteristics of lake water. With the increase of spectral wavelength, the stronger the spectral absorption of lake water, this enables a better performance for LaeNet model to segment lake area. (2) Under different band combination situations, bands combination outperforms the single-band due to taking full advantage of complementary information among different bands. Nevertheless, combination of B5 and B6 outperforms the other bands. One possible explanation is that the worst spectral information limits the overall performance of the LaeNet model. This is a consensus with the famous theory saying "cannikin law". (3) There is no clear relationship that can be obtained between the performance and number of CNN layers at the semantic feature extraction stage. This implies that number of CNN layers should be selected carefully in multispectral semantic feature for different tasks. With this experiment, we decided to choose B5 and B6 as spectral information and set 1 CNN layer to extract semantic feature of the lake and land in Landsat-8 images in the next experiments, because it performs better than other combinations.

Performance Comparison with Different Semantic Segmentation Models
To verify the effectiveness and superiority of our proposed LaeNet model, we have compared our results with mainstream semantic segmentation model of DeepLabV3+ [54], AttUNet [63], CloudNet [64], DeepUNet [65], UNet++ [66,67], UNet [51]. The numerical and visual results of the testing on Selinco lake region are demonstrated in Table 3 and Figure 6.
We summarized the quantitative results of the above models and the LaeNet model in Table 3 based on nine quantitative evaluation metrics -Accuracy, Precision, Recall, F1-score, mIoU, MSE, MAE, Time, and Size.
We have the following findings from Table 3: (1) DeepLabV3+ only accepts three image bands as input, hence we have chosen the best performance when Band B5, B5, and B6 are combined. However, it has achieved the worst performance as compared to other learning models. This problem may come from Xception65 [68] architecture we used for DeeplabV3+. The Xception65 is pretrained on ImageNet [69] which is a complicated dataset under nature scenes. This may lead an oversmoothing and loss of detail in our task of lake-land segmentation. (2) Similar performance is found among family of UNet variants including the original UNet, AttUNet, CloudNet, DeepUNet, and UNet++. Although AttUNet adds attention mechanism, Cloud-Net increases Convolution-Identity-Concatenation mapping [70], DeepUNet applies much deeper network architecture and UNet++ is based on nested and dense skip connections, original UNet still outperforms them. This demonstrates that complex network architecture cannot show excellent performance in small remote sensing image dataset and simple application scenarios. (3) Our proposed LaeNet model outperforms all other learning models we tested in terms of Accuracy, Precison, Recall, F1-score, and mIoU, while both the MSE and MAE between labeled pixel value and the corresponding LaeNet model prediction are also the smallest among all algorithms. Although the LaeNet model has only two CNN layers, one for semantic feature extraction of the lake-land and the other for lake area segmentation, it can achieve comparable or even better accuracy. It demonstrates that the lightweight model is more suitable for small datasets and simple pixel-level binary classification problems, such as lake-land segmentation.   Figure 6 shows the visual results of a qualitative and intuitive evaluation for the performance of our proposed LaeNet model. The first row shows the pseudo color remote sensing images combined with Band 5, 5, and 6. The second row shows the corresponding semantic label. Focusing on the 3-th row, it can be seen that the junction of lake and land obtained from DeepLabV3+ are the smoothest, which misses many details and some small areas. In the rows 4-8, it can be observed that the visual appearances are mostly similar and some local results in detail are much better than the one of the DeepLabV3+. The 9-th row shows our model is able to achieve comparable or even better results than the family of UNet Variations, especially for the small land area. In addition, we obtained the lake shoreline while producing the lake segmentation result, as shown in the last row. We have used red rectangle boxes to highlight the shortcomings of different models reflected in Figure 6. These qualitative results were consistent with the quantitative results in Table 3.  Figure 6. Lake segmentation visualization from DeepLabV3+ [54], AttUNet [63], CloudNet [64], DeepUNet [65], UNet++ [66,67], UNet [51] and lake segmentation and shoreline results from our proposed LaeNet model.

Performance Comparison with Situ Observed Results
In order to further evaluate the practicability and accuracy of the proposed LaeNet model, we collected in-situ GPS trajectory data along the one typical part of lake Selenco on 20 August 2020, as illustrated in Figure 7. The orange line represents the in-situ GPS trajectory, which has irregular edges. This is due to complex terrains and intricate structures of the region. The trajectory line has characteristics of continuous linearity and is slightly close to the inside of the land. This is because there are shoals or marshlands at the shoreline of lake Selincoo, so that our surveyors cannot get too close to it. The black raster line represents the lake Selenco shoreline, which is the sawtooth-shaped edge extracted by our LaeNet model from the Landsat-8 image on 19 August 2020. The main principle is to classify each pixel of the remote sensing image as shoreline or nonshoreline. The predictive shoreline (black) is very close to the measured line (orange), which indicates the effectiveness of the LaeNet model. Furthermore, quantitative results were computed between predictive shoreline and measured line in terms of DRMSE, DMAE and DSTD. The center coordinate of predictive lake shoreline pixel can be derived by converting raster to vector with the help of the API in ArcGIS Pro 2.5. Then, the coordinates of the shortest distance can be found from the center coordinate to the measured line. Following Equations (17)

Applications on Different Attention Mechanisms
The remote sensing images have multiple spectral bands and complex spatial structures, which are capable of providing rich spectral and spatial information. The experimental results in Section 4.1 demonstrate that not all spectral bands are equally informative and predictive. We have tried to introduce spectral attention [71] and channel attention [72,73] under the all bands combination situations to emphasize useful bands and suppress less useful ones, respectively. The spectral attention extracted global spatial information and sent it into the spectral gates (i.e., the sigmoid function), then, adaptively recalibrated spectral bands by applying a global convolutional layer. The number and sizes of the convolutional filters are equal to the number of the spectral bands and size of the input images. For channel attention, it extracted semantic attributes from average-pooled and max-pooled features of the input images by employing a shared two-layer perception firstly. Then, the sigmoid value of the semantic attributes is multiplied with band maps to produce spectral bands extraction. To be noticed, within the two-layer perception, the number of neurons in the first layer should be greater than the number of the spectral bands and it is set to 16 in our experiment, while the number of neurons in the second layer is equal to the number of the spectral bands. Both the attention mechanisms were placed before the input of the LaeNet model. Table 4 implies that the two attention mechanisms can enhance the properties of all bands of the remote sensing images, but the LaeNet model performance is not significantly improved compared with the combination of B5 and B6. This indicates that even attention mechanisms have been adopted, less useful spectral bands in CNNs also may introduce noises and weaken the performance of the LaeNet model.

Effect of Pixel Tolerance on Different Semantic Segmentation Models
Apart from pixelwise evaluation that we described in Section 4.4, the patchwise evaluation [74][75][76] is also being introduced generally to overcome data-labeling uncertainty for semantic segmentation tasks. Hereby, we applied a certain pixel tolerance to evaluate our label generated by using the single-band threshold method in Section 2.2. In this experiment, the tolerance margins are set as 0, 1, 2, 3, and 4 pixels away from the label pixels as the True Positive (TP), respectively. The corresponding results are summarized in Table 5 and seen from Table 5, we have the following findings. Firstly, when the margin of the pixel tolerance is 0 pixel, the performance of all model decreased compared with the pixelwise evaluation result as shown in Table 3. The is because the result images predicted by using these models are the grayscale images which can lose some information during manual binary converting postprocessing in the patchwise evaluation. Secondly, in the case of same semantic segmentation model, the model performance decreases with an increasing tolerance margin size and this indicates that utilizing pixel tolerance will not improve the model performance. This further implies that our label produced by using the single-band threshold is quite high. This is because the multispectral image of Landsat-8 has a strong spectral absorption to the lake water; it is conducive to utilize the manual threshold method to generate the corresponding label for scenes one by one in lake-land segmentation task.

Applications on Images from Different Satellite Sensors
Various satellites sensors can collect different remote sensing images in the same geographical area at different times. Besides the landsat-8 images we described in Section 2.2, we also downloaded Landsat-5 images for the same area and applied our trained LaeNet model to extract the lake area and shoreline accordingly. It turns out that the basic information of lake area and shoreline can be extracted, but some details were missing-the wetlands and the mountain shadows were recognized as lake area, as shown in Figure 8b,c. This is because remote sensing images from Landsat-8 near the lake Selenco were only considered as training samples in LaeNet model, yet the remote sensing images to be segmented and extracted were from Landsat-5. The two kinds of images (Landsat-5 and Landsat-8) have the same resolution and similar wavelength, but there are still slight differences in the corresponding wavelength, as shown in Table 1. In order to segment lake area and extract shoreline more accurately from Landsat-5 images, we have rebuilt a new training data from Landsat-5 for the study area by utilizing the method provided in Section 2 and retrained the LaeNet model. In this way, we can apply the model to extract lake area and shoreline from the images of the Landsat-5. The results are shown in Figure 8d,e, which has similar performance as our proposed LaeNet model for Landsat-8.  Figure 8. Four examples (row 1-4) of lake area and shoreline of lake Selenco from Lansat-5 images by using LaeNet model with different training images as input. (a) indicate the pseudo color images by using band 4, 4, 5 as the RGB channel; (b,c) are test results of lake area and shoreline by using Landsat-8 for training; test results of lake area and shoreline by using Landsat-5 images for training is indicated in (d,e), respectively.

Conclusions
Lake area segmentation and shoreline extraction are crucial steps in lake monitoring. In this paper, we proposed a lightweight, but still effective end-to-end multitask no-downsampling fully CNN (LaeNet) to segment lake area and extract shoreline simultaneously from remote sensing images. Firstly, several no-downsampling CNN layers with ReLU activation function are applied to extract semantic features. Then another nodownsampling CNN layer with Sigmoid activation function is utilized to segment lake area. Finally, the difference between lake area segmentation and its spatial gradient is derived as the lake shoreline. Extensive experimental results showed that our LaeNet model outperforms mainstream deep semantic segmentation approaches (i.e., UNet, DeepLabV3+ and etc.) in terms of both the performance and the simplicity of model, which can be indicated by time and space usage. Furthermore, the in-situ GPS measurements from one typical part of lake Selenco was used to validate the effectiveness of the LaeNet model. Moreover, we also discussed the applications of the LaeNet model on multibands with different attention mechanisms and images from different remote sensing sensors.
In the future, we will expand the study area from Selenco region to the whole Tibet Plateau and enrich data sources from Landsat series to more sensors (for example, Sentinel-2) to improve the robustness and generality of the LaeNet model. Moreover, powerful deep network models for area segmentation and edge extraction will be redesigned for tackling more complex application scenarios.
Author Contributions: W.L. designed and developed the model of this study, conducted the experiments and analysis, and wrote the manuscript. J.R. and L.L. supervised W.L. and codesigned the model. X.C. provided the training dataset, assessed the error, and also designed use case diagrams. L.X. helped with GPS data process. Q.W., L.X., and G.L. revised the manuscript. All coauthors helped write the manuscript. All authors have read and agreed to the published version of the manuscript.