Multi-Resolution Supervision Network with an Adaptive Weighted Loss for Desert Segmentation

: Desert segmentation of remote sensing images is the basis of analysis of desert area. Desert images are usually characterized by large image size, large-scale change, and irregular location distribution of surface objects. The multi-scale fusion method is widely used in the existing deep learning segmentation models to solve the above problems. Based on the idea of multi-scale feature extraction, this paper took the segmentation results of each scale as an independent optimization task and proposed a multi-resolution supervision network (MrsSeg) to further improve the desert segmentation result. Due to the different optimization difﬁculty of each branch task, we also proposed an auxiliary adaptive weighted loss function (AWL) to automatically optimize the training process. MrsSeg ﬁrst used a lightweight backbone to extract different-resolution features, then adopted a multi-resolution fusion module to fuse the local information and global information, and ﬁnally, a multi-level fusion decoder was used to aggregate and merge the features at different levels to get the desert segmentation result. In this method, each branch loss was treated as an independent task, AWL was proposed to calculate and adjust the weight of each branch. By giving priority to the easy tasks, the improved loss function could effectively improve the convergence speed of the model and the desert segmentation result. The experimental results showed that MrsSeg-AWL effectively improved the learning ability of the model and has faster convergence speed, lower parameter complexity, and more accurate segmentation results.


Introduction
Desertification is a land degradation phenomenon characterized by wind-sand activities in arid and semi-arid areas due to the human-nature imbalance. It is a positive feedback process of environmental instability [1]. A comprehensive, macroscopic, and scientific grasp of the spatial distribution pattern and dynamic change information of desert land types is the basis for preventing and/or controlling desertification [2]. The feature types in desert areas are complex, manual field mapping statistics or visual interpretation consumes time and energy, and the information of dynamic large-scale areas cannot be reflected quickly and accurately [3]. In recent years, satellite remote sensing technology has been developing rapidly, making it possible to obtain remote sensing images in desert areas with low cost, fast speed, and high accuracy [4]. However, due to the complexity of remote sensing image features, there is no universal method for image recognition [5]. Light, water, and other external factors have different effects on the image features of different desert land types, making it difficult to identify land types and distinguish boundaries [6]. Therefore, desert remote sensing image recognition is still a challenging task.
Most of the existing remote sensing image-recognition methods have used sliding windows to extract spectral features and texture features [7,8]. Pi et al. [9] proposed the desert grassland classification network (DGC) and three-dimensional convolutional neural network (3D-CNN) models to identify desert and grassland. Moghaddam et al. [10] used a multi-layer perceptron (MLP) to classify Isfahan desert images and obtained the land cover map of the Sejzy area. Ge et al. [11] used the artificial intelligence method (ANN), random forest (RF), support vector machine (SVM), and k-nearest neighbor method (KNN) to analyze seven different land cover types in China's dengkou oasis. These methods made full use of the information contained in remote sensing images and effectively improved the land classification accuracy of high-resolution images, but there were still some problems such as time-consuming calculation and inaccurate edge segmentation results. Researches showed that image segmentation methods could better avoid the above problems [5,12].
Traditional desert segmentation methods such as mathematical morphology and threshold segmentation methods were mainly based on remote sensing technology (RS) and geographic and information system (GIS) technologies. These methods' performance depended on many threshold parameters that should be elaborately given. The threshold parameters usually vary in different images, so the traditional methods could only work in a small range of data and cannot be validated in complex circumstances [13,14]. Remote sensing image segmentation methods based on a single path encoder-decoder network to solve pixel-to-pixel prediction have achieved good results [15,16]. Li et al. [17] proposed a land-use segmentation model based on deep learning, which improved the performance of the model by using residuals [18] and multi-scale module ASPP [19]. Ulmas et al. [20] used a deep learning model based on U-Net to identify the land cover type. The features record in desert images usually presents multi-scale characteristics. The extraction and fusion of multi-scale features can help improve the learning ability and the segmentation result [21,22]. The existing single-branch segmentation model did not fully consider the feature information of different scales, and the existing multi-scale feature fusion model requires a lot of computation [23,24]. In order to quickly and accurately segment desert remote sensing images, it is still necessary to further strengthen the multi-scale information fusion effect [25], reduce the number of parameters, and speed up model convergence.
In the field of person re-identification and object detection, the use of deep supervision can effectively improve the network performance [26,27]. When applying this idea with multi-resolution learning to the segmentation task, it is important to achieve balanced loss by considering different contribution of each resolution tasks [28]. Reducing the weight for difficult tasks and increasing the weight for easy tasks can effectively accelerate the convergence speed of training and prevent the model from falling into the local minimum [16]. However, the existing balance loss methods mostly adopted fixed balance parameters or adjust the balance parameters only according to the difficulty of a single task [29].
In view of the above problems, we consider the application of desert remote sensing with the characteristics of large image size, large-scale change, and irregular location distribution of surface objects [30]. This paper regarded the outputs of different branches as different optimization tasks and proposed a multi-resolution supervision network (MrsSeg) with an adaptive weighted loss function (AWL) to automatically segment desert remote sensing images. First, a lightweight backbone was used to extract different-scale features, then a multi-resolution fusion module was adopted to fuse the local and global informations, and finally, a multi-level fusion decoder was used to aggregate and merge the object features at different levels to get the desert segmentation result. An improved adaptive weighted loss function was also designed to automatically optimize the training process. The main contributions of this work are as follows: (1) This paper took the segmentation results of each resolution as an independent optimization task and proposed a multi-resolution supervision network (MrsSeg) to better promote the feature fusion process. (2) According to the characteristics of desert images, a specialized multi-resolution aggregation module was proposed to better recover the detailed information of desert segmentation results by aggregating features from low to high resolution. (3) In order to improve the efficiency of the multi-resolution supervision network, an adaptive weighted loss function (AWL) was designed. By giving priority to the easy tasks, the improved loss function could effectively improve the convergence speed of tranining and the desert segmentation result. (4) A new desert image dataset was collected, including desert, gobi, oasis, and river. The experimental results on the self-constructed dataset showed that the proposed model obtained better performance in the desert segmentation task compared with existing approaches.

Materials and Methods
Desert remote sensing images are usually characterized by large image size, large scale change, and irregular location distribution of surface objects [30]. In order to quickly and accurately segment desert images, this paper proposes a multi-resolution supervision network to effectively fuse local information and global information, so as to improve the desert segmentation effect. According to the characteristics of multi-resolution outputs of the network, an adaptive weighted loss function was proposed to further improve the segmentation performance of the network.

Multi-Resolution Supervision Network
In the existing remote sensing image segmentation methods, the feature fusion model is often used to extract multi-scale features and preserve spatial details [31]. However, it can be seen from Figure 1 that the mutil-branch model (Figure 1a) was short of dealing with high-level features combination of parallel branches, the lack of feature communication between parallel branches led to insufficient learning ability, and the additional branches on high-resolution images limited the acceleration of training speed. Commonly used pyramid feature map fusion methods include image pyramid [32], feature pyramid [33], and spatial pyramid pool (SPP) [34] module ( Figure 1b).The SPP module uses shallow semantic information to enhance high-level features by extracting high-resolution context semantics and enhancing receptive fields. However, the segmentation results of this method are limited to the feature layer where the spatial pooling pyramid is located, and implementing the SPP module is usually time-consuming. The feature pyramid (Figure 1c) fuses the deep semantic information into the shallow network layer by layer through the top-down path. This feature fusion method of aggregating context information not only increases the local information extraction ability of the deep neural network but also makes the shallow network have certain deep-level semantic information. Inspired by the above ideas, the structure of the improved multi-resolution segmentation network in this paper was shown in Figure 2. The structure aimed to better extract and fuse local and global information through supervised training among multiple branches so as to improve the segmentation ability.
The MrsSeg was a lightweight desert image segmentation method that combined multi-resolution semantics to encode features. The whole network could be divided into three parts, among which the encoder module consisted of a lightweight backbone network and multi-resolution fusion modules and the decoder module was designed as a simple and effective up-sampling module that combined low-level and high-level features.
The overall structure of the MrsSeg-AWL was illustrated in Figure 2. First, we used the pre-trained MobilenetV2 [35] as a lightweight backbone network to obtain different levels features of desert image. Then, we used the multi-resolution fusion module to fuse the multi-level semantic information to improve the feature representation ability of the network, and then adopted the multi-resolution supervised training to improve the feature extraction ability of each branch to promote feature fusion ability. Finally, the segmentation result with the same size as the input image was obtained by up-sampling the feature map of the multi-level fusion decoder.

Backbone
Desert images usually have a large image size. In order to improve the model segmentation efficiency, a pre-trained Mobilenetv2 was used in this paper as the lightweight backbone. Inverted residual with a linear bottleneck was adopted in Mobilenetv2, this structure not only ensured the efficiency of feature extraction, but also effectively reduced the number of parameters.
The inverted residual with a linear bottleneck is shown in Figure 3. The inverted structure was designed according to the idea of "expansion-convolution-compression". First, 1 × 1 point-wise convolution (PW) was used to expand the input F to a highdimensional embedding space, and then a 3 × 3 depth-wise separable convolution (DW) was used for filtering. Subsequently, the features were projected back to a low-dimensional representation with a 1 × 1 linear convolution. Finally, the low-dimensional outputs were added to the inputs by the skip connection to obtain the final output. The inverted residual with a linear bottleneck Bott could be computed as follows: where F is the input feature map, PW is 1 × 1 point-wise convolution layer, DW is 3 × 3 depth-wise separable convolution layer.

Multi-Resolution Fusion Model
The feature map output from the backbone upper layer had a smaller size and higher semantic information [35]. This kind of high-level information has been experimentally proven to play a key role in the subsequent segmentation task. However, the greater the stride of the downsampling of the network was, the more the spatial details of the image were lost. This led to deep encoder blocks' lack of low-level features and made it difficult for decoders to recover local details. This problem motivated us to propose an aggregation strategy to fuse local detail and global information in different depth positions of feature extraction networks to achieve better performance.
The high-level features of desert images contained more global information, while the low-level features contained more local information such as color, texture, and edge. Effective fusion of high-level and low-level features could improve the segmentation effect. Based on this, the multi-resolution fusion module with a top-down fusion mechanism was designed ( Figure 2). The module was composed of aggregation blocks ( Figure 4). Each aggregation block had two inputs. The low-level feature was from the previous aggregation block at the same branch, and the high-level feature was from aggregation block at low-resolution branch. When the input feature map came from the backbone network, the number of channel (C1) would be adjusted by the 1 × 1 convolution to C2 to match the dimension of high-level feature. At the same time, the skip connection was used to connect input and output, which could effectively avoid the problems of information loss and gradient disappearance and improve the model's optimization ability. The multi-resolution fusion model made each feature representation from low-resolution to high-resolution continuously receive information from other parallel branches, so as to obtain richer high-resolution representation. This made the final output feature map more accurate.The aggregation block Agg could be computed as follows: where HF is a high-level feature, LF is a low-level feature, CBR represents 3 × 3 convolution layer followed by one batch normalization layer and relu activation function, and U p represents the bilinear interpolation upsampling layer. The architectures of MrsSeg are shown in Table 1. The encoder of MrsSeg contained two parts, including the lightweight backbone network and multi-resolution fusion modules. A pre-trained Mobilenetv2 was used in this paper as the lightweight backbone to down-sample the 512 × 512 training image to 1/16 of itself. A multi-resolution fusion module took the multi-scale output of backbone as input. Each branch of multi-resolution fusion module had four aggregation blocks. The first aggregation block was used to unify the channel of feature map to 64, and the rest of the aggregation blocks were used to multi-scale information fusion.

Multi-Level Fusion Decoder
According to the research of [25], not all the features of the stages were necessary to contribute to the decoder module. This motivated us to find a lightweight method to incorporate multi-level context into encoded features. The decoder in this paper was designed as a simple and effective upsampling module that integrated low-level and highlevel features. First, the first feature map of each row (left to right in the multi-resolution module) was upsampled to the original image size through bilinear interpolation and added together, as shown in the black dotted line in Figure 2. Then, the result was followed by convolution operation and added with the output feature maps of the multiresolution fusion module (upper right feature map in multi-resolution module) so that high-level features and low-level details were further fused. Finally, the fused feature map was subjected to a convolution operation followed by a softmax function to obtain the segmentation result.

Adaptive Weighted Loss Function
In this paper, the output of different resolution branches of multi-resolution structure (MrsSeg) was regarded as different optimization tasks, and the supervised training method was used to promote multi-resolution fusion. In a multi-output structure, it is usually important to achieve the loss balance by integrating the multi-branches loss. However, the existing loss balancing parameter was determined uniformly or only determined by single task difficulties. In the case that balancing parameters were calculated without considering task difficulty for each branch, losses that did not match task difficulties of each branch were propagated (Figure 5a), and it seemed to reduce the effect of multitask learning.
In order to solve the above problem, an adaptive weighted loss function (AWL) was proposed to adjust the balancing parameters according to task difficulties for each branch. By reducing the weight of difficult tasks and increasing the weight of easy tasks, it could effectively accelerate the convergence speed of training and help to improve the segmentation result. The improved adaptive weighted loss function is shown in Figure 5b.
balancing parameter : In this paper, the method of quantifying branch task difficulty and adjusting balancing parameters was used to achieve the purpose of adaptive loss balance. The branch task difficulty was calculated by loss reduction, so first we calculated the moving average k τ b of the current loss L τ b of each branch, as follows:

Weight of loss
where α ∈ [0, 1] is a discount factor and α = is the number of each branch task, τ is the current training iteration, k τ b is the current moving average, and k τ−1 b is the previous moving average.
Using k τ b , we defined the current task difficulty r τ b of each branch as follows: A large r τ b means that the current optimization step did not reduce the loss much; that is, optimization for the current branch task was difficult. In particular, if r > 1, it seemed that the task stepped into a local minimum. Therefore, we introduced the balancing parameter λ τ b to reduce the weight of the branch task with large r τ b and increase the weight of the branch task with small r τ b . The formula for balancing parameters is: The overall loss function L τ AW L was defined as the sum of each adjusted branch loss: Algorithm 1 shows the application of the adaptive weighted loss algorithm in the training process. First, the cross-entropy loss function was used to calculate the loss of each branch. Then, the moving average of each branch (k τ b ) could be obtained by Equation (3). The smaller the moving average changed, the more difficult the branch was to optimize, so Equation (4) was used to calculate the optimizing difficulty of each branch (r τ b ). According to the principle of giving priority to the easy task, the balance parameter of each branch (λ τ b ) was allocated by Equation (5). Finally, the weight of each branch loss was adjusted by balance parameters to obtain the overall loss, and the network parameters were updated through back-propagation. Calculate task difficulty r τ b with Equation (4); 10: Calculate balancing parameters λ τ b with Equation (5); 11: Get final Loss L τ AW L with Equation (6); 12: Using L τ AW L backpropagation and update network parameters W 13: end for 14: return W Desert imagery has the characteristics of large-scale change and irregular location distribution of surface objects. In view of the above characteristics, MrsSeg adopted multi-resolution feature aggregation modules in order to extract and fuse multi-resolution features of desert image. Aiming at the structural features of multi-resolution outputs of the model, this work designed auxiliary loss function on multiple resolution branches to improve the feature extraction. Because the training difficulty of different branch tasks is different, and the optimization difficulty of each branch is different in different training stages, an adaptive weighted loss function was designed, which could improve the convergence speed of the model and the desert segmentation result by giving priority to the easy tasks.

Data and Pre-Processing
Xinjiang is the region with the largest desertification area, the widest desertification distribution, and the most serious desertification damage in China. The region is deep inland, forming a distinct temperate continental climate. Desert (sandy desert) and gobi (Gobi desert) are the main land types in this area. Unlike desert, gobi is mainly covered with bare gravel and stones. Based on the above reasons, Xinjiang was selected as the sampling object of the desert area segmentation data set.
The images collected in this paper were all from Environment and Disaster Monitoring and Forecasting Small Satellite Constellations A, B (HJ-1A/B). The satellite has unique advantages of autonomous control, medium and high resolution, wide coverage, etc. It can stably obtain the medium resolution remote sensing data covering the whole country every half a year and is the preferred remote sensing data source for carrying out high-dynamic national desert and desertification land monitoring.
The data set contained desert, gobi, oasis, and river categories ( Figure 6) and was divided into training set, verification set, and test set by the ratio of 6:2:2. In order to make the model more robust, the data were expanded from 1665 to 6660 pieces by random flipping, rotating, and other image enhancement methods, described in Figure 7.

Experimental Results
The experimental environment was Intel Core i7-8700k (Intel Corporation is an American multinational corporation and technology company headquartered in Santa Clara, CA, USA, in Silicon Valley) eight-core processor, RTX2070 8G independent graphics card, 32G memory, and 1T hard disk; the software that we used was PyTorch framework (PyTorch is an open source machine learning library based on the Torch library, primarily developed by Facebook. Facebook, Inc., is an American technology conglomerate based in Menlo Park, CA, USA). For a fair comparison, all runs were trained with stochastic gradient descent method. The hyper-parameters were set as follows: batchsize = 4, momentum = 0.9, weight decay = 0.00005. We used cosine decay as learning rate decay strategy. To obtain a quantitative evaluation result, we adopted Frame Per Second (FPS) and mean Intersection over Union (mIoU) as metrics. FPS is the number of images that can be processed per second. The larger its value, the faster the prediction speed of the model. The mIoU calculates the intersection ratio of all classes. This index can better reflect the accuracy and completeness of model segmentation in different terrain type areas in the experiment, as defined below: where k + 1 is the number of classes (including background); p ii is the number of pixels that belong to class i and were classifified correctly, p ij is the number of pixels that belong to class i and were classifified as class j.
Due to the large number of data in the desert data set, the mini-batch training method was adopted for model training. In the verification process, the model result will inevitably be biased towards the final iteration of the batch data. When mini-batch randomly extracts batch data from desert data set, batch data samples imbalance may occur. When such a situation occurs in the last iteration, the verification curve will be jittered. However, this does not affect the overall training trend of the model. Hence, average value of mIoU every 20 epochs was used in Figures 8 and 9 as the points of the curve so as to better reflect the overall trend and performance of the model.
In this section, we first analyze the results of ablation experiments and then demonstrate the effect and role of the adaptive weighted loss function. Finally, we compare the results of MrsSeg-AWL and the existing segmentation network in desert land type segmentation task.
In the first section, a detailed ablation experiment was performed on MrsSeg-AWL to better understand the gain effect of each improved component. The ablation experiments results are shown in Figure 8 and Table 2. Backbone network and the number of branches remain unchanged (Net1), the introduction of skip-connection into the aggregation blocks(Net2) could effectively avoid the network degradation caused by the increase of network layers. By changing the convolution before the upsampling mode to the upsampling before convolution mode in multi-level fusion decoder's final stage (Net3), the decoding capability of the decoder was enhanced to better recover the detailed features of desert images. Compared with the model that only used the cross-entropy loss function, the MrsSeg-AWL using the adaptive weighted loss function improved mIoU by 3.8%. It could be clearly seen from Figure 8 that the mIoU of MrsSeg-AWL rapidly improved between 0 and 40 epochs, and the mIoU curve did not oscillate after 140 epochs, indicating that adaptive weighted loss function effectively improved the convergence speed of the model.
The training curves of MrsSeg with different loss strategies compared with the singlebranch(Baseline) are shown in Figure 9 and Table 3. It can be seen from Table 3 that when fixed balancing parameters were used, increasing the number of integrated branches could effectively increase the mIoU value of the model, which shows that the use of additional branch loss has a positive effect on the final result. Compared with the model only trained with cross-entropy loss, the model trained with four-branch loss improved the mIoU by 2.2%, and the model that used adaptive weighted loss function improved the mIoU by 3.9%. It can be seen from Figure 9d that the training curve of MrsSeg with adaptive weighted loss was steeper between 0 and 40 epochs, and the curve did not oscillate after 140 epochs. Compared with other loss strategies, MrsSeg-AWL training curves also achieved the highest mIoU. The experimental result shows that adaptive weighted loss function effectively improved the convergence speed and the mIoU of the model.    The training curves of MrsSeg with different loss strategies compared with t 278 single-branch(Baseline) were shown in Figure 9 and Table 3. It could be seen from Table 3 Figure 10 shows the balancing parameters curve of MrsSeg-AWL in different training stages. The balancing parameters represent the optimization degree of each branch. The larger the balance parameter was, the easier the branch was to be optimized. Each line in Figure 10 corresponds to the four branch outputs in the same color as in Figure 2. On the one hand, the top-down comparison of Figure 10a-d shows that the branch task with higher resolution has larger balance parameter values in the whole training process, indicating that the integration of global information and local information could better optimize the training process. On the other hand, by observing the curve, it could be found that in different training stages, the optimization difficulty of each branch loss in the multi-resolution supervision network is different. If fixed balancing parameters were adopted, the proportion of each branch loss cannot be dynamically adjusted, such that the model cannot be further optimized. The experimental results further demonstrated that the adaptive weighted loss function was helpful to adjust the influence of each branch loss on the total loss in different training stages, and gave priority to training the branch with large optimization space, so as to accelerate the convergence speed and improve the accuracy. Table 4 shows the desert segmentation result by four mainstream lightweight backbones, which were all pre-trained on Imagenet classification inn the case that the feature fusion module and the adaptive loss function did not change. The experimental results show that MobilenetV2 achieved the best desert segmentation result. Although its segmentation time was a little bit slower than ResNet-18, its mIoU was 2.2% higher than ResNet-18. Therefore, in the next comparative experiment, we used MobileV2 as the backbone network.  The performances of different models in the desert segmentation task are shown in Table 5. It can be seen from the table that the improved MrsSeg-AWL achieved the highest mIoU, and adaptive weighted loss function improved the mIoU by 1.7% without increasing the prediction time of MrsSeg. In the experiment, we found that FCN based on Vgg was prone to the problem of hard convergence. FPN achieved the fastest prediction speed, but its accuracy was unsatisfactory. DeepLabV3+ was better than MrsSeg-AWL with respect to speed and comparable to MrsSeg-AWL with respect to accuracy. However, the last one requires less parameter tuning. Experimental results showed that the proposed MrsSeg-AWL with multi-resolution fusion network and adaptive weighted loss function has better performance in desert segmentation task than the existing segmentation network. Table 6 shows he land type segmentation results of each model. It can be seen that the IoU of desert and oasis categories was generally high, indicating that that land type was easier to identify when the sample was sufficient. The IoU of MrsSeg-AWL segmentation result reached 84% and 86%, respectively. Due to the small number size and the large-scale change in river samples, the IoU of this category was generally low. The MrsSeg-AWL's IoU of the river category reached 23.1%, which is 3.4% higher than DeepLabV3+. It showed that the multi-resolution supervision network could better learn the characteristics of river samples and obtain more accurate segmentation results when the number of samples was small and the sample scale changed greatly.

Desert Segmentation Results
The segmentation results of desert remote sensing images are shown in Figure 11. It can be seen that the images segmented by FPN have a large area of mis-segmentation, indicating that the feature fusion network with a single branch cannot make full use of local and global semantic information, leading to pixel-level classification errors. Compared with FPN, Unet significantly reduces he false detection area in desert and gobi land types, but the classification result of river samples was still not accurate enough, and the segmentation edge was rough. MrsSeg-AWL and the state-of-art segmentation network Deeplabv3+ performed well on desert and oasis segmentation task. While extracting the multi-resolution features of desert images, MrsSeg-AWL used adaptive weighted loss function to promote multi-resolution feature fusion through supervised training. This enabled MrsSeg-AWL to better learn the characteristics of river samples for river types with a small number of samples and large-scale change and obtained more accurate and clear desert segmentation maps.
In order to test the desert segmentation ability in the non-sampling region, we randomly selected some desert images in the Nile valley and carried out the segmention test on these images using the model trained on the desert dataset. The segmentation results are shown in Figure 12. It can be seen from the figure that FPN and Unet's segmentation results showed large areas of desert and gobi false detection areas, indicating that these methods' feature extraction ability needed to be improve. MrsSeg-AWL and the state-of-art segmentation network Deeplabv3+ had a better overall classification results. MrsSeg-AWL using multi-resolution supervising network had more accurate segmentation edges of the river category. The test result shows that MrsSeg-AWL had good desert segmentation application potential.

Discussion
Desert remote sensing images are usually characterized by large image size, large-scale change, and irregular location distribution of surface objects. In the desert data, the gobi and river samples accounted for only 15% and 1% of the total number, respectively. It can be seen from Table 6 that MrsSeg-AWL achieved the highest IoU value in the river category and also reached comparable gobi segmentation result as that of Deeplab V3+. It can be also seen from Figures 11 and 12 that MrsSeg-AWL achieved good segmentation results, especially for complex images such as the small area of the oasis and rivers. MrsSeg-AWL used adaptive weighted loss function to promote multi-resolution feature fusion through supervised training. This enabled MrsSeg-AWL to better learn the characteristics of samples with a small number and large-scale changes and obtained more accurate and clear desert segmentation maps.
In the experiment, we found that there was excessive exposure in some areas, and sthe narrow rivers in the desert area had seasonal flow interruption, which had an impact on the accuracy of desert segmentation. Therefore, in future research, we will focus on the problem of bad effects of image noise, so as to further improve the segmentation results.

Conclusions
Accurate desert segmentation results could provide a basis for the timely understanding of the status, extent, and evolution of desert areas. Desert images are usually characterized by large image size, large-scale change, and irregular location distribution of surface objects. The multi-scale fusion method is widely used in the existing deep learning segmentation models to solve the above problems. Based on the idea of multi-scale feature extraction, this paper took the segmentation results of each scale as an independent optimization task and proposed a multi-resolution supervision network (MrsSeg) to further improve the desert segmentation result. Due to the different optimization difficulties of different scale tasks, we also proposed an adaptive weighted loss function (AWL) to automatically optimize the training process. First, we collected remote sensing images of the Xinjiang region from the Environment and Disaster Monitoring and Forecasting Small Satellite Constellations A and B satellites (HJ-1A/B) and used these images to create a desert segmentation data set. Then, a multi-resolution segmentation method based on the adaptive weighted loss function was proposed. Finally, the image segmentation experiments and results analysis were carried out on remote sensing images of the Xinjiang and Nile valleys. The experimental results showed that the proposed multi-resolution supervision network could effectively improve the desert segmentation accuracy under the condition of low parameter complexity. The adaptive weighted loss function accelerated the convergence of the model and further improved the segmentation results. MrsSeg-AWL also showed a certain improvement in the gobi and river categories with few samples and was difficult to segment. To sum up, the improved network is an effective automatic desert remote sensing image segmentation method.

Data Availability Statement:
The data and the code of this study are available from the corresponding author upon request. (xiamin@nuist.edu.cn).

Conflicts of Interest:
The authors declare no conflict of interest.