Land Use and Land Cover Mapping Using RapidEye Imagery Based on a Novel Band Attention Deep Learning Method in the Three Gorges Reservoir Area

: Land use/land cover (LULC) change has been recognized as one of the most important indicators to study ecological and environmental changes. Remote sensing provides an effective way to map and monitor LULC change in real time and for large areas. However, with the increasing spatial resolution of remote sensing imagery, traditional classiﬁcation approaches cannot fully represent the spectral and spatial information from objects and thus have limitations in classiﬁcation results, such as the “salt and pepper” effect. Nowadays, the deep semantic segmentation methods have shown great potential to solve this challenge. In this study, we developed an adaptive band attention (BA) deep learning model based on U-Net to classify the LULC in the Three Gorges Reservoir Area (TGRA) combining RapidEye imagery and topographic information. The BA module adaptively weighted input bands in convolution layers to address the different importance of the bands. By comparing the performance of our model with two typical traditional pixel-based methods including classiﬁcation and regression tree (CART) and random forest (RF), we found a higher overall accuracy (OA) and a higher Intersection over Union (IoU) for all classiﬁcation categories using our model. The OA and mean IoU of our model were 0.77 and 0.60, respectively, with the BA module and were 0.75 and 0.58, respectively, without the BA module. The OA and mean IoU of CART and RF were both below 0.51 and 0.30, respectively, although RF slightly outperformed CART. Our model also showed a reasonable classiﬁcation accuracy in independent areas well outside the training area, which indicates the strong model generalizability in the spatial domain. This study demonstrates the novelty of our proposed model for large-scale LULC mapping using high-resolution remote sensing data, which well overcomes the limitations of traditional classiﬁcation approaches and suggests the consideration of band weighting in convolution layers.


Introduction
The Three Gorges Project (TGP) built the largest dam across the third longest river (Yangtze River) in the world. It was built to provide great power, improve the river shipping, and control floods in the upper reaches while increasing the dry season flow in the middle and lower reaches of the Yangtze River. The TGP has created a reservoir area of 1080 km 2 by damming the Yangtze River and greatly changed the landscape pattern in the Three Gorges Reservoir Area (TGRA) [1]. Approximately 1.25 million people were displaced over a 16-year period ending in 2008. Thus far, this project has begun to bring enormous economic benefits but caused adverse and often irreversible environmental impacts [2,3]. To characterize the environmental impacts of the TGP, the Chinese government has formulated a series of relevant polices to monitor the ecological and environmental changes [2,4,5].
LULC change has been recognized as one of the most important indicators to study ecological and environmental changes [6,7]. Remote sensing technology provides the most effective way for monitoring LULC at large scales [8,9]. LULC mapping with high spatial resolution is the fundament to assess ecosystem service and global environmental change [10]. It is necessary to develop an automated and efficient classification method for high-resolution LULC mapping at large scales.
With the increasing availability and spatial resolution of remote sensing data, traditional classification approaches for LULC mapping using remote sensing can be divided into two groups: pixel-based methods and objected-based methods [11].
Pixel-based methods have long been the main approach to classify remote sense images by using the spectral information of each pixel. Initially, parametric classifiers were commonly used such as parallelepiped, maximum likelihood, and minimum distance classifiers [12]. In 2014, Yu and Liang et al. found that 32% of the1651 studies about remote sensing classification used maximum likelihood due to its efficiency and accessibility in most remote sensing software [13]. In recent years, non-parametric machine learning classifiers were also developed to improve classification accuracy, such as the classification and regression tree (CART) [14][15][16], support vector machine (SVM) [17][18][19], and random forest (RF) [20,21]. These classifiers are often used for hyperspectral data with more spectral information provided in each pixel [22,23]. However, the increase in within-class variance and decrease in between-class variance often limit the accuracy of pixel-based methods [24]. In addition, with the increase in the spatial resolution of sensors to 5-30 m or higher, the pixel-based classification results tend to possess "salt and pepper" noise, that is, in some categories, there are other categories of noise due to the effect of spectral confusion of different land cover types [25].
To overcome this issue, object-based classification methods were proposed for highresolution image analysis [26]. In contrast to pixel-based methods, object-based methods use both spectral and spatial features of objects and divide the classification into two steps: image segmentation and classification [27]. Image segmentation splits an image into separate regions or objects based on geometric features, spatial relations, and scale topology relations of upscale and downscale inheritances. A group of pixels with similar spectral and spatial properties are aggregated into one object, thereby avoiding "salt and pepper" noise [28,29]. The classic segmentation algorithms mainly include the recursive hierarchical segmentation (RHSeg) [6,30], multiresolution segmentation [31], and watershed segmentation [32]. Zhang et al. mapped the land cover of China using HJ satellite imagery based on an object-based method, which showed an overall accuracy of 86%. However, this two-step classification method requires great human involvement to adjust the parameters in each step, which limits the model generalization and adaptability in large area [3].
Over the past few years, deep learning based on convolution neural networks (CNN) has become a technology of considerable interest in computer vision with the emergence of graphic processing units (GPU). CNN is a very powerful tool for object identification and image classification which can yield informative features hierarchically [33]. Recently, fully convolutional networks (FCN) based on CNN were developed for pixel-wise semantic segmentation, which assigns a semantic label (i.e., a class) to each coherent region of an image using dense pixel-wise prediction models. They were first used for semantic segmentation of digital images [34], multi-modal medical image analyses, and multispectral satellite image segmentation. Several semantic network architectures have been proposed, e.g., SegNet [35], U-Net [36], FC-Densenet [37], E-Net [38], Link-Net [39], RefineNet [40], and PSPNet [41]. In particular, the U-Net architecture initially developed for biomedical image segmentation has now been widely used in Kaggle competition and classifying ground features with high accuracy. The rapid development of semantic segmentation architecture brings opportunities for high-resolution remote sensing classification.
However, compared to digital images with red, green, and blue bands, multispectral imagery normally has 3-10 bands and hyperspectral imagery even consists hundreds or thousands of bands. In addition, the data from different remote sensing sensors or platforms can be stacked together to provide multi-band information [11]. The relationship between bands is a critical feature in remote sensing classification. Traditionally, vegetation indices such as normalized difference vegetation index (NDVI) [42], soil-adjusted vegetation index (SAVI) [43], and normalized difference water index (NDWI) [44] are calculated to enhance the contrast between targets and background. However, in a CNN, the convolutional operations extract the feature information from input bands with equal weights without considering the different importance of spectral information in multiple bands.
In this study, we developed an adaptive band attention (BA) deep learning model based on U-Net to classify LULC in TGRA using RapidEye imagery and topographic information. We introduced the BA module that weighted bands adaptively in CNN by learning the different importance of input bands. To evaluate our model, the results based on traditional pixel-based classification methods including CART and RF, as well as an object-based multiresolution segmentation method, were generated for comparison. The main contributions of this study are listed as follows:

1.
We proposed an end-to-end deep learning method which overcame the limitations of traditional methods for high-resolution remote sensing classification.

2.
We investigated the impact of the BA module on model performance which generated a dynamic weight for each band in CNN.

3.
To evaluate the spatial generalizability of the proposed model, we tested the trained model at other regions of the study area.

Study Area
In this study, the TGRA was selected as the study area. It is located in the transition zone from the Qinghai-Tibet plateau to the lower reaches of the Yangtze River within 106 • 16 E-111 • 28 E and 28 • 56 N-31 • 44 N ( Figure 1). The TGRA has an area of 58,000 km 2 and a population of 20 million. Approximately 74% of this region is mountainous, 4.3% is plain, and 21.7% is hilly, with the elevation ranging from 73 to 2977 m. The TGRA consists of 20 counties, with 16 in Chongqing province and 4 in Hubei province. The study area is ecologically vulnerable with frequent landslides. Due to the monsoon climate, there is obvious seasonality in the TGRA. Annual precipitation ranges from 1000 to 1200 mm, with 60% of the precipitation occurring during June-September.
Moreover, Wushan county located at the west entrance of the Wu Gorge is called the 'core heart' of TGRA ( Figure 2). Its location is within 109 • 33 E-110 • 11 E and 30 • 45 N-23 • 28 N. Wushan county occupies 2958 km 2 (∼5% of TGRA) and has a population of 600,000. In this study, our proposed model was first trained and validated in Wushan county. We then evaluated the model generalizability in other counties in TGRA.

Data Sources
The RapidEye imagery at 5-m resolution was used in this study. The RapidEye constellation was launched into orbit on 29 August 2008. RapidEye constellation sensors provided five bands: blue (440-510 nm), green (520-590 nm), red (630-685 nm), red edge (690-730 nm), and near infrared (760-850 nm). The abundant spatial and spectral information in RapidEye imagery has been used for agriculture, forestry, environmental monitoring, etc. In particular, the unique red edge band, which is sensitive to change in chlorophyll content, can assist in monitoring vegetation health and improving species separation [45]. In this study, 198 tiles of RapidEye Surface Reflectance Product in summer and autumn of 2012 with minimum cloud cover were collected. The atmospheric effects were removed by atmospheric correction using 6S [46] . Topographic attributes such as elevation and slope provide useful terrain information for land cover classification [47]. The elevation from SRTM 1 arc-second global data at 30-m resolution was used in this study [48]. The slope was calculated from elevation using the Spatial Analyst toolbox in ArcGIS [49]. The elevation and slope were both re-sampled to 5 m using the nearest neighbor interpolation to match the RapidEye imagery.
The LULC map of Wushan county in 2012 was generated as the ground truth map using object-oriented classification methods, which includes six categories: forest, shrubland, grassland, cropland, built-up, and waterbody Figure 2. The segmentation was processed by multiresolution segmentation [29] in eCognition [27]. CART [50] was used to classify the segmented objects. The LULC maps were further corrected manually by visual interpretation and validated by extensive ground survey points [51]. The ground survey points were collected by technicians driving randomly through a survey area and recording the homogeneous land cover types within the visible range on both sides of the road using a smartphone app named GVG (a volunteer geographical information smart phone app) [52,53]. The location of all points were corrected manually based on Google Maps. In total, 4174 points were used in this study ( Figure 1).

The Model Architecture
In this study, the U-Net architecture was selected as our based structure ( Figure 3). The U-Net, with a simple and efficient architecture, has become the most widely used semantic segmentation structure in recent years [36]. The most important innovation of U-Net is the skip connection, which allows the decoder at each stage to learn back relevant features that are lost when pooled in the encoder. The skip connection has proved effective in recovering the details of target objects even on a complex background. As Figure 3 shows, the structure consists of encoder and decoder parts. The encoder extracts feature map from the input image by passing several convolution-pooling operations and the decoder is used to predict the classification result and recover the deep hidden feature to original image size by up sampling. The input bands are fed into the encoder part to generate hidden features. For each encoder operation, the band number of the feature will be doubled, and the band size will become half. Each encoder block contains a max pooling layer and two repeat conv operations. The max pooling down sample the input representation to half size. The conv operation has a 3 × 3 convolution kernel followed by a rectified linear unit (ReLU) [54] and a BatchNorm layer [55]. The operation can be formulated as: where E i is the hidden feature in i layers, E i−1 is the upper hidden feature of E i , f down represents the down sampling operation by max pooling, and f Conv means the Conv operation. The decoder adopts a structure similar to the encoder by replacing the max pooling layer to an up-sampling layer to bilinear extend the deep feature to the original size. For each decoder block, the upscale was set to 2 to make sure the size of output is the same as the forward encoder output. A skip connection was used to concatenate the encoder and decoder at each layer. The original skip connection was simply the residual learning. The hidden feature from encoder was directly concatenated to the decoder part. The decoder operation can be formulated as: where D i is the hidden feature of decoder part in i layers, D i+1 is the lower hidden feature of D i , and f up presents the up-sampling operation.
In this study, the model input included seven bands by combining RapidEye imagery and the elevation and slope information. To emphasize the effect of the spectral relationship between bands, a BA connection module was introduced to replace the skip connection in original U-Net structure. Four BA modules were added at different depths in the semantic segmentation model and provided 32, 64, 128, and 256 weights, respectively. The aim of this operation is to generate a dynamic weight for each band. We first squeezed each channel to a single numeric vector using pooling layers. Here, we used average and max pooling to create two values for each channel. Then, the squeezed vector was fed into a fully connected (FC) layer with ReLU activation, which reduces the dimensionality while introduces new non-linearities. Sigmoid activation was used at the end to give each channel a smooth weight and ranged the value to 0-1. At last, the original hidden feature was used to multiply weights and then send to the decoder part. It can be formulated as: where σ, f FC , and P represent the Sigmoid, FC, and pooling operation, respectively. Consequently, the decoder operation with the BA connection can be formulated as: The model was trained by a pixel-wise loss function. The categorical cross entropy was selected in this work as: where y i is ground truth classes in i location and y i is the predicted value in the corresponding place.

Data Preprocessing
In this study, we conducted data preprocessing before training and evaluating the proposed model ( Figure 4). First, to standardize the brightness of original RapidEye images, a histogram matching was performed in mosaicking all images [56]. The original images and the image mosaic after histogram matching are shown in Figure 5. Then, we stacked the five-band RapidEye imagery with the topographic data including DEM and slope together.  In general, remote sensing images are too large to pass through a CNN. Given the limited GPU memory, we split our input bands into small patches (256 × 256 pixels) using an overlapped sliding window Figure 6. The overlap rate was set as half of the patch size, which doubled the number of training samples. In total, we sampled 6664 image patches using sliding window. Moreover, to reduce model overfitting, three forms of data augmentation were used for each sample: random brightness and contrast, vertical and horizontal flip, and random rotate [57].

Experiment Design Models for Comparison
In our proposed model, 80% of the image patches were used for model training and the left were used for validation. The model was implemented based on PyTorch and executed on a server with an Intel(R) Xeon(R) CPU E5-2650, NVIDIA 2080TI, and 64 GB memory. To optimize model parameters, Adam, a stochastic optimization algorithm with a batch size of 64 samples, was used to train the model. We firstly set the learning rate (LR) as 1 × 10 −3 . The LR decreased to 1 × 10 −6 with increasing iterations.
Moreover, two commonly used pixel-based classification methods, CART and RF, were selected for results comparison. CART, often referred to as 'decision trees', is strictly a nonparametric model that can be trained on any input without parameter adjustment, and the prediction is rapid without complex computation. Due to its simplicity, CART has many advantages in land cover analysis. RF is an ensemble classifier that uses a set of CARTs. In a RF model, the out-of-Bag (OOB) and permutation tests can determine the importance of each feature. RF is also computationally efficient and deals better with high-dimension data without over-fitting. In this study, we randomly selected 2000 pixels for each category from training images to train the pixel-based classifiers. The pixel-based classifiers were implemented in python with Scikit-Learn package [58]. Detailed parameters are shown in Table 1. Two metrics derived from confusion matrix were calculated to assess the classification accuracy of each method: the overall accuracy (OA) and F1-score. OA is used to evaluate the overall effectiveness of an algorithm and F1-score measures the accuracy by combining the precision and recall measures. OA and F1-score are formulated as: where S d is the total number of correctly classified pixels, n means total number of validation pixels, and X ij is the observation in row i column j in confusion matrix. X i is the marginal total of row I and X j is the marginal total of column j in confusion matrix. The Intersection over Union (IoU) was also used to evaluate the segmentation performance of each method. IoU is the ratio of overlapped area to the area of union between predicted and ground truth category (Figure 7). This metric ranges 0-1 (0-100%) with 0 signifying no overlap and 1 signifying perfect segmentation. The mean IoU (MIoU) is calculated by averaging the IoU of each category.

Classification Results
The performance of our proposed model with or without the BA module was significantly better than that based on traditional pixel-based classification methods (Table 2). For pixel-based classification methods, the performance of RF was better than CART. The OA and MIoU of RF were 0.510 and 0.265, respectively, while the OA and MIoU of CART were only 0.390 and 0.175, respectively. However, the OA and MIoU of our model were 0.771 and 0.596, respectively, and were 0.754 and 0.578, respectively, without the BA module. Among the six categories, the classification result of waterbody had the highest accuracy in each method ( Figure 8). Overall, 91-96% of waterbody pixels could be correctly classified. Among vegetation categories, the classification accuracy of forest was higher than the others. Specifically, 52% and 67% of forest pixels were correctly classified using CART and RF, respectively, and 80% and 82% forest pixels were correctly classified using our proposed model without and with band attention module, respectively. For the two pixel-based classification methods, the classification accuracy of shrubland and grassland were the lowest, with an accuracy of 32-36% for shrubland and 39-53% for grassland Figure 8b. Moreover, a small proportion of cropland was classified into built-up, especially based on our proposed model.

Model Generalizability
In total, 4174 fields ground truth points were used to evaluate the generalizability and adaptability of the proposed model in other counties. Our proposed model was originally trained in Wushan county and further evaluated by this independent dataset. As Figure 9 shows, in many counties, the OA of classification was above 70%. The lowest OA estimates occurred in Wuxi and Yuyang counties where there were a few ground truth points. The OA for the whole TGRA was 72.7% based on all ground truth points.

Discussion
In this study, we developed a novel adaptive BA deep learning model based on U-Net to classify the LULC in the TGRA using RapidEye imagery and topographic information. We comprehensively evaluated the performance of the proposed model by comparison with two traditional pixel-based classification methods. Given the relationship of the spectral information between input bands, we also investigated the effectiveness of an adaptive BA module which generates a dynamic weight for each band in CNN. As the results show, the accuracy of our proposed model was considerably higher than that of the pixel-based methods. By adding the BA module, the model performance could be further improved.
Our proposed deep learning model not only showed a higher classification accuracy in training region, but also demonstrated strong spatial generalizability in other regions, which allows for large-scale LULC mapping based on a model trained at a local region.

Traditional Methods vs CNN Based Methods
The pixel-based classification methods only use the spectral information of each pixel, as shown in Figure 10. The waterbody has lower reflectance in NIR and is mostly distributed in low and flat areas Figure 10. Thus, waterbody had the highest accuracy using all methods in our study Figure 8). However, the vegetation categories such as forest, shrubland, grassland, and cropland share similar spectral properties, which makes it difficult to distinguish them. Forest had a relatively higher accuracy than the other vegetation types because forests are usually distributed in the mountainous area with high elevation. In our study, many cropland pixels were assigned to built-up. One reason is that Wushan county is a farming-based county where most croplands are distributed around residential areas, which increases the misclassification. Another reason is that RapidEye images were acquired in summer and autumn when most cropland is near harvest or after harvest, which results in the misclassification of bare ground cropland into built-up. In comparison, the classification output based on our proposed model showed a much smoother look than that based on pixel-based method (e.g., RF) and matched well with the ground truth labels (Figure 11). The disadvantage of pixel-based methods relying on spectral information will be amplified when dealing with higher spatial resolution imagery, resulting in severe "salt and pepper" noise in the classified images. This was also demonstrated by the lower MIoU (<30%) estimates using pixel-based classification methods ( Table 2). In our proposed model, the CNN can deal with joint spatial-spectral information using convolution kernels. All values in adjacent pixels are fed into a convolution filter to extract feature information. The pooling layers further extract the average or maximum value of adjacent pixels while excluding the abnormal and unimportant information, which avoids the "salt and pepper" effect in results. Figure 11. Comparison of random forest and proposed model outputs. The first column shows the RGB image, the second column shows the ground truth label, the third column shows the classification results from Random forest, and the fourth column shows the classification results from proposed model. Furthermore, as the proposed model can generate a segmented result, we compared the segmentation performance between the proposed model with a tradition object-based method. In this work, the multiresolution segmentation algorithm, which is the most widely used segmentation method and has been integrated in the eCognition land cover mapping software, was selected. The LULC map of Wushan county (Ground truth map) was also generated using this method. This method creates objects using an iterative algorithm. Objects (beginning with individual pixels) are clustered until an upper object variance threshold is reached. To minimize the fractal borders of the objects, the variance threshold (scale parameter) is weighted with shape parameters. Larger objects will be generated when the scale parameter is raised, but their exact size and dimensions will be determined by the underlying data [59]. In this study, given the large size of the remote sensing images and the 5 m spatial resolution, the scale parameter was set as 30 for the segmentation. Figure 12 shows the segmentation results using the multiresolution segmentation method and our proposed model. As shown in the second column of Figure 12, the edge of each segmented object looks rough due to the scale effect of segmentation. However, the results of our proposed model were smoother and more realistic by visual check.
In this study, we evaluated our model generalization in other counties using an independent ground truth points. The accuracy in these counties was still acceptable, which demonstrated the strong model generalizability. In our study, our proposed model was trained using massive training samples. The data augmentation randomly transformed the training images to offer more possibilities of the data, which enables the CNN based model to have good generalizability.

The Importance of Band Weighting
In this study, the topographic datasets were combined with RapidEye imagery to provide more attribute information. Benediktsson et al. demonstrated that the elevation information is responsible for up to 80% of contribution on land cover classification [47]. Figure 13 shows the importance score of each band in pixel-based classification; the DEM showed the highest feature importance and the near-infrared, which is generally considered as the most important band for identifying vegetation, provided the second highest importance. The blue band showed a negative importance, which can also be explained by the similar spectral distribution shared by each category in Figure 10. However, in CNN-based models, the weights of each band are generally set equal. It is reasonable when dealing with traditional RGB or BW images which only have three and one bands. In this study, the model input has a total of seven bands including RapidEye imagery and elevation and slope information. We introduced a BA module which generated an adaptive weight for each band in the convolution operation. Four BA modules were added at different depths in the semantic segmentation model. The number of bands was extended to 32, 64, 128, and 258 in hidden features, and the band weights were multiplied on the hidden features. Based on the adaptive weight for each band for each category in Figure 14, the weight of each class varied, which demonstrated that the BA module changed the weight of each band in the prediction. The F1-score of forest, shrubland grassland, cropland, and built-up increased by adding the BA module (Table 2). Especially, for the three most difficult categories, namely built-up, grassland, and shrubland, the F1-score increased by 0.025, 0.046, and 0.025, respectively. The OA and MIoU improved from 0.754 to 0.771 and 0.578 to 0.596, respectively, which demonstrated the effectiveness of band weighting in our proposed model for classification. However, the F1-score of the water body slightly dropped by adding the band attention module ( Table 2). One of the reasons is that the ground truth land cover data were corrected by visual interpretation and validated by extensive ground survey points. It cannot be guaranteed that the field data were collected on the acquisition date of the remote sensing data. Thus, some seasonal water bodies might be assigned into water bodies in ground truth map. However, the remote sensing images were captured in summer when these water bodies might dry out. This situation can lead to the slight decline in accuracy of proposed model with BA module. Moreover, the deep learning method is usually referred as "black box" and its mechanism is difficult to explain by traditional methods. Although we analyzed the impact of the BA module on the classification results, we still cannot explain clearly how BA modules have affected the results. How to explain the mechanism of the BA module in deep learning is also a direction of our future work.

Conclusions
Land use and land cover mapping is a critical work in remote sensing application. It is still challenging with the increasing high spatial resolution remote sensing data. The traditional commonly used pixel-based classification methods cannot fully represent the feature information from objects and thus have limitations in classification results, such as the "salt and pepper" effect. Although the object-oriented classification methods provide a way to solve this challenge, they usually include two steps, namely image segmentation and classification, which requires great human involvement by adjusting the parameters in each step and lacks generalization and adaptability when dealing with large areas. In this study, an end-to-end CNN based method integrated with a band attention (BA) module was proposed. The BA module was introduced to leverage the spectral information in our proposed model. The results show that the proposed CNN model outperformed traditional pixel-based methods in classifying high-resolution images. By adding the BA module, the model performance increased further. By evaluating the trained model at independent regions outside the training area, the classification accuracy was still acceptable, which demonstrated the strong model generalizability at the spatial domain.