A Coarse-to-Fine Deep Learning Based Land Use Change Detection Method for High-Resolution Remote Sensing Images

: In recent decades, high-resolution (HR) remote sensing images have shown considerable potential for providing detailed information for change detection. The traditional change detection methods based on HR remote sensing images mostly only detect a single land type or only the change range, and cannot simultaneously detect the change of all object types and pixel-level range changes in the area. To overcome this di ﬃ culty, we propose a new coarse-to-ﬁne deep learning-based land-use change detection method. We independently created a new scene classiﬁcation dataset called NS-55, and innovatively considered the adaptation relationship between the convolutional neural network (CNN) and the scene complexity by selecting the CNN that best ﬁt the scene complexity. The CNN trained by NS-55 was used to detect the category of the scene, deﬁne the ﬁnal category of the scene according to the majority voting method, and obtain the changed scene by comparison to obtain the so-called coarse change result. Then, we created a multi-scale threshold (MST) method, which is a new method for obtaining high-quality training samples. We used the high-quality samples selected by MST to train the deep belief network to obtain the pixel-level range change detection results. By mapping coarse scene changes to range changes, we could obtain ﬁne multi-type land-use change detection results. Experiments were conducted on the Multi-temporal Scene Wuhan dataset and aerial images of a particular area of Dapeng New District, Shenzhen, where promising results were achieved by the proposed method. This demonstrates that the proposed method is practical, easy-to-implement, and the NS-55 dataset is physically justiﬁed. The proposed method has the potential to be applied in the large scale land use ﬁne change detection problem and qualitative and quantitative research on land use / cover change based on HR remote sensing data.


Introduction
Change detection is the process of comparing objects, scenes, or phenomena in different time dimensions to identify their state differences [1,2]. Remote sensing image change detection is a multidisciplinary interdisciplinary technology involving geosciences, statistics, and computer science [2]. With the development of various optical satellite sensors capable of acquiring high-resolution (HR) images, related research on the use of HR images for land cover has become popular in the field deep learning methods, since the methods either detect a specific category or simply identify changed locations [36,37]. Whether it is designing a better network or adding some new algorithms on the basis of CNN, almost no comprehensive detection of multiple types of objects has been achieved. Some authors have tried to realize the detection of multiple objects by adding indexes such as the normalized difference vegetation index, normalized difference water index, morphological building index, enhanced vegetation index, etc. through the relevant algorithm to the detection process of change detection, but the categories they can detect are few (often not exceeding five categories) [38,39]. Others have used segmentation networks like FCN, U-Net, and Deeplabv3+ to achieve pixel-level change detection. Although their detection accuracy is very high, most of them only extract the change information of the building [37,40].
In the study of global environmental change, the description, quantification, and monitoring of land use/cover are particularly important [41], and land cover change has serious implications for the environment [42]. According to the detection results of the types and locations of land cover changes, global issues such as urban environmental changes and ecological changes can be better understood. In related research, it was found that few people have been able to obtain the fine change detection results of multiple types of land cover.
In order to obtain fine change detection results and extract information on various types of change objects and change positions in the study area, we propose a land-use change detection method for HR remote sensing images based on coarse to fine deep learning. The basic idea of the method in this paper was to use CNN, which is very suitable for the complexity of the scene, to classify the scene according to the result of classification to obtain the changed and unchanged scenes, depending on the scene change information to obtain the category-level coarse change detection result. Depending on the changing scene, we used a multi-scale threshold (MST) method to obtain high-quality changed and unchanged training samples by dividing the spectrum and texture change intensity values. Using these changed and unchanged samples as the training samples to train the deep belief network (DBN), a DBN can then detect the image and obtain pixel-level range detection results. By mapping the category detection results to the range detection results, a fine change detection result can be obtained. The main idea and contribution of this paper can be summarized from three aspects: (1) The new idea of detecting a land use/cover change from coarse to fine based on deep learning is proposed, which solves the problem that the traditional method detects a single type of object or only detects the change range. The coarse detection results provide type conversion information for all objects in the study area. The pixel-level range change detection results can reflect the precise position of the object change. Mapping the coarse detection to the range changed to obtain the result of fine change detection, which can reflect the change of the type of the object at the precise position.
(2) We independently designed a new scene classification dataset, NS-55, which enriches the data sources of the remote sensing image scene classification research field and contributes to the development of related fields. When selecting CNN for scene classification, the adaptive relationship between CNN and scene complexity was innovatively considered, giving full play to the performance of CNN and improving the accuracy of scene category detection.
(3) We created a new method called MST for selecting high-quality training samples. We used MST to divide the spectral and textural change intensity value to obtain high-quality training samples. Moreover, we provide a new method for selecting training samples for the field of remote sensing image processing using deep learning, which can improve the efficiency of labelling samples and reduce the tedious work of manually labelling samples.

Methodology
The structure diagram of the method proposed in this paper is shown in Figure 1. The method consisted of three parts. The first part was to train the CNN and detect scene categories. We used the NS-55 dataset to train the four networks of AlexNet, ResNet50, ResNet152, and DenseNet169, and we used these four CNNs to detect the category of the scene. The majority category was used to obtain the Remote Sens. 2020, 12, 1933 4 of 21 final category of the scene, and the scene category was compared to obtain the changed scene. In this part, the coarse change detection result was obtained. The second part was to use MST to extract the training samples. Robust change vector analysis (RCVA) and gray-level co-occurrence matrix (GLCM) algorithms were used to extract the intensity values of the image spectrum and texture changes. In different scenarios, MST was used to divide the change intensity values to separate the changed samples from the unchanged samples. In this part, high-quality training samples could be obtained. The third part was to train the DBN with samples and detect pixel-level range changes. The category detection results were mapped to the range detection results to obtain fine change detection results. We conducted experiments based on the Multi-temporal Scene Wuhan dataset and aerial images of a particular area of Dapeng New District, Shenzhen to verify the feasibility of our method.
Remote Sens. 2020, 12, x FOR PEER REVIEW 4 of 21 scene. In this part, the coarse change detection result was obtained. The second part was to use MST to extract the training samples. Robust change vector analysis (RCVA) and gray-level co-occurrence matrix (GLCM) algorithms were used to extract the intensity values of the image spectrum and texture changes. In different scenarios, MST was used to divide the change intensity values to separate the changed samples from the unchanged samples. In this part, high-quality training samples could be obtained. The third part was to train the DBN with samples and detect pixel-level range changes. The category detection results were mapped to the range detection results to obtain fine change detection results. We conducted experiments based on the Multi-temporal Scene Wuhan dataset and aerial images of a particular area of Dapeng New District, Shenzhen to verify the feasibility of our method. (a) The first part. We used our self-built scene classification dataset NS-55 to train four CNNs, and used these four CNNs to detect the categories of all scenes that were cropped in the experimental area. (b) The second part. By comparing the scene categories, the changing scenes were obtained, and the RCVA and GLCM algorithms were used to calculate the change intensity value of the image. According to the changed scene, the MST was used to divide the changed sample and the unchanged sample. (c) The third part. We used high-quality training samples obtained by MST to train DBN, and use the trained DBN to detect the pixel-level range changes. The category detection results were mapped to the range change detection results to obtain the fine change detection results. (a) The first part. We used our self-built scene classification dataset NS-55 to train four CNNs, and used these four CNNs to detect the categories of all scenes that were cropped in the experimental area. (b) The second part. By comparing the scene categories, the changing scenes were obtained, and the RCVA and GLCM algorithms were used to calculate the change intensity value of the image. According to the changed scene, the MST was used to divide the changed sample and the unchanged sample. (c) The third part. We used high-quality training samples obtained by MST to train DBN, and use the trained DBN to detect the pixel-level range changes. The category detection results were mapped to the range change detection results to obtain the fine change detection results.

NS-55 Dataset
In addition to the impact of preprocessing results and CNNs on the detection results of deep learning change detection based on scene classification, the datasets used for training often determine the richness of its detection categories, which in turn affects the accuracy of the detection results. In order to obtain a dataset containing more scene categories than the general dataset, we produced an NS-55 scene classification dataset. The data in NS-55 came from NWPU-RESISC45 [43], UCMercedLand_Use [44], WHU-RS19 [45], and AID [46], and added a large number of scene images cropped from Gaofen-2 satellite remote sensing images.

NS-55 Dataset
In addition to the impact of preprocessing results and CNNs on the detection results of deep learning change detection based on scene classification, the datasets used for training often determine the richness of its detection categories, which in turn affects the accuracy of the detection results. In order to obtain a dataset containing more scene categories than the general dataset, we produced an NS-55 scene classification dataset. The data in NS-55 came from NWPU-RESISC45 [43], UCMercedLand_Use [44], WHU-RS19 [45], and AID [46], and added a large number of scene images cropped from Gaofen-2 satellite remote sensing images.

Coarse Deep-Learning Change Detection
With the continuous development of artificial intelligence algorithms, deep learning has become one of the most powerful tools in the field of computer vision. In particular, the widespread application of CNN in the image field and its network structure are being optimized continuously, which has further deepened and expanded the research of remote sensing image scene classification [47,48].
There is a specific adaptive relationship between the complexity of remote sensing images and the depth of CNN. That is, when using shallow and deep CNN to classify images with low complexity and high complexity, respectively, if the correspondence between CNN and image complexity is effectively used, the inherent characteristics of CNN can be fully utilized to improve classification accuracy [49]. According to Zhang Xiaonan et al. [45] and Li, H [49], AlexNet has a high recognition rate for macroscopic scenes such as snow mountain, beach, cloud, runway, forest, and bridges;

Coarse Deep-Learning Change Detection
With the continuous development of artificial intelligence algorithms, deep learning has become one of the most powerful tools in the field of computer vision. In particular, the widespread application of CNN in the image field and its network structure are being optimized continuously, which has further deepened and expanded the research of remote sensing image scene classification [47,48].
There is a specific adaptive relationship between the complexity of remote sensing images and the depth of CNN. That is, when using shallow and deep CNN to classify images with low complexity and high complexity, respectively, if the correspondence between CNN and image complexity is effectively used, the inherent characteristics of CNN can be fully utilized to improve classification accuracy [49]. According to Zhang Xiaonan et al. [45] and Li, H [49], AlexNet has a high recognition rate for macroscopic scenes such as snow mountain, beach, cloud, runway, forest, and bridges; ResNet50 has a high recognition rate for scenes with fixed shapes such as basketball courts, baseball fields, golf courses, ground track field, and airplanes; DenseNet169 has a higher recognition rate for abstract scenes with more semantic features such as the park, harbor, chaparral, sea ice, and island; and ResNet152 has a high recognition rate for scenes such as lakes, terraces, thermal power station, and mobile home park. AlexNet and ResNet are the champions of the classification task of the ImageNet competition in 2012 and 2015, respectively, and DenseNet is a CNN model with a better performance developed on the basis of ResNet. This paper selected the above four CNNs as the basic network for scene classification: (1) AlexNet: AlexNet contains five convolutional layers and three fully connected layers. Using ReLU as the activation function, the problem of gradient disappearance is solved. A regularization layer was added after the first convolutional layer and the second pooling layer. It uses the maximum pooling technique for pooling, and it uses the Dropout technique after the convolutional layer to prevent overfitting [32].
(2) ResNet: ResNet solves the problem of network degradation with the deepening of the network by designing the residual module. Assuming that I a−1 is the input of the ath layer in the network, the output of the directly connected CNN at the l layer can be formalized as: The design of the residual block makes the output of the a layer become: That is, the problem of network degradation is solved by adding identity mapping. Introducing the batch normalization layer after the convolutional layer and the pooling layer makes the network easier to train. ResNet includes ResNet18, ResNet34, ResNet50, ResNet101, ResNet152, a total of five kinds, and their conv1 structure is the same, and the end of the network ends with average pool, 1000-d fc, and Softmax, but their conv2_x, conv3_x, conv4_x and conv5_x are different [35].
(3) DenseNet: DenseNet expands and develops the idea of ResNet, uses batch regularization technology to accelerate convergence, control overfitting, and adds the output of the previous convolution and pooling layers to the input of several subsequent convolutional layers. The output of the a layer becomes: The convergence speed is faster than ResNet, and the test results of the ImageNet dataset have also been significantly improved [50].
The classification dimension of the above CNN was modified to 55, and the rest of the structure remained the same as the classic network. We used NS-55 to train the above CNN and tested its classification performance.
In order to obtain the segments of the ground objects in line with the actual situation and improve the efficiency of our method in the implementation process, we used a multi-scale segmentation algorithm, which is more classic but generally used and has a good segmentation effect. The multi-scale segmentation algorithm can not only make the result of feature classification more in line with the actual distribution of ground objects, but can also effectively retain the effective information of different objects in the image [51,52]. Each of the four trained CNNs will independently assign a class to each scene, and the final class is then obtained by majority voting. Comparing the categories of scenes at the same location in the two images, the change information of the scene can be determined, then the coarse change detection result can be obtained.

Use Multi-Scale Threshold (MST) to Separate the Changed Samples from the Unchanged Samples
In the change detection method based on deep learning, the selection of high-quality training samples has been a hot research issue for many experts and scholars [37]. RCVA [53,54] and GLCM [55,56] are classic algorithms for extracting image spectrum and texture features. They have a wide range of applications and excellent feature extraction capabilities.
The RCVA algorithm considers the neighborhood information of pixels and selects the pair of pixels with the smallest spectral difference for detection, which greatly eliminates the impact of registration errors. Based on the RCVA principle, by traversing all the pixel pairs of two images one by one, the spectral change intensity values of all pixel pairs can be calculated, and then the spectral information change intensity map, considering the neighborhood information, can be obtained.
GLCM is one of the most classic methods for extracting textures, and it is also a commonly used texture feature analysis method with a good extraction effect. GLCM mainly extracts textures based on the conditional probability density between the grey levels of the image. It is a special matrix that describes the relationship between the grey levels of neighboring pixels or neighboring pixels with a certain distance [57]. Taking the variance as the scalar can best reflect the differences between different ground objects [58], after obtaining the texture variance characteristics based on the two images, the texture change intensity value can be calculated by the formula, and then the change intensity map can be obtained.
When dividing training samples according to the change intensity values of the spectrum and texture, traditional methods mostly use a certain threshold for division, and the selection of threshold has linear characteristics. This leads to a large amount of confusion between the changed samples and the unchanged samples, which dramatically reduces the quality of the samples. Based on this, we proposed a strategy for dividing samples using MST. The formula is as follows: where F i represents the change intensity value of RCVA or GLCM of the ith scene; W i is the number of pixels corresponding to the change intensity value; and k is the scale coefficient.
The schematic diagram of sample division in different scenarios is shown in Figure 3.
In different scenarios (changed), different thresholds can be calculated using Equation (4). The threshold is used to divide the intensity value of the spectral and textural changes to separate the changed and unchanged samples, so as to obtain two sets of training samples corresponding to RCVA and GLCM. Finally, the combination of two sets of samples is taken as the final training sample.

Fine Deep-Learning Change Detection
Deep learning can use its own advantages to fully dig in-depth data features, which is an effective method for remote sensing big data analysis and data mining. It can cope with the challenges of complex feature information and the large amount of information brought by the improvement of remote sensing image resolution [59].
Remote Sens. 2020, 12,1933 8 of 21 we proposed a strategy for dividing samples using MST. The formula is as follows: where represents the change intensity value of RCVA or GLCM of the th scene; is the number of pixels corresponding to the change intensity value; and is the scale coefficient.
The schematic diagram of sample division in different scenarios is shown in Figure 3. DBN [60,61] is based on the development of biological neural networks and shallow neural networks. It is a probability generation model that infers the sample data distribution by joint distribution. It uses a layer-by-layer unsupervised training method to extract features from a large number of unlabeled sample data, and optimizes the model by using a small amount of labelled sample data. Finally, it obtains the optimal weight of the network, so that the network can generate training data based on the maximum probability, forming high-level abstract features to obtain a better classification model. The DBN is mainly composed of two parts. The first part is a multi-layer restricted Boltzmann machine (RBM) [62,63] for pre-training the network. The second part is a feedforward back-propagation network, which makes the RBM stack out of the system more refined. For each RBM, there is energy E as a system: where n and m are the numbers of neurons in the explicit layer and hidden layer; v i and h i indicate the state of the neurons in the explicit layer and hidden layer; where W ij is the weight between neurons i and j; and a i and b i represent the paranoia of explicit and hidden neurons, respectively. According to the energy function, the joint probability distribution P about v and h can be obtained, and the DBN reconstructs the sample data through the correlation calculations based on P. At the same time, the input of the DBN is a vector of continuous input real-valued variables, which overcomes the shortcomings of the CNN, which must take an image of a specific size as an input. It can be easily applied to pixel-based change detection and can perform more precise detection of the picture. Using the samples selected by MST to train the DBN, a detection model with excellent performance can be obtained. By detecting the entire image, we can get the pixel-level range change detection results.
In order to evaluate the accuracy of change range detection, accuracy (A), recall (R), false alarm (FA), and missing alarm (MA) were introduced to evaluate the detection results, as shown in Table 1. The formula is as follows: where  According to the principle of the change matrix, we can obtain the change trajectory of the scene, that is, from what to what. Finally, we mapped the category changes to the range changes to obtain the fine change detection results, as shown in Figure 4. According to the principle of the change matrix, we can obtain the change trajectory of the scene, that is, from what to what. Finally, we mapped the category changes to the range changes to obtain the fine change detection results, as shown in Figure 4.

Experiments and Results
We used the NS-55 dataset to train the CNN. To accelerate the model convergence speed and improve the accuracy of model training, combined with the idea of transfer learning [64][65][66], the AlexNet, ResNet50, ResNet152, and DenseNet169 used in this paper were pre-trained models (models from torchvision), that is, a network pre-trained on the ImageNet dataset of 1000 classes. The

Experiments and Results
We used the NS-55 dataset to train the CNN. To accelerate the model convergence speed and improve the accuracy of model training, combined with the idea of transfer learning [64][65][66], the AlexNet, ResNet50, ResNet152, and DenseNet169 used in this paper were pre-trained models (models from torchvision), that is, a network pre-trained on the ImageNet dataset of 1000 classes. The weights of CNNs were updated after the training process, 300 images of each category were selected as the training set (a total of 16,500 (55 × 300) images), and 100 images of each category were used as the validation set (a total of 5500 (55 × 100) images). The training loss, validation loss, training accuracy, and validation accuracy of each CNN are shown in Figure 5.

Experiments and Results
We used the NS-55 dataset to train the CNN. To accelerate the model convergence speed and improve the accuracy of model training, combined with the idea of transfer learning [64][65][66], the AlexNet, ResNet50, ResNet152, and DenseNet169 used in this paper were pre-trained models (models from torchvision), that is, a network pre-trained on the ImageNet dataset of 1000 classes. The weights of CNNs were updated after the training process, 300 images of each category were selected as the training set (a total of 16,500 (55 × 300) images), and 100 images of each category were used as the validation set (a total of 5500 (55 × 100) images). The training loss, validation loss, training accuracy, and validation accuracy of each CNN are shown in Figure 5. It can be seen from Figure 5 that the pre-trained CNNs had a fast convergence speed. After training within 100 epochs, the training accuracy and validation accuracy reached 80% or more. After 200 epochs or more training, the training accuracy reached 90% or more. Only the validation accuracy of AlexNet was 82%, the remaining CNNs reached 85% or higher. Training and validation loss values dropped below 0.3. We selected 300 images of each category as the test dataset (a total of 16,500 (55 × 300) images) to verify the classification accuracy of the CNN, as shown in Figure 6. It can be seen from Figure 5 that the pre-trained CNNs had a fast convergence speed. After training within 100 epochs, the training accuracy and validation accuracy reached 80% or more. After 200 epochs or more training, the training accuracy reached 90% or more. Only the validation accuracy of AlexNet was 82%, the remaining CNNs reached 85% or higher. Training and validation loss values dropped below 0.3. We selected 300 images of each category as the test dataset (a total of 16,500 (55 × 300) images) to verify the classification accuracy of the CNN, as shown in Figure 6. From the test results, the average test accuracy of AlexNet, ResNet50, ResNet152, and DenseNet169 were: 88.56%, 91.78%, 89.51%, and 90.85%. CNNs trained on the NS-55 dataset have better detection capabilities for various scenarios. Except for a few categories with lower test accuracy, the test accuracy for most categories was higher than 85% or even 95%, and a few reached 100%, indicating that the NS-55 dataset is reasonably designed to be suitable for CNN training, and can be used as a scene classification dataset. At the same time, CNNs have high training accuracy and can be used for scene category detection. It can also be seen that AlexNet had a high test accuracy for macro scenes such as snow mountain and beach, ResNet50 had high test accuracy for fixed-shape scenes such as basketball courts and baseball fields, DenseNet169 had high test accuracy for scenes with rich semantic features such as parks and harbors, and ResNet152 had a higher test accuracy for more complex scenes such as lakes and terraces, which further validates the adaptive relationship between the complexity of remote sensing images and the depth of the CNN. From the test results, the average test accuracy of AlexNet, ResNet50, ResNet152, and DenseNet169 were: 88.56%, 91.78%, 89.51%, and 90.85%. CNNs trained on the NS-55 dataset have better detection capabilities for various scenarios. Except for a few categories with lower test accuracy, the test accuracy for most categories was higher than 85% or even 95%, and a few reached 100%, indicating that the NS-55 dataset is reasonably designed to be suitable for CNN training, and can be used as a scene classification dataset. At the same time, CNNs have high training accuracy and can be used for scene category detection. It can also be seen that AlexNet had a high test accuracy for macro scenes such as snow mountain and beach, ResNet50 had high test accuracy for fixed-shape scenes such as basketball courts and baseball fields, DenseNet169 had high test accuracy for scenes with rich semantic features such as parks and harbors, and ResNet152 had a higher test accuracy for more complex scenes such as lakes and terraces, which further validates the adaptive relationship between the complexity of remote sensing images and the depth of the CNN. From the test results, the average test accuracy of AlexNet, ResNet50, ResNet152, and DenseNet169 were: 88.56%, 91.78%, 89.51%, and 90.85%. CNNs trained on the NS-55 dataset have better detection capabilities for various scenarios. Except for a few categories with lower test accuracy, the test accuracy for most categories was higher than 85% or even 95%, and a few reached 100%, indicating that the NS-55 dataset is reasonably designed to be suitable for CNN training, and can be used as a scene classification dataset. At the same time, CNNs have high training accuracy and can be used for scene category detection. It can also be seen that AlexNet had a high test accuracy for macro scenes such as snow mountain and beach, ResNet50 had high test accuracy for fixed-shape scenes such as basketball courts and baseball fields, DenseNet169 had high test accuracy for scenes with rich semantic features such as parks and harbors, and ResNet152 had a higher test accuracy for more complex scenes such as lakes and terraces, which further validates the adaptive relationship between the complexity of remote sensing images and the depth of the CNN.

Data Processing and Accuracy Assessment
Based on the multi-scale segmentation algorithm, 73 scenes were cropped in Site1, and 44 scenes were cropped in Site2. The trained CNNs detected the categories of these scenes, and each CNN obtained a category result it detected. The final category detection result of each scene was obtained through the majority voting method, and the category change detection result (coarse change detection result) was obtained by comparing the scene categories. The category labelling numbers corresponded to the categories in the NS-55 dataset, and binary labelling was used to indicate changing and unchanging scenes, where 0 represents the unchanged scene, and 1 represents the changed scene, as shown in Figure 8.
Based on the category labels of the MtS-WH dataset (Figure 7c,d), we evaluated the classification accuracy of CNNs for the scenes in Site1 and Site2. We counted the number of changed scenes in Site1 and Site2: Site1 contained a total of 37 changed scenes and 36 unchanged scenes; Site2 contained a total of 36 changed scenes and eight unchanged scenes. The results of CNN detection were that Site1 contained a total of 39 changed scenes (four false detections) and 34 unchanged scenes (two false detections), and Site2 contained a total of 37 changed scenes (two false detections) and seven unchanged scenes (one wrong detection). obtained a category result it detected. The final category detection result of each scene was obtained through the majority voting method, and the category change detection result (coarse change detection result) was obtained by comparing the scene categories. The category labelling numbers corresponded to the categories in the NS-55 dataset, and binary labelling was used to indicate changing and unchanging scenes, where 0 represents the unchanged scene, and 1 represents the changed scene, as shown in Figure 8. Based on the category labels of the MtS-WH dataset (Figure 7c,d), we evaluated the classification accuracy of CNNs for the scenes in Site1 and Site2. We counted the number of changed scenes in Site1 and Site2: Site1 contained a total of 37 changed scenes and 36 unchanged scenes; Site2 contained a total of 36 changed scenes and eight unchanged scenes. The results of CNN detection were that Site1 contained a total of 39 changed scenes (four false detections) and 34 unchanged scenes (two false detections), and Site2 contained a total of 37 changed scenes (two false detections) and seven unchanged scenes (one wrong detection).
We determined the value of k as 0.8, which was used to separate the changed segments in the changed scene. At the same time, in order to ensure that the unchanged samples were distributed in the entire image, we also selected a part of the unchanged samples in the unchanged scene. Using the spectral change intensity value and the textural change intensity value, two sets of samples could be obtained, and the changed samples and the unchanged samples of the two sets of samples were We determined the value of k as 0.8, which was used to separate the changed segments in the changed scene. At the same time, in order to ensure that the unchanged samples were distributed in the entire image, we also selected a part of the unchanged samples in the unchanged scene. Using the spectral change intensity value and the textural change intensity value, two sets of samples could be obtained, and the changed samples and the unchanged samples of the two sets of samples were respectively combined to obtain the final training samples. Schematic diagram of the sample division of Site1 and Site2, as shown in Figure 9.
MST categorized 110,723 changed samples and 324,675 unchanged samples at Site1 and 231,972 changed samples and 1,167,469 unchanged samples at Site2. The training results of DBN and the detection results of the change range are shown in Figure 10.
We selected 8000 changed samples and 8000 unchanged samples to train the DBN. In order to compare it with the MST, 8000 training samples categorized by a single threshold were selected to train the DBN. The above range change detection results were evaluated, and the results are shown in Table 1.
From the indicators, it can be seen that the accuracy and recall of Site1 increased by more than 20%, and the false alarm rate and missing alarm decreased by more than 30%; Site2's accuracy rate increased by more than 10%, recall rate increased by more than 35%, false alarm rate decreased by more than 10%, and missing alarm decreased by more than 35%. At the same time, the accuracy of the Site1 and Site2 detection results were more than 80%, and the recall was more than 78%.
We mapped the category change detection result to the detection result of the change range, and obtained the fine detection results of multiple types of land-use changes, as shown in Figure 11. Remote Sens. 2020, 12, x FOR PEER REVIEW 12 of 21 respectively combined to obtain the final training samples. Schematic diagram of the sample division of Site1 and Site2, as shown in Figure 9.  respectively combined to obtain the final training samples. Schematic diagram of the sample division of Site1 and Site2, as shown in Figure 9.  20%, and the false alarm rate and missing alarm decreased by more than 30%; Site2's accuracy rate increased by more than 10%, recall rate increased by more than 35%, false alarm rate decreased by more than 10%, and missing alarm decreased by more than 35%. At the same time, the accuracy of the Site1 and Site2 detection results were more than 80%, and the recall was more than 78%.
We mapped the category change detection result to the detection result of the change range, and obtained the fine detection results of multiple types of land-use changes, as shown in Figure 11.

Study Area and Data
We obtained an image of an area in Dapeng New District, Shenzhen, by aerial photography. The image size was 5622 × 5587 pixels, the acquisition time was December 2014 and May 2017, and the resolution was 0.5 m including three bands of red, green, and blue, as shown in Figure 12.

Study Area and Data
We obtained an image of an area in Dapeng New District, Shenzhen, by aerial photography. The image size was 5622 × 5587 pixels, the acquisition time was December 2014 and May 2017, and the resolution was 0.5 m including three bands of red, green, and blue, as shown in Figure 12.

Data Processing and Accuracy Assessment
According to the data processing flow in experiment 1, segmentation, cropping, and category detection were performed on the image, and 141 scenes were cropped out. The other processing results are shown in Figures 13-15.

Data Processing and Accuracy Assessment
According to the data processing flow in experiment 1, segmentation, cropping, and category detection were performed on the image, and 141 scenes were cropped out. The other processing results are shown in Figures 13-15.

Data Processing and Accuracy Assessment
According to the data processing flow in experiment 1, segmentation, cropping, and category detection were performed on the image, and 141 scenes were cropped out. The other processing results are shown in Figures 13-15.   The CNN detection results were as follows: there were 30 changed scenes (two false detections) and 111 unchanged scenes (three false detections). The index analysis of the range detection results is shown in Table 2.  The CNN detection results were as follows: there were 30 changed scenes (two false detections) and 111 unchanged scenes (three false detections). The index analysis of the range detection results is shown in Table 2. The CNN detection results were as follows: there were 30 changed scenes (two false detections) and 111 unchanged scenes (three false detections). The index analysis of the range detection results is shown in Table 2. From the above table, the range change detection result of this experiment had an accuracy rate of more than 91% (an increase of 5%), a recall rate of more than 77% (an increase of 47.52%), false alarm rate decreased by more than 24%, and missing alarm rate decreased by more than 47%. The fine detection results of multiple types of land-use changes are shown in Figure 16.

Discussion
In the field of change detection, the accuracy, meticulousness, and comprehensiveness of the change detection method reflect the level of its use-value. Whether it is pixel-based or object-based change detection, the traditional methods mostly detect a single type of land or only detect the change range (change location). Our proposed method starts with detecting the category of the scene by comparing the land cover classes of a scene at two different moments in time and using MST to select high-quality training samples and binary change value to describe change/non-change, so as to gradually obtain fine land use/cover change information. At the same time, two innovative strategies were proposed in our method (1) to improve the accuracy of CNN detection scenes; and (2) to a certain extent, solve the problem of the low quality of pixel-level training samples in the field of deep learning. The experimental results (Figure 11 and Figure 16) confirmed the superiority of our proposed method in reflecting the detailed information of land use/coverage in the study area compared with other traditional methods and obtained the effect (visual effects and description effects) that the traditional method did not.
In the research of HR remote sensing image classification tasks, training deep learning models using scene classification datasets has almost become an indispensable step. Due to the limitations of the scene classification dataset (limited categories or limited data), many people have only conducted classification in a few categories. They often choose to fix certain categories as their research object in

Discussion
In the field of change detection, the accuracy, meticulousness, and comprehensiveness of the change detection method reflect the level of its use-value. Whether it is pixel-based or object-based change detection, the traditional methods mostly detect a single type of land or only detect the change range (change location). Our proposed method starts with detecting the category of the scene by comparing the land cover classes of a scene at two different moments in time and using MST to select high-quality training samples and binary change value to describe change/non-change, so as to gradually obtain fine land use/cover change information. At the same time, two innovative strategies were proposed in our method (1) to improve the accuracy of CNN detection scenes; and (2) to a certain extent, solve the problem of the low quality of pixel-level training samples in the field of deep learning. The experimental results (Figures 11 and 16) confirmed the superiority of our proposed method in reflecting the detailed information of land use/coverage in the study area compared with other traditional methods and obtained the effect (visual effects and description effects) that the traditional method did not.
In the research of HR remote sensing image classification tasks, training deep learning models using scene classification datasets has almost become an indispensable step. Due to the limitations of the scene classification dataset (limited categories or limited data), many people have only conducted classification in a few categories. They often choose to fix certain categories as their research object in order to explain a certain problem (the problem of forest changes, the problem of building changes, etc.). It can be observed from Figure 2 that the NS-55 dataset we designed contained a wide variety of scene categories and high-quality images, and simple and reasonable naming of categories. As can be seen from Figure 5, when we used NS-55 to train CNNs, the training accuracy and verification accuracy of each CNN could reach more than 80% accuracy in a short time (within 100 epochs). The fluctuation of the loss value was also very small and stabilized in almost 50 epochs. This means that the NS-55 we designed is fit-for-purpose: it is suitable for the network requirements of CNN and can be used in related research.
CNN trained with NS-55 had high accuracy when detecting scenes, which was verified in Figure 6. It is worth mentioning that when CNN detects scenes with complex semantic information such as churches, palaces, railway stations, and stadiums, the classification accuracy is often low. The reason may be that the semantic information in these scenes is too complicated, and the objects contained in the scene are not only the subject objects, which is also a common problem encountered when using CNN for classification. However, when we chose CNN, we innovatively considered the relationship between CNN and scene complexity, which may avoid these problems to a certain extent. As can be seen from the test accuracy of the four CNN pairs "tennis_court" in Figure 6, only the test accuracy of ResNet50 on "tennis_court" was high (above 85%), and the test accuracy of the other three networks was low (below 75%). This shows that if the adaptive relationship between CNN and scene complexity is considered when performing scene classification, the classification accuracy can be improved.
The performance of deep learning methods depends on the number and quality of training samples. Especially for models like RBM, DBM, and DBN, which use vectors with stacked spectral values of pixels as input, the confusion of positive and negative samples will lead to a reduction in the robustness of the model, and the detection ability will be greatly reduced. Figures 9 and 14 proved that the quality of the training samples categorized by MST was very high, which avoids the drawback that a single threshold categorized training sample will produce much confusion. It can be concluded from Tables 1 and 2 that the DBN trained by the training samples obtained through MST had strong detection ability and high accuracy.
Finally, we proposed mapping the category changes to the range changes, and we could obtain the fine multi-type land-use change detection results. As can be seen from Figures 10 and 16, this fine land-use change detection map can more intuitively, meticulously, and accurately reflect the scope and type of land change, and can be used for land planning, urban change, disaster monitoring, military, agriculture, and forestry industry, and other fields to provide detailed information on the land change.

Conclusions
In this paper, a coarse-to-fine deep learning based land use change detection method for HR remote sensing images was proposed, which solved the problem of detecting only a single land type or only the range of land change and could obtain more complete change information. Our new dataset NS-55 enriched the scene classification dataset, and we used it to train the CNNs. When choosing CNNs, we innovatively considered the adaptation relationship between CNN and scene complexity. The change category obtained by CNN detection is the so-called coarse change detection result. When selecting training samples, we proposed using MST to obtain the changed and unchanged samples by dividing the intensity values of the spectrum and texture changes. This method avoids the problem of a large number of sample confusion caused by the use of a single threshold to divide the intensity value in the traditional method. MST can obtain high-quality samples, and the trained DBN can more accurately detect the change range and obtain accurate pixel-level change detection results. By mapping the category change result to the range change result, the so-called coarse to fine change detection result was obtained. Experiments were carried out on two high-resolution remote sensing image pairs. The results showed that the method has high detection accuracy and can reflect the changes in land use/cover in detail. This is a novel remote sensing image land use/cover multi-type change detection method, which has special practical value and promotion significance in the study of global issues such as urban change, land planning, disaster monitoring, and other urban environmental changes and ecological changes.

Conflicts of Interest:
The authors declare no conflicts of interest.