Multimodal Data and Multiscale Kernel-Based Multistream CNN for Fine Classification of a Complex Surface-Mined Area

Fine land cover classification (FLCC) of complex landscapes is a popular and challenging task in the remote sensing community. In complex surface-mined areas (CSMAs), researchers have conducted FLCC using traditional machine learning methods and deep learning algorithms. However, convolutional neural network (CNN) algorithms that may be useful for FLCC of CSMAs have not been fully investigated. This study proposes a multimodal remote sensing data and multiscale kernel-based multistream CNN (3M-CNN) model. Experiments based on two ZiYuan-3 (ZY-3) satellite imageries of different times and seasons were conducted in Wuhan, China. The 3M-CNN model had three main features: (1) multimodal data-based multistream CNNs, i.e., using ZY-3 imagery-derived true color, false color, and digital elevation model data to form three CNNs; (2) multisize neighbors, i.e., using different neighbors of optical and topographic data as inputs; and (3) multiscale convolution flows revised from an inception module for optical and topographic data. Results showed that the proposed 3M-CNN model achieved excellent overall accuracies on two different images, and outperformed other comparative models. In particular, the 3M-CNN model yielded obvious better visual performances. In general, the proposed process was beneficial for the FLCC of complex landscape areas.


Introduction
Land cover classification (LCC) [1,2] is one of the most popular topics in the remote sensing community. With the enhancement of the spatial resolution of remote sensing imagery, fine LCC (FLCC) [3][4][5][6][7] of complex landscapes has attracted increasing attention because of the need to improve accuracy. Owing to the complex features of three-dimensional terrain, FLCC of complex surface-mined areas (CSMAs) is challenging and of great importance for mine development.
The solution of FLCC applied to CSMAs involves the use of multimodal data for feature learning and advanced algorithms for classification [8,9]. Firstly, feature engineering coupled with machine learning algorithms (MLAs) is effective. Feature engineering mainly includes hand-crafted feature calculation, feature selection (FS), and feature extraction (i.e., feature reduction). FS uses an algorithm to choose a representative feature combination. Feature extraction uses algorithms [10] to extract a combination of new features. For example, Azeez et al. [11] used support vector machine (SVM), decision tree (DT), and random forest (RF) algorithms for LCC of a regional agricultural area. Our previous studies have used this scheme. For example, Li et al. [12] proposed a fine classification framework for three surface-mined land cover classes using multimodal spatial, spectral, and topographic features from ZiYuan-3 (ZY-3) imagery. Additionally, an FS method and RF, SVM, and artificial neural network (ANN) algorithms were used. For FLCC of CSMAs, Chen et al. [13] assessed the importance of feature sets based on an FS method; Chen et al. [14] utilized multiple features derived from WorldView-3 imagery and some optimized SVM models. In addition, Chen et al. [8] conducted a review of FLCC in CSMAs using remote sensing data. The shortcoming of each of the above-mentioned methods was that the feature sets derived by feature engineering were insufficient for representation.
Deep learning (DL) algorithms have been widely used in computer vision and remote sensing fields because they can learn informative high-level features. As a result, some researchers have examined many DL algorithm-based schemes for LCC and FLCC. For example, Lv et al. [15] used a deep belief network (DBN)-based method for LCC. Li et al. [16] utilized a stacked autoencoder (SAE) method for LCC, while Tong et al. [17] proposed a transferable convolutional neural network (CNN) model for LCC. For FLCC of CSMAs, Li et al. [9] proposed a multimodal and multimodel deep fusion strategy based on ZY-3 imagery, the DBN algorithm, and two MLAs (i.e., RF and SVM). Li et al. [18] proposed a novel multi-level, output-based DBN. Compared to SAE and DBN, CNN has stronger representation ability due to the consideration of spatial neighborhood information, which has seldom been examined for FLCC.
In addition to directly using DL algorithms, some researchers have explored multistream CNNs based on multiple inputs. For example, Liu et al. [19] applied a triplet CNN for remote sensing scene classification. Others were based on multimodal data such as hyperspectral image (HSI) and light detection and ranging (LiDAR) data. For example, Li et al. [20] proposed a three-stream CNN model based on the spectral and spatial features of HSI and the topographic features of LiDAR data. Xu et al. [21] developed a two-branch CNN based on HSI and LiDAR data. Moreover, a two-tunnel CNN framework was first used to extract deep spectral-spatial features of HSI. Chen et al. [22] proposed a similar two-stream CNN based on HSI and LiDAR data. Chen et al. [23] utilized multispectral (MS) image/HSI and LiDAR to form a two-stream CNN. Jahan et al. [24] proposed a three-stream CNN using HSI and LiDAR data with a feature-reduction technique for LCC. Lastly, Rasti et al. [25] proposed a three-stream CNN using HSI and LiDAR data for LCC. Moreover, different band combinations that have been used as inputs of the DL algorithms have been considered worthy of investigation [26,27]. It has been suggested that the fusion methods for panchromatic (PAN) and MS images might affect the classification results [28][29][30].
Some researchers have focused on extracting multiscale features. For example, Zhang et al. [31] proposed a multiscale and multifeature-based CNN for LCC. Zhang et al. [32] proposed a multiscale dense network for LCC. They stated that the models with multi-scale features effectively improved classification accuracy.
Four influencing factors were pertinent to land cover classification over the study area: (1) Sample acquisition cost is high, as a result, a small sample classification situation was necessary; (2) There were complex features of the three-dimensional terrain, as a result topographic data was needed. How to effectively mine the multimodal multispectral image and topographic data was, therefore, an issue; (3) The topographic data were of lower spatial resolution than the multispectral images. This raised the question of how to fuse them; (4) The ground objects were at different scales. This raised the question of how to represent them.
Considering these influencing factors, the pipeline design was based on certain prior knowledge, reducing the need for sample data, as follows: (1) True color and false color images have different and complementary feature representation capabilities and can essentially replace multispectral data with four bands. Compared with using multi-spectral data as input, the spectral band combination method can reduce the demand for sample data; (2) The use of multi-branch extraction and post-fusion methods can avoid the interactions of different modal data due to different numerical units, meanings, and magnitudes, and theoretically is more conducive to the extraction of multimodal deep features; (3) Using multisize neighbors as inputs for data with different spatial resolutions can fully express the spatial relationship among different land covers; (4) In view of the multiscale characteristics of the ground objects, a multiscale convolution module was designed to ensure that the algorithms had multiscale feature extraction capabilities.
In this study, a novel multimodal remote data sensing, and multiscale kernel-based multistream CNN (3M-CNN), model was proposed for FLCC of CSMAs. The 3M-CNN model was tested using two ZY-3 imageries. The contributions to development of the approach were as follows: (1) We proposed a multimodal data-based multistream CNN model, which can extract multiple features from true color and false color images, and a digital elevation model (DEM) data that is derived from ZY-3 imagery. (2) We used multisize neighbors as the inputs of the CNN, i.e., using different neighbors for optical images and DEM data. (3) Multiscale convolution flows revised from the inception module were applied to optical images and DEM data, which could match different input neighbors.

Study Area and Data Source
As in our previous studies, we conducted experiments in Wuhan City, China in an area of 109.4 km 2 [9,12,13] (Figure 1). The terrain of the study area is complex and includes multiple agricultural-related land covers in different phenological periods.  [12]. G107: national highway 107 of China.
We first obtained four images from different cameras of the ZY-3 satellite on 20 June 2012 ( Table 1). The processing of the four images resulted in two kinds of data. One was a 10 m DEM using a process of stereo image pairs of two 3.6 m front-and backward-facing PAN images. Using a process of pixel-level image fusion operation (i.e., Gram-Schmidt spectral sharpening), the other obtained image was a 2.1 m fused MS image based on the 2.1 m PAN and the 5.8 m MS images. To match the resolution of these two sets of data, the DEM was re-sampled to 2.1 m. The DEM included relative elevation data ranging from −14 m to 122 m. Table 1. Characteristics of ZiYuan-3 satellite imagery [9,12]. PAN, panchromatic; GSD, ground spatial distance; MS, multispectral; NIR, near-infrared.

Sensor and Data Features
ZiYuan-3 Satellite Imagery Another scene of ZY-3 imagery on 11 November 2020 was obtained, preprocessed, and tested in this study ( Figure 2). A 2.1 m fused MS image and a 2.1 m DEM data (resampled from 10 m) were also obtained. The DEM included relative elevation data with a range from −25 m to 113 m. Although the values of these two DEMs had some differences, these had little effect on the subsequent classification results, as they were first scaled before inputting to the classification models. These two sets of imageries, for different years and seasons, helped to evaluate the generalization capability of the proposed model.

Methods
In this section, the details of the 3M-CNN model are presented (Figure 4). The 3M-CNN model was influenced by: (1) the size of ground objects, which varied greatly; (2) data from different models that can show different characteristics of ground objects; and (3) a multiscale convolution kernel that can extract informative features. By combining these three concepts, a novel 3M-CNN method was proposed for FLCC of CSMAs. The FLCC scheme, various data sets, model development, and accuracy assessment are described below.  MS data included 4 bands: red, green, blue, and near-infrared. To make full use of the MS image, we used the true color (i.e., R-red, G-green, B-blue), and false color (i.e., R-near-infrared, G-red, B-green) images as inputs, which had different representative ability. The false color image can help to distinguish between water and vegetation, which showed different red colors. Finally, based on two optical images and DEM data, a three-stream CNN was constructed.

Multisize Neighbors
It is worth noting that the DEM data were derived by interpolation, which resulted in more correlated pixels in the data. In other words, the same-size pixel neighbors from DEM and MS data failed to provide the same amount of valid classification information. Hence, we designed a series of experiments to calculate the best size of DEM data (i.e., N 2 × N 2 in Figure 4). In contrast, to simplify the procedure of parameter selection, the input size of MS (i.e., N 1 × N 1 in Figure 4) was fixed at 15 pixels × 15 pixels. Figures 5 and 6 show the demonstration area (i.e., the input image in Figure 4) and subset areas for explaining the utilization of the 15 pixels × 15 pixels neighbor. From the subset areas of 1 to 5 and 7 to 8, we can see that the 15 pixels × 15 pixels neighbor can represent both the features of some land covers and their surrounding classes. Furthermore, from the subset area of 6, we can see that the 15 pixels × 15 pixels neighbors of some sample points in the data polygon (for details, see Section 3.3) covered some other sample points. This may have contributed to test accuracy and affected the parameter selection that was based on training and validation sets. We determined that the influences of the 15 pixels × 15 pixels neighbor were acceptable. Larger neighbors that may have larger negative impacts were not used.

Multiscale Convolutional Modules
The multiscale convolutional modules were revised from the inception block. To be specific, we used a parameter k, to magnify the kernel sizes.
To fully utilize the contextual information, two different strategies were pursued for DEM and other data. The branches of the true color and the false color data were designed with fixed convolutional kernels. As shown in Figure 7, the kernel sizes were always 1 × 1 + 3 × 3, 3 × 3 + 3 × 3, and 3 × 3 + 5 × 5. The branch of the DEM data was designed with varied kernels to match the input size. As shown in Figure 8, the kernel sizes of DEM data were k × k + 3k × 3k, 3k × 3k + 3k × 3k, and 3k × 3k + 5k × 5k. Hence, the parameter k could be objectively selected using training and validation sets.

FLCC Scheme
This study was committed to the FLCC of CMALs. To maintain continuity of research, the second-level LCC scheme described in Li et al. [9] was applied. It contained the following four types of croplands: paddy, greenhouse, green dry land, and fallow land; four types of forest lands: woodland, shrubbery, coerced forest, and nursery; two water classes: pond and stream (natural) and mine pit pond; three road classes: dark road, bright road, and light gray road; three types of residential lands: bright roof, red roof, and blue roof; one bare surface land; and three types of surface-mined land: open pit, ore processing site, and dumping ground.
All other operations, such as the division of the training set, validation set, and test set, parameter selection, and accuracy assessment, were conducted using the FLCC scheme. The two scenes of imageries have the same FLCC scheme.

Training, Validation and Test Set
First, some polygons in the study area were labeled manually. Then, sampling points were randomly selected for training, validation, and test sets according to the ratio 4:1:1 (see Figure 9). Subsequently, 2000, 500, and 500 samples of each class for the training set, validation set, and test set, respectively, were obtained. Finally, the experimental data sets contained five groups of training, validation, and test sets. More detail of the data sets can be found in Table 2. Similarly, for the imagery on 11 November 2020, the training, validation, and test sets with 2000, 500, and 500 random samples, respectively, for each land cover were also constructed.

Model Construction, Parameter Selection, and Model Comparison
The model structures and hyper-parameters from the imagery on 20 June 2012 were directly used for that on 11 November 2020.

Model Construction
As mentioned above, the 3M-CNN model had three different inputs. Each stream was constructed by two convolution flows, a flattened layer, a concatenating layer, and a fully connected layer. Then the three streams were concatenated. The output was subsequently placed into two fully connected layers. Finally, the result was produced.

Grid Search for Parameter Selection
Several parameters can affect the performance of a CNN, such as input size, kernel size, batch size, the iteration number (epoch), steps per epoch, learning rate, loss function, the activation function, and dropout. The first two were the most important parameters as well as the most difficult ones to select.
In general, the complex model might yield better performance. However, the complex model increased training difficulty and training time. Moreover, in the field of remote sensing, limited by the scarcity of labeled data, complex models often lead to serious underfitting issues.
In this study, the two above-mentioned parameters were selected. To simplify the parameter selection process, we fixed other parameters with different empirical values.
To determine the best parameters, we conducted two types of experiments. Experiment 1: Input Size Selection. This experiment was used to find out the optimal input size of DEM data (i.e., N 2 × N 2 ) using the single-scale CNN model ( Figure 10). The single-scale convolution flow is depicted in Figure 11, which contains two groups of convolution block, batch normalization (i.e., BN), the activation function (i.e., ReLU), and the maximum pooling layer. After the flattening and concatenation operations, two fully connected layers were applied. Then the Softmax classifier function was applied. The optimal input size was applied to the other experiments.  Figure 11. BN, batch normalization; ReLU, rectified linear unit; Dropout, a method to improve generalization performance. Experiment 2: Kernel Size Selection. This experiment utilized the optimal input size obtained in Experiment 1. Kernel size also required to be considered. As mentioned above, we designed different parameter values of k to find out the optimal kernel size for the 3M-CNN model.

Model Comparison
Firstly, by using the selected input size of DEM data, the single-scale CNN (Figure 10), and the 3M-CNN model, based on a multiscale kernel with k = 1, were compared with the final 3M-CNN-Magnify model, which was based on the multiscale kernel with the selected parameter value of k (Table 3). Then, the MLAs and DBN-based models in [9] were also compared. Table 3. Models for comparison. CNN, convolutional neural network; 3M-CNN, multimodal data and multiscale kernel-based multistream CNN.

Model Description
Single-scale CNN With just 3 × 3 convolutional kernels.

3M-CNN-Magnify
Multiscale kernels with selected value of parameter k.

Accuracy Assessment
The training and validation sets were used for model training and parameter selection, while the test set was used to conduct the quantitative evaluation.
The classification performances of the models were evaluated based on the average overall accuracy (OA), used time, quantity disagreement (QD), and allocation disagreement (AD) [33]. With respect to the results derived by ZiYuan-3 imagery on 20 June 2012, some former results in [9] were used for comparison. Visual assessment of the predicted maps was also conducted.

Experiment 1: Input Size Selection
For this experiment, as shown in Figure 12, a series of input sizes ranging from 31 to 71, in intervals of 10, were applied. Five experiments were repeated. Training time was at a minimum value for an input size of 51, and the OA, at an input size of 51, exceeded that of the OAs of input sizes 31, 41, and 61. Although input size 71 resulted in a higher OA than that of input size 51, the training time for input size 71 was almost twice that of input size 51. Hence, we considered input size 51 to be optimal. The single-scale CNN with a size of 15 for spectral data and 51 for DEM achieved an average OA of 94.60%. This experiment indicated that input size magnification for single-scale CNN improved accuracy.

Experiment 2: Kernel Size Selection
For this experiment, we used 51 as the input size of DEM and 15 as the input size of the spectral data. The values of k were set from 1 to 4, in which 1 represented the original model, while 2, 3, and 4 represented the magnifying models. Specifically, we limited the value of k to a maximum of 4 because a larger kernel size caused small feature maps. If the value of k was set to 5, the feature maps in the middle of the model would be smaller than the kernel size, which would lead to model construction failure. Each value of k was applied to five random tests. As shown in Figure 13, when the k-value was set to 3, the model achieved the highest OA and took the least amount of time. It is of note that 51 divided by 15 is 3.4 which is greater than 3 and less than 4, and closer to 3. Therefore, from this point of view, a k-value of 3 was also most suitable for the model.

Precision Evaluation Result
After the parameter selection stage, the optimal models were obtained. Table 4 presents the OA, training time, QD, and AD values of the single-scale CNN, 3M-CNN, and 3M-CNN-Magnify models, and the models in [9]. The 3M-CNN-Magnify model achieved the highest OA value of 95.09% ± 0.29% and lowest AD value of 1.84% ± 0.22%. DBN took the least amount of training time. It was apparent that the more complex the model, the more computations were involved, which resulted in longer running times. Furthermore, multiple convolution layers based on feature maps needed more time than the coding layers of DBN based on feature vectors. Next, the model compression method to simplify the model and improve the running speed was considered.

Visual Assessment of the Predicted Maps
To qualitatively evaluate the results, two predicted maps of the study area were produced. One was a map with 20 classes for FLCC, as depicted in Figure 14 (the bottom image). The other was a map with 7 classes for LCC as depicted in Figure 15 (the bottom image), which was derived from the FLCC map by grouping different second-level land covers into their corresponding first-level land covers.  Overall, the two maps were visually accurate. Compared to those derived from the DBN-SVM model in [9] (the upper images in Figures 14 and 15), this study achieved better map results. For example, the bare surface land, residential land, dumping ground, green dry land, ore processing site, and pond and stream (i.e., the areas from 1 to 6 in Figure 14) in this study were classified more accurately. The Jing-zhu expressway (see Figure 1) was obviously misclassified by the DBN-SVM model. The bare surface land, residential land, and road (i.e., the areas 1, 3, and 4 in Figure 15) that were derived by the proposed model in this study were obviously better than those obtained by DBN-SVM. In particular, almost all the north-south roads were wrongly classified using the DBN-SVM model. There were also some misclassifications for the 7-class map by the proposed model in this study. For example, some crop lands were classified erroneously as forest lands at the right middle corner of the bottom map in Figure 15 (i.e., the area of 2). Figure 16 shows the predicted maps of FLCC and LCC based on the 3M-CNN-Magnify model and ZY-3 imagery on 11 November 2020. These two predicted maps were both obtained based on the 3M-CNN-Magnify model. Overall, the two maps in Figure 16 were visually accurate. There were three main differences between the maps of FLCC in Figures 14 and 16: (1) there were more fallow lands in Figure 16, especially at the upper right corner; (2) there were more green dry lands in Figure 14. These two differences could be attributed to the seasons of the two images (one was in summer and the other in autumn); (3) there were some misclassifications in Figure 16. For example, many croplands at the bottom left corner (the area of 2) were classified as nursery. The dumping ground, crop land, and residential land (i.e., the areas 1, 3, and 4 in Figure 16) that were derived by the imagery on 11 November 2020 were better than those obtained on 20 June 2012.
The maps of LCC in Figures 15 and 16 were similar overall. This result indicated that from 2012 to 2020, the changes among the first-level land covers were small.
In sum, from 2012 to 2020 there were basically no changes in the agricultural development pattern; however, open-pit mining grew rapidly. Figure 17 shows a diagram of the proposed scheme. When there were some land covers leading to terrain changes, topographic data were helpful for classification. At this time, the proposed model might be positive. We think this model is also suitable for other sensors, as follows: (1) When the other sensors have multi-view cameras (e.g., GaoFen-7 satellite), topographic data can be extracted directly. Then, this model is suitable; (2) When the other sensors can obtain only multispectral images, this model is also suitable because we can use some other topographic data. For example, data with lower resolution, such as advanced spaceborne thermal emission and reflection radiometer global digital elevation models, can be used, if their spatial resolution can describe the terrain changes of the study areas. Other topographic data with higher resolution can also be used, such as light detection and ranging data. We can also replace DEM data with synthetic aperture radar (SAR), hyperspectral, and unmanned aerial vehicle (UAV) images, etc., in the proposed scheme. Then the fused models with multimodal data (e.g., multispectral and SAR images, multispectral and hyperspectral images, and multispectral and UAV images, etc.) are suitable for classification of other landscapes.

Effectiveness of Multimodal Data-Based Multistream CNN Structure
Li et al. [9] pointed out that directly stacking and normalizing the multimodal features using DBN-based models might undermine the advantages of the topographic information. As a result, this study used a multimodal data-based multistream structure. The results showed that the proposed models outperformed single-stream models, such as the DBN-based models in [9]. Although performance differences might be attributed partly to the utilization of different models, the results indirectly confirmed the effectiveness of a multistream structure. Other studies have also demonstrated the effectiveness of multimodal data-based multistream structures [20][21][22][23][24][25].
Considering that the true color and false color images had different representative ability, both were used as inputs. Li et al. [12] found that DEM had the highest importance for LCC. Slope was important, but its significance was much lower than that of DEM. Aspect was not important and was not selected using an FS procedure. As a result, DEM was used as the input for the third CNN branch.
Using the imagery taken on 20 June 2012, an additional experiment was conducted. Based on the multimodal data of two band combinations (true color imagery, hereafter referred to as 321; false color image, hereafter referred to as 432) of MS imagery (blue, green, red, and near-infrared bands, hereafter referred to as 1, 2, 3, and 4 bands) and DEM, this study tested three models with different numbers of branches: (1) CNN 4321 DEM : a CNN model with an input of both four spectral bands and DEM data; (2) CNN 4321 + DEM : a CNN model with two inputs of four spectral bands and DEM data; (3) CNN 321 + 432 + DEM : a CNN model with three inputs of true color image, false color image, and DEM data, respectively. The OA values for these three models are shown in Table 6; multiple branches improved the classification performances. This experiment demonstrated that using both true color and false color images as inputs was effective. Studies [26,27] investigated other band combinations for mapping of ice-wedges. The fusion methods for PAN and MS images might affect the classification results [28][29][30]. These two issues require further consideration.

Effectiveness of Different Input Sizes for Multimodal Data
As shown in Figure 12, different sizes of DEM data resulted in different classification accuracies. The model achieved the highest accuracy when the DEM data input size was 51 and the MS data input size was 15. This indicated that matching input sizes for different data was helpful for feature extraction and classification. Terrain fluctuations in the study area were varied. Therefore, too small an input size had too little influence on the classification accuracy due to very small elevation change. However, too large an input size contained excessive interference information. Therefore, the selection of the DEM input size was crucial.

Effectiveness of Multiscale Kernel Strategy
The convolution kernel size increases directly with input size. Figure 13 showed that different k-value-based multiscale convolutional modules yielded different results. Moreover, the 3M-CNN-Magnify model outperformed the others. They both demonstrated that the multiscale convolution kernel contributes to classification performance.
While the 3M-CNN model was slightly inferior to the single-scale CNN model, this was due to a small multiscale kernel k-value of 1, which was unsuitable for a large input size of 51.

Comparison with Different Methods
To assess the effectiveness of the 3M-CNN-Magnify model, a series of comparison experiments were conducted. Traditional MLAs-based models, such as RF, SVM, and FS-RF, etc., and DBN-based models, were selected. As shown in Table 4, the 3M-CNN-Magnify model achieved the highest average OA value, which was significantly higher than that of the MLAs-based OAs, and slightly higher than that of the DBN-based OAs.
The results showed that the strategy of multiscale kernels, coupled with various input sizes for different data, could contribute significantly to the extraction of the representative classification features. Table 4 indicates that the training time of the 3M-CNN models was slightly longer than that of the MLAs-and DBN-based models. However, the time for parameter selection of some MLAs-based models was greater. For example, there were 80 groups of parameter combinations for SVM-based models. The time of parameter selection for DBN-SVM was also greater, which contained the selection of DBN's and SVM's parameters. Using less time, DBN-SVM yielded comparative OA with that of the 3M-CNN models. On the one hand, the coding layers based on feature vectors were much quicker than the convolution layers obtained by feature maps. On the other hand, the test accuracies were based on only a small set of samples, i.e., with 500 for each class. Therefore, the comparative test accuracies cannot reflect the true performances of the DBN-SVM and 3M-CNN models. The assessments of the predicted maps demonstrated that the 3M-CNN models were significantly better than the DBN-SVM model.
Three larger test sets were then used to further compare the model performances. Table 7 shows the accuracy values. From Figure 7, the following conclusions can be drawn: (1) With greater test sets, CNN-based models obtained greater accuracy values (about 3% improvement). This might be attributed to the overlap of 15 pixels × 15 pixels neighbors. However, the accuracies for those three test sets were stable, showing that the accuracy values for CNN-based models were reliable; (2) With greater test sets, DBN-based models' accuracy values were slightly reduced; (3) With greater test sets, CNN-based models yielded about 4.7% improvement compared to DBN-based models. In the future, spatially independent test sets [9] will be constructed for model comparison.

Parameter Selection and Effects of the Dataset
The selection of input size of DEM data and corresponding kernel size was strictly conducted. However, the tuning of the baseline was very simple. For example, the input size of spectral images was determined based on subjective experience and the size of different land covers in the study area. In the single-scale CNN models, the numbers of convolution flows and fully connected layers were determined by a trial-and-error method.
The datasets in this study were not very large, containing only 2000, 500, and 500 samples for each class in the training, validation, and test sets, respectively. Therefore, the baseline of the single-scale CNN model was not complex, which helps avoid underfitting. The small performance gains among 3M-CNN-Magnify, single-scale CNN, and DBN-SVM models, might be partly caused by the small size of the datasets and their spatial autocorrelation [9] (see the subset area of 6 in Figure 6), which reached precision saturation. However, the map accuracies of the whole study area were far from optimal and showed obvious differences. As a result, larger spatial independent datasets should be investigated in the future.

Investigations on Different Datasets
A lot of time is needed to label the polygons for extracting training, validation, and test sets; consequently, our former studies [9,12,13,18] relied on only one dataset, i.e., the ZY-3 imagery on 20 June 2012.
The following features contributed to the reliability of the accuracy assessment results: (1) The training, validation, and test sets containing 2000, 500, and 500 samples for each class, respectively, were spatially independent (only 12.88% pixels in the data polygons were used on average (Table 2) [9]; (2) There were five groups of training, validation, and test sets, and each was run 5 times; (3) The study area was very large, i.e., 104.9 km 2 ; (4) The classification maps were visually accurate.
However, it was considered of benefit to also investigate using other datasets (e.g., different imageries for different study areas), or ZY-3 datasets of this study at other times. As a result, the ZY-3 imagery on 11 November 2020 was further tested in this study. By directly using the model structures and hyper-parameters derived from the imagery on 20 June 2012, the imagery on 11 November 2020 yielded very high test accuracy values and accurate visual maps of LCC and FLCC. In future, imageries from different sources, geographical locations, and with varying spatial resolutions, should be further investigated to test the generalization capability of the proposed models.

Conclusions
A 3M-CNN model was proposed for FLCC of a CSMA in Wuhan City, China. ZY-3 imagery-derived true color, false color, and DEM data were fed into a multistream structure. The DEM data had different input neighbors with other data for better feature learning. A multiscale kernel was revised from the inception module. By testing two scenes of ZY-3 imageries, results indicated that: (1) a multistream CNN structure contributed significantly to the extraction of effective features from multimodal data; (2) with respect to CSMAs, optimal input neighbors and corresponding multiscale convolution kernels for the branch using DEM as input were crucial. In general, the proposed model achieved optimal performance and surpassed all other models. In the future, we will focus on the utilization of various imageries from different study areas, sources, spatial resolutions, and spatial independence, and with larger datasets. Furthermore, conduction of transfer learning from one dataset to another is worthy of further examination.