1. Introduction
Remote sensing-assisted classification of forest type at the tree species level drives a wide variety of applications, including issues with sustainable forest management [
1,
2], biological and surveillance [
3,
4], and invasive species monitoring [
5].
Over the last four decades, advances in remote sensing technology have enabled the classification of tree species using various satellite sensors. Meanwhile, with the widespread availability of satellites and the advancement of sensor manufacturing technology, higher spatial resolution images and more detailed classification results were obtained. Recently, more and more studies have been conducted to obtain more specific information of forest types at the tree species level using high spatial resolution (HSR) satellite images [
6,
7,
8]. However, it was shown that the spectral response of various tree species in a forest environment usually displayed complicated patterns as the spatial resolution improved. This means that several tree species or forest types have the same or comparable spectral response, which may cause interpretative difficulty when discriminating from the mono-temporal high-resolution data. As a result, despite substantial breakthroughs in geographic information science and technology, reliably categorizing forest types at the tree species level from mono-temporal high-resolution images remains a difficulty [
9].
Due to phenological variations across tree species, multi-temporal satellite image has the ability to compensate for inadequate spectrum information. Previous research demonstrated that utilizing multi-temporal satellite images helped improve forest type categorization results. Nelson (2017) used the multi-temporal satellite images to classify tree species groups and forest types in central Sweden. The results demonstrated that utilizing a multi-temporal method could enhance overall classification accuracy [
10]. Wessel (2018) extracted four tree species classifications from multi-temporal satellite images at a German test site and achieved up to 88% overall accuracy [
11]. Persson (2018) demonstrated that employing multi-temporal satellite images yielded greater performance in tree species classification than using mono-temporal data [
12]. Other research has also underlined the significance of phenological information contained in multi-season data for forest type mapping [
13].
In terms of methodologies, the researchers successfully employed machine learning methods such as support vector machine (SVM) and random forest (RF) to the classification of forest types based on multi-temporal satellite data and produced satisfactory classification results. Meanwhile, many comparative analyses of different machine learning methods for forest type classification based on multi-temporal data have also been conducted [
14,
15]. The results showed that the key factor for determining the effect of machine learning algorithm was the feature representation of the satellite data which was always in the form of manual feature extraction and optimization. However, such operations were always time-consuming, laborious, and vulnerable to human experience.
Deep learning approaches have shown considerable potential in feature representation of remote sensing images with the advent of the big data era and the rapid development of scientific computing [
16]. It has attracted the attention of many researchers as it demonstrated good classification performance with satellite imagery and overcame many limitations of traditional classification methods [
17,
18,
19].
The full convolutional network (FCN) is regarded as a watershed moment in deep learning for semantic segmentation since it demonstrates how to train a convolutional neural network (CNN) end-to-end and produce dense predictions with inputs of any sizes [
20]. Due to the connection between semantic segmentation in computer vision and satellite imagery classification, FCN was used to extract the hierarchical context characteristics of satellite image pixels in order to classify the land cover and use [
21,
22]. The model’s core premise is that it converts standard CNN into fully convolutional ones by replacing fully connected layers with convolutional layers and produces dense per-pixel labeled outputs using a progressively up-sampling process [
23]. Despite its strength and versatility, the FCN model loses a lot of detail and lacks spatial consistency of pixels owing to a lot of pooling and up-sampling procedures [
24].
The uNet is a symmetric U-shaped FCN that was first employed for image segmentation in biomedicine [
25]. The performance of uNet is increased above that of regular FCN by merging the underlying spatial information gained by down-sampling with the input of up-sampling via skip connections. Deep learning approaches based on uNet have made significant progress in the areas of forest type and tree species classification using remote sensing data in recent years. Wang (2020) used HSR images to classify forest types using the uNet model. When compared to the classification results of FCN, support vector machine, and random forest models, the result has been improved considerably [
26]. Cao (2020) suggested an enhanced Res-uNet network based on the uNet structure for tree classification utilizing HSR imagery [
27]. To extract the multi-scale feature of an image, the novel method optimized the uNet with the residual unit of ResNet [
28]. The experimental results revealed that, when compared to uNet and ResNet, the upgraded Res-uNet model produced superior results since it could extract the spatial and spectral properties of an image more efficiently.
Despite the fact that the optimized Res-uNet model offered new possibilities for tree species classification in HSR images, it failed to consider the advantages of multi-temporal imaging in classifying forest types and tree species. Meanwhile, the enhanced model did not integrate the deeper Resnet model with the original uNet network, instead using simply a three-level residual unit to replace the traditional convolutional block in the uNet. It overlooked the fact that, to some extent, the depth of representations is critical for many visual identification tasks [
28].
With the development of artificial intelligence technologies, several studies explored the optimized the deep learning model for the land cover classification [
29,
30,
31]. Recent studies also investigated the fusion of multiple branch classifiers into a FCN model for forest type at tree species level classification, which further enhanced the classification ability of single classifier in a manner of multi-classifier ensemble. Guo (2020) presented a two-branch FCN8s approach to improve forest type classification based on China’s Gaofen-2 (GF-2) HSR imagery by fusing with two sub-FCN8s models which were constructed with the multi-spectral channels and pretrain model, respectively [
32,
33]. The results showed that the suggested model could improve the classification performance by combining two classifiers. More recently, Guo (2020) further exploited the deep fusion model constructed in an end-to-end manner for mapping forests at tree species levels with HSR satellite imagery, which further enhanced the classification ability of single classifier in a manner of combining a two-branch FCN8s model and a conditional random field as recurrent neural network [
34]. However, the advantage of phenological information extracted from multi-temporal images in forest type and tree species classification was also unemployed to the constructed model in the above previous studies. Moreover, while the uNet model has shown its performance in the classification of forest type and tree species, the backbone model used in the deep fusion model was mostly based on FCN8s method, which lacked investigating the effect of uNet model as its backbone.
It could be noted that: (1) Despite the fact that multi-temporal high-resolution satellite data are favored for better performance in forest type classification, few research have studied the performance of building deep learning models with such data to cope with such a job. (2) Although an improved Res-uNet network has been successfully applied in the tree species classification, the investigation of combining deeper Resnet such as 18-layer, 34-layer, 50-layer, and 101-layer residual nets with uNet was insufficient. (3) Furthermore, while Res-uNet and the deep fusion model have tremendous promise to answer the challenge of enhancing forest classification accuracy at the tree species level, research into developing the model by fusing the aforementioned two approaches utilizing multi-temporal satellite data have been rarely investigated.
As a result, the paper suggested a novel deep fusion uNet model based on multi-temporal HSR satellite data for mapping forest types at the tree species level. The suggested model was built on a two-branch deep fusion architecture that employed the deep Res-uNet model as its backbone and was named dual-uNet-Resnet in the study.
The remainder of the paper is structured as follows.
Section 2 presents the Materials and Methods in detail.
Section 3 gives the results, while
Section 4 discusses the feasibility of the optimized model. Finally,
Section 5 concludes the paper.
4. Discussion
We developed a novel deep fusion model employing HSR remote sensing imagery from two dates to improve the performance of forest type categorization at the tree species level. The experimental results demonstrated that the proposed model could efficiently extract the main tree species and forest types in the research regions, particularly plantation species such as Chinese pine, Larix principis, which all had an OA greater than 90.00%.
The time phase of multi-temporal remote sensing data was particularly crucial since the major objective of this work is to classify forest types at the tree species level using multi-temporal HSR optical remote sensing images. The images in May, June, and September were chosen in this study based on the findings of the literature [
46] and the present state of available GF-2 data in the experimental area for the last 3 years. These periods relate to the growth and defoliation stages of vegetation in the experimental area, and they include a variety of phenological data that aids in the classification task. It is worth mentioning that the two-branch optimized deep learning model outperformed traditional methods such as the original uNet when it regarded extracting forest types from multi-temporal satellite data (
Table 4). We also strived to incorporate the December image into the training dataset for classification during the research process in order to improve classification results. However, the accuracy was dramatically decreased when the December image was added, and the snow cover area had a considerable impact on the classification result. Thus, the data of vegetation growth phase and early defoliation period should be selected as much as feasible when employing HSR remote sensing images for forest type classification. If obtaining available images during the stages of vegetation’s growth and defoliation was challenging, the method of multi-resolution optical data fusion, which is also the next research topic of this study, might be employed for modeling.
Moreover, since this study adopted the supervised deep learning optimization method to perform forest type classification, the quality and quantity of training samples had a major impact on the model’s classification result. We employed 149 samples in the experiment, including 119 training samples and 30 verification samples. However, when compared to similar studies that performed FCN-based HSR remote sensing classification, the work obtained comparable classification results with a smaller sample size. Liu et al. (2018) carried out remote sensing classification of land use types based on the FCN model. In this study, a total of seven land use types were extracted, and the classification accuracy was 87.1% [
47]. However, this study used 400 orthophotos to extract 2800 sample blocks, and the sample size was much larger than that of this study. Fu et al. (2017) used the FCN model and two dates of GF-2 images to extract the land use types of urban areas. The OA was 81% for 12 land use types [
22]. The study divided the two dates of remote sensing images into 74 image blocks, with each image block having a pixel size of 1024 pixels, including 70 blocks for training and four blocks for testing. The pixel size of each image block in our study is only 310 pixels, the sample size was much smaller than that of this study, but the classification accuracy of our study was better.
Compared with the previous research results of the forest classification at tree species level using multi-temporal HSR data, the proposed model got better performance. Agata (2019) classified tree species based on multi-temporal Sentinel-2 and DEM following the stratified approach with a classification accuracy of 89.5% for broadleaf and 82% for coniferous species [
46]. Ren (2019) based on multi-temporal SPOT-5 and China’s GF-1 data performed the fine classification of forest types with an accuracy of up to 92% [
48]. Persson (2018) classified common tree species over a mature forest in central Sweden based on a multi-temporal Sentinel-2 dataset with a classification accuracy of 88.2% [
13].
When compared to earlier research results of the forest type classification utilizing HSR data in the same test region, the approach proposed in the study gotbetter performance. Xie (2019) based on multi-temporal ZY-3 data, carried out the classification of tree species, forest type, and land cover type in the Wangyedian forest farm and got the overall accuracy of 84.9% [
15], which was lower than the results of this study. For the results of the tree species classification, the performance of the Larix principis category improved obviously using the proposed model which increased from 87.3% to 93.24%. The proposed model in this study also got better performance than the results in [
34], which improved from 85.89% to 93.30%. Categories with an obvious improvement included the Larix principis, White birch and aspen, Construction land type, in which accuracy increased from 91.3% to 92.54%, from 80.65% to 85.71%, and from 43.75% to 100.00%, respectively. Compared with the results of [
32], the proposed model also performed better, especially for the results of the Larix principis category, in which the accuracy increased from 89.86% to 92.54%.
In order to obtain the optimal structure of the proposed model, we also evaluated the network design of the proposed model. First, four residual networks with varying depths at the encoder were compared. The various fusion techniques at the decoder were then examined. Finally, the classification effects of including a residual convolution module, a standard convolution module, and not including convolution in the skip connection were assessed.
4.1. Impact of the Depth of Residual Network on Classification Results
Resnet 18, Resnet 34, Resnet 50, and Resnet 101 were included in the comparative residual networks.
Table 5 showed that the Resnet 50, better than the other three models, had pretty excellent results in classification. The weight of the Resnet 50 was updated by 19 s per epoch. Although the processing speed was not the fastest, however, it got the best performance by the comprehensive comparison. For the remaining three models, the classification effect of the proposed model based on Resnet 34 and Resnet 101 were similar, but the running speed of the Resnet 34 was much faster than the Resnet 101, which increased from 31 to 19 s for one epoch update. The processing speed of the Resnet 18 was the fastest, but its classification effect was relatively poor, possibly due to the shallow depth of the residual network.
Figure 8 depicted the classification results of the four residual networks in further detail. All the residual networks above performed well, except the Larix principis category was misclassified as Chinese pine by Resnet 18, and the Korean pine category was misclassified as Construction land by Resnet 101.
4.2. Impact of the Different Fusion Strategies of the Decoder
Table 6 displayed the results of two types of decoder fusion strategies: Final decision layer fusion (dual-uNet-Resnet-DeMerge) and multi-level fusion involving all of the layer fusion (dual-uNet-Resnet). It was clear that the multi-level fusion technique outperformed the decision layer fusion technique, particularly for the Larix principis, Cultivated land, and Grassland categories. This might due to the fact that spatial information could be extracted more efficiently from the multi-level fusion technique, particularly for types with regular shape and texture.
Figure 9 depicted the results of the two decoder fusion procedures in further detail. It was clear that the multi-level fusion method produced superior outcomes. The multi-level fusion technique significantly improved the categorization result of the Larix principis category.
4.3. Impact of Inserting the Convolution Module into the Skip Connection
One of the major components of the uNet model is the skip connection. The research investigated the classification impact of adding the residual convolution module (dual-uNet-Resnet), ordinary convolution module (dual-uNet-Resnet-ConvConnect), and without any convolution module (dual-uNet-Resnet-WithoutConnect) in the skip connection to validate the effect of adding the convolution module of different architectures in the skip connection. According to
Table 7, the classification accuracy significantly improved as the complexity of the convolution module increased. First, the overall classification accuracy was raised from 91.01% to 93.30%. This trend has also been shown by the optimization effects of Larix principis, Cultivated land, Construction land, Shrub land, and Grassland. As shown in
Figure 10, the classification accuracy increased following the addition of the residual convolution module, particularly in areas with several categories.