1. Introduction
Forest plays a significant multi-dimensional role in human health and life [
1]. The reasonable virescence can conserve soil and water, purify the air, adjust the temperature, and execute other ecological functions vital to maintaining the earth’s ecological safety [
2,
3]. The distribution of plant species is the basis of establishing a stable ecosystem and plant community. Understanding the plants’ distribution can be extremely helpful in environmental protection and resource development from a productional and academic perspective [
4]. Therefore, it is essential to realize the plant species’ classification and distribution and attract considerable attention [
5,
6]. The traditional manual investigation is laborious and requires many experts to conduct an adequate investigation. Recently, the studies provided more insights into automatic classification through the plants’ appearance characteristics due to the increased availability of plant images [
4,
7,
8]. Moreover, it is popular to collect plant pictures from the web or initiatively capture pictures to create vegetation datasets [
9]. According to these datasets, many meaningful studies have been executed, such as the individual plant species classification, large-scale plant images classification, and the multiple plant species segmentation.
In the individual plant species’ classification research, an important step is to obtain plant organ features from images; for example, studies use the global features of leaf images [
10], or the contour information of leaves [
11] for classification. However, the image recognition of leaves is over fine and cannot play many roles in the structural composition and spatial distribution. Collecting the real plant information aims to understand the distribution of plants in an area quickly. The plant growth is contiguous and dense in the wild, where a small-scale image of a plant is not easily accessible. Therefore, an image classification method that can segment plant species from large-scale images is needed to capture all plants in a particular region. Some studies, including parametric and non-parametric methods, have the breakthrough advancement of plant classification due to the emergence of high spatial resolution satellite images. Parametric methods are the mainstream algorithms in plant automatic classification. In early studies, K-means [
12], maximum likelihood (ML) [
13], linear discriminant analysis (LDA) [
14], and principal components analysis (PCA) [
15] can be easily implemented for plants classification. However, these methods are affected by the distribution of training data. Therefore, parts studies propose non-parametric approaches, such as Random Forest (RF) [
16] and support vector machine (SVM) [
17], to overcome the problem of the parametric methods. However, these methods have low efficiency, weak identifying ability, and cannot effectively understand the distribution of plant species. Therefore, studies consider the segmentation techniques and attempt to solve these problems.
In image segmentation, the convolutional neural networks (CNNs) succeed in computer vision due to the fast development of computer power and are the mainstream method in image segmentation [
18]. Many studies propose various CNN variants to improve the performance of plant segmentation on various scale images. The well-trained CNNs can extract plant features and achieve plant segmentation with good performance for the small-scale images [
19]. To identify a plant from a large-scale image, CNNs are used as the viable tools to remote sensing (RS) data and successfully segment vegetation with high accuracy [
20]. Moreover, CNNs have also been used to identify plant crowns using aerial RS data to segment plants [
21]. However, the existing CNN segment plants into families rather than species, which is the limitation of most large-scale plant segmentation studies, and this is because most studies use large-scale remote sensing, which makes some plants in remote sensing images too small and cannot be well recognized. However, the plants’ size in low-aerial images is moderate, which is helpful to segment and classify plant species accurately. Therefore, we build a specific plant dataset using unmanned aerial vehicles with an onboard optical camera and develop a novel approach for plant species pixel segmentation based on the self-collected dataset. The aerial and optical images-based plant species segmentation has the following challenges:
The optical image only has RGB channels with little information compared to multispectral and hyperspectral images.
The aerial image misses the details of the tree species, such as the leaf’s texture, edge, and shape.
There are many tree species and staggered with each other in the aerial images.
We demonstrate some examples which are related to the plants’ segmentation, as shown in
Figure 1. In
Figure 1, each example has two images; left-hand is the input image, and right-hand is its result.
Figure 1a–c can only address a single plant;
Figure 1d use the high-resolution remote sensing images to segment the object without a fine-grained tree species segment;
Figure 1e take RGB remote sensing imagery for plant detection. These approaches cannot fine-grain segment the tree species in complex backgrounds.
Figure 1f is our results which can segment interlaced tree species in a complex background.
In this study, we analyze in depth the extraction of plant features with an information enhancement technique and develop the convolutional neural network using enhancing nested downsampling features, namely END-Net, which has a decent quality architecture with novel enhancing modules, for semantic segmentation of plants. The proposed END-Net nests a tiny encoder–decoder framework in each downsampling block to replace the original ordinary convolution operation and add the pixel-based enhancing module in each encoder block. The pixel-based enhancing module designs a learnable variable map with a size of n by n, the same size as the corresponding feature map, to adjust the enhancement information. Moreover, it associates with its corresponding features to obtain the enhanced features. In addition, we conduct extensive experiments with well-known semantic segmentation frameworks on self-built datasets and demonstrate the quantitative and qualitative results to prove that the proposed model is generally beneficial to the plant semantic segmentation task. In summary, our main contributions are four-fold:
We propose a novel enhance module in this work; it is composed of a learnable variable map and can adaptively enhance each pixel’s features. Moreover, the simplicity of the module makes it a plug-n-play module. In ablation analysis, the proposed variable map (plug-n-play module) improves the accuracy and is 0.45% and 0.60% higher than using a single variable in PA and FWIoU, respectively.
We nest a tiny encoder–decoder framework in the process of downsampling to replace the original ordinary convolution operation, which can extract more in-depth information features of plant species. In ablation analysis, the accuracy of the network without the tiny encoder–decoder framework is 1.3% lower than the proposed network.
The proposed END-Net has the advantage of the enhancing module and nested structure; it can extract much information and distinctive features of plant species to achieve the best performance on the self-collected plants’ dataset (OAPI dataset) compared with other well-known methods. For example, the accuracies of END-Net are 18.49% and 20.17% higher than OCNet and ASPOCRNet in PA metric, respectively.
We build an optical aerial plant image dataset named OAPI, which contains hundreds of optical aerial images and corresponding manual annotations. It is constructive for the study of plant species segmentation. To the best of our knowledge, the OAPI dataset is a rare database mainly focused on collecting optical images captured by a low-altitude drone.
This study focuses on fine-grained instance segmentation of plant species with optical aerial images, which is different from the existing studies such as single plant species segmentation [
25], rough segmentation [
26], plant crown segmentation [
27], and single segmentation based on the current multi-functional technologies [
28]. The segmentation accuracy of the proposed method is better than the newest semantic segmentation models from popular journals and conferences such as OCNet [
29] and ASPOCRNet [
30].
We organize the rest of this study as follows:
Section 2 introduces related works, including semantic segmentation, aerial images semantic segmentation, and information fusion.
Section 3 explains the network architecture of the proposed END-Net and the detail of the enhancing module.
Section 4 describes the self-collected dataset, the OAPI dataset, with 16 tree species.
Section 5 presents the self-built dataset, implementation details, and experimental results, including quantitative and qualitative results.
Section 6 discusses the effect of various hyper-parameters settings, data expansion, external factors, and the advantages and limitations of the proposed model. We give the conclusions in
Section 7.
4. Dataset
We aim to realize the semantic plant species understanding using the aerial image. However, the existing public plant datasets set over-process plant images, which means there is only a plant or plant organ per image, and they are mainly designed for classification without the ground truth of segmentation. Therefore, the existing dataset does not satisfy our research goal, prompting us to build Optical and Aerial Plant species Images, namely the OAPI dataset.
We take Anxi and Changting counties as the study area, which suffered from severe soil erosion and vegetation restoration and is significant in research. We carefully designed the recording data to capture abundant plant information per image at high resolution and acquired thousands of aerial images from a moving UAV in the summer. We use an unmanned aerial vehicle (UAV) equipped with an optical camera to capture aerial images of the vegetation from a bird’s eye view. The UAV model is the DJI inspire one raw, and its specific parameters are as follows: rotation angular velocity pitching axis is 300/s, and the heading axis is 150/s. The optical camera is a ZenmuseX5R and provides an image with a size of .
We consider different aerial shooting heights, set sampling height range at 20–100 m, and mainly concentrate at 20–60 m as shown in
Figure 7 to verify the robustness of the proposed END-Net. In
Figure 7, the leaves’ appearance, such as texture, shape, and color, can be easily distinguished at low altitudes. However, the plants have a small area as the increasing of aerial height; they may be ignored and add the difficulty of segmentation. Finally, we selected 592 images to produce the training and testing sets.
We invite relevant professionals to label the selected images by using various colors based on the distribution of plant species and ensure that each class has its unique corresponding color. The OAPI dataset has 16 classes as shown in
Table 1: Background (#0),
Pinus massoniana (#1),
Eucalyptus citriodora (#2),
Dicranopteris dichotoma (#3),
Photinia serrulata (#4),
Adenanthera pavonina (#5),
Blechnum orientale (#6),
Miscanthus sinensis (#7),
Withered dicranopteris dichotoma (#8),
Withered pinus massoniana (#9), Unkown (#10), Stone (#11),
Schima superba (#12),
Mosla chinensis (#13),
Carmona microphylla (#14),
Liquidambar formosana (#15). In
Table 1, we can observe that the morphological characteristics of most plants are different. For example, the leaf of #1 is fasciculate, slender, and slightly twisted, #2 is narrow and needle-shaped, #15 is thin, leathery, and broadly ovate. However, aerial images sometimes fail to show clear differences, such as #4 and #5. In
Figure 8, we demonstrate an optical and aerial image with ground truth, and we mark the same plant with the same color and outline the intersections of the different plants with yellow according to distribution.
5. Experiments
This section firstly describes the experimental settings and the evaluation indicators of semantic segmentation. Then, we sequentially introduce the quantitative evaluation of our proposed END-Net against the eleven well-known methods, present the qualitative result with visualization, and execute the diagnostic and ablation experiments to evaluate the feasibility and robustness of END-Net. All experiments were executed on Ubuntu 16.04 using an NVIDIA 1080 graphics card, and all experimental setting parameters are consistent.
5.1. Implementation Details
We implement our model in Tensorflow and execute all experiments on a workstation with NVIDIA 1080 (11 G) under the Ubuntu16.04 system. Moreover, the hyper-parameters are: the training epoch is set to 400, but they will be terminated when the over-fitting phenomenon occurs; each batch has eight images per GPU; and the dropout rate of the proposed net is 0.5. The initial learning rate is 0.0003; the optimizer is Adam-Optimizer; images are resized into ; and there are 414/178 images for training and testing, respectively. The learnable variable map k is initialized to 1.
5.2. Evaluation Indicators for Semantic Segmentation
In this study, we take PA (Pixel Accuracy), MPA (Mean Pixel Accuracy), MIoU (Mean Intersection over Union), and FWloU (Frequency Weighted Intersection over Union), as the evaluation indicators with parameters: TP (True Positive), FP (False Positive), TN (True Negative), and FN (False Negative). TP (True Positive) means that the model’s prediction is a positive example, and actual observation is a positive example. FP (False Positive) means that the model’s prediction is a negative example, and actual observation is a positive example. TN (True Negative) means that the model’s prediction is a negative example, and actual observation is a negative example. FN (False Negative) means that the model’s prediction is negative, and actual observation is positive.
5.2.1. PA (Pixel Accuracy)
PA means the ratio of correctly classified pixel points to all pixel points. PA can show the classification accuracy of the whole image. The confusion matrix can calculate the value of PA. The equation of the PA is shown as follows:
5.2.2. FWloU (Frequency Weighted Intersection over Union)
FWloU is the promotion of MIoU (Mean Intersection over Union). It means the IoU of each class is weighted and summarized according to the frequency of each class’s occurrence. The IoU is the ratio of the intersection and union of the predicted results and the true values of a given class. The confusion matrix can also calculate the value of IoU and FWIoU. The equation of the IoU and FWIoU are shown as follows:
5.3. Quantitative Analysis
In the quantitative analysis, we compare the proposed END-Net with eleven well-known semantic segmentation architectures on the self-collected dataset (OAPI dataset), including Unet [
32], FCN [
31], Refinenet [
46], FC-densenet [
35], FRRN [
51], Deepresunet [
47], BiSeNet [
52], DANet [
53], CFNet [
54], ASPOCRNet [
30], and OCNet [
29], and adopt PA, FWIoU, MPA, MIoU, Parameters (
Params), and FPS metrics to demonstrate the validity of the proposed method, as shown in
Table 2.
In
Table 2, the proposed END-Net has the best accuracy in both PA, FWIoU, MPA, and MIoU metrics and achieves 84.52%, 74.96%, 52.17%, and 37.49%, respectively. It is 1.98%, 3.03%, 0.56%, and 1.36% higher than the second-best approach (Unet) in PA, FWIoU, MPA, and MIoU metrics, respectively, and indicates that our improvement is noticeable compared to our backbone (Unet). Moreover, the compared methods which take the ResNet101 as the backbone have low accuracies in both metrics. It illustrates that the ResNet101 is not suitable for our dataset. In addition, we also take FCN as the backbone and use the proposed pixel-based enhancing module in the downsampling block and achieve 81.61%, 70.36%, 45.14%, and 34.14% in PA, FWIoU, MPA, and MIoU metrics. It is 4.90%, 7.15%, 3.80%, and 5.21% higher than FCN in PA, FWIoU, MPA, and MIoU metrics, respectively. In the
Params, our net is the third-smallest model and has the best accuracies in PA, FWIoU, MPA, and MIoU metrics. Compared to the second-best method (Unet), our model is 5.6 M smaller than Unet. In the FPS metric, the execution speed of most the methods is between 11 and 12 FPS, in which OCNet has the best FPS and achieves 14.35. The FPS of the proposed method is 11.42, which is close to most of the methods.
Additionally, we select three situations: (I) common tree species, (II) more and scattered tree species, and (III) uncommon categories to present the robustness of our model and demonstrate the fine-grained segmentation results of these situations in
Table 3,
Table 4 and
Table 5, respectively. In
Table 3,
Table 4 and
Table 5, we use the PA metric to present the performance of each model in classifying each plant species; ID is the number of plant species, the model is the name of the compared method, and
overall is the PA metric of the whole image. The best accuracy is marked as red, and the second-best accuracy is marked as green.
In
Table 3, it presents situation I, which is the aerial image that contains the common tree species. The accuracy of the proposed framework, including FCN + (
) and END-Net, has almost the best accuracy on classifying tree species except #0. The proposed framework does not have the best accuracy on classifying #0 tree species but is only 0.27%, slightly lower than the best approach. In
Table 4, each tree species on the image is widely distributed, and the number of tree species is more than the situation I. The proposed END-Net has good performance and the best overall accuracy except recognizing #0 (the background). Our model has the best accuracy for classifying
Blechnum orientale (#6), which has fewer pixels on the image, and compared models do not perform well. Moreover, FCN + (
) has the second-best overall accuracy, which embeds the proposed pixel-based enhancing module in the FCN’s downsampling blocks. In
Table 5, the number of tree species and that of rare tree species are increasing compared to situations I and II. None of the models perform well for most tree species, but the proposed END-Net and FCN(
) have the best overall accuracies. More specifically, the proposed framework has performed well for most tree species compared to the compared methods.
5.4. Qualitative Analysis
This section provides the visualization predicted results to describe the capability of each approach and shown in
Table 6, respectively. In
Table 6, we consider four situations: (A) few tree species with relatively concentrated distribution, (B) a piece of bare land and more tree species contain some rare classes, (C) a relatively scattered distribution of the same tree, and (D) a lot of bare land with many tree species, the number of each tree species is limited and contains most of the rare categories.
In
Table 6, most of the networks have difficulty identifying the tree species on the proposed dataset, and their performance is not very ideal. However, the proposed net has the best performance in both situations, tree species with dense distribution, and tree species with scattered distribution in the images. In situation A, the proposed framework has the best visualization result compared to the other methods compared, which have broken segmentation results similar to the ground truth. In situation B, all nets do not perform well, especially the
Withered pinus massoniana (#9). However, our model has the best overall performance and recognition performances of
Pinus massoniana (#1),
Dicranopteris dichotoma (#3), and
Schima superba (#12) are very close to the Ground Truth. In situation C, the shape of the recognition result of our model for each tree is similar to Ground Truth, but it is slightly insufficient in details, and it also performs best among all the models on the whole. In situation D, the performance of all models is poor in this situation, but the proposed net has the best performance compared to the compared methods. Overall, our model has the best performance in various situations; the FCN + (
) takes the VGG as the backbone and embeds the proposed pixel-base enhancing module into the downsampling blocks, and it has the second-best performance; the rest of the compared methods with Resnet101 have the worst performance.
5.5. Diagnostic and Ablation Experiments
In this subsection, we execute diagnostic and ablation experiments to present the feasibility and effectiveness of the proposed network.
5.5.1. Diagnostic Experiments
In the diagnostic experiments, we execute the significance testing to prove the significance of the proposed enhancing model.
Significance test: We execute the paired-samples T-test as the significance testing to verify the significance of the pixel-based enhancing module on the self-collected datasets and demonstrate the testing result with two metrics in
Table 7. In
Table 7, X, Y, “Sig. ( 2-tailed)”, and
refer to the proposed network with
variable map in enhancing module, the proposed network with 1 variable map in enhancing module, the
p-value of the two-sided significance, and the difference between X and Y for each fold cross-validation. In
Table 7, the performance of X is higher than that of Y at each validation. Moreover, significant
p-values (Sig.) are all less than 0.05 with these two metrics. The testing proves that the proposed pixel-based enhancing module’s performance is significant; and the proposed module can efficiently increase the accuracy.
5.5.2. Ablation Experiments
In this subsection, we conduct the ablation experiments on OAPI with three settings to verify the rationality and scientificity of the proposed END-Net: (1) locations of pixel-based enhancing module, (2) structures of variable map, and (3) loss strategies.
Locations of pixel-based enhancing module: We set the proposed pixel-based enhancing module at various locations of the END-Net to explore the optimal settings and demonstrate the results in
Table 8. In
Table 8, we consider four locations: (a) downsampling blocks (
-blocks), (b) upsampling blocks (
-blocks), (c)
-blocks, and (d)
NONE. More specially, situations (a) and (b) only set the proposed pixel-based enhancing module at downsampling and upsampling blocks, respectively; situation (c) uses the enhancing module at both blocks; and situation (d) does not consider the enhancing module in the network. The network with the proposed pixel-based enhancing module at
-blocks has the best accuracies and achieves 84.52% and 74.96% in PA and FWIoU metrics. Its PA accuracy is 1.64% higher than
-blocks, 1.75% higher than
-blocks, 1.9% higher than
NONE; its FWIoU is 2.32% higher than
-blocks 1.95% higher than
-blocks, and 2.73% higher than
NONE. Its accuracy is significantly higher than the rest locations and ensures that using the pixel-based enhancing module at downsampling blocks can effectively achieve the best results. The tiny encoder–decoder structure with a pixel-based enhancing module embedded in a downsample block that can efficiently improve the segmentation accuracy due to the downsampling blocks (DS-blocks) is the procedure of downsampling. The downsampling block ensembles features from the large region into a small region with convolution operation is the procedure to reduce the size of the feature map. It uses the tiny encoder–decoder structure with a pixel-based enhancing module that efficiently highlights the features and improves segmentation accuracy. In contrast, the upsampling block enlarges the feature map, and this makes the feature blur. Operating the tiny encoder–decoder structure with the pixel-based enhancing module on a blur feature map does not get better results.
Structures of variable map: We consider different structures of the variable map in the enhance module, including the variable map with the size of
and with the size of
, and demonstrate the results in
Table 9. In
Table 9, the item “Size” indicates the size of the variable map in the enhance module, and
means the size of the variable map is the same as the size of the feature map. It can seem that a variable map with size
is powerful than using a single variable. The variables will adjust their value (enhancing factor) adaptively as the network iterates; therefore, the variable map with the size of
can obtain the appropriate enhancing factor for each pixel rather than use a single enhancing factor. Notice that we set the initial value of the variable map with various sizes to one. In
Table 9, the performance of using variable map with size
is better than with size
, and it is 0.45% and 0.60% higher than using size
in PA and FWIoU, respectively. An
feature map with only one enhancing factor makes each pixel of the feature map have the same weight. However, some discriminative characteristics, such as edges and corners, which can highlight the differences, should assign different weights (enhancing factor) to highlight the importance of features in feature learning. Therefore, our study designs an
variable map that can get better results.
Loss strategies: We analyze two-loss strategies, single-loss, and multi-loss strategies to determine the best loss strategy and demonstrate the results in
Table 10. In a single-loss strategy, we keep the
loss7 and abandon the rest of the loss in our network. In multi-strategy, we reserve all the losses, which are designed in our network. In
Table 10, the multi-loss strategy has 84.52% and 74.96% accuracies of PA and FWIoU, which is 0.09% and 0.54% higher than using single-loss strategy. Overall, the multi-loss approach can improve the performance of the proposed model and has better accuracy than using a single-loss strategy.
7. Conclusions
This study proposes a plant species segmentation network with enhancing nested downsampling features (END-Net) for complex and challenging plant species segmentation tasks. END-Net takes the Unet as the backbone and contains three main contributions: (1) The tiny encoder–decoder structure with a pixel-based enhancing module embedded in a downsample block can efficiently highlight the features, improving the segmentation accuracy; (2) the pixel-based enhancing module assigns different weights (enhancing factor) to adaptively highlight the importance of each pixel’s features in feature learning; and (3) the multi-loss strategy is a deep-supervision strategy; it calculates and accumulates the losses for the efficient adjustment of the network. Moreover, we collect the aerial and optical images to construct the plant dataset, namely the OAPI dataset. To the best of our knowledge, the OAPI dataset is a rare database mainly focused on collecting optical images captured by a low-altitude drone.
In the experiments, we execute the diagnostic and ablation experiments to prove the significance of the proposed pixel-based module and demonstrate the effectiveness of our network. Moreover, we provide the quantitative results with six metrics to show the performance of the proposed END-Net and give the qualitative consequence to prove the feasibility of the END-Net with visualization segmentation outcomes.
In the future, we will improve the model to increase the segmentation accuracies of rare plant species with fewer data than the dominant species. More specifically, we will consider the classes’ weights into the loss strategy that makes the network pay more attention to categories with small data [
58]. In addition, we also consider using adversarial networks [
59] or various loss strategies [
60] to improve the accuracy of categories with small samples. Furthermore, we will keep collecting aerial and optical images, infrequent ones, and balance the number of categories in the dataset. More precisely, we will expand the number of rare categories, such as
Withered dicranopteris dichotoma,
Adenanthera pavonina,
Blechnum orientale, and
Miscanthus sinensis. Moreover, we will add some new sampling points to extend the number of tree species and images.