YOLO-SegNet: A Method for Individual Street Tree Segmentation Based on the Improved YOLOv8 and the SegFormer Network

Yang, Tingting; Zhou, Suyin; Xu, Aijun; Ye, Junhua; Yin, Jianxin

doi:10.3390/agriculture14091620

Open AccessArticle

YOLO-SegNet: A Method for Individual Street Tree Segmentation Based on the Improved YOLOv8 and the SegFormer Network

by

Tingting Yang

^1,2,3,

Suyin Zhou

^2,3

,

Aijun Xu

^2,3,*,

Junhua Ye

² and

Jianxin Yin

²

¹

College of Chemistry and Materials Engineering, Zhejiang Agriculture and Forestry University, Hangzhou 311800, China

²

College of Mathematics and Computer Science, Zhejiang Agriculture and Forestry University, Hangzhou 311800, China

³

Key Laboratory of State Forestry and Grassland Administration on Forestry Sensing Technology and Intelligent Equipment, Hangzhou 311300, China

^*

Author to whom correspondence should be addressed.

Agriculture 2024, 14(9), 1620; https://doi.org/10.3390/agriculture14091620

Submission received: 27 August 2024 / Revised: 3 September 2024 / Accepted: 13 September 2024 / Published: 15 September 2024

(This article belongs to the Section Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

In urban forest management, individual street tree segmentation is a fundamental method to obtain tree phenotypes, which is especially critical. Most existing tree image segmentation models have been evaluated on smaller datasets and lack experimental verification on larger, publicly available datasets. Therefore, this paper, based on a large, publicly available urban street tree dataset, proposes YOLO-SegNet for individual street tree segmentation. In the first stage of the street tree object detection task, the BiFormer attention mechanism was introduced into the YOLOv8 network to increase the contextual information extraction and improve the ability of the network to detect multiscale and multishaped targets. In the second-stage street tree segmentation task, the SegFormer network was proposed to obtain street tree edge information more efficiently. The experimental results indicate that our proposed YOLO-SegNet method, which combines YOLOv8+BiFormer and SegFormer, achieved a 92.0% mean intersection over union (mIoU), 95.9% mean pixel accuracy (mPA), and 97.4% accuracy on a large, publicly available urban street tree dataset. Compared with those of the fully convolutional neural network (FCN), lite-reduced atrous spatial pyramid pooling (LR-ASPP), pyramid scene parsing network (PSPNet), UNet, DeepLabv3+, and HRNet, the mIoUs of our YOLO-SegNet increased by 10.5, 9.7, 5.0, 6.8, 4.5, and 2.7 percentage points, respectively. The proposed method can effectively support smart agroforestry development.

Keywords:

urban street tree; individual tree segmentation; image instance segmentation; SegFormer; deep learning

1. Introduction

A significant portion of city afforestation consists of street trees. Trees enhance environmental comfort by improving the air quality [1], providing shade [2], and storing carbon [3,4].

The parameters of standing street trees, such as the tree height, trunk diameter at breast height, crown width, and tree species, are important components of urban forest inventory systems [5,6]. Traditional field measurements are time-consuming and expensive and unable to reliably and promptly capture dynamic changes in street trees [7]. Computer vision is a new but increasingly popular technology for urban street tree intelligent research. Street tree images are usually captured by consumer phones, iPads, or cameras, and tree parameters are collected from these images via visual interpretation. Wu et al. [8] first used a feature-adaptive mean-shift algorithm to segment images and extract tree contours [9]. Then, a passive measurement method for the tree height and crown diameter based on a smartphone was proposed.

Segmenting individual trees from complex backgrounds is the crucial first step of street tree inventories. Studies on tree segmentation include two main styles: studies based on point clouds [10,11,12] and studies based on images [13,14]. The morphological parameters of street trees extracted from light detection and ranging (LiDAR) point cloud data are close to the measured values [15]. Mobile laser scanning (MLS) can collect high-precision point clouds of trees [16,17,18]. One of the purposes of these three-dimensional 3D point clouds is to measure the morphological parameters of street trees, and two-dimensional 2D images can also achieve this purpose. Moreover, these data acquisition devices are very expensive and difficult to carry, which is problematic for some researchers without professional equipment who wish to carry out intelligent tree research. Furthermore, compared to 2D images, point clouds require more attempts to prepare and annotate a dataset.

Another type of tree segmentation research is based on low-cost 2D images. Many convolutional neural networks (CNNs) have been used in these studies [19,20]. A depth camera was used to acquire RGB and depth images. A branch segmentation method based on a region convolutional neural network was developed for apple trees [21]. The efficiency and accuracy of this method still need to be greatly improved. The parts of the tree should be recognized, including the trunk and whole tree. A SegNet-based network architecture [22] based on a Kinect V2 camera was used for the segmentation of the branch, trunk, and trellis wire [19]. This method obtained the foreground RGB based on the depth camera to improve the IoU value, but it still needs to be improved. A method based on UNet++ for tree branch localization [23] was implemented to achieve pixelwise segmentation on an RGB image [24]. It is an improved version of UNet [25], where the skip paths are revised with denser skip connections. Most of the existing tree segmentation methods are focused on local information extraction, such as tree branches and canopies, and there is a lack of research on whole trees covered with leaves. Moreover, these experimental data are customized, and the effectiveness of these algorithms has not been verified in relevant public datasets.

To address these problems, we use larger, publicly available urban street tree datasets to verify our segmentation method. The data and number of tree species used in this paper far exceed those of previous related studies. Tree image object detection was first conducted to extract tree positions in images, and then the image segmentation step was performed to improve the street tree segmentation accuracy. Therefore, in the object detection stage, the BiFormer attention mechanism [26] was introduced into the YOLOv8 network [27] to extract regions of interest and reduce the impact of the background on target segmentation. Based on the detection results, the SegFormer [28] network was introduced to segment street trees. The multilayer perceptron (MLP) decoder aggregates information from different layers and thus combines both local and global attention to render powerful representations. Therefore, it is very suitable for street tree objects with large differences in global and local features.

The main contributions of this study are as follows.

(1): Street tree datasets were reconstructed. The original images [13] were re-annotated and re-transformed according to the YOLO dataset format to produce a new street tree dataset for object detection. Our method was applied to this public street tree segmentation dataset.
(2): The BiFormer module was introduced into YOLOv8 to reduce the background interference in street tree segmentation tasks, enabling the backbone network to capture long-distance dependencies effectively. The experimental results also show that the street tree object detection results markedly affect the image segmentation task.
(3): This new YOLO-SegNet model, combined with the improved YOLO + BiFormer and SegFormer models, was used to accurately segment individual street trees of different species.

2. Materials and Methods

2.1. Street Tree Dataset

2.1.1. Image Acquisition

In previous research, we published a large public dataset on urban street tree image segmentation [13]. The details of the street tree dataset are available at https://ytt917251944.github.io/dataset_jekyll (accessed on 10 October 2023). These street tree images were obtained using different consumer-grade equipment, such as an iPhone 12, iPhone 13, and iPad 9.

The street tree images were collected in natural environments. A total of 3949 street tree images were produced, and the numbers of different tree species are shown in Figure 1A. A tree image object detection task was first conducted to extract tree positions from the image to improve the accuracy of street tree segmentation. Then, an image segmentation step was performed. Based on public street tree images [13], the LabelMe rectangle tool was used to carry out object detection labeling, as shown in Figure 1B.

2.1.2. Image Annotation

In our object detection task, we further used rectangular boxes to label the original street tree images. In the instance segmentation task, this public street tree dataset provides fine annotations at the pixel level, as Figure 2 shows. The rectangular boxes are the object detection labels, and the segmentation masks with different colors are the segmentation labels. Intuitively, there are great differences in the contour information of different tree species. Some tree crowns are nearly flat–spherical, such as those of Cinnamomum camphora (Linn) Presl. Some are nearly conical, such as Cedrus deodara. Some are nearly umbrella-shaped, such as Ginkgo biloba. Some are nearly weeping, such as Salix babylonica. We also divided the street tree dataset. See Table 1 for details.

2.2. YOLO-SegNet Model

Figure 3 shows the structure of our method, YOLO-SegNet, including the object detection module and image segmentation module. Based on the street tree object detection dataset, the raw tree images were fed into the YOLOv8 module to obtain the background-removed images. The background region was set to black. Based on the street tree image segmentation dataset, the background-removed images were then input into the SegFormer network to obtain the final segmentation images.

2.2.1. YOLOv8 with the BiFormer Attention Mechanism

Researchers have conducted multiple updates and iterations on YOLO [30,31,32]. YOLOv8 [27], an improvement on the previous versions, further enhances the object detection performance. See the left of Figure 3. YOLOv8 is an anchor-free model that can directly predict the center of an object.

A street tree is an irregular larger object. The distance between the tree top and bottom is large, and the difference in features is large. Therefore, this study introduces the BiFormer in the backbone part [26,33,34], which is used to capture long-range dependencies in street tree images, as shown in Figure 4. BiFormer uses bilevel routing to enable more flexible computation allocation with content awareness, and it benefits the training and inference of large models [35,36,37,38].

Several studies [39,40,41,42,43] have proposed different sparse attention mechanisms. Vanilla attention is the most basic attention mechanism, with high computational complexity and a heavy memory footprint. Local attention limits the attention operations to local windows, but may sacrifice some global context information. Axial attention is a pattern of sparse attention along a specific axis. Dilated attention is calculated by using an expanded window pattern over a local area. Deformable attention allows the attention window to be adjusted adaptively according to the image content, thus capturing image features more flexibly. Observing Figure 5b–d, these existing studies try to reduce the complexity via sparse attention.

Therefore, a dynamic, query-aware bilevel routing attention (BRA) was proposed, as shown in Figure 5f. The upper-layer router interacts with all image blocks through a global self-attention mechanism and generates a global image representation. The low-level router uses a local self-attention mechanism to interact with each image block and generate a local image representation. Through this dual-layer routing attention mechanism, BiFormer is able to capture both global and local feature information, thereby improving the accuracy of the model.

BRA: A feature map

X \in R^{H \times W \times C}

is separated into S × S nonoverlapping regions such that each region contains

\frac{H \times W}{S^{2}}

vectors. This part is carried out by reshaping X as

X^{r} \in S^{2} \times \frac{H \times W}{S^{2}} \times C

. The query Q, key K, and value V tensor, where

Q, K, V \in R^{S^{2} \times \frac{H \times W}{S^{2}} \times C}

, are as follows:

Q = X^{r} W^{q}, K = X^{r} W^{k}, V = X^{r} W^{v}

(1)

where

W^{q} {, W}^{k}, W^{v} \in R^{C \times C}

are the projection weights, respectively.

Each query token in region i will attend to all key–value pairs residing in the union of k routed regions indexed with

I_{(i, 1)}^{r}, I_{(i, 2)}^{r} {, . . ., I}_{(i, k)}^{r}

. The key and value tensors are as follows:

K^{g} = g a t h e r ({K, I}^{r}), V^{g} = g a t h e r ({V, I}^{r}), I^{r} = t o p k I n d e x ({Q^{r} {(K}^{r})}^{T})

(2)

where

K^{g}, V^{g} \in R^{S^{2} \times \frac{k \times H \times W}{S^{2}} \times C}

are the gathered key and value tensors, respectively.

Q^{r}, K^{r} \in R^{S^{2} \times C}

. See Figure 6.

The attention:

O = A t t e n t i o n ({Q, K}^{g}, V^{g}) + LCE (V)

(3)

Here, the function LCE(·) was parameterized with depthwise convolution.

2.2.2. SegFormer Network

SegFormer [28] is an efficient and reliable semantic segmentation model that combines transformers with lightweight multilayer perceptron (MLP) decoding. SegFormer includes a hierarchically structured transformer encoder. The output feature size of each transformer layer decreases layer by layer. In this way, feature information of different scales is captured, i.e., coarse-grained large-scale and fine-grained small-scale features are extracted. Instead of using complex decoders, SegFormer is a simple and lightweight decoder. The proposed MLP decoder merges information from different layers and thus combines local and global attention to provide forceful representations. Therefore, it is very suitable for street tree objects with large differences in global and local features.

Heads A, B, and D have the same dimensions. N(N = H × W) × C. The self-attention is as follows:

A t t e n t i o n (A, B, D) = S o f t m a x (\frac{A B^{T}}{\sqrt{d_{h e a d}}}) B

(4)

In the attention calculation, we use the downsampling rate R to process B as follows:

\begin{matrix} \hat{B} = Reshape (\frac{N}{R}, C \cdot R) (B) \\ B^{'} = Linear (C \cdot R, C) (\hat{B}) \end{matrix}

(5)

where

B

is the mapping feature of the input,

\hat{B}

is the feature after

B

dimension transformation, and

B^{'}

is the feature after dimension reduction.

The Mix-FFN (feed-forward network) is as follows:

X_{o u t} = M L P (G E L U ({C o n v}_{3 \times 3} (M L P (x_{i n})))) + x_{i n}

(6)

where x_in is the feature from the self-attention module. The Mix-FFN mixes a 3 × 3 convolution and an MLP into each FFN.

In the lightweight ALL-MLP decoder part, first, we unify the channel dimension:

\hat{F_{i}} = L i n e a r (C_{i}, C) (F_{i}), \forall_{i}

(7)

where i∈{1, 2, 3, 4}, and F_i denotes multilevel features.

Then, the features are upsampled and concatenated:

\hat{F_{i}} = U p s a m p l e (\frac{W}{4} \times \frac{W}{4}) (\hat{F_{i}}), \forall_{i}

(8)

Next, the concatenated features F are

F = L i n e a r (4 C, C) (C o n c a t (\hat{F_{i}})), \forall_{i}

(9)

Finally, we predict the segmentation mask M:

M = L i n e a r (C, N_{c l s}) (F)

(10)

where M has a

\frac{H}{4} \times \frac{W}{4} \times N_{c l s}

resolution, and N_cls is the number of categories.

3. Results and Analysis

3.1. Experimental Environment and Evaluation Indicators

The experimental environment consists of an Intel Core i7-12700KF CPU, 64 GB RAM, 10 GB of video memory, an Nvidia GeForce RTX3090Ti GPU, CUDNN V11.7, the CUDA Toolkit 9.0, the Ubuntu 22.04 operating system, and Python 3.7.12.

The precision, recall, and mean average precision (mAP) are used as the object detection evaluation indices [27]. The intersection over union (IoU), mean intersection over union (mIoU), pixel accuracy (PA), and mean pixel accuracy (mPA) are used as segmentation evaluation indices [28]. Specific definitions are provided in the corresponding research literature.

3.2. YOLOv8m+BiFormer Model Results and Analysis

The YOLOv8 series are YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x. For the urban street tree image object detection task, the hyperparameters were set to 200 epochs, a batch size of 32, an initial learning rate of 0.01, momentum of 0.937, and a weight attenuation coefficient of 0.0005. Then, the trained model was evaluated on the validation and test sets.

Figure 7A–D exhibit the loss function curves on the train and validation sets. These loss function curves decrease gradually to flatness, meaning that the models converge. Figure 7E–H display the results of the object detection evaluation indices on the validation set. These curves show that the network converges quickly and tends to be stable.

In this work, five models of YOLOv8 were trained and verified using a large public urban street tree dataset. The details are provided in Table 2 and Table 3. The results show that the YOLOv8m model values for the params (M), layers, preprocessing (ms), inference (ms), and postprocessing (ms) parameters were superior to those of YOLOv8l and YOLOv8x, and the object detection evaluation indicators were comparable. Compared with those of the smaller YOLOv8n and YOLOv8s models, the preprocessing and postprocessing values of YOLOv8m decreased rather than increasing, and the mAP^val50 and mAP^val50-95 improved, as detailed in Table 3.

Table 2 shows that the params, FLOPs, layers, preprocessing, inference and postprocessing values of the YOLOv8m+BiFormer model were much lower than those of YOLOv8l and YOLOv8x. As shown in Table 3, the mAP^val50-95 values also increased by 0.3 and 0.5 percentage points, respectively. The experimental results demonstrate that the YOLOv8m+BiFormer model can still achieve high object detection performance while reducing the number of parameters and layers. Moreover, this approach greatly reduces the preprocessing, inference, and postprocessing speeds.

Therefore, we specifically selected YOLOv8m, which has a moderate model size but relatively high detection accuracy, as the main network of the object detection module. The loss function of YOLOv8m+BiFormer on the training and validation sets and the change curves of the object detection indicators on the validation set are shown in Figure 7.

To display the performance of the object detection models more directly, Figure 8A shows thermal map examples of the YOLOv8 series models and YOLOv8m+BiFormer in the training process to reveal the intensity distribution of the detected targets. According to the thermal images of layer 9 in Figure 8A, most of the YOLOv8 series models mistakenly captured the left target, while the YOLOv8m+BiFormer model was better able to capture the target region. In the thermal images of layer 10, the color highlight region of the YOLOv8m+BiFormer model is larger and more closely fits the tree crown region than the other models, indicating that the model is more accurate in locating the targets of interest in the image. In the thermal images of layer 12, there are multiple scattered thermal regions, while our model focuses more on capturing the slender trunk region. The thermal region of our YOLOv8m+BiFormer model is larger and closer to the target region in the images; in other words, the region that attracts the attention of the network is relatively more accurate.

Figure 8B displays examples of the results of the different object detection models on the street tree test set. Combined with Figure 8A,B, it is found that the trunk is an important false detection region. The thermal value of the trunk in the thermal map is relatively low, so the trunk region object detection effect in the sample images is not ideal. However, after the BiFormer attention mechanism is fused, the trunk region detection performance significantly improves.

3.3. Results and Analysis of the Different Segmentation Models

In this paper, an approach for urban street tree image segmentation based on the improved YOLO-SegNet is proposed. Firstly, the YOLOv8m+BiFormer model was introduced to conduct the object detection task, and then the object detection results were fed into the SegFormer model for the segmentation task. To confirm the segmentation performance of our YOLO-SegNet method, we carried out FCN [44], LR-ASPP [45], PSPNet [46], UNet [25], DeepLabv3+ [47], HRNet [48], and SegFormer [28] experiments. For the urban street tree segmentation task, the hyperparameters were set as follows: 200 epochs, 8 batches, an initial learning rate of 0.01, momentum of 0.9, and a weight attenuation coefficient of 0.0001. Then, the trained model was evaluated on the street tree validation and test datasets. Figure 9A shows the training loss function curves of the segmentation models without the object detection module. Figure 9B shows the training loss function curves of the segmentation models with the object detection module. The loss function curves of these models decline steadily and flatten, indicating that these models converge.

These results prove that the segmentation performance of almost all of these models based on the YOLOv8m+BiFormer module is better than that of the original models on the street tree segmentation dataset. The specific comparison results are shown in Table 4. Compared to these original models without the YOLOv8m+BiFormer module, the mIoUs of these models based on YOLOv8m+BiFormer, e.g., YOLO-FCN, YOLO-LR-ASPP, YOLO-PSPNet, YOLO-DeepLabv3+, YOLO-HRNet, and YOLO-SegNet, increased by 1.3, 2.5, 0.9, 0.8, 1.2, and 1.0 percentage points, respectively. Additionally, compared with those of the FCN, LR-ASPP, PSPNet, UNet, DeepLabv3+, and HRNet models, the mIoUs of YOLO-SegNetSegFormer increased by 10.5, 9.7, 5.0, 6.8, 4.5, and 2.7 percentage points, respectively. Moreover, the values of mPA and the f_score also improved. This also indicates that the performance of our YOLO-SegNet model is significantly better than that of other models.

Compared with those of the original SegFormer model, the mIoU, mPA, and f_score indices of our YOLO-SegNet model are 1.0, 0.5, and 0.2 percentage points greater, respectively. The important index value mIoU is increased by 1.0 percentage points, which also indicates that the YOLOv8m+BiFormer module introduced is beneficial to the SegFormer model for the segmentation of street trees. In addition, we also conducted experiments on the street tree test set. Table 5 shows the performance of the different segmentation models on the test set. The segmentation model based on the YOLOv8m+BiFormer module is significantly improved compared with the original models.

4. Discussion

4.1. YOLOv8m+BiFormer Module

This part further analyzes the the street tree object detection module and image segmentation module. In the object detection task, YOLOv8m, which has a moderate model size and good detection performance, is selected as the main network [27,49,50]. As shown in Table 2, the parameters, layers, preprocessing, inference, and postprocessing values of the YOLOv8m model are superior to those of YOLOv8l and YOLOv8x, and the obtained object detection evaluation indices are comparable. Compared with those of the smaller YOLOv8n and YOLOv8s models, the preprocessing and postprocessing speeds of YOLOv8m decreased rather than increased, and the mAP^val50 and mAP^val50-95 improved. On this basis, we introduce the BiFormer attention mechanism to further improve the object detection accuracy [26,33,34]; this can help networks to capture long-distance context dependencies.

To confirm the effectiveness of the YOLOv8m+BiFormer model further [51], the following object detection experiments were conducted: YOLOv8n+BiFormer, YOLOv8s+BiFormer, YOLOv8l+BiFormer, and YOLOv8x+BiFormer. The experimental results show that the YOLOv8m+BiFormer model is superior to the YOLOv8 series of different scale models based on the BiFormer, as detailed in Table 6 and Table 7. The YOLOv8m+BiFormer model’s size is moderate, the preprocessing and postprocessing models are the fastest, and the difference between the important index values mAP^val50 and mAP^val50-95 is small compared with those of YOLOv8l+BiFormer and YOLOv8x+BiFormer. Moreover, compared with that of the original BiFormer model, the mAP^val50-95 of our YOLOv8m+BiFormer model increased by 0.7 percentage points.

4.2. YOLO-SegNet Network

Regarding the segmentation task, Section 3.3 analyzes the performance of the different segmentation models overall. However, to some extent, these average indices cannot directly reflect the segmentation results for different street tree species. Therefore, in this paper, we analyze the segmentation results of different methods on the validation set and the test set to evaluate the segmentation performance of the YOLO-SegNet model more comprehensively.

Figure 10 depicts the segmentation performance of different models in detail with a radar diagram, which can intuitively show the segmentation differences of different models for the same street tree species. Figure 10(A₁) shows the distribution of the IoU index values of the different segmentation models on the validation set. The IoU values of the YOLO-SegNet model almost exceed those of the other segmentation models. Compared with the other models, the YOLO-SegNet model significantly improved the IoU values of the Acer palmatum, Elaeocarpus decipiens, Koelreuteria paniculata, Liriodendron chinense, Platanus, and Sapindus saponaria street tree species. Figure 10(A₁,A₂) shows that the segmentation performance of the classical PSPNet, UNet, and DeepLabv3+ models listed in this paper are poor for street tree species such as Acer palmatum, Elaeocarpus decipiens, and Liriodendron chinense. However, the IoU and PA values of our method are significantly improved. In addition, compared with those of the original models, the mIoU values of the YOLO-HRNet model improved, and the IoU values of tree species such as Acer palmatum, flowering cherry, and Zelkova serrata trees significantly increased. In conclusion, the mIoU and mPA overall segmentation index values of our YOLO-SegNet model on the validation set and the individual IoU, PA, and accuracy values of different tree species demonstrate the effectiveness of the model, especially for some tree species that are not easy to segment, such as Liriodendron chinense. This is because the canopy of this tree species is relatively sparse, and the texture is more complex.

Figure 11 shows the results of different segmentation models on the test set, where (A) and (C) are evergreen street trees with dense canopies, (B) is deciduous trees with relatively sparse canopies, and (D) is deciduous trees with multiple branches. The original classical segmentation model has a good segmentation effect for street trees with dense canopies but a relatively poor segmentation effect for trees with sparse canopies. Additionally, the segmentation effect for trunk parts is poor. The classical models based on YOLOv8m+BiFormer also improve the segmentation indices of the street tree trunk, as shown in Figure 11(A₅,A₁₁,C₂,C₈,D₂,D₈). This is because the object detection module conducts the preliminary detection of the trunk and eliminates non-target regions, especially the white region of the trunk, which is not considered in this paper. Moreover, the segmentation models based on the object detection module can improve the segmentation of street trees with sparse canopies to a certain extent, as shown in Figure 11(B₄,B₁₀).

5. Conclusions

In this work, the object detection model based on the YOLOv8m+BiFormer was introduced to remove the image background. On this basis, a SegFormer network tree segmentation method was proposed to obtain tree edge information more efficiently. Combined with the proposed YOLOv8m+BiFormer and SegFormer networks (YOLO-SegNet), a new model was proposed to effectively eliminate complex backgrounds and was very beneficial for urban street tree segmentation. The experimental results show that our method obtains an mIoU of 92.0% on the validation set and an mIoU of 88.4% on the test set. Our approach displays higher accuracy than other models. Therefore, our method accurately segments single trees, providing support for the calculation of the tree height and crown length and tree species identification. In subsequent studies, we will expand the tree dataset to verify the robustness of our method.

Author Contributions

Methodology, T.Y.; software, J.Y. (Junhua Ye); validation, S.Z. and J.Y. (Jianxin Yin); investigation, T.Y.; data curation, A.X.; writing—original draft preparation, T.Y.; writing—review and editing, A.X. and S.Z.; visualization, J.Y. (Jianxin Yin); supervision, A.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China [32371867].

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Acknowledgments

We thank Ya Li and Yixin Li for their assistance during the preparation of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Miao, C.; Li, P.; Huang, Y.; Sun, Y.; Chen, W.; Yu, S. Coupling outdoor air quality with thermal comfort in the presence of street trees: A pilot investigation in Shenyang, Northeast China. J. For. Res. 2022, 34, 831–839. [Google Scholar] [CrossRef]
Jareemit, D.; Srivanit, M. A comparative study of cooling performance and thermal comfort under street market shades and tree canopies in tropical savanna climate. Sustainability 2022, 14, 4653. [Google Scholar] [CrossRef]
Havu, M.; Kulmala, L.; Kolari, P.; Vesala, T.; Riikonen, A.; Jarvi, L. Carbon sequestration potential of street tree plantings in Helsinki. Biogeosciences 2022, 19, 2121–2143. [Google Scholar] [CrossRef]
Kim, J.Y.; Jo, H.K. Estimating carbon budget from growth and management of urban street trees in South Korea. Sustainability 2022, 14, 4439. [Google Scholar] [CrossRef]
Ma, B.; Hauer, R.J.; Ostberg, J.; Koeser, A.K.; Wei, H.; Xu, C. A global basis of urban tree inventories: What comes first the inventory or the program. Urban For. Urban Green. 2021, 60, 127087. [Google Scholar] [CrossRef]
Zhu, Y.; Li, D.; Fan, J.; Zhang, H.; Eichhorn, M.P.; Wang, X.; Yun, T. A reinterpretation of the gap fraction of tree crowns from the perspectives of computer graphics and porous media theory. Front. Plant Sci. 2023, 14, 1109443. [Google Scholar] [CrossRef]
Galle, N.J.; Halpern, D.; Nitoslawski, S.; Duarte, F.; Ratti, C.; Pilla, F. Mapping the diversity of street tree inventories across eight cities internationally using open data. Urban For. Urban Green. 2021, 61, 127009. [Google Scholar] [CrossRef]
Wu, X.M.; Xu, A.J.; Yang, T.T. Passive Measurement Method of Tree Height and Crown Diameter Using a Smartphone. IEEE Access 2020, 8, 11669–11678. [Google Scholar] [CrossRef]
Yang, T.T.; Zhou, S.Y.; Xu, A.J.; Yin, J.X. A Method for Tree Image Segmentation Combined Adaptive Mean Shifting with Image Abstraction. J. Inf. Process Syst. 2020, 16, 1424–1436. [Google Scholar] [CrossRef]
Li, Q.J.; Yan, Y.; Li, W.Z. Coarse-to-fine segmentation of individual street trees from side-view point clouds. Urban For. Urban Green. 2023, 89, 128097. [Google Scholar] [CrossRef]
Hakula, A.; Ruoppa, L.; Lehtomaki, M.; Yu, X.; Kukko, A.; Kaartinen, H.; Hyyppä, J. Individual tree segmentation and species classification using high-density close-range multispectral laser scanning data. ISPRS Open J. Photogramm. Remote Sens. 2023, 9, 100039. [Google Scholar] [CrossRef]
Xu, X.; Iuricich, F.; Calders, K.; Armston, J.; Florian, L.D. 2023.Topology-based individual tree segmentation for automated processing of terrestrial laser scanning point clouds. Int. J. Appl. Earth Obs. Geoinf. 2023, 116, 103145. [Google Scholar] [CrossRef]
Yang, T.T.; Zhou, S.Y.; Huang, Z.J.; Xu, A.J.; Ye, J.H.; Yin, J.X. Urban Street Tree Dataset for Image Classification and Instance Segmentation. Comput. Electron. Agric. 2023, 209, 107852. [Google Scholar] [CrossRef]
Borrenpohl, D.; Karkee, M. Automated pruning decisions in dormant sweet cherry canopies using instance segmentation. Comput. Electron. Agric. 2023, 207, 107716. [Google Scholar] [CrossRef]
Sun, X.; Xu, S.; Hua, W.; Tian, J.; Xu, Y. Feasibility study on the estimation of the living vegetation volume of individual street trees using terrestrial laser scanning. Urban For. Urban Green. 2022, 71, 127553. [Google Scholar] [CrossRef]
Jiang, K.; Chen, L.; Wang, X.; An, F.; Zhang, H.; Yun, T. Simulation on different patterns of mobile laser scanning with extended application on solar beam illumination for forest plot. Forests 2022, 13, 2139. [Google Scholar] [CrossRef]
Wang, Y.J.; Chen, Q.; Zhu, Q.; Liu, L.; Li, C.; Zheng, D. A survey of mobile laser scanning applications and key techniques over urban areas. Remote Sens. 2019, 11, 1540. [Google Scholar] [CrossRef]
Wu, B.; Yu, B.; Yue, W.; Shu, S.; Tan, W.; Hu, C.; Huang, Y.; Wu, J.; Liu, H. A voxelbased method for automated identification and morphological parameters estimation of individual street trees from mobile laser scanning data. Remote Sens. 2013, 5, 584–611. [Google Scholar] [CrossRef]
Majeed, Y.; Zhang, J.; Zhang, X.; Fu, L.; Karkee, M.; Zhang, Q.; Whiting, M.D. Deep learning based segmentation for automated training of apple trees on trellis wires. Comput. Electron. Agric. 2020, 170, 105277. [Google Scholar] [CrossRef]
Wan, H.; Zeng, X.; Fan, Z.; Zhang, S.; Kang, M. U2ESPNet-A lightweight and high-accuracy convolutional neural network for real-time semantic segmentation of visible branches. Comput. Electron. Agric. 2023, 204, 107542. [Google Scholar] [CrossRef]
Zhang, J.; He, L.; Karkee, M.; Zhang, Q.; Zhang, X.; Gao, Z. Branch Detection with Apple Trees Trained in Fruiting Wall Architecture using Stereo Vision and Regions-Convolutional Neural Network (R-CNN). In Proceedings of the 2017 ASABE Annual International Meeting, Spokane, WA, USA, 16–19 July 2017; American Society of Agricultural and Biological Engineers: Saint Joseph, MI, USA, 2017; p. 1. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Kok, E.; Wang, X.; Chen, C. Obscured tree branches segmentation and 3D reconstruction using deep learning and geometrical constraints. Comput. Electron. Agric. 2023, 210, 107884. [Google Scholar] [CrossRef]
Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. UNnet++: Redesigning skip connections to exploit multiscale features in image segmentation. IEEE Trans. Med. Imaging 2019, 39, 1856–1867. [Google Scholar] [CrossRef] [PubMed]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv 2015, arXiv:1505.04597. [Google Scholar] [CrossRef]
Zhu, L.; Wang, X.J.; Ke, Z.H.; Zhang, W.; Lau1y, R. BiFormer: Vision Transformer with Bi-Level Routing Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 24 June 2023. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLO (Version 8.0.0) [Computer Software]. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 10 May 2023).
Xie, E.; Wang, W.H.; Yu, Z.D.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar] [CrossRef]
Everingham, M.; Ali Eslami, S.M.; Gool, L.V.; Williams, C.K.I.; Winn, J.; Zisserman, A. The PASCAL Visual Object Classes Challenge: A Retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar] [CrossRef]
Feng, X.; Ren, A.; Qi, H. Improved Highway Vehicle Detection Algorithm for YOLOv8n. In Proceedings of the 2023 9th International Conference on Mechanical and Electronics Engineering (ICMEE), Xi’an, China, 17–19 November 2023. [Google Scholar] [CrossRef]
Yang, F.; Wang, T.; Wang, X. Student Classroom Behavior Detection based on YOLOv7-BRA and Multi-Model Fusion. ArXiv 2023, arXiv:2305.07825. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 3–5 June 2019; Volume 1, pp. 4171–4186. [Google Scholar]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. NAACL (North American Chapter of the Association for Computational Linguistics). 2018. Available online: https://api.semanticscholar.org/CorpusID:49313245 (accessed on 10 May 2023).
Reddy, D.M.; Basha, S.M.; Hari, M.C.; Penchalaiah, N. Dall-e: Creating images from text. UGC Care Group I J. 2021, 8, 71–75. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 5998–6008. [Google Scholar]
Dong, X.Y.; Bao, J.M.; Chen, D.D.; Zhang, W.M.; Yu, N.H.; Yuan, L.; Chen, D.; Guo, B.N. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12124–12134. [Google Scholar]
Liu, Z.; Lin, Y.T.; Cao, Y.; Hu, H.; Wei, Y.X.; Zhang, Z.; Lin, S.; Guo, B.N. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Tu, Z.Z.; Talebi, H.; Zhang, H.; Yang, F.; Milanfar, P.; Bovik, A.; Li, Y.X. Maxvit: Multi-axis vision transformer. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
Wang, W.X.; Yao, L.; Chen, L.; Lin, B.B.; Cai, D.; He, X.F.; Liu, W. Crossformer: A versatile vision transformer hinging on cross-scale attention. In Proceedings of the International Conference on Learning Representations (ICLR), online, 25–29 April 2022. [Google Scholar]
Xia, Z.F.; Pan, X.R.; Song, S.J.; Li, L.E.; Huang, G. Vision transformer with deformable attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4794–4803. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. arXiv 2015, arXiv:1505.04597. [Google Scholar] [CrossRef]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V. Searching for mobilenetv3. arXiv 2019, arXiv:1905.02244. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2016. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Laurens, V.D.M.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
Sun, K.; Xiao, B.; Liu, D.; Wang, J.D. Deep High-Resolution Representation Learning for Human Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5693–5703. [Google Scholar] [CrossRef]
Rizvi, S.M.H.; Naseer, A.; Rehman, S.U.; Akram, S.; Gruhn, V. Revolutionizing Agriculture: Machine and Deep Learning Solutions for Enhanced Crop Quality and Weed Control. IEEE Access 2024, 12, 11865–11878. [Google Scholar] [CrossRef]
Bonfanti-Gris, M.; Herrera, A.; Paraíso-Medina, S.; Alonso-Calvo, R.; Martínez-Rus, F.; Pradíes, G. Performance evaluation of three versions of a convolutional neural network for object detection and segmentation using a multiclass and reduced panoramic radiograph dataset. J. Dent. 2024, 144, 104891. [Google Scholar] [CrossRef]
Sun, S.; Mo, B.; Xu, J.; Li, D.; Zhao, J.; Han, S. Multi-YOLOv8: An infrared moving small object detection model based on YOLOv8 for air vehicle. Neurocomputing 2024, 588, 127685. [Google Scholar] [CrossRef]

Figure 1. (A) is the number distribution of street tree images; (B) is the street tree image annotation.

Figure 2. Examples of street tree object detection and instance segmentation annotated images for different tree species.

Figure 3. YOLO-SegNet model. The CBS is the basic module, including the Conv2d layer, BatchNorm2d layer, and Sigmoid Linear Unit (SiLU) layer. The function of the CBS module is to introduce a cross-stage partial connection to improve the feature expression ability and information transfer efficiency. The role of the Spatial Pyramid Pooling Fast (SPPF) module is to fuse larger-scale global information to improve the performance of object detection. The bottleneck block can reduce the computational complexity and the number of parameters.

Figure 4. (A) The overall architecture of BiFormer; (B) details of a BiFormer block.

Figure 5. (a) Vanilla attention. (b–d) Local window [40,42], axial stripe [39], and dilated window [41,42]. (e) Deformable attention [43]. (f) Bilevel routing attention, BRA [6].

Figure 6. Gathering key–value pairs in the top k related windows.

Figure 7. (A,B) are the loss function curves of the object detection network on the train and validation sets, respectively; (C,D) are the loss function curves of tree classification on the train and validation sets, respectively; (E–H) are the change curves of the four segmentation indicator values on the validation set, respectively.

Figure 8. (A) Thermal map examples of YOLOv8 series models and YOLOv8m+BiFormer in the training process; (B) example results of the different object detection models on the test set.

Figure 9. (A) The training loss function curves of the segmentation models without the object detection module. (B) The training loss function curves of the segmentation models with the object detection module.

Figure 10. Performance of different segmentation models on the validation and test sets: (A₁,A₂) the segmentation results on the validation set; (B₁,B₂) the segmentation results on the test set.

Figure 11. Results of the different segmentation models on the test set.

Table 1. Street tree dataset.

Street Tree Dataset	Ratio	Number	Dataset Format (Object Detection)	Dataset Format (Segmentation)
Train set	8	3168	YOLO [27]	VOC2012 [29]
Validation set	1	395
Test set	1	386

Table 2. Different scale models of the YOLOv8 series.

Model	Size (Pixels)	Params (M)	FLOPs (B)	Layers	Preprocess (ms)	Inference (ms)	Postprocess (ms)
YOLOv8n	640	3.0	8.2	225	1.6	0.6	1.6
YOLOv8s	640	11.1	28.7	225	0.5	1.0	0.8
YOLOv8m	640	25.8	79.1	295	1.3	2.1	1.1
YOLOv8l	640	43.6	165.5	365	0.5	3.7	1.3
YOLOv8x	640	68.2	257.5	365	0.6	5.5	0.3
YOLOv8m+BiFormer	640	36.3	178.5	306	0.2	2.3	0.5

Table 3. Object detection performance of the YOLOv8 series on the validation set.

Model	Recall	Precision	mAP^val50	mAP^val50-95
YOLOv8n	96.9	97.1	99.1	91.8
YOLOv8s	97.1	98.1	99.3	93.2
YOLOv8m	98.2	97.6	99.4	93.9
YOLOv8l	99.4	97.5	99.5	94.3
YOLOv8x	97.5	98.4	99.5	94.1
YOLOv8m+BiFormer	98.9	97.5	99.4	94.6

Table 4. Segmentation results of different models on the validation set.

	Model	mIoU	mPA	f_Score
Not based on YOLOv8m+BiFormer module	FCN [44]	81.5	-	-
	LR-ASPP [45]	82.3	-	-
	PSPNet [46]	87.0	92.7	95.8
	UNet [25]	85.2	91.7	85.3
	DeepLabv3+ [47]	87.5	92.5	97.0
	HRNet [48]	89.3	94.7	97.0
	SegFormer [28]	91.0	95.4	97.8
Based on YOLOv8m+BiFormer module	YOLO-FCN	82.8	-	-
	YOLO-LR-ASPP	84.8	-	-
	YOLO-PSPNet	87.9	93.3	97.0
	YOLO-UNet	84.7	91.4	88.6
	YOLO-DeepLabv3+	88.3	93.4	97.6
	YOLO-HRNet	90.5	95.3	97.4
	Ours: YOLO-SegNet	92.0	95.9	98.0

Table 5. Segmentation results of different models on the test set.

	Model	mIoU	mPA
Not based on YOLOv8m+BiFormer module	FCN	71.3	-
	LR-ASPP	76.6	-
	PSPNet	80.6	88.6
	UNet	81.2	89.6
	DeepLabv3+	83.2	90.7
	HRNet	88.1	93.5
	SegFormer	87.3	93.1
Based on YOLOv8m+BiFormer module	YOLO-FCN	79.3	-
	YOLO-LR-ASPP	83.4	-
	YOLO-PSPNet	81.1	88.9
	YOLO-UNet	79.7	88.5
	YOLO-DeepLabv3+	86.0	92.6
	YOLO-HRNet	87.7	94.0
	Ours: YOLO-SegNet	88.4	93.7

Table 6. Comparison of the YOLOv8 series models based on the BiFormer attention mechanism.

Model	Size (Pixels)	Params (M)	FLOPs (B)	Layers	Preprocess (ms)	Inference (ms)	Postprocess (ms)
YOLOv8m [27]	640	25.8	79.1	295	0.1	1.6	0.9
YOLOv8n+BiFormer	640	3.3	18.6	236	1.8	1.2	2.3
YOLOv8s+BiFormer	640	12.2	70.0	236	0.6	1.2	0.7
YOLOv8l+BiFormer	640	80.9	357.4	376	0.4	4.5	1.1
YOLOv8x+BiFormer	640	126.4	557.9	376	0.2	6.3	0.7
Ours: YOLOv8m+BiFormer	640	36.3	178.5	306	0.2	2.4	0.5

Table 7. Comparison of the detection performance on the validation set of YOLOv8 series models based on the BiFormer attention mechanism.

Model	Recall	Precision	mAP^val50	mAP^val50-95
YOLOv8m [27]	98.2	97.6	99.4	93.9
YOLOv8n+BiFormer	98.1	97.1	99.4	92.5
YOLOv8s+BiFormer	97.9	96.9	99.3	93.4
YOLOv8l+BiFormer	98.6	98.9	99.5	94.9
YOLOv8x+BiFormer	98.4	98.9	99.5	94.9
Ours: YOLOv8m+BiFormer	98.9	97.5	99.4	94.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, T.; Zhou, S.; Xu, A.; Ye, J.; Yin, J. YOLO-SegNet: A Method for Individual Street Tree Segmentation Based on the Improved YOLOv8 and the SegFormer Network. Agriculture 2024, 14, 1620. https://doi.org/10.3390/agriculture14091620

AMA Style

Yang T, Zhou S, Xu A, Ye J, Yin J. YOLO-SegNet: A Method for Individual Street Tree Segmentation Based on the Improved YOLOv8 and the SegFormer Network. Agriculture. 2024; 14(9):1620. https://doi.org/10.3390/agriculture14091620

Chicago/Turabian Style

Yang, Tingting, Suyin Zhou, Aijun Xu, Junhua Ye, and Jianxin Yin. 2024. "YOLO-SegNet: A Method for Individual Street Tree Segmentation Based on the Improved YOLOv8 and the SegFormer Network" Agriculture 14, no. 9: 1620. https://doi.org/10.3390/agriculture14091620

APA Style

Yang, T., Zhou, S., Xu, A., Ye, J., & Yin, J. (2024). YOLO-SegNet: A Method for Individual Street Tree Segmentation Based on the Improved YOLOv8 and the SegFormer Network. Agriculture, 14(9), 1620. https://doi.org/10.3390/agriculture14091620

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLO-SegNet: A Method for Individual Street Tree Segmentation Based on the Improved YOLOv8 and the SegFormer Network

Abstract

1. Introduction

2. Materials and Methods

2.1. Street Tree Dataset

2.1.1. Image Acquisition

2.1.2. Image Annotation

2.2. YOLO-SegNet Model

2.2.1. YOLOv8 with the BiFormer Attention Mechanism

2.2.2. SegFormer Network

3. Results and Analysis

3.1. Experimental Environment and Evaluation Indicators

3.2. YOLOv8m+BiFormer Model Results and Analysis

3.3. Results and Analysis of the Different Segmentation Models

4. Discussion

4.1. YOLOv8m+BiFormer Module

4.2. YOLO-SegNet Network

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI