TEB-YOLO: A Lightweight YOLOv5-Based Model for Bamboo Strip Defect Detection

Yang, Xipeng; Ruan, Chengzhi; Yu, Fei; Yang, Ruxiao; Guo, Bo; Yang, Jun; Gao, Feng; He, Lei

doi:10.3390/f16081219

Open AccessArticle

TEB-YOLO: A Lightweight YOLOv5-Based Model for Bamboo Strip Defect Detection

by

Xipeng Yang

¹

,

Chengzhi Ruan

^2,*

,

Fei Yu

³,

Ruxiao Yang

²,

Bo Guo

²,

Jun Yang

²,

Feng Gao

⁴

and

Lei He

³

¹

College of Mechanical and Electrical Engineering, Fujian Agriculture and Forestry University, Fuzhou 350002, China

²

The Key Laboratory for Agricultural Machinery Intelligent Control and Manufacturing of Fujian Education Institutions, Wuyi University, Wuyishan 354300, China

³

College of Mechanical and Electrical Engineering, Wuyi University, Wuyishan 354300, China

⁴

School of Electronic Science and Engineering, Nanjing University, Nanjing 210008, China

^*

Author to whom correspondence should be addressed.

Forests 2025, 16(8), 1219; https://doi.org/10.3390/f16081219

Submission received: 8 June 2025 / Revised: 15 July 2025 / Accepted: 22 July 2025 / Published: 24 July 2025

(This article belongs to the Special Issue Cutting-Edge Solutions in Advanced Forestry: Integrating Sensors, AI, IoT, Robotics, and Connectivity)

Download

Browse Figures

Versions Notes

Abstract

The accurate detection of surface defects in bamboo is critical to maintaining product quality. Traditional inspection methods rely heavily on manual labor, making the manufacturing process labor-intensive and error-prone. To overcome these limitations, TEB-YOLO is introduced in this paper, a lightweight and efficient defect detection model based on YOLOv5s. Firstly, EfficientViT replaces the original YOLOv5s backbone, reducing the computational cost while improving feature extraction. Secondly, BiFPN is adopted in place of PANet to enhance multi-scale feature fusion and preserve detailed information. Thirdly, an Efficient Local Attention (ELA) mechanism is embedded in the backbone to strengthen local feature representation. Lastly, the original CIoU loss is replaced with EIoU loss to enhance localization precision. The proposed model achieves a precision of 91.7% with only 10.5 million parameters, marking a 5.4% accuracy improvement and a 22.9% reduction in parameters compared to YOLOv5s. Compared with other mainstream models including YOLOv5n, YOLOv7, YOLOv8n, YOLOv9t, and YOLOv9s, TEB-YOLO achieves precision improvements of 11.8%, 1.66%, 2.0%, 2.8%, and 1.1%, respectively. The experiment results show that TEB-YOLO significantly improves detection precision and model lightweighting, offering a practical and effective solution for real-time bamboo surface defect detection.

Keywords:

bamboo strip defect detection; lightweight YOLOv5; EfficientVit; ELA mechanism

1. Introduction

As a sustainable resource, bamboo products are widely used in construction materials, household goods, and industrial applications, gaining significant global popularity [1]. Bamboo strips, serving as essential raw materials in the manufacturing of bamboo products, are susceptible to defects resulting from inherent growth characteristics and processing-induced factors [2]. To ensure the quality of bamboo products, defects must be accurately identified and appropriately addressed during the production process. However, traditional methods for defect detection remain heavily reliant on manual inspection, which is not only labor-intensive and time-consuming but also increasingly expensive as production scales [3,4]. Moreover, human assessment is subject to subjective bias and visual fatigue, which compromises both the efficiency and accuracy of defect identification. In addition, different types of defects often require distinct processing techniques, making precise and timely identification essential [5,6,7]. Furthermore, some defects may appear visually similar or occur in overlapping regions, which can compromise the accuracy of manual classification. Additionally, misclassification during this process may lead to inappropriate treatment methods, potentially resulting in damage to the raw material.

With rapid advances in object detection technologies, machine vision and computer vision have garnered substantial research interest in recent years [8,9]. Although numerous defect detection models have been applied in various industries, relatively few studies have focused on defect detection in bamboo strips. In parallel, rising public environmental awareness continues to drive the growing demand for bamboo-based products [10,11]. Consequently, there is an urgent need to develop a lightweight and high-precision surface defect detection algorithm tailored to bamboo strips. Such a system would facilitate the shift in the bamboo industry from a conventional labor-intensive approach to a modern, automated, and intelligent manufacturing model.

Model development generally follows two dominant paradigms, namely machine learning and deep learning. Traditional machine vision methods mainly rely on manually designed features, focusing on the extraction and analysis of low-level image characteristics such as color, texture, and morphology [12]. The pioneer researchers have performed a substantial number of works on bamboo defect detection using machine learning. Qin et al. developed an online detection and classification system for bamboo strips based on color features, achieving preliminary defect screening [13]. Zeng et al. further integrated color, texture, and morphological features to construct feature vectors and introduced a backpropagation (BP) neural network to identify defects with blurred edges or subtle color differences [14]. Li and Ye utilized the LabVIEW platform in combination with Otsu threshold segmentation and particle analysis to implement the industrial-grade detection of carbonized bamboo poles [15]. Kuang et al. combined Local Binary Pattern (LBP) and Gray-Level Co-occurrence Matrix (GLCM) features and introduced a Support Vector Machine (SVM) classifier, significantly improving the detection accuracy and efficiency [16].

Although traditional machine learning approaches offer advantages such as low implementation cost and flexible deployment, they exhibit limited robustness and generalization ability when confronted with complex backgrounds, minor defects, or unstructured images. Due to their strong capabilities in feature representation and semantic understanding, deep learning methods have been increasingly adopted to overcome these limitations. In the field of deep learning, researchers have improved models for various application domains by introducing attention mechanisms and modifying network architectures. For example, Yang et al. proposed an improved YOLOv8n model for detecting subtle bamboo strip defects such as cracks, mildew, wormholes, and burrs. The model incorporates Dynamic Convolution, Ghost-enhanced C3k2, DySample for adaptive upsampling, and Efficient Multiscale Attention (EMA) [17]. Hu et al. incorporated residual structures and the Convolutional Block Attention Module (CBAM) into the U-Net architecture, enhancing its feature extraction capabilities [18]. Qin et al. proposed a lightweight optimization of U-Net, enabling the online deployment and rapid identification of multiple defect types [19]. Hu et al. also integrated DropBlock and transfer learning strategies into the Transformer framework to improve adaptability in new environments [20]. Guo et al. optimized YOLOv4-CSP by introducing asymmetric convolutions and the CBAM module, which not only improved the accuracy but also validated the model’s generalization capability across different materials [21]. Yang et al. proposed an improved YOLOv5s model incorporating the Ghost module, C2f structure, and CA mechanism, achieving lightweight design and high accuracy in detecting five common bamboo strip defects. Their approach significantly enhances the detection speed and reduces model complexity while maintaining robust recognition performance [22]. Haq et al. utilized a deep learning-based supervised image classification approach with UAV-collected images for the classification of forest areas, demonstrating the effectiveness of stacked auto-encoders in achieving high accuracy in identifying various features within forested regions [23].

In addition, experience in detecting defects similar to bamboo strip defects may provide a valuable reference for this study. For instance, Liu et al. proposed the CARF framework, which uses multi-level feature alignment to improve the detection of small defects [24]. Ji et al. enhanced the detection robustness under complex backgrounds by combining YOLO with SIFT feature fusion [25]. Lin et al. designed the WCU-Net, incorporating positional attention and multi-scale residual structures to highlight crack regions and suppress noise interference [26]. Zhang et al. improved the detection speed and accuracy by integrating Principal Component Analysis (PCA) with compressed sensing techniques [27]. Xu et al. developed a multi-line detection system based on Bi-LSTM, combining semantic modeling with point cloud correction to achieve efficient real-time recognition [28]. These studies, each from a different perspective, offer diverse strategies and technical foundations for improving the performance of bamboo defect detection systems. Haq et al. proposed an automated weed detection system based on a CNN-LVQ architecture, which achieved 99.44% accuracy using UAV imagery of soybean fields, demonstrating its effectiveness in classifying soil, grass, and broadleaf weeds with high precision [29].

The aforementioned studies have demonstrated the feasibility of constructing an efficient and high-precision bamboo strip defect detection system based on convolutional neural networks (CNNs). Although numerous defect detection models have been proposed, several challenges remain in detecting surface defects on bamboo strips. Namely, the aforementioned models are relatively large, making them difficult to deploy on embedded or other edge devices. Moreover, certain types of bamboo strip defects exhibit similar visual characteristics or appear in the same regions, which increases the likelihood of misclassification by detection algorithms. To address these issues, the primary objective of this study is to develop a lightweight model that achieves both high efficiency and high accuracy for bamboo strip surface defect detection. Through a series of comparative and ablation experiments, this paper proposes an optimized detection model based on the YOLOv5 architecture, named TEB-YOLO. The contributions of this paper are as follows:

(1): A lightweight and task-specific feature extraction framework is proposed by integrating EfficientViT with an Efficient Local Attention (ELA) mechanism. Unlike previous works that apply ViT backbones in general detection tasks, our design is tailored for bamboo strip defects, where fine-grained local features are critical. This combination enhances the representation power while maintaining model compactness.
(2): A BiFPN-based neck with embedded ELA was introduced to improve multi-scale feature fusion. Compared to conventional PANet or even standard BiFPN usage, our configuration is optimized for defect types with subtle or overlapping appearances, leading to improved robustness under complex backgrounds.
(3): A modified loss function using Efficient IoU (EIoU) is employed to enhance localization precision. Unlike CIoU, which struggles with aspect ratio variation, EIoU directly penalizes width and height misalignment, resulting in faster convergence and better regression accuracy, especially on irregular-shaped bamboo defects.

Furthermore, a customized bamboo strip surface defect dataset was also constructed. This dataset consists of 10,000 images featuring five defect classes that were captured by an industrial camera and subsequently annotated.

2. Materials and Methods

2.1. Data Acquisition and Preparation

The bamboo strip defect images were captured at Maiwei Intelligent Equipment Co., Ltd., located in Shangping Township, Yong’an, Sanming, Fujian Province, China. The image acquisition system consisted of a linear light source and a CCD color industrial camera (MV-GE133GC) equipped with an 8 mm lens. The dataset includes five types of defects: carbonized spots, dark knotted scars, green-black knots, concave yellowing, and flat yellowing, as illustrated in Figure 1. The industrial camera was connected to a microcomputer via a network interface, and both defective and non-defective images were stored in JPEG with an image size of 640 × 640 pixels.

Subsequently, three annotators labeled the images through a dual-review and cross-validation approach to ensure the accuracy of the defect information. After labeling, the annotations were converted into text files compatible with the YOLOv5 format. In total, 10,000 images were collected, including the five aforementioned defect types. The quantity distribution of each defect category is illustrated in Figure 1.

2.2. YOLOv5 Model Enhancements

2.2.1. Proposed Methods

To address the trade-off between detection accuracy and real-time inference on resource-constrained embedded devices, this study proposes TEB-YOLO, a lightweight detection model based on YOLOv5s. YOLOv5s was selected as the baseline due to its proven stability and reliability in industrial inspection tasks. As a lightweight variant, it strikes an effective trade-off between detection precision and computational efficiency, which makes it suitable for use in environments with limited resources. Although more recent object detection architectures have emerged, they often come with increased model complexity and higher hardware requirements, which can hinder their applicability in edge computing scenarios. In contrast, YOLOv5s offers a mature and well-optimized framework with extensive community support, facilitating efficient customization and deployment. As illustrated in Figure 2, TEB-YOLO replaces the original CSPDarknet backbone with EfficientViT for lightweight feature extraction. To improve the extraction of fine-grained local features, the ELA mechanism is incorporated into both the backbone and neck of the network.

Additionally, the original PANet in the neck is replaced with BiFPN to improve multi-scale feature fusion, and the CIoU loss function is substituted with EIoU to enhance localization precision. These architectural modifications collectively enable TEB-YOLO to deliver high detection performance with significantly reduced computational complexity and parameter count, making it suitable for real-world bamboo defect detection applications.

2.2.2. Optimization of the Backbone Network

CSPDarknet is the original backbone network of YOLOv5, primarily used for image feature extraction. In object detection tasks, the choice of backbone network is critical, as a high-quality feature extractor can significantly enhance the overall model performance. To achieve a lightweight architecture, this study proposes replacing CSPDarknet with EfficientViT (Efficient Vision Transformer) [30]. The overall architecture of EfficientViT is illustrated in Figure 3a.

Although traditional Transformers have demonstrated excellent performance in vision tasks, they suffer from high computational overhead and often exhibit limited inference speed due to inefficient memory access. To address these issues, EfficientViT introduces a novel “sandwich” block structure that not only strengthens inter-channel information interaction but also improves memory access efficiency. Moreover, to mitigate the redundancy present in multi-head self-attention mechanisms, EfficientViT further proposes a cascaded group attention (CGA) module to enhance the semantic modeling capabilities.

The EfficientViT architecture consists of three stages, each comprising multiple sandwich structures. Each sandwich block is composed of 2ndepthwise separable convolutions (DWConv, responsible for spatial local communication) [31], a Feed-Forward Network (FFN), and a CGA module. The sandwich structure is shown in Figure 3b, and the CGA module is depicted in Figure 3c. In CGA, the input is first divided into multiple subspaces, and the Q, K, and V matrices are computed separately. The attention heads are organized in a cascaded manner, with each head receiving the output from the previous one to progressively capture more informative semantic representations. The outputs of all heads are then concatenated and passed through a linear layer to produce the final output.

This sandwich layout enhances feature representation through the FFN and employs only a single self-attention layer for spatial mixing, thereby reducing dependence on memory access efficiency. The computation is formulated as follows:

X_{i + 1} = \prod^{N} ϕ_{i}^{F} (ϕ_{i}^{A} (\prod^{N} ϕ_{i}^{F} (x_{i})))

(1)

In the equation, X_i denotes the input feature of the i-th EfficientViT block,

Φ_{i}^{F}

represents the Feed-Forward Network (FFN) layer, and

Φ_{i}^{A}

denotes the self-attention layer.

The attention computation process of the CGA module is as follows:

{\tilde{X}}_{i j} = A t t n (X_{i j} W_{i j}^{Q}, X_{i j} W_{i j}^{K}, X_{i j} W_{i j}^{V})

(2)

{\tilde{X}}_{i + 1} = C o n c a t {[{\tilde{X}}_{i j}]}_{j} = {}_{1 : h}W_{i}^{P}

(3)

X_{i j}^{'} = X_{i j} + {\tilde{X}}_{i (j - 1)}, 1 < j \leq h

(4)

In the equation,

{\tilde{X}}_{i j}

denotes the output of the j-th attention head,

X_{i j}

is the j-th input feature subspace,

W_{i j}^{Q}

is the projection matrix for the query,

W_{i j}^{K}

is the projection matrix for the key,

W_{i j}^{V}

is the projection matrix for the value,

W_{i}^{P}

is the linear layer that projects the concatenated features back to the input dimension, and

X_{i j}^{'}

is the updated input feature for the j-th attention head.

The EfficientViT series includes six baseline models, each configured with different network widths (C), depths (H), and number of detection heads (L). The detailed structural parameters are shown in Table 1. In line with the goal of model lightweighting, this study selects EfficientViT-M0 as the backbone network to replace CSPDarknet.

2.2.3. Attention Mechanism

The small size, diversity, and co-occurrence of bamboo strip surface defects make it challenging to preserve critical defect information during the feature extraction stages of deep CNNs, where such details are often diminished or overlooked. This significantly increases the difficulty of feature learning and poses a major challenge to defect detection tasks. To address these issues, this study introduces an Efficient Local Attention (ELA) mechanism [32], designed to balance the computational efficiency and feature representation capabilities. By enhancing the convolutional neural network’s ability to model local details, ELA helps improve the overall detection performance.

While global attention mechanisms excel at capturing global dependencies, they typically demand substantial computational and memory resources. ELA, by comparison, concentrates on local contextual relationships, offering a more efficient solution for deep neural networks. It calculates attention weights within a localized scope, effectively modeling spatial neighborhood dependencies while significantly reducing computational complexity. This makes it particularly suitable for deployment in deep, large-scale convolutional neural networks. Specifically, ELA builds upon the Coordinate Attention (CA) mechanism [33] by introducing Stripe Pooling operations [34] to capture features in both horizontal and vertical orientations. The detailed architecture of the ELA mechanism is illustrated in Figure 4.

Let the output of the previous layer be denoted as x∈R^H×W×C. After performing pooling operations along the horizontal (H, 1) and vertical (1, W) directions, the outputs at height h and width w for the c-th channel are obtained, respectively. The specific calculations are shown in Equations (5) and (6).

Z_{c}^{h} (h) = \frac{1}{H} \sum_{0 \leq i < H} X_{c} (h, i)

(5)

Z_{c}^{w} (w) = \frac{1}{W} \sum_{0 \leq j < W} X_{c} (j, w)

(6)

In the equations,

Z_{c}^{h}

(h) represents the output at position h along the height for the c-th channel, while

Z_{c}^{w}

(w) corresponds to the output at position w along the width.

As shown in Equations (7) and (8), after obtaining the feature maps Z_h and Z_w along the horizontal and vertical directions, one-dimensional convolutions F_h and F_w with a kernel size of 7 are applied to enhance spatial awareness. Group Normalization (GN) is then used to further normalize the feature distribution, and finally, a sigmoid activation function is employed to generate the directional attention weights.

y^{h} = σ (G_{n} (F_{h} (Z_{h})))

(7)

y^{w} = σ (G_{n} (F_{w} (Z_{w})))

(8)

In the equation, F denotes the one-dimensional convolution, G_n represents Group Normalization with 16 groups, and σ is the sigmoid activation function.

Finally, ELA fuses the horizontal and vertical attention weights with the original input features through channel-wise multiplication to generate the weighted output features, as shown in Equation (9).

F_{E L A} = x_{c} \times y^{h} \times y^{w}

(9)

2.2.4. Optimization of the Neck Network

In object detection networks, the neck component plays a crucial role in enhancing the detection accuracy by constructing top-down and bottom-up multi-scale feature fusion pathways. This design effectively integrates semantic and spatial information from different feature levels. YOLOv5s employs PANet (Path Aggregation Network) [35] as its neck structure, as illustrated in Figure 5a. The core idea of PANet lies in the observation that high-level features encode abundant semantic context, whereas low-level features preserve detailed spatial information. Introducing a bottom-up information path into the network helps build a multi-scale feature representation that combines both semantic expression and spatial localization, thus improving the detection performance.

In this study, BiFPN (Bidirectional Feature Pyramid Network), an improved version of PANet, was adopted, as shown in Figure 5b. Specifically, BiFPN removes the single-input nodes present in PANet since they lack involvement in feature aggregation and contribute little to the overall performance of the neck. Their removal helps simplify the network structure. Moreover, BiFPN adds shortcut connections between input and output nodes located at the same level in PANet, enhancing the feature fusion capability without significantly increasing the model parameters.

Additionally, BiFPN introduces an extra top-down pathway and reuses the feature layers formed by bidirectional paths within the network, enabling deeper multi-scale feature fusion. Finally, BiFPN improves upon the conventional feature map addition or concatenation strategies by proposing a weighted feature fusion method. This strategy adaptively assigns weights to input feature maps based on their importance, thus improving the effectiveness of the fusion. The calculation method is shown in Equation (10). In this study, we address the detection of bamboo strip surface defects, which typically present as subtle textural variations within overlapping spatial regions. Conventional feature concatenation assigns equal weight to all inputs, thereby diluting the contribution of texture-sensitive layers. In contrast, the weighted fusion scheme of BiFPN enables the network to learn the relative importance of multi-level features. By adaptively amplifying the most informative scales, BiFPN improves both the localization and classification performance. Consequently, its weighted fusion strategy delivers a more discriminative and robust integration of texture-centric features than simple concatenation.

O = \sum_{i} \frac{w_{i} \cdot I_{i}}{ε + \sum_{j} w_{j}}

(10)

In the formula, O represents the fused output feature map, I_i denotes the i-th input feature map from different levels, and W_i is the corresponding learnable weight used to measure the importance of each input feature. ε is a small constant added to avoid division by zero; in this experiment, ε = 0.0001.

2.2.5. Loss Function Optimization

In object detection tasks, the Intersection over Union (IoU) is an important metric used to evaluate the similarity between the predicted bounding box and the ground truth bounding box. Its calculation formula is shown in Equation (11).

I o U = \frac{| B \cap B^{g t} |}{| B \cup B^{g t} |}

(11)

where B represents the predicted bounding box, and B^gt represents the ground truth bounding box. Ideally, the two should completely overlap, resulting in an IoU value of 1. However, in practical applications, as shown in Figure 6a, the predicted box often does not perfectly match the ground truth box. When using IoU directly as the loss function for bounding box regression optimization, if there is no overlap between the predicted box and the ground truth box, the IoU value becomes 0, which prevents effective gradient backpropagation. Therefore, the following form of loss function is commonly used to optimize the difference between the predicted box and the ground truth box.

L o s s = 1 - I o U

(12)

However, using IoU alone as the loss function still presents certain limitations. Even when the IoU values are identical, the predicted box and the ground truth box may differ significantly in terms of relative position, distance, and scale. Therefore, IoU cannot fully express the geometric relationship between the two boxes. To address this shortcoming, researchers have proposed improved IoU-based loss functions such as GIoU and DIoU to further enhance model accuracy [36]. It is evident that the effectiveness of object detection heavily depends on the design of the loss function, and selecting an appropriate regression loss function is vital for improving the detection performance.

The YOLOv5s architecture employs the CIoU (Complete IoU) metric to guide the regression of bounding boxes. CIoU builds upon DIoU by further incorporating an aspect ratio consistency term, considering not only the center point distance and overlap between the predicted box and the ground truth box, but also the consistency of the bounding box shape. This makes the predicted box fit the ground truth box more closely. Its calculation formulas are shown in Equations (13) and (14).

L_{b o x} = L_{C I o U} = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + α v

(13)

v = \frac{4}{π^{2}} {(t a n^{- 1} \frac{w^{g t}}{h^{g t}} - t a n^{(- 1)} \frac{w}{h})}^{2}

(14)

In the equation, b and B^gt denote the center points of the predicted and ground truth boxes, respectively; ρ represents the Euclidean distance between these centers; and c refers to the diagonal of the minimal enclosing box that contains both. w and h denote the width and height of the predicted box. w^gt and h^gt represent those of the ground truth box. The term v quantifies the disparity between aspect ratios, and α serves as a weighting factor to balance its impact.

Although CIoU introduces an aspect ratio constraint, it measures shape differences by calculating the relative change in aspect ratios, which can be somewhat ambiguous. To further enhance the geometric constraint capability of the bounding box, this study introduces the EIoU (Efficient IoU) loss function [37] based on CIoU. This method directly models the width and height differences between the predicted box and the ground truth box separately, improving the loss function’s sensitivity to scale differences in bounding boxes. The calculation formula of EIoU is shown in Equation (15).

L_{E o x} = L_{I o U} + L_{d i s} + L_{a s p} = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{{(h^{c})}^{2} + {(w^{c})}^{2}} + \frac{ρ^{2} (w, w^{g t})}{{(h^{c})}^{2}} + \frac{ρ^{2} (h, h^{g t})}{{(w^{c})}^{2}}

(15)

where w^c and h^c represent the width and height of the smallest enclosing rectangle that contains both the predicted box and the ground truth box, as shown in Figure 6b. EIoU more accurately describes the spatial location and size differences between bounding boxes, thereby helping to enhance the precision of bounding box localization.

3. Experiments and Analysis

3.1. Experimental Environment

This experiment was conducted on a platform running Windows 11, equipped with an Intel Core i9-14900HX processor, an NVIDIA GeForce RTX 4060 Laptop GPU (8 GB VRAM), and 32 GB of SK Hynix DDR5 RAM. PyTorch 2.4.0 was used as the deep learning framework, with GPU acceleration enabled via CUDA 12.4. Visual Studio Code 2024 served as the programming environment, and the programming language used was Python 3.8. The specific training parameter configurations are shown in Table 2. In future work, the deployment performance of TEB-YOLO is intended to be evaluated on embedded platforms such as the NVIDIA Jetson series to further assess its suitability for real-world edge applications.

3.2. Evaluation Metrics

To evaluate the performance of the proposed method in the task of bamboo strip defect detection, this study adopts precision (P), recall (R), mean Average Precision (mAP), transmitted frames per second (FPS), and Giga Floating Point Operations per Second (GFLOPs) as the evaluation metrics.

Precision refers to the proportion of correctly identified positive samples among all samples predicted as positive, and it reflects the accuracy of object detection. Recall indicates the proportion of correctly identified positive samples among all actual positive samples, representing the model’s ability to detect all object instances in the image. The calculation formulas are shown in Equations (16)–(20) as follows.

P = \frac{F P}{T P + F P}

(16)

R = \frac{T P}{T P + F N}

(17)

m A P = \sum_{i = 1}^{k} \frac{A P_{i}}{k}

(18)

G F L O P s = O (\sum_{i = 1}^{n} {k_{i}}^{2} \times {G_{i - 1}}^{2} \times G_{i} + \sum_{i = 1}^{n} M^{2} \times G_{i})

(19)

F P S = \frac{1000}{p r e p r o c e s s + i n f e r e n c e + N M S}

(20)

In the equation, TP (true positive) represents the number of instances that are accurately predicted as belonging to the positive class, TN (true negative) refers to those correctly identified as negative, FP (false positive) indicates samples incorrectly predicted as positive, and FN (false negative) refers to samples incorrectly identified as negative. k represents the convolution kernel size, is the number of iterations, and M denotes the image size. Preprocess refers to the image preprocessing time, inference is the model inference time, and NMS stands for the time taken for Non-Maximum Suppression.

3.3. Comparison Experiments of Different Module Combination

To evaluate the impact of different backbone networks, attention mechanisms, and neck structures on the performance of the YOLOv5s model, this study uses YOLOv5s as the baseline and systematically replaces its backbone, integrates various attention mechanisms, and experiments with different neck designs. The experimental results are presented in Table 3, Table 4, and Table 5, respectively.

For the backbone comparison, the original CSPDarknet in YOLOv5s was replaced with ShuffleNetv2 [38], MobileNetv3 [39], and EfficientViT for benchmarking. According to Table 3, the CSPDarknet-based model achieves the highest precision in terms of precision (88.9%) and mAP@50 (92.6%). The model using EfficientViT followed closely with a precision of 86.9% and mAP@50 of 91.5%. Although slightly lower in accuracy, EfficientViT markedly decreased the number of parameters and the computational complexity by approximately 23% and 38%, respectively, demonstrating a clear advantage in model lightweighting. ShuffleNetv2 performed the worst across all metrics, failing to meet the accuracy requirements for defect detection.

Next, using the YOLOv5s + EfficientViT architecture as the new baseline, various attention mechanisms were introduced into both the backbone and neck components, including Coordinate Attention (CA), CBAM [40], SE [41], and the lightweight and efficient ELA module. As summarized in Table 4, while most attention mechanisms improved certain performance metrics, they also introduced considerable increases in model parameters and computational complexity. Remarkably, the ELA mechanism yielded the highest precision (89.3%) among all tested configurations, while only marginally increasing the parameter count from 5.326 M to 5.342 M. Although its mAP@50 slightly declined to 90.9%, ELA demonstrated an effective compromise between computational efficiency and detection precision. The ELA mechanism enhances precision by leveraging localized attention. However, the slight decline in mAP@50 can be attributed to its limited capacity for global semantic modeling. This design aligns with the industrial requirement of prioritizing precision in bamboo strip defect detection, and its advantages in parameter efficiency and computational cost make it more suitable for edge deployment. This is crucial for enabling the model to accurately identify defect types and reduce the additional costs caused by false positives and false negatives.

Finally, based on the YOLOv5s + EfficientViT + ELA configuration, different neck modules were further compared to assess their effectiveness in feature fusion, including PANet, GFPN [42], HSFPN [43], SlimNeck [44], and BiFPN. As shown in Table 5, BiFPN achieves the highest precision (91.7%) while preserving a modest parameter size (5.423 M) and computation cost (10.5 GFLOPs), showcasing excellent detection capability and deployment efficiency. Although SlimNeck had the fewest parameters, its computational cost was higher, and its detection performance was the weakest, with a precision of only 83.6%.

3.4. Ablation Studies

To validate the effectiveness of the proposed improvements in bamboo strip defect detection, a series of ablation experiments based on the YOLOv5s framework were conducted. Each experiment was independently run three times, and the average results were reported to ensure reliability and robustness. The experimental results are summarized in Table 6.

Firstly, the original CIoU loss function was replaced with EIoU. This led to increases of 1.9% in precision, 1.1% in recall, and 9.9 frames per second (FPS). EIoU improves the localization accuracy and convergence speed. This is accomplished by individually calculating the discrepancies in width and height between the predicted box and the ground truth bounding box. This replaces the original aspect ratio constraint. Since only the loss function was changed, the mAP, GFLOPs, and parameter count remained nearly unchanged.

Next, the backbone network was changed from CSPDarknet to EfficientViT. This modification reduced the parameters by 24.3% and GFLOPs by 38.1%. The improvements are due to EfficientViT’s cascaded grouped attention mechanism, which lowers computational overhead. However, it also caused a 51.6 FPS drop and slight decreases in precision and mAP@50. Despite the observed decline in inference speed, the improved model still achieves 67.3 FPS, which surpasses the commonly accepted threshold for real-time monitoring applications (i.e., >60 FPS). As a result, the proposed architecture proves effective for use in industrial settings that demand accurate and real-time defect identification. These drawbacks are attributed to the multi-head self-attention (MHSA) mechanism. Transformer-based structures like EfficientViT are generally slower and more computationally intensive during inference.

Subsequently, the ELA mechanism was introduced. This change significantly improved the precision to 89.3%. At the same time, the parameter count and GFLOPs decreased by 24.06% and 37.5%, respectively. These results suggest that ELA enhances feature representation while maintaining computational efficiency. However, the mAP@50 decreased by 1.8%, and the FPS dropped by 56.1. This reflects a performance trade-off.

Finally, the original PANet neck was replaced with BiFPN. The EIoU loss function, EfficientViT backbone, and ELA mechanism were retained. This configuration achieved the best overall performance. Precision reached 91.7%, the highest in all configurations. FPS improved to 67.3. The parameters and GFLOPs were reduced by 22.9% and 34.3%, respectively, compared to the original YOLOv5s. These gains result from BiFPN’s bidirectional feature fusion, which improves multi-scale feature interactions.

Overall, the ablation results confirm that the proposed model significantly improves the detection accuracy on the bamboo defect dataset while effectively reducing model complexity and resource consumption, demonstrating strong potential and applicability for real-world deployment.

3.5. Improved Model Results

After modifying the backbone, neck module, and loss function of YOLOv5s, and integrating an attention mechanism, the training results are shown in Figure 7. According to Figure 7, the loss curve of TEB-YOLO becomes smooth after replacing the loss function, indicating no significant overfitting during training. Compared with the original YOLOv5s, the optimized model increased in depth from 214 layers to 602 layers, representing a growth of approximately 181%. However, the model size (in terms of Params) and computational complexity (measured by GFLOPs) decreased from 7.035 M and 16.0 M to 5.423 M and 10.5 M, respectively—reductions of about 22.9% and 34.3%. Meanwhile, the model’s precision improved from 87% to 91.7%, a gain of around 5.4%. Although the FPS decreased, it still reached 67.3 f/s, which is suitable for real-time defect detection on bamboo strips. These improvements enhanced the model’s representational capacity while maintaining its lightweight design, in line with the original design goals.

In addition, Figure 8 shows the detection results of TEB-YOLO, where the defect locations, types, and confidence scores are clearly annotated. Experimental outcomes confirm that TEB-YOLO performs well in recognizing five distinct types of bamboo strip defects contained in the dataset.

To further verify the model’s focus on target areas, feature maps from the first detection head were visually examined. As illustrated in Figure 9a, the feature map presents the first detection head of the improved TEB-YOLO, while Figure 9b shows the corresponding feature map from YOLOv5s. The comparison reveals that the feature map from TEB-YOLO exhibits more prominent color responses in defect regions, indicating stronger feature representation capabilities in defect identification. This enables the more accurate localization and discrimination of defect areas. To provide quantitative support for the visual comparison in Figure 9, the mean activation intensity and the foreground-to-background contrast ratio on the feature maps generated by the first detection head were computed. Table 7 indicates that compared to YOLOv5s, TEB-YOLO achieved a higher mean activation (34.5001 vs. 33.1852) and a stronger contrast ratio (1.47 vs. 1.31), indicating better focus on defect regions and enhanced noise suppression. These quantitative metrics further validate the improved feature representation capability of TEB-YOLO.

Moreover, the feature maps generated by TEB-YOLO display clearer boundaries between target and background regions, with significantly fewer abnormally highlighted areas. This reflects the model’s enhanced ability to suppress noise and irrelevant information. These findings further demonstrate that TEB-YOLO possesses higher robustness and reliability in reducing both missed detections and false positives.

3.6. Comparison of Different Models

To objectively evaluate the performance of the proposed improved model TEB-YOLO for bamboo strip defect detection, a systematic comparison was conducted against several mainstream models, including YOLOv5n, YOLOv7, YOLOv8n, YOLOv8s, YOLOv9t, and YOLOv9s. All models were tested under the same conditions as specified in Table 2. The evaluation metrics include precision, recall, mAP@50, GFLOPs, Params, GPU inference speed (FPS-GPU), and CPU inference speed (FPS-CPU), with the results summarized in Table 8.

As shown in Table 8, TEB-YOLO delivers the highest detection accuracy among all models, achieving a precision of 91.7%, a recall of 85.6%, and an mAP@50 of 90.8%, while maintaining a lightweight design with only 10.5 GFLOPs and 5.423M parameters. In terms of inference speed, the model runs at 67.3 frames per second (FPS) on GPU and 12.4 FPS on CPU, demonstrating strong practical deployment potential. Compared to the baseline YOLOv5s, TEB-YOLO improves the precision by approximately 5.4%, reduces the Params and GFLOPs by 22.9% and 34.3%, respectively, with only a slight decrease of 2.1% in mAP@50.

In horizontal comparisons with other mainstream models, TEB-YOLO ranks second in precision, just behind YOLOv8s. It also maintains competitive computational efficiency—its GFLOPs are only slightly higher than those of YOLOv5n and YOLOv8n, and its parameter count exceeds only YOLOv5n, YOLOv8n, and YOLOv9t. Although its CPU inference speed is marginally lower than that of YOLOv5n, YOLOv5s, and YOLOv7, TEB-YOLO still outperforms most models overall in inference performance. These results indicate that TEB-YOLO strikes an effective balance between compact model design, high detection accuracy, and satisfactory inference efficiency, making it particularly suitable for deployment on resource-constrained edge devices.

Furthermore, to intuitively demonstrate TEB-YOLO’s stability and superior detection precision, Figure 10 shows the precision curves of each model over 300 training epochs, while Figure 11 presents the visual detection results of the bamboo strip defect dataset. As shown in Figure 10, TEB-YOLO maintains consistently high detection accuracy in the later stages of training, with overall performance second only to YOLOv8s, reflecting good convergence and stability. Figure 11 further illustrates the performance of each model on specific sample images. In the second image, YOLOv8n, YOLOv8s, and YOLOv9s exhibit missed detections; in the fifth image, YOLOv5n, YOLOv8n, and YOLOv8s fail to detect some defects, while YOLOv7, YOLOv9s, and YOLOv9t produce false positives. In contrast, TEB-YOLO demonstrates higher accuracy in identifying defect targets, providing strong evidence of its superior balance between detection precision and model compactness.

4. Conclusions and Future Work

In conclusion, this study presents TEB-YOLO, an improved YOLOv5-based framework optimized for real-time bamboo strip defect detection. Compared with the original model on a custom dataset, TEB-YOLO achieves a lighter model and higher detection accuracy by replacing the backbone and neck networks, modifying the loss function, and embedding attention mechanisms in both modules. TEB-YOLO attains a precision of 91.7%, GFLOPs of 10.5 M, and Params of 5.423 M, representing improvements of approximately 5.4% in precision and reductions of 22.9% and 34.3% in Params and GFLOPs, respectively, relative to the original model.

Compared with mainstream detection models including YOLOv5n, YOLOv5s, YOLOv7, YOLOv8n, YOLOv9t, and YOLOv9s, TEB-YOLO improves precision by 11.8%, 5.4%, 1.66%, 2%, 2.8%, and 1.1%, respectively. Although its precision is 0.3% lower than YOLOv8s, TEB-YOLO achieves FPS of 67.3 on GPU and 12.4 on CPU, meeting real-time monitoring requirements. The comparison indicates that TEB-YOLO strikes an ideal trade-off between precision and inference speed, making it hardware-friendly and significantly enhancing the defect detection effectiveness in bamboo strips. This contributes to the automation and intelligence of the bamboo product industry.

Despite the balanced performance, TEB-YOLO has limitations. Compared with lighter models such as YOLOv5n and YOLOv8n, there remains room to further reduce the model size and increase the inference speed for improved real-time applicability. Additionally, TEB-YOLO has only been validated on a bamboo strip defect dataset; its generalizability to related tasks like wood defect detection is yet to be determined. Moreover, TEB-YOLO underperforms in precision compared to YOLOv8s. Furthermore, in the current implementation, TEB-YOLO emphasizes high precision (91.7%) to minimize false positives, which is particularly important in industrial QA where unnecessary rejection of acceptable products can lead to material waste and reduced production efficiency. While the recall (85.6%) is slightly lower, it remains acceptable for many industrial scenarios. However, it may not be acceptable for certain industrial scenarios—particularly those requiring high reliability or involving only a single-stage final inspection because the recall (85.6%) is slightly lower.

Future work will focus on improving the detection accuracy while reducing the computational complexity and deployment on embedded GPUs. In addition, the dataset will be expanded to cover a broader range of defect categories. Efforts will also aim to extend the model’s applicability to a broader range of defect detection tasks. Furthermore, the integration of multi-scale context modules (e.g., ASPP or RFB), advanced attention fusion strategies, and optimized training techniques such as label smoothing and adversarial augmentation will be explored. Moreover, recall may be improved through incorporating hard example mining, multi-scale context modules, and lightweight ensemble strategies to achieve a better balance between precision and recall according to specific QA requirements. Additionally, model compression techniques such as channel pruning, quantization (e.g., INT8), and inference acceleration using frameworks like TensorRT or ONNX Runtime are necessary for more resource-constrained edge scenarios. Finally, the model will be refined to strengthen model robustness in difficult scenarios, such as under extreme illumination or limited visual clarity.

Author Contributions

Conceptualization, C.R. and R.Y.; methodology, X.Y. and F.G.; software, X.Y. and F.G.; validation, X.Y. and F.Y.; formal analysis, X.Y.; investigation, J.Y. and F.Y.; resources, C.R.; data curation, R.Y.; writing—original draft preparation, X.Y.; writing—review and editing, C.R. and B.G.; visualization, X.Y. and L.H.; supervision, C.R.; project administration, C.R. and B.G.; funding acquisition, C.R. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the National Nature Science Foundation of China (No. 61903288), Fujian Provincial Natural Science Foundation (No. 2023J011043), College Student Innovation Training Program of Fujian (No. S202410397049), Nanping Science and Technology Commissioner Resource Industry Science and Technology Innovation Joint Funding Project (No. N2023Z003), Fujian Province Undergraduate Education and Teaching Reform Research Project (No. FBJY20230282), Wuyi University Introduced Talents Research Startup Project (No. YJ202320), and Wuyi University Teaching and Research Project (No. JY2024002).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yun, J. A Comparative Study of Bamboo Culture and Its Applications in China, Japan and South Korea. In Proceedings of the SOCIOINT14—International Conference on Social Sciences and Humanities, Istanbul, Turkey, 8–10 September 2014; pp. 8–10. [Google Scholar]
Lucy, B.; Vahid, N.; Chunping, D. Bamboo Industrialization in the Era of Industry 5.0: An Exploration of Key Concepts, Synergies and Gaps. Environ. Dev. Sustain. 2024, 1, 1–3. [Google Scholar] [CrossRef]
Ibrahim, Y.; Liam, B.; Fadi El, K.; Ramy, H. Leveraging Computer Vision Towards High-Efficiency Autonomous Industrial Facilities. J. Intell. Manuf. 2024, 35, 401–442. [Google Scholar]
Zhao, R.; Yan, R.; Chen, Z.; Mao, K.; Wang, P.; Gao, R.X. Deep Learning and Its Applications to Machine Health Monitoring: A Survey. J. Manuf. Syst. 2022, 62, 738–752. [Google Scholar] [CrossRef]
Amir, T.; Joonho, C.; Jangwoon, P.; Wonsup, L.; Myeongsup, C.; Jongchul, P.; Kihyo, J. Development of a Human-Friendly Visual Inspection Method for Painted Vehicle Bodies. Appl. Ergon. 2023, 106, 103911. [Google Scholar]
Muhammad, B.R.; Sheheryar, M.Q.; Muhammad, S.H.; Tayyab, N.; Muhammad, A.; Hafsa, J. Evaluation of Human Factors on Visual Inspection Skills in Textiles and Clothing: A Statistical Approach. Cloth. Text. Res. J. 2022, 41, 43–58. [Google Scholar]
Alireza, S.; Jing, R.; Moustafa, E.G. Defect Detection Methods for Industrial Products Using Deep Learning Techniques: A Review. Algorithms 2023, 16, 95. [Google Scholar] [CrossRef]
Zhang, Q.; Wang, H.; Wang, X.; Shang, J.; Wang, X.; Li, J.; Wang, Y. DSCW-YOLO: Vehicle Detection from Low-Altitude UAV Perspective via Coordinate Awareness and Collaborative Module Optimization. Sensors 2025, 25, 3413. [Google Scholar] [CrossRef]
Wang, X.; Hong, W.; Liu, Y.; Yan, G.; Hu, D.; Jing, Q. Improved YOLOv8 Network of Aircraft Target Recognition Based on Synthetic Aperture Radar Imaging Feature. Sensors 2025, 25, 3231. [Google Scholar] [CrossRef]
Prasad, N.; Muthusamy, A. A Review on Sustainable Product Design, Marketing Strategies and Conscious Consumption of Bamboo Lifestyle Products. Intell. Inf. Manag. 2023, 15, 67–99. [Google Scholar] [CrossRef]
Zhang, H.; Zhang, X.; Ding, Y.; Huang, F.; Cai, Z.; Lin, S. Charting the Research Status for Bamboo Resources and Bamboo as a Sustainable Plastic Alternative: A Bibliometric Review. Forests 2024, 15, 1812. [Google Scholar] [CrossRef]
Bi, Y.; Xue, B.; Mesejo, P.; Cagnoni, S.; Zhang, M. A Survey on Evolutionary Computation for Computer Vision and Image Analysis: Past, Present, and Future Trends. IEEE Trans. Evol. Comput. 2022, 27, 5–25. [Google Scholar] [CrossRef]
Qin, X.; Song, X.; Liu, Q.; He, F. Online Detection and Sorting System of Bamboo Strip Based on Visual Servo. In Proceedings of the 2009 IEEE International Conference on Industrial Technology, Gippsland, Australia, 10–13 February 2009. [Google Scholar]
Zeng, Q.; Lu, Q.; Yu, X.; Li, S.; Chen, N.; Li, W.; Zhang, F.; Chen, N.; Zhao, W. Identification of Defects on Bamboo Strip Surfaces Based on Comprehensive Features. Eur. J. Wood Wood Prod. 2023, 81, 315–328. [Google Scholar] [CrossRef]
Yeni, L.; Shaowei, Y. Defect Inspection System of Carbonized Bamboo Cane Based on LabView and Machine Vision. In Proceedings of the 2017 International Conference on Information, Communication and Engineering (ICICE), Xiamen, China, 17–20 November 2017. [Google Scholar]
Kuang, H.; Ding, Y.; Li, R.; Liu, X. Defect Detection of Bamboo Strips Based on LBP and GLCM Features by Using SVM Classifier. In Proceedings of the 2018 Chinese Control and Decision Conference (CCDC), Shenyang, China, 9–11 June 2018. [Google Scholar]
Yang, R.X.; Lee, Y.R.; Lee, F.S.; Liang, Z.; Chen, N.; Liu, Y. Improvement of YOLO Detection Strategy for Detailed Defects in Bamboo Strips. Forests 2025, 16, 595. [Google Scholar] [CrossRef]
Hu, J.; Yu, X.; Zhao, Y.; Wang, K.; Lu, W. Research on Bamboo Defect Segmentation and Classification Based on Improved U-Net Network. Wood Res. 2022, 67, 109–122. [Google Scholar] [CrossRef]
Qin, X.; He, F.; Liu, Q.; Song, X. Online Defect Inspection Algorithm of Bamboo Strip Based on Computer Vision. In Proceedings of the 2009 IEEE International Conference on Industrial Technology, Gippsland, Australia, 10–13 February 2009. [Google Scholar]
Hu, J.F.; Yu, X.; Zhao, Y.F. Bamboo Defect Classification Based on Improved Transformer Network. Wood Res. 2022, 67, 501–510. [Google Scholar] [CrossRef]
Guo, Y.; Zeng, Y.; Gao, F.; Qiu, Y.; Zhou, X.; Zhong, L.; Zhan, C. Improved YOLOv4-CSP Algorithm for Detection of Bamboo Surface Sliver Defects with Extreme Aspect Ratio. IEEE Access 2022, 10, 29810–29820. [Google Scholar] [CrossRef]
Yang, R.X.; Lee, Y.R.; Lee, F.S.; Liang, Z.; Liu, Y. An Improved YOLOv5 Algorithm for Bamboo Strip Defect Detection Based on the Ghost Module. Forests 2024, 15, 1480. [Google Scholar] [CrossRef]
Haq, M.A.; Rahaman, G.; Baral, P.; Ghosh, A. Deep Learning Based Supervised Image Classification Using UAV Images for Forest Areas Classification. J. Indian Soc. Remote Sens. 2021, 49, 601–606. [Google Scholar] [CrossRef]
Liu, B.; Guo, T.; Luo, B.; Cui, Z.; Yang, J. Cross-Attention Regression Flow for Defect Detection. IEEE Trans. Image Process. 2024, 33, 5183–5193. [Google Scholar] [CrossRef]
Ji, M.; Zhang, W.; Han, J.K.; Miao, H.; Diao, X.L.; Wang, G.F. A Deep Learning-Based Algorithm for Online Detection of Small Target Defects in Large-Size Sawn Timber. Ind. Crops Prod. 2024, 222, 119671. [Google Scholar] [CrossRef]
Lin, Y.; Xu, Z.; Chen, D.; Ai, Z.; Qiu, Y.; Yuan, Y. Wood crack detection based on data-driven semantic segmentation network. IEEE/CAA J. Autom. Sin. 2023, 10, 1510–1512. [Google Scholar] [CrossRef]
Zhang, Y.Z.; Xu, C.; Li, C.; Yu, H.L.; Cao, J. Wood Defect Detection Method with PCA Feature Fusion and Compressed Sensing. J. For. Res. 2015, 26, 745–751. [Google Scholar] [CrossRef]
Xu, Z.; Lin, Y.; Chen, D.; Yuan, M.; Zhu, Y.; Ai, Z.; Yuan, Y. Wood Broken Defect Detection with Laser Profilometer Based on Bi-LSTM Network. Expert Syst. Appl. 2024, 242, 122789. [Google Scholar] [CrossRef]
Haq, M.A. CNN Based Automated Weed Detection System Using UAV Imagery. Comput. Syst. Sci. Eng. 2022, 42, 837–849. [Google Scholar] [CrossRef]
Liu, X.; Peng, H.; Zheng, N.; Yang, Y.; Hu, H.; Yuan, Y. EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Wei, X.; Yi, W. ELA: Efficient Local Attention for Deep Convolutional Neural Networks. arXiv 2024, arXiv:2403.01123. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Hou, Q.; Zhang, L.; Cheng, M.M.; Feng, J. Strip Pooling: Rethinking Spatial Pooling for Scene Parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing Geometric Factors in Model Learning and Inference for Object Detection and Instance Segmentation. IEEE Trans. Cybern. 2021, 52, 8574–8586. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and Efficient IoU Loss for Accurate Bounding Box Regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. ShuffleNet v2: Practical Guidelines for Efficient CNN Architecture Design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Andrew, H.; Mark, S.; Bo, C.; Weijun, W.; Liangchieh, C.; Mingxing, T.; Hartwig, A. Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Jie, H.; Li, S.; Gang, S. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Xu, X.; Jiang, Y.; Chen, W.; Huang, Y.; Zhang, Y.; Sun, X. Damo-YOLO: A Report on Real-Time Object Detection Design. arXiv 2022, arXiv:2211.15444. [Google Scholar]
Chen, Y.; Zhang, C.; Chen, B.; Huang, Y.; Sun, Y.; Wang, C.; Fu, X.; Dai, Y.; Qin, F.; Peng, Y.; et al. Accurate Leukocyte Detection Based on Deformable-DETR and Multi-Level Feature Fusion for Aiding Diagnosis of Blood Diseases. Comput. Biol. Med. 2024, 170, 107917. [Google Scholar] [CrossRef]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-Neck by GSConv: A Lightweight Design for Real-Time Detector Architectures. J. Real-Time Image Process. 2024, 21, 6. [Google Scholar] [CrossRef]

Figure 1. Sample instances and quantity distribution of five bamboo strip defect categories.

Figure 2. The architecture of the proposed TEB-YOLO detection framework.

Figure 3. Overview of the EfficientViT backbone. (a) EfficientViT architecture; (b) sandwich layout block; (c) cascaded group attention module.

Figure 4. Structure of the Efficient Local Attention (ELA) mechanism.

Figure 5. Comparison of PANet and BiFPN structures. (a) PANet introduces a supplementary bottom-up pathway to the FPN; (b) BiFPN Introduces an additional top-down path and reuses features via bidirectional pathways.

Figure 6. Illustration of the positions of the predicted box and the ground truth box. (a) IoU-based relationship between the predicted box and the ground truth box. (b) EIoU-based relationship between the predicted and ground truth boxes.

Figure 7. Training performance curves of TEB-YOLO on the bamboo defect dataset.

Figure 8. Detection results of TEB-YOLO for various bamboo strip defects.

Figure 9. Comparison of feature maps from YOLOv5s and TEB-YOLO. (a) TEB-YOLO; (b) YOLOv5s. The color intensity indicates the level of attention by the model, with more yellow areas representing higher attention to that region.

Figure 10. Precision curves of TEB-YOLO and benchmark models across training epochs.

Figure 11. Visual detection results of TEB-YOLO and other models on sample images.

Table 1. Architectural configurations of the EfficientViT model series.

Model	{C₁, C₂, C₃}	{L₁, L₂, L₃}	{H₁, H₂, H₃}
EfficientViT-M0	{64, 128, 192}	{1, 2, 3}	{4, 4, 4}
EfficientViT-M1	{128, 144, 192}	{1, 2, 3}	{2, 3, 3}
EfficientViT-M2	{128, 192, 224}	{1, 2, 3}	{4, 3, 2}
EfficientViT-M3	{125, 240, 320}	{1, 2, 3}	{4, 3, 4}
EfficientViT-M4	{128, 256, 384}	{1, 2, 3}	{4, 4, 4}
EfficientViT-M5	{192, 288, 384}	{1, 3, 4}	{3, 3, 4}

Table 2. Training settings and hyperparameters for the experiments.

Parameter Categories	Parameter Setting
Image-size	640 × 640
Optimizer	SGD
Epochs	300
Batch-size	8
Momentum	0.937
Weight_decay	0.0005
Learning-rate	0.01

Table 3. Performance comparison of different backbone networks.

Models	Precision (%)	mAP@50 (%)	GFLOPs	Params (M)
YOLOv5 + CSPDarknet	88.9	92.6	16.0	7.035
YOLOv5 + Shuffulnetv2	75.6	80.0	2.0	0.871
YOLOv5 + Mobilenetv3	83.5	91.9	7.7	4.917
YOLOv5 + EfficientViT	86.9	91.5	9.9	5.326

Table 4. Performance comparison of different attention mechanisms.

Models	Precision (%)	mAP@50 (%)	GFLOPs	Params (M)
YOLOv5s + EfficientViT	86.9	91.5	9.9	5.326
YOLOv5s + EfficientViT + CA	84	90.7	14.0	7.251
YOLOv5s + EfficientViT + CBAM	85	90.9	7.8	4.296
YOLOv5s + EfficientViT + SE	86.8	91.1	10.1	5.474
YOLOv5s + EfficientViT + ELA	89.3	90.9	10.0	5.342

Table 5. Performance comparison of different neck modules.

Models	Precision (%)	mAP@50 (%)	GFLOPs (M)	Params (M)
YOLOv5s + EfficientViT + ELA + PANet	89.3	90.9	10.0	5.342
YOLOv5s + EfficientViT + ELA + GFPN	85.3	91.8	12.7	6.946
YOLOv5s + EfficientViT + ELA + HSFPN	88.5	91.0	21.0	4.274
YOLOv5s + EfficientViT + ELA + SlimNeck	83.6	92.3	9.2	5.341
YOLOv5s + EfficientViT + ELA + BiFPN	91.7	90.8	10.5	5.423

Table 6. Ablation study results on individual module contributions in TEB-YOLO.

Model	EIoU	EfficientViT	ELA	BiFPN	Precision (%)	Recall (%)	mAP5 (%)	GFLOPs (M)	Params (M)	FPS (f/s)
YOLOv5s	×	×	×	×	87	88.6	92.8	16.0	7.035	123.5
	√	×	×	×	88.9	89.7	92.6	16.0	7.035	133.4
	√	√	×	×	86.9	87.1	91.5	9.9	5.326	71.1
	√	√	√	×	89.3	86.2	90.9	10.0	5.342	67.4
	√	√	√	√	91.7	85.6	90.8	10.5	5.423	67.3

Table 7. Comparison of TEB-YOLO with other YOLO-based detection models.

Model	Mean Activation	Mean Activation
TEB-YOLO	34.5001	1.47
YOLOv5s	33.1852	1.31

Table 8. Comparison of TEB-YOLO with other YOLO-based detection models.

Model	Precision (%)	Recall (%)	mAP@50 (%)	GFLOPs (M)	Params (M)	FPS-GPU (f/s)	FPS-CPU (f/s)
YOLOv5n	82.0	89.9	92.1	4.2	1.8	108.0	34.1
YOLOv5s	87.0	88.6	92.8	16.0	7.035	123.5	18.3
YOLOv7	90.2	87.2	93	103.2	36.3	87.1	12.5
YOLOv8n	89.9	91.2	95.9	8.1	3.0	84.7	11.2
YOLOv8s	92.0	90.0	96.2	28.4	11.1	72.99	3.8
YOLOv9t	89.2	91.3	95.6	10.7	2.61	34.6	6.9
YOLOv9s	90.7	88.2	95.4	38.7	9.60	58.4	2.6
TEB-YOLO	91.7	85.6	90.8	10.5	5.423	67.3	12.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, X.; Ruan, C.; Yu, F.; Yang, R.; Guo, B.; Yang, J.; Gao, F.; He, L. TEB-YOLO: A Lightweight YOLOv5-Based Model for Bamboo Strip Defect Detection. Forests 2025, 16, 1219. https://doi.org/10.3390/f16081219

AMA Style

Yang X, Ruan C, Yu F, Yang R, Guo B, Yang J, Gao F, He L. TEB-YOLO: A Lightweight YOLOv5-Based Model for Bamboo Strip Defect Detection. Forests. 2025; 16(8):1219. https://doi.org/10.3390/f16081219

Chicago/Turabian Style

Yang, Xipeng, Chengzhi Ruan, Fei Yu, Ruxiao Yang, Bo Guo, Jun Yang, Feng Gao, and Lei He. 2025. "TEB-YOLO: A Lightweight YOLOv5-Based Model for Bamboo Strip Defect Detection" Forests 16, no. 8: 1219. https://doi.org/10.3390/f16081219

APA Style

Yang, X., Ruan, C., Yu, F., Yang, R., Guo, B., Yang, J., Gao, F., & He, L. (2025). TEB-YOLO: A Lightweight YOLOv5-Based Model for Bamboo Strip Defect Detection. Forests, 16(8), 1219. https://doi.org/10.3390/f16081219

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TEB-YOLO: A Lightweight YOLOv5-Based Model for Bamboo Strip Defect Detection

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Acquisition and Preparation

2.2. YOLOv5 Model Enhancements

2.2.1. Proposed Methods

2.2.2. Optimization of the Backbone Network

2.2.3. Attention Mechanism

2.2.4. Optimization of the Neck Network

2.2.5. Loss Function Optimization

3. Experiments and Analysis

3.1. Experimental Environment

3.2. Evaluation Metrics

3.3. Comparison Experiments of Different Module Combination

3.4. Ablation Studies

3.5. Improved Model Results

3.6. Comparison of Different Models

4. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI