Abstract
The precise localization of the pith within sawn timber cross-sections is essential for improving downstream processing accuracy in modern wood manufacturing. Existing industrial workflows still rely heavily on manual interpretation, which is labor-intensive, error-prone, and unsuitable for real-time quality control. However, automatic pith detection is challenging due to the small size of the pith, its visual similarity to knots and cracks, and the dominance of negative samples (boards without visible pith) in practical scenarios. To address these challenges, this study develops Wood-YOLOv11, a task-adapted YOLOv11-based pith detection model optimized for real-time and high-precision operation in wood processing environments. The proposed approach incorporates: (1) a dedicated sawn-timber cross-section dataset including multiple species, mixed imaging sources, and clearly annotated pith positions; (2) a negative-sample-aware training strategy that explicitly leverages pithless boards and weighted binary cross-entropy to mitigate extreme class imbalance; (3) a high-resolution (840 × 840) input configuration and optimized loss weighting to improve small-target localization; and (4) a comprehensive evaluation protocol including false-positive analysis on pithless boards and comparison with mainstream detectors. Validated on a comprehensive, custom-annotated sawn timber dataset, our model demonstrates excellent performance. It achieves a mean Average Precision (mAP@0.5) of 92.1%, a Precision of 95.18%, and a Recall of 87.72%, proving its ability to handle high-texture backgrounds and small target sizes. The proposed Wood-YOLOv11 model provides a robust, real-time, and efficient technical solution for the intelligent transformation of the wood processing industry.
1. Introduction
Wood, as a sustainable and renewable material, plays a vital role in modern construction and engineered wood products such as Glulam and Cross-Laminated Timber (CLT). However, the mechanical stability and dimensional reliability of these materials strongly depend on the accurate determination of the pith, which serves as the geometric reference for assessing grain orientation. Traditionally, human operators perform visual inspection under illumination, but such manual identification is highly subjective and error-prone. Studies have shown that human observers achieve only 70–80% accuracy when locating the pith under varying lighting and texture conditions [1,2], and the results vary significantly with operator experience.
With the rise of machine vision and deep learning, wood inspection has entered a new era of automation. Early research primarily focused on wood species recognition [3] and surface defect detection [4], using convolutional neural networks (CNNs) [5] to learn texture and anatomical patterns. Other studies explored internal defect detection using CT scanning or laser profiling [6]. However, the specific task of real-time pith detection using 2D machine vision remains a formidable and under-explored challenge. This task is exceptionally difficult because the pith typically constitutes a very small target within the cross-section image, making its features susceptible to loss during the down-sampling process inherent in deep neural networks. Furthermore, the pith must be distinguished from a visually complex background comprising dense annual rings, as well as features with similar appearances such as small knots, resin canals, and drying cracks. These issues demand a detection model specifically tailored for small, hard-to-distinguish targets in textured environments [7,8].
To address these challenges, this paper introduces Wood-YOLOv11, an architecture adapted from YOLOv11 for accurate and real-time pith localization. In its original form, YOLOv11 is a high-performance, general-purpose object detector incorporating an efficient backbone, a feature pyramid network, and a decoupled detection head. In this work, we build upon the standard YOLOv11 framework and tailor its training and configuration for the pith detection task. The adaptation and application of the YOLOv11 architecture, leveraging its advanced feature extraction (C2PSA), multi-scale fusion (FPN + PAN) [9,10], and decoupled head components to maximize performance for pith detection. The implementation of a composite loss function strategy that synergistically improves localization precision [11], classification accuracy under severe class imbalance [12], and boundary refinement for the pith target. Our contributions do not modify the fundamental network architecture; rather, they focus on dataset construction, training strategies, and task-oriented parameter adjustments.
The main contributions of this study are as follows:
- Task-Oriented Dataset Construction: We develop a dedicated sawn timber cross-section dataset composed of multiple species, diverse imaging conditions, and high-quality pith annotations. The dataset includes a substantial number of negative samples (pithless boards) to reflect practical industrial conditions.
- Negative-Sample-Aware Training Strategy: Pithless images are explicitly incorporated during training as background-only samples. A weighted binary cross-entropy (WBCE) component is adopted to mitigate severe class imbalance and enhance the model’s capability to suppress false positives.
- Resolution and Loss Optimization for Small-Target Detection: The model uses a high-resolution input configuration (840 × 840) and a composite loss function with tuned weighting coefficients to improve the localization accuracy of small pith targets.
- Comprehensive Evaluation Pipeline: Beyond standard mAP and precision/recall metrics, the evaluation includes false-positive rate analysis on pithless boards, ablation studies on model configurations, and comparisons with mainstream detectors.
2. Related Work and Theoretical Basis
2.1. Convolutional Neural Networks (CNNs) in Image Recognition
CNNs are the cornerstone of modern computer vision, distinguished by their ability to automatically learn a hierarchy of features from raw pixel data [13]. Foundational architectures like AlexNet [14], VGGNet [15], and GoogLeNet [16] proved the efficacy of deep stacks of convolutional and pooling layers, while later models like ResNet introduced residual connections to facilitate the training of even deeper networks, significantly boosting image recognition capabilities. In the domain of wood science, CNNs have been effectively used for wood species identification based on anatomical features and for classifying surface defects.
2.2. Object Detection Models
Object detection algorithms identify and localize objects within an image using bounding boxes. They are broadly categorized into two-stage and one-stage approaches [17]. Two-stage detectors, like Faster R-CNN [18], first generate region proposals and then classify each proposal, generally achieving high accuracy at the cost of speed. One-stage detectors, such as the YOLO series [19,20,21] and SSD [22], formulate detection as a single regression problem, directly predicting bounding box coordinates and class probabilities. This approach offers superior speed, making it highly suitable for real-time industrial applications like automated wood processing lines.
The YOLO (You Only Look Once) series is a prominent family of one-stage object detection algorithms [23]. Its core principle is to divide the input image into a grid and have each grid cell be responsible for detecting objects whose centers fall within it. This design enables end-to-end training and rapid inference [19]. Our work leverages the YOLOv11 model, which builds upon this foundation with several architectural enhancements, such as a CSPNet-based backbone [24], path aggregation networks [10], and spatial pyramid pooling [25] to improve its performance on challenging targets, making it an ideal theoretical basis for our pith detection task.
3. Dataset Construction and Preprocessing
A high-quality, task-specific dataset is essential for training a robust pith detection model. To accurately reflect real industrial scenarios, the dataset developed in this study includes multiple wood species, diverse imaging sources, a substantial number of pithless boards, and high-resolution cross-sectional images. This section details the data collection, annotation, preprocessing, augmentation, and task analysis.
3.1. Data Collection
To build a comprehensive dataset suitable for small-target pith localization, we collected images from two complementary sources:
Laboratory Photography: A large set of images was captured from physical timber samples using a Canon 5D series digital SLR camera. The sawn timber specimens were arranged parallelly on custom-made, simply supported timber blocks. The camera angle was adjusted to achieve perfect perpendicularity relative to the cross-section of the specimens. Subsequently, the camera was connected to a computer, and remote shooting operations were conducted using the built-in EOS Utility software. The experimental setup is illustrated in Figure 1. The focus was on species commonly used in engineered wood, including domestically grown Chinese Fir and imported species such as Spruce.
Figure 1.
The shooting experimental setup using Canon 50D SLR camera (a) Right side view; (b) Left side view.
Online Wood Database: To further expand the dataset, we downloaded numerous high-resolution cross-section images from “The Wood Database”, a large, open-source repository of wood species imagery. The wood-filter database of The Wood Database website is shown in Figure 2.
Figure 2.
The wood-filter database of The Wood Database website.
We selected two wood species, Chinese fir and spruce from The Wood Database. The dataset contains 1640 images in total: 960 for Chinese fir and 680 for spruce. For each species, we combined online images (primarily from The Wood Database and related open sources) with our laboratory photography, yielding an even split overall (820 online/820 lab). The set is balanced with respect to pith, comprising 820 images with pith and 820 without pith, and covering multiple cutting types. Representative samples are shown in Figure 3.
Figure 3.
Examples of cross-sectional images (a) Chinese fir sawn timber; (b) Spruce sawn timber.
3.2. Image Annotation
All images were manually annotated using LabelImg, following a strict single-class labeling protocol.
Class definition: A single object class, “pith”, was defined for detection.
Annotation Tool: The graphical annotation tool LabelImg was used to manually draw bounding boxes around the pith in each relevant image.
Annotation Standard: Annotators were instructed to create the tightest possible bounding box fully enclosing the pith, ensuring high-quality labels for precise model training.
Negative samples: Images without visible pith were preserved as background-only images with no bounding boxes, and explicitly incorporated into training.
3.3. Data Preprocessing and Augmentation
Before model training, all images underwent standardized preprocessing steps:
- Image Normalization: All images were resized to a uniform input resolution of 840 × 840 pixels. This larger-than-standard size was chosen specifically to preserve the fine details of the small pith target during training [26];
- Data Augmentation: To prevent overfitting and improve the model’s robustness to variations in appearance, several data augmentation techniques were applied during the training phase, including random rotations, horizontal flips, and adjustments to brightness and contrast. Techniques like Mosaic and Mixup, common in YOLO training pipelines, were also employed [27,28].
3.4. Task Analysis
The core objective is to accurately and reliably localize the pith. This task is defined by several key challenges:
Extreme Small-Target Problem: The pith’s pixel area is often minuscule compared to the entire cross-section, making it prone to feature loss in deep network layers.
Strong Background Interference: The pith must be distinguished from visually similar features inherent to wood, such as dense annual ring lines, small knots, resin canals, and cracks, which act as strong “pseudo-targets”.
Morphological Diversity and Ambiguous Boundaries: The pith’s shape can be irregular (circular, elliptical, ruptured), and its boundary with the surrounding wood is often diffuse, complicating precise bounding box regression.
Negative Sample Dominance: A large proportion of cross-sections contain no pith, requiring the model to strongly avoid false positives.
Given the need for real-time processing in industrial settings [29,30], the one-stage YOLOv11 architecture was selected as the baseline. Its powerful feature extraction backbone, multi-scale feature fusion neck (FPN + PAN) [9,10], and pre-training on large datasets [31] provide a strong foundation. However, to achieve state-of-the-art performance, we leverage its most advanced components and a tailored loss function, as detailed below.
4. Proposed Method
The proposed Wood-YOLOv11 is a task-adapted YOLOv11-based detection framework designed specifically for small-target pith localization in sawn timber cross-sections. Rather than modifying the architectural modules of YOLOv11, this work focuses on dataset design, negative sample aware training, resolution configuration, and loss weight optimization all critical factors for robust small-object detection under wood’s complex texture patterns.
This section describes the adopted YOLOv11 architecture, the composite loss function, training configurations, and ablation strategies incorporated into this study.
4.1. YOLOv11 Architecture Overview
YOLOv11 consists of three major components: a backbone, a feature pyramid (neck), and a decoupled detection head. Network structure diagram of YOLOv11 is shown in Figure 4.
Figure 4.
Network structure diagram of YOLOv11.
(A) Backbone Network
The backbone employs CSP-enhanced convolution modules and C3K2 blocks to extract hierarchical features. These modules help capture high-frequency details such as ring boundaries and crack contours, which are essential for distinguishing the small pith from visually similar texture patterns.
C3K2 Module: This is the core feature extraction unit. By using its deeper bottleneck structure, it can extract more complex and subtle features, which is essential for differentiating the pith from the intricate wood grain texture. The structure diagram of the C3K2 module is shown in Figure 5.
Figure 5.
Structure diagram of the C3K2 module.
SPPF (Spatial Pyramid Pooling Fast) Module: This module enhances the model’s multi-scale detection capability by efficiently creating a spatial pyramid. It allows the network to capture both local and global context, significantly improving its ability to detect small targets like the pith. The structure diagram of the SPPF module is shown in Figure 6.
Figure 6.
Structure diagram of the SPPF module.
C2PSA (Cross-scale Pixel Spatial Attention) Module: This innovative attention mechanism allows the model to “focus” on regions most likely to contain the pith while suppressing background noise from annual rings and cracks. Attention mechanisms have been shown to be highly effective in various computer vision tasks [32,33,34,35]. The structure diagram of the C2PSA module is shown in Figure 7.
Figure 7.
Structure diagram of the C2PSA module.
(B) Neck: Multi-Scale Feature Aggregation
The neck, combining a Feature Pyramid Network (FPN) [9] and a Path Aggregation Network (PAN) [10], fuses feature maps from different backbone levels. This ensures that the final feature maps are rich in both the strong semantic information needed to identify the pith and the precise spatial information needed to localize it accurately.
(C) Detection Head: Classification and Localization
We use YOLOv11’s advanced decoupled head, which separates the classification and regression tasks into different convolutional branches. This allows each branch to be optimized independently, improving convergence speed and overall detection accuracy (mAP) [36]. The structure diagram of the detection head module is shown in Figure 8.
Figure 8.
Structure diagram of the detection head module.
4.2. Loss Function Optimization
To efficiently guide model training, YOLOv11 integrates a comprehensive optimization objective consisting of multiple loss functions, which optimize localization, classification, and boundary details, respectively. A loss function defines the optimization goal of the model and quantifies the gap between the predicted values and the true values. To efficiently and accurately accomplish the task of pith recognition, YOLOv11 integrates a comprehensive optimization system composed of multiple advanced loss functions. This system does not simply sum up the losses of each part; instead, it designs specialized loss functions for the three core subtasks in object detection: Localization, Classification, and Boundary Refinement, and makes them work in coordination. This research will fully adopt this advanced design to ensure that the model can achieve optimal performance in all subtasks.
4.2.1. Localization Loss: Using Enhanced CIoU Loss
Localization loss directly measures the spatial difference between the bounding box predicted by the model and the manually annotated real bounding box, and its performance directly determines the accuracy of model localization. To achieve precise locking of the pith position, YOLOv11 adopts an enhanced Complete Intersection over Union (CIoU) loss function [11]. Compared with the traditional IoU (which only considers the overlapping area) and GIoU (which adds a penalty for non-overlapping areas on the basis of IoU), CIoU provides a more comprehensive way to evaluate bounding boxes because it additionally integrates two key geometric factors: the distance between central points and the consistency of aspect ratios. This comprehensive measurement makes it particularly excellent in handling tasks such as dense target detection or small target detection like pith.
The calculation formula of Box_Loss is defined as (Equation (1)):
The calculation method of CIoU is as follows (Equation (2)):
This formula consists of three core components:
The Intersection over Union term (): this is the most basic metric, defined as the ratio of the area of intersection between the predicted bounding box and the ground truth box to the area of their union, used to measure the degree of their spatial overlap.
Center point distance penalty term (): this term is used to penalize the deviation between the center points of the predicted box and the ground truth box.
and power asterisk operator, respectively, represents the center point coordinates of the predicted box and the ground truth box.
power asterisk operator right parenthesis represents the square of the Euclidean distance between these two center points.
represents the diagonal length of the smallest enclosing rectangle that can cover both the predicted box and the ground truth box. Normalization by ensures that this penalty term is scale-invariant and can effectively alleviate the problem of the IoU loss gradient being zero when the two boxes do not overlap.
Aspect ratio consistency penalty term (), this term is used to measure the similarity in shape between the predicted bounding box and the ground truth box. is a parameter that measures the aspect ratio consistency, and its calculation formula is (Equation (3)):
When the IoU is high, the alpha value will be larger, and the model will pay more attention to the accurate matching of the aspect ratio. By integrating these three dimensions into a unified framework, CIoU Loss provides the model with a robust and comprehensive evaluation metric. It can guide the model to simultaneously tend to higher overlap, closer center point distance, and more similar shapes during the optimization process, thereby significantly improving the positioning quality of the bounding box. This is crucial for accurately locking in pith targets that are small in size and varied in shape.
4.2.2. Classification Loss: Using Weighted Binary Cross-Entropy
To address the severe class imbalance between the rare pith target and the abundant background, we use Weighted Binary Cross-Entropy (WBCE). This loss function applies a higher weight to the positive class (pith), ensuring the model gives adequate attention to learning its features despite the prevalence of negative samples. This concept is related to the Focal Loss, which also addresses class imbalance by down-weighting the loss assigned to well-classified examples [12].
The core idea of WBCE is to balance the contribution of different categories (here referring to foreground and background) to the total loss by introducing category weights. Its calculation formula is as follows (Equation (4)):
The meanings of each symbol in the formula are as follows:
: the total number of samples.
: the true label of the sample. For this task, = 1 indicating that the sample is a pith (positive sample), and = 0 indicating that it is a background (negative sample).
: the probability that the model predicts the sample as a positive sample, representing the confidence of the model.
: the weight factor for the category to which the sample belongs. This weight is carefully designed to balance the contributions of various categories. Specifically, the weight can be set by the reciprocal of the number of samples, thereby amplifying the loss caused by the scarce pith (positive samples), ensuring that the model can fully learn their features during the training process without being dominated by the massive background (negative samples).
By integrating WBCE, YOLOv11 can effectively balance the learning process of various categories, improve the classification performance for minority categories, ensure the stability of the training process and the accuracy of the final classification, and achieve robust performance even in cases where the ratio of foreground to background samples is extremely imbalanced.
4.2.3. Distribution Focal Loss: Optimizing Boundary Details
Distribution Focal Loss (Dfl_Loss) is an innovative design in YOLOv11 aimed at improving the quality of bounding box regression. Traditional regression losses (such as L1 or L2 Loss) regress coordinate values as continuous real numbers, while Dfl_Loss innovatively models the coordinate prediction task of bounding boxes as a learning problem of probability distribution [36].
The core idea of this method is that instead of directly predicting a single coordinate value, the model predicts the probability distribution of the coordinate value within a discretized range. This enables the model not only to optimize the overall positioning but also to focus on the fine details of high-confidence regions, thereby improving the accuracy and regression quality of the bounding box. Its calculation formula can be expressed as (Equation (5)):
The meanings of each symbol in the formula are as follows:
: the total number of points after discretization of coordinate values.
: the actual coordinate value.
: the j-th candidate coordinate value after discretization.
: the probability that the model predicts the actual coordinate as .
The innovation of Dfl_Loss lies in that it weights the regression error power asterisk operator device control 4 vertical line row for each candidate coordinate point through the predicted probability distribution . This means that regions with higher probabilities in the model’s view contribute more to the total loss through their regression errors. This mechanism can encourage the model to make greater efforts to optimize those regions in which it is most confident, thereby achieving refined polishing of the boundaries. This is particularly beneficial for handling targets such as the pith, whose boundaries are often vague and difficult to precisely define.
Dfl_Loss and CIoU Loss are functionally complementary: CIoU is responsible for ensuring the overall and global positioning accuracy of the bounding box, while Dfl_Loss is responsible for refining the local and fine-grained boundary details. These two work together, enabling YOLOv11 to achieve both accurate and refined bounding box predictions in various challenging scenarios, providing strong performance guarantees for the pith recognition task.
The overall loss is formulated as (Equation (6)):
where are task-tuned weights selected through experiments (values to be filled with actual training logs).
4.3. Model Parameters and Training
The model was trained using the following key parameters, based on best practices from the source literature:
Optimizer: AdamW [37], which is often preferred over standard Adam in deep learning for its improved weight decay implementation.
Learning Rate Schedule: Initial learning rate of 1 × 10−4 with a cosine learning rate scheduler [38].
Batch Size: 64.
Input Image Size: 840 × 840 pixels.
Pre-trained Weights: The model was initialized with official YOLOv11 weights pre-trained on a large-scale dataset (e.g., COCO) to leverage transfer learning [39].
Augmentation Strategy: Training integrates Mosaic, MixUp, flipping, rotation, scaling, and color adjustments, helping the model generalize across species and imaging conditions.
5. Results and Discussion
5.1. Experimental Setup
Hardware Environment: Experiments were conducted on a workstation with a 13th Gen Intel(R) Core(TM) i5-13400, a NVIDIA GeForce RTX 5060 Ti, and 32.0 GB Software Environment: The model was implemented in Python version 3.13.5 using the PyTorch framework version 2.8.0 and the OpenCV library version 4.12. Evaluation Metrics: Model performance was evaluated using standard object detection metrics: Confusion Matrix, Accuracy, Precision, Recall, and F1-Score. Mean Average Precision (mAP) is also a primary metric for this task [40].
5.2. Quantitative Results
Wood-YOLOv11 demonstrates strong performance on the pith detection task. Table 1 summarizes the results, Wood-YOLOv11 achieves the best overall accuracy, reaching 0.921 mAP@0.5 with the highest precision (0.952) and strong recall (0.877) at 27 FPS. SSD is the fastest at 45 FPS but shows the lowest accuracy (0.72 mAP) and recall (0.78), highlighting a clear speed–accuracy trade-off. Faster R-CNN reaches 0.845 mAP; its precision (0.935) is respectable, yet the two-stage pipeline limits throughput to 14 FPS, constraining real-time use. YOLOv7 delivers 0.89 mAP and 24 FPS—acceptable accuracy, but slower and marginally less recall than YOLOv8/YOLOv11. YOLOv8 trails closely at 0.905 mAP with comparable precision/recall (0.945/0.865) and slightly higher speed (28 FPS), making it a solid balanced baseline. In summary: choose Wood-YOLOv11 for peak accuracy, YOLOv8 for accuracy–speed balance, SSD for latency-critical scenarios, and Faster R-CNN when classical two-stage behavior is preferred.
Table 1.
Precision and recall result.
The comparison confirms that task-specific optimizations provide measurable improvements over other mainstream detectors. Wood-YOLOv11 achieves a strong balance of accuracy and real-time performance.
5.3. Ablation Studies
Ablation experiments verify the necessity of each task-adapted component.
As shown in Table 2, increasing the input resolution from 640 to 840 yields consistent accuracy gains under identical weights and evaluation settings (same split, thresholds, and NMS). mAP@0.5 improves from 0.909 to 0.921, precision from 0.948 to 0.952, and recall from 0.865 to 0.877. The larger resolution preserves finer spatial details, improving localization and reducing missed detections, particularly for small or slender objects; tighter boxes also slightly reduce false positives. Although higher resolution increases computational cost (latency and memory), the 840 setting offers a favorable accuracy–efficiency trade-off for our dataset, and is therefore adopted as the default in subsequent experiments.
Table 2.
Input resolution.
As shown in Table 3, ablating WBCE and DFL shows complementary effects. With either component absent, performance is lower. Enabling WBCE alone primarily benefits recall (0.871) with a modest mAP gain (0.912), consistent with better handling of class imbalance and hard negatives. Enabling DFL alone mainly improves localization quality, increasing precision (0.951) and mAP (0.914), with a smaller recall lift. Combining WBCE + DFL yields the best result, indicating additive gains from improved box quality and higher true-positive rate.
Table 3.
WBCE Weighting and DFL.
5.4. Training Curves and Convergence
The trained Wood-YOLOv11 model was evaluated on the held-out test set. Training progress was monitored through loss and accuracy curves, which demonstrated stable convergence. The final detection performance, including visualizations of the confusion matrix and detection examples on test images, should be presented here.
Figure 9 presents the predicted labels of images in the validation set. The blue bounding boxes denote the detected regions of the target “item” (e.g., the pith region in wood cross-sections) by the model. It can be observed that the model successfully locates the target areas in various wood cross-section images, demonstrating its capability to identify the target despite variations in wood texture, background, and the positions of the target, which reflects the model’s generalization performance on the validation data.
Figure 9.
Predicted objects in test images.
Figure 10 illustrates the training and validation loss curves, along with key evaluation metrics throughout the model training process.
Figure 10.
Train and validaton loss.
In the loss subplots, all loss values exhibit a consistent decreasing trend as the number of epochs increases. This indicates that the model effectively learns feature representations from the data and gradually converges during training.
For the evaluation metrics (i.e., metrics/precision(B), metrics/recall(B), metrics/mAP50(B), and metrics/mAP50-95(B)), a notable upward trend is observed. Precision and recall increase and stabilize at relatively high levels, while the mean Average Precision (mAP) metrics (both mAP50—focusing on IoU = 0.5, and mAP50-95—averaging across IoU thresholds from 0.5 to 0.95) also show continuous improvement.
Collectively, the decreasing loss curves and increasing evaluation metrics validate the effectiveness of the training process, demonstrating that the model not only reduces training and validation losses but also achieves enhanced detection performance in terms of precision, recall, and overall accuracy across different Intersection over Union (IoU) criteria.
Figure 11 presents the normalized confusion matrix for the target detection task, where the rows correspond to predicted classes and the columns correspond to true classes, involving two categories: item and background.
Figure 11.
Confusion matrix.
For the true class item, the model correctly predicts item with a normalized ratio of 89%, while 11% of true item instances are misclassified as background. In contrast, when the true class is background, the model achieves perfect prediction, all true background instances are correctly identified as background (with a normalized ratio of 1.00, indicating no misclassifications for the background class).
In summary, the confusion matrix demonstrates that the model excels at identifying the background class. For the item class, although the correct prediction ratio is relatively high, there remains a minor portion of item instances misclassified as background, suggesting potential for further optimization in subsequent model iterations.
As illustrated in Figure 12, four evaluation curves (Precision-Confidence, Recall-Confidence, F1-Confidence, and Precision-Recall) are plotted to assess the model’s detection robustness. The model delivers a perfect precision of 1.00, a high recall of 0.91, and a peak F1-score of 0.93, with the AP for the “item” class and mAP@0.5 across all categories both attaining 0.924. These metrics not only validate the model’s capability to maintain high confidence while balancing precision and recall, but also confirm its outstanding accuracy in object detection tasks.
Figure 12.
BoxP, BoxR, BoxF1, BoxPR curve.
Figure 13 displays the visual detection results for wood cross-section images. In each subfigure, the blue bounding box labeled “item” (accompanied by a corresponding confidence score, such as 0.85, 0.82, 0.73, and 0.87) indicates the target region identified by the model. It can be observed that the model successfully locates the target areas across various wood cross-section scenarios, even when the images exhibit differences in wood texture, background environments, or contain cracks and other structural variations. The confidence scores of these detections are relatively high (ranging from 0.73 to 0.87), demonstrating the model’s capability to accurately identify the target “item” with reliable confidence across diverse wood samples and reflecting good generalization performance.
Figure 13.
Detection results.
5.5. Visualization Results and Analysis of Pith Detection Model
The designed pith detection model was validated on the test dataset, and the results demonstrate that the model can accurately identify and localize the pith regions in microscopic images. As shown in Figure 9, the predicted bounding boxes are compared with the ground-truth labels for different sample images. The model performs well in images where the medullary boundaries are clear and the contrast is high, with predicted boxes almost overlapping the actual regions.
5.5.1. Initial Candidate Box Generation Stage
In object detection tasks, the initial candidate box generation is the first and fundamental step of the YOLO (You Only Look Once) algorithm. The core idea of this stage is to divide the input image into an grid, where each grid cell is responsible for predicting the objects whose center points fall within it. To accommodate targets of different sizes and aspect ratios, YOLO predefines multiple anchor boxes (Anchors) of various scales and aspect ratios in each grid cell, forming the initial set of candidate boxes.
Unlike traditional sliding window or Region Proposal Network (RPN) methods, YOLO adopts a dense prediction mechanism, performing simultaneous predictions for all anchor boxes in a single forward pass, thereby achieving real-time, end-to-end detection. Instead of directly predicting absolute coordinates, each anchor box predicts the relative offsets to refine localization. This parameterization significantly improves regression stability and generalization capability.
Furthermore, the initial shapes of the anchor boxes are often determined by clustering analysis (e.g., K-means) on the ground-truth bounding boxes of the training set, ensuring that the predefined anchors better match the typical target dimensions and proportions in the dataset. In practice, this stage generates a large number of candidate boxes, approximately , where () is the number of anchor boxes per grid cell. These boxes are subsequently refined and filtered through confidence thresholding and Non-Maximum Suppression (NMS).
In this study, the visualization experiment, as shown in Figure 14, illustrates the dense distribution of YOLO’s initial prediction boxes across the entire image, vividly demonstrating the large-scale candidate box generation process.
Figure 14.
Initial candidate box generation stage.
5.5.2. Feature Extraction and Prediction Stage
After the generation of initial candidate boxes, the YOLO model proceeds to the feature extraction and prediction stage. The main task at this stage is to extract image features via deep convolutional neural networks (CNNs) and to make predictions for each candidate box. Typically, YOLO employs efficient CNN backbones, such as the Darknet family, to extract hierarchical semantic representations through multiple layers of convolution, pooling, and nonlinear activation operations.
Following feature extraction, YOLO predicts three main types of outputs for each anchor box: (1) bounding box coordinate adjustments for precise localization; (2) the objectness score, representing the likelihood that the box contains an object; and (3) the class probability distribution, indicating the object’s category.
To improve detection reliability, the model applies a confidence threshold (e.g., 0.25) to filter out low-confidence predictions, removing background noise and retaining only high-confidence boxes. This not only enhances detection accuracy but also reduces computational overhead in later stages.
From an architectural perspective, YOLO integrates multi-scale feature fusion and Feature Pyramid Network (FPN) structures to strengthen its capability in detecting objects of varying sizes, especially small targets. Multi-scale fusion combines low-level detailed features with high-level semantic features, while the FPN’s top-down pathway ensures consistent detection performance across different resolutions.
As shown in Figure 15, the confidence thresholding process retains candidate boxes that the model deems most likely to contain the pith region, providing a refined set of boxes for subsequent Non-Maximum Suppression and final result generation.
Figure 15.
Feature extraction and prediction stage.
5.5.3. Post-Processing and Final Output Stage
After candidate box prediction and filtering, YOLO enters the post-processing and final output stage. The main objective here is to resolve redundant detections and finalize class assignments, producing the ultimate detection results. Because multiple high-confidence boxes may correspond to the same object, YOLO employs the Non-Maximum Suppression (NMS) algorithm to eliminate redundancy.
Specifically, NMS first sorts all candidate boxes by confidence score and selects the highest-confidence box as the current detection result. Then, it computes the Intersection over Union (IoU) between this box and the remaining ones, discarding those with IoU values greater than a predefined threshold (e.g., 0.45). This process iterates until all boxes have been processed. NMS effectively removes duplicate detections of the same object, yielding a single, optimal detection result.
After NMS, the model performs class filtering to determine the most probable class label for each retained box. The final detection output includes bounding box coordinates, class labels, and confidence scores, fully describing the model’s detection outcome.
The performance of NMS depends heavily on the IoU threshold, which controls the strictness of the suppression. To further improve detection accuracy, several enhanced versions such as Soft-NMS (using score decay instead of hard elimination) and DIoU-NMS (considering both overlap and center distance) have been proposed. It is worth noting that post-processing is executed only during inference and does not participate in backpropagation, thus having minimal impact on computational efficiency.
As illustrated in Figure 16, the final detection results include bounding boxes, category labels, and confidence scores after NMS processing, comprehensively reflecting YOLO’s post-processing logic and output characteristics.
Figure 16.
Post-processing and final output stage.
In summary, the YOLO detection process can be divided into three core stages: initial candidate box generation, feature extraction and prediction, and post-processing and output. The main advantages of YOLO lie in its one-stage, end-to-end design and real-time performance: (1) unlike traditional two-stage detectors, YOLO completes detection in a single forward pass; (2) it considers the global context of the entire image, effectively reducing background false detections; (3) its efficient post-processing mechanism ensures unique and accurate detection results. Together, these characteristics make YOLO a representative algorithm balancing detection accuracy and computational efficiency in modern object detection research.
5.6. Discussion
Our proposed Wood-YOLOv11 model achieved excellent performance on the pith detection task. The results validate that the targeted application of the YOLOv11 architecture and its composite loss function is highly effective for this specific industrial challenge. The model’s ability to accurately detect the small pith target amidst complex wood textures demonstrates the success of the C2PSA attention mechanism and the SPPF module. The high precision and recall rates confirm that the model is both reliable (low false positives) and sensitive (low false negatives), making it suitable for deployment in an automated quality control pipeline in the wood industry [41,42,43].
6. Conclusions and Future Work
This study presents Wood-YOLOv11, a task-adapted YOLOv11-based detection framework designed to achieve accurate and real-time pith localization in sawn timber cross-sections. Unlike methods that modify the architecture of YOLO models, Wood-YOLOv11 focuses on dataset construction, negative-sample–aware training, small-target–oriented loss optimization, and high-resolution input settings, all of which directly address the unique challenges of pith detection.
Through the development of a comprehensive multi-species dataset—including both pith and a large number of pithless boards—this work recreates realistic industrial detection conditions. The proposed training strategy effectively suppresses false positives, while the composite CIoU + WBCE + DFL loss significantly enhances small-target localization accuracy. Extensive experiments demonstrate that Wood-YOLOv11 consistently outperforms standard YOLOv11 configurations and other mainstream detectors (YOLOv7/YOLOv8, Faster R-CNN, SSD), while maintaining real-time inference performance.
For future work, the model could be expanded to detect and classify other important wood features simultaneously, such as knots, cracks, and resin canals, creating a comprehensive automated timber grading system [3,44,45]. Furthermore, deploying the model on edge computing hardware for on-site, real-time analysis in sawmills would be the next step toward its industrial application [46,47,48]. Future research could also explore integrating this vision system with other sensing modalities, such as X-ray or laser scanning, for a more holistic wood quality assessment [49,50].
Author Contributions
Conceptualization, S.J.; Methodology, S.J.; Software, S.J.; Formal analysis, S.J.; Investigation, S.J. and F.K.; Data curation, S.J., F.K., B.J. and C.J.; Writing—original draft, S.J.; Writing—review and editing, F.K.; Supervision, Z.Q.; Project administration, Z.Q.; Funding acquisition, Z.Q. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Data Availability Statement
The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.
Acknowledgments
The authors gratefully acknowledge the financial support provided by the National Key Research and Development Program of China (Grant No. 2024YFD2201204).
Conflicts of Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References
- Habite, T.; Abdeljaber, O.; Olsson, A. Automatic detection of annual rings and pith location along Norway spruce timber boards using conditional adversarial networks. Wood Sci. Technol. 2021, 55, 461–488. [Google Scholar] [CrossRef]
- Hu, M.; Briggert, A.; Olsson, A.; Johansson, M.; Oscarsson, J.; Säll, H. Growth layer and fibre orientation around knots in Norway spruce: A laboratory investigation. Wood Sci. Technol. 2018, 52, 7–27. [Google Scholar] [CrossRef]
- Zielinski, K.M.; Scabini, L.; Ribas, L.C.; da Silva, N.R.; Beeckman, H.; Verwaeren, J.; Bruno, O.M.; De Baets, B. Advanced wood species identification based on multiple anatomical sections and using deep feature transfer and fusion. Comput. Electron. Agric. 2025, 231, 109867. [Google Scholar] [CrossRef]
- Wang, B.; Wang, R.; Chen, Y.; Yang, C.; Teng, X.; Sun, P. FDD-YOLO: A Novel Detection Model for Detecting Surface Defects in Wood. Forests 2025, 16, 308. [Google Scholar] [CrossRef]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Lee, S.H.; Seo, H.I.; Seong, J.H.; Joo, Y.I.; Seo, D.H. A study on defect detection in X-ray image castings based on unsupervised learning. J. Adv. Mar. Eng. Technol. 2020, 44, 487–493. [Google Scholar] [CrossRef]
- Kisantal, M.; Wojna, Z.; Murawski, J.; Gorban, J.; Naruniec, J.; Cho, K. Augmentation for small object detection. arXiv 2019, arXiv:1902.07296. [Google Scholar] [CrossRef]
- Liu, H.; Wang, M.; Liu, L.; Wu, J.; Huang, H. A survey of small object detection based on deep learning. Comput. Eng. Sci. 2021, 43, 1429–1442. [Google Scholar] [CrossRef]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
- Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New York, NY, USA, 7–12 February 2020; pp. 12993–13000. [Google Scholar] [CrossRef]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems 25 (NIPS 2012), Lake Tahoe, NV, USA, 3–8 December 2012; pp. 1097–1105. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar] [CrossRef]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems 28 (NIPS 2015), Montréal, QC, Canada, 7–12 December 2015; pp. 91–99. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788. [Google Scholar]
- Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
- Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar] [CrossRef]
- Jiao, L.; Zhang, F.; Liu, F.; Yang, S.; Li, L.; Feng, Z.; Qu, R. A survey of deep learning-based object detection. IEEE Access 2019, 7, 128837–128868. [Google Scholar] [CrossRef]
- Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Seattle, WA, USA, 13–19 June 2020; pp. 390–391. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
- Singh, B.; Davis, L.S. An analysis of scale invariance in object detection-snip. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3578–3587. [Google Scholar]
- Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv 2018, arXiv:1710.09412. [Google Scholar] [CrossRef]
- Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. CutMix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6023–6032. [Google Scholar]
- Esteva, A.; Kuprel, B.; Novoa, R.A.; Ko, J.; Swetter, S.M.; Blau, H.M.; Thrun, S. Dermatologist-level classification of skin cancer with deep neural networks. Nature 2017, 542, 115–118. [Google Scholar] [CrossRef] [PubMed]
- Gulshan, V.; Peng, L.; Coram, M.; Stumpe, M.C.; Wu, D.; Narayanaswamy, A.; Venugopalan, S.; Widner, K.; Madams, T.; Cuadros, J.; et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA-J. Am. Med. Assoc. 2016, 316, 2402–2410. [Google Scholar] [CrossRef]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV) 2018, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2018, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
- Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
- Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. In Proceedings of the Advances in Neural Information Processing Systems 33 (NeurIPS 2020), Virtual, 6–12 December 2020. [Google Scholar]
- Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. In Proceedings of the 7th International Conference on Learning Representations (ICLR 2019), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Loshchilov, I.; Hutter, F. SGDR: Stochastic gradient descent with warm restarts. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), Toulon, France, 24–26 April 2017. [Google Scholar]
- Yosinski, J.; Clune, J.; Bengio, Y.; Lipson, H. How transferable are features in deep neural networks? In Proceedings of the Advances in Neural Information Processing Systems 27, Montréal, QC, Canada, 8–13 December 2014; pp. 3320–3328. [Google Scholar]
- Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The PASCAL Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
- Ciresan, D.; Giusti, A.; Gambardella, L.M.; Schmidhuber, J. Deep neural networks segment neuronal membranes in electron microscopy images. In Proceedings of the Advances in Neural Information Processing Systems 25 (NIPS 2012), Lake Tahoe, NV, USA, 3–8 December 2012; pp. 2843–2851. [Google Scholar]
- Fink, G.; Kohler, J. Model for the prediction of the tensile strength and tensile stiffness of knot clusters within structural timber. Eur. J. Wood Wood Prod. 2014, 72, 331–341. [Google Scholar] [CrossRef]
- Guindos, P.; Guaita, M. A three-dimensional wood material model to simulate the behavior of wood with any type of knot at the macro-scale. Wood Sci. Technol. 2013, 47, 585–599. [Google Scholar] [CrossRef]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
- Tan, M.; Le, Q.E. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Volume 15, pp. 6105–6114. [Google Scholar]
- Mokroš, M.; Liang, X.; Surový, P.; Valent, P.; Čerňava, J.; Chudý, F.; Tunák, D.; Saloň, Š.; Merganič, J. Evaluation of close-range photogrammetry image collection methods for estimating tree diameters. ISPRS Int. J. Geo-Inf. 2018, 7, 93. [Google Scholar] [CrossRef]
- Bhandarkar, S.M.; Luo, X.; Daniels, R.F.; Tollner, E.W. Automated planning and optimization of lumber production using machine vision and computed tomography. IEEE Trans. Autom. Sci. Eng. 2008, 5, 677–695. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).