WoodYOLO: A Novel Object Detector for Wood Species Detection in Microscopic Images

Nieradzik, Lars; Stephani, Henrike; Sieburg-Rockel, Jördis; Helmling, Stephanie; Olbrich, Andrea; Wrage, Stephanie; Keuper, Janis

doi:10.3390/f15111910

Open AccessArticle

WoodYOLO: A Novel Object Detector for Wood Species Detection in Microscopic Images

by

Lars Nieradzik

^1,*

,

Henrike Stephani

¹

,

Jördis Sieburg-Rockel

²

,

Stephanie Helmling

²

,

Andrea Olbrich

²

,

Stephanie Wrage

² and

Janis Keuper

³

¹

Image Processing Department, Fraunhofer ITWM, Fraunhofer Platz 1, 67663 Kaiserslautern, Germany

²

Thünen Institute of Wood Research, Leuschnerstraße 91, 21031 Hamburg, Germany

³

Institute for Machine Learning and Analysis (IMLA), Oﬀenburg University, Badstr. 24, 77652 Oﬀenburg, Germany

^*

Author to whom correspondence should be addressed.

Forests 2024, 15(11), 1910; https://doi.org/10.3390/f15111910

Submission received: 26 September 2024 / Revised: 22 October 2024 / Accepted: 26 October 2024 / Published: 30 October 2024

(This article belongs to the Section Wood Science and Forest Products)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Wood species identification plays a crucial role in various industries, from ensuring the legality of timber products to advancing ecological conservation efforts. This paper introduces WoodYOLO, a novel object detection algorithm specifically designed for microscopic wood fiber analysis. Our approach adapts the YOLO architecture to address the challenges posed by large, high-resolution microscopy images and the need for high recall in localization of the cell type of interest (vessel elements). Our results show that WoodYOLO significantly outperforms state-of-the-art models, achieving performance gains of 12.9% and 6.5% in F2 score over YOLOv10 and YOLOv7, respectively. This improvement in automated wood cell type localization capabilities contributes to enhancing regulatory compliance, supporting sustainable forestry practices, and promoting biodiversity conservation efforts globally.

Keywords:

object detection; microscopic imaging; forest protection

1. Introduction

Global deforestation is a cause of biodiversity loss and climate change. The European Union’s recently adopted EU Deforestation Regulation (EUDR [1]), which replaces the EU Timber Regulation (EUTR), requires that products traded in the EU are based on deforestation-free supply chains. This increases the demand for confirming the declaration of wood products regarding the wood species and origin.

This is a particular challenge for paper products where the DNA is destroyed and different pulps are mixed during production. Therefore, neither genetics, stable isotopes nor NIR spectroscopy can be used. It is therefore not possible to analyze the origin [2,3]. Recently a new, very complex chemotaxonomic method for determining wood species was introduced for the first time [4]. But the standard analysis for checking the declared wood species in paper is still the anatomy [5,6]. After sample preparation, the microscopic examination of the cell characteristics by experts is time-consuming and requires a high level of personal experience. Figure 1 shows an example of a microscope image of macerated cells that is analyzed. The limited number of experts in the field makes it challenging to meet the increasing demand for wood species verification [7].

To address these challenges, recent advancements in computer vision and machine learning offer promising avenues for automating wood species identification. Machine learning techniques, particularly deep neural networks, have shown remarkable capabilities in analyzing large-scale image datasets and extracting intricate features crucial for species classification [8]. However, while automated systems exist for macroscopic wood analysis [9,10,11], automated methods for microscopic analysis of fibrous materials like paper are still nascent [12].

A deep learning-based approach specifically targeting the detection and classification of vessel elements in microscopic images of macerated wood samples was recently presented [12]. These efforts have highlighted the potential of automation to streamline what has traditionally been a manual task. However, existing methods often face challenges such as suboptimal recall and high demands on computational power, especially when processing large and high-resolution microscopic images.

To address these limitations and further advance automated wood species identification, we present WoodYOLO, a novel object detection algorithm specifically designed for microscopic wood fiber analysis. WoodYOLO builds upon the YOLO (You Only Look Once) architecture, incorporating tailored optimizations to enhance performance in high-resolution microscopy.

Our algorithm introduces several key innovations:

Customized YOLO-based architecture specifically optimized for microscopic images, achieving significant performance gains over YOLOv10 and YOLOv7 by 12.9% and 6.5% respectively in terms of F2 score, while using around 3–4× less VRAM.
Introduction of a novel anchor box specification method, where users define only the maximum width and height of objects. This approach improves F2 score by 0.7%.
Comprehensive evaluation of various architectural decisions in modern object detectors. Our findings reveal that optimizations designed for general datasets like COCO [13] may not always translate to improved performance in real-world datasets or different domains.

By advancing automated wood species identification capabilities, our work contributes to enhancing regulatory compliance, supporting sustainable forestry practices, and promoting biodiversity conservation efforts globally. WoodYOLO represents a significant step towards developing scalable, reliable and efficient methods for wood species identification in microscopic images of fibrous materials.

2. Related Work

The automated identification of wood species in microscopic images of fibrous materials has gained significant attention in recent years. This interest is driven by the need for efficient and accurate methods to support global wood fiber product controls.

A pioneering approach for the identification of hardwood species in microscopic images using deep learning techniques was introduced by [12]. They developed a methodology for generating a large dataset of macerated wood references, focusing on nine hardwood genera. This approach utilized a two-step process: first, detecting vessel elements using YOLOv7 [14], and then classifying these elements using convolutional neural networks (CNNs).

While the localization of objects achieved promising results, there remains room for improvement. Recently developed object detection algorithms, particularly those based on transformers, such as the DETR (DEtection TRansformer) model family [15,16,17,18], have shown potential. However, they have not seen widespread use due to higher time complexity, slower training speeds or lower mAP on real-world datasets.

Another line of research is the continuation of YOLO. It is important to note that a higher version number in YOLO does not necessarily indicate an improvement; instead, different techniques are applied, which may or may not work on particular datasets. Since the original YOLO publication [19], only YOLOv2 [20] and YOLOv3 [21] were developed by the original authors. Other versions have been introduced by different institutes or companies, including YOLOv4 [22], Scaled-YOLOv4 [23], YOLOX [24], YOLOv6 [25], DAMO-YOLO [26], YOLOv9 [27], YOLOv10 [28], PP-YOLO [29], PP-YOLOv2 [30], and PP-YOLOE [31]. Notably, YOLOv5 and YOLOv8 have never been published. In our method section, we will analyze some of the different components found in these papers.

A recent study by [32] titled “Segmentation and characterization of macerated fibers and vessels using deep learning” demonstrated the application of YOLOv8 for analyzing microscopy images of wood fibers.

In most practical machine learning research and data competitions, YOLO remains the state-of-the-art. Therefore, our focus is on developing an object detector based on this literature. Our current work builds upon these foundations by introducing a novel object detection algorithm specifically tailored for vessel element detection in microscopic images of fibrous materials. By designing our detection algorithm with this task in mind, we can make better optimizations and avoid focusing on general-purpose detection datasets such as COCO.

Although there are numerous papers in the microscopy and satellite imaging literature that adapt YOLO for high-resolution image analysis, they generally rely on the original YOLO code base and make only minor changes. For example, refs. [33,34] adapted YOLOv5 for cell counting. There are also various studies in the field of satellite images in which YOLO [35,36] has been slightly modified. As a result, the improvements compared to the baseline are often only marginal. In contrast, we have developed our version of YOLO from scratch and tested components from different versions. This allows for more significant and customized improvements specifically for our application.

3. Materials and Methods

Frequently processed woods that are cultivated in plantations for pulp, paper and fiberboard production were selected, such as poplar or eucalypt. The exact genera can be found in [12]. Vouchered specimens of the Thünen Institute’s wood collection and other documented sources served as reference material for training and testing. Analogous to pulp production, the cell structure of the wood tissue was broken down into individual cells by maceration according to the method of [37]. At least 3 macerates per genus were produced. Maceration and staining are described in detail in [5,38]. For each macerate, 20 slides were prepared. Ten of these were stained with Alexander Herzberg solution and ten with nigrosine (1 wt%).

Our detection framework is tailored to localize vessel elements in microscopic images, a crucial step for automating hardwood species identification in fibrous materials. Vessel elements, the cell elements for conducting water in deciduous trees, contain characteristic morphological features that differ within the genera, in contrast to fibers. We adapted the YOLO architecture for this domain, addressing the challenges posed by large image sizes (up to 54,000 × 31,000 pixels) and the need for high recall. Unlike algorithms such as DETR, which do not scale well for very large images and have slower training times, YOLO has proven effective in real-world applications, making it a suitable choice for our task.

Although the YOLO family includes various models optimized for general datasets like COCO, these models are not directly applicable to our problem due to their design for multiple classes and general-purpose images. Therefore, we customized YOLO by integrating components from different versions to optimize it for vessel detection without the need for classification.

In this section, we describe our model’s architecture, loss function, metric, and additional approaches evaluated to enhance detection performance.

3.1. Architecture

Our model architecture begins with selecting a backbone capable of efficiently extracting features from large microscopic images. The backbone processes the input to generate multi-scale feature maps. We tested several backbones such as VGG11 [39], ConvNext [40], and ResNet [41], and combined their feature maps through a component known as the neck, which outputs three feature maps. Although more than three feature maps can be used, our evaluation showed no significant advantage in doing so.

Our neck architecture is based on YOLOv7-tiny. We also tested YOLOX’s CSPNet [42] but found the former to be better. The use of a smaller architecture is due to the need for memory efficiency. Since we want to train the network with a higher image resolution than the usual 640 × 640 or 1280 × 1280, we need to reduce the memory requirements. Also, deeper networks are usually chosen when many features are needed to distinguish between different classes. Here it is only a matter of finding objects without the need for classification. Therefore, simpler networks work better.

Figure 2 shows that our neck consists of several convolutional layers that are combined in different ways. A “c” block consists of a simple convolution followed by a batch normalization and a ReLU function. The “b” block consists of parallel convolutions that are combined by concatenation. Figure 3 shows the “b” block in detail.

The three orange blocks in Figure 2 indicate the outputs of the neck. These three blocks are then used as inputs for the head. The outputs have different dimensions as a higher stride size is used for some of the convolutions.

The head produces the predictions of the neural network. It consists of only one convolutional block and one output convolution. A decoupled head, such as the one used in YOLOX, has not proven to be better in our case.

For each feature map, the head produces an output tensor

f_{i}

of dimensions

g_{h_{i}} \cdot g_{w_{i}} \times 5

, where

g_{h_{i}}

and

g_{w_{i}}

denote the grid height and width for the i-th layer. Each grid cell in

f_{i}

predicts five parameters: the center x-coordinate, center y-coordinate, width, height, and object confidence. These outputs are transformed as follows:

\begin{matrix} x_{c} & = 2 σ (f_{i, \cdot, 1}) - 0.5, \\ y_{c} & = 2 σ (f_{i, \cdot, 2}) - 0.5, \\ w & = σ {(f_{i, \cdot, 3})}^{2} \cdot g_{w_{i}} \cdot m_{w}, \\ h & = σ {(f_{i, \cdot, 4})}^{2} \cdot g_{h_{i}} \cdot m_{h}, \\ o & = σ (f_{i, \cdot, 5}), \end{matrix}

where

m_{w}

and

m_{h} \in [0, 1]

are hyperparameters defining the maximum width and height of the object. For instance,

m_{w} = 0.1

means an object can be at most 10% of the total image width. This is similar to having a single anchor box of a maximum specific size.

The advantage of using two hyperparameters instead of anchor boxes is that no techniques such as clustering [20] have to be used to determine them. In addition, the loss function is much simpler and the training speed is higher.

The sigmoid function

σ (\cdot)

ensures

x_{c}

and

y_{c}

are offsets within the grid, while w and h define bounding box dimensions. The confidence score o indicates the likelihood of a bounding box’s presence at each location.

x_{c}

and

y_{c}

are scaled here between

[- 0.5, 1.5]

. This allows the model to shift the center of the box half to the left or right.

In the prediction phase,

x_{c}

and

y_{c}

offsets are adjusted by adding the grid indices

{0, 1, \dots, g_{w}}

and

{0, 1, \dots, g_{h}}

, respectively. Coordinates are scaled to the original image size by multiplying

x_{c}

,

y_{c}

, w, and h by

\frac{s_{h}}{g_{h_{i}}}

and

\frac{s_{w}}{g_{w_{i}}}

, where

s_{h}

and

s_{w}

are the input image dimensions.

3.2. Loss Function

Our loss function consists of two components:

L = L_{r} + L_{p},

where

L_{r}

is the regression loss and

L_{p}

is the classification loss.

3.2.1. Regression Loss

The regression loss measures the alignment between predicted bounding boxes

\hat{b}

and ground truth b using the Intersection over Union (IoU):

L_{r} = \frac{1}{m} \sum_{i = 1}^{n} \sum_{j = 1}^{m} (1 - IoU ({\hat{b}}_{i, j}, b_{i, j})),

where n is the number of feature pyramid layers (in our case,

n = 3

) and m is the number of bounding boxes. The regression loss is either evaluated with the corresponding bounding box at that grid cell or additionally with neighboring grid cells (multi-positives).

There are different variants of IoU: Complete IoU (cIoU) [43], Distance IoU (DIoU) [44], Generalized IoU (GIoU) [45] and standard IoU. In the evaluation section, we evaluate the different approaches to see which maximizes our metric.

3.2.2. Classification Loss

The classification loss evaluates the confidence score

\hat{o}

using binary cross entropy (BCE), with the ground truth confidence o derived from IoU:

L_{p} = \frac{1}{m} \sum_{i = 1}^{n} \sum_{j = 1}^{m} BCE ({\hat{o}}_{i, j}, IoU ({\hat{b}}_{i, j}, b_{i, j})) .

Unlike the regression loss, we evaluate BCE at all locations of the grid. However, we set

IoU ({\hat{b}}_{i, j}, b_{i, j}) = 0

when there is no ground truth box at a specific grid cell.

3.3. Metric

The predominant metric in object detection is average precision (AP) [46] computed at different thresholds, which summarizes both precision and recall:

AP = \int_{0}^{1} p (r) d r,

where r denotes recall and

p (r)

denotes precision as a function of recall. A detection is considered correct if the IoU between the predicted and the true bounding box exceeds a predefined threshold.

In our specific application, however, the use of AP would not be a good choice. Recall takes precedence over precision as our goal is to find all objects.

Furthermore, we are less interested in an exact overlap with the ground truth. Minor shifts or size variations in the bounding box should not be penalized by the metric. Therefore, we want to consider only a single low IoU threshold. Often AP is computed at multiple thresholds.

Hence, we propose an alternative metric: the F2 score, which is computed with a fixed IoU threshold of 0.3. This choice emphasizes recall over precision. False positives can be handled in a postprocessing step by training a classifier to distinguish between correct and wrong detections. We see in Figure 4 two examples where the overlap of 30% is sufficient.

While the usual threshold is 0.5, we choose a lower threshold of 0.3. This threshold takes into account the fact that perfect alignment with the ground truth bounding box is not essential for our objectives.

3.4. Additional Approaches

We explored several innovations from the YOLO series to further enhance our detection framework, evaluating their impact on performance. Some of these results will be shown in the evaluation section.

3.4.1. Center Sampling and Multi-Positives

We explored the use of neighboring grid cells for matching ground truth boxes, a technique known in the literature as multi-positives [24] or center sampling [47].

In the standard loss function

L_{r}

, we compute the IoU loss only between boxes at coordinates

(i, j)

. Center sampling extends this concept by also comparing boxes at

(i + k_{1}, j + k_{2})

, where

k_{1}

and

k_{2}

are integer offsets. The ground truth box is duplicated for these new coordinates

(i + k_{1}, j + k_{2})

to make a comparison with the ground truth box at those positions possible. We investigated three variants:

0 Neighbors : [\begin{matrix} 0 & 0 & 0 \\ 0 & \circ & 0 \\ 0 & 0 & 0 \end{matrix}] 2 Neighbors : [\begin{matrix} 0 & \times & 0 \\ 0 & \circ & \times \\ 0 & 0 & 0 \end{matrix}] 4 Neighbors : [\begin{matrix} 0 & \times & 0 \\ \times & \circ & \times \\ 0 & \times & 0 \end{matrix}]

Here, ∘ denotes the original bounding box, while × represents neighboring boxes and 0 means “empty cell”. For the 0 neighbors configuration, the loss

L_{r}

remains unchanged as it only considers the original box. In the 2 neighbors configuration, the nearest bounding boxes within the grid are selected, in this case, the right and upper boxes. For the 4 neighbors configuration, we use bounding boxes from all directions: left, right, up, and down. Note that the diagonal boxes are never selected.

Since object detection is a one-to-many mapping (one ground-truth box corresponds to many correctly predicted boxes), this strategy attempts to simulate this mapping using the loss function.

3.4.2. Label Assignment

Bounding boxes are predicted for every feature map. The use of center sampling further increases the number of predicted boxes. To manage this increase of bounding boxes, we evaluated label assignment strategies designed to reduce the number of valid boxes per object.

We experimented with modern label assignment techniques such as SimOTA and TAL [48,49]. However, these methods did not yield improved results in our scenario. We attribute this to our metric, which prioritizes maximizing recall rather than balancing precision and recall.

3.4.3. Auxiliary Head Loss

Deep supervision techniques, such as those used in YOLOv7 [14], involve adding auxiliary losses to guide deeper networks. Our experiments with additional model layers showed no benefit, so this approach was excluded from our final model.

3.4.4. Anchor Boxes

Anchor boxes, introduced in YOLOv2 [20], are used to predict object locations. Consistent with YOLOX findings [24], our results showed no improvement with anchor boxes, leading us to exclude them for simplicity. Instead, we incorporate parameters

m_{h}

and

m_{w}

in the range

[0, 1]

to constrain the predicted width and height of bounding boxes, as discussed previously.

3.4.5. NMS-Free Detection

NMS-free approaches from models like YOLOv10 did not perform as well in our tests. We retained traditional Non-Maximum Suppression (NMS) for its robustness and simplicity.

3.4.6. Training Strategies

Techniques such as mosaic augmentation and gradient accumulation, which are effective in other YOLO implementations, did not significantly improve detection in our application. Therefore, they were excluded from the final model configuration.

4. Results

We evaluate WoodYOLO on a dataset constructed for automating the detection and identification of vessel elements in hardwood species, a critical step toward wood species classification. Vessel elements are the water-conducting cells in hardwoods, that differ from genus to genus due to their characteristic morphological features. These vessel elements provide vital information for wood identification and are easily to distinguish from other cell types like fibers or parenchyma cells.

In this paper, we are specifically concerned with improving the localization of these vessel elements. The dataset comprises high-resolution microscope images of macerated hardwood samples, captured with a ZEISS Axioscan 7 microscope. Each image, originally in the czi format with a resolution of approximately 54,000 × 31,000 pixels and file size of 1 GB, was scaled down by 10% (5400 × 3100 pixels) to enhance training efficiency and reduce memory usage. The final dataset consists of 767 images annotated with 118,287 bounding boxes identifying vessel elements.

Only the third of five focal planes of each image was utilized for training, as additional planes did not contribute significant information for detecting the vessel elements. The annotated dataset was split into 613 images for training and 154 images for validation. We have conducted initial experiments with 5-fold cross-validations, but found that the metrics are relatively stable across different folds. Due to time constraints, we use a simple train-validation split.

In this section, we evaluate the performance of our vessel detection framework across various configurations and compare it to other state-of-the-art models. The evaluations were conducted using the F2 score at a fixed IoU threshold of 0.3, as described before.

4.1. Detection Model and Backbone Comparison

Since we use YOLO as a basis, it is useful to compare our model with other YOLO variants. In Table 1, we present the F2 scores for different detection models.

Our customized YOLO variant outperforms other models, achieving an F2 score of 0.848, highlighting its superior ability to detect vessel elements in large microscopic images.

The parameters of YOLOv10 and YOLOv7 have both been optimized. It is worth noting that we use a resolution of 5184 × 5184 pixels for the second-best model YOLOv7-W6, which requires the use of an A100. Our model uses 2048 × 2048 and can be trained with less than 10 GB of VRAM.

We also evaluated various backbone networks to determine their impact on detection performance. Table 2 summarizes the results, while including the number of parameters.

The VGG11-bn backbone yielded the highest F2 score (0.8316) while maintaining a reasonable parameter count and VRAM usage. All the other backbones except YOLOv7-tiny have much higher VRAM requirements as they use skip connections, more complex activation functions or special layers like Squeeze-and-Excitation blocks [52]. The simplicity of VGG makes it possible to scale it easier to higher resolutions.

4.2. Effect of Neighboring Cells and IoU Loss Function

We assessed the impact of considering neighboring grid cells (multi-positives) for matching ground truth boxes. As shown in Table 3, using 0 neighboring cells produced the highest F2 score (0.8481).

Adding more neighboring cells led to a decrease in performance, suggesting that the decrease in precision is too high.

Next, we compared different IoU-based loss functions to determine their effectiveness in our model. Table 4 shows that the generalized IoU (GIoU) loss yielded the best performance with a F2 score of 0.8340.

However, the differences at F2 are quite small. This parameter therefore has no major influence on the result.

4.3. Impact of Image Size and Training Techniques

Table 5 evaluates the impact of varying image sizes on detection performance. Training on images of size 2048 provided the highest F2 score (0.8316).

This confirms that we do not need the full resolution of 54,000 × 31,000 to find the vessel elements. Therefore, it is also not necessary to split the images to perform the detection for individual patches. Since only a single image needs to be predicted with our approach, we have a higher prediction speed.

We have successfully trained a model with a resolution of 6144 × 6144 on an A100 GPU with 40 GB VRAM. Even higher resolutions are possible with further adjustments to the architecture. It is important to emphasize that our standard model, which operates at a resolution of 2048 × 2048, is designed to be more accessible. It can be trained on consumer-grade hardware and requires only about 8 GB of VRAM for training.

In training our YOLO-based model, we explored several advanced techniques to enhance performance, including mosaic augmentation and gradient accumulation. Mosaic augmentation is a data augmentation strategy that creates a new training image by combining four different images from the dataset. This technique is intended to provide more context and variability during training, potentially improving the model’s generalization ability. However, as shown in Table 6, mosaic augmentation did not lead to an improvement in the F2 score for our task.

Gradient accumulation is another technique we evaluated. It allows for effective training with larger batch sizes than can fit in GPU memory by accumulating gradients over multiple mini-batches before updating the model weights. Despite its potential to stabilize training and improve convergence, our results indicate that gradient accumulation did not provide a significant benefit in our experiments.

One key modification that proved beneficial was the implementation of a maximum object width and height (the previously discussed anchor box variant). Removing this constraint resulted in a noticeable decrease in the F2 score, demonstrating the effectiveness of this technique in improving detection performance.

4.4. Summary of the Results

We have demonstrated that WoodYOLO outperforms other YOLO variants in our specific use case. Interestingly, certain techniques that have consistently shown improvements in mAP on COCO do not yield similar benefits here. For instance, mosaic augmentation, introduced in YOLOv4 [22], showed a 1.8% increase in

{AP}_{50}

in their ablation study. In contrast, our experiments reveal a substantial decrease of 6.2% in F2 score when applying this technique. Similarly, we observed no advantage in using multi-positives, despite YOLOX reporting a 2.1% improvement.

We attribute these discrepancies to several factors:

Metric difference: Our focus is on recall and approximate bounding box overlap, rather than the standard COCO metrics.
Task simplification: As we only need to localize objects, our architecture can be shallower compared to those designed for more complex tasks.
Reproducibility challenges: Deep learning, particularly in object detection, often faces reproducibility issues. Many YOLO implementations use legacy code with undocumented workarounds to improve AP, which are not mentioned in the original papers. These may include arbitrary loss function weightings or different weight decay strategies [53].

To mitigate these confounding factors, we developed our detector from scratch, avoiding reliance on previous codebases. This approach allows us to more accurately assess the impact of individual modifications.

In conclusion, our findings suggest that for specialized domains that diverge significantly from the standard COCO use-case, developing customized detectors can be more beneficial than adapting existing general-purpose models. This approach enables a more tailored solution that better addresses the specific requirements of the task at hand.

5. Discussion and Conclusions

In this paper, we presented WoodYOLO, a novel object detection algorithm specifically designed for microscopic wood fiber analysis. Our approach builds upon the YOLO architecture, incorporating tailored optimizations to enhance performance in high-resolution microscopy images. We introduced several key innovations, including a customized YOLO-based architecture optimized for microscopic images and a novel anchor box specification method.

Our comprehensive evaluation demonstrated that WoodYOLO outperforms state-of-the-art models such as YOLOv10 and YOLOv7 by significant margins in terms of F2 score. We also provided insights into the effectiveness of various architectural decisions and training techniques in the context of wood vessel detection.

The superior performance of WoodYOLO in detecting vessel elements in microscopic images of fibrous materials represents a significant advancement in automated wood species identification. This contribution has far-reaching implications for enhancing regulatory compliance, supporting sustainable forestry practices, and promoting biodiversity conservation efforts globally.

Future Work

The development of WoodYOLO opens up several promising avenues for future research and improvement. A key area for exploration is the integration of rotated bounding boxes to improve the accuracy of vessel element localization, particularly for elongated or angled structures. This further development requires adjustments to both the model architecture and the dataset annotations and offers considerable potential for improving detection accuracy.

At the same time, further optimization of the WoodYOLO architecture can be worked on to reduce the GPU requirements and increase the recall. Reducing the model’s memory requirements is crucial to enable the processing of larger, higher-resolution microscopic images.

In addition to these technical improvements, we see great potential for adapting WoodYOLO to other areas that require high-resolution image analysis. For example, our approach could be useful in medical imaging to detect cell structures or in analyzing satellite imagery to identify specific geographical features. By exploring these cross-domain applications, we aim to extend the impact of our research beyond forestry and wood science and potentially contribute to advances in various scientific disciplines.

Author Contributions

L.N. conceived the study, developed the WoodYOLO algorithm, and wrote the main manuscript. H.S. and J.K. provided supervision, offered valuable remarks, and substantially revised the work, significantly contributing to its refinement and overall guidance. H.S., J.S.-R., S.H. and A.O. provided the wood samples, conducted maceration, performed staining, and prepared the images, including handling biological details related to the dataset. S.W. designed and established the database infrastructure. All authors reviewed and approved the final manuscript.

Funding

This research was funded by Fachagentur Nachwachsende Rohstoffe e.V. (FNR–FKZ 2220HV063A and 2220HV063B).

Data Availability Statement

The data will be made available upon request from the authors once all studies have been completed and published.

Acknowledgments

The authors would like to thank all colleagues who participated in the preparation of the numerous samples, helped with annotation and made the project happen: P.Gospodnetić, L.Gradert, J.Heddier, D.Helm, S.Kaschuro, G.Koch, C.Piehl, M.Rauhut, L.Wenrich, A. Wettich, T.Stephani (all Fraunhofer Institute for Industrial Mathematics ITWM or Thünen Institute of Wood Research).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Parliament, E. Regulation (EU) 2023/1115 of the European Parliament and of the Council of 31 May 2023 on the making available on the Union market and the export from the Union of certain commodities and products associated with deforestation and forest degradation and repealing Regulation (EU) No 995/2010. Off. J. Eur. Union 2023, 150, 206–247. [Google Scholar]
Tsuchikawa, S.; Kobori, H. A review of recent application of near infrared spectroscopy to wood science and technology. J. Wood Sci. 2015, 61, 213–220. [Google Scholar] [CrossRef]
Schmitz, N.; Beeckman, H.; Blanc-Jolivet, C.; Boeschoten, L.; Braga, J.W.; Cabezas, J.A.; Chaix, G.; Crameri, S.; Deklerck, V.; Degen, B.; et al. Overview of Current Practices in Data Analysis for Wood Identification; A Guide for the Different Timber Tracking Methods; Technical Report; GTTN, 2020. [Google Scholar]
Flaig, M.L.; Berger, J.; Wenig, P.; Olbrich, A.; Saake, B. Identification of tropical wood species in paper: A new chemotaxonomic method based on extractives. Holzforschung 2023, 77, 860–878. [Google Scholar] [CrossRef]
Helmling, S.; Olbrich, A.; Heinz, I.; Koch, G. Atlas of vessel elements: Identification of Asian timbers. Iawa J. 2018, 39, 249–352. [Google Scholar] [CrossRef]
Ilvessalo-Pfäffli, M.S. Fiber Atlas: Identification of Papermaking Fibers; Springer Science & Business Media: Berlin/Heidelberg, Germany, 1995. [Google Scholar]
Ruffinatto, F.; Crivellaro, A. Atlas of Macroscopic Wood Identification: With a Special Focus on Timbers Used in Europe and CITES-Listed Species; Springer Nature: Berlin/Heidelberg, Germany, 2019. [Google Scholar]
Silva, J.L.; Bordalo, R.; Pissarra, J.; de Palacios, P. Computer Vision-Based Wood Identification: A Review. Forests 2022, 13, 2041. [Google Scholar] [CrossRef]
UTAR; FRIM. MyWood-Premium. 2018. [Google Scholar]
Ravindran, P.; Thompson, B.J.; Soares, R.K.; Wiedenhoeft, A.C. The XyloTron: Flexible, open-source, image-based macroscopic field identification of wood products. Front. Plant Sci. 2020, 11, 1015. [Google Scholar] [CrossRef]
Wiedenhoeft, A.C. The XyloPhone: Toward democratizing access to high-quality macroscopic imaging for wood and other substrates. Iawa J. 2020, 41, 699–719. [Google Scholar] [CrossRef]
Nieradzik, L.; Sieburg-Rockel, J.; Helmling, S.; Keuper, J.; Weibel, T.; Olbrich, A.; Stephani, H. Automating Wood Species Detection and Classification in Microscopic Images of Fibrous Materials with Deep Learning. arXiv 2023, arXiv:2307.09588. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context. arXiv 2015, arXiv:1405.0312. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. arXiv 2020, arXiv:2005.12872. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. arXiv 2024, arXiv:2304.08069. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
Ouyang-Zhang, J.; Cho, J.H.; Zhou, X.; Krähenbühl, P. NMS Strikes Back. arXiv 2022, arXiv:2212.06137. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. arXiv 2016, arXiv:1506.02640. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. arXiv 2016, arXiv:1612.08242. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. Scaled-YOLOv4: Scaling Cross Stage Partial Network. arXiv 2021, arXiv:2011.08036. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Xu, X.; Jiang, Y.; Chen, W.; Huang, Y.; Zhang, Y.; Sun, X. DAMO-YOLO: A Report on Real-Time Object Detection Design. arXiv 2023, arXiv:2211.15444. [Google Scholar]
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Long, X.; Deng, K.; Wang, G.; Zhang, Y.; Dang, Q.; Gao, Y.; Shen, H.; Ren, J.; Han, S.; Ding, E.; et al. PP-YOLO: An Effective and Efficient Implementation of Object Detector. arXiv 2020, arXiv:2007.12099. [Google Scholar]
Huang, X.; Wang, X.; Lv, W.; Bai, X.; Long, X.; Deng, K.; Dang, Q.; Han, S.; Liu, Q.; Hu, X.; et al. PP-YOLOv2: A Practical Object Detector. arXiv 2021, arXiv:2104.10419. [Google Scholar]
Xu, S.; Wang, X.; Lv, W.; Chang, Q.; Cui, C.; Deng, K.; Wang, G.; Dang, Q.; Wei, S.; Du, Y.; et al. PP-YOLOE: An evolved version of YOLO. arXiv 2022, arXiv:2203.16250. [Google Scholar]
Qamar, S.; Baba, A.I.; Verger, S.; Andersson, M. Segmentation and characterization of macerated fibers and vessels using deep learning. Plant Methods 2024, 20, 126. [Google Scholar] [CrossRef]
López Flórez, S.; González-Briones, A.; Hernández, G.; Ramos, C.; de la Prieta, F. Automatic Cell Counting With YOLOv5: A Fluorescence Microscopy Approach. Int. J. Interact. Multimed. Artif. Intell. 2023, 8, 64. [Google Scholar] [CrossRef]
Aldughayfiq, B.; Ashfaq, F.; Jhanjhi, N.Z.; Humayun, M. YOLOv5-FPN: A Robust Framework for Multi-Sized Cell Counting in Fluorescence Images. Diagnostics 2023, 13, 2280. [Google Scholar] [CrossRef]
Meng, X.; Li, C.; Li, J.; Li, X.; Guo, F.; Xiao, Z. YOLOv7-MA: Improved YOLOv7-Based Wheat Head Detection and Counting. Remote Sens. 2023, 15, 3770. [Google Scholar] [CrossRef]
Li, P.; Che, C. SeMo-YOLO: A Multiscale Object Detection Network in Satellite Remote Sensing Images. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Virtual Conference, 18–22 July 2021; pp. 1–8. [Google Scholar] [CrossRef]
Franklin, G. Preparation of thin sections of synthetic resins and wood-resin composites, and a new macerating method for wood. Nature 1945, 155, 51. [Google Scholar] [CrossRef]
Helmling, S.; Olbrich, A.; Tepe, L.; Koch, G. Qualitative and quantitative characteristics of macerated vessels of 23 mixed tropical hardwood (MTH) species: A data collection for the identification of wood species in pulp and paper. Holzforschung 2016, 70, 839–844. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. arXiv 2022, arXiv:2201.03545. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar]
Wang, C.Y.; Liao, H.Y.M.; Yeh, I.H.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W. CSPNet: A New Backbone that can Enhance Learning Capability of CNN. arXiv 2019, arXiv:1911.11929. [Google Scholar]
Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing Geometric Factors in Model Learning and Inference for Object Detection and Instance Segmentation. arXiv 2021, arXiv:2005.03572. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. arXiv 2019, arXiv:1911.08287. [Google Scholar] [CrossRef]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized Intersection over Union: A Metric and A Loss for Bounding Box Regression. arXiv 2019, arXiv:1902.09630. [Google Scholar]
Everingham, M.; Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vision 2010, 88, 303–338. [Google Scholar] [CrossRef]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. arXiv 2019, arXiv:1904.01355. [Google Scholar]
Ge, Z.; Liu, S.; Li, Z.; Yoshie, O.; Sun, J. OTA: Optimal Transport Assignment for Object Detection. arXiv 2021, arXiv:2103.14259. [Google Scholar]
Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. TOOD: Task-aligned One-stage Object Detection. arXiv 2021, arXiv:2108.07755. [Google Scholar]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. RepVGG: Making VGG-style ConvNets Great Again. arXiv 2021, arXiv:2101.03697. [Google Scholar]
Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv 2020, arXiv:1905.11946. [Google Scholar]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. arXiv 2019, arXiv:1709.01507. [Google Scholar]
He, T.; Zhang, Z.; Zhang, H.; Zhang, Z.; Xie, J.; Li, M. Bag of Tricks for Image Classification with Convolutional Neural Networks. arXiv 2018, arXiv:1812.01187. [Google Scholar]

Figure 1. Microscope image of macerated hardwood cells including vessel elements. Blue boxes indicate the correctly localized vessel locations by WoodYOLO, the light purple box indicate one false negative and red boxes denote false positives that were not annotated by wood anatomists. WoodYOLO significantly speeds up the manual annotation process by automatically identifying hundreds of vessel elements.

Figure 2. Detection architecture based on YOLOv7-tiny [14]. “c” = Convolution with BN and ReLU, “+” = Concatenation, “m” = MaxPooling, “u” = Upsampling, “b” = Concatenation Block, “o” = single convolution with 5 outputs (x, y, width, height, confidence). Orange denotes an output in the neck of the model, which is given to the head. There are in total three outputs.

Figure 3. The “b” concatenation block consists of convolutions of kernel size 3 × 3 and 1 × 1. Each convolution is followed by batch normalization and ReLU activation. The “+” means concatenation.

Figure 4. Comparison of predicted bounding boxes (blue) and ground truth boxes (green). A high IoU threshold can result in both predicted boxes being rated as errors. (A) The overlap is below 0.5. Due to incorrect annotations, the predicted bounding boxes are sometimes more accurate. (B) Imperfect prediction as the end of the object (vessel element) is not detected.

Table 1. Comparison of detection models based on the F2 score. “Ours” refers to our best WoodYOLO configuration.

Architecture	F2 Score
YOLOv10-S	0.691
YOLOv10-M	0.719
YOLOv7-W6	0.783
YOLOv7-tiny	0.723
Ours	0.848

Table 2. Comparison of different backbone networks. We used 2 neighbors for this experiment.

Backbone	Params (M)	F2 Score
YOLOv7-tiny [14]	9.77	0.8146
VGG11-bn [39]	16.59	0.8316
RepVGG-A0 [50]	15.52	0.8168
ResNet-18 [41]	18.48	0.8096
EfficientNet-B0 [51]	10.78	0.8198
ConvNeXt-Nano [40]	22.33	0.8284

Table 3. Effect of using neighboring grid cells on detection performance.

Number of Neighbors	F2 Score
0	0.8481
2	0.8316
4	0.8080

Table 4. Comparison of different IoU-based loss functions. We used 2 neighbors for this experiment.

IoU Loss	F2 Score
CIoU	0.8316
DIoU	0.8321
IoU	0.8293
GIoU	0.8340

Table 5. Effect of different image sizes on detection performance. We used 2 neighbors and CIoU for this experiment.

Image Size	F2 Score
1024	0.7863
2048	0.8316
4096	0.8243

Table 6. Comparison of approaches to increase F2.

Method	F2 Score
Baseline (Ours)	0.848
+ Mosaic Augmentation	0.786
+ Gradient Accumulation	0.838
+ No Maximum Size Constraint	0.841

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nieradzik, L.; Stephani, H.; Sieburg-Rockel, J.; Helmling, S.; Olbrich, A.; Wrage, S.; Keuper, J. WoodYOLO: A Novel Object Detector for Wood Species Detection in Microscopic Images. Forests 2024, 15, 1910. https://doi.org/10.3390/f15111910

AMA Style

Nieradzik L, Stephani H, Sieburg-Rockel J, Helmling S, Olbrich A, Wrage S, Keuper J. WoodYOLO: A Novel Object Detector for Wood Species Detection in Microscopic Images. Forests. 2024; 15(11):1910. https://doi.org/10.3390/f15111910

Chicago/Turabian Style

Nieradzik, Lars, Henrike Stephani, Jördis Sieburg-Rockel, Stephanie Helmling, Andrea Olbrich, Stephanie Wrage, and Janis Keuper. 2024. "WoodYOLO: A Novel Object Detector for Wood Species Detection in Microscopic Images" Forests 15, no. 11: 1910. https://doi.org/10.3390/f15111910

APA Style

Nieradzik, L., Stephani, H., Sieburg-Rockel, J., Helmling, S., Olbrich, A., Wrage, S., & Keuper, J. (2024). WoodYOLO: A Novel Object Detector for Wood Species Detection in Microscopic Images. Forests, 15(11), 1910. https://doi.org/10.3390/f15111910

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

WoodYOLO: A Novel Object Detector for Wood Species Detection in Microscopic Images

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Architecture

3.2. Loss Function

3.2.1. Regression Loss

3.2.2. Classification Loss

3.3. Metric

3.4. Additional Approaches

3.4.1. Center Sampling and Multi-Positives

3.4.2. Label Assignment

3.4.3. Auxiliary Head Loss

3.4.4. Anchor Boxes

3.4.5. NMS-Free Detection

3.4.6. Training Strategies

4. Results

4.1. Detection Model and Backbone Comparison

4.2. Effect of Neighboring Cells and IoU Loss Function

4.3. Impact of Image Size and Training Techniques

4.4. Summary of the Results

5. Discussion and Conclusions

Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI