MoLi-Net: A Lightweight Brightness-Aware Model for Chinese Herbal Materials Recognition with an Auxiliary Module for Impurity Detection

Xu, Zilong; Jiang, Changcheng; Ding, Jianhui; Ding, Weiyang; Wan, Zhenping

doi:10.3390/electronics15122731

Open AccessArticle

MoLi-Net: A Lightweight Brightness-Aware Model for Chinese Herbal Materials Recognition with an Auxiliary Module for Impurity Detection

by

Zilong Xu

¹

,

Changcheng Jiang

¹,

Jianhui Ding

²,

Weiyang Ding

² and

Zhenping Wan

^1,*

¹

School of Mechanical and Automotive Engineering, South China University of Technology, Guangzhou 510641, China

²

Guangzhou Ruijia Industry Co., Ltd., Guangzhou 510730, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(12), 2731; https://doi.org/10.3390/electronics15122731 (registering DOI)

Submission received: 11 April 2026 / Revised: 14 June 2026 / Accepted: 19 June 2026 / Published: 21 June 2026

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Object detection in complex industrial environments is prone to being affected by insufficient dynamic weighting of local and global features, as well as illumination variations and impurities. Moreover, existing models suffer from excessive model complexity, which directly impairs computational efficiency. To more accurately distinguish Chinese herbal materials with diverse morphologies, this paper proposes the MobileAttn module. Drawing on the idea of token representation in the Transformer architecture, this module extracts contextual information through global feature compression, fuses it with tokens to generate a spatial attention map, and realizes dynamic recalibration of convolutional features. This process enhances the feature weights of key semantic regions, suppresses redundant background information, and improves feature discriminability. To address illumination interference, brightness-aware weights are combined with dual-path (channel and spatial) attention for global control, dynamically reducing the impact of illumination; this component is named LightAttn. When Chinese herbal materials contain common industrial unknown impurities (e.g., small stones and weeds), an impurity detection auxiliary module, a post-processing step independent of the main detection network, is proposed. This module refines Non-Maximum Suppression (NMS) logic to distinguish target Chinese herbal materials from interfering impurities. Subsequently, it accurately locates and marks impurities on the conveyor belt, thereby achieving effective unknown impurity detection. Experimental results demonstrate that, compared with the original YOLOv11 on the Chinese herbal materials detection task, the optimized model achieves a 1.7% improvement in the overall mean Average Precision (mAP@0.5:0.95). On a per-class basis, gains are particularly pronounced for certain challenging high-aspect-ratio Chinese herbal materials. Prunella vulgaris and orange peel achieve respective AP improvements of 5.8% and 4.1%. Meanwhile, the model parameter count is reduced by 23.1% and the computational complexity by 20.3%. The F1-Score of the impurity detection results is 86.38%, verifying the effectiveness of the impurity detection auxiliary module.

Keywords:

lightweight; Chinese herbal materials; attention mechanism; impurity detection

1. Introduction

The identification of Chinese herbal materials has significant application value in the field of traditional Chinese medicine (TCM). Particularly in industry, the accuracy of their identification before packaging directly affects the quality control and clinical application of Chinese herbal materials. However, Chinese herbal materials are extremely complex with a wide range of samples, making manual detection and traditional vision techniques difficult to meet the requirements of industrialization. With the development of computer vision technology, object detection algorithms based on deep learning have provided a new solution for Chinese herbal materials identification [1]. The YOLO series algorithms [2] have been widely applied in object detection tasks due to their low latency and high accuracy. Nevertheless, when YOLOv11 is used for Chinese herbal materials detection and identification, it still suffers from problems such as excessive model parameters and large computational complexity, which lead to model overfitting and make it difficult to meet the high-precision requirements of complex industrial environments [3].

In the task of Chinese herbal materials identification, the accuracy of object detection depends not only on the complexity of the model but also on factors such as insufficient dynamic weighting of local and global features, illumination conditions, and impurity interference [4]. In particular, different Chinese herbal materials have varying color depths: dark-colored Chinese herbal materials have strong light absorption, while light-colored ones have weak light absorption. As a result, a single illumination condition cannot achieve balanced feature presentation for different Chinese herbal materials, leading to unclear partial feature information and increased detection difficulty [5]. Furthermore, Chinese herbal materials usually contain impurities, such as small stones, weeds, and dead branches, which require an auxiliary impurity detection module to solve the quality detection problem. Most existing models only address a single one of the above problems and cannot meet the actual scenario where multiple problems coexist.

Recent lightweight vision architectures, including MobileViT [6] and EfficientViT [7], improve efficiency by restricting self-attention. Contemporaneous work such as Mobile-Former sidesteps dense self-attention by using learnable global tokens as proxies that interact with convolutional features through cross-attention. However, these methods still incur non-negligible computational overhead from cross-attention or token mixing, limiting their suitability for low-power edge devices. In contrast, the proposed dynamic global feature weighting module (MobileAttn) aggregates global context via an extremely lightweight, attention-free additive fusion of pooled features and learnable tokens, and generates a spatial attention map for feature recalibration rather than updated token representations.

To this end, the investigation proposes a lightweight, brightness-aware network based on the YOLOv11 architecture, integrating an auxiliary module for impurity identification and detection. Thus, collaborative optimization of three functional modules, including dynamic global feature weighting (MobileAttn), adaptive illumination adjustment (LightAttn), and precise identification of common industrial unknown impurities, is achieved to attain a favorable trade-off between detection performance, computational efficiency, and robustness for industrial Chinese herbal inspection. The improved model not only maintains high accuracy in Chinese herbal materials detection and impurity identification but also reduces the model size and computational complexity.

The main contributions of this paper are as follows:

This research proposes MobileAttn, a lightweight attention module that dynamically recalibrates convolutional features. By encoding global features into learnable tokens and fusing them to generate a spatial attention map, this mechanism adaptively enhances key regions and suppresses background noise, improving feature discriminability and generalization at a marginal computational cost.
An illumination-adaptive attention module denoted as LightAttn is proposed, which integrates brightness-aware weights with dual-path attention covering both channel and spatial dimensions to achieve global regulation, thereby dynamically mitigating the adverse effects induced by varying illumination conditions.
This paper replaces some convolutional layers (Conv) of YOLOv11 with depth-wise separable convolutional layers (DSConv) to realize a lightweight convolutional network, P-YOLOv11, which effectively reduces the model complexity. Taking this as the baseline model, the MoLi-Net network is constructed by fusing the MobileAttn and LightAttn modules.
A new algorithm auxiliary module is proposed to assist in detecting impurities on the conveyor belt by improving IoU and implementing staged NMS.
The proposed improved model MoLi-Net is evaluated using the Chinese herbal materials dataset, and its detection mAP@0.5 reaches 96.6%. Compared with YOLOv11, the designed model achieves a better balance between accuracy and computational efficiency.

The structure of this paper is as follows: Section 2 introduces related work, including the research progress of object detection and impurity detection. Section 3 details the principles of the MobileAttn module, LightAttn module, and auxiliary impurity detection module. Section 4 verifies the performance of the designed model through experiments and conducts a comparative analysis with other models. Section 5 mainly discusses the limitations of this study and directions for future research. Finally, Section 6 summarizes the research results of this work.

2. Related Work

2.1. Object Detection Applications

Current object detection research generally faces the challenges of model complexity and computational burden while improving accuracy. In terms of two-stage networks, in 2024, to enhance the detection accuracy in dense scenarios, Hong et al. [8] proposed a two-stage IoU-aware feature fusion R-CNN model, which achieves an average precision (AP) of 55.9% on the SKU-110k dataset. In 2025, Song et al. [9] optimized Faster R-CNN for steel strip defect scenarios, adopting PSO-Gabor filtering and improving AlexNet, with a classification accuracy of 99.78%. However, the number of parameters of the above models is still large, especially ResNet50, which has more than 20 million parameters. Since 2024, research on models based on the DETR framework has been active. Liu, Zhang et al. [10,11] detected weld seams and domestic waste based on RT-DETR, respectively. Although their mAP reached 80.1% and 66.75%, respectively, both rely on ResNet-like backbone networks, with parameter scales exceeding 42 million and 19.9 million, respectively.

To address the problem of insufficient dynamic weighting of local and global features, in 2024, Tu et al. [12] designed the NCGLF² network, which combined global and local features to improve target recognition performance. However, it lacks an adequate dynamic balance mechanism between high-frequency detail preservation and low-frequency semantic integration. In 2025, Zhang et al. [13] introduced the GLAF-DETR model in infrared maritime object detection, using a global–local adaptive fusion attention mechanism to cope with diverse target sizes and real-time requirements. Nevertheless, its attention-weight generation process relies on fixed-structure modules, lacking the ability to dynamically adjust the ratio of local to global features according to input content. In 2025, DFAS-YOLO proposed by Liu et al. [14] alleviated the feature shift problem caused by upsampling through dual-path feature-aware sampling, but its cross-level feature fusion still adopted a static weighting method, failing to achieve sample-wise and position-wise dynamic regulation. In a parallel direction, Dong et al. [15] proposed a moving object segmentation method that captures temporal distributions and spatial correlations via DIDL and SBR for robust perception in complex dynamic scenes. This approach aligns with the present work in adopting a modular attention-enhancement paradigm, where both employ lightweight attention modules to extract more discriminative features under background interference. Currently, illumination conditions lead to severe degradation of image quality, such as low contrast, significant noise, and detail loss, which directly affect the accuracy and robustness of object detection. To solve this problem, existing studies have proposed various enhancement methods, such as unsupervised illumination enhancement of EnlightenGAN, a Retinex-based variational denoising enhancement framework, and zero-reference curve estimation of Zero-DCE [16,17,18]. Although these methods improve image visibility, they ignore the importance differences in channels and spatial positions and are not deeply integrated with detection models.

In summary, although existing methods perform excellently in detection accuracy, they generally have problems such as large parameter scale, complex structure, insufficient dynamic weighting of local and global features, and the impact of illumination conditions, which seriously restrict their application in industry. Therefore, promoting the lightweight of object detection models and suppressing the impact of illumination have become the focus of current research.

2.2. Advances in Impurity Detection Technology

In industrial production, target objects transported on assembly lines are often mixed with impurities, which affect product quality. Such objects have complex types and limited data. To realize impurity detection, especially to solve problems such as difficulty in identifying small-target impurities and easy misjudgment due to similar colors, Xu et al. [19] adopted the lightweight FasterNet-PConv backbone and applied reparameterization training to the RepBlock structure, effectively reducing convolutional redundancy. Furthermore, they implemented ByteTrack-based weed tracking to prevent repeated weeding operations. Liang et al. [20] proposed a solution optimized by lightweight convolution, attention mechanism and Focal Loss, which solved the difficulties in the detection of rice impurities and broken rice, achieving high-precision detection of 97.55%. In addition, the model proposed by Li et al. [21] solved the problems of insufficient feature expression, detail loss and background interference in the detection of small foreign bodies in large infusions by introducing the multi-scale dynamic enhancement network MSG-CECM module, increasing the mAP by 2.2%. Although these methods perform significantly under the closed-set setting, their reliance on prior annotations and difficulty in generalizing to common industrial unknown impurities limit their practical application potential in open industrial environments.

At present, open-set recognition technology is gradually applied to impurity detection tasks to address the challenges brought by unknown categories. For example, Openmax, Grounding DINO and YOLO-World [22,23,24] have provided new ideas for the detection of unknown categories in impurity recognition by introducing the concept of unknown classes, yet severe missed detection issues still remain.

In summary, deep learning technology has made remarkable progress in the field of object detection, but the complexity and computational load of existing models limit their application in practical scenarios. Research on lightweight methods and impurity detection technologies provides new ideas and methods for solving these problems, laying a foundation for the subsequent proposal of improved methods.

3. Method

This section will introduce the research work. The goal of this research is to design a lightweight detection model for Chinese herbal materials with brightness perception capability and auxiliary impurity detection function.

3.1. The MoLi-Net Network Architecture

3.1.1. MobileAttn Module

To extract the feature information of Chinese herbal materials more efficiently, this study attempted to increase the network depth, but this would also lead to a gradual increase in the number of channels. Although this design can gradually extract higher-level features, it also results in substantial computational load and memory occupation. Moreover, although the current convolutional neural networks can achieve global perception through deep stacking, this is all learned locally by convolution kernels, which have weak global contextual relationships and cannot handle two distant pixel regions well. Studies have found that the introduction of the Transformer architecture [25] can significantly improve feature extraction efficiency. For example, Vision Transformer can achieve efficient feature extraction of targets through the self-attention mechanism. This is because the Transformer architecture, through the self-attention mechanism, allows any Token in the sequence to directly interact with all other Tokens in parallel, which greatly enhances the global connection of data. However, its high computational complexity makes it unsuitable for direct application [26]. To this end, this paper conducts a lightweight improvement on the Transformer architecture. Since the original Transformer architecture requires more computational load for larger inputs, this is because the calculation process of self-attention captures global dependencies by establishing the interaction relationship between all element pairs in the input sequence. This process requires the dot product operation of query vectors and key vectors, and then the correlation degree of each element’s corresponding position can be obtained by calculating the attention matrix. Taking advantage of this, this study removes the original self-attention mechanism and places it in the shallow layer of the network. This approach reduces computational load while preserving the integrity of the feature maps. Furthermore, it leverages a token mechanism to facilitate efficient information interaction. For this reason, the MobileAttn module is proposed.

Traditional attention methods require each spatial pixel to interact with all other pixels for global information exchange, which leads to a square-level increase in computational load with the spatial size, and a large amount of computation is consumed in redundant associations of spatial positions rather than effective information exchange at the semantic level. To this end, in the feature extraction stage shown in Figure 1, Tokens can be used as agents for global interaction. First, global average pooling extracts channel features to form an initial semantic representation, which confines global information interaction to a minimal number of tokens. Then, through random initialization, each token adaptively captures specific semantic information, aggregating scattered spatial features into a highly directed semantic representation:

\bar{t} = \frac{1}{K} \sum_{i = 1}^{K} T_{i} \in ℝ^{B \times D \times 1 \times 1}

(1)

where K denotes the total number of tokens, B is the batch size, D represents the token dimension, and T_i denotes the i-th token.

The Token vectors generated in the MobileAttn module need to undergo deep feature transformation and semantic enhancement processing through MLP [27] to meet the demand for highly discriminative features in object detection tasks. As shown in Figure 2, different from the traditional two-layer or multi-layer MLP structure, the Token MLP module adopts an extremely simple and efficient MLP architecture, which only includes a single-layer fully connected layer, combined with LayerNorm normalization and ReLU activation function [28], thus achieving a balance between computational efficiency and feature enhancement.

Since MLP processing may lead to differences in spatial size between the token vectors and the original feature map, as shown in Figure 3, to maintain the same size as the original feature map, the Tokens processed as described above are expanded through convolution operations to meet feature matching. The advantage of the Transformer architecture is that global correlation can be achieved, making up for the locality of traditional convolutional neural networks. Using this property, all targets of the data can be correlated, and at the same time, channel weighting of the feature map is performed on the attention map, assigning high weights to features related to the target, thereby achieving background suppression:

F_{out 1} = x ⊙ σ (W * (\bar{t} \times (H \times W)) + b)

(2)

where ⊙ denotes the element-wise product, σ represents the sigmoid activation function, and H × W denotes the spatial resolution. The terms W and b are learned weights and bias, respectively, while ∗ denotes the 1 × 1 convolution operation.

3.1.2. LightAttn Module

To address the issue of degraded robustness in industrial vision systems under complex illumination conditions, this investigation proposes a lightweight illumination-adaptive attention module, denoted as LightAttn. As shown in Figure 4, this module draws on the CBAM attention mechanism [29] and incorporates a brightness perception sub-network to realize dynamic reweighting of feature maps, thereby significantly suppressing illumination disturbances and enhancing the discriminability of target features.

In convolutional neural networks (CNNs), convolutional layers can usually only capture local features because the size of the convolution kernel is limited [30]. Although stacking multiple convolutional layers can expand the receptive field, it is still difficult to directly obtain global information. As shown in Figure 4, first, the channel attention branch compresses the input feature map F ∈ ℝ^C^×H×W into a channel descriptor g_c ∈ ℝ^C^×1×1 using global average pooling to capture global contextual information. When the number of channels C and the spatial dimension H×W are large, directly processing the high-dimensional feature map F ∈ ℝ^C^×H×W will lead to a significant increase in the number of parameters and computational load. To address this issue, the proposed method maps channel features to a low-dimensional latent space through a bottleneck structure and then reconstructs them, which effectively reduces the number of parameters and enhances nonlinear representation capabilities. Finally, the channel weight M_c ∈ [0,1]^C is obtained through normalization by the Sigmoid function, and this weight can realize adaptive enhancement of key channels [31].

In the process of feature extraction, optimizing feature expression is a problem that needs to be solved. Traditional convolutional neural networks (CNNs) mainly rely on convolutional layers to extract local features and lack effective utilization of global spatial information, resulting in limited ability of the model to identify key regions of targets. To this end, this investigation uses a max-average pooling collaborative method to extract saliency features and global statistical features along the channel dimension, respectively, generating a dual-channel feature map F_spa ∈ ℝ^2×H×W. A large-size 7 × 7 convolution kernel is used to fuse local cross-channel information, and a spatial weight map M_s ∈ [0,1]^1×H×W is output, enabling the model to focus on key spatial regions of the target while suppressing background information.

To cope with backlight and low-illumination scenarios, the module adds a brightness perception sub-network. This sub-network generates a brightness weight M_b ∈ [0,1] via global average pooling and a lightweight convolutional block, adaptively adjusting the brightness distribution of the feature map. Linear enhancement is applied to the dark regions, and M_b is utilized to gate the channel and spatial attention weights to suppress noise transmission in shadow regions, as formulated in Equation (3):

F^{'} = F + F ⊙ (1 - M_{b})

(3)

M_{s} = A_{s} ⊙ M_{b}

(4)

F_{out 2} = F + F ⊙ M_{s}

(5)

where F denotes the input feature map, F’ is the brightness-adjusted feature map, A_s ∈ [0,1]^1×H×W represents the spatial attention weights, ⊙ denotes element-wise multiplication, and F_out2 is the final output feature map.

The LightAttn module is integrated into the backbone network in the form of residuals. The channel attention weight acts on the input feature map first to realize feature selection in the channel dimension, then the spatial attention performs spatial recalibration on the channel-weighted results, and finally the brightness perception weight is used for global control of the intensity of the dual-path attention. The final output feature is F_out2, which not only retains the original contextual information but also realizes the collaborative optimization of illumination disturbance suppression and discriminative feature enhancement:

3.1.3. Network Architecture Optimization

First, the original YOLOv11 model undergoes lightweight processing. As shown in Figure 5, the original Conv modules are replaced with DSConv modules [32]. This is because DSConv adopts a decoupled design of spatial convolution and channel convolution, which can balance the feature extraction of the dual channels of the LightAttn module and significantly reduce computational complexity. However, this will lead to a certain degree of degradation in feature extraction capability. To address this, standard convolution is used in the shallow feature extraction stage, which can effectively extract texture and edge information from the original images. The parameter count and computational complexity of the DSConv and Conv modules can be computed via the following formulas:

Params_Conv = c_{1} \times c_{2} \times k^{2}

(6)

Params_DSConv = c_{1} \times k^{2} + c_{1} \times c_{2}

(7)

FLOPs_Conv = 2 \times H \times W \times c_{1} \times c_{2} \times k^{2}

(8)

FLOPs_DSConv = 2 \times H \times W \times (c_{1} \times k^{2} + c_{1} \times c_{2})

(9)

where

c_{1}

and

c_{2}

denote the number of input and output channels, respectively, and

k

represents the kernel size. In this network, all convolutional layers employ a kernel size of 3.

H \times W

denotes the spatial dimension of the output feature maps, and the default number of groups is set to 1. As shown in the above formulations, the parameter count and computational cost of DSConv compared to standard Conv depend on the number of output channels

c_{2}

. For a typical convolution with kernel size k = 3, the parameter ratio is:

\frac{Params_DSConv}{Params_Conv} = \frac{1}{9} + \frac{1}{c_{2}}

(10)

When

c_{2}

is large (e.g.,

c_{2}

≥ 64 in most network layers), the term

1 / c_{2}

becomes negligible, and DSConv requires approximately one-ninth of the parameters and FLOPs compared to standard convolutions.

Then, with the P-YOLOv11 network structure as the baseline model, the MobileAttn module is added to the Backbone. Through its dynamic recalibration mechanism, an adaptive feature selection prior is established, thereby optimizing the information density and discriminability of feature representation and achieving collaborative optimization of lightweight and performance improvement. Finally, the LightAttn module is added to the Neck to meet the demand of the Neck for processing multi-scale features. Its channel and spatial attention can dynamically screen key features, and the brightness perception branch can reduce illumination interference, thereby improving the performance of the model.

3.2. Auxiliary Module for Impurity Detection

Impurities in Chinese medicinal herbs are mostly small-sized and irregularly shaped targets. Due to their blurred boundaries and susceptibility to background interference, the traditional Intersection over Union (IoU), which relies on the area of the union, tends to cause regression deviations for small targets. To simultaneously achieve the detection and recognition of Chinese medicinal herbs and the screening of potential common industrial unknown impurities on the conveyor belt, the module first uses the spatiotemporal prior information of the fixed-camera conveyor belt scene to perform constrained extraction of the moving foreground in the region of interest. Subsequently, it combines multi-scale feature fusion and multi-feature candidate screening to suppress background residuals, boundary trailing, and small-scale false targets. Finally, stable recognition and accurate localization of common industrial unknown impurities are realized through improved IoU, stage-wise non-maximum suppression (NMS), and adaptive confidence threshold.

Foreground extraction is performed on the input image I ∈ ℝ^H^×W×C within the region of interest Ω_ROI ⊆ ℝ², and a candidate foreground mask M₀ is constructed. Considering that impurity detection mainly occurs in the effective working area of the conveyor belt, a regional constraint is introduced to retain only the foreground responses within the ROI. After background modeling, threshold segmentation, and morphological processing, the candidate foreground mask is defined as:

M_{0} (p) = \{\begin{matrix} 1, & p \in Ω_{ROI} and p Ω_{fg} \\ 0, & otherwise \end{matrix}

(11)

where p = (x, y) denotes the pixel position, and Ω_fg represents the foreground region obtained after background modeling, threshold segmentation, and morphological processing. To suppress invalid background responses and reduce the generation probability of false candidate regions, the proposed method introduces a working area constraint, as formulated in Equation (12):

F_{fused} = α \cdot U p s a m p l e (F_{high}) \otimes W_{high} + β \cdot F_{low} \otimes W_{low}

(12)

where F_high and F_low represent the high-level semantic feature map and low-level semantic feature map, respectively, and α and β are adaptively learned fusion coefficients obtained via gradient descent, with the total loss of the detection task as the loss function. Both coefficients are initialised to 0.5, providing an unbiased equal contribution from high-level and low-level features at the start of training. This symmetric initialisation promotes stable gradient updates in early epochs and allows the network to adaptively learn the optimal fusion ratio from the data. The learning rate of the main network is set to 0.01, and the fusion ratio is dynamically adjusted during training; ⊗ denotes the convolution operation; and W_high and W_low are the corresponding convolution kernel weights. This feature fusion scheme effectively alleviates the problem of feature loss in unknown-sized impurity detection.

To further suppress false detections caused by conveyor belt contamination, edge afterimages, and local noise, this study extracts the area, aspect ratio, rectangular filling rate, and rotated rectangular filling rate for each candidate connected region R_i, and constructs a multi-feature joint screening criterion as:

S (R_{i}) = \{\begin{matrix} 1, & A_{i} \in [A_{\min}, A_{\max}], ρ_{i} \in [ρ_{\min}, ρ_{\max}], ϕ_{i} \geq ϕ_{\min}, ψ_{i} \geq ψ_{\min} \\ 0, & otherwise \end{matrix}

(13)

A_{i} = Area (R_{i})

(14)

ρ_{i} = \frac{w_{i}}{h_{i}}

(15)

ϕ_{i} = \frac{A_{i}}{w_{i} h_{i}}

(16)

ψ_{i} = \frac{A_{i}}{A_{i}^{r o t}}

(17)

herein, S(R_i) = 1 indicates that the candidate region R_i passes the screening; A_i is the region area; ρ_i is the aspect ratio; ϕ_i is the filling rate of the candidate region in its horizontal bounding rectangle; ψ_i is the filling rate of the candidate region in its minimum rotated bounding rectangle; w_i and h_i are the width and height of the horizontal bounding rectangle of the candidate region, respectively; and

A_{i}^{rot}

is the area of the minimum rotated bounding rectangle. It can effectively eliminate false candidate regions with abnormal areas, unreasonable geometric shapes, or insufficient edge support, thereby improving the effectiveness of subsequent detection inputs.

Building on the aforementioned spatial domain filtering, this paper further develops a temporal association model based on inter-frame Euclidean distance and a continuous hit count-based stability judgment rule. Through multi-frame position continuity verification, only real moving impurities that persist continuously and stably across consecutive frames are retained. The proposed two-stage mechanism is serially cascaded and synergistically complementary, comprehensively improving the quality of impurity candidate bounding boxes from both spatial and temporal dimensions. This process is mathematically formulated as follows:

D_{i j}^{(t)} = \sqrt{{(c_{x, i}^{(t)} - c_{x, j}^{(t - 1)})}^{2} + {(c_{y, i}^{(t)} - c_{y, j}^{(t - 1)})}^{2}}

(18)

H_{i}^{(t)} = \{\begin{matrix} H_{i}^{(t - 1)} + 1, & D_{i j}^{(t)} < τ_{d} \\ 1, & otherwise \end{matrix}

(19)

H_{i}^{(t)} \geq H_{stable}

(20)

where

(c_{x, i}^{(t)}, c_{y, i}^{(t)})

and

(c_{x, j}^{(t− 1)}, c_{y, j}^{(t− 1)})

denote the center coordinates of the candidate target in the current frame and the stable target in the previous frame, respectively.

D_{i j}^{(t)}

denotes the Euclidean distance between the centers of candidate bounding boxes across consecutive frames.

H_{i}^{(t)}

quantifies the continuous hit count of objects within the region, τ_d is the distance matching threshold, and H_stable is the trajectory stability judgment threshold. This joint method leverages temporal features to strengthen the attribute evaluation of objects in the bounding box, accurately suppresses transient interference, and significantly enhances the robustness and accuracy of impurity recognition.

Traditional Intersection over Union (IoU) is insensitive to small localization errors in tiny targets. For small-sized impurities, a minor positional shift of the bounding box causes only a negligible change in the union area, leading to insufficient gradient feedback and poor regression accuracy. To address this limitation, the Small-Box IoU (SB-IoU) regression metric is proposed, which uses the area of the smaller of the predicted box and the ground truth box as the denominator instead of the union area in traditional IoU. In the herbal impurity detection scenario, ground truth boxes of tiny impurities are always smaller than or equal to predicted boxes, so SB-IoU is strictly bounded between 0 and 1:

A_{intersect} = \max (0, \min (x_{2}, {x^{'}}_{2}) - \max (x_{1}, {x^{'}}_{1})) \times \max (0, \min (y_{2}, {y^{'}}_{2}) - \max (y_{1}, {y^{'}}_{1}))

(21)

A_{small - box} = \min ((x_{2} - x_{1}) \times (y_{2} - y_{1}), ({x_{2}}^{'} - {x_{1}}^{'}) \times ({y_{2}}^{'} - {y_{1}}^{'}))

(22)

SB - IoU = \frac{A_{intersect}}{A_{small - box} + ε}

(23)

where ε is a small constant to avoid division by zero. Mathematically, SB-IoU is strictly bounded between 0 and 1. Compared with traditional IoU based on union area, SB-IoU amplifies the impact of small positional deviations, achieving higher accuracy in bounding box regression for small-sized targets and meeting the precise localization requirement for common industrial unknown impurities. In addition, to improve the approximation accuracy of the horizontal bounding rectangle to the real impurity boundary, this study further designs an adaptive boundary shrinking mechanism based on local foreground distribution. Let the initial candidate box be B = (x, y, w, h) and its internal foreground binary mask be M_B. Taking the left boundary correction as an example, the corrected position is defined as:

x^{'} = x + δ_{l}

(24)

\sum_{v = y}^{y + h} M_{B} (x + δ_{l}, v) \geq η_{x}

(25)

Similarly, the positions x″, y′, and y″ after right, top, and bottom boundary corrections can be derived to determine the final shrunk candidate box, as expressed below:

B^{'} = (x^{'}, y^{'}, w^{'}, h^{'})

(26)

w^{'} = x^{″} - x^{'} + 1

(27)

h^{'} = y^{″} - y^{'} + 1

(28)

{IoU}_{pair} (d_{i}, d_{j}) = \frac{A_{intersect} (d_{i}, d_{j})}{\min (A_{area} (d_{i}), A_{area} (d_{j}))}

(29)

Subsequently, if IoU_pair > τ_IoU and d_i < d_j, the detection box d_i is removed. This process is repeated until the IoU value between any two boxes in the set is less than the threshold τ_IoU. Experiments show that, compared with traditional single-stage NMS, this strategy reduces the repeated false detection rate for the same unknown impurity.

Finally, for unknown impurity category detection, an adaptive confidence threshold strategy is adopted. The dynamic threshold is calculated by the following equation:

τ_{adaptive} = τ_{base} + λ \cdot (1 - \frac{T P (k)}{T P (k) + F P (k) + ε})

(30)

where τ_base is the baseline confidence threshold; λ is the adjustment coefficient; and TP(k) and FP(k) denote the number of true positives and false positives for category k, respectively. This strategy effectively suppresses false detections in scenarios with high class imbalance and reduces the false positive rate of common industrial unknown impurities.

Through the above series of algorithmic innovations, the auxiliary impurity detection module constructs a pipeline from feature extraction, bounding box refinement, and category recognition to localization, which can efficiently solve impurity detection in industrial conveyor belt transportation.

4. Experiments and Results

To verify the performance of MoLi-Net in Chinese herbal materials detection and the effectiveness of the auxiliary impurity detection module, this study conducted the following experiments: a Chinese herbal materials dataset was constructed, relevant experimental parameters were configured, and finally, experiments were implemented with detailed analysis of the results.

4.1. Dataset

As shown in Figure 6, a self-built Chinese herbal materials image dataset named CH-MO is established in this paper. All images were captured in a controlled illumination environment at a resolution of 5120 × 5120 px, and divided into a training set (Train, 3591 images), a validation set (Val, 1197 images), and a test set (Test, 1197 images) with a ratio of 6:2:2.

To compensate for light absorption by the black experimental platform, it was replaced with a white platform to simulate the white conveyor belt used in industrial production. Image acquisition was performed using Hik-vision MVS software (Version 4.3.0) with a theoretical frame rate of 14.2223 fps, a 24 V LED light source, and rising-edge trigger polarity. The computer environment for this experiment was PyTorch 2.2.0 and CUDA 12.7, with an NVIDIA RTX 4090 GPU for training and testing. All quantitative results reported in this manuscript are the average of 3 independent experiments with different random seeds, and the standard deviations of all mAP metrics are less than 0.6%, indicating high stability and reproducibility of the experimental results. As listed in Table 1, several hyperparameters were adjusted to reduce the risk of overfitting. Meanwhile, the following augmentation strategies were applied to the training set: random rotation (degrees: 20.0), scaling (0.5–1.5×), horizontal flip (probability 0.5), HSV color jitter (hue ±1.5%, saturation ±70%, value ±40%), translation (±10%). No augmentation was performed on the validation and test sets, aiming to optimize the training process and improve the detection performance and generalization ability of the model in complex scenarios.

To rigorously evaluate the practical industrial deployment capability and real-time suitability, hardware-in-the-loop benchmarking was conducted on an NVIDIA Jetson Orin Nano 8 GB Developer Kit. This device serves as a representative low-power (15 W) edge AI computing platform based on the ARM aarch64 architecture. The deployment environment was configured with Ubuntu 22.04, CUDA 12.6, and TensorRT 10.7.0. To maximize hardware execution efficiency, all evaluated models were compiled into TensorRT engines utilizing FP16 half-precision, fully leveraging the Ampere architecture’s Tensor Cores and kernel-level Operator Fusion. Furthermore, rather than relying solely on theoretical FLOPs, the practical inference speed (FPS) was profiled continuously over 500 iterations using dummy tensors. This strict benchmarking protocol effectively isolates pure network forward-pass efficiency by eliminating random I/O fluctuations, ensuring a highly reliable assessment of real-world edge deployment.

4.2. Comparative Experiments

4.2.1. Accuracy Comparison

As shown in Table 2, this study evaluated several mainstream object detection algorithms on the CH-MO dataset. To comprehensively validate the performance of different models, five metrics were adopted for result assessment, including mAP@0.5, mAP@0.5:0.95, FLOPs, Parameters, and Recall. These metrics were used to evaluate the performance of the improved model on Chinese herbal materials recognition tasks. It can be observed from Table 2 that without pre-trained weights, single-stage detectors generally achieve better performance than two-stage detectors. Compared with other models, MoLi-Net maintains the highest accuracy with the fewest parameters and the lowest computational cost. Despite the lightweight design of the network, the introduction of the MobileAttn and LightAttn modules enhances the detection performance of the model.

Experimental results show that both YOLOv11 and MoLi-Net achieve an overall mAP@0.5 of over 95%. However, two categories, prunella vulgaris and orange peel, show a mAP@0.5 below 90% in the YOLOv11 model; in particular, the mAP@0.5:0.95 of prunella vulgaris is lower than 60%. In contrast, the proposed model improves accuracy to a certain extent, with detailed data presented in Table 3. For prunella vulgaris, MoLi-Net increases the mAP@0.5:0.95 by 5.8% compared with YOLOv11n. For orange peel, the mAP@0.5 and mAP@0.5:0.95 are improved by approximately 2% and 4.1%, respectively. Nevertheless, due to the structural lightweighting in which some standard convolutions are replaced with depth-wise separable convolutions, the model’s ability to extract certain features is slightly degraded, resulting in a 1% lower Recall for prunella vulgaris in MoLi-Net than in YOLOv11n. As can be seen from Figure 7, the detection confidence of MoLi-Net is more than 10% higher than that of YOLOv11, and the comprehensive detection effect exceeds 60%. Meanwhile, YOLOv11 suffers from a missed detection in the figure, indicated by a confidence level below 50%. In the MoLi-Net architecture, LightAttn performs brightness adjustment on the feature maps, and MobileAttn highlights key features through global dynamic weighting to reduce background interference, enabling more robust object detection.

In terms of inference speed, the lightweight baseline P-YOLOv11 achieves 124 FPS, yielding an 18% speedup over YOLOv11 (105 FPS). Even after integrating the MobileAttn and LightAttn modules, MoLi-Net maintains a competitive 108 FPS while delivering improved detection accuracy. This demonstrates that the proposed model successfully balances accuracy and efficiency for industrial deployment.

4.2.2. Efficiency Comparison

Finally, the proposed model is compared with various lightweight models. As shown in Table 4, MoLi-Net achieves excellent performance in all metrics. It achieves 96.6% mAP@0.5, which is approximately 2% higher than other lightweight models, and mAP@0.5:0.95 is improved to 76%, 4% higher than most lightweight models. The Recall reaches 92.1%, also the highest, indicating that the proposed model maintains a high recall rate while significantly improving the average precision under different IoU thresholds.

The detection results are shown in Figure 8. It can be observed that the detection results in the first column are similar, and the advantage of MoLi-Net is not obvious. In the second column, only MoLi-Net can accurately detect Sophora flavescens, while the other four models misclassify it as prunella vulgaris. This demonstrates that for similar Chinese herbal materials of different categories, the MobileAttn module of MoLi-Net enhances the feature weights of key semantic regions, suppresses redundant background information, and improves feature discriminability. Meanwhile, LightAttn enables the model to suppress irrelevant background noise via dual paths, greatly improving the generalization ability. In the third column, the first three models misdetect orange peel, as excessively high or low image brightness severely degrades the detection performance. The brightness-aware sub-network in the LightAttn module plays a crucial role, reducing the sensitivity of the model to brightness and enabling adaptation to varying illumination conditions.

4.3. Ablation Study

4.3.1. Sensitivity Analysis of Parameters

To investigate the impact of the number of tokens on the performance of the MobileAttn module, targeted experiments were conducted (only the MobileAttn module was included in the experiments). As shown in Figure 9, with the increase in the number of tokens, the number of network parameters increases incrementally, while the mAP@0.5:0.95 reaches its peak when the number of tokens is 16. Considering the overall performance of the network comprehensively, the module achieves the optimal performance when the number of tokens is set to 16.

Similarly, to investigate the impact of the Reduction ratio on the performance of the LightAttn module, targeted experiments were conducted (only the LightAttn module was included in the experiments). As shown in Figure 10, with the increase in the Reduction ratio, the mAP@0.5:0.95 shows a trend of first increasing and then decreasing and reaches the maximum value when the Reduction ratio is 128; meanwhile, the number of network parameters and computational load remain basically stable with the change of the Reduction ratio.

Based on the optimal hyperparameters identified above, the inference computational overhead of each module was further quantified. MobileAttn introduces 0.019 M additional parameters (+1.0%) and 0.2 G FLOPs (+4.1%), while LightAttn adds only 0.0016 M parameters (+0.1%) with no increase in computational complexity. These results confirm that both attention modules achieve significant performance gains with negligible computational cost, fully aligning with the lightweight design objectives of this work.

4.3.2. Module-Wise Ablation Analysis

This study comprehensively evaluates the proposed model through a series of ablation experiments and heatmap analyses. These experiments are designed to verify the effectiveness of the MobileAttn and LightAttn modules in Chinese herbal materials detection. As presented in Table 5 and Figure 11, the mAP@0.5:0.95 of prunella vulgaris is improved from 56.0% to 64.1% compared with the baseline model, and that of orange peel is increased from 57.8% to 64.8%. Furthermore, using either the MobileAttn or LightAttn module alone enhances model performance to a certain extent, but the improvement is less significant than that achieved by their combination. For instance, when only the LightAttn module is inserted, the mAP@0.5:0.95 of prunella vulgaris is increased by 5.4% and that of orange peel by 4.9% relative to the baseline. The heatmap comparison reveals that, compared with P-YOLOv11, the proposed method not only substantially reduces the proportion of heat color in background regions but also effectively weakens the response weight to dark background areas. By precisely suppressing background interference via the dual-path background suppression mechanism, the model’s attention is directed toward target regions, thereby significantly improving the accuracy and robustness of target recognition. However, the confusion rate of orange peel reaches 4% in this case, higher than the 1% achieved by the dual-module combination.

When only the MobileAttn module is inserted, the mAP@0.5:0.95 of prunella vulgaris is increased by 5.5% and that of orange peel by 3.1% compared with the baseline. As observed from the heatmaps, relative to P-YOLOv11, MobileAttn reduces the heat color proportion in background regions while paying more attention to small targets. This is because MobileAttn dynamically weights the features of Chinese herbal materials from local to global, enabling the extraction of critical features and further suppression of background information for more efficient detection. Nevertheless, under complex illumination conditions, the mAP@0.5:0.95 values of both prunella vulgaris and orange peel decrease by more than 3%, indicating that the absence of illumination-adaptive adjustment renders the model sensitive to brightness variations.

Experimental results demonstrate that when both the MobileAttn and LightAttn modules are activated simultaneously, the detection performance of the model is significantly improved for both categories. Moreover, the heatmaps show that both background suppression and overall target attention are superior to those of P-YOLOv11. This indicates that the proposed model maintains high precision while possessing favorable recall capability.

4.4. Performance Evaluation of Impurity Detection

As this module is an externally embedded module, it cannot directly participate in the evaluation system of the network model, and only actual statistical effects can be adopted. A total of 300 images containing 1126 individual impurity instances were used in this impurity detection experiment, including 100 randomly selected from the test set of the CH-MO dataset and 200 randomly collected impurity images (100 images of small stones and 100 images of non-target plants), with the impurity size ranging from 20 px to 200 px. Impurities were regarded as the positive class and targets as the negative class. As shown in Table 6, the final F1-Score reached 86.38%. Meanwhile, as shown in Figure 12, the impurity detection module can accurately identify the two most common representative interfering impurities in industrial herbal packaging lines. In addition, the original network structure is retained, facilitating plug-and-play deployment. The evaluation metrics are shown as:

Precision = \frac{T P}{T P + F P}

(31)

Recall = \frac{T P}{T P + F N}

(32)

F_{1} - Score = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(33)

where TP (True Positives) denotes positive instances correctly identified, FN (False Negatives) refers to positive instances incorrectly labeled as negative, FP (False Positives) indicates negative instances mistakenly classified as positive, and F1-Score is the harmonic mean of precision and recall, offering a balanced measure of model performance.

5. Discussion

This discussion focuses on the key limitations of the current model and proposes future optimization directions and improvement approaches in combination with practical application scenarios.

Our industrial-calibrated data augmentation strategy has significantly improved the model’s robustness to common production variations. However, we acknowledge that data collection was primarily conducted under controlled laboratory conditions, which do not fully replicate all complexities of real industrial environments. In particular, extreme conditions such as severe conveyor belt staining, dense background clutter, high-frequency equipment vibration, and illumination variations beyond the ±40% range may still cause performance degradation. Additionally, the model’s detection performance for tiny impurities (<30 px) requires further improvement due to their indistinct feature information.

To systematically address these limitations, our future work will advance along two complementary and interconnected directions. First, we have signed a formal collaboration agreement with Guangzhou Ruijia Industry Co., Ltd. to conduct large-scale on-site validation using real production data from their three commercial herbal packaging lines. This validation will include testing under diverse operational conditions, varying conveyor speeds, and different background environments. We plan to publish the complete validation results as a preprint on arXiv. Second, we will further optimize the model architecture to improve its performance under extreme conditions. Specifically, self-supervised learning algorithms will be introduced to reduce reliance on manually labeled data and enhance generalization to unknown scenarios, while super-resolution reconstruction technology will be applied to enhance feature extraction for tiny impurities and reduce their missed detection rate.

The impurity detection module in this study is mainly aimed at high-frequency unknown impurities (stones and weeds) in industrial production, rather than all possible impurity types. Therefore, expanding the impurity types and the scale of the impurity dataset is an important direction for future work.

6. Conclusions

To improve feature extraction efficiency and mitigate the impact of illumination on recognition, the Transformer architecture is restructured and refined, and a lightweight network named MoLi-Net is designed by globally controlling brightness-aware weights integrated with channel and spatial dual-path attention. Experimental results demonstrate that by embedding the LightAttn and MobileAttn modules into MoLi-Net, the proposed model successfully achieves both lightweight design and performance improvement. Compared with the original YOLOv11, the overall mAP@0.5:0.95 of the improved model is increased by 1.7%, with gains of 5.8% and 4.1% for prunella vulgaris and orange peel, respectively. The number of parameters is reduced by 23.1%, and the computational cost is decreased by 20.3%. Compared with other lightweight models, the proposed model can better alleviate the influence of brightness and enhance the generalization ability. Meanwhile, to detect common industrial unknown impurities on the conveyor belt, an auxiliary impurity detection module is designed with an improved IoU metric and staged non-maximum suppression (NMS). Finally, the auxiliary module is embedded into MoLi-Net, enabling the direct identification of common industrial unknown impurities with an F1-Score of 86.38%, which verifies the effectiveness of the auxiliary module for impurity detection on conveyor belts.

Author Contributions

Conceptualization, Z.X. and Z.W.; methodology, Z.X.; software, Z.X.; validation, Z.X. and C.J.; formal analysis, Z.X.; investigation, J.D.; resources, W.D.; data curation, Z.X.; writing—original draft preparation, Z.X.; writing—review and editing, C.J. and Z.W.; visualization, Z.X.; supervision, Z.W.; project administration, J.D. and W.D.; funding acquisition, Z.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw dataset used in this study is available at www.kaggle.com/datasets/silencexu001/moli-net (accessed on 11 April 2026).

Conflicts of Interest

Author Jianhui Ding and Author Weiyang Ding were employed by the company Guangzhou Ruijia Industry Co., Ltd. The remaining authors declare that there are no commercial or financial relationships that could be construed as a potential conflict of interest. No external sponsors or funding sources had any involvement in the study design, data collection, analysis and interpretation, manuscript writing, or the decision to submit this work for publication.

References

Varghese, R.; Sambath, M. Yolov8: A novel object detection algorithm with enhanced performance and robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; pp. 1–6. [Google Scholar]
Yao, L. Deep learning-based object detection algorithms. In Proceedings of the International Workshop on Advanced Applications of Deep Learning in Image Processing (IWADI 2024), Kuala Lumpur, Malaysia, 27–29 December 2024; p. 02024. [Google Scholar]
Rasheed, A.F.; Zarkoosh, M. YOLOv11 optimization for efficient resource utilization. J. Supercomput. 2025, 81, 1085. [Google Scholar] [CrossRef]
Tang, Y.C.; Qiu, J.J.; Zhang, Y.Q.; Wu, D.X.; Cao, Y.H.; Zhao, K.X.; Zhu, L.X. Optimization strategies of fruit detection to overcome the challenge of unstructured background in field orchard environment: A review. Precis. Agric. 2023, 24, 1183–1219. [Google Scholar] [CrossRef]
Lu, Y.F.; Gao, J.W.; Yu, Q.; Li, Y.; Lv, Y.S.; Qiao, H. A Cross-Scale and Illumination Invariance-Based Model for Robust Object Detection in Traffic Surveillance Scenarios. IEEE Trans. Intell. Transp. Syst. 2023, 24, 6989–6999. [Google Scholar] [CrossRef]
Liu, X.; Sui, Q.; Chen, Z. Real time weed identification with enhanced mobilevit model for mobile devices. Sci. Rep. 2025, 15, 27323. [Google Scholar] [CrossRef] [PubMed]
Cai, H.; Li, J.; Hu, M.; Gan, C.; Han, S. EfficientViT: Lightweight Multi-Scale Attention for High-Resolution Dense Prediction. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 17256–17267. [Google Scholar]
Hong, J.X.; He, X.Q.; Deng, Z.L.; Yang, C.H. IoU-aware feature fusion R-CNN for dense object detection. Mach. Vis. Appl. 2024, 35, 3. [Google Scholar] [CrossRef]
Song, F.H.; Bao, K.C.; Deng, M.L.; Xia, X.Y.; Sun, L.; Zhou, J.; Liu, L.; Jiang, L.B. Steel surface defect detection based on PSO-Gabor and an improved faster RCNN. Nondestruct. Test. Eval. 2025, 41, 2607–2630. [Google Scholar] [CrossRef]
Liu, G.Y.; Yang, D.R.; Ye, J.; Lu, H.J.; Chen, L.; Wang, Z.; Zhao, Y. A real-time welding defect detection framework based on RT-DETR deep neural network. Adv. Eng. Inform. 2025, 65, 103318. [Google Scholar] [CrossRef]
Zhang, X.Y.; Zhai, J.Y. Real-time detection of domestic waste based on deep learning. In Proceedings of 2024 Chinese Intelligent Systems Conference (CISC 2024); Springer: Singapore, 2024; pp. 462–470. [Google Scholar]
Tu, B.; Ren, Q.; Li, J.; Cao, Z.L.; Cheng, Y.Y.; Plaza, A. NCGLF2: Network combining global and local features for fusion of multisource remote sensing data. Inf. Fusion 2024, 104, 102192. [Google Scholar] [CrossRef]
Zhang, W.; Guo, D.; Shang, Y.; Zhang, W.; Hu, Z. GLAF-DETR: Detection Transformer With Global–Local Adaptive Fusion Attention for Infrared Maritime Object Detection. IEEE Internet Things J. 2025, 12, 44927–44940. [Google Scholar] [CrossRef]
Liu, X.Y.; Zhou, S.B.; Ma, J.B.; Sun, Y.M.; Zhang, J.L.; Zuo, H.R. DFAS-YOLO: Dual Feature-Aware Sampling for Small-Object Detection in Remote Sensing Images. Remote Sens. 2025, 17, 3476. [Google Scholar] [CrossRef]
Dong, G.; Zhao, C.; Pan, X.; Basu, A. Learning Temporal Distribution and Spatial Correlation Toward Universal Moving Object Segmentation. IEEE Trans. Image Process. 2024, 33, 2447–2461. [Google Scholar] [CrossRef] [PubMed]
Jiang, Y.F.; Gong, X.Y.; Liu, D.; Cheng, Y.; Fang, C.; Sheng, X.H.; Yang, J.C.; Zhou, P.; Wang, Z.Y. EnlightenGAN: Deep light enhancement without paired supervision. IEEE Trans. Image Process. 2021, 30, 2340–2349. [Google Scholar] [CrossRef] [PubMed]
Ma, Q.T.; Wang, Y.; Zeng, T.Y. Retinex-Based Variational Framework for Low-Light Image Enhancement and Denoising. IEEE Trans. Multimed. 2023, 25, 5580–5588. [Google Scholar] [CrossRef]
Guo, C.L.; Li, C.Y.; Guo, J.C.; Loy, C.C.; Hou, J.H.; Kwong, S.; Cong, R.M. Zero-Reference Deep Curve Estimation for Low-Light Image Enhancement. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1777–1786. [Google Scholar]
Xu, Y.L.; Liu, Z.H.; Li, J.; Huang, D.Y.; Chen, Y.B.; Zhou, Y. Real-time detection and localization of weeds in dictamnus dasycarpus fields for laser-based weeding control. Agronomy 2024, 14, 2363. [Google Scholar] [CrossRef]
Liang, Z.W.; Xu, X.Y.; Yang, D.Y.; Liu, Y.B. The development of a lightweight DE-YOLO model for detecting impurities and broken rice grains. Agriculture 2025, 15, 848. [Google Scholar] [CrossRef]
Li, Z.Q.; Jia, D.Y.; He, Z.H.; Wu, N.K. MSG-YOLO: A multi-scale dynamically enhanced network for the real-time detection of small impurities in large-volume parenterals. Electronics 2025, 14, 1149. [Google Scholar] [CrossRef]
Yang, Z.; Han, D.Q.; Yang, Y.; Dezert, J. A dual-threshold based evidential openmax approach for open set recognition. In Proceedings of the 2024 27th International Conference on Information Fusion (FUSION), Venice, Italy, 8–11 July 2024. [Google Scholar]
Liu, S.L.; Zeng, Z.Y.; Ren, T.H.; Li, F.; Zhang, H.; Yang, J.; Jiang, Q.; Li, C.Y.; Yang, J.W.; Su, H.; et al. Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. In Proceedings of the 18th European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024. [Google Scholar]
Cheng, T.H.; Sone, L.; Ge, Y.X.; Liu, W.Y.; Wang, X.G.; Shan, Y. YOLO-World: Real-time open-vocabulary object detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 16901–16911. [Google Scholar]
Chitty-Venkata, K.T.; Emani, M.; Vishwanath, V.; Somani, A.K. Neural architecture search for transformers: A survey. IEEE Access 2022, 10, 108374–108412. [Google Scholar] [CrossRef]
Yeganeh, Y.; Farshad, A.; Weinberger, P.; Ahmadi, S.A.; Adeli, E.; Navab, N. Transformers pay attention to convolutions leveraging emerging properties of ViTs by dual attention-image network. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Paris, France, 2–6 October 2023; pp. 2296–2307. [Google Scholar]
Lin, R.Y.; Zhou, Z.R.; You, S.Y.; Rao, R.G.; Kuo, C.C.J. Geometrical interpretation and design of multilayer perceptrons. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 2545–2559. [Google Scholar] [CrossRef] [PubMed]
Ilan, B.; Ranganath, A.; Alvarez, J.; Khatri, S.; Marcia, R. Interpretability of ReLU for inversion. In Proceedings of the 2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA), Nassau, Bahamas, 12–14 December 2022; pp. 1190–1195. [Google Scholar]
Woo, S.H.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Li, Z.W.; Liu, F.; Yang, W.J.; Peng, S.H.; Zhou, J. A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 6999–7019. [Google Scholar] [CrossRef] [PubMed]
Tian, Y.J.; Su, D.; Lauria, S.; Liu, X.H. Recent advances on loss functions in deep learning for computer vision. Neurocomputing 2022, 497, 129–158. [Google Scholar] [CrossRef]
Wang, J.; Feng, S.C.; Cheng, Y. Survey of research on lightweight neural network structures for deep learning. Comput. Eng. 2021, 47, 1–13. [Google Scholar] [CrossRef]

Figure 1. The structure of feature compression.

Figure 2. The structure of Token MLP.

Figure 3. The structure of spatial attention.

Figure 4. The structure of the LightAttn module.

Figure 5. The architecture of P-YOLOv11 (Left) and MoLi-Net (Right).

Figure 6. Statistical information of the CH-MO dataset.

Figure 7. Detection results of Yolov11n and MoLi-net on prunella vulgaris and orange peel. The red arrows point to the undetected objects.

Figure 8. Detection results of lightweight models evaluated on the CH-MO dataset. The red arrows point to the undetected objects. Specifically, the green boxes indicate Sophorae flavescentis.

Figure 9. Effect of Token count in the MobileAttn module on model performance.

Figure 10. Effect of reduction ratio in the LightAttn module on model performance.

Figure 11. Heatmap visualization of network outputs, with L and M representing LightAttn and MobileAttn, respectively. The colors of the bounding boxes are used to distinguish different objects, while the heatmap (e.g., red for high activation) indicates the network’s attention regions.

Figure 12. Visualization of detection results by the auxiliary module for impurities.

Table 1. Training hyperparameter configuration.

Parameter	Setup
Epochs	100
Batch Size	16
Img Size	640
Learning Rate	0.01
Optimizer	AUTO
Close Mosaic	Last 10 Epochs

Table 2. Overall performance comparison of different models on the CH-MO dataset.

Models	mAP@0.5	mAP@0.5:0.95	Params	FLOPs	Recall
Faster RCNN	91.1%	61.3%	30.5 M	25.6 G	80.1%
YOLOv5	94.9%	73.3%	2.2 M	6.1 G	86.6%
YOLOv8	96.1%	73.3%	2.7 M	6.8 G	91.9%
YOLOv9	96.2%	74.6%	1.7 M	6.5 G	92.8%
MobileNetV4-RT-DETR	91.8%	45.6%	7.3 M	11.7 G	88.2%
YOLOv11	96.2%	74.3%	2.6 M	6.4 G	92.0%
P-YOLOv11	95.5%	74.1%	2.0 M	5.0 G	91.9%
SSD	83.2%	51.2%	26.3 M	31.8 G	77.9%
MoLi-Net	96.6%	76.0%	2.0 M	5.1 G	92.1%

Table 3. Comparative experiments of models on prunella vulgaris and orange peel. Note: class 1 and class 2 denote prunella vulgaris and orange peel, respectively.

Models	Class	mAP@0.5	mAP@0.5:0.95	Recall	FPS
YOLOv11n	1	88.4%	58.3%	83.2%	105
YOLOv11n	2	88.5%	60.7%	81.8%	105
P-YOLOv11	1	84.9%	56.0%	80.1%	124
P-YOLOv11	2	86.9%	57.8%	81.5%	124
MoLi-Net	1	89.5%	64.1%	82.2%	108
MoLi-Net	2	90.4%	64.8%	81.8%	108

Table 4. Comparison of lightweight models.

Models	mAP@0.5	mAP@0.5:0.95	Params	FLOPs	Recall
StarNet	94.8%	58.6%	1.7 M	4.8 G	91.1%
MobileNetV3	94.1%	70.5%	1.7 M	3.9 G	89.5%
Mobileone	94.7%	71.4%	5.4 M	3.9 G	91.1%
Shufflenetv2	94.3%	71.5%	2.2 M	5.1 G	90.2%
MoLi-Net	96.6%	76%	2.0 M	5.1 G	92.1%

Table 5. Ablation study results of P-YOLOv11 with different module replacements on prunella vulgaris (class 1) and orange peel (class 2).

Model	Class	mAP@0.5	mAP@0.5:0.95	Recall	Overall mAP@0.5:0.95	FPS
P-YOLOv11	1	84.9%	56.0%	80.1%	74.1%	124
P-YOLOv11	2	86.9%	57.8%	81.5%	74.1%	124
P-YOLOv11 + LightAttn	1	88.0%	61.4%	83.9%	75.8%	121
P-YOLOv11 + LightAttn	2	89.0%	62.7%	83.6%	75.8%	121
P-YOLOv11 + MobileAttn	1	87.3%	61.5%	82.0%	75.1%	114
P-YOLOv11 + MobileAttn	2	87.9%	60.9%	85.2%	75.1%	114
MoLi-Net	1	89.5%	64.1%	82.2%	76%	108
MoLi-Net	2	90.4%	64.8%	81.8%	76%	108

Table 6. Detection Performance of the Auxiliary Module for Impurity Detection.

Metric	TP	FP	FN	F1-Score	FPS
Value	856	98	172	86.38%	51

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xu, Z.; Jiang, C.; Ding, J.; Ding, W.; Wan, Z. MoLi-Net: A Lightweight Brightness-Aware Model for Chinese Herbal Materials Recognition with an Auxiliary Module for Impurity Detection. Electronics 2026, 15, 2731. https://doi.org/10.3390/electronics15122731

AMA Style

Xu Z, Jiang C, Ding J, Ding W, Wan Z. MoLi-Net: A Lightweight Brightness-Aware Model for Chinese Herbal Materials Recognition with an Auxiliary Module for Impurity Detection. Electronics. 2026; 15(12):2731. https://doi.org/10.3390/electronics15122731

Chicago/Turabian Style

Xu, Zilong, Changcheng Jiang, Jianhui Ding, Weiyang Ding, and Zhenping Wan. 2026. "MoLi-Net: A Lightweight Brightness-Aware Model for Chinese Herbal Materials Recognition with an Auxiliary Module for Impurity Detection" Electronics 15, no. 12: 2731. https://doi.org/10.3390/electronics15122731

APA Style

Xu, Z., Jiang, C., Ding, J., Ding, W., & Wan, Z. (2026). MoLi-Net: A Lightweight Brightness-Aware Model for Chinese Herbal Materials Recognition with an Auxiliary Module for Impurity Detection. Electronics, 15(12), 2731. https://doi.org/10.3390/electronics15122731

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MoLi-Net: A Lightweight Brightness-Aware Model for Chinese Herbal Materials Recognition with an Auxiliary Module for Impurity Detection

Abstract

1. Introduction

2. Related Work

2.1. Object Detection Applications

2.2. Advances in Impurity Detection Technology

3. Method

3.1. The MoLi-Net Network Architecture

3.1.1. MobileAttn Module

3.1.2. LightAttn Module

3.1.3. Network Architecture Optimization

3.2. Auxiliary Module for Impurity Detection

4. Experiments and Results

4.1. Dataset

4.2. Comparative Experiments

4.2.1. Accuracy Comparison

4.2.2. Efficiency Comparison

4.3. Ablation Study

4.3.1. Sensitivity Analysis of Parameters

4.3.2. Module-Wise Ablation Analysis

4.4. Performance Evaluation of Impurity Detection

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI