MSG-YOLO: A Multi-Scale Dynamically Enhanced Network for the Real-Time Detection of Small Impurities in Large-Volume Parenterals

Li, Ziqi; Jia, Dongyao; He, Zihao; Wu, Nengkai

doi:10.3390/electronics14061149

Open AccessArticle

MSG-YOLO: A Multi-Scale Dynamically Enhanced Network for the Real-Time Detection of Small Impurities in Large-Volume Parenterals

School of Automation and Intelligence, Beijing Jiaotong University, Beijing 100044, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(6), 1149; https://doi.org/10.3390/electronics14061149

Submission received: 23 February 2025 / Revised: 7 March 2025 / Accepted: 12 March 2025 / Published: 14 March 2025

(This article belongs to the Topic Visual Computing and Understanding: New Developments and Trends)

Download

Browse Figures

Versions Notes

Abstract

The detection of small targets holds significant application value in the identification of small foreign objects within large-volume parenterals. However, existing methods often face challenges such as inadequate feature expression capabilities, the loss of detailed information, and difficulties in suppressing background interference. To tackle the task of the high-speed and high-precision detection of tiny foreign objects in production scenarios involving large infusions, this paper introduces a multi-scale dynamic enhancement network (MSG-YOLO) based on an improved YOLO framework. The primary innovation is the design of a multi-scale dynamic grouped channel enhancement convolution module (MSG-CECM). This module captures multi-scale contextual features through parallel dilated convolutions, enhances the response of critical areas by integrating channel-space joint attention mechanisms, and employs a dynamic grouping strategy for adaptive feature reorganization. In the channel dimension, cross-scale feature fusion and a squeeze-excitation mechanism optimize feature weight distribution; in the spatial dimension, local maximum responses and spatial attention enhance edge details. Furthermore, the module features a lightweight design that reduces computational costs through grouped convolutions. The experiments conducted on our custom large infusion dataset (LVPD) demonstrate that our method improves the mean Average Precision (mAP) by 2.2% compared to the baseline YOLOv9 and increases small target detection accuracy (AP_small) by 3.1% while maintaining a real-time inference speed of 58 FPS.

Keywords:

small object detection; YOLOv9; multi-scale enhancement; real-time detection; large-volume parenterals

1. Introduction

Large-volume parenteral (LVP) inspection involves detecting foreign objects in 50 mL medical infusion containers [1]. During manufacturing and manual handling processes, various insoluble contaminants, including gel particles, plastic fibers, and hair, may be introduced into the product. If these contaminated containers reach the market, they pose significant health risks to patients. The current industry standard relies on manual visual inspection under strong illumination, where operators examine each container for foreign objects [2]. However, this approach is inherently limited by operator fatigue and varying experience levels, resulting in significant detection failures. Consequently, developing an automated detection system for small foreign objects is essential, with the primary challenge being the simultaneous achievement of high detection accuracy and rapid processing speed [3,4].

Recent advances in machine vision, deep learning, and multimodal image processing have led to substantial improvements in LVP foreign matter detection. Multimodal imaging techniques have enhanced detection accuracy by integrating various imaging modalities, including visible light, X-rays, and laser technologies. Liang et al. (2023) developed a robotic inspection system utilizing a 660 nm parallel laser light source and 90 kV X-rays (5 µm wavelength) for transmission imaging [5], effectively addressing the limitations of the plastic containers’ poor light transmission and structural rigidity. Zhang et al. [6] further advanced this field by introducing a visual inspection system based on reverse PM diffusion, which enhances detection robustness through multimodal image acquisition and fusion, incorporating both transmission and reflection modes using infrared and visible light. Deep learning approaches have gained significant traction in LVP foreign matter detection. Wu et al. developed an innovative image segmentation method using Fuzzy Cellular Neural Networks (FCNN) [7], incorporating non-linear fuzzy min/max connection weights to enhance edge detection accuracy. Liang et al. (2023) [8] utilized transfer learning techniques to perform efficient image classification and foreign object detection under limited sample conditions, achieving a detection accuracy of 97%. Their findings demonstrate the significant advantages of deep learning algorithms in detecting foreign objects within complex backgrounds.

In small object detection, YOLO series [9,10,11,12,13,14,15,16] models have emerged as preferred solutions due to their computational efficiency and real-time capabilities. Recent improvements to these models have specifically targeted small object detection performance. The SED-YOLO [17] architecture incorporates Switchable Atrous Convolution (SAC) and Efficient Multi-Scale Attention (EMA) mechanisms, achieving a 71.6% mAP on the DOTA dataset—a 2.4% improvement over YOLOv5s [18]. CPDD-YOLOv8 [19] extends this progress by implementing Global Attention Mechanism (GAM) and Dynamic Snake Convolution (DSConv), achieving 41% mAP@0.5 on VisDrone2019, surpassing YOLOv8 by 6.9%. Transformer architectures have also demonstrated promising results in small object detection. LKR-DETR [20] employs large kernel convolutions and multi-scale feature fusion, improving mAP@0.5 and mAP@0.5: 0.95 by 2.5% and 2.0%, respectively, on VisDrone2019-DET. Additionally, D-FINE [21] enhances model robustness by reformulating boundary box regression as a Fine-grained Distribution Resolution (FDR) task. Current improvements in object detection primarily focus on feature pyramid optimization and attention mechanism design. While the ASPP [22] module effectively expands the receptive field through multi-dilated convolution, it introduces a substantial increase in parameters, hampering its deployment in real-time detection scenarios. Similarly, spatial-channel attention methods, such as CBAM [23], enhance feature representation but fail to address the collaborative optimization of multi-scale features.

Umar et al. [24] presents a fault diagnosis technique for milling machines based on acoustic emission (AE) signals and a hybrid deep learning model, which achieves an accuracy of 99.6% and offers a highly accurate and efficient solution for fault detection in milling machines. Siddique et al. [25] proposes a novel method for bearing-fault diagnosis using Mel-transformed scalograms obtained from vibrational signals, which achieves perfect precision, recall, F1-scores, and an AUC of 1.00 across all fault categories. LiqD [26], proposed by Ma, shows a container dynamic liquid level detection model based on U

\hat{2}

-Net; a large number of experimental results show that the model can effectively detect the dynamic liquid level changes of the liquid in the container. However, such approaches rely heavily on handcrafted preprocessing (e.g., signal-to-image conversion) and lack adaptability to dynamic industrial scenes with motion blur or illumination changes. Current visual DL models, particularly YOLO variants, prioritize speed but sacrifice precision on small or partially occluded defects (e.g., micro-cracks in metal surfaces).

Furthermore, the traditional grouped convolution’s fixed grouping strategy constrains the model’s adaptability across diverse scenarios. To overcome these limitations, we propose the MSG-CECM module with three key innovations:

(1): Multi-scale Dynamic Perception: We implement adaptive multi-scale feature fusion through parallel dilated convolution, leveraging feature disparities across scales to enhance small target detection.
(2): Dual-dimensional Attention Enhancement: We integrate channel re-weighting with spatial detail enhancement to create a two-level noise filtering mechanism, effectively suppressing interference and highlighting target features.
(3): Lightweight Structural Design: Our module employs grouped computation for spatial attention, enabling channel group-specific attention calculation while preserving inter-channel differences. This approach significantly reduces parameters while improving deployment efficiency.

2. Proposed Method

2.1. Overall Model Structure

Given the diversity of foreign objects and the prevalence of small targets in LVPD images, directly applying the general object detection model YOLOv9s for the high-precision detection of small foreign objects presents significant challenges. YOLO series algorithms face issues such as insufficient feature representation and a loss of detailed information in small target detection, with performance deteriorating notably in complex backgrounds or dense target scenarios. To address these challenges, this paper proposes the MSG-YOLO framework, which combines the GELAN structure with our proposed Multi-Scale Group Channel Enhance Convolution Module (MSG-CECM) to form a new feature computation block called MSG-GELAN. This integration enhances multi-scale feature fusion capabilities and detail preservation mechanisms, improving small target detection performance while maintaining high computational efficiency. The model structure of MSG-YOLO is illustrated in Figure 1.

We augmented the original model by implementing a size-adaptive loss function weighting mechanism that dynamically enhances gradient focusing for small-scale targets, specifically objects with dimensions less than 32 × 32 pixels. As follows, the modified localization loss now incorporates an adaptive weight coefficient:

Γ_{b o x} = λ_{s m a l l} \cdot (1 - I O U) + Γ_{C I o U}

(1)

where the dynamic weight is given as follows:

λ_{s m a l l} = 1 + α \cdot tanh (\frac{1024}{w \cdot h})

(2)

This formulation automatically assigns higher weights (

α

= 1.5 in our implementation) to smaller targets based on their pixel area (

w \cdot h

) forcing the network to prioritize learning discriminative features for subtle impurities. This intrinsic optimization at the learning stage achieves superior small-object detection without requiring additional post-processing modules, maintaining real-time performance at 58 FPS.

2.2. MSGELAN

The Generalized Efficient Layer Aggregation Network (GELAN [9]) is an advanced layer aggregation network that integrates the CSPNet [27] structure with ELAN [28] gradient path optimization. This combination enables high-precision object detection while ensuring a lightweight design and high processing speed. In this paper, we introduce the MSGELAN module, which merges the efficiency of GELAN with the specificity of the Multi-Scale Group Channel Enhance Convolution Module (MSG-CECM) to enhance small target detection.

Figure 2 illustrates the architectural framework of the Generalized Efficient Layer Aggregation Network (GELAN), tracing its evolutionary trajectory from the neural network architectures of CSPNet and ELAN. Both antecedent architectures feature a gradient path planning mechanisms, carefully designed to facilitate efficient information flow.

CSPNet: The CSPNet architecture involves bifurcating the input via a transformation layer, followed by parallel processing through arbitrary computational blocks. Subsequently, these divergent branches are reconciled through concatenation, and then subjected to another transformation layer, thereby revitalizing the information flow.
ELAN: In contrast to CSPNet, ELAN employs a hierarchical arrangement of stacked convolutional layers, where each layer’s output is synergistically combined with the next layer’s input, and then subjected to additional convolutional processing. This hierarchical schema enables ELAN to effectively capture complex patterns and relationships.
GELAN: By synthesizing the design philosophies of CSPNet and ELAN, GELAN emerges as a more versatile and efficient architecture. It incorporates the segmentation and recombination principles of CSPNet while integrating ELAN’s hierarchical convolutional processing paradigm at each segment. A key differentiator of GELAN lies in its flexibility to accommodate any type of computational block, rather than being confined to convolutional layers alone. This adaptability enables GELAN to be tailored to diverse application requirements.

As depicted in panel (a) of Figure 3, the GELAN module within YOLOv9 utilizes RepNCSP as a multi-tiered computational block. However, this configuration lacks specific optimization for small target detection. To address this limitation, this study introduces MSG-CECM as a computational module, designed to enhance the feature extraction capacity for small targets during the multi-level feature computation process. By integrating MSG-CECM, the proposed framework aims to elevate the detection performance for small targets, thereby contributing to more accurate and robust object detection.

2.3. MSG-CECM Module

As illustrated in Figure 4, the MSG-CECM module incorporates multi-scale depth convolutional layers, channel attention mechanisms, dynamic group spatial attention, and residual connections. The mathematical formulation of MSG-CECM can be expressed as follows:

O u t p u t = F_{r e s} (x) + B N (A_{s} \circ G (A_{c} \circ M (x)))

(3)

where M represents multi-scale feature extraction,

A_{c}

and

A_{s}

denote channel and spatial attention, respectively, and G signifies dynamic group transformation.

F_{r e s}

represents residual calculation, and

B N

represents batch normalization.

2.3.1. Multi-Scale Deep Feature Extraction

Small target detection necessitates the concurrent extraction of local details and global contextual information. Traditional single-scale convolutional operations are inadequate for this dual requirement. To overcome this limitation, we implement parallel deep convolutional operations with varying dilation rates to extract features across multiple receptive fields. For instance, a convolutional operation with a dilation rate of 2 generates a 5 × 5 receptive field within the architecture. Through the fusion of multi-scale features, we enhance the representation of local differential information critical for small target detection. Our approach employs parallel dilated convolutions with adjustable dilation rates (

d \in {1, 2, \dots, D}

, where D is application-specific), generating multi-scale feature maps via computationally efficient depthwise separable convolutions. This process can be mathematically formulated as follows:

F_{d} = D W C o n v (x, k = 3, d i l a t i o n = d), d \in s c a l e s

(4)

F_{c a t} = [F_{1}; F_{2}; \dots F_{D}]

(5)

Through the implementation of varied receptive fields with dilation rates d, we achieve comprehensive spatial coverage with an effective receptive field size of (k−1)d + 1. Our multi-branch architecture encompasses receptive fields ranging from 5 × 5 to 9 × 9, enabling the adaptive processing of targets with diverse spatial dimensions. Additionally, the incorporation of depthwise separable convolution substantially reduces the computational complexity by decreasing the number of parameters, resulting in enhanced computational efficiency.

2.3.2. Channel Attention Enhancement

Following multi-scale feature fusion, the channels exhibit varying degrees of importance, necessitating dynamic feature weight adjustment through an attention-based mechanism. The system employs a squeeze-and-excitation network to generate attention weights, enabling adaptive channel-wise feature recalibration that simultaneously amplifies informative channels and attenuates noise components.

F_{a t t} = F_{c a t} \otimes σ (W_{2} \cdot δ (W_{1} \cdot G A P (F_{c a t})))

(6)

where

σ

represents the Sigmoid activation function,

W_{1} \in R^{(2 C_{i n} / 4) \times 2 C_{i n}}, W_{2} \in R^{2 C_{i n} \times (2 C_{i n} / 4)}

utilizing a reduction ratio of 4, and

G A P

denotes Global Average Pooling with a kernel size of 3 and stride of 1.

2.3.3. Dynamic Group-Wise Spatial Attention

Conventional spatial attention mechanisms utilize an ungrouped approach, which exhibits limited adaptability to diverse feature distributions across scenarios. To address this limitation, we propose a group-wise spatial attention mechanism where feature

F_{a t t}

is uniformly partitioned into g groups (g = 4) along the channel dimension, each with a dimension of

2 C_{i n} / g

. Subsequently, these grouped features undergo independent transformations through 1 × 1 convolutions prior to spatial attention computation. Given the critical role of edge and texture information in small target detection, we implement an enhanced spatial attention mechanism. The process begins with local maximum response extraction, where 3 × 3 max pooling operations are applied to each feature group to obtain local structural features

M_{i}

. Spatial attention is then computed for individual groups by averaging

M_{i}

along the channel dimension and generating spatial masks through 7 × 7 convolutions. These masks are subsequently multiplied with their corresponding group features, effectively enhancing edge and texture information within each group. The process culminates in the concatenation of all enhanced group features to produce the final feature representation O:

[\begin{matrix} F_{a t t}^{(1)}, F_{a t t}^{(2)}, \dots \end{matrix}] = C h u n k (F_{a t t})

(7)

M_{i} = M a x P o o l (C o n v_{1 \times 1} (F_{a t t}^{(i)})), i = 1, 2, \dots, g

(8)

O_{i} = G_{i} \otimes σ (C o n v_{7 \times 7} (\frac{1}{C} \sum_{c = 1}^{C} M_{i}^{(c)}))

(9)

O = C o n c a t (O_{1}, O_{2}, \dots, O_{g})

(10)

where Chunk represents the channel grouping operation, C denotes the channel dimensionality per group, and g specifies the total number of groups.

2.3.4. Residual Connections

Residual connections serve as an effective mechanism for mitigating the gradient vanishing problem in deep neural architectures. Within the MSG-CECM module, we implement residual connections through a dimension-adaptive approach: for matching input–output channel dimensions, direct element-wise addition is performed; for mismatched dimensions, a 1 × 1 convolutional projection aligns the input features before the addition of the module’s output to produce the final feature representation. This process can be formally expressed as follows:

F_{r e s} = \{\begin{matrix} x, & C_{i n} = C_{o u t} \\ C o n v_{1 \times 1} (x), & otherwise \end{matrix}

(11)

O u t p u t = F_{r e s} (x) + B N (O)

(12)

This architectural design preserves the original input information and high-frequency features while simultaneously preventing small target suppression during deep feature extraction, thereby enhancing training convergence.

3. Methodology

3.1. Material

Deep learning-based foreign object detection methods require extensive large-volume parenteral (LVP) image data. However, no comparable datasets are currently available in public databases, and existing small object datasets exhibit significant differences in target characteristics and background conditions from our research context, rendering them unsuitable for model training. To address this limitation, we present the large-volume parenteral dataset (LVPD). The dataset comprises images of 250 mL standard medical saline solutions manufactured by a pharmaceutical company in cylindrical soft-plastic containers measuring 6.5 cm in diameter and 14 cm in height.

On the right side of the Figure 5, multiple high-resolution cameras are positioned to capture images. A robotic gripper holds each bottle and moves it past the cameras while a backlight panel provides illumination, enabling the acquisition of clear bottle images. These captured images (as shown in Figure 6) subsequently undergo a series of preprocessing operations, including the isolation of bottle regions through cropping. Following annotation, these processed images constitute the LVPD utilized in this study.

The LVPD comprises a wide range of foreign objects, including insects, hair strands, color anomalies, and arthropod appendage fragments, along with various interference factors such as surface scratches, air bubbles, and textural reflections. Given that industrial applications potentially involve many more foreign object categories than those present in our current dataset, we adopt a binary classification approach, designating all targets simply as “foreign objects” without detailed categorization. Interference elements are excluded from the annotation process. As illustrated in Figure 7, the contaminants present in the bottle are microscopic in size, while surface features such as bubbles and scratches exhibit subtle visual characteristics that are nearly indistinguishable from these contaminants, potentially leading to classification errors.

The LVPD consists of 5000 images containing foreign objects with 32,800 annotations, and exhibits considerable dimensional diversity among foreign object classes, with varying imaging characteristics under different illumination conditions. A large proportion of the annotated objects are contained within a 32 × 32 pixel area, establishing the LVPD primarily as a small-object detection dataset. The dataset presents four major challenges: object category diversity, significant intra-class variations, multiple interference sources, and the prevalence of small-scale targets.

3.2. Experimental Setup

All experiments were performed on a system running Windows 10, featuring an NVIDIA GeForce RTX 4070 GPU (12 GB VRAM) and 64 GB RAM. The implementation utilized PyTorch 2.6.0 with CUDA 12.4 and Python 3.11. For both model training and inference, we set the Intersection over Union (IoU) and confidence thresholds to 0.2.

To enhance model generalization, we applied several data augmentation techniques to the input images, which were standardized to 640 × 640 pixels. These techniques included Mosaic augmentation, random horizontal flipping, affine transformations (scale factor: 0.5–1.5), and HSV color space adjustments (hue: ±0.1, saturation: ±0.7, value: ±0.4). Additional hyperparameter configurations are detailed in Table 1. We divided the LVPD using a 70:30 ratio for training, validation and testing, respectively.

For all the experiments conducted on the LVPD, the models were initialized randomly without pre-training weights. Training was performed using Stochastic Gradient Descent (SGD) with momentum optimization (learning rate: 0.01, momentum: 0.937, L2 regularization: 5 × 10⁻⁴). We adopted a two-stage training strategy, freezing the backbone network for the initial 10 epochs followed by full model training for a total of 300 epochs, with learning rate adjustment governed by cosine annealing. Model performance was quantitatively assessed using multiple metrics: mean Average Precision at IoU threshold 0.5 (mAP@0.5), Average Precision for small objects (AP_small), inference speed (FPS), model parameters (millions), and computational complexity (GFLOPs).

3.3. Evaluation Metrics

We evaluated our enhanced model using two key metrics: mean Average Precision at IoU threshold 0.5 (mAP@0.5), and AP_small for targeting objects smaller than 32 × 32 pixels. The metric calculations incorporated standard classification parameters: True Positives (TP, correctly identified positive samples), False Positives (FP, incorrectly identified positive samples), and False Negatives (FN, incorrectly identified negative samples). Precision, calculated using Equation (13), represents the proportion of correct positive predictions among all positive predictions. Recall, calculated using Equation (14), quantifies the model’s ability to identify positive samples by measuring the proportion of correctly detected positive instances among all actual positive cases; Average Precision (AP), a comprehensive metric for evaluating detection performance, is computed as the area under the precision–recall curve according to Equation (15); mean Average Precision (mAP), which quantifies the model’s overall detection capability, is computed as the weighted average of Average Precision values across all object classes, as defined in Equation (16).

P r e c i s i o n = \frac{T P}{T P + F P}

(13)

R e c a l l = \frac{T P}{T P + F N}

(14)

A P = \int_{0}^{l} P r e c i s i o n (R e c a l l) d (R e c a l l)

(15)

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(16)

In Equation (16),

A P_{i}

represents the Average Precision for class index i, while N denotes the total number of distinct classes in the training dataset.

4. Experiments Result

4.1. Ablation Studies

To evaluate the effectiveness of our proposed approach, we conducted comprehensive ablation studies on the LVPD. Figure 8 illustrates the baseline architecture of our feature extraction module, where the CBS block (highlighted in red) represents the computational unit targeted for improvement. To validate both the effectiveness of individual components and their mutual compatibility, we incrementally incorporated several optimization techniques: multi-scale perception, channel attention mechanisms, spatial attention mechanisms, and dynamic grouping strategies.

As shown in Table 2, the integration of multi-scale perception and deformable convolution for cross-scale feature fusion yields a 1.9% improvement in mAP, particularly enhancing the detection of small objects (below 32 × 32 pixels) with a 2.3% increase in AP_small. The incorporation of channel attention mechanism enables dynamic adjustment of channel importance, enhancing key feature representations and emphasizing channels containing small objects, resulting in improvements of 1.5% and 1.2% in mAP@0.5 and AP_small, respectively. Furthermore, the addition of spatial attention mechanism allows the network to focus on crucial image regions while suppressing irrelevant background information, leading to enhanced local detail perception and a corresponding increase of 1.1% in both mAP@0.5 and AP_small. The final enhancement involves a dynamic grouping module, where input features are segmented into multiple groups, each processed through independent 1 × 1 convolution layers with adjustable kernel weights. This dynamic grouping enables adaptive processing strategies based on input features, with separate spatial attention computation for different groups, enhancing model flexibility and adaptability. These improvements culminate in final mAP@0.5 and AP_small values of 36.5% and 25.6%, respectively.

The group size represents a critical hyperparameter within the proposed dynamic grouping mechanism, with its configuration significantly influencing detection performance. Through comprehensive experimental investigations, we systematically evaluated the impact of varying group sizes on detection accuracy (mean Average Precision, mAP) and inference speed (frames per second, FPS). The subsequent section presents the experimental methodology, detailed results, and analytical insights.

Table 3 shows that increasing group size (g) initially improves detection accuracy (mAP@0.5 and AP_small peak at g = 8 with 36.9% and 26.1%, respectively). Beyond g = 8, performance declines, likely due to the over-segmentation of features or redundant attention allocation, which reduces discriminative power for small objects. Smaller groups (g = 2–4) limit dynamic feature fusion flexibility, while larger groups (g = 16–32) introduce computational noise. As the number of groups increases, we observed an inverse relationship between model parameter count and computational complexity. While the parameter count decreases, computational time rises, resulting in a reduction in the detection of FPS. After comprehensive evaluation, we selected a group size of g = 4 for LVPD’s impurity detection tasks, prioritizing detection speed over the marginally higher mean Average Precision (mAP) observed at g = 8.

Furthermore, we evaluated detection performance under diverse interfering conditions, including varying resolutions, noise perturbations, and image blur effects. As shown in Table 4, MSG-YOLO balances speed (58–25 FPS) and accuracy (mAP@0.5: 36.5–39.7) across resolutions. Increasing the resolution leads to a certain improvement in mAP; however, it also results in higher video memory consumption and a significant reduction in FPS. It resists moderate Gaussian noise (−3.9% mAP@0.5) but struggles with salt-and-pepper noise (−24.4% mAP), particularly harming small objects (−37.9% AP_small). Motion blur (kernel = 25) degrades mAP@0.5 by −15.9%, though dynamic feature fusion alleviates distortion. Deploying this at 640 × 640–1024 × 1024 with noise suppression achieves applicable performance for industrial real-time use.

4.2. Comparative Experiments

To evaluate the effectiveness of MSG-YOLO, we conducted extensive comparative experiments on the LVPD against state-of-the-art models including YOLOv9t, YOLOv5s, PP-YOLOE, TPH-YOLO, and YOLOv5-CBAM. PP-YOLOE, developed by Baidu, extends PP-YOLOv2 with an anchor-free architecture to achieve enhanced efficiency through structural simplification. TPH-YOLO advances YOLOv5 specifically for drone-based small object detection by implementing a Transformer attention mechanism that prioritizes key regions. YOLOv5-CBAM augments the base YOLOv5 architecture with a Convolutional Block Attention Module (CBAM), leveraging both channel and spatial attention mechanisms to strengthen feature representation. Following the aforementioned experimental protocol, each model underwent 30 training iterations, with the results reported in Table 5 representing the mean values across all evaluation metrics.

Results from Table 5 and Figure 9 demonstrate that MSG-YOLO significantly outperforms the baseline YOLOv9t model, achieving an @0.5 of 36.5% and AP_small of 25.6% with only a modest increase in parameters. The integration of CBAM attention modules in YOLOv5s-CBAM yields modest improvements of 0.8% and 1.6% in mAP@0.5 and AP_small, respectively. Our proposed MSGELAN architecture, incorporating multi-scale fusion and grouped convolution, achieves substantially better performance than the CBAM-enhanced YOLOv5s. With a parameter count comparable to CBAM, MSGELAN improves mAP@0.5 by 2.3% and AP_small by 3.1%, representing significant advancement in small object detection, while PP-YOLO’s performance is limited by its sensitivity to training strategies, and TPH-YOLO achieves competitive accuracy (36.5%) but proves computationally intensive for real-time applications. MSGELAN, our proposed multi-scale dynamic grouped convolution module, demonstrates exceptional capability in detecting small foreign objects in large-volume parenteral solutions. The architecture integrates multi-scale feature extraction, dynamic grouping strategies, dual-attention mechanisms (channel and spatial), and residual connections to enhance both detection accuracy and computational efficiency. The resulting MSG-YOLO framework achieves state-of-the-art performance on the LVPD for small object detection.

We also conducted experiments on the TinyPerson dataset to assess the generalization capability of the proposed method. As shown in Table 6, MSG-YOLO achieves the highest mAP@0.5 (41.2%) and AP_small (28.6%), outperforming YOLOv9t (+3.1% mAP) and TPH-YOLO (+1.8% mAP), demonstrating superior small-object detection via multi-scale dynamic grouping and dual attention. While slightly slower than YOLOv9t (54 vs. 62 FPS), it balances accuracy and speed better than TPH-YOLO (47 FPS). Its moderate parameter count (6.8 M) and FLOPs (15.4 G) confirm efficient feature fusion, avoiding the computational overhead of transformer-based designs.

4.3. Visualization Analysis of Results

To evaluate the practical effectiveness of our proposed model, we conducted inference tests on four representative test images using optimally trained weights across different models; the results of the experiment are presented in Figure 10.

The proposed MSG-YOLO employs an adaptive confidence threshold strategy specifically optimized for pharmaceutical quality control scenarios. While the detection results in Figure 10 demonstrate confidence scores ranging from 0.26 to 0.92, our system implements a conservative lower threshold of 0.25 during inference. This threshold was rigorously validated through ROC curve analysis on our LVPD validation set, achieving an optimal balance between a 93.5% recall and 5.2% false discovery rate—compliant with pharmaceutical industry standards requiring < 0.1% defective product leakage.

The selected images in Figure 10 encompass diverse challenging scenarios: tilted bottles, color spot interference, elongated foreign objects, bubble occlusion, multi-scale foreign objects, and bottle surface reflections. In the first test case featuring a tilted bottle with clear visibility, all models successfully detected three foreign objects. The second case presented increased detection complexity with an elongated foreign object and an additional minute object near the bottle bottom; notably, only TPH-YOLO and our MSG-YOLO model achieved successful detection. The third and fourth test cases introduced dense bubble interference and significant size variations among foreign objects. Under these challenging conditions, MSG-YOLO maintained robust performance with detection confidence scores consistently exceeding 0.5, while PPYOLO and YOLOv5s exhibited detection failures. Although TPH-YOLO demonstrated comparable detection accuracy to MSG-YOLO, our model achieved superior inference speed. These results validate the enhanced effectiveness of MSG-YOLO for foreign object detection in large-volume parenteral solutions.

As illustrated in Figure 11, our experimental analysis revealed critical detection limitations, particularly when encountering targets with ambiguous edge boundaries. The model demonstrated significant challenges in discriminating between fine bubbles and minute impurities, highlighting a crucial area for subsequent research to enhance algorithmic recognition precision.

5. Conclusions

This work presents MSGELAN, a novel computational block optimized for rapid small object detection, and its derivative model MSG-YOLO, developed to enhance the detection of foreign particles in large-volume parenteral solutions in industrial settings. The comprehensive evaluation on the LVPD demonstrates that MSG-YOLO achieves state-of-the-art performance in both mean Average Precision (mAP) and inference speed, with notable improvements in small object detection (AP_small) while maintaining real-time processing capabilities. Through extensive ablation studies, we establish the efficacy of the synergistic integration of multi-scale convolution, channel-spatial attention, and dynamic grouping mechanisms, elucidating their collective contribution to small object feature enhancement. Furthermore, the proposed approach exhibits robust performance under challenging conditions, including noise interference and extreme scale variations, making it a viable solution for industrial applications.

Author Contributions

Conceptualization, Z.L. and D.J.; methodology, Z.L.; software, Z.L. and Z.H.; validation, Z.L. and N.W.; formal analysis, Z.L.; investigation, Z.H.; resources, N.W.; data curation, N.W.; writing—original draft preparation, Z.L.; writing—review and editing, Z.L. and D.J.; visualization, N.W.; supervision, D.J.; project administration, D.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets presented in this article are not readily available because the data are part of an ongoing study. Requests to access the datasets should be directed to dongyaoj1974@163.com.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Eisenhauer, D.A.; Schmidt, R.; Martin, C.; Schultz, S.G. Processing of small volume parenterals and large volume parenterals. In Pharmaceutical Dosage Forms-Parenteral Medications; CRC Press: Boca Raton, FL, USA, 2016; pp. 348–366. [Google Scholar]
Jia, D.; Sun, H.; Zhang, C.; Tang, J.; Li, Z.; Wu, N.; He, Z. Detection Method of Foreign Body in Large Volume Parenteral Based on Continuous Time Series. Available online: https://ssrn.com/abstract=4178845 (accessed on 11 March 2025).
Zhang, H.; Li, X.; Zhong, H.; Yang, Y.; Wu, Q.J.; Ge, J.; Wang, Y. Automated machine vision system for liquid particle inspection of pharmaceutical injection. IEEE Trans. Instrum. Meas. 2018, 67, 1278–1297. [Google Scholar] [CrossRef]
Zhang, Q.; Liu, K.; Huang, B. Research on Defect Detection of The Liquid Bag of Bag Infusion Sets Based on Machine Vision. Acad. J. Sci. Technol. 2023, 5, 186–197. [Google Scholar] [CrossRef]
Ge, J.; Xie, S.; Wang, Y.; Liu, J.; Zhang, H.; Zhou, B.; Weng, F.; Ru, C.; Zhou, C.; Tan, M.; et al. A system for automated detection of ampoule injection impurities. IEEE Trans. Autom. Sci. Eng. 2015, 14, 1119–1128. [Google Scholar] [CrossRef]
Zhang, H.; Shi, T.; He, S.; Wang, H.; Ruan, F. Visual detection system design for plastic infusion combinations containers based on reverse PM diffusion. In Proceedings of the 2015 7th International Conference on Intelligent Human-Machine Systems and Cybernetics, Hangzhou, China, 26–27 August 2015; Volume 2, pp. 306–310. [Google Scholar]
Cheng, K.S.; Lin, J.S.; Mao, C.W. Techniques and comparative analysis of neural network systems and fuzzy systems in medical image segmentation. In Fuzzy Theory Systems; Elsevier: Amsterdam, The Netherlands, 1999; pp. 973–1008. [Google Scholar]
Liang, Q.; Luo, B. Visual inspection intelligent robot technology for large infusion industry. Open Comput. Sci. 2023, 13, 20220262. [Google Scholar] [CrossRef]
Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. Yolov9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 1–21. [Google Scholar]
Terven, J.; Córdova-Esparza, D.M.; Romero-González, J.A. A comprehensive review of yolo architectures in computer vision: From yolov1 to yolov8 and yolo-nas. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Farhadi, A.; Redmon, J. Yolov3: An incremental improvement. In Proceedings of the Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; Springer: Berlin/Heidelberg, Germany, 2018; Volume 1804, pp. 1–6. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Varghese, R.; Sambath, M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; pp. 1–6. [Google Scholar]
Wei, X.; Li, Z.; Wang, Y. SED-YOLO based multi-scale attention for small object detection in remote sensing. Sci. Rep. 2025, 15, 3125. [Google Scholar] [CrossRef] [PubMed]
Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; Kwon, Y.; Michael, K.; Fang, J.; Wong, C.; Yifu, Z.; Montes, D.; et al. Ultralytics/yolov5: v6. 2-yolov5 classification models, apple m1, reproducibility, clearml and deci. ai integrations. Zenodo 2022. Available online: https://ui.adsabs.harvard.edu/abs/2022zndo...7002879J/abstract (accessed on 11 March 2025).
Wang, J.; Gao, J.; Zhang, B. A small object detection model in aerial images based on CPDD-YOLOv8. Sci. Rep. 2025, 15, 770. [Google Scholar] [CrossRef] [PubMed]
Dong, Y.; Xu, F.; Guo, J. LKR-DETR: Small object detection in remote sensing images based on multi-large kernel convolution. J. Real-Time Image Process. 2025, 22, 46. [Google Scholar] [CrossRef]
Peng, Y.; Li, H.; Wu, P.; Zhang, Y.; Sun, X.; Wu, F. D-FINE: Redefine regression Task in DETRs as Fine-grained distribution refinement. arXiv 2024, arXiv:2410.13842. [Google Scholar]
Chen, L.C. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Umar, M.; Siddique, M.F.; Ullah, N.; Kim, J.M. Milling machine fault diagnosis using acoustic emission and hybrid deep learning with feature optimization. Appl. Sci. 2024, 14, 10404. [Google Scholar] [CrossRef]
Siddique, M.F.; Zaman, W.; Ullah, S.; Umar, M.; Saleem, F.; Shon, D.; Yoon, T.H.; Yoo, D.S.; Kim, J.M. Advanced Bearing-Fault Diagnosis and Classification Using Mel-Scalograms and FOX-Optimized ANN. Sensors 2024, 24, 7303. [Google Scholar] [CrossRef]
Ma, Y.; Mao, Z. LiqD: A Dynamic Liquid Level Detection Model under Tricky Small Containers. arXiv 2024, arXiv:2403.08273. [Google Scholar]
Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 390–391. [Google Scholar]
Wittenburg, P.; Brugman, H.; Russel, A.; Klassmann, A.; Sloetjes, H. ELAN: A professional framework for multimodality research. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy, 22–28 May 2006; pp. 1556–1559. [Google Scholar]
Khanam, R.; Hussain, M. What is YOLOv5: A deep look into the internal features of the popular object detector. arXiv 2024, arXiv:2407.20892. [Google Scholar]
Xu, S.; Wang, X.; Lv, W.; Chang, Q.; Cui, C.; Deng, K.; Wang, G.; Dang, Q.; Wei, S.; Du, Y.; et al. PP-YOLOE: An evolved version of YOLO. arXiv 2022, arXiv:2203.16250. [Google Scholar]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2778–2788. [Google Scholar]
Miao, L.; Li, N.; Zhou, M.; Zhou, H. CBAM-Yolov5: Improved Yolov5 based on attention model for infrared ship detection. In Proceedings of the International Conference on Computer Graphics, Artificial Intelligence, and Data Processing (ICCAID 2021), Harbin, China, 24–26 December 2021; Volume 12168, pp. 564–571. [Google Scholar]

Figure 1. MSG-YOLO structure. The part outlined by the red dashed box in the figure is the MSGELAN module proposed in this paper, which combines MSG-CECM with GELAN. Its specific structure will be provided in the following text.

Figure 2. Comparison of CSPNet, ELAN, and GELAN Structures: (a) Structure of CSPNet. (b) Structure of ELAN. (c) Structure of GELAN.

Figure 3. Comparison of RepNCSPELAN and MSGELAN structures: (a) Structure of RepNCSPELAN. (b) Structure of MSGELAN.

Figure 4. Detailed structure of MSG-CECM: (a) The Overall Calculation Process of MSG-CECM. (b) The Calculation Process of DWConv. (c) The Calculation Process of Channel Att. (d) The Calculation Process of PWConv. (e) The Calculation Process of Spatial Att.

Figure 5. Image acquisition mechanical structure.

Figure 6. Image acquisition mechanical structure.

Figure 7. Image acquisition mechanical structure.

Figure 8. Structure of baseline. To ensure the experimental rigor, we replaced the feature extraction module in the computation block of the original model with a basic CBS module, which served as the baseline. Subsequently, we systematically validated the modules proposed in this study through a series of experiments.

Figure 9. Comparison of mAP performance across different models on the LVPD dataset.

Figure 10. Comparative analysis of model inference performance. In this figure, each row from top to bottom represents the inference results of YOLOv9-t, YOLOv5s, PP-YOLOE, TPH-YOLO, YOLOv5s-CBAM, and MSG-YOLO, respectively.

Figure 11. False Negative and False Positive examples of MSGYOLO in LVPD detection. (a) Ground Truth. (b) Inference Result.

Table 1. Hyperparameters of experiment.

Hyperparameter	Value	Hyperparameter	Value
lr0	0.01	warmup_momentum	0.8
lrf	0.01	dfl	1.5
momentum	0.937	box	7.5
warmup_bias_lr	0.1	cls	0.5
warmup_epochs	3.0	obj	0.7

Table 2. Performance comparison of different optimization strategies integrated With the baseline method.

Experiment Group	Multi-Scale Conv.	Channel Attention	Spatial Attention	Dynamic Grouping	mAP@0.5 (%)	AP_Small (%)
Baseline	×	×	×	×	31.5	19.5
Stage 1	✓	×	×	×	+1.7 (33.2)	+2.3 (21.8)
Stage 2	✓	✓	×	×	+1.5 (34.7)	+1.3 (23.1)
Stage 3	✓	✓	✓	×	+1.1 (35.8)	+1.1 (25.2)
MSG-YOLO	✓	✓	✓	✓	+0.7 (36.5)	+0.4 (25.6)

× means the strategy is not used in the experiment; ✓ means the stratepy is used in the experiment.

Table 3. Impact of group size on detection performance.

Group Size (g)	mAP@0.5 (%)	AP_Small (%)	FPS	Params (M)
2	35.1	24.3	56	7.1
4	36.5	25.6	54	6.8
8	36.9	26.1	51	6.7
16	36.4	25.4	47	6.6
32	35.8	24.9	42	6.6

Table 4. Consolidated performance analysis (%).

Test Condition	Parameters	mAP@0.5	AP_Small	FPS	mAP Drop
Resolution	Original	36.5	25.6	58	-
	768 × 768	38.1	26.9	46	-
	1024 × 1024	39.7	27.7	25	-
Gaussian Noise	$σ$ = 10	35.1	24.8	-	−3.9%
	$σ$ = 30	29.4	18.2	-	−19.5%
Salt-and-Pepper Noise	Density = 10%	27.6	15.9	-	−24.4%
Motion Blur	Kernel Length = 15	34.2	23.1	-	−6.3%
	Kernel Length = 25	30.7	19.5	-	−15.9%

Table 5. Comparison of the detection results of the proposed MSG-YOLO with those of classical networks.

Models	mAP@0.5	AP_Small ↑	FPS ↑	Params (M) ↓	FLOPs (G) ↓
YOLOv9t [9]	34.2	22.5	72	2.4	10.1
YOLOv5s [29]	33.8	21.7	63	7.2	16.5
PP-YOLOE [30]	35.1	23.1	65	8.9	24.3
TPH-YOLO [31]	36.5	24.8	41	16.7	36.8
YOLOv5s-CBAM [32]	34.6	23.3	58	7.5	17.2
MSG-YOLO(OURS)	36.5	25.6	58	2.5	10.7

↑ indicates that a larger value of the indicator represents better performance, while ↓ suggests that a smaller value of the indicator represents better performance.

Table 6. Performance comparison on TinyPerson dataset (%).

Model	mAP@0.5	AP_Small	FPS	Params (M)	FLOPs (G)
MSG-YOLO	41.2	28.6	54	6.8	15.4
YOLOv9t	38.1	24.1	62	5.2	12.7
TPH-YOLO	39.4	26.8	47	8.5	18.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Z.; Jia, D.; He, Z.; Wu, N. MSG-YOLO: A Multi-Scale Dynamically Enhanced Network for the Real-Time Detection of Small Impurities in Large-Volume Parenterals. Electronics 2025, 14, 1149. https://doi.org/10.3390/electronics14061149

AMA Style

Li Z, Jia D, He Z, Wu N. MSG-YOLO: A Multi-Scale Dynamically Enhanced Network for the Real-Time Detection of Small Impurities in Large-Volume Parenterals. Electronics. 2025; 14(6):1149. https://doi.org/10.3390/electronics14061149

Chicago/Turabian Style

Li, Ziqi, Dongyao Jia, Zihao He, and Nengkai Wu. 2025. "MSG-YOLO: A Multi-Scale Dynamically Enhanced Network for the Real-Time Detection of Small Impurities in Large-Volume Parenterals" Electronics 14, no. 6: 1149. https://doi.org/10.3390/electronics14061149

APA Style

Li, Z., Jia, D., He, Z., & Wu, N. (2025). MSG-YOLO: A Multi-Scale Dynamically Enhanced Network for the Real-Time Detection of Small Impurities in Large-Volume Parenterals. Electronics, 14(6), 1149. https://doi.org/10.3390/electronics14061149

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MSG-YOLO: A Multi-Scale Dynamically Enhanced Network for the Real-Time Detection of Small Impurities in Large-Volume Parenterals

Abstract

1. Introduction

2. Proposed Method

2.1. Overall Model Structure

2.2. MSGELAN

2.3. MSG-CECM Module

2.3.1. Multi-Scale Deep Feature Extraction

2.3.2. Channel Attention Enhancement

2.3.3. Dynamic Group-Wise Spatial Attention

2.3.4. Residual Connections

3. Methodology

3.1. Material

3.2. Experimental Setup

3.3. Evaluation Metrics

4. Experiments Result

4.1. Ablation Studies

4.2. Comparative Experiments

4.3. Visualization Analysis of Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI