DScanNet: Packaging Defect Detection Algorithm Based on Selective State Space Models

Luo, Yirong; Du, Yanping; Wang, Zhaohua; Mo, Jingtian; Yu, Wenxuan; Dou, Shuihai

doi:10.3390/a18060370

Open AccessArticle

DScanNet: Packaging Defect Detection Algorithm Based on Selective State Space Models

by

Yirong Luo

,

Yanping Du

,

Zhaohua Wang

^*,

Jingtian Mo

,

Wenxuan Yu

and

Shuihai Dou

Beijing Institute of Graphic Communication, Beijing 102600, China

^*

Author to whom correspondence should be addressed.

Algorithms 2025, 18(6), 370; https://doi.org/10.3390/a18060370

Submission received: 13 May 2025 / Revised: 11 June 2025 / Accepted: 13 June 2025 / Published: 19 June 2025

(This article belongs to the Special Issue Algorithms in Data Classification (3rd Edition))

Download

Browse Figures

Versions Notes

Abstract

With the rapid development of e-commerce and the logistics industry, the importance of logistics packaging defect detection as a key link in product quality control is becoming increasingly prominent. However, existing target detection models often face the problems of difficulty in improving detection accuracy and high model complexity when dealing with small-scale targets in logistics packaging. For this reason, an improved target detection model, DScanNet, is proposed in this paper. To address the problem that the model’s detailed feature extraction for small target defects is not sufficient and thus leads to low detection accuracy, the MEFE module, the local feature extraction module (LFEM Block), and the PCR module of the multi-scale convolution and feature enhancement strategy are proposed to enhance the model’s capability of capturing defective features and focusing on specific features, and to improve the detection accuracy. To address the problem of excessive model complexity, a Mamba module incorporating a channel attention mechanism is proposed to optimize the model via its linear complexity. Through experiments on its own dataset, BIGC-LP, DScanNet achieves a high accuracy of 96.8% on the defect detection task compared with the current mainstream detection algorithms, while the number of model parameters and the computational volume are effectively controlled.

Keywords:

small target defect detection; multi-scale convolution; feature enhancements; channel attention mechanisms

1. Introduction

In the contemporary era of increasing emancipation of productive forces, the brand value of commodities has become an important growth point for the value of commodities. Packaging has always been considered to increase the value of commodities, to protect the commodities themselves from being damaged by important bearers, and the importance of packaging with the development of the commodity economy is increasing daily.

From the point of view of protecting the contents of the package, outer packaging defects can reflect the state of the products inside the package to a certain extent, and different outer packaging defects represent different problems encountered by the commodities in the logistics process. For example, the carton could be damaged in the production process due to improper folding, or in the storage process caused by distortion of deformation, human error caused by surface scratches or tears, or production equipment failure caused by poor adhesive or folding lines not being neat. For some larger types of defects, we can easily eliminate them from good products. However, some minor external packaging defects, which may not be easy to detect visually, can lead to product quality problems, thus affecting consumers’ repurchase decisions and the market competitiveness of products. In this context, the application of defect detection technology becomes particularly important.

Currently, automated vision inspection systems have gradually replaced traditional manual inspection methods to improve inspection accuracy and efficiency. Traditional defect detection methods, such as template matching or rule-based methods [1,2,3], often require considerable manual intervention to design and adjust the detection rules, which can cause inefficiencies and other problems in real and complex industrial environments, making it difficult to achieve better detection results. In 2012, AlexNet [4] demonstrated the powerful feature learning capabilities of convolutional neural networks (CNNs), after which target detection entered the era of deep learning. Girshick et al. [5] pioneered the introduction of deep learning techniques into the field of target detection by proposing a region-based convolutional neural network (R-CNN). Since then, deep learning has gained momentum in the field of computer vision, with a range of powerful network architectures emerging, from CNNs to transformer [6] and the more novel Mamba architecture [7]. The applications of these structures have achieved impressive results in tasks such as image classification, target detection, and image segmentation, fully demonstrating their great potential in computer vision.

As the first single-stage target detection algorithm in the deep learning era, YOLO is widely used in various scenarios because of its excellent real-time performance and accuracy. It is still the typical algorithm that achieves the best balance between speed and accuracy. However, YOLO often suffers from insufficient detection accuracy when dealing with the task of small target detection because small targets occupy fewer pixels in the image and their features are not obvious. To solve this problem, researchers have proposed a variety of improvement strategies based on transformers, including a multi-scale feature fusion and attention mechanism.

For example, Yasir et al. [8] proposed the SwinYOLOv7 model, which achieves 96.59% ship detection accuracy by combining the YOLOv7 framework, feature pyramid network and swin transformer with an anchorless detection algorithm. Tong et al. [9] proposed YOLO-Faster, an enhanced lightweight remote sensing target detection network, which solves the problem of detecting small-scale targets that are more difficult to detect in large and complex backgrounds through lightweight design, adaptive multi-scale feature fusion, and decoupling of the detection head. Fu et al. [10] introduces the small target feature enhancement module (STFEM) and transformer-based Jump Connected Path Aggregation Network (Tr-PANet) in YOLO, fully extracts the contextual information of the small targets, and uses the self-attention mechanism to model the non-local contextual features of the small targets, which maximizes the retention of the global structural information in the small targets and solves the problem of low accuracy of small target detection. Wang et al. [11] embedded a novel prior enhancement transformer (PETNET) module and a One-to-Many Feature Fusion (OMFF) mechanism in YOLO networks to improve the performance of target detection for aerospace images.

The above studies can indeed enhance the model’s ability to detect small targets, but these methods require complex network structures and large computational resources. Subsequently, more researchers started to use hybrid architectures, such as MobileViT [12], EdgeViT [13], and EfficientFormer [14], to reduce model complexity; however, such hybrid architectures degrade model performance. Therefore, finding a balance between performance and speed has been a concern for researchers [15].

To solve the above problems, this paper designs a both efficient and accurate small target defect detection model, DScanNet, based on the YOLO framework to improve the accuracy and efficiency of small target defect detection.

The main contributions of this paper are as follows:

We propose the DScanNet model, a detection algorithm based on deep learning and selective state-space models, to successfully design an efficient model for defect detection in logistics packaging. A new technical solution is provided to solve the real-time defect detection problem.
The multi-scale enhanced feature extractor (MEFE Block), PCR, Mamba Block, and local feature extraction module (LFEM Block) modules are proposed. MEFE Block captures the detailed features and contextual information in the image better through multi-scale feature extraction, PCR optimizes the feature extraction process and reduces redundant computation through partial convolution, Mamba Block optimizes the model by using Mamba’s linear complexity, and LFEM Block fully extracts the local spatial information. Block optimizes the model using the linear complexity of Mamba, and LFEM Block fully extracts the local spatial information. These modules work in concert to effectively improve the model’s ability to capture and focus on defective features and enhance the detection accuracy.
Based on the current shortage of packaging defect datasets, we conducted experiments on our own dataset BIGC-LP. After a large number of experiments, our proposed algorithm performs excellently on the defect dataset. Compared with the mainstream detection algorithms, the accuracy is as high as 96.8%, which is an obvious advantage in detection accuracy; at the same time, the number of model parameters and computation volume are effectively controlled, which achieves a good balance between performance and efficiency.

2. Methods

2.1. YOLO Series Real-Time Target Detection Algorithm

The YOLO (You Only Look Once) series of algorithms is a typical representative of single-stage target detection models and has been in the research hotspot since it was first proposed in 2016. This series of algorithms has gradually become a widely used real-time target detection framework in industry and academia by continuously optimizing the network structure, feature fusion strategy, and training techniques. The core idea of the algorithm is to transform the target detection task into a single forward-propagation regression problem by predicting the bounding box and category probabilities directly on the image grid, which avoids redundant computation and significantly improves the detection speed compared to the traditional two-stage detector.

YOLOv1 to YOLOv3 [16,17,18] are the forerunners of the YOLO family of models, and their performance improvements are closely related to backbone improvements and have led to the widespread use of DarkNet. YOLOv4 [19] uses architectures such as CSP-Darknet-53 and PAN to optimize speed and accuracy, and proposes the first BoS policy and BoF policy. YOLOv5 [20] replaces the YOLO family of development environments with PyTorch and provides a more detailed design of the backbone network and CSP structure to further optimize ease of use and performance. YOLOv6 [21] constructs EfficientRep, an efficient parameter reformulation network based on RepVGG and designs an effective decoupled detection head to improve classification accuracy. YOLOv7 [22] uses an extended efficient layer aggregation network (E-ELAN) to enhance the learning ability of the network without destroying the original gradient paths. YOLOv8 [23] adopts the C2f structure with richer gradient flow, which ensures lightweight while obtaining richer gradient flow information. YOLOv9 [24] proposed programmable gradient information (PGI) based on YOLOv7, which can maximize the retention of key features required for target detection tasks without incurring additional costs. The architecture of YOLOv10 [25] builds on previous YOLO models by eliminating non-maximal suppression (NMS) and optimizing various model components to achieve state-of-the-art performance while significantly reducing computational overhead. YOLOv11 [26] employs C3k2 module and depth-separable convolution to reduce redundant computation and improve efficiency. YOLOv12 [27] designs a YOLO framework based on an attention mechanism to improve detection speed while maintaining real-time speed.

YOLOv1-v4 versions have found it difficult to meet the dual requirements of accuracy and real-time performance of current industrial inspection due to performance limitations. Although YOLOv8 and subsequent versions achieve higher inspection accuracy, their computational complexity is more suited to high-performance GPU platforms, making it difficult to be efficiently deployed on edge devices. In contrast, YOLOv5 is still the preferred solution for real-time target detection in edge computing scenarios due to its balanced accuracy, speed, and hardware compatibility advantages.

2.2. An Efficient Visual Modeling Approach Based on Mamba

All current state-of-the-art base models use transformers with self-attention mechanisms. Self-attention mechanisms can achieve more accurate image understanding and more efficient task execution when processing images through the advantages of global information capture, dynamic feature adaptation, parallel computation, and long-range dependency modeling. However, the computational complexity of the self-attention mechanism is about the square level of the length of the input sequence n, which means that the training process requires more computational resources and longer training time.

Recently, Mamba [7], a sequence model based on State Space Models (SSM), maintains computational efficiency in long sequence modeling through a selective scanning mechanism. The Mamba architecture approximates transformer in terms of global modeling capability while maintaining the computational overhead of linear growth rate and has been applied to vision tasks by scholars. VMamba [28] achieves linear complexity without sacrificing the global sensory field by introducing a state model and a cross-scanning module, while also improving the computational efficiency; U-Mamba [29] addresses the limitations of traditional networks in dealing with long-range dependencies in medical image segmentation tasks, and achieves significant results in a series of complex biomedical image segmentation tasks; Vision Mamba [30] designed a bi-directional SSM and introduced positional coding to specialize in visual signals. In various classification, detection, and segmentation tasks, Vision Mamba has significantly improved accuracy, computation, and memory efficiency compared to existing vision transformers.

The direct application of Mamba to visual detection tasks leads to a loss of information in the global receptive field. To address this challenge, VMamba uses the VSS block module. By introducing a two-dimensional selective scanning mechanism (SS2D), linear complexity can be preserved while maintaining the global sensory field. In view of the different scales and irregularities of packaging defects, the model needs to have long-distance dependent modeling capabilities, so the long-distance modeling ability of selective state space is transferred to the spatial dimension to process the image information, which is very suitable for detecting packaging defects. Therefore, we design Mamba Block, the core module of this paper, based on the VSS Block. The architecture of the VSS Block and two-dimensional selective scanning mechanism (SS2D) modules is shown in Figure 1.

3. Proposed Method

The DScanNet proposed in this paper continues the design concept of single-stage target detection network and is designed on the basis of YOLOv5 to further improve the detection accuracy and computational efficiency of the model. In the backbone network, we take advantage of the parallel multi-scale convolution and channel attention of the MEFE Block module to improve the detection of defects of different sizes, and take advantage of the fact that partial convolution can simultaneously reduce redundant computation and memory access, and optimize the CBS module, which can efficiently extract the spatial features and significantly reduce the number of model parameters; based on this, we use the Mamba Block Based on this, we use Mamba Block to replace the C3 structure, which improves the computational efficiency while maintaining the global receptive field. In the neck part, we follow the FPN + PAN design for feature fusion, and pass these features to the head for final prediction. The DScan network structure is shown in Figure 2:

3.1. Multi-Scale Enhanced Feature Extractor(MEFE Block)

In the backbone network we design the multi-scale enhanced feature extractor module (MEFE Block) to optimize the initial feature extraction of the input image. The aim is to enhance the model’s ability to detect different objects through the multi-scale feature extraction and enhancement mechanism.

The MEFE module employs a Multi-Scale Parallel Convolution (MSPC) strategy, where different scale convolution kernels are used in parallel in the same layer to extract features, and each convolution kernel outputs a feature map. These feature maps from different scale convolution kernels are spliced in the channel dimension to form a new feature map. After that, further channel compression is performed by 1 × 1 convolutional layers to reduce redundant information and enhance the feature representation. Finally, the importance of the features is automatically adjusted by introducing a SE mechanism to weight the features of each channel and output them.

This design allows the MEFE module to capture both the local detail information of the image and the global contextual content, solving the problem of insufficient information acquisition by a single convolutional kernel due to the restricted sensory field. At the same time, the resulting computational overhead is effectively avoided compared to the stacked deep network approach. The MEFE Block is shown in Figure 3.

3.2. PCR Block

To enable the network to focus on specific features and improve the efficiency of feature extraction, we use partial convolution Pconv [31] instead of the traditional convolution Conv2d. In partial convolution, the convolution filter performs convolution operations on only a portion of the channels of the input feature map, not all of them.

Pconv optimizes the computational cost based on the channel redundancy property of the feature map. When processing the feature map, only a portion of the channels of the input feature map are subjected to regular convolution operations, and the remaining channels are directly retained. For example, if the input feature map has 64 channels, Pconv may select only 32 of them for convolution, leaving the remaining 32 channels unchanged. After completing the partial convolution, the network splices the new feature map obtained from the convolution with the feature map of the part of the channel that was not processed by the convolution. This means that the final output feature map contains both the part that has been convolved and the part that has not been processed. This approach makes partial convolution more flexible in the feature extraction process, and because only a portion of the input channel is convolved, the network is able to focus its computational resources on key features, reducing unnecessary redundant computations. The FLOPs for Pconv are as follows:

h \times w \times k^{2} \times c_{p}^{2}

(1)

For the typical

r = \frac{c_{p}}{c} = \frac{1}{4}

in deep learning, Pconv’s FLOPs are only 1/16th of a regular convolution.

3.3. Mamba Block

The design of Mamba Block is inspired by State Space Models (SSM). State Space Models are mathematical models used to map a one-dimensional input vector

x (t)

to an N-dimensional potential state variable

h (t)

, which is then mapped to a one-dimensional output vector

y (t)

. SSMs are usually represented as linear ordinary differential equations (ODEs):

\{\begin{cases} h^{'} (t) = A h (t) + B x (t) \\ y (t) = C h (t) + D x (t) \end{cases}

(2)

Deep learning cannot use State Space Models directly, and to solve this problem, ODEs need to be discretized into discrete functions. This conversion effectively aligns the model with the sampling rate of the underlying signals in the input data, thereby improving computational efficiency. Considering the input

x_{k} \in R^{L \times D}

, i.e., a vector of length L sampled in the signal stream, the equations can be discretized as follows according to the Zero Order Holding (ZOH) rule:

\{\begin{cases} h_{k} = \bar{A} h_{k - 1} + \bar{B} x_{k} \\ y (k) = \bar{C} h_{k} + D x_{k} \end{cases}

(3)

\begin{matrix} \bar{A} = e^{Δ A} \\ \bar{B} = {(Δ A)}^{- 1} (e^{Δ A} - I) Δ B \\ \bar{C} = C \end{matrix}

(4)

where

B, C \in R D

and

I

is the unit matrix. After discretization, the SSM is computed by global convolution with a structured convolution kernel

\bar{K} \in R D

:

\begin{matrix} y = x * \bar{K} \\ \bar{K} = (C \bar{B}, C \bar{A} \bar{B}, \dots, C {\bar{A}}^{L - 1} \bar{B}) \end{matrix}

(5)

Based on the above equations, Mamba has devised a simple selection mechanism to parameterize the SSM parameters based on the inputs. The model filters out irrelevant information being this approach and remembers the relevant information indefinitely. However, the direct application of Mamba to visual detection tasks leads to the loss of information in the global receptive field. To address this challenge, VMamba proposes the VSS Block module. By introducing a two-dimensional selective scanning mechanism (SS2D), linear complexity can be preserved while maintaining the global sensory field. As shown in Figure 4, the core module of this paper is Mamba Block. The feature map is input followed by a series of feature extraction modules that enable the network to capture deeper features while keeping the training inference process efficient and stable through batch normalization.

LFEM is used here to refer to the LFEM Block (see Section Local Feature Extraction Module (LFEM Block) for details). SS2D is composed of three parts: the scan expansion operation, the S6 module, and the scan merging operation. The scanning expansion operation expands the input image into sequences along four different directions (top left to bottom right, bottom right to top left, top right to bottom left, and bottom left to top right), which allows each pixel to capture information from other pixels in different directions. These sequences are then processed by the S6 module to extract features, ensuring that the information from each direction is fully scanned so that diverse features are captured. Subsequently, a scan-and-merge operation sums and merges the sequences from different directions [32].

F_{1} = Φ (B N (C o n v_{1 \times 1} (x))

(6)

F_{2} = S S 2 D (L N (L F E M (F_{1})))

(7)

F = L N (F_{2}) + F_{1}

(8)

where

F_{1}

and

F_{2}

are intermediate states and where

F

is the final output feature.

Local Feature Extraction Module (LFEM Block)

LFEM is a local feature extraction module, which plays an important role in Mamba Block. For the input features

x_{k} \in R^{H \times W \times C}

, first after the depth separable convolution, this convolution method for each input channel performs independent convolution operations, which can ensure that the number of parameters in the model and the amount of computation can be reduced at the same time as having a strong feature learning ability, and then after batch normalization (BN), which not only accelerates the training process but also provides a certain degree of regularization effect, thereby reducing the risk of overfitting the model in the training set. The obtained intermediate state is defined as:

F x_{k 1} = B N (D W C o n v_{3 \times 3} (x_{k}))

(9)

The intermediate state

F x_{k 1}

further obtains more feature representations by mixing channel information through

1 \times 1

convolution, and learns nonlinear features by varying the number of channels of the features through the activation function GELU, enabling the model to learn more complex mapping relationships. The intermediate state is defined as

F x_{k 2} = C o n v_{1 \times 1} (Φ (C o n v_{1 \times 1} (F x_{k 1})))

(10)

After activation, the channel attention SE module is added to obtain the global features and attention weights for each channel, which are scaled and then output. Finally, the input features are fused with the processed features through residual connections, which effectively retains the input information and promotes feature reuse.

X = \frac{1}{H \times W} Σ_{i = 1}^{H} Σ_{j = 1}^{W} F x_{k 2}

(11)

S = σ (W_{2} \cdot Φ (W_{1} \cdot X))

(12)

F x_{k 3} = σ (W_{2} \cdot Φ (W_{1} \cdot X)) \cdot \frac{1}{H \times W} Σ_{i = 1}^{H} Σ_{j = 1}^{W} F x_{k 2}

(13)

F x = C o n v_{1 \times 1} (F x_{k 3}) \oplus x_{k}

(14)

where

F_{x}

is the output feature,

Φ

is the GELU activation function,

X

is the global feature of the channel,

S

is the attention weight,

W_{1}

and

W_{2}

are the weights of the fully connected layer,

σ

is the sigmoid activation function.

The LFEM Block can effectively capture and utilize the local spatial information of the input feature maps and combine them with the original inputs to finally obtain the enhanced feature maps, which capture richer feature information for DScanNet in complex tasks.

4. Dataset Preparation and Experimental Environment

4.1. Dataset Preparation

This section verifies the effectiveness of the proposed method in the task of target defect detection as well as in solves problems in real production. In this experiment, carton samples with different types of defects were collected from a company for image acquisition to make a dataset, which we call the BIGC-LP defect dataset. The dataset contains three main types of defects, i.e., tears, scratches, and deformations, totaling 1056 sheets, and the defect images are shown in Figure 5. These logistics packaging defects exhibit significant scale diversity. For example, some defects appear as elongated scratches or indentations, while others exhibit localized depressions or large stains. The significant differences in morphology and size of these defects pose a great challenge to conventional inspection models, which often struggle to accurately recognize these defect types simultaneously.

In addition, in order to enhance the diversity of the dataset and improve the generalization ability and robustness of the model, while avoiding problems such as overfitting, we used data enhancement techniques to extend the dataset, including rotation, scaling, horizontal and vertical flipping, and brightness adjustment, and a total of 3270 defect images were obtained after processing. All images were uniformly sized at 640 × 640 and all defect images were labeled using labeling software with the smallest possible labeling box and the results were stored as a txt file for recording the defect locations. We divided the dataset according to the ratio of 8:1:1, and finally obtained a training set of 2616 images, a validation set of 327 images and a test set of 327 images. Such a division ensures that the model can learn on sufficient training data and also perform performance evaluation on independent validation and test sets.

Figure 6 shows the location and size distribution of defective target centroids, from which it can be seen that the location of defective samples is more centralized, there is cross-overlap between the labeled boxes, and the defects are mostly small target samples with small distances from each other, which is difficult to be detected and identified, so they are well suited for the research design of the paper’s original intention.

4.2. Experimental Environment

In order to ensure the experimental rigor, we conducted the experiments under the same server configuration to strictly evaluate the effectiveness of the improved algorithm. We conducted our experiments on the Linux-Ubuntu 20.04 operating system with an NVIDIA GeForce GTX 3090 GPU and an Intel-i7-11700 CPU. The deep learning framework used was PyTorch 1.11 with Python version 3.9 and CUDA version 11.3. The training parameters are configured as follows: the input image size is 640 × 640, the total training period of the model is 100 epochs, the number of training samples in each batch is 16, the initial learning rate is set to 0.01, and the optimizer is selected as SGD. In order to improve the performance of the model in real scenarios, we turn off the mosaic data augmentation in the last 10 epochs of training. Meanwhile, to ensure the reliability of the evaluation results, each index is obtained through three independent experiments, and the best value of the three is used.

The experiment-related configuration is shown in Table 1.

5. Experiments

5.1. Evaluation Metrics

In this study, parameters and FLOPs were used as evaluation metrics for model size and computational complexity. The model performance was evaluated using precision rate P (Precision), recall rate R (Recall), mean average precision (mAP), and FPS (Frames Per Second Transmitted).

The precision rate represents the ratio of true positives (TP) to all samples predicted to be positive (TP + FP). That is, how many of the samples predicted to be defective are actually defective samples, with the following equation:

P = \frac{N_{TP}}{N_{FP} + N_{TP}} \times 100 %

(15)

where

N_{T P}

indicates the number of samples correctly predicted as positive;

N_{F P}

indicates the number of samples incorrectly predicted as positive.

Recall represents the ratio of true positives (TP) to all samples that are actually defective (TP + FN). That is, how many of the samples that are actually defective are predicted to be abnormal, with the following equation:

R = \frac{N_{TP}}{N_{FN} + N_{TP}} \times 100 %

(16)

where

N_{F N}

indicates the number of samples that were actually positive but mistakenly detected as negative.

The mAP is the mean average precision average for a given threshold, with the following equation:

m A P_{I O U} = \frac{1}{N_{c}} Σ_{i = 1}^{N_{c}} A P_{I O U}^{(i)}

(17)

where

N_{c}

indicates the number of defect categories.

FPS indicates the detection speed of the model, which is the number of images that can be processed in a second. The formula is as follows:

F_{FPS} = \frac{N}{T}

(18)

where

N

represents the number of detected pictures;

T

represents the total detection time.

5.2. Experimental Analysis

We used the same experimental setup to train the two models before and after the improvement, respectively. The specific performance metrics are shown in Table 2.

The experimental results show that the improved DScanNet algorithm improves the detection precision by 4.3 percentage points compared to the original YOLOv5 algorithm, the detection speed reaches 97.1 FPS, and the number of parameters and computation amount also decrease significantly. Overall, DScanNet’s model is lighter, requires less computational resources, is easier to deploy, and can realize the accurate and real-time detection requirements in real detection scenarios.

In addition, the DScanNet algorithm was able to improve its detection precision with a 2.9 percentage point increase in recall, which means that fewer defective samples were missed, while not introducing more false detections (false positives) due to the increase in recall. This suggests that the algorithm is able to improve its sensitivity to defects while still maintaining better discrimination.

In order to further compare the model performance of DScanNet and YOLOv5, we compared the mAP@0.5 and mAP@0.5:0.95 variations between the two. Since there is often uncertainty in the target boundary in the actual defect detection scenario, it is more practical to adopt the evaluation standard of an IoU threshold of 0.5. That is, if there is a 50% overlap between the prediction box and the real target, the target is considered to have been detected. This index can objectively reflect the recognition ability of the model when it tolerates a certain positioning error, which is more in line with the needs of actual industrial inspection scenarios. The mAP@0.5:0.95 is the average accuracy of the average calculation using the threshold from 0.5 to 0.95 in 0.05 steps, considering the different overlap degrees, which is more comprehensive and accurate [33].

As shown in Figure 7, in the early stage of training, the mAP@0.5 values of both models rise rapidly, indicating that both models are learning data features quickly. In the middle stage, DScanNet rises more steadily, and YOLOv5 is more volatile and less stable. In the later stage, both models tend to be stable, but the improved DScanNet model fluctuates relatively less, and the model shows a more stable convergence trend and significant stability, and mAP@0.5 also reaches 95.1%, which is 2% higher than the original YOLOv5 model. Overall, the mAP@0.5 of most rounds of DScanNet is higher. This indicates that under the same training conditions, the DScanNet model has better learning ability to learn the key features in the dataset. This gives DScanNet better performance in defect detection task.

As shown in Figure 8, DScanNet consistently outperforms YOLOv5. DScanNet rose rapidly at mAP@0.5:0.95 in the early stage, fluctuating in a small range, but quickly stabilized at a higher level. Although YOLOv5 has improved in the early stage, the overall accuracy is lower than that of DScanNet, and the fluctuations are relatively more obvious. DScanNet is even more advantageous when it comes to packaging defect detection tasks.

Figure 9 and Figure 10 show the detection results of YOLOv5 and DScanNet, respectively. The experimental results show that YOLOv5 has three limitations in detection: a high leakage rate of some defective targets with fuzzy boundaries, incomplete detection of defects in small targets, and the generation of redundant predictions for defects of large sizes. In contrast, DScanNet, with its improved feature extraction mechanism, provides significant improvement in the reliability and stability of defect recognition, locates and identifies defects more accurately, and provides more reliable results for quality inspection work in real production. In addition, the same experimental setup and a more realistic image acquisition environment further validate the performance advantages of DScanNet in real scenarios.

5.3. Comparison Experiment

In order to fully evaluate the performance of DScanNet, we compare DScanNet with the current mainstream detection algorithms under the same experimental conditions. Specifically, YOLOv6–v12, SSD, and Faster R-CNN are selected for the following reasons: YOLOv6–v12 are all models that belong to the same series, and by comparing them with the same series of models, it is ensured that the experimental variables focus on the improvement points themselves, rather than the differences at the architectural level. In addition to the comparison of the same series, we also selected two models, Faster R-CNN and SSD, for comparative testing. Faster R-CNN and SSD are representative algorithms for dual-stage object detection and single-stage object detection, and the two algorithms are widely used in different scenarios with high-quality detection results and efficient inference speed, respectively. By comparing it with them, you can better evaluate the performance of the model. The specific experimental results are shown in Table 3.

As shown in Table 3, DScanNet has the best accuracy (96.8%) and mAP@0.5 (95.1%), and the highest detection accuracy. With an FPS of 97, the fastest speed, and a parameter volume of only 3.68 M and a computational volume of 9.4 B, the model is the lightest, and it performs the best among YOLO models in the same series. The SSD has an accuracy of 84.98% and a mAP@0.5 of 86.1%, which is low, but the FPS is 90 and the speed is fast, with 62.74 M parameters and 26.29 B calculations. Faster R-CNN accuracy is 93.66%, mAP@0.5 is 91.73%, the accuracy is good, but the FPS is only 40, the speed is slow, the number of parameters is 137 M, the amount of computation is 370.21 B, and the model is complex. Overall, DScanNet performs well in terms of recognition accuracy, with more accurate judgment of the target category and fewer false positives, while SSD and Faster R-CNN may be more suitable for scenarios with less stringent performance requirements. The following figure shows the inspection results of each model.

5.4. Ablation Experiment

In this section, we conduct ablation studies to evaluate the impact of different improvements on the final performance of the network and analyze the effectiveness of the module; “√” indicates that the corresponding improvement strategy is used in the model. The results of the ablation experiments are shown in Table 4.

Experiments show that the model accuracy, FPS (detection speed), and mAP@0.5 (detection accuracy) are all improved after the introduction of MEFE Block in the backbone network, which suggests that MEFE Block can enhance the backbone network’s ability of extracting and processing the target features and improve the detection accuracy.

After replacing the traditional convolution with the PCR module, the accuracy, FPS and mAP@0.5 are further improved while the number of parameters is reduced. This shows that it can streamline the model structure and improve the computational efficiency, removing the redundant parts without increasing the burden of the model, making the model more efficient.

After adding Mamba Block, the accuracy and mAP@0.5 of the model were significantly improved, while maintaining a high efficiency, indicating that it has good feature learning and processing capabilities, which can greatly improve the performance of the model, so that the model achieves a good balance between efficiency and performance.

The addition of MEFE Block, PCR, and Mamba Block all had a positive impact on the performance of the model, especially the addition of the Mamba Block, which significantly improved the accuracy and mAP@0.5 of the model, while maintaining a high efficiency, indicating that the model achieved a good balance between efficiency and performance. These improvements make Model 4 the best choice for comprehensive performance.

The ablation experiments in this section fully demonstrate the contribution of each module to the overall performance improvement of the DScanNet algorithm and also verify the rationality and effectiveness of the improvement strategy.

6. Discussion

The DScanNet model proposed in this study shows significant advantages in the task of logistics packaging defect detection, but there are still many issues to be explored.

In terms of model performance, DScanNet performs well in terms of accuracy, recall, mean average precision (mAP) and detection speed (FPS). Compared with the original YOLOv5 algorithm, the detection precision is improved by 4.3 percentage points to reach a high accuracy of 96.8%, the recall is also improved by 2.9 percentage points, and the detection speed reaches 97.1 FPS. This is due to the synergistic effect of the components of the MEFE Module, the PCR Module, and the Mamba Block. Although the model performs well on existing datasets, it may face more challenges in real-life complex and changing logistics scenarios. For example, factors such as different lighting conditions, complex backgrounds, and the diversity of packaging materials may have an impact on the model performance. Subsequent studies may consider training and testing on more diverse datasets to further improve the generalization ability of the model.

Analyzing from the perspective of model structure, the introduction of Mamba Block is a major innovation point of DScanNet. The Mamba architecture was originally applied to long sequence modeling, and its direct use in visual detection tasks would lead to the loss of global receptive field information, which is effectively solved by the Mamba Block designed based on the VSS Block in this paper. Through the SS2D mechanism, linear complexity is realized while maintaining the global receptive field. However, in practical applications, how to further optimize the implementation of the Mamba Block so that it can run more efficiently on some edge devices with limited computational resources is an issue.

In terms of dataset, the BIGC-LP defect dataset used in this study contains multiple defect types and has been processed with data enhancement, which effectively improves the generalization ability of the model. However, there is still room for improvement in the scale and diversity of the dataset. There are many types of defects in real logistics packaging, and in addition to tears, scratches, deformations, and stains covered in the dataset, there may be other types of defects. Meanwhile, there are differences in different materials, sizes, and designs, and these factors may affect the detection effect of the model. Therefore, subsequent consideration can be given to expanding the size of the dataset to collect more images of different types of logistics packaging defects to further enhance the adaptability of the model to various real-world situations.

7. Conclusions

In this paper, we analyzed the limitations of current defect detection algorithms in terms of detection accuracy and computational efficiency. On this basis, we propose a DScanNet model for small-target defect detection, aiming to solve the current problem that performance and speed are difficult to balance in small-target defect detection tasks. The core of the DScanNet model is the Mamba module, which enhances the model’s ability to extract local features as well as to focus on specific features through a series of feature extraction modules, and, at the same time, successfully reduces the number of parameters and the computational effort of the model using its linear complexity. In addition, we introduce the MEFE module with multi-scale convolution and feature enhancement strategies to improve the detection of objects of different sizes. Extensive experiments with existing detection algorithms have shown a significant improvement in the accuracy and computational speed of our model for defect image recognition. In the future, we will focus on exploring the application of the model to other types of defect detection tasks.

Author Contributions

Y.L.: writing—original draft, methodology, investigation, validation; Y.D.: access to funds, data management; Z.W.: writing—review and editing, resources; J.M.: writing—original draft; W.Y.: software; S.D.: funding access, resources. All authors have read and agreed to the published version of the manuscript.

Funding

Partly funded by the General Project of Science and Technology Programme of Beijing Municipal Commission of Education. Title: Research on high-dimensional intelligent optimisation of linkage line in packaging and printing enterprises in the context of mixed flow production. Serial number: KM202410015003.

Data Availability Statement

The datasets generated and/or analyzed during the current study are not publicly available but are available from the corresponding author on reasonable request. Corresponding authors declare on behalf of all authors that the data in this article are available.

Acknowledgments

We thank the Beijing Municipal Commission of Education for financial support of this study. We would like to thank Beijing Institute of Graphic Communication for providing us with a learning platform for academic exchanges and discussions with other scholars.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Anjomshoae, S.T.; Rahim, M.S.M. Enhancement of template-based method for overlapping rubber tree leaf identification. Comput. Electron. Agric. 2016, 122, 176–184. [Google Scholar] [CrossRef]
Rahman, M.O.; Hussain, A.; Scavino, E.; Hannan, M.; Basri, H. DNA computer based algorithm for recyclable waste paper segregation. Appl. Soft Comput. 2015, 31, 223–240. [Google Scholar] [CrossRef]
Kulkarni, K.; Evangelidis, G.; Cech, J.; Horaud, R. Continuous action recognition based on sequence alignment. Int. J. Comput. Vis. 2015, 112, 90–114. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 142–158. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar]
Yasir, M.; Shanwei, L.; Mingming, X.; Jianhua, W.; Nazir, S.; Islam, Q.U.; Dang, K.B. SwinYOLOv7: Robust ship detection in complex synthetic aperture radar images. Appl. Soft Comput. 2024, 160, 111704. [Google Scholar] [CrossRef]
Tong, Y.; Yue, G.; Fan, L.; Lyu, G.; Zhu, D.; Liu, Y.; Meng, B.; Liu, S.; Mu, X.; Tian, C. YOLO-Faster: An efficient remote sensing object detection method based on AMFFN. Sci. Prog. 2024, 107, 00368504241280765. [Google Scholar] [CrossRef]
Fu, X.; Zhou, Z.; Meng, H.; Li, S. A synthetic aperture radar small ship detector based on transformers and multi-dimensional parallel feature extraction. Eng. Appl. Artif. Intell. 2024, 137, 109049. [Google Scholar] [CrossRef]
Wang, T.; Ma, Z.; Yang, T.; Zou, S. PETNet: A YOLO-based prior enhanced transformer network for aerial image detection. Neurocomputing 2023, 547, 126384. [Google Scholar] [CrossRef]
Mehta, S.; Rastegari, M. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar]
Chen, Z.; Zhong, F.; Luo, Q.; Zhang, X.; Zheng, Y. Edgevit: Efficient visual modeling for edge computing. In Proceedings of the International Conference on Wireless Algorithms, Systems, and Applications, Dalian, China, 24–26 November 2022; pp. 393–405. [Google Scholar]
Li, Y.; Hu, J.; Wen, Y.; Evangelidis, G.; Salahi, K.; Wang, Y.; Tulyakov, S.; Ren, J. Rethinking vision transformers for mobilenet size and speed. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 16889–16900. [Google Scholar]
Wang, Z.; Li, C.; Xu, H.; Zhu, X. Mamba YOLO: SSMs-based YOLO for object detection. arXiv 2024, arXiv:2406.05835. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Nelson, J.; Solawetz, J. Yolov5 is here: State-of-the-art object detection at 140 fps. Roboflow 2020, 17, 26. Available online: https://blog.roboflow.com/yolov5-is-here/ (accessed on 22 November 2022).
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. Version 8.0.0. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 12 June 2025).
Wang, C.-Y.; Yeh, I.-H.; Mark Liao, H.-Y. Yolov9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 1–21. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar]
Tian, Y.; Ye, Q.; Doermann, D. Yolov12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. Vmamba: Visual state space model. Adv. Neural Inf. Process. Syst. 2024, 37, 103031–103063. [Google Scholar]
Ma, J.; Li, F.; Wang, B. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv 2024, arXiv:2401.04722. [Google Scholar]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. ICML 2024. arXiv 2024, arXiv:2401.09417. [Google Scholar]
Chen, J.; Kao, S.-H.; He, H.; Zhuo, W.; Wen, S.; Lee, C.-H.; Chan, S.-H.G. Run, don’t walk: Chasing higher FLOPS for faster neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar]
Shi, H.; Wang, N.; Xu, X.; Qian, Y.; Zeng, L.; Zhu, Y. HeMoDU: High-Efficiency Multi-Object Detection Algorithm for Unmanned Aerial Vehicles on Urban Roads. Sensors 2024, 24, 4045. [Google Scholar] [CrossRef]
Szőlősi, J.; Szekeres, B.J.; Magyar, P.; Adrián, B.; Farkas, G.; Andó, M. Welding defect detection with image processing on a custom small dataset: A comparative study. IET Collab. Intell. Manuf. 2024, 6, e70005. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. pp. 21–37. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]

Figure 1. (Left) VSS Block. The global acceptance domain is implemented through the SS2D module, which dynamically adjusts the weights and linear complexity. (Right) SS2D. The input blocks are traversed along four different paths, and each sequence is processed independently by a different S6 block. Subsequently, the results are merged as the final output.

Figure 2. DScan Network Architecture.

Figure 3. MEFE Block. The module is divided into two steps. Firstly, different scales of convolution kernels are used in parallel in the same layer to extract features and perform splicing and channel compression; in the second step, the importance of the features is adjusted by the SE mechanism, and the features of each channel are weighted and output.

Figure 4. Mamba Block.

Figure 5. Defect types.

Figure 6. Target center point location and size distribution.

Figure 7. Comparison of the mAP@0.5 curves of YOLOv5 and DScanNet.

Figure 8. Comparison of the mAP@0.5:0.95 curves of YOLOv5 and DScanNet.

Figure 9. YOLOv5 detection results.

Figure 10. DScanNet detection results.

Table 1. Configurations related to this experiment.

Name	Configure
Operating System	Linux-Ubuntu 20.04
CPU	Intel-i7-11700
GPU	NVIDIA GeForce GTX 3090
RAM	24 GB
IDE	PyCharm 2020
Deep Learning Frameworks	PyTorch 1.11
Programming Language	Python 3.9
CUDA	Version 11.3
Image size	640 × 640
Learning rate	0.01
Epoch	100
Batchsize	16
Learning rate	0.01
Optimizer	SGD

Table 2. Specific performance comparison of the model before and after improvement.

Algorithm	Precision	Recall	mAP @0.5	mAP @0.5:0.95	FPS	Param (M)	FLOPs (B)
YOLOv5	92.5	91.7	93.1	61.3	83.1	7.02	16.1
DScanNet	96.8	94.6	95.1	66.1	97.1	3.68	9.4

Note: Bold values indicate the best performance.

Table 3. We conducted experiments on a home-made defect dataset and compared the current mainstream detection algorithms for this defect.

Model	Precision	mAP@0.5	mAP@0.5:0.95	FPS	Params (M)	FLOPs (B)
YOLOv6 [21]	92.7	91.8	61.3	60	9.14	17.23
YOLOv7 [22]	93.5	93.1	64	90.2	6.007	13
YOLOv8 [23]	90.1	88.7	58.2	76	11.2	28
YOLOv9 [24]	89.5	85	54.4	72.5	7.2	26
YOLOv10 [25]	60.2	58.7	34.9	83.4	7.2	21.6
YOLOv11 [26]	86.8	86.1	54.8	72.5	9.4	21.5
YOLOv12 [27]	85.9	82.2	47.9	65.7	9.3	21.4
SSD [34]	84.98	86.06	50.9	52.3	62.74	26.29
Faster R-CNN [35]	85.66	87.05	56	40	137	370.21
DScanNet	96.8	95.1	66.1	97.1	3.68	9.4

Note: Bold values indicate the best performance.

Table 4. Ablation experiments were conducted to analyze the effectiveness of the different improvement strategies in this paper.

Model	MEFE Block	PCR	Mamba Block	Precision	FPS	mAP@0.5	Params (M)
1				92.5	83.1	93.1	7.02
2	√			93.1	92.5	93.5	7.27
3	√	√		94.3	96.4	93.9	5.67
4	√	√	√	96.8	97.1	95.1	3.68

Note: Bold values indicate the best performance.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Luo, Y.; Du, Y.; Wang, Z.; Mo, J.; Yu, W.; Dou, S. DScanNet: Packaging Defect Detection Algorithm Based on Selective State Space Models. Algorithms 2025, 18, 370. https://doi.org/10.3390/a18060370

AMA Style

Luo Y, Du Y, Wang Z, Mo J, Yu W, Dou S. DScanNet: Packaging Defect Detection Algorithm Based on Selective State Space Models. Algorithms. 2025; 18(6):370. https://doi.org/10.3390/a18060370

Chicago/Turabian Style

Luo, Yirong, Yanping Du, Zhaohua Wang, Jingtian Mo, Wenxuan Yu, and Shuihai Dou. 2025. "DScanNet: Packaging Defect Detection Algorithm Based on Selective State Space Models" Algorithms 18, no. 6: 370. https://doi.org/10.3390/a18060370

APA Style

Luo, Y., Du, Y., Wang, Z., Mo, J., Yu, W., & Dou, S. (2025). DScanNet: Packaging Defect Detection Algorithm Based on Selective State Space Models. Algorithms, 18(6), 370. https://doi.org/10.3390/a18060370

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DScanNet: Packaging Defect Detection Algorithm Based on Selective State Space Models

Abstract

1. Introduction

2. Methods

2.1. YOLO Series Real-Time Target Detection Algorithm

2.2. An Efficient Visual Modeling Approach Based on Mamba

3. Proposed Method

3.1. Multi-Scale Enhanced Feature Extractor(MEFE Block)

3.2. PCR Block

3.3. Mamba Block

Local Feature Extraction Module (LFEM Block)

4. Dataset Preparation and Experimental Environment

4.1. Dataset Preparation

4.2. Experimental Environment

5. Experiments

5.1. Evaluation Metrics

5.2. Experimental Analysis

5.3. Comparison Experiment

5.4. Ablation Experiment

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI