Dangerous Goods Detection in X-Ray Security Inspection Images Based on Improved YOLOv8-seg

Wang, Ting; Yuan, Pengfei; Wang, Aili

doi:10.3390/electronics15051112

Open AccessArticle

Dangerous Goods Detection in X-Ray Security Inspection Images Based on Improved YOLOv8-seg

by

Ting Wang

¹,

Pengfei Yuan

² and

Aili Wang

^2,*

¹

School of Electronics and Information Engineering, Heilongjiang University of Science and Technology, Harbin 150022, China

²

Heilongjiang Province Key Laboratory of Laser Spectroscopy Technology and Application, Harbin University of Science and Technology, Harbin 150080, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(5), 1112; https://doi.org/10.3390/electronics15051112

Submission received: 5 February 2026 / Revised: 4 March 2026 / Accepted: 5 March 2026 / Published: 7 March 2026

(This article belongs to the Special Issue Image Processing, Target Tracking and Recognition System Design)

Download

Browse Figures

Versions Notes

Abstract

In X-ray security inspection imagery, hazardous object detection is challenged by severe object overlap/occlusion, ambiguous boundaries of small objects, and complex texture representations caused by material diversity. Although YOLOv8-seg provides real-time instance segmentation capability, it still has clear limitations in this application scenario. Specifically, the original SPPF module has limited ability to model long-range spatial dependencies, making it difficult to accurately separate boundaries of densely overlapped objects, while the C2f module is insufficient for multi-scale feature parsing of hazardous items with diverse sizes and materials and introduces feature redundancy, which degrades segmentation accuracy in occluded scenes. To address these issues, this paper proposes an improved YOLOv8-seg framework for X-ray hazardous object detection, termed LM-YOLOv8. For feature enhancement, an SPPF-LSKA module is constructed by integrating large-kernel separable attention with dynamic receptive-field adjustment, thereby improving global contextual modeling and alleviating boundary ambiguity. For multi-scale feature fusion, a C2f-MSC module is designed by combining multi-branch dilated convolutions with the C2f structure to enhance complex contour parsing and cross-scale feature interaction. Experiments on the PIDray dataset show that the proposed method achieves 84.8% mAP50 in instance segmentation, representing an improvement of approximately 4.0 percentage points over the baseline YOLOv8-seg. In addition, the method demonstrates stronger robustness on challenging hard/hidden subsets, validating its effectiveness for X-ray security inspection hazardous object detection.

Keywords:

X-ray security inspection; instance segmentation; YOLOv8-seg; large separable kernel attention

1. Introduction

Security screening is a critical component of passenger safety in transportation systems. However, dangerous goods detection in X-ray security images still relies heavily on manual inspection in many practical scenarios. This paradigm has two major limitations: (1) prohibited items are often overlapped or occluded by other objects in X-ray images, which increases recognition difficulty; and (2) prohibited items occur with low probability, so inspectors are exposed to large numbers of negative samples, which can cause fatigue and reduced attention during prolonged operation. These factors degrade both detection accuracy and inspection efficiency, making automated dangerous goods recognition in X-ray imagery an important research problem.

With the development of machine vision and deep learning, X-ray dangerous goods detection has gradually shifted from handcrafted feature-based methods to deep learning-based approaches. Early studies mainly focused on feature representation optimization. For example, BoVW-based methods encode local features into global statistical histograms to improve robustness and classification efficiency, and category-oriented feature clustering further enhances semantic discriminability in complex scenes [1]. Subsequently, CNN-based detectors have been introduced to X-ray security inspection. DML-Net improves occluded prohibited item detection through enhanced appearance representation and cross-dataset domain alignment [2], while STN + Faster R-CNN improves recall under severe occlusion by dynamic geometric correction [3]. Transformer- and attention-based methods further improve global context modeling and feature weighting, such as MHT-X [4] and DGDN [5].

Due to their real-time advantages, YOLO-series models have become a mainstream direction in X-ray security inspectionp [6,7,8,9]. A series of studies have improved YOLOv4/v5/v8 variants by introducing attention mechanisms, feature pyramid redesign, Transformer modules, and edge-enhanced structures [10,11,12,13]. These methods achieve promising detection performance on datasets such as SIXray, OPIXray, CLCXray, and PIDray. However, most existing works primarily focus on object detection accuracy, while the instance-level segmentation of dangerous goods in dense and occluded X-ray scenes remains insufficiently addressed.

The difficulty of X-ray instance segmentation is rooted in the imaging mechanism and scene complexity. X-ray imaging projects multiple objects onto a 2D plane, causing superposition of pixel responses rather than clean foreground–background separation. As a result, occlusion affects segmentation in three ways: (i) boundary discriminability is weakened because mixed responses reduce local gradient consistency, leading to mask adhesion or contour leakage; (ii) visible regions become incomplete, which breaks object shape integrity and weakens shape-prior-based feature aggregation; and (iii) dense overlap intensifies feature competition among neighboring instances, causing inconsistency across classification, localization, and mask prediction. These effects are typically reflected in lower recall and poorer mask quality metrics (e.g., mAP50-95), especially on hard/hidden subsets.

YOLOv8-seg is an efficient single-stage real-time instance segmentation model and is therefore a suitable baseline for online X-ray security inspection. It generates shared mask prototypes and combines them with instance-specific coefficients to directly output instance masks, enabling collaborative optimization of detection and segmentation. Nevertheless, in X-ray hazardous object segmentation, YOLOv8-seg still has application-specific deficiencies. First, the original SPPF module mainly performs multi-scale local pooling and has limited ability to model long-range spatial dependencies, which is unfavorable for separating densely overlapped object boundaries. Second, although the C2f structure is computationally efficient, it is not specifically designed to suppress feature redundancy and preserve discriminative multi-scale cues in multi-material X-ray images, which may degrade segmentation quality in complex scenes. Third, many existing improvements report overall detection gains but do not explicitly target security-screening-specific segmentation challenges, such as occlusion robustness and mask boundary quality.

To address these issues, this paper proposes LM-YOLOv8, an improved YOLOv8-seg-based instance segmentation model for dangerous goods detection in X-ray security images. The design follows a problem-driven strategy: enhancing long-range context modeling for occluded instance separation and improving multi-scale discriminability while reducing feature redundancy in multi-material scenes. Specifically, an SPPF-LSKA composite module is introduced to strengthen global spatial perception and boundary discrimination under occlusion with controlled computational overhead; in addition, a C2f-MSC module is designed to decouple feature paths and perform heterogeneous multi-scale parsing, thereby improving contour preservation and texture-detail representation.

The remainder of this paper is organized as follows. Section 2 presents the proposed LM-YOLOv8 framework and describes the designs of the SPPF-LSKA and C2f-MSC modules in detail. Section 3 introduces the experimental settings, dataset, comparative results, and ablation studies. Section 4 further discusses the effectiveness of the proposed method through visualization analysis and receptive field analysis in complex X-ray security inspection scenarios. Finally, Section 5 concludes this paper and outlines possible directions for future work.

The main contributions of this work are as follows:

1. An SPPF-LSKA module is designed to compensate for the limited long-range dependency modeling of the original SPPF in dense occlusion scenarios, improving global context aggregation and occlusion-aware boundary separation while maintaining efficiency.

2. A C2f-MSC module is proposed to address feature redundancy and insufficient multi-scale discriminability in multi-material X-ray images by channel decoupling and heterogeneous convolution-based multi-branch parsing.

2. Materials and Methods

In order to more accurately detect the location and shape information of dangerous goods, this paper constructs an improved X-ray image dangerous goods detection model based on YOLOv8 seg, as shown in Figure 1. This model improves the detection ability of hazardous materials in X-ray images combined with YOLOv8-seg and the contributions are summarized as follows:

1. The traditional SPPF module extracts local detail features through multi-scale pooling, but it is difficult to establish long-range spatial dependencies of hazardous materials in X-ray images (such as correlations between multiple objects inside the package). To this end, this study combines SPPF with separable large kernel attention LSKA: the SPPF module preserves the geometric structure of small targets through multi-scale pooling, while the LSKA module adopts an axial decomposition strategy to split the traditional two-dimensional large convolution kernel into one-dimensional convolution kernels in horizontal and vertical directions, reducing computational complexity while building global spatial perception capability.

2. Due to the large differences in material attenuation and complex texture in X-ray images, the static multi branch structure of traditional C2f modules is difficult to adaptively distinguish the characteristics of different hazardous materials such as metals and liquids. The C2f-MSC proposed in this study adopts a channel decoupling strategy to split the input features into native paths and multi-scale parsing paths. The native path directly transmits the original attenuation coefficient distribution to avoid the loss of key information; multi-scale paths extract local details and wide domain context in parallel through 3 × 3 and 5 × 5 heterogeneous convolutions; and the dynamic fusion stage balances the contribution of different material features through a weight adaptive mechanism.

2.1. SPPF-LSKA Module

In X-ray detection scenarios, in order to handle complex object overlapping and severe occlusion, it is usually necessary to simultaneously focus on local details and global information. The traditional Spatial Pyramid Pooling (SPP) module obtains multi-scale feature information through multi-scale pooling. However, when dealing with small objects or occluded areas, these modules still face challenges in fully extracting global information. YOLOv8′s Spatial Pyramid Pooling Fusion (SPPF) incorporates feature fusion mechanisms on the basis of SPP, which improves perception and detection performance to a certain extent. However, it still lacks attention to global context, especially in the presence of common multi-target overlap and redundant local details in X-ray images. The SPPF network structure is shown in Figure 2.

To address the challenge of remote dependency modeling caused by severe occlusion in X-ray security images, this section combines the Large Separable Kernel Attention (LSKA) mechanism to enhance the network’s global perception capability [14]. However, traditional large kernel convolution and multi head self-attention face significant challenges in practical applications: as the size of the convolution kernel increases, the computational complexity increases nonlinearly, and a complete two-dimensional large kernel convolution requires storing a large number of weight parameters, resulting in slower detection speed. In addition, a single expansion convolution strategy is difficult to effectively balance the contradiction between receptive field expansion and feature resolution preservation.

To address the aforementioned issues, this section proposes an innovative architecture for SPPF-LSKA, as shown in Figure 3. This scheme achieves efficient feature extraction through a three-stage decomposition strategy: first, deep convolutional layers are used to obtain local spatial features, and channel independent calculations are used to reduce parameter coupling; subsequently, an expanded convolutional layer is introduced to broaden the feature capture range, and the interval sampling mechanism is utilized to break through the perceptual limitations of conventional convolution; and by adopting an axial decomposition design, the traditional two-dimensional convolution kernel is decoupled into cascaded horizontal and vertical one-dimensional convolution kernels, significantly reducing the computational dimension while maintaining the completeness of feature extraction. Compared to the standard LSKA, SPPF-LSKA reduces redundant weight storage through parameter sharing mechanisms and optimizes matrix operation efficiency with a separated convolution structure. This structural reorganization enables it to exhibit unique advantages in the complex scenes unique to X-ray security images, including multi-level occlusion resolution of objects inside the package, suppression of scattering noise from metal objects, and enhancement of edge features of densely stacked objects.

To clarify the efficiency of the proposed SPPF-LSKA, we compare its computational complexity with the original LSKA under the same input feature size. Let the input feature map be

X \in R^{C \times H \times W}

. We report complexity in FLOPs, where one multiplication and one addition are counted separately (i.e.,

FLOPs = 2 \times MACs

).

For a depthwise convolution with kernel size

a \times b

, the parameter counts and FLOPs are:

{P a r a m s}_{D W (a \times b)} = C \cdot a \cdot b, {F L O P s}_{D W (a \times b)} = 2 \cdot H W C \cdot a \cdot b .

(1)

For a pointwise convolution (

1 \times 1

) with identical input/output channels

C

, the parameter counts and FLOPs are:

{P a r a m s}_{1 \times 1} = C^{2}, {F L O P s}_{1 \times 1} = 2 \cdot H W C^{2} .

(2)

The original LSKA branch is implemented by an axial pair of depthwise convolutions

1 \times K

and

K \times 1

, followed by a

1 \times 1

projection. Its complexity is:

{P a r a m s}_{L S K A} = C (2 K) + C^{2}, {F L O P s}_{L S K A} = 2 \cdot H W C (2 K) + 2 \cdot H W C^{2} .

(3)

SPPF-LSKA introduces two additional depthwise stages (a depthwise

3 \times 3

and a dilated depthwise

3 \times 3

; dilation does not change parameter count), followed by the same axial

1 \times K

and

K \times 1

depthwise operations and a

1 \times 1

projection. Therefore:

{P a r a m s}_{S P P F - L S K A} = C (9) + C (9) + C (2 K) + C^{2} = C (18 + 2 K) + C^{2},

(4)

{F L O P s}_{S P P F - L S K A} = 2 \cdot H W C (9) + 2 \cdot H W C (9) + 2 \cdot H W C (2 K) + 2 \cdot H W C^{2} = 2 \cdot H W C (18 + 2 K) + 2 \cdot H W C^{2} .

(5)

Thus, the incremental overhead of SPPF-LSKA over LSKA is:

Δ P a r a m s = 18 C, Δ F L O P s = 2 \cdot 18 H W C = 36 H W C,

(6)

which indicates that the additional cost comes only from the two extra depthwise

3 \times 3

stages and scales linearly with

C

and

H \times W

. Since the SPPF stage is typically at a relatively low spatial resolution in YOLOv8-seg, this overhead remains limited in practice while improving contextual aggregation.

For

C = 256

,

H = W = 20

, and

K = 23

:

Δ P a r a m s = 18 \times 256 = 4608, Δ F L O P s = 36 \times 20 \times 20 \times 256 = 3,686,400 .

(7)

2.2. C2f-MSC Module

To solve the serious occlusion problem in X-ray security inspection images, we propose a novel multi-channel multi-scale interactive convolution C2f-MSC. X-ray images typically contain complex textures and stacked objects, making object detection particularly difficult. Our method aims to reduce the redundancy of feature maps, decrease computational complexity and parameter count, while enhancing the network’s ability to capture multi-scale information, which is crucial for detecting occluded objects. Figure 4 shows the structural design of the C2f-MSC layer.

In object detection tasks, feature maps that have undergone convolution operations typically contain redundant information [15]. These feature maps include both high-frequency edge information and low-frequency overall contour information. By reasonably segmenting and interacting with these features across channels, the recognition ability of occluded targets can be improved while reducing computational complexity [16]. As shown in Figure 5, in the YOLOv8 network, there are multiple similar feature maps extracted from the first convolutional layer that can be observed, one of which can be generated by linear transformation of the other.

To this end, we propose a C2f-MSC module for achieving multi-scale interaction across channel dimensions while reducing computational complexity. In Equations (8)–(12), the input feature map is evenly partitioned along the channel dimension. This equal partition is adopted as a balanced default design rather than a claim of theoretical optimality. The rationale is threefold: First, equal partitioning maintains comparable channel capacity for the native path and the multi-scale parsing path, which avoids biasing feature allocation toward either feature preservation or feature transformation at initialization. Second, under fixed total channels, equal splitting provides a simple and stable implementation without introducing an additional ratio hyperparameter, which helps control model complexity and improves reproducibility. Third, it preserves regular tensor shapes and facilitates efficient parallel computation in the proposed multi-branch structure.

The specific principle is as follows:

Assuming the input feature map

X \in R^{C \times H \times W}

, first divide it evenly along the channel dimension into two parts:

X_{c h e a p} = X [0 : \frac{C}{2}]

(8)

Among them,

X_{c h e a p}

retained the original features and only transmitted through residual connections, with a computational complexity of 0; and

X_{c o m p l e x}

further is split into two sub channel groups:

X_{3}, X_{5} = S p l i t (X_{c o m p l e x}, \frac{C}{4})

(9)

Among them, 3 × 3 and 5 × 5 convolutions are used for processing

X_{3}, X_{5}

, forming

Y_{3}, Y_{5}

:

Y_{3} = C o n v_{3 \times 3} (X_{3})

(10)

Y_{5} = C o n v_{5 \times 5} (X_{5})

(11)

Finally,

X_{c h e a p}, Y_{3}, Y_{5}

are concatenated along the channel dimension and fuses through a 1 × 1 convolution:

Y = C o n v_{1 \times 1} ([X_{c h e a p}; Y_{3}; Y_{5}])

(12)

In the YOLOv8-based object detection framework, we construct an adaptive feature extraction architecture for X-ray security imaging characteristics by using C2f-MSC modules. In response to the common problems of multi material attenuation projection superposition and target edge blurring in X-ray penetration imaging, this design adopts a channel decoupling strategy to divide the input features into native paths and multi-scale analytical paths: the former preserves the original attenuation coefficient distribution characteristics to maintain information integrity in low signal-to-noise ratio environments, while the latter extracts local detail gradients and wide domain spatial correlations in parallel through heterogeneous convolution groups (3 × 3 and 5 × 5 kernels), effectively decoupling the material boundary fusion phenomenon caused by spectral hardening effects. The dynamic feature fusion stage achieves feature fusion through 1 × 1 convolution, adaptively balancing the high-frequency edge responses of metal products with the continuous attenuation feature contribution of organic regions.

To verify its effectiveness in reducing complexity, we will now compare and analyze the computational complexity between the MC-Conv convolution in the C2f-MSC model and the traditional convolution model.

Assuming the dimension of the input feature map is

C \times H \times W

. The computational complexity of performing 3 × 3 and 5 × 5 convolutions on a quarter channel is:

ο_{3 \times 3} = 9 \cdot \frac{C}{4} \cdot H \cdot W

(13)

ο_{5 \times 5} = 25 \cdot \frac{C}{4} \cdot H \cdot W

(14)

The complexity of traditional full-channel multi-scale convolution methods is:

ο_{direct} = (9 + 25) C \cdot H \cdot W

(15)

The total complexity of MC-Conv is:

ο_{MC - Conv} = \frac{C}{4} (9 + 25) \cdot H \cdot W + 1 \cdot C \cdot H \cdot W

(16)

The second item corresponds to a 1 × 1 convolution operation, which shows that C2f-MSC significantly reduces computational costs while maintaining feature diversity.

Note that the standard convolution cost is

C_{i n} C_{o u t} k^{2} H W

. In Equations (13)–(16), each multi-scale branch processes only

C / 4

channels (thus

C_{i n} = C_{o u t} = C / 4

for the branch), and the missing cross-channel

C_{i n} C_{o u t}

mixing is captured by the final

1 \times 1

fusion term in Equation (16).

3. Results

To verify the effectiveness of the proposed model in X-ray security inspection scenarios, this section evaluates its detection accuracy, robustness, and detection speed through multiple sets of comparative experiments. The experiment covers baseline model comparison, ablation experiments, and in-depth analysis through visualization.

3.1. Experimental Environment and Dataset Description

The experimental environment of this study is built on PyCharm 2023.1 IDE (JetBrains s.r.o., Prague, Czech Republic). The model architecture is constructed using PyTorch 2.0.1 deep learning framework (Meta Platforms Inc., Menlo Park, CA, USA). The hardware platform is configured with an Intel Core i5-13490F CPU (turbo 4.8 GHz; Intel Corporation, Santa Clara, CA, USA) combined with an NVIDIA GeForce RTX 4070 graphics card (12 GB GDDR6X memory; NVIDIA Corporation, Santa Clara, CA, USA), and GPU-accelerated computing is achieved through CUDA 11.8 (NVIDIA Corporation, Santa Clara, CA, USA). The model training is conducted for a total of 300 epochs, with a batch size set to 32. The input images are uniformly scaled to a resolution of 640 × 640 pixels. We use a standard SGD-based optimization with momentum and weight decay, together with a learning-rate schedule including warm-up. Mixed-precision training is enabled to accelerate training and reduce memory usage. Data augmentation includes random flips, scale/translation, and color jitter following the default training pipeline. For inference, NMS is applied with fixed confidence and IoU thresholds.

To further explore the precise localization of hazardous materials, a dataset containing pixel level instance segmentation annotations is required. Although existing datasets have made significant progress in object detection tasks, such as SIXray containing over one million weakly annotated samples and OPIXRay focusing on tool classification optimization, only the PIDray dataset provides pixel level instance segmentation annotations. This fine-grained annotation plays a crucial role in accurately identifying the contour and spatial distribution of hazardous materials. Based on this, this study chooses PIDray as the core benchmark dataset for algorithm validation. The PIDray dataset was constructed by the State Key Laboratory of Computer Science of the Software Research Institute of the Chinese Academy of Sciences [17], covering 12 kinds of dangerous goods commonly seen in X-ray security images, and each image contains at least one dangerous goods with the notes of bounding box and mask. The test set of this dataset is divided into three subsets, namely easy, hard, and hidden. The hidden subset focuses on prohibited items intentionally hidden in cluttered objects (such as changing the shape of items by wrapping wires). The PIDray dataset is divided as shown in Table 1 and Figure 6.

In X-ray security inspection, dangerous goods are difficult to accurately detect due to image distortion caused by false color monotony, blurry small target features, overlapping occlusion of items, and imaging angle deviation. The dataset examples are shown in Figure 7.

3.2. Comparative Experiments and Analysis

To verify the detection performance of the improved YOLOv8 algorithm on X-ray dangerous goods images, this study compared the proposed algorithm with mainstream object detection algorithms such as YOLOv11, YOLOv8, and YOLACT on the PIDray dataset, as shown in Table 2.

To ensure a fair comparison among YOLOv8, YOLOv11, and YOLACT in Table 2, we trained all models using the same training configuration, including the same epochs, batch size, input resolution, and augmentation pipeline. Pretrained weights were not used; all models were trained from scratch. Moreover, we did not optimize hyperparameters separately for each model—a single fixed set of hyperparameters was applied to all methods to maintain comparability.

As shown in Table 2, the improved YOLOv8 algorithm exhibits significant performance improvement on the PIDray dataset. This algorithm outperforms YOLOv11 and the original YOLOv8 in all four core metrics, with an accuracy of 83.5%, which is 3.2 and 3.7 percentage points higher than YOLOv11 and YOLOv8, respectively; likewise, the recall rate increased to 80.8%, which is 1.3% and 2.9% higher than YOLOv11 and YOLOv8, respectively. In terms of positioning accuracy, the improved model achieved 84.8% and 63.0% on mAP50 and mAP50-95 metrics, respectively.

The training results for each category are shown in Table 3 and the improved algorithm exhibits differentiated detection performance for different forms of prohibited items. The mAP50 of baton, pliers, hammer, and handcuffs are all above 99.0%; due to flat structure and frequent occlusion issues, the mAP50 of cutting tools is 48.5%, and their precision is 59.0%. Handcuffs, with their clear geometric features, have a mAP50-95 index of 78.5%. The algorithm has an mAP50 of 84.8% and an mAP50-95 of 63.0% on the 12 hazardous materials in the PIDray dataset.

The training results on the hidden subset are shown in Table 4. To verify that the improvements in Table 4 are not due to randomness, we repeated each experiment with multiple random seeds and reported the results as mean ± standard deviation. We further conducted statistical validation on the hidden subset using a paired significance test (and bootstrap confidence intervals), which indicates that the performance gains remain consistent across runs. Compared to YOLOv8n, LM-YOLOv8 has improved accuracy in the hidden subset, with precision increasing by 1.9%, recall increasing by 2.6%, mAP50 increasing by 4.1%, and mAP50-95 increasing by 7.4%.

3.3. Ablation Experiments

For the object detection combined with instance segmentation task of hazardous materials in X-ray security inspection images, this study verified the effectiveness of the algorithm improvement through systematic control variable experiments. The results of the ablation experiment are shown in Table 5.

In the ablation experiment design, A represents the C2f-MSC module and B represents the SPPF-LSKA module. YOLOv8 achieved an mAP50 score of 80.8% without introducing any improvement modules. After integrating the C2f-MSC module separately, the model achieved an mAP50 of 82.6% through a multi-scale feature fusion mechanism, an increase of 1.8 percentage points; the introduction of the SPPF-LSKA module was achieved through a hierarchical space kernel aggregation strategy, resulting in a 1.1 percentage point increase in mAP50 to 81.9%. When the two types of improved modules work together, the model exhibits significant performance synergy, with mAP50 reaching 84.8%, which is four percentage points higher than YOLOv8. At the same time, the synchronous optimization of both the average accuracy (83.2%) and the recall rate (80.3%) verifies the robustness of the algorithm in dense occlusion scenes.

To provide a more intuitive analysis of the collaborative optimization effect of each functional module on model performance, Figure 8 presents the changes in mAP, precision, and recall of module A, module B, and their combined forms on the PIDray dataset in the form of a line graph through ablation experiments.

The visualization analysis of the training process further reveals the advantageous characteristics of the improved model. As shown in Figure 9a, the mAP curve of the improved model enters a stable convergence stage after 150 epochs, which is 37% earlier than YOLOv8′s training period (reduced from 240 epochs to 150 epochs), and the peak mAP50 reaches 84.8%, an increase of 5.2 percentage points compared to YOLOv8. The loss curve is shown in Figure 9b, and the improved model has a lower loss compared to the original YOLOv8 model.

4. Discussion

To evaluate the optimization effect of the system evaluation model, this study verified the effectiveness of the improvement strategy through multidimensional visualization comparative experiments. Figure 4, Figure 5, Figure 6 and Figure 7 shows the performance evolution process of occlusion target detection, single target recognition, and multi-target occlusion scenarios. The first row is labeled with visualized images from the dataset, the second row is the baseline model detection results, the third row is the detection results after adding C2f-MSC, the fourth row is the detection results after adding SPPF-LSKA, and the fifth row is the detection results using both modules simultaneously. Figure 4, Figure 5, Figure 6, Figure 7 and Figure 8 intuitively reveals the optimization mechanism of the algorithm’s focus area through heat map analysis.

1. Performance analysis of single object detection in complex scenarios.

For a single target scenario, as shown in Figure 10a, the baseline model achieved a confidence level of 84.0%. The Cf MSC module enhances the confidence level to 86% through boundary detail enhancement network; by integrating two improvement strategies, a confidence level of 89.0% was achieved, verifying the effectiveness of the local global feature joint optimization mechanism.

2. Performance analysis of occlusion target detection.

In the occluded scene, as shown in Figure 10b, the baseline model did not detect the occluded scissors due to insufficient feature extraction ability. By introducing the C2f-MSC module, the model successfully detected the scissors but achieved a low confidence level of 29.0%. The final model successfully detected scissors through the synergistic effect of two improvement points, with a confidence level of 48.0%, which was 19% higher than the case of one improvement point.

3. Performance analysis of multi-scale and multi-target detection.

In the multi-target occlusion experiment, as shown in Figure 10c, the baseline model only detected one hazardous substance, the baton, and achieved a confidence level of only 36.0%. After introducing the C2f-MSC module, two hazardous materials were detected, namely the baton and bullet, with confidence levels of 54.0% and 39.0%, respectively. After introducing the SPPF-LSKA module, the confidence level of the baton was improved, achieving a confidence level of 62.0%. The final model successfully achieved full object recognition, fully verifying the key role of C2f-MSC and SPPF-LSKA modules in complex scenes, providing a better solution for occlusion target detection.

The heat maps of the baton and bullets are shown in Figure 11. Figure 11b is the heat map obtained from the baseline model, and Figure 11c is the heat map obtained from the improved model. The baseline model has a significant feature dispersion problem, with relatively scattered hotspot areas and multiple scattered small hotspots. The algorithm’s attention was relatively vague, indicating a relatively weak recognition ability. The improved model achieved cross-scale feature interaction through heterogeneous kernel group parallel computing using the SPPF-LSKA module. It detected hotspots that were clearly concentrated within the target object area, with precise and clear focus on the attention points, indicating that the model can effectively focus on target features.

The experimental results have shown that the collaborative optimization of local feature extraction and global dependency modeling can effectively improve the performance of occlusion target detection; multi-scale attention mechanism can significantly improve feature focusing ability on complex scenes. The comprehensive performance improvement of the improved model in single target, multi-objective, and occlusion scenes validated the universality of the algorithm design.

To verify whether the original SPPF can provide sufficient contextual coverage under severe occlusion and whether the proposed SPPF-LSKA can enlarge the receptive field effectively, we visualized the effective receptive field (ERF) of the two modules. Specifically, we backpropagated the gradient from a fixed spatial position on the module output to the input feature map and accumulated the absolute gradient magnitude, then normalized it into a heatmap. The colors in the Figure 12 are used to represent the intensity distribution of gradient response in ERF, usually visualized as a heatmap by normalizing the absolute values of the backpropagation gradient. The color ranges from light yellow to dark green, representing the response intensity from weak to strong. Figure 12 shows that the ERF of the original SPPF is highly concentrated around the center (a), indicating that the response is dominated by local neighborhoods. In contrast, SPPF-LSKA exhibits a noticeably broader and more spatially distributed ERF (b), demonstrating that the module can integrate information from a wider region and thus provides stronger contextual aggregation, which is beneficial for recognizing partially occluded objects.

5. Conclusions

This article focuses on the requirements of dangerous goods detection tasks in X-ray security inspection scenarios, using the YOLOv8 seg algorithm as the basis, and focuses on solving the segmentation accuracy problems caused by complex occlusion, multi-target overlap, and material texture differences. Comparative experiments on the PIDray dataset show that the key indicator mAP50 of the improved model is 84.8%, which is 4.0% higher than the original YOLOv8, and mAP50-95 is increased to 63.0%. The experimental results show that the collaborative optimization strategy of integrating local detail enhancement and global dependency modeling can effectively improve the accuracy of instance segmentation in complex scenes of X-ray security inspection images. The method proposed in this paper has achieved good results for the detection of hazardous materials in X-ray security checks, but the practical application of these algorithms still needs to be continuously improved and validated. Additionally, the YOLOv8-seg network can be pruned to achieve faster detection speed without compromising accuracy.

In dense X-ray occlusion scenes, boundary ambiguity is often caused by superimposed responses and incomplete visible regions. The SPPF-LSKA module improves contextual aggregation over a larger spatial interaction range, which helps the model use broader structural cues to reduce boundary confusion between overlapped instances. Meanwhile, the C2f-MSC module improves multi-scale feature allocation by decoupling feature paths and introducing heterogeneous convolutional parsing, which helps preserve contour-related cues while enhancing local texture details in multi-material X-ray images. As a result, the proposed framework shows stronger robustness in challenging occlusion scenarios, where segmentation quality depends more heavily on context modeling and feature complementarity than in simpler scenes. These observations indicate that the performance gains are not only empirical improvements in metrics but are also consistent with the design motivations of the proposed modules.

Author Contributions

Conceptualization, T.W., P.Y. and A.W.; methodology, T.W., P.Y.; software P.Y.; validation T.W. and P.Y.; writing—review and editing T.W., P.Y. and A.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Natural Science Foundation of Heilongjiang Province (LH2023F034).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

SIXray: https://hyper.ai/datasets/18691, accessed on 15 February 2019; CLCXray: https://github.com/GreysonPhoenix/CLCXray, accessed on 15 February 2022; PIDray: https://github.com/bywang2018/security-dataset, accessed on 15 August 2021; HiXray: https://github.com/hixray-author/hixray, accessed on 23 August 2021.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Turcsany, D.; Mouton, A.; Breckon, T.P. Improving feature-based object recognition for X-ray baggage security screening using primed visual words. In Proceedings of the IEEE International Conference on Industrial Technology, Cape Town, South Africa, 25–28 February 2013; pp. 1140–1145. [Google Scholar]
Yang, F.; Jiang, R.; Yan, Y.; Xue, J.-H.; Wang, B.; Wang, H. Dual-Mode Learning for Multi-Dataset X-Ray Security Image Detection. IEEE Trans. Inf. Forensics Secur. 2024, 19, 3510–3524. [Google Scholar] [CrossRef]
Pai, P.; Krishna Kumar, S.; Asok Kumar, G. Improving Prohibited Item Detection in X-Ray Images Using Neural Networks Under Complex Occlusion Conditions. IEEE Access 2025, 13, 182608–182620. [Google Scholar] [CrossRef]
Alansari, M.; Ahmed, A.; Alnuaimi, K.; Velayudhan, D.; Hassan, T.; Javed, S.; Bennamoun, M.; Werghi, N. Multi-Scale Hierarchical Contour Framework for Detecting Cluttered Threats in Baggage Security. IEEE Access 2024, 12, 77454–77467. [Google Scholar] [CrossRef]
Yang, X.; Lan, T.; Xu, Y. A Novel Dangerous Goods Detection Network Based on Multi-Layer Attention Mechanism in X-Ray Baggage Images. IEEE Access 2025, 13, 106805–106816. [Google Scholar] [CrossRef]
Yu, Q.; Wu, Q.; Liu, H. Research on X-ray Contraband Detection and Overlap Target Detection Based on Convolutional Network. In Proceedings of the 2022 4th International Conference on Frontiers Technology of Information and Computer (ICFTIC); IEEE: New York, NY, USA, 2022; pp. 736–741. [Google Scholar]
Dong, Y.S.; Li, Z.X.; Guo, J.Y.; Chen, T.; Lu, S. Improved YOLOv5 Model for X-Ray Prohibited Item Detection. Laser Optoelectron. Prog. 2023, 60, 359–366. [Google Scholar]
Li, W.Q.; Chen, L.; Xie, X.; Hao, X.; Li, H. Algorithm for Detecting Prohibited Items in X-Ray Images Based on Improved YOLOv5. Comput. Eng. Appl. 2023, 59, 170–176. [Google Scholar] [CrossRef]
Dong, J.; Luo, T.; Li, G. Prohibited Items Detection Method of X-ray Security Inspection Image Based on Improved YOLOv8s. Laser Optoelectron. Prog. 2024, 6, 2215008. [Google Scholar]
Liu, J.J.; Feng, P.; Liao, W.; Xi, W. YOLO-STM: A network model for identifying prohibited items in X-ray security inspection images based on Swin-Transformer. Chin. J. Stereol. Image Anal. 2024, 29, 230–241. [Google Scholar]
Chen, M.; Zhang, Z.; Jiang, N.; Li, X.; Zhang, X. YOLO-SRW: An Enhanced YOLO Algorithm for Detecting Prohibited Items in X-Ray Security Images. IEEE Access 2025, 13, 68323–68339. [Google Scholar] [CrossRef]
Xu, W.; Peng, C.; Wang, W.; Tian, Y.; Ge, J. E-MPDNet: An Edge-Enhanced Multi-Scale Network for X-Ray Prohibited Item Detection. IEEE Access 2025, 13, 182177–182191. [Google Scholar] [CrossRef]
Han, L.; Ma, C.; Liu, Y.; Sun, J.; Jia, J. SC-Lite: An Efficient Lightweight Model for Real-Time X-Ray Security Check. IEEE Access 2024, 12, 103419–103432. [Google Scholar] [CrossRef]
Lau, K.W.; Po, L.M.; Rehman, Y.A.U. Large Separable Kernel Attention: Rethinking the Large Kernel Attention Design in CNN. Expert Syst. Appl. 2024, 236, 121352. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More Features from Cheap Operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar]
Howard, A.G. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Wang, B.; Zhang, L.; Wen, L.; Liu, X.; Wu, Y. Towards Real-World Prohibited Item Detection: A Large-Scale X-ray Benchmark. arXiv 2021, arXiv:2108.07020. [Google Scholar]

Figure 1. Overall architecture diagram of the improved YOLOv8-seg model.

Figure 2. The structure of SPPF.

Figure 3. The structure of SPPF-LSKA.

Figure 4. The structure of C2f-MSC.

Figure 5. Visualization of feature map redundancy.

Figure 6. Class distribution of PIDray dataset.

Figure 7. Sample datasets with different situations.

Figure 8. Ablation results analysis of detection index.

Figure 9. Visualization of training process results in ablation study.

Figure 10. Visualization of detection results.

Figure 11. Comparison of heat map between YOLOv8n and the improved model.

Figure 12. Comparison of effective receptive field distributions.

Table 1. Dataset partitioning of PIDray.

Mode	Training Samples	Test Samples
Mode	Training Samples	Easy	Hard	Hidden
Number	29,457	9482	3733	5005
Sum	47,677

Table 2. Comparison of experimental results between improved YOLOv8 and mainstream algorithms.

Model	Precision	Recall	mAP50	mAP50-95	FPS
YOLOv8	79.8	77.9	80.8	60.2	488.1
YOLOv11n	80.3	79.5	81.6	61.9	537.6
YOLACT	81.1	79.8	81.9	62.1	503.0
Ours	83.5	80.8	84.8	63.0	493.2

Table 3. Analysis of training results for each category on the PIDray dataset (%).

Class	Precision	Recall	mAP50	mAP50-95
Baton	97.8	99.2	99.6	85.2
Pliers	97.0	99.9	99.5	74.5
Hammer	91.5	93.3	99.6	68.5
Powerbank	92.0	94.7	97.5	71.0
Scissors	87.0	85.5	92.0	52.5
Wrench	75.5	98.8	92.8	59.0
Gun	45.0	18.0	30.0	21.0
Bullet	73.2	95.9	95.5	62.0
Sprayer	96.8	85.5	82.5	51.5
Handcuffs	95.8	99.6	99.5	78.5
Knife	59.0	46.0	48.5	30.0
Lighter	87.0	83.0	86.5	54.0
All	84.3	82.1	84.8	63.0

Table 4. Analysis of the training results of the hard and hidden subset in the PIDray dataset.

Subset	Model	Precision	Recall	mAP50	mAP50-95
Hard	YOLOv8	62.1 ± 0.1	63.0 ± 0.2	69.5 ± 0.1	40.0 ± 0.3
Hard	Ours	63.0 ± 0.1	65.1 ± 0.1	73.2 ± 0.1	48.1 ± 0.2
Hidden	YOLOv8	68.4 ± 0.1	66.4 ± 0.2	71.1 ± 0.2	42.7 ± 0.2
Hidden	Ours	70.3 ± 0.2	69.0 ± 0.1	75.2 ± 0.1	50.1 ± 0.2

Table 5. Results of ablation experiment.

Index	Model
Index	YOLOv8n	YOLOv8n + A	YOLOv8n + B	YOLOv8n + A + B
Parments	32.6	29.7	35.3	32.4
GFLOPs	12.0	11.5	11.7	11.7
Precision	79.8	81.4	83.2	83.2
Recall	77.9	79.4	80.3	80.8
mAP50	80.8	82.6	81.9	84.8
mAP50-95	60.2	61.4	61.0	63.0
FPS	323	316	321	308

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, T.; Yuan, P.; Wang, A. Dangerous Goods Detection in X-Ray Security Inspection Images Based on Improved YOLOv8-seg. Electronics 2026, 15, 1112. https://doi.org/10.3390/electronics15051112

AMA Style

Wang T, Yuan P, Wang A. Dangerous Goods Detection in X-Ray Security Inspection Images Based on Improved YOLOv8-seg. Electronics. 2026; 15(5):1112. https://doi.org/10.3390/electronics15051112

Chicago/Turabian Style

Wang, Ting, Pengfei Yuan, and Aili Wang. 2026. "Dangerous Goods Detection in X-Ray Security Inspection Images Based on Improved YOLOv8-seg" Electronics 15, no. 5: 1112. https://doi.org/10.3390/electronics15051112

APA Style

Wang, T., Yuan, P., & Wang, A. (2026). Dangerous Goods Detection in X-Ray Security Inspection Images Based on Improved YOLOv8-seg. Electronics, 15(5), 1112. https://doi.org/10.3390/electronics15051112

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dangerous Goods Detection in X-Ray Security Inspection Images Based on Improved YOLOv8-seg

Abstract

1. Introduction

2. Materials and Methods

2.1. SPPF-LSKA Module

2.2. C2f-MSC Module

3. Results

3.1. Experimental Environment and Dataset Description

3.2. Comparative Experiments and Analysis

3.3. Ablation Experiments

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI