Fine-YOLO: A Simplified X-ray Prohibited Object Detection Network Based on Feature Aggregation and Normalized Wasserstein Distance

X-ray images typically contain complex background information and abundant small objects, posing significant challenges for object detection in security tasks. Most existing object detection methods rely on complex networks and high computational costs, which poses a challenge to implement lightweight models. This article proposes Fine-YOLO to achieve rapid and accurate detection in the security domain. First, a low-parameter feature aggregation (LPFA) structure is designed for the backbone feature network of YOLOv7 to enhance its ability to learn more information with a lighter structure. Second, a high-density feature aggregation (HDFA) structure is proposed to solve the problem of loss of local details and deep location information caused by the necked feature fusion network in YOLOv7-Tiny-SiLU, connecting cross-level features through max-pooling. Third, the Normalized Wasserstein Distance (NWD) method is employed to alleviate the convergence complexity resulting from the extreme sensitivity of bounding box regression to small objects. The proposed Fine-YOLO model is evaluated on the EDS dataset, achieving a detection accuracy of 58.3% with only 16.1 M parameters. In addition, an auxiliary validation is performed on the NEU-DET dataset, the detection accuracy reaches 73.1%. Experimental results show that Fine-YOLO is not only suitable for security, but can also be extended to other inspection areas.


Introduction
Security screening is an indispensable part of activities such as transportation and accessing sensitive areas.Currently, security screening relies heavily on the pseudo-color images generated from X-ray images by security personnel to determine and identify prohibited objects.However, the increasingly dense traffic networks and large passenger flows have led to a surge in security inspection tasks, which may result in potential instances of missed inspections, endangering the safety of the people and public property.Meanwhile, due to the specificity of the application scenarios, the image produced by the X-ray security channel has its own characteristics, the randomness of the objects to be inspected and the arbitrary placement of the objects in the security process, resulting in a complex background of the X-ray image images.Moreover, the actual application scenario has strict requirements on the detection speed, posing a serious challenge to the detection of prohibited objects.Therefore, it is significant to develop an efficient method to automatically recognize prohibited objects in security X-ray images.
With the rapid development of artificial intelligence technology, convolutional neural network-based vision technologies have found widespread application in image classification and object detection, and has achieved remarkable results.At the same time, an increasing number of scholars have investigated how to utilize computer vision technology to improve the accuracy and speed of detecting prohibited objects in X-ray images.Zhu et al. [1] designed a unique Frequency-aware Dual-stream Transformer (FDTNet) tailored for analyzing X-ray images, introducing a Frequency-Aware Module (FAM) to enhance feature representation using frequency domain information, thus facilitating the accurate detection of prohibited objects.Chen et al. [2] used Discrete Cosine Transform (DCT) to convert RGB domain images into frequency-domain representations, alongside proposing an RGB Frequency Attention Module (RFAM) for comprehensive feature representation integrating RGB and frequency-domain information.Ding et al. [3] introduced FE-DETR, a transformer-based target detection framework that improves anchor-based detectors in foreign object detection through split-attention mechanisms, integration of DCN and CBAM, an MSFE module for feature dispersion processing, and a transformer as a prediction head, along with optimized training strategies to boost detector performance.Wei et al. [4] employed collaborative knowledge distillation and leveraged a teacher model to assist the student model in distillation training, thereby uncovering hard-to-detect prohibited objects in X-ray images.Chang et al. [5] proposed a hard negative sample selection method to generate proposed foreground regions for joint commodity segmentation, aiming to detect prohibited objects in cluttered X-ray baggage images.Hassan et al. [6] utilized incremental learning and a traditional encoder-decoder structure to extract and recognize chaotic, occluded, and overlapping prohibited objects from X-ray images, achieving instance recognition with small-scale training batches.CFPA-Net introduced a cross-layer feature extraction fusion module (CEF) to augment semantic and localization information between low-level and high-level features [7], whereas EAOD-Net incorporated a learnable Gabor revolution layer to enhance the network's capability to capture edge and contour information about prohibited objects [8].
Although aforementioned methods have been used to address the problems in X-ray images and achieved good detection results, the proposed algorithms are not considered to be applied to realistic scenarios.Considering the limitations of these methods, this study proposes a lightweight detection model that improves both the detection speed and the ability of detecting small prohibited objects.The contributions of this study are as follows: 1.
This study proposes a high-density feature aggregation (HDFA) structure for the backbone feature network of YOLOv7, simplifying the network structure and enhancing its ability to capture global object information.

2.
A low-parameter feature aggregation (LPFA) structure is proposed for the YOLOv7-Tiny-SiLy neck feature fusion network, which improves the feature integration capability of the lightweight network, resulting in a finer and more comprehensive representation of target features.

3.
To avoid the loss of detailed information during feature transmission layer by layer, max-pooling operation is employed for the cross-layer connections.Moreover, the NWD loss function is utilized to enhance the detection of information from small objects given the size constraints of prohibited objects.

4.
Experiments conduct on the EDS dataset demonstrate a successful balance between the detection accuracy and speed.Furthermore, the results on the NEU-DET dataset illustrate the robustness of the model and its potential extension to various practical detection domains.
The remainder of this paper is structured as follows.In Section 2, related works on object detection and the X-ray prohibited objects detection are reviewed.The key components of the proposed Fine-YOLO are described in Section 3, such as low-parameter feature aggregation (Section 3.2), high-density feature aggregation and cross-layer connection (Section 3.3), and Normalized Gaussian Wasserstein distance (Section 3.4).Detailed experiments and the analysis are reported in Section 4. Finally, we conclude our work and discuss future work in Section 5 and Section 6, respectively.

Object Detection Algorithms
Object detection constitutes a fundamental task in computer vision, aiming to identify objects of interest within natural images.Existing object detection algorithms can be divided into two categories: two-stage detection algorithms and one-stage detection algorithms.Two-stage detection algorithms, such as Fast R-CNN [9], Faster R-CNN [10], Mask R-CNN [11], and ThunderNet [12], typically generate region proposals prior to making predictions.While effective, these algorithms may exhibit inefficiencies in inference time due to their multi-stage nature.Regarding one-stage object detection algorithms.such as SSD [13], YOLO [14], RetinaNet [15], CenterNet [16], FCOS [17], and DETR [18], directly predict bounding boxes without the need for generating proposals through additional RPNs.This characteristic makes one-stage object detection algorithms particularly suitable for efficient inference on mobile devices.Although DETR has an advantage in accuracy, its complex Transformer structure may require more data and computational resources for training.In contrast, the network architecture of YOLO is designed to be simple, which effectively reduces the amount of computation and model parameters, and thus, is more suitable for the construction of lightweight models.

YOLO Series Object Setection Algorithm
The YOLOv1 algorithm [14] simplifies object detection by dividing images into grids, with each grid responsible for predicting bounding boxes and category probabilities.Although the algorithm achieves fast processing with a single forward pass, it has limited performance in detecting small objects and dense scenes because only one category can be recognized in each grid.To address this problem, YOLOv2 [19] replaces the original GoogLeNet with the more advanced DarkNet-19 network architecture and replaces Dropout with batch normalization, which improves convergence speed and generalization.YOLOv3 [20] introduces the residual structure of ResNet and utilizes Darknet-53 as the base network architecture, resolving gradient explosion issues while increasing network depth.YOLOv4 [21] inherits YOLOv3 and introduces the CSPDarknet53 backbone architecture, which integrates spatial pyramid pooling and path aggregation network (PAN) modules, effectively solving the challenge of varying feature sizes into the fully-connected layer.YOLOv5 [22] retains a similar architecture to YOLOv4; however, it adopts an automatically learned anchor frame based on the training dataset to better adapt to the size distribution of different objects.YOLOv6 [23] brings an innovative Neck network design, featuring a reparameterizable bi-directional fusion Rep-PAN Neck network with enhanced characterization capability.Coupled with a decoupled localization distillation strategy, it boosts the performance of smaller models.YOLOX [24] enhances YOLOv3 by decoupling the prediction branch, separating classification and regression tasks, which speeds up model convergence and improves algorithm generality and versatility.
As the state-of-the-art in the field of image recognition and object detection, YOLOv7 [25] combines extended E-ELAN networks, layer aggregation networks, and composite model scaling strategies.It delivers 5 to 160 frames per second without compromising accuracy in image recognition and object detection.Compared to other YOLO algorithms, YOLOv7 offers the best balance between detection speed and accuracy.By replacing the C3 module in YOLOv5 with the more lightweight C2f module, YOLOv8 [26] achieves a further improvement in network performance.Despite retaining the SPPF module and the design concept of PAN in the YOLOv5 architecture, YOLOv8 simplifies the network structure by removing the convolutional structure of the sampling stage on the PAN-FPN.In addition, YOLOv8 introduces the decoupled-head design with anchor-free detection, which enhances the flexibility and accuracy of the detection algorithm.Notably, YOLOv7 enhances detection accuracy through various improvements, including advancements in the backbone network and feature fusion methods.By incorporating additional feature layers, YOLOv7 is able to capture finer details of the targets, providing a significant advantage in handling complex scenes and diverse objects.Furthermore, YOLOv7 achieves faster detection speeds while maintaining high detection accuracy, making it an exceptionally efficient model for practical applications.Therefore, YOLOv7 is chosen as the baseline model in this study.

X-ray Prohibited Object Detection Datasets
As described in the literature [27,28], X-ray prohibited objects images are significantly different from natural scene images.While natural image datasets like Pascal VOC [29] and MS-COCO [30] are widely known and easily accessible, publicly available datasets of X-ray prohibited objects images are relatively scarce.Accessing natural images is relatively straightforward, as they can be captured using common cameras or smartphones.In contrast, obtaining X-ray images requires specialized equipment.Furthermore, a substantial volume of prohibited objects is essential to assemble the X-ray baggage image dataset utilized for training deep learning-based X-ray prohibited detection models.This necessity significantly compounds the challenge of constructing such a dataset.Moreover, due to privacy policies, many X-ray baggage image datasets cannot be publicly released.To the best of our knowledge, the five current datasets used for X-ray prohibited detection are GDXay [31], SIXray [32], OPIXray [33], HiXray [34], and EDS [35].
GDXray is the first complete dataset that can be used for the detection of prohibited in X-ray security screening, which consists of 19,407 X-ray images.However, this dataset is a greyscale image captured by a single energy X-ray screening machine, and the images scanned by the present X-ray screening machines are all pseudo-colour images, and thus, it does not conform to the present state of research.The SIXray dataset comprises 1,059,231 X-ray images depicting prohibited across six categories: guns, knives, spanners, pliers, scissors, and hammers.However, the dataset faces challenges due to the relatively small number of images containing prohibited, coupled with significant disparities in the quantities of different types of prohibited represented.The OPIXray dataset contains 8885 X-ray images of five different types of objects, namely utility knives, folding knives, utility knives, scissors, and straight knives, with three different levels of occlusion, aimed at investigating occlusion and overlap problems.However, there is a lack of diversity in the dataset as the object objects all belong to the knife category.The HiXray dataset comprises 45,364 X-ray images depicting 102,928 commonly prohibited objects across eight categories.However, among the eight categories, only lighters and rechargeable batteries are prohibited objects.
The EDS dataset is considered to be the first endogenous domain transfer benchmark, crafted specifically for X-ray security screening scenarios.The dataset contains 14,219 X-ray images covering 10 common prohibited categories: beverage bottles, pressure containers, lighters, knives, small electronic devices, power banks, umbrellas, glass bottles, scissors, and laptops.The diversity of these images from screening devices produced by three different manufacturers is suitable for evaluating the proposed Fine-YOLO model.Figure 1 illustrates an example EDS dataset, where Xray1, Xray2 and Xray3 represent images from three different manufacturers of screening machines.

Overall Architecture
The framework of the proposed Fine-YOLO is shown in Figure 2. First, the LPFA module with a low number of parameters is proposed for the backbone network of YOLOv7, which is able to better handle the problem of the complex background of X-ray images and the morphological differences of prohibited objects while balancing the performance and computational cost.Then, HDFA module is proposed for the neck feature fusion network of YOLOv7-Tiny-SiLU, enabling the network to capture more detailed feature information and use max-pooling operations for cross-layer connectivity to enhance the prediction effect of the model.Finally, the detection results undergo evaluation using NWD to compute the loss, which in turn guides the optimization of the model.

Low-Parameter Feature Aggregation
In X-ray images, objects often overlap, making it challenging to detect occluded objects, particularly in areas where the integrity of the occluded objects is compromised.To address this issue, we introduce the LPFA module, designed to enhance the detection accuracy of occluded objects by effectively leveraging global context information within the features.The specific structure of the LPFA module is illustrated in Figure 3.
Compared with YOLOv7, the LPFA module uses the max-pooling operation directly for downsampling, reducing convolutional layers and model complexity while preserving texture details.The resulted features are then processed with 1 × 1 and 3 × 3 convolution kernels, respectively.The 1 × 1 convolution facilitates the interaction of channel information, and the 3 × 3 convolution further extracts more advanced features and semantic insights, expanding the sensory field of the network.After that, the outputs of the different convolutions are concatenated by the concat operation in order for the network to use the different feature information for learning.The two subsequent 3 × 3 and 1 × 1 convolutional kernels function similarly to the above structure with a feature pyramid similar to the feature aggregation structure.Finally, a 1 × 1 convolution is used to recover the dimension and fusion.

High-Density Feature Aggregation
Only optimizing the backbone feature extraction network of YOLOv7 obviously cannot achieve a better lightweight effect.Therefore, the HDFA module is proposed in this paper for optimizing the neck feature fusion network of YOLOv7-Tiny-SiLU.YOLOv7-Tiny-SiLU is a compact variant of YOLOv7, which adopts the streamlined network architecture and uses the SiLU activation function on top of the Leaky ReLU.The comparison between the HDFA module and YOLOv7-Tiny-SiLU is shown in Figure 4, where a 3 × 3 convolution is additionally added to the original to improve the feature integration capability of the network.Considering that the neck feature fusion network part is directly connected to the detection head part, if the integration of the features is not strong, it will make the network accuracy decrease and the prediction of the object is not good.The additional 3 × 3 convolution is used because the number of channels in the neck feature fusion network part is small, and even adding a convolution does not result in a very large amount of computation.To enhance the detection capabilities and facilitate various feature information exchanges within the model, we implement a cross-level connection strategy, which integrated the backbone feature extraction network of YOLOv7 with the neck feature fusion network of YOLOv7-Tiny-SiLU.We address this by introducing a cross-layer network connection that harmonizes the features extracted by both networks.To preserve texture information as much as possible, we employ a max-pooling operation for feature transfer from the backbone feature extraction network to the neck feature fusion network.This strategy enhances feature integration of the entire network, mitigating performance degradation and loss of object prediction.However, due to the large discrepancy between the feature maps referenced by max-pooling and those extracted by multi-layer convolution, directly performing a concat operation can lead to network learning difficulties.Therefore, a convolutional layer is added after max-pooling and the feature maps are adjusted to reduce the discrepancy.The cross-level aconnection method is illustrated in Figure 5.

Normalized Wasserstein Distance
The IoU (Intersection over Union) metric, commonly employed for object detection, exhibits remarkable sensitivity, particularly in the detection of small prohibited objects.Even slight positional deviations between the predicted and ground-truth bounding boxes can induce substantial fluctuations in the IoU score.As demonstrated in Figure 6, a negligible positional offset could cause the IoU of a prohibited objects small object to plummet from 0.53 to 0.06.However, for objects of regular size, even with the same degree of positional deviation, the IoU only drops from 0.90 to 0.65.The heightened sensitivity of IoU metrics to small prohibited objects leads to significant overlap between positive and negative samples during training, hindering effective network convergence.Additionally, small prohibited objects often occupy minimal space within the image, causing the IoU between the ground truth and predicted bounding box to fall below the minimum threshold, resulting in the absence of positive samples.Hence, there is a pressing need for an alternative evaluation metric to precisely assess small prohibited objects.(

1) Bounding Box Two-Dimensional Gaussian Distribution Modeling
The IoU metric, primarily designed to gauge sample similarity, proves highly sensitive to object size variations, rendering it less suitable for small object detection when objects are prohibited.Hence, we introduce NWD as a novel measure [36].The Wasserstein distance, non-sensitive to object scales, effectively quantifies distribution similarity with minimal or no overlap.Consequently, it addresses issues related to sample similarity during small object prohibited training and the scarcity of positive samples.Notably, we model the bounding box as a two-dimensional Gaussian distribution: where P represents the coordinates (x, y) and µ and Σ represent the mean vector and covari- ance matrix of the Gaussian distribution, respectively.In the case of (P − µ) T ∑ −1 (P − µ) = 1, the ellipse in Equation ( 1) is the density profile of a two-dimensional Gaussian distribution.Hence, the horizontal bounding box R = (cx, cy, w, h) can be modeled as a two-dimensional Gaussian distribution.The expression is shown in Equation ( 2): where (cx, cy), w, and h represent the center coordinates, width, and height, respectively.
(2) Normalized Gaussian Wasserstein Distance Wasserstein, in Optimal Transport theory, is a common method for calculating the distance between two distributions.This distance measure can facilitate the measurement of the similarity or difference between the two distributions.For two two-dimensional Gaussian distributions µ a = N(m a , Σ a ) and µ b = N(m b , Σ b ), the Wasserstein distance between µ a and µ b is shown in Equation (3): where ∥ ∥ F is the Frobenius norm.Furthermore, the distance between the Gaussian distributions N a and N b modeled by bounding boxes A = (cx b , cy b , w b , h b ) and B = (cx b , cy b , w b , h b ) can be further simplified to Equation (4): Finally, using exponential normalization, the distance can be converted into a value between 0 and 1 to obtain a new metric, Normalized Wasserstein Distance (NWD), as shown in Equation (5): where C denotes a constant related to the dataset.

Experiments 4.1. Implementation Details
All experiments in this study are conducted on the Ubuntu 20.04.0 operating system, utilizing an Intel Core i7-12800HX processor, 128 GB of random access memory, and an NVIDIA GeForce RTX 3080 graphics card with 16GB of memory.Python 3.8 is the programming language employed.Image resizing to 640 × 640 is performed, with detector training conducted over 150 epochs.The initial learning rate is set to 0.01, and a batch size of 16 is utilized.

Evaluation Metrics
For evaluation metrics, this paper employs mean Average Precision (mAP), Precision, and Recall to evaluate the detection accuracy of the model.The Average Precision (AP) value represents the average accuracy of object detection within a particular category.The AP value represents the average accuracy of object detection in a particular category.The mAP value represents the average of the AP values across all categories, with an IoU threshold of 0.5.Precision denotes the ratio of correctly detected objects to all detected objects, while recall denotes the ratio of correctly detected objects to positive samples, as illustrated in Equations ( 6)-( 8): where TP, FP, and FN denote the numbers of correctly detected objects, false detections, and missed objects, respectively.
In addition, we evaluate the complexity of the model by utilizing the number of parameters and Floating Point Operations (FLOPs), which directly refers to the computational workload.We measure the detection speed of the model, denoted as Frames Per Second (FPS), which represents the number of images processed per second.These metrics are defined in Equations ( 9) and ( 10), respectively: where K denotes the size of the convolution kernel, C in and C out represent the number of input and output channels, respectively, T 1 represents the preprocessing time before the input image is sent to the neural network, T 2 represents the the computational inference time of the neural network, T 3 represents the post-processing time, and T 4 represents the output time of the result.

Performance of the Fine-YOLO Model 4.3.1. EDS Dataset
This section provides a comprehensive comparison between Fine-YOLO and several classical detection algorithms, as shown in Table 1.The second-stage object detection algorithm, Faster R-CNN, is too slow due to its large number of parameters.SSD has a faster detection speed compared to Faster R-CNN; however, its performance in terms of detection accuracy is very unsatisfactory.RetinaNet is 6.9% higher than SSD in terms of the mAP metrics, although it is nearly twice as slow as SSD in terms of detection speed.The anchor-free family of object detection algorithms, such as CenterNet and FCOS, cannot strike a good balance between detection accuracy and detection speed.In the YOLO series, smaller models such as YOLOv5-S, YOLOX-Nano, and YOLOv8-S, while more lightweight in structure, are not comparable to the Fine-YOLO in terms of detection speed and accuracy.This discrepancy can be attributed to the impact of various factors on model detection speed, including model structure, optimization techniques, hardware efficiency, and implementation details.Despite having fewer parameters, these models may employ more complex network structures or model designs, resulting in greater computational resource requirements and longer inference times.In contrast, Fine-YOLO strikes a balance between model complexity and computational efficiency by optimizing feature extraction, thereby reducing overall execution time.Compared to models such as YOLOv7-Tiny and YOLOv7-Tiny-SiLU, Fine-YOLO provides significant advantages in terms of detection accuracy, speed, and lightweight design for X-ray datasets, despite a slight increase in computational complexity.Moreover, when compared to larger-scale models such as YOLOv5-L, YOLOX-L, YOLOv8-L, and YOLOv7, Fine-YOLO stands out due to its lighter network architecture, faster detection speed, and superior detection accuracy.To further validate the efficacy of Fine-YOLO, we compare it with the X-ray prohibited object detection algorithm on the EDS dataset, and the results are presented in Table 2.The methods in the table are all lightweight X-ray prohibited detection methods.The experimental results confirm that our model not only maintains superior detection accuracy, but also reduces model complexity and accelerated detection speed.In summary, Fine-YOLO achieves an excellent balance between lightweight design, high detection accuracy, and rapid detection speed.Its excellent performance in small and large model variants demonstrates that it is an advanced lightweight inspection solution.Figure 7 provides a more intuitive representation of the superiority of our proposed Fine-YOLO algorithm.

NEU-DET Dataset
In order to more comprehensively evaluate the performance of Fine-YOLO on different domain datasets and the effectiveness of the proposed improvements on complex background information and small object detection, the NEU-DET dataset is selected for comparative analysis in this paper.The dataset consists of 1800 images containing six different types of surface defects: cracks, inclusions, spots, pitted surfaces, rolled-in scales, and scratches, as shown in Figure 8.In this study, the dataset is split into a 9:1 ratio of training set to test set.The training set is then subdivided into a 9:1 ratio to yield a training subset of 1458 images, a test subset of 180 images, and a validation subset of 162 images.Experiments are conducted using the NEU-DET dataset, which is then compared with a recently proposed lightweight model for defect detection.The experimental results, presented in Table 3, confirmed the robust performance of our model in the field of small object detection.Notably, our algorithm significantly improves the detection speed while maintaining high accuracy.This makes our model a more convenient solution for object detection applications.

Ablation Study
In this section, we methodically examine the impact of three enhancement techniques on the network model.The experimental findings are detailed in Table 4. Initially, we evaluate the effectiveness of enhancing the backbone feature extraction network using YOLOv7 as the baseline model.Second, the impact of the improvement on the neck feature fusion network is evaluated using YOLOv7-Tiny-SiLU as the baseline model.Subsequently, the effectiveness of the cross-layer connection network is validated.Finally, the NWD loss function is introduced along with a comparison with the performance of YOLOv7.Ten sets of experiments are conducted incorporating different modules, and their performance is evaluated using mAP, Parameters, GFLOPs, and FPS as metrics.LPFA is an improvement of YOLOv7 backbone feature extraction network, which reduces the number of parameters and computational complexity by the design of lowparameter aggregation module.The proposed method exhibites a remarkable decrease of 67.7% in parameters and 51.6% in computations compared to YOLOv7, while retaining a high detection accuracy.Additionally, our proposed method achieves a 43.9% increase in the detection speed.
HDFA is an improvement of the YOLOv7-Tiny-SiLU neck feature fusion network by designing a high-density aggregation module that efficiently integrates features at different scales, thus improving the performance of object detection and the dependence on background information.Despite the increased parameters and computational resources of our proposed method compared to YOLOv7-Tiny-SiLU, the detection accuracy is improved by 2.2%.
By connecting the LPFA and HDFA modules through a max-pooling operation, this network architecture offers significant advantages over YOLOv7.Notably, Fine-YOLO achieves a 56.7% reduction in the number of parameters, 44.9% decrease in computational load, and 60.1% increase in detection speed.This facilitates the improvement of the detection speed while maintaining high detection accuracy, rendering it easier for the actual application of the model.In the pursuit of lightweight design, we observes a potential impact on the detection accuracy of the model when verifying the above method.
To address this issue, we introduce the NWD loss function in this article, aimed at striking a balance between detection accuracy and speed.This enhancement yields a 1.0% improvement in detection accuracy compared to YOLOv7.The introduction of the NWD loss function effectively solves the problem of bounding box regression mismatch when detecting small objects and further improves the detection accuracy while maintaining a lightweight design.Figure 9 illustrates the results of the ablation experiments for the improved methods.

Visualization of the Detection Result
Figure 10 shows the comparison results of YOLOv7-Tiny-SiLU, YOLOv7, YOLOv8-S, YOLOv8-L, and Fine-YOLO.It is visualized from the figure that Fine-YOLO is able to accurately identify small prohibited objects such as scissors, knives, lighters, etc., and has a higher confidence level for slightly larger objects such as laptops and bottles.

Discussion
According to the above experimental results and analyses, it can be proved that Fine-YOLO has obvious advantages in X-ray security detection and can balance well the detection accuracy and detection speed.There are three main innovations.(1) A lowparameter feature aggregation module is proposed for the backbone network of YOLOv7.This design enables the model to capture global information efficiently while maintaining a lightweight structure.(2) A high-density aggregation feature model is introduced for the feature extraction network of YOLOv7-Tiny-SiLU.This enhancement allows the model to maintain its lightweight nature while simultaneously improving detection accuracy.
(3) Considering the different structures of YOLOv7 and YOLOv7-Tiny-SiLU, a cross-layer connection is established using max-pooling operations.In light of the above discussion, the superiority of Fine-YOLO is mainly manifested in the following aspects: (1) Two different feature fusion modules are designed according to the characteristics of YOLOv7 and YOLOv7-Tiny-SiLU models, which achieves a good balance between detection speed and accuracy.(2) For different network structures, max-pooling operation is employed for cross-layer connection, thereby minimizing the loss of feature information.(3) The application of the NWD loss function further enhances the detection accuracy of the model without sacrificing the lightweight design advantage of the model, particularly for small prohibited objects.Notably, the auxiliary validation conducted on the NEU-DET dataset confirms the excellent detection performance of Fine-YOLO and shows the potential to be extended to other object detection domains.

Conclusions
This study presents the Fine-YOLO model for detecting prohibited objects in X-ray images.Despite the challenges of complex background and small size of prohibited, the Fine-YOLO model proposed in this paper shows excellent detection performance.The major contributions are as follows: (1) The LPFA module is proposed for the YOLOv7 backbone feature extraction network to extract the multi-scale global context information, which contributes to the detection of occluded objects; (2) The proposed HDFA module enables the neck feature fusion network of YOLOv7-Tiny-SiLU to capture detailed object information while possessing a lower number of parameters; (3) The features of the backbone feature extraction network and the neck feature fusion network are effectively merged by max-pooling operation; (4) The model is optimized using the NWD loss function to enhance detection accuracy for small objects.Extensive experimentation and visualization analyses demonstrate that Fine-YOLO precisely identifies and classifies prohibited objects, achieving faster detection speeds and a lighter network structure.Experiments performed on the EDS dataset and auxiliary validation on the NEU-DET dataset demonstrate the excellent detection performance of Fine-YOLO.However, the unique characteristics of the security screening dataset, such as frequent stacked placements in X-ray images, pose challenges that may hinder overall algorithm performance.
In future research, we aim to utilize reinforcement learning technique to further improve the detection accuracy of prohibited objects through an iterative trial-and-error and feedback process.In addition, we plan to explore optimization techniques such as knowledge distillation and model pruning to improve the efficiency and applicability of the model.Furthermore, we intend to integrate LPFA and HDFA modules into other YOLO algorithms to refine their architectural design and ensure improved detection capabilities.

Figure 1 .
Figure 1.Example of the EDS dataset.

Figure 3 .
Figure 3. Comparative analysis of improved components within the backbone feature extraction network.

Figure 4 .
Figure 4. Comparative analysis of improved components within the neck feature fusion network.

Figure 6 .
Figure 6.We conduct an analysis of the IoU sensitivity focusing on small-and normal-scale objects.In this analysis, each grid cell represents one pixel.Box A represents the ground truth bounding box, while Boxes B and C illustrate predicted bounding boxes with diagonal biases of one pixel and four pixels, respectively.Specifically, (a) pertains sto the detection of micro-scale objects, whereas (b) relates to the detection of normal-scale objects.

Figure 8 .
Figure 8. Example of the NEU-DET dataset.

Figure 9 .
Figure 9. Ablation experiment results of improved methods.

Figure 10 .
Figure 10.Comparison of the prediction effect of different models on the EDS dataset.

Table 1 .
Experimental results of different object detection algorithms on the EDS dataset.It is worth noting that in this study, baseline model values are underlined, whereas optimal values are in bold.

Table 2 .
Performance comparison of Fine-YOLO with different lightweight X-ray inspection models, where bold indicates the optimal value for each metric.

Table 3 .
Performance comparison between Fine-YOLO and different advanced lightweight steel defect detection models, where bold indicates the optimal value for each metric.

Table 4 .
Ablation experiments on the EDS dataset.