Advancing Intelligent Logistics: YOLO-Based Object Detection with Modified Loss Functions for X-Ray Cargo Screening

Tee, Jun Hao; Solihin, Mahmud Iwan; Chong, Kim Soon; Tiang, Sew Sun; Tham, Weng Yan; Ang, Chun Kit; Lee, Y. J.; Goh, C. L.; Lim, Wei Hong

doi:10.3390/futuretransp5030120

Open AccessArticle

Advancing Intelligent Logistics: YOLO-Based Object Detection with Modified Loss Functions for X-Ray Cargo Screening

by

Jun Hao Tee

^1,2

,

Mahmud Iwan Solihin

^1,*

,

Kim Soon Chong

^1,*

,

Sew Sun Tiang

¹

,

Weng Yan Tham

¹,

Chun Kit Ang

¹

,

Y. J. Lee

^2,3,4,

C. L. Goh

^2,3,4 and

Wei Hong Lim

¹

Faculty of Engineering, Technology and Built Environment, UCSI University, Cheras, Kuala Lumpur 56000, Malaysia

²

Billion Prima Sdn Bhd, PTB 1587, Jalan Sengkang, Kulai, Johor Bahru 81000, Malaysia

³

Sensenet Sdn Bhd, No. 2C, Jalan Jubli Perak 22/1, SS22, Shah Alam 40400, Selangor, Malaysia

⁴

Billion Prima Technologies Sdn Bhd, PTB 1587, Jalan Sengkang, Kulai, Johor Bahru 81000, Malaysia

^*

Authors to whom correspondence should be addressed.

Future Transp. 2025, 5(3), 120; https://doi.org/10.3390/futuretransp5030120

Submission received: 23 June 2025 / Revised: 16 August 2025 / Accepted: 29 August 2025 / Published: 8 September 2025

Download

Browse Figures

Versions Notes

Abstract

Efficient threat detection in X-ray cargo inspection is critical for the security of the global supply chain. This study evaluates YOLO-based object-detection models from YOLOv5 to the latest, YOLOv11, which is enhanced with modified loss functions and Soft-NMS to improve accuracy. The YOLO model comparison also includes DETR (Detection Transformer) and Faster R-CNN (Region-based Convolution Neural Network). Standard loss functions struggle with overlapping items, low contrast, and small objects in X-ray imagery. To overcome these weaknesses, IoU-based loss functions—CIoU, DIoU, GIoU, and WIoU—are integrated into the YOLO frameworks. Experiments on a dedicated cargo X-ray dataset assess precision, recall, F1-score, mAP@50, mAP@50–95, GFLOPs, and inference speed. The enhanced model, YOLOv11 with WIoU and Soft-NMS, achieves superior localization, reaching 98.44% mAP@50. This work highlights effective enhancements for YOLO models to support intelligent logistics in transportation services and automated threat detection in cargo security systems.

Keywords:

X-ray cargo; security inspection; object detection; deep learning; YOLO algorithms; soft-NMS

1. Introduction

X-ray imaging in cargo inspection is a critical component of modern security and inspection technology, providing a non-intrusive method for scanning cargo containers, vehicles, and large shipments [1,2]. This technology operates by transmitting X-ray beams through the cargo and capturing images that reveal its contents, enabling inspectors to identify and analyze items without the need to physically open or unload them.

Traditionally, cargo inspection was primarily done manually, relying on the experience and judgment of security officers. However, this manual procedure was prone to human error and was affected by factors such as fatigue and mental stress. As a result, automatic detection of forbidden objects during cargo inspection, particularly through artificial intelligence (AI)-based X-ray scanning, is used to reduce processing time and improve detection accuracy.

Deep learning, with its exceptional ability to process image data, has emerged as a transformative technology in X-ray cargo inspection, automating and enhancing the detection of contraband, explosives, and other prohibited items. In addition to accelerating and automating the inspection process, it also enhances the accuracy of the process, making it one of the most indispensable tools for cargo security today. Ref. [2] reviews the practical applications of deep learning algorithms in X-ray-based detection of hazardous materials, including improving public safety in airports, train stations, and subways. X-ray-based detection of dangerous goods in baggage can be improved by developing more accurate and efficient object-detection algorithms, which can improve security screening.

Deep learning is viewed as part of computer vision or machine vision in the context of X-ray cargo inspection. A deep learning approach can be used to perform three types of machine-vision tasks: image classification, object detection, and segmentation [3]. This project falls under the category of object detection, specifically within the context of X-ray inspection, where the goal is to identify and localize objects of interest, such as contraband, weapons, or hazardous materials, within X-ray images.

YOLO (You Only Look Once) is currently a highly popular deep learning object-detection algorithm due to its real-time performance and accuracy. YOLO has undergone substantial evolution, progressing from YOLOv1 to YOLOv11, the latest iteration as of 2024. These advancements include multi-scale detection capabilities, improved computational efficiency, and enhanced adaptability to diverse datasets and hardware platforms. Such improvements have established YOLO as a preferred solution for high-speed, precise tasks across applications like industrial quality control [4], biomedical applications [5], security inspection [2,6], security surveillance [7,8] and autonomous systems [9,10]. Recent review publications on YOLO variants and their various applications can be found in [5,11,12,13,14,15,16,17,18].

In YOLO-based object detection, the loss function is vital for training the model to accurately predict object locations and classifications [19]. A key component is the Intersection over Union (IoU), which quantifies the overlap between predicted bounding boxes and ground truth boxes, directly influencing the precision of object localization. IoU-based metrics ensure that the confidence scores reflect the quality of localization while penalizing inaccurate predictions. Variants like Generalized IoU (GIoU), Complete IoU (CIoU), and Distance IoU (DIoU) further enhance robustness by addressing limitations of standard IoU, such as handling nonoverlapping boxes or optimizing aspect ratios. These improvements lead to stable training and better performance, making IoU essential for applications like X-ray cargo inspection, where precise detection of prohibited items is critical for operational success.

In light of the inherent limitations of manual cargo inspection and the demonstrated potential of deep learning-based object detection to assist human inspectors in improving efficiency and robustness in X-ray-based cargo screening, this study is guided by the following hypotheses:

YOLO-based object-detection models trained with advanced IoU-based loss functions (GIoU, DIoU, CIoU, WIoU) achieve a statistically significant improvement in accuracy compared to models using the standard IoU loss function on cargo X-ray datasets.
The combination of advanced IoU-based loss functions and soft NMS significantly reduces the rates of false positives and false negatives, particularly for occluded or overlapping prohibited items in X-ray images of cargo.
The integration of advanced IoU-based loss functions into the latest YOLO architecture offers a more feasible trade-off between detection accuracy and computational efficiency for real-time X-ray-based cargo inspection compared to other popular deep learning object-detection architectures such as DETR (DEtection TRansformer) and Faster R-CNN (Region-based Convolutional Neural Network).

The objective of this paper is to investigate the feasibility of combining various YOLO versions with different loss functions to evaluate their accuracy and efficiency in YOLO-based object detection for X-ray cargo inspection. This study aims to enhance the performance of object-detection models by exploring the interplay between a standard YOLO architecture and loss functions. The dataset used is a private dataset provided by a security company located in Malaysia. The main contributions of this paper are as follows:

Benchmarks standard IoU against advanced variants (i.e., GIoU, DIoU, CIoU, WIoU) to determine their effectiveness in addressing domain-specific challenges like occlusion and overlapping objects on a specialized cargo X-ray dataset.
Demonstrates the effectiveness of modified IoU variants in optimizing object detection in X-ray cargo inspection, addressing a significant gap in the existing research.
Highlights the scarcity of annotated cargo X-ray datasets, encouraging further research, dataset development, and collaboration in this important field in security and inspection.
Incorporates a Soft-NMS mechanism in the latest YOLOv11, enhancing detection accuracy, especially for challenging categories of contraband items.

2. Background and Related Works

2.1. Modern YOLO Architecture

According to a review paper published in 2023 [2], there are three types of object detectors, namely CNN-based detectors, Transformer-based detectors, and hybrid detectors. CNN-based detectors include the R-CNN (Regional Convolutional Neural Networks) series and the YOLO series. One popular end-to-end Transformer-based detector is called DETR (DEtection TRansformer). We can also classify deep learning-based object detectors as one-stage detectors and two-stage detectors. Detectors in the R-CNN series are two-stage detectors, while those in the YOLO series and SSD (Single Shot MultiBox Detector) are one-stage detectors. YOLO has several key advantages over two-stage detectors (such as R-CNN) and transformer-based detectors (like DETR). YOLO’s design prioritizes speed, simplicity, and real-time performance, making it one of the most popular choices for practical applications including X-ray imaging. The YOLO series has evolved through multiple versions since its first appearance in 2016 [20], with each new version building on previous versions to improve performance and address limitations. As of the time of writing of this manuscript, the most recent version was YOLOv11, launched by the Ultralytics team in September 2024 [21]. Initial versions of YOLO were built based purely on a CNN architecture. In later versions, from version YOLOv5 onward, the architecture is composed of three parts, namely the backbone, neck, and head, as shown in Figure 1 [12]. This paper will implement YOLOv5 and compare its performance to that of the latest version, YOLOv11.

The respective responsibilities of these three parts (backbone, neck, and head) are as follows:

Backbone: The backbone is all about efficiency and effective feature extraction. It often employs architectures like EfficientNet [22] or CSPNet [23]. These networks excel at capturing essential features from the input image while minimizing the number of parameters and operations. In addition to efficient architectures, the backbone may incorporate techniques like attention mechanisms to further enhance feature representation. Attention mechanisms help the network focus on the most relevant parts of the image, improving the discriminative power of the learned features.
Neck: The neck is a crucial component that connects the backbone to the head. Its primary role is to combine features from different levels of the backbone, creating a feature pyramid that captures information at multiple scales. Feature Pyramid Networks (FPNs) [24] are commonly used in the neck, and variants like PAN (Path Aggregation Network) [25] further improve feature flow and information exchange between different levels of the pyramid. The neck’s focus is on efficient feature fusion, ensuring minimal information loss while combining features from different scales.
Head: The head is responsible for making the final predictions, including object location, class, and confidence score. The head typically consists of several components: objectness prediction, classification, and regression. Objectness prediction determines the probability of an object being present in a specific region. Classification assigns a class label to the detected object (e.g., person, car, etc.). Regression refines the bounding-box coordinates to accurately localize the object within the image. The head utilizes loss functions that balance classification and localization accuracy, guiding the training process towards optimal performance.

Further detailed explanation of the architectures of different YOLO versions, particularly concerning the changes in feature-extraction and feature-fusion techniques, as well as the respective applications, are provided in some studies such as [5,12,26,27,28]. We summarize the comparative analysis of different YOLO versions (from YOLOv5), focusing in their architectures, in Table 1.

The architecture of YOLOv11 is shown in Figure 2. YOLOv11 retains a structure similar to those of earlier versions, beginning with convolutional layers that downsample the input image. These layers are crucial for feature extraction, progressively decreasing spatial resolution while increasing channel depth. A key enhancement in YOLOv11 is the integration of the C3k2 block, which replaces the previously used C2f block [36]. The C3k2 block serves as a more computationally efficient variant of the Cross Stage Partial (CSP) Bottleneck. In contrast to the single large convolution used in YOLOv8, C3k2 utilizes two smaller convolutions. The term “k2” reflects the use of a smaller kernel size, which improves processing speed without sacrificing performance.

YOLOv11 retains the Spatial Pyramid Pooling–Fast (SPPF) block from earlier iterations but introduces a new component—the Cross Stage Partial with Spatial Attention (C2PSA) block—immediately following it [36]. This newly added C2PSA block enhances the model’s ability to focus on important spatial regions within the feature maps. By incorporating a spatial attention mechanism, it allows YOLOv11 to better prioritize critical areas of the image. Through spatial pooling, the C2PSA block helps the network concentrate on key regions, potentially improving the accuracy of object detection across different scales and positions.

2.2. Performance Metrics in Object Detection

The following four metrics are commonly employed for evaluating the performance of object-detection models:

Model size (Parameter Count/PC) [10]: The total number of parameters (weights and biases) or the model’s storage size, which is reduced through compression techniques like pruning and quantization.

$PC = (C_{in} K^{2} + δ (bias)) M C_{out}, bias \in (0, 1)$

(1)

Here, $C_{in}$ represents the number of input channels, $C_{out}$ is the number of output channels, K denotes the size of the convolutional kernel, bias represents the bias term, M is the number of convolutional kernels, and $δ (bias)$ is an indicator function that equals 1 if bias is included and 0 otherwise.
FLOPs (Floating-Point Operations per Second) [37]: It refers to the number of floating-point calculations that a model requires during inference. It is a key metric in evaluating the efficiency of an object-detection model, as it directly influences the model’s computational complexity and speed. Reducing FLOPs often reduces power consumption, which is critical for mobile and embedded AI. The FLOPs calculation methods for convolutional and fully connected layers are distinct. The following Equations can be used to compute FLOPs for convolutional and fully connected layers, respectively [38]:

${FLOPs}_{conv} = (2 C_{in} K^{2} + 2) H W C_{out}$

(2)

${FLOPs}_{fc} = (2 I - 1) O$

(3)

where K denotes the kernel size; H and W represent the spatial dimensions (height and width) of the output layer; $C_{in}$ and $C_{out}$ indicate the numbers of channels in the input and output layers, respectively; and I and O denote the numbers of neurons in the input and output layers.
Model inference (latency): Latency refers to the time it takes for the model to process a single input (image or frame) and produce an output (detection or prediction). It is often measured in milliseconds (ms). Lower latency means the system can process inputs faster, which is critical for real-time applications.
Model accuracy: The evaluation of detection accuracy is described using Precision and Recall, respectively expressed in Equations (4) and (5).

$Precision = \frac{T P}{T P + F P}$

(4)

$Recall = \frac{T P}{T P + F N}$

(5)

Here, $T P$ refers to True Positives, indicating instances that are correctly detected as positive. $F P$ denotes False Positives, referring to instances that are incorrectly detected as positive when they are actually negative. $F N$ stands for False Negatives, indicating instances that are incorrectly detected as negative when they are actually positive. Since object-detection tasks involve both object classification and localization, a detection is considered positive only if it satisfies both criteria. Otherwise, it is regarded as a negative detection. The positivity of classification is straightforward—it depends on whether the predicted class matches the ground truth. In contrast, localization positivity is primarily determined by the Intersection over Union (IoU), as defined in Equation (6). A detection is considered correctly localized if the IoU exceeds a predefined threshold (e.g., 0.5). An illustration of detection accuracy in object detection is presented in Figure 3.
A harmonic mean called the F1-score is calculated to ensure that both precision and recall are considered equally and is expressed as follows:

$F 1 - score = 2 \times \frac{Precision \times Recall}{Precision + Recall}$

(6)

Assuming the predicted bounding box is denoted as $B_{p}$ and the actual detection box as $B_{r}$ , the formula for Intersection over Union (IoU) is given as follows:

$IoU = \frac{B_{p} \cap B_{r}}{B_{p} \cup B_{r}}$

(7)

When $B_{p}$ and $B_{r}$ have no intersection, IoU is 0, and when they perfectly overlap, IoU is 1. The range of IoU values is between 0 and 1, with higher values indicating greater proximity between the predicted and real boxes. Similar to other methods, YOLO structures often employ IoU as a loss function.
In principle, aiming for higher values of Precision and Recall is ideal. However, in real-world situations, these two metrics frequently conflict, posing a challenge in intuitively comparing detection accuracy. Therefore, the detection task employs a more comprehensive evaluation metric called mean average precision (mAP), as shown in Equation (8).

$mAP = \frac{1}{m} \sum_{i = 1}^{m} {AP}_{i}$

(8)

Here, m represents the number of object categories to be detected and ${AP}_{i}$ denotes the average precision for category i, which quantifies the area under the Precision–Recall curve.

The specific detection-accuracy metrics employed in this study are mAP@50 and mAP@50:95. The metric mAP@50 indicates that when the IoU value exceeds 50%, the object is considered positively located and the mAP value is computed accordingly. On the other hand, mAP@0.95 requires more overlap between the detected and the real bounding box, with an IoU threshold of 95% for positive detection. Therefore, mAP@50:95 (mean Average Precision averaged over IoU thresholds from 50% to 95% in steps of 5%) provides a stringent evaluation of detection accuracy [37].

2.3. IoU Loss Functions and Their Modification

Similarly to other methods, YOLO models commonly employ Intersection over Union (IoU) as a loss function to iteratively adjust parameters, aiming to increase the IoU value such that it approaches 1. Consequently, selecting an appropriate loss function is essential, as the choice of loss function directly affects convergence speed and the overall quality of localization. The IoU-based loss function, denoted as

L_{IoU}

, is defined in Equation (9) as follows:

L_{IoU} = 1 - \frac{B_{p} \cap B_{r}}{B_{p} \cup B_{r}}

(9)

where

B_{p}

and

B_{r}

denote the predicted and ground-truth bounding boxes, respectively.

However, traditional IoU exhibits limitations, particularly when dealing with overlapping, small, or elongated objects. To address these challenges, several enhanced IoU-based loss functions have been proposed to improve the accuracy and robustness of YOLO-based object detection. These include CIoU [39], DIoU [40], GIoU [41], and WIoU [42].

A key factor in evaluating the overlap between two bounding boxes is the distance between their center points. The Distance IoU (DIoU) loss incorporates this distance, aiming to minimize the discrepancy between the centers of the predicted and ground-truth boxes. DIoU is defined as follows:

L_{DIoU} = 1 - IoU + \frac{d^{2} (c, c_{r})}{w_{m}^{2} + h_{m}^{2}}

(10)

Here,

w_{m}

and

h_{m}

denote the width and height of the minimum enclosing rectangle covering both

B_{p}

and

B_{r}

;

c

and

c_{r}

are the center coordinates of the predicted and ground-truth boxes, respectively; and d is the Euclidean distance between

c

and

c_{r}

.

When the center points coincide, the penalty term becomes zero, reducing the loss function to the standard IoU. This shortcoming motivates the Complete IoU (CIoU) loss, which introduces an additional term to penalize differences in aspect ratio. The CIoU loss is defined as follows:

L_{CIoU} = 1 - IoU + \frac{d^{2} (c, c_{r})}{w_{m}^{2} + h_{m}^{2}} + u v

(11)

where

u = \frac{v}{(1 - IoU) + v}, v = \frac{4}{π^{2}} {(arctan (\frac{w_{r}}{h_{r}}) - arctan (\frac{w}{h}))}^{2}

(12)

Here,

w_{r}

and

h_{r}

denote the width and height of the ground-truth box and w and h are those of the predicted box.

The Generalized IoU (GIoU) improves upon IoU by introducing a penalty term that considers the area of the smallest enclosing box C, thus providing useful gradients even for nonoverlapping boxes. The GIoU loss is formulated as follows:

L_{GIoU} = 1 - IoU + \frac{| C - (B_{r} \cup B_{p}) |}{| C |}

(13)

This formulation encourages better spatial alignment and localization of bounding boxes.

The Weighted IoU (WIoU) loss is another variant designed to enhance bounding-box regression by incorporating a weighting factor based on box attributes such as aspect ratio, scale, or confidence scores. This approach helps to address issues related to scale variance and data imbalance. The WIoU v1 loss is defined as follows:

L_{WIoU}^{(v 1)} = L_{IoU} \cdot R_{WIoU}

(14)

R_{WIoU} = exp (\frac{{(x_{r} - x_{p})}^{2} + {(y_{r} - y_{p})}^{2}}{d_{C}^{2}})

(15)

where

(x_{r}, y_{r})

and

(x_{p}, y_{p})

are the center coordinates of the ground truth and predicted bounding boxes, respectively, and

d_{C}

is the diagonal length of the smallest enclosed box.

2.4. X-Ray-Based Baggage- and Cargo-Inspection Datasets

In this section, we review research studies on implementations of YOLO object-detection models with a specific focus on X-ray security imaging of cargo. Current state-of-the-art object-detection methods involve two basic tasks: object classification and object localization.

Training a YOLO model requires high-quality datasets with annotated bounding boxes and class labels. Specific datasets are preferred for specific applications. Table 2 shows a list of the most common datasets used for training YOLO models, categorized by their use cases and domain. These data sets come mainly from four standard public datasets for object-detection tasks, namely PASCAL VOC (Pascal Visual Object Classes) [43], ILSVRC (ImageNet Large Scale Visual Recognition Challenge) [44], MS-COCO (Microsoft Common Objects in Context) [45], and Open Images from Google [46].

In addition, Table 3 shows a list of X-ray imaging datasets used in research on baggage security inspection that are publicly accessible. The column #Positive refers to the total number of positive (prohibited) items in the dataset, such as cutters, knives, etc., while #Negative refers to the total number of normal (nonprohibited) items. Note that the total of #Positive and #Negative may exceed the total number of images (#Images), as in certain datasets, an image may contain multiple elements. In other situations, only #Positive items are annotated, resulting in a value of 0 in the column #Negative.

Samples of X-ray images from baggage and cargo inspection are shown in Figure 4. Unfortunately, compared to X-ray baggage-image datasets, the number of publicly available X-ray cargo-image datasets is limited. Based on our survey, only two such datasets are accessible. Some datasets are fully synthetic or partially augmented with synthetic data, as by adding prohibited items into existing X-ray images. This approach is necessitated by the scarcity of real-world data containing genuine prohibited items.

Table 4 presents a list of X-ray imaging datasets used in cargo-inspection research, including the dataset used in this work. Regarding our dataset, a more detailed explanation of the data-collection process and its attributes will be presented in Section 3.

2.5. X-Ray Baggage and Cargo Inspection Using the YOLO Algorithm

This section reviews related studies that investigate the application of YOLO in X-ray baggage and cargo inspection, emphasizing its effectiveness in detecting concealed or prohibited items. The outcomes demonstrate how YOLO adapts to the specific challenges of X-ray images, such as overlapping objects and varying densities, through fine-tuning and custom dataset training.

Table 5 shows recent studies of X-ray image inspection using YOLO-based object detection. Research applying YOLO (You Only Look Once) object-detection models to X-ray cargo inspection is still in its early stages compared to the extensive studies conducted on baggage X-ray screening. While YOLO has been effectively utilized in baggage X-ray applications, such as threat-object detection and prohibited-item detection, its application to cargo X-ray images remains limited. This disparity highlights the need for the further exploration and development of YOLO-based models tailored for the unique challenges presented by cargo X-ray inspection.

As shown in Table 5, while numerous studies have applied various YOLO versions (up to YOLOv10) for X-ray baggage inspection, only a single study has utilized YOLOv3 for X-ray cargo inspection. This paper advances research in X-ray cargo inspection by applying the latest YOLO models (from YOLOv5 to YOLOv11), addressing the limitations of earlier studies. Additionally, the implementation of Intersection over Union (IoU) variants in this field has not been sufficiently explored in the existing literature.

Furthermore, only two studies so far have proposed a modified IoU loss function, specifically the so-called Inner-IoU loss in [68] and Soft-WIoU-NMS [6]. In the study [68], the authors propose the inner-CIoU loss function by introducing a scale factor to alter the size of auxiliary bounding boxes, boosting the precision of bounding-box regression. This method refines the classic IoU calculations by allowing the model to handle diverse item sizes more correctly; this is notably beneficial in congested X-ray images where small and overlapping things must be detected distinctly. Inner-CIoU is particularly effective in the high-stakes environment of security screening because it addresses the challenges of generalization and convergence that traditional IoU-based loss calculations face, resulting in faster, more accurate model training and improved detection performance. In their results, the adoption of the inner-CIoU loss function in SIXray, HiXray, CLCXray, and PIDray gives increased accuracy by 1.0%, 0.4%, 0.2% and 1.3%, respectively.

Table 5. YOLO-based studies for X-ray baggage- and cargo-image inspection.

Dataset	Ref.	Year	YOLO Version	Modified IoU Loss	Metrics
CLCXray	[69]	2024	YOLOv4, YOLOv5s, YOLOv7, YOLOv8n	none	mAP50 = 0.731, 0.755, 0.766, 0.772
SIXray	[69]	2024	YOLOv4, YOLOv5s, YOLOv7, YOLOv8n	none	mAP50 = 0.849, 0.896, 0.916, 0.920
PIDray	[69]	2024	YOLOv4, YOLOv5s, YOLOv7, YOLOv8n	none	mAP50 = 0.789, 0.803, 0.816, 0.825
SIXray	[68]	2024	YOLOv4, YOLOv5s, YOLOv7, YOLOv8n, YOLOv8s, YOLOv9, YOLOv10n	Inner-IoU loss	mAP50 = 0.853, 0.931, 0.911, 0.908, 0.940, 0.914, 0.916
HiXray	[68]	2024	YOLOv4, YOLOv5s, YOLOv7, YOLOv8n, YOLOv8s, YOLOv9, YOLOv10n	Inner-IoU loss	mAP50 = 0.756, 0.810, 0.805, 0.804, 0.823, 0.809, 0.811
CLCXray	[68]	2024	YOLOv4, YOLOv5s, YOLOv7, YOLOv8n, YOLOv8s, YOLOv9, YOLOv10n	Inner-IoU loss	mAP50 = 0.726, 0.883, 0.865, 0.880, 0.886, 0.890, 0.880
PIDray	[68]	2024	YOLOv4, YOLOv5s, YOLOv7, YOLOv8n, YOLOv8s, YOLOv9, YOLOv10n	Inner-IoU loss	mAP50 = 0.764, 0.836, 0.832, 0.825, 0.861, 0.835, 0.825
SIXray	[6]	2023	YOLOv3, YOLOv5, YOLOv7	Soft-WIoU-NMS	mAP50 = 0.876, 0.910, 0.951
Baggage	[49]	2023	YOLOv3, YOLOv5, YOLOv8	none	mAP50 = 0.893, 0.875, 0.933
OPIXray	[49]	2023	YOLOv8	none	mAP = 0.880
SIXray	[63]	2022	YOLOv3	none	mAP50 = 0.738
CargoX	[63]	2022	YOLOv3	none	mAP50 = 0.931

In the study [6], the authors propose a Soft-NMS based on the WIoU loss function (Soft-WIoU-NMS) to address the problem of object overlap and achieved satisfactory results. Compared to the original NMS, Soft-NMS prioritizes the selection of prediction boxes with overlapping positions and incorporates a WIoU penalty term to boost accuracy. As shown in their results, the accuracy of YOLOv7 using WIoUv1 is 0.4% to 0.7% higher than that of the other IoU versions.

3. Methodology

3.1. X-Ray Cargo-Image Datasets and Preprocessing

The dataset used in this paper was mainly obtained from Billion Prima Sdn Bhd, Malaysia. The X-ray cargo-image datasets can be obtained on the company’s cloud server. The server contained a numerous of image datasets from different customs ports, and the images datasets are organized in daily basis.

We use grayscale images in X-ray cargo-image datasets to train the deep learning-based object-detection model. There are 11 classes of labelled items to be detected, such as Automotives/Automotives Parts, Liquid Tank, Refrigerator, Nonhomogeneous, Metal Scrap, High Intensity, Suspicious Arrangement, Drums, Food/Fruits, e-waste and Unknowns. Table 6 shows the distribution of category/class labels of items in our datasets.

The image preprocessing implemented in this project includes cropping and resizing. The cropping technique automates the removal of horizontal white regions from the right side of images using image-processing methods. The process begins by converting the image to grayscale, then applying binary thresholding to create a binary map. In this map, pixels with intensity values satisfying

value \geq 240

are set to white (255), while others are set to black (0). This binary map isolates the white regions, allowing the script to calculate column-wise sums of white pixels and identify the boundary where the white area ends. A small offset is added to prevent the crop from being too tight, and the image is horizontally trimmed based on the detected boundary. The cropped image is then saved, preserving the original content while removing unwanted white space. The use of binary mapping is essential in this process, as it simplifies the image into a black-and-white representation, making it both efficient and precise for analyzing and cropping the white regions.

On the other hand, the resizing methods used in this project resize images to a target size using the Python Imaging Library. It automatically resizes every image in the folder to specified dimensions (2000 × 816) using the library’s “resize” function and filters, ensuring high-quality down-sizing. This process is useful for standardizing image sizes, which is often required for tasks like training deep learning models.

3.2. Image Annotation

The annotation software tool used in this project is LabelImg v1.8.6, a graphical image-annotation tool designed to label objects in images, which is crucial for training machine learning models, especially for object-detection tasks. The LabelImg annotation tool was used to draw bounding boxes around objects of interest in each image, with each box assigned a corresponding label (e.g., “Refrigerator” or “Food & Fruits”). Figure 5 illustrates the GUI of the LabelImg annotation tools in the X-ray cargo dataset.

These annotations are saved as text files using the YOLO format, which stores the coordinates of the bounding boxes along with their associated labels. This labeled dataset serves as the ground truth for training models, allowing them to learn and recognize objects in new unseen images. This process ensured precise and consistent annotations, which are essential for achieving high accuracy and robust performance from deep learning models. The general annotation process is guided and verified by experienced X-ray security staff using real contraband cases to ensure accuracy and relevance. Samples of the annotated images can be accessed at github.com/iwanxx/Xray-cargo (accessed on 26 August 2025).

Finally, Figure 6 represents a deeper insight into the used X-ray cargo images (training data). In addition to the distribution of samples across 11 categories (Figure 6a), figure (Figure 6b) also shows the distribution of bounding-box sizes. It indicates that there are variations in shape, i.e., squares and rectangles, vertically and horizontally. (Figure 6c) reveals that spatially, many bounding boxes are located around the centers of the images and some portions are located far at the edges of the images. (Figure 6d) indicates that vertical rectangular bounding boxes dominate despite the presence of numerous horizontal rectangles.

3.3. Setup for Model Training

All experiments were conducted in the same environment with a Windows system. The hardware configuration comprised a 12th Gen Intel(R) Core (TM) i5-12400F processor, a 16 GB of RAM and a NVIDIA GeForce RTX 4070 Super graphic card. The GPU was equipped with 12 GB of video memory, CUDA version 12.1, and Python version 3.12. The input image size was consistently adjusted to 2000 × 816. Hyperparameter tuning was implemented for this project as follows. The number of epochs was set to 300 for model training, the batch size was set to 32, and the initial learning rate was set to 0.001. The probability hyperparameter of Mosaic data augmentation was set to 1.0, and all other parameters were left at the default settings.

The dataset consists of 1117 training images, 320 validation images, and 160 test images, excluding background images, using a 7:2:1 ratio to ensure balanced and effective model evaluation. To ensure the reproducibility of experimental results and training outcomes, the seed function provided by YOLO was utilized. This function sets a fixed random seed across multiple libraries commonly used in deep learning workflows. It guarantees consistent data shuffling, weight initialization, and other stochastic processes throughout the training pipeline.

3.4. Evaluation of Model Performance

In this work on identifying contraband items at port customs, it is essential to prioritize both real-time performance and model accuracy to ensure consistent detection results. To evaluate model performance in contraband detection, key metrics were employed, including precision, recall, F1-score, mean average precision (mAP) at different thresholds, namely mAP@50 and mAP@50:95, FLOPs, and inference time. The explanation of these metrics has been presented in Section 2 earlier.

Hence, the model performance is basically measured via detection accuracy (precision, recall, F1-score, mAP), computational efficiency (FLOPs) and real-time capability (inference time). Here, inference time–the duration required for the model to process a single input and generate a prediction–is critical for port customs operations, where rapid processing of scanned cargo is essential to maintaining workflow efficiency.

3.5. Incorporation of Soft-NMS and Modified IoU Loss Functions

Conventionally, recent YOLO algorithms use standard NonMaximum Suppression (NMS) to eliminate duplicate bounding boxes by retaining only the most confident one among overlapping detections. However, this approach may suppress true positives in cases where objects are densely packed or partially overlapping, as in cargo X-ray images.

To address this limitation, Soft-NMS is integrated into the YOLO model. Unlike standard NMS, Soft-NMS reduces the confidence scores of overlapping bounding boxes instead of removing them entirely. This is achieved by decaying the confidence score

s_{i}

of each remaining detection box

b_{i}

based on its Intersection over Union (IoU) with the current highest-scoring box M using the following Gaussian penalty function [70]:

s_{i} = s_{i} - e^{— \frac{IoU {(M, b_{i})}^{2}}{σ}}, \forall b_{i} \notin D

(16)

where D denotes the set of filtered candidate boxes and

σ

is a hyperparameter controlling the decay rate. This method allows the model to retain closely spaced true positives, thereby enhancing detection robustness in cluttered or complex scenes.

In addition to modifying the NMS mechanism, this work also investigates enhancements to the bounding-box-regression loss function. While YOLO algorithms typically rely on standard IoU loss during bounding-box regression, we explore alternative IoU-based loss functions to determine the most effective variant, as discussed in Section 2. A comparative summary of these loss functions and their characteristics is provided in Table 7.

4. Experimental Results

4.1. Comparing Different Object-Detection Models on the Cargo and Baggage Datasets

To evaluate detection performance, we compared several object-detection models (i.e., YOLO variants, DETR, and Faster R-CNN) using our compiled cargo dataset and the PIDray dataset [53].

The results for our cargo dataset, summarized in Table 8, show that the YOLOv11 model generally outperforms the other YOLO variants, particularly in the key evaluation metrics mAP@50 and mAP@50:95. In Table 8, bold and underlined values represent the highest performance for each metric, while underlined values indicate the second-highest. Faster R-CNN and DETR incur significantly higher FLOPs compared to the YOLO variants, despite the comparable accuracy between the DETR and YOLO models. Clearly, Faster R-CNN and DETR are not favorable choices of model, as both produce longer inference time in real-time applications. Similar trends are observed for the PIDray baggage dataset, as shown in Table 9.

We now turn our attention to the results produced for our cargo dataset (Table 8), which represents the main focus of this study. The YOLOv11 model achieved 95.9% precision. Although this value is not the highest, the model remains highly competitive. For recall, which is the more important metric in the detection of contraband items, the YOLOv11 model achieved the highest value. The YOLOv11 model achieved the second-highest F1-score (95.9%), following YOLOv10n by just 0.1% and both surpassing other variants.

Regarding mAP@50, the YOLOv11 model yielded a value of 96.7%, which is 3.14% higher than the values yielded by YOLOv5n and YOLOv7t and nearly 1% higher than those yielded by YOLOv8n, YOLOv9t, and YOLOv10n. For the more challenging metric mAP@50:95, the YOLOv11 model achieved a value of 92.8%, matching YOLOv10n, while outperforming YOLOv8n and YOLOv9t by only 0.1%.

The YOLOv11 model achieved a balance between performance and computational complexity. It produced 1.2 G, 1.8 G, 1.3 G, and 2.0 G fewer FLOPs compared to YOLOv7t, YOLOv8n, YOLOv9t, and YOLOv10n, respectively, while using 2.1 G more FLOPs than YOLOv5n. In addition, the inference time was the fastest, at only 0.8 ms; YOLOv8n had the same inference time. Overall, these results demonstrate that the YOLOv11 model achieves superior accuracy while maintaining a good trade-off between computational cost and inference speed. Thus, we use the YOLOv11 model as the baseline, enhancing it with the loss function and soft-NMS based on the modified IoU.

4.2. Experiments with Different IoU Loss Functions with Incorporation of the Soft-NMS Mechanism

We use the YOLOv11 model as the baseline in the coming experiments. The experiments were conducted using different IoU-based loss functions. The detection results are presented in detail per contraband item category, as shown in Table 10 and Table 11 for mAP@50 and mAP@50:95, respectively. These represent both loose and strict localization thresholds. In both tables, bold and underlined values represent the highest-performing metrics, while underlined values indicate the second-highest. Note that we implement the Soft-NMS mechanism in the model.

The results demonstrate that the model incorporating WIoUv1 and Soft-NMS outperforms all other IoU loss functions. Specifically, it achieves mAP@50 improvements of 1.74%, 1.64%, 2.34%, and 1.64% over models using IoU, GIoU, DIoU, and CIoU, respectively. Furthermore, it records mAP@50:95 improvements of 4.16% over IoU, GIoU, and CIoU and 4.86% over DIoU. These findings clearly indicate that the WIoUv1 loss function, when combined with Soft-NMS, significantly enhances performance in item detection in cargo images.

According to the mAP@50 metric, most loss functions, including IoU, GIoU, DIoU, and CIoU, produce near-perfect results of 99.5% in almost all categories. However, these traditional loss functions perform poorly in the Nonhomogeneous category, with mAP values dropping as low as 61.7%, 69.3%, 69.7%, and 69.8% for DIoU, IoU, GIoU, and CIoU, respectively. In contrast, the WIoUv1 + Soft-NMS configuration demonstrates significantly improved detection in the Nonhomogeneous class, achieving a mAP of 90.42%, while maintaining parity across all other classes. However, the Automotives category recorded a slightly lower performance of 97.9%.

Under the mAP@50:95 evaluation metric, traditional loss functions such as IoU, GIoU, DIoU, and CIoU exhibit a notable drop in detection performance for challenging categories such as the Nonhomogeneous and Refrigerator categories. Specifically, these loss functions achieved detection accuracies of 91.7%, 92.0%, 90.1%, and 92.0%, respectively, for the Refrigerator category, while recording lower accuracies of 38.5%, 39.2%, 30.6%, and 39.2% for the Nonhomogeneous category. In contrast, the proposed WIoUv1, when combined with Soft-NMS, demonstrates superior performance in these categories, attaining 93.82% in refrigerator detection and a substantial improvement to 81. 24% in detecting nonhomogeneous items. Likewise, traditional loss functions maintain strong performance in most other categories such as Liquid Tank, Metal Scrap, Drums, Unknowns, Food & Fruits, and E-waste, while the proposed method matches or slightly outperforms them. However, a slight decline in performance is observed for categories like Automotives/Automotive Parts and Suspicious Arrangement when using the proposed method compared to the IoU-based loss functions.

Furthermore, Figure 7 and Figure 8 show examples of detection by the proposed model for challenging categories, specifically Refrigerator and Nonhomogeneous, using test images. In the Refrigerator category, the proposed method with WIoUv1 + Soft-NMS achieves the highest detection accuracy, outperforming the others. Similar results are seen for the Food & Fruits category. Moreover, in the Nonhomogeneous category, one of the most difficult due to irregular structures, the proposed method also exhibits slightly higher-confidence performance than those using IoU, GIoU, DIoU, and CIoU, respectively.

Figure 9 presents examples demonstrating that the proposed YOLO model maintains comparable detection performance on rotated images around

- 10^{\circ}

and

+ 10^{\circ}

. Rotating the images further is not feasible, as that would lead to missing or truncated image information. The overall results for detection on rotated images is shown in Table 12.

Figure 10 shows a comparison of the training curves for the modified YOLOv11 model incorporating WIoUv1 and Soft-NMS to the training curves for five other YOLO variants: YOLOv5n, YOLOv7-tiny, YOLOv8n, YOLOv9t, and YOLOv10n. Each model was trained for 300 epochs, with key evaluation metrics and loss values captured during both training and validation phases.

The proposed YOLOv11 model consistently outperforms all other variants in terms of both mAP@50 and mAP@50:95. It demonstrates a faster convergence rate and achieves the highest final mAP scores. Notably, YOLOv11 (WIoUv1 + Soft-NMS) reaches nearly 1.0 in mAP@50 and exceeds 0.96 in mAP@50:95. Furthermore, the precision and recall curves for the proposed model remain consistently higher than those for the other YOLO variants across all epochs, with final values surpassing 0.97. These results highlight the effectiveness of the proposed YOLOv11 model enhanced with WIoUv1 and Soft-NMS for detection of contraband items in cargo X-ray imagery.

5. Discussion

Overall, the enhanced YOLOv11 model with WIoU and soft-NMS achieved exceptionally high scores, with a mean precision of 98.24%, recall of 97.65%, F1-score of 97.94%, mAP@50 98.44%, and a mean mAP@50:95 of 96.96% across 10 independent training runs using various random seeds, as shown in Table 13.

The consistently low values of standard deviation across all metrics reinforce the model’s stability and resistance to variance caused by training randomness or data splits. These findings indicate that the model is not only highly accurate but also stable and robust across multiple training iterations. Furthermore, the class-wise performance analysis shows that the model performs excellently across a wide range of categories of contraband item, achieving nearly perfect detection scores of ≥99% in classes such as Liquid Tank, Metal Scrap, High Intensity, Drums, and E-waste under both the mAP@50 and mAP@50:95 metrics.

However, the Nonhomogeneous category posed a notable challenge, yielding lower detection scores, particularly under the more stringent mAP@50:95 metric. This suggests that detecting items with irregular layouts or highly overlapping boundaries remains difficult for the modified models, indicating a possible area for future enhancement in the use of more specialized augmentation techniques or architectural improvements. Likewise, the overall model training curves demonstrate the model’s learning efficiency. These key metric curves collectively indicate that the modified model converges efficiently and achieves high performance across all core metrics, validating the effectiveness of integrating WIoUv1 and Soft-NMS into the YOLOv11 architecture.

Moreover, the experiments were conducted using different IoU-based loss functions to evaluate the effectiveness of the modified model. It outperforms traditional IoU-based losses such as GIoU, DIoU, and CIoU across both mAP@50 and mAP@50:95, demonstrating improvements of up to 2.34% and 4.86%, respectively. This enhancement is especially significant for detecting irregularly shaped or spatially dispersed items, for which traditional loss functions tend to underperform. Notably, the proposed model significantly improves detection accuracy in challenging classes such as Refrigerator, Food & Fruits, and Nonhomogeneous, achieving small but meaningful improvements of 0.1% to 0.8%—highlighting its increased sensitivity to complex object structures, which is crucial for real-world cargo inspection.

Nevertheless, the proposed method demonstrates superiority in nearly all evaluation metrics when compared with other YOLO variants. While it slightly trails YOLOv10n in terms of precision and inference speed, it achieves the best balance between detection accuracy and computational efficiency. It outperforms all other variants in F1-score, mAP@50, and mAP@50:95, while maintaining a relatively low FLOPs count and fast inference latency. The comparison of training curves further highlights the superior learning behavior, stability, and accuracy of the modified model. By integrating WIoUv1 and Soft-NMS, the model not only enhances detection performance but also improves training efficiency and generalization. It achieves higher final accuracy and smoother and faster convergence and validates the architectural and optimization advancements over other YOLO versions—making it the most robust among the evaluated variants.

In addition, the project findings strongly support the efficacy of the proposed model modifications. The integration of WIoUv1 and Soft-NMS improves the localization and classification capabilities of the model, particularly in complex and cluttered X-ray scenarios. These improvements contribute to better detection accuracy and real-time performance, which are essential for operational deployment in cargo inspection systems.

6. Conclusions and Recommendations

This project aims to develop contraband detection using deep learning and machine vision techniques. The primary motivation comes from the limitations of manual cargo screening at border control and customs checkpoints, where both speed and accuracy are paramount.

Throughout the study, it has been found that the enhanced YOLOv11 model outperforms other versions when the soft-NMS mechanism is implemented to reduce the number of missed detections in cases involving overlapping objects and densely packed cargo. Additionally, Wise Intersection over Union (WIoU) is implemented as a bounding-box regression loss function to improve the localization precision and robustness during training. The YOLO model comparison also includes DETR and Faster R-CNN. While the DETR model exhibits promising detection performance, it requires significantly higher FLOPs, resulting in longer inference time. This makes it less practical for real-time X-ray cargo inspection. Future work may explore transformer-inspired architectural improvements while maintaining the lightweight efficiency of YOLO models.

Furthermore, the enhanced YOLOv11 model was trained and evaluated on a private dataset of X-ray cargo images provided by Billion Prima Sdn Bhd (Malaysia). The proposed model was evaluated with performing metrics such as precision, recall, F1-score, mean Average Precision, floating point operations per second (FLOPs), and inference speed (latency). The enhanced model showed improvements in precision and recall over the baseline, while maintaining real-time detection speed. Moreover, qualitative analysis of detection results demonstrated the model’s enhanced ability to identify and localize objects that were either partially visible or heavily blocked compared to the standard object-detection models.

There are several areas can be considered in future work. The first involves expanding the dataset. The current model was trained on a private dataset from the company, which may not fully represent the diversity of real-world cargo scenarios. Expanding the dataset to include multi-source or open-access data can improve generalization, making the model more reliable against unseen object types.

Additionally, optimizing the model architecture is a potential approach to enhancing deployment efficiency, especially in resource-constrained environments. Techniques such as model pruning, quantization, and knowledge distillation could be implemented to reduce the FLOPs of the model while preserving detection accuracy [71]. These improvements would make it feasible to run the model on lower computational cost needed in real-time without compromising performance. Moreover, integrating explainable AI (XAI) frameworks into the detection system could increase transparency and trustworthiness, particularly in high-stakes environments such as customs and national security. This would allow security personnel to understand that a certain object was flagged, facilitating better decision-making and faster human verification.

Another potential direction for development is the incorporation of multi-view X-ray cargo image inspection. These approaches could enhance detection accuracy in cases in which objects are hidden behind other items or aligned unfavorably in single-view X-rays. Furthermore, exploring multi-modal fusion by combining X-ray imaging with additional data sources such as weight sensors could provide contextual cues to improve classification reliability. Nonetheless, the system could be integrated with robotic inspection platforms capable of intelligent path planning and dynamic scanning, transforming the current passive detection model into an active, adaptive security system. Such integration would enable fully autonomous cargo-inspection pipelines capable of identifying threats with minimal human intervention. Overall, these future directions aim to make the system more robust, scalable, and operationally effective in real-world deployment scenarios.

Author Contributions

Conceptualization, M.I.S., K.S.C. and W.H.L.; methodology, J.H.T., M.I.S. and W.Y.T.; software, S.S.T. and Y.J.L.; validation, C.K.A. and C.L.G.; formal analysis, J.H.T., M.I.S. and W.Y.T.; investigation, J.H.T. and W.Y.T.; resources, K.S.C. and C.L.G.; data curation, C.L.G.; writing—original draft preparation, J.H.T. and M.I.S.; writing—review and editing, M.I.S., C.K.A. and W.H.L.; visualization, J.H.T.; supervision, M.I.S., Y.J.L. and W.H.L.; project administration, K.S.C.; funding acquisition, C.K.A., K.S.C., W.H.L. and S.S.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was financial supported by the industry project grant IND-FETBE-2024-008-SST.

Data Availability Statement

Due to the proprietary nature of the data used in this study, access is restricted. Data may be made available from the authors upon request with official letter to Billion Prima Sdn Bhd, subject to approval by the data center.

Acknowledgments

The authors gratefully acknowledge Billion Prima Sdn Bhd (Malaysia) for their support of the research and publication of this work.

Conflicts of Interest

Author Jun Hao Tee was affiliated with both UCSI University and Billion Prima Sdn Bhd. Y. J. Lee and C. L. Goh were employed by Billion Prima Sdn Bhd, Sensenet Sdn Bhd and Billion Prima Technologies Sdn Bhd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Michel, S.; Mendes, M.; de Ruiter, J.C.; Koomen, G.C.M.; Schwaninger, A. Increasing X-ray image interpretation competency of cargo security screeners. Int. J. Ind. Ergon. 2014, 44, 551–560. [Google Scholar] [CrossRef]
Wu, J.; Xu, X.; Yang, J. Object Detection and X-Ray Security Imaging: A Survey. IEEE Access 2023, 11, 45416–45441. [Google Scholar] [CrossRef]
Manakitsa, N.; Maraslidis, G.S.; Moysis, L.; Fragulis, G.F. A Review of Machine Learning and Deep Learning for Object Detection, Semantic Segmentation, and Human Action Recognition in Machine and Robotic Vision. Technologies 2024, 12, 15. [Google Scholar] [CrossRef]
Li, W.; Solihin, M.I.; Nugroho, H.A. RCA: YOLOv8-Based Surface Defects Detection on the Inner Wall of Cylindrical High-Precision Parts. Arab. J. Sci. Eng. 2024, 49, 12771–12789. [Google Scholar] [CrossRef]
Ragab, M.G.; Abdulkadir, S.J.; Muneer, A.; Alqushaibi, A.; Sumiea, E.H.; Qureshi, R.; Al-Selwi, S.M.; Alhussian, H. A Comprehensive Systematic Review of YOLO for Medical Object Detection (2018 to 2023). IEEE Access 2024, 12, 57815–57836. [Google Scholar] [CrossRef]
Jing, B.; Duan, P.; Chen, L.; Du, Y. EM-YOLO: An X-ray Prohibited-Item-Detection Method Based on Edge and Material Information Fusion. Sensors 2023, 23, 16–19. [Google Scholar] [CrossRef]
Dalal, S.; Lilhore, U.K.; Sharma, N.; Arora, S.; Simaiya, S.; Ayadi, M.; Almujally, N.A.; Ksibi, A. Improving smart home surveillance through YOLO model with transfer learning and quantization for enhanced accuracy and efficiency. PeerJ Comput. Sci. 2024, 10, e1939. [Google Scholar] [CrossRef]
Wang, G.; Ding, H.; Duan, M.; Pu, Y.; Yang, Z.; Li, H. Fighting against terrorism: A real-time CCTV autonomous weapons detection based on improved YOLO v4. Digit. Signal Process. 2023, 132, 103790. [Google Scholar] [CrossRef]
Zhao, Y.; Yang, D.; Cao, S.; Cai, B.; Maryamah, M.; Solihin, M.I. Object detection in smart indoor shopping using an enhanced YOLOv8n algorithm. IET Image Process. 2024, 18, 4745–4759. [Google Scholar] [CrossRef]
Yang, D.; Solihin, M.I.; Zhao, Y.; Cai, B.; Chen, C.; Riyadi, S. A YOLO Benchmarking Experiment for Maritime Object Detection in Foggy Environments. In Proceedings of the 2024 IEEE 14th Symposium on Computer Applications & Industrial Electronics (ISCAIE), Penang, Malaysia, 24–25 May 2024; pp. 354–359. [Google Scholar] [CrossRef]
Vijayakumar, A.; Vairavasundaram, S. YOLO-based Object Detection Models: A Review and its Applications. Multimed. Tools Appl. 2024, 83, 83535–83574. [Google Scholar] [CrossRef]
Terven, J.; Córdova-Esparza, D.M.; Romero-González, J.A. A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Badgujar, C.M.; Poulose, A.; Gan, H. Agricultural object detection with You Only Look Once (YOLO) Algorithm: A bibliometric and systematic literature review. Comput. Electron. Agric. 2024, 223, 109090. [Google Scholar] [CrossRef]
Hussain, M. YOLOv1 to v8: Unveiling Each Variant—A Comprehensive Review of YOLO. IEEE Access 2024, 12, 42816–42833. [Google Scholar] [CrossRef]
Flores-Calero, M.; Astudillo, C.A.; Guevara, D.; Maza, J.; Lita, B.S.; Defaz, B.; Ante, J.S.; Zabala-Blanco, D.; Moreno, J.M.A. Traffic Sign Detection and Recognition Using YOLO Object Detection Algorithm: A Systematic Review. Math 2024, 12, 297. [Google Scholar] [CrossRef]
Vinh, T.Q.; Anh, N.T.N. Real-Time Face Mask Detector Using YOLOv3 Algorithm and Haar Cascade Classifier. In Proceedings of the 2020 International Conference on Advanced Computing and Applications (ACOMP), Quy Nhon, Vietnam, 25–27 November 2020; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA, 2020; pp. 146–149. [Google Scholar] [CrossRef]
Markappa, P.S.S.; O’Leary, C.; Lynch, C. A Review of YOLO Models for Soccer-Based Object Detection. In Proceedings of the 2024 Sixth International Conference on Intelligent Computing in Data Sciences (ICDS), Marrakech, Morocco, 23–24 October 2024; pp. 1–7. [Google Scholar] [CrossRef]
Sohan, M.; Ram, T.S.; Reddy, C.V.R. A Review on YOLOv8 and Its Advancements. In Proceedings of the International Conference on Data Intelligence and Cognitive Informatics, Tirunelveli, India, 27–28 June 2023; Springer: Singapore, 2024; pp. 529–545. [Google Scholar] [CrossRef]
Sun, Y.; Wang, J.; Wang, H.; Zhang, S.; You, Y.; Yu, Z.; Peng, Y. Fused-IoU Loss: Efficient Learning for Accurate Bounding Box Regression. IEEE Access 2024, 12, 37363–37377. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Ultralytics. Ultralytics YOLO11 Has Arrived! Redefine What’s Possible in AI! Available online: https://www.ultralytics.com/blog/ultralytics-yolo11-has-arrived-redefine-whats-possible-in-ai?utm_source=chatgpt.com (accessed on 18 December 2024).
Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; pp. 10691–10700. Available online: https://arxiv.org/abs/1905.11946v5 (accessed on 26 August 2025).
Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 1571–1580. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar] [CrossRef]
Alif, M.A.R.; Hussain, M. YOLOv1 to YOLOv10: A Comprehensive Review of YOLO Variants and Their Application in the Agricultural Domain. 2024. Available online: http://arxiv.org/abs/2406.10139 (accessed on 26 August 2025).
Hussain, M.; Khanam, R. In-Depth Review of YOLOv1 to YOLOv10 Variants for Enhanced Photovoltaic Defect Detection. Solar 2024, 4, 351–386. [Google Scholar] [CrossRef]
Jegham, N.; Koh, C.Y.; Abdelatti, M.; Hendawi, A. Evaluating the Evolution of YOLO (You Only Look Once) Models: A Comprehensive Benchmark Study of YOLO11 and Its Predecessors. 2024. Available online: http://arxiv.org/abs/2411.00201 (accessed on 26 August 2025).
Ultralytics. YOLOv5–Ultralytics YOLO Docs. Available online: https://docs.ultralytics.com/models/yolov5/ (accessed on 1 January 2025).
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. September 2022. Available online: https://arxiv.org/abs/2209.02976v1 (accessed on 26 August 2025).
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
Ultralytics. YOLOv8–Ultralytics YOLO Docs. Available online: https://docs.ultralytics.com/models/yolov8/ (accessed on 1 January 2025).
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. February 2024. Available online: https://arxiv.org/abs/2402.13616v2 (accessed on 1 January 2025).
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. May 2024. Available online: https://arxiv.org/abs/2405.14458v2 (accessed on 1 January 2025).
Ultralytics. YOLO11–NEW–Ultralytics YOLO Docs. Available online: https://docs.ultralytics.com/models/yolo11/ (accessed on 1 January 2025).
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024. Available online: http://arxiv.org/abs/2410.17725 (accessed on 26 August 2025).
Yang, D.; Solihin, M.I.; Zhao, Y.; Yao, B.; Chen, C.; Cai, B.; Machmudah, A. A review of intelligent ship marine object detection based on RGB camera. IET Image Process. 2023, 18, 281–297. [Google Scholar] [CrossRef]
Molchanov, P.; Tyree, S.; Karras, T.; Aila, T.; Kautz, J. Pruning Convolutional Neural Networks for Resource Efficient Inference. In Proceedings of the 5th International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017; Available online: https://arxiv.org/abs/1611.06440v2 (accessed on 26 August 2025).
Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing Geometric Factors in Model Learning and Inference for Object Detection and Instance Segmentation. IEEE Trans. Cybern. 2022, 52, 8574–8586. [Google Scholar] [CrossRef] [PubMed]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. Proc. AAAI Conf. Artif. Intell. 2020, 34, 12993–13000. [Google Scholar] [CrossRef]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2019, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar] [CrossRef]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding Box Regression Loss with Dynamic Focusing Mechanism. January 2023. Available online: https://arxiv.org/abs/2301.10051v3 (accessed on 19 December 2024).
Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. Lect. Notes Comput. Sci. 2014, 8693, 740–755. [Google Scholar] [CrossRef]
Kuznetsova, A.; Rom, H.; Alldrin, N.; Uijlings, J.; Krasin, I.; Pont-Tuset, J.; Kamali, S.; Popov, S.; Malloci, M.; Kolesnikov, A.; et al. The Open Images Dataset V4: Unified Image Classification, Object Detection, and Visual Relationship Detection at Scale. Int. J. Comput. Vis. 2020, 128, 1956–1981. [Google Scholar] [CrossRef]
Mbwslib. GitHub–DvXray: The First Large-Scale Dual-view X-Ray Baggage Dataset. Available online: https://github.com/Mbwslib/DvXray (accessed on 14 December 2024).
Ma, B.; Jia, T.; Li, M.; Wu, S.; Wang, H.; Chen, D. Toward Dual-View X-Ray Baggage Inspection: A Large-Scale Benchmark and Adaptive Hierarchical Cross Refinement for Prohibited Item Discovery. IEEE Trans. Inf. Forensics Secur. 2024, 19, 3866–3878. [Google Scholar] [CrossRef]
Han, L.; Ma, C.; Liu, Y.; Jia, J.; Sun, J. SC-YOLOv8: A Security Check Model for the Inspection of Prohibited Items in X-ray Images. Electronics 2023, 12, 4208. [Google Scholar] [CrossRef]
GitHub–MACHUNHAI/LSIray. Available online: https://github.com/MACHUNHAI/LSIray (accessed on 23 December 2024).
GitHub–GreysonPhoenix/CLCXray: Detecting Overlapping Objects in X-Ray Security Imagery by a Label-aware Mechanism. Available online: https://github.com/GreysonPhoenix/CLCXray (accessed on 14 December 2024).
GitHub–HiXray-author/HiXray. Available online: https://github.com/HiXray-author/HiXray (accessed on 14 December 2024).
GitHub–lutao2021/PIDray: PIDray: A Large-Scale X-Ray Benchmark for Real-World Prohibited Item Detection. Available online: https://github.com/lutao2021/PIDray (accessed on 14 December 2024).
GitHub–LPAIS/Xray-PI: An X-Ray Image Dataset for Prohibited Item Segmentation. Available online: https://github.com/LPAIS/Xray-PI (accessed on 14 December 2024).
GitHub–OPIXray-author/OPIXray. Available online: https://github.com/OPIXray-author/OPIXray (accessed on 14 December 2024).
GitHub–MeioJane/SIXray: The SIXray Dataset. Available online: https://github.com/MeioJane/SIXray (accessed on 14 December 2024).
Miao, C.; Xie, L.; Wan, F.; Su, C.; Liu, H.; Jiao, J.; Ye, Q. Sixray: A large-scale security inspection x-ray benchmark for prohibited item discovery in overlapping images. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 2114–2123. [Google Scholar] [CrossRef]
COMPASS-XP. Available online: https://zenodo.org/records/2654887 (accessed on 14 December 2024).
Kolte, S.; Bhowmik, N.; Dhiraj. Threat Object-based anomaly detection in X-ray images using GAN-based ensembles. Neural Comput. Appl. 2023, 35, 23025–23040. [Google Scholar] [CrossRef] [PubMed]
GitHub–Computervision-Xray-Testing/GDXray: Dataset GDXray Used Along the Book Computer Vision for X-Ray Testing. Available online: https://github.com/computervision-xray-testing/GDXray (accessed on 14 December 2024).
Mery, D.; Riffo, V.; Zscherpel, U.; Mondragón, G.; Lillo, I.; Zuccar, I.; Lobel, H.; Carrasco, M. GDXray: The Database of X-ray Images for Nondestructive Testing. J. Nondestruct. Eval. 2015, 34, 1–12. [Google Scholar] [CrossRef]
OSF|MFA-Net: Object Detection for Complex X-Ray Cargo and Baggage Security Imagery. Available online: https://osf.io/7hd3v/ (accessed on 16 December 2024).
Viriyasaranon, T.; Chae, S.H.; Choi, J.H. MFA-net: Object Detection for Complex X-Ray Cargo and Baggage Security Imagery. PLoS ONE 2022, 17, e0272961. [Google Scholar] [CrossRef]
GitHub–IS2AI/cargoxray: It Is a Dataset of X-Ray Images of Cargo Transport. The Dataset Includes Images of Railcars and Trucks with Trailers. Available online: https://github.com/IS2AI/cargoxray (accessed on 15 December 2024).
Cho, H.; Park, H.; Kim, I.J.; Cho, J. Data Augmentation of backscatter x-ray images for deep learning-based automatic cargo inspection. Sensors 2021, 21, 7294. [Google Scholar] [CrossRef]
Jaccard, N.; Rogers, T.W.; Morton, E.J.; Griffin, L.D. Detection of concealed cars in complex cargo X-ray imagery using Deep Learning. J. Xray. Sci. Technol. 2017, 25, 323–339. [Google Scholar] [CrossRef]
Kolokytha, S.; Flisch, A.; Lüthi, T.; Plamondon, M.; Visser, W.; Schwaninger, A.; Hardmeier, D.; Costin, M.; Vienne, C.; Sukowski, F.; et al. Creating a reference database of cargo inspection X-ray images using high energy radiographs of cargo mock-ups. Multimed. Tools Appl. 2018, 77, 9379–9391. [Google Scholar] [CrossRef]
Wang, A.; Yuan, P.; Wu, H.; Iwahori, Y.; Liu, Y. Improved YOLOv8 for Dangerous Goods Detection in X-ray Security Images. Electronics 2024, 13, 3238. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, E.; Yu, X.; Wang, A. Efficient X-ray Security Images for Dangerous Goods Detection Based on Improved YOLOv7. Electronics 2024, 13, 1530. [Google Scholar] [CrossRef]
Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS–Improving Object Detection with One Line of Code. Proc. IEEE Int. Conf. Comput. Vis. 2017, 2017, 5562–5570. [Google Scholar] [CrossRef]
Yang, D.; Solihin, M.I.; Zhao, Y.; Cai, B.; Chen, C.; Wijaya, A.A.; Ang, C.K.; Lim, W.H. Model compression for real-time object detection using rigorous gradation pruning. iScience 2025, 28, 111618. [Google Scholar] [CrossRef]

Figure 1. Main parts of modern YOLO architecture (YOLOv5 onward).

Figure 2. The YOLOv11 architecture, consisting of the head, neck, and backbone modules.

Figure 3. Example of object-detection outcomes indicating TP (left), FP (middle), and FN (right) (with IoU threshold = 0.5; red bounding box = ground truth; yellow bounding box = prediction).

Figure 4. Sample of X-ray images from baggage (top) and cargo (bottom) inspection (red bounding box indicates ground truth).

Figure 5. The GUI of the LabelImg annotation tools on our cargo X-ray dataset.

Figure 6. (a) Histogram of sample distribution across categories; (b) Distribution of bounding-box size; (c) Spatial distribution of bounding boxes in images; (d) Distribution of aspect ratios of bounding boxes.

Figure 7. Detection result for the Refrigerator category.

Figure 8. Detection result for the Nonhomogeneous category.

Figure 9. Sample of detection results with the proposed YOLO model on normal (a) and rotated images, with (b) negative rotation and (c) positive rotation.

Figure 10. Comparison of training curves over different YOLO variants.

Table 1. Highlights of architectural innovations in YOLO from version v5 to v11.

Version	Released	Architectural Highlights
YOLOv5 [29]	June 2020	Uses CSPNet derived from ResNet. Multiple SPP modules for feature extraction. PAN neck for integrating semantic and spatial information.
YOLOv6 [30]	June 2022	EfficientRep backbone based on RepVGG. FPN for multi-scale feature fusion.
YOLOv7 [31]	July 2022	E-ELAN for efficient feature aggregation.
YOLOv8 [32]	January 2023	Introduces semantic segmentation. Cross-stage partial connections. C2f module instead of traditional FPN. Anchor-free detection with NMS.
YOLOv9 [33]	February 2024	PGI for optimized gradient flow. GELAN backbone for lightweight high performance.
YOLOv10 [34]	May 2024	Eliminates NMS for faster inference. Lightweight heads, decoupled downsampling, rank-guided blocks.
YOLOv11 [35,36]	September 2024	Supports segmentation and pose estimation. Replaces C2f with C3k2 block. Introduces C2PSA for enhanced spatial attention.

Table 2. Four standard public datasets for object-detection tasks.

Dataset	Subset	#Cat.	#Images	#Boxes	Boxes/Image
Pascal VOC	VOC07	20	5011	12,608	2.5
	VOC08	20	4332	10,364	2.4
	VOC09	20	7054	17,218	2.3
	VOC10	20	10,103	23,374	2.4
	VOC11	20	11,540	27,450	2.4
	VOC12	20	11,540	27,450	2.4
ILSVRC	ILSVRC13	200	416,030	401,356	1.0
	ILSVRC14	200	476,668	534,309	1.1
	ILSVRC15	200	476,668	534,309	1.1
	ILSVRC16	200	476,668	534,309	1.1
	ILSVRC17	200	476,668	534,309	1.1
MS-COCO	MS-COCO15	80	123,287	896,782	7.3
	MS-COCO16	80	123,287	896,782	7.3
	MS-COCO17	80	123,287	896,782	7.3
	MS-COCO18	80	123,287	896,782	7.3
Open Images	OICOD18	500	1,743,042	12,195,144	7.0

Table 3. Common public datasets for X-ray baggage inspection.

Dataset	Year	#Cat.	#Positive	#Negative	Annotation	#Images
DvXray [47,48]	2024	15	5496	-	Bbox, mask	32,000
LSIray [49,50]	2023	21	-	-	Bbox	37,106
CLCXRay [51]	2022	12	9565	0	Bbox	9565
HiXray [52]	2021	8	883	102,045	Bbox	45,364
PIDray [53]	2021	12	124,486	0	Bbox, mask	124,486
Xray-PI [54]	2020	12	>2409	10,000	Bbox, mask	12,409
OPIXray [55]	2020	5	8885	0	Bbox	8885
SIXray [56,57]	2019	6	8929	10,500,302	Bbox	1,059,231
Compass-XP [58,59]	2019	366	1928	0	Bbox	1928
GDXray [60,61]	2015	5	19,407	0	Bbox	19,407

Table 4. Common datasets and our dataset for X-ray cargo inspection.

Dataset	Year	#Cat.	Public?	Synthetic?	Annotation	#Images
CargoX [62,63]	2022	4 knife types	Y	Y	Bbox, mask	64,000
IS2AI [64]	2022	7 types of goods	Y	N	Bbox	-
BSX-car [65]	2021	car types	N	N	Bbox, mask	1776
SoC cargo [66]	2017	5 car categories	N	N	Bbox	79
ACXIS [67]	2016	1 (cigarettes)	N	N	None	38,331
This paper	2024	11 item categories	N	Y	Bbox	2598

Table 6. Distribution of category/class labels in our dataset.

Classes	Total Images
Automotives/Automotive Parts	74
Drums	454
E-waste	72
Food/Fruits	337
High Intensity	426
Liquid Tank	166
Metal Scrap	70
Nonhomogeneous	210
Refrigerator	337
Suspicious Arrangement	238
Unknowns	214
Total	2598

Table 7. Key features of IoU-based loss functions used in bounding-box regression.

Loss Function	Key Feature
IoU	Basic overlap metric
GIoU	Penalizes nonoverlapping boxes
DIoU	Adds center-distance penalty
CIoU	Adds aspect-ratio similarity
WIoU	Distance-aware attention term

Table 8. Experimental results from our cargo dataset.

Model	Precision (%)	Recall (%)	F1-Score (%)	mAP@50	mAP@50:95	FLOPs (G)	Inference (ms)
Faster R-CNN	71.5	81.2	76.0	71.5	68.4	36	72
DETR	96.4	92.0	94.2	96.4	92.6	28	320
YOLOv5n	93.9	93.2	93.5	93.3	89.0	4.2	6.5
YOLOv7t	93.7	93.2	93.4	93.3	89.5	7.5	84
YOLOv8n	97.0	94.2	95.6	95.8	92.7	8.1	0.8
YOLOv9t	97.9	94.2	96.0	96.0	92.7	7.6	2.0
YOLOv10n	98.6	93.2	95.8	96.2	92.8	8.3	1.0
YOLOv11n	95.9	96.0	95.9	96.7	92.8	6.3	0.8

Table 9. Experimental results from the PIDray baggage dataset.

Model	Precision (%)	Recall (%)	F1-Score (%)	mAP@50	mAP@50:95	FLOPs (G)	Inference (ms)
Faster R-CNN	58.4	65.9	61.9	58.8	58.3	36	71
DETR	96.8	85.0	90.5	95.7	82.2	28	500
YOLOv5n	95.2	89.5	92.3	94.6	81.2	4.2	0.7
YOLOv7t	94.9	86.0	90.2	93.1	73.2	7.5	55
YOLOv8n	95.8	88.1	91.8	94.1	81.4	8.1	0.9
YOLOv9t	95.8	90.7	93.2	95.6	83.3	7.6	0.7
YOLOv10n	94.8	88.9	91.8	94.4	81.7	8.3	0.9
YOLOv11n	96.5	88.4	92.3	95.0	82.0	6.3	0.9

Table 10. Performance comparison of IoU loss functions for all categories of contraband items with mAP@50 (%).

IoU Loss Function/Category	Automotives	Liquid Tank	Refrigerator	Nonhomogeneous	Metal Scrap	High Intensity	Suspicious Arr.	Drums	Unknowns	Food & Fruits	E-Waste	Mean
IoU	99.5	99.5	99.5	69.3	99.5	99.5	99.5	99.5	99.2	99.5	99.5	96.7
GIoU	99.5	99.5	99.5	69.7	99.5	99.5	99.5	99.5	99.2	99.5	99.5	96.8
DIoU	99.5	99.5	99.5	61.7	99.5	99.5	99.5	99.5	99.4	99.5	99.5	96.1
CIoU	99.5	99.5	99.5	69.8	99.5	99.5	99.4	99.5	99.2	99.5	99.5	96.8
WIoUv1	97.9	99.5	99.0	90.4	99.5	99.5	99.5	99.5	99.5	99.1	99.5	98.4

Table 11. Performance comparison of IoU loss functions for all categories of contraband items with mAP@50:95 (%).

IoU Loss Function/Category	Automotives	Liquid Tank	Refrigerator	Nonhomogeneous	Metal Scrap	High Intensity	Suspicious Arr.	Drums	Unknowns	Food & Fruits	E-Waste	Mean
IoU	99.5	99.3	91.7	38.5	99.5	99.5	98.4	99.5	96.9	99.2	98.8	92.8
GIoU	98.5	99.3	92.0	39.2	99.5	99.5	99.3	99.5	98.3	98.9	97.0	92.8
DIoU	98.6	99.5	90.1	30.6	99.5	99.5	98.1	99.5	99.2	99.2	99.5	92.1
CIoU	98.5	99.3	92.0	39.2	99.5	99.5	99.3	99.5	98.3	98.9	96.9	92.8
WIoUv1	97.6	99.5	93.8	81.2	99.5	99.5	98.4	99.5	99.4	98.9	99.4	97.0

Table 12. YOLOv11n detection performance on rotated cargo images.

Rotation	Precision (%)	Recall (%)	F1-Score (%)	mAP@50 (%)	mAP@50:95 (%)
0°	95.9	96.0	95.9	96.7	92.8
+5°	95.4	95.6	95.5	96.3	92.3
−5°	95.2	95.5	95.3	96.1	92.1
+10°	94.6	95.0	94.8	95.6	91.5
−10°	94.4	94.8	94.6	95.4	91.3

Table 13. Performance statistics for the proposed YOLOv11 model over 10 runs. Metrics are reported as mean ± standard deviation.

Model	Precision (%)	Recall (%)	F1-Score (%)	mAP@50	mAP@50:95
YOLOv11	95.90 ± 0.013	96.00 ± 0.010	95.95 ± 0.008	96.70 ± 0.004	92.80 ± 0.005
Enhanced YOLOv11 (with WIoU + Soft-NMS)	98.24 ± 0.014	97.65 ± 0.008	97.94 ± 0.009	98.44 ± 0.004	96.96 ± 0.006

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tee, J.H.; Solihin, M.I.; Chong, K.S.; Tiang, S.S.; Tham, W.Y.; Ang, C.K.; Lee, Y.J.; Goh, C.L.; Lim, W.H. Advancing Intelligent Logistics: YOLO-Based Object Detection with Modified Loss Functions for X-Ray Cargo Screening. Future Transp. 2025, 5, 120. https://doi.org/10.3390/futuretransp5030120

AMA Style

Tee JH, Solihin MI, Chong KS, Tiang SS, Tham WY, Ang CK, Lee YJ, Goh CL, Lim WH. Advancing Intelligent Logistics: YOLO-Based Object Detection with Modified Loss Functions for X-Ray Cargo Screening. Future Transportation. 2025; 5(3):120. https://doi.org/10.3390/futuretransp5030120

Chicago/Turabian Style

Tee, Jun Hao, Mahmud Iwan Solihin, Kim Soon Chong, Sew Sun Tiang, Weng Yan Tham, Chun Kit Ang, Y. J. Lee, C. L. Goh, and Wei Hong Lim. 2025. "Advancing Intelligent Logistics: YOLO-Based Object Detection with Modified Loss Functions for X-Ray Cargo Screening" Future Transportation 5, no. 3: 120. https://doi.org/10.3390/futuretransp5030120

APA Style

Tee, J. H., Solihin, M. I., Chong, K. S., Tiang, S. S., Tham, W. Y., Ang, C. K., Lee, Y. J., Goh, C. L., & Lim, W. H. (2025). Advancing Intelligent Logistics: YOLO-Based Object Detection with Modified Loss Functions for X-Ray Cargo Screening. Future Transportation, 5(3), 120. https://doi.org/10.3390/futuretransp5030120

Article Menu

Advancing Intelligent Logistics: YOLO-Based Object Detection with Modified Loss Functions for X-Ray Cargo Screening

Abstract

1. Introduction

2. Background and Related Works

2.1. Modern YOLO Architecture

2.2. Performance Metrics in Object Detection

2.3. IoU Loss Functions and Their Modification

2.4. X-Ray-Based Baggage- and Cargo-Inspection Datasets

2.5. X-Ray Baggage and Cargo Inspection Using the YOLO Algorithm

3. Methodology

3.1. X-Ray Cargo-Image Datasets and Preprocessing

3.2. Image Annotation

3.3. Setup for Model Training

3.4. Evaluation of Model Performance

3.5. Incorporation of Soft-NMS and Modified IoU Loss Functions

4. Experimental Results

4.1. Comparing Different Object-Detection Models on the Cargo and Baggage Datasets

4.2. Experiments with Different IoU Loss Functions with Incorporation of the Soft-NMS Mechanism

5. Discussion

6. Conclusions and Recommendations

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI