Foreign Object Detection Model for Retail Cabinets Under Complex Backgrounds

Zhou, Zhenshuo; Xie, Kai; Zhang, Wei; He, Jianbiao

doi:10.3390/electronics15132920

Open AccessArticle

Foreign Object Detection Model for Retail Cabinets Under Complex Backgrounds

¹

School of Electronic Information and Electrical Engineering, Yangtze University, Jingzhou 434023, China

²

School of Electronic Information, Central South University, Changsha 410004, China

³

School of Computer Science, Central South University, Changsha 410083, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(13), 2920; https://doi.org/10.3390/electronics15132920

Submission received: 29 April 2026 / Revised: 26 June 2026 / Accepted: 30 June 2026 / Published: 3 July 2026

(This article belongs to the Special Issue Intelligent Sensing Empowered by Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

With the rapid expansion of the unmanned retail ecosystem, real-time foreign object detection (FOD) in smart vending cabinets has become a critical technology for ensuring equipment safety and protecting user rights. However, existing models often face bottlenecks in accuracy when dealing with small targets and occlusion scenarios, and struggle to balance accuracy with speed on edge devices. To address these challenges, this paper proposes an improved model specifically designed for foreign object detection based on the YOLOv11n framework, named YOLOv11n-FOD (foreign object detection). In terms of algorithm design, this paper reconstructs the feature extraction and fusion paradigm. Specifically, the original C3K2 module in the backbone network is replaced with a C3K2-SAC (Spatial Attention Convolution) module incorporating an attention mechanism, which enhances global context modeling capabilities. Subsequently, the CARAFE (Content-Aware ReAssembly of Features) operator is introduced to replace traditional interpolation, significantly improving sensitivity to small targets and textural details. Furthermore, the CBAM (Convolutional Block Attention Module) is integrated into the downsampling stage to suppress background noise while reducing computational redundancy. Notably, these improvements maintain an extremely lightweight architecture, increasing computational overhead by only 0.3 GFLOPs. Experimental results demonstrate that the proposed YOLOv11n-FOD achieves significant performance gains: mAP@50 is increased by 0.4%, and mAP@50-95 is improved by 1.0%. Extensive experiments on the SKU-110K dataset further verify the superior performance of the proposed model. In conclusion, this study effectively balances detection accuracy, model complexity, and inference speed, providing an efficient solution for foreign object detection in smart retail cabinets.

Keywords:

unmanned retail; foreign object detection; dynamic upsampling and downsampling; low-light image enhancement

1. Introduction

The rapid popularization of unattended intelligent retail has driven the extensive deployment of self-service intelligent cabinets in residential communities and shopping malls. After long-term continuous operation, various scattered foreign contaminants—such as plastic fragments, leftover packaging debris and misplaced sundries—tend to accumulate inside the cabinets. These foreign objects frequently jam the cargo pushing mechanisms, trigger frequent equipment shutdowns, and bring persistent economic losses to cabinet operators. Against this backdrop, vision-based foreign object detection (FOD) has evolved into an essential core technology to guarantee stable cabinet operation, attracting rising research interest from both industry and academia.

Traditional periodic manual inspection fails to support round-the-clock online real-time monitoring, while also incurring high labor costs and risks of missed detection. Driven by the rapid advancement of deep learning, CNN-based object detection algorithms have gradually replaced manual patrols and become mainstream technical solutions for FOD tasks. Nevertheless, existing detection approaches still have obvious shortcomings when applied to real-world intelligent retail cabinet scenarios, limiting their large-scale industrial rollout. Worse still, this specific research direction remains largely underexplored in current academic circles.

Unlike mature studies focusing on industrial surface defect detection, supermarket shelf commodity recognition and open-area foreign matter inspection, visual foreign object detection targeting enclosed intelligent retail cabinets is still in its preliminary stage. Few studies take into account the unique traits of cabinet inspection environments: closed dark inner cavities, tiny miscellaneous foreign objects, occlusion interference from stocked goods, and time-varying mirror reflection noise. Furthermore, there is a severe shortage of public benchmark datasets and customized detection frameworks tailored for cabinet FOD missions. Generic detection models cannot adapt to the distinctive feature distribution of such scenarios, thus failing to satisfy practical industrial deployment requirements.

In short, the lack of dedicated algorithm optimization, open dataset resources and systematic experimental validation severely impedes intelligent maintenance and automatic fault early warning for unattended retail terminal devices. From this perspective, the present study carries important academic value and practical engineering significance. It fills the research blank of tiny foreign object detection within confined intelligent retail cabinet spaces, addresses the deficiency in lightweight detection algorithms customized for such scenarios, and delivers an edge-deployable viable technical scheme for autonomous running and health monitoring of unmanned retail cabinets. This work will further facilitate the engineering implementation and iterative upgrade of full unattended intelligent retail systems.

From the perspective of actual cabinet imaging conditions: complex variable illumination including local specular reflection, uneven indoor ambient light and extreme dark closed cabinet environment severely distorts feature information of tiny foreign targets, and most existing detection networks lack targeted anti-low-light feature optimization, leading to sharp accuracy decline under dim environments. In addition, foreign objects are mostly tiny-sized and often partially shielded by stacked regular commodities; mainstream standard feature extraction modules fail to capture fine-grained multi-scale features of occluded small targets efficiently, resulting in frequent missed detection. From the perspective of algorithm deployment limitation: current state-of-the-art detection models such as YOLO series and DETR suffer from inherent drawbacks when applied to edge-end embedded hardware. On one hand, conventional backbone and upsampling modules generate plenty of redundant feature information and introduce unnecessary computational overhead, which cannot match the strict computing and memory constraints of low-power cabinet embedded chips; on the other hand, most existing lightweight improvement strategies merely reduce network parameters in a rough way at the cost of obvious precision loss, failing to realize favorable balance between detection accuracy and inference speed. In addition, most published FOD algorithms are designed for open-space industrial defect inspection, lacking specialized optimization aiming at cabinet-confined, multi-lighting and easily occluded foreign object characteristics.

To tackle the above bottlenecks of existing FOD approaches, this work proposes an improved lightweight YOLOv11n detection network dedicated to intelligent retail cabinet foreign object inspection. Three targeted optimizations are elaborated in our framework: firstly, the original basic feature extraction unit is upgraded to C3K2-enhanced block to strengthen multi-scale small-target feature aggregation capability; secondly, CARAFE adaptive upsampling replaces fixed bilinear interpolation to recover detailed spatial features of micro foreign bodies; thirdly, CBAM + Conv hybrid attention is embedded in downsampling branches to suppress useless background interference and enhance effective target feature extraction under irregular illumination. Sufficient ablation and comparative experiments on our self-built CFOD cabinet foreign object dataset verify that the proposed method achieves outstanding detection precision while maintaining low computational cost, satisfying real-time deployment requirements of cabinet edge devices.

The remainder of this paper is organized as follows: Related work about lightweight object detection and industrial foreign-body inspection is reviewed in Section 2. Section 3 elaborates the structural optimization details of our improved YOLO11vn. Section 4 presents experimental setup, comparison results and detailed analysis. Finally, Section 5 summarizes this paper and prospects future research directions.

2. Related Work

2.1. Deep Learning-Based Foreign Object Detection

Foreign object detection (FOD) plays a vital role in intelligent visual inspection systems, aiming to accurately identify anomalous or unwanted objects under complex and dynamic environments. Current research efforts are primarily concentrated in six representative domains: food safety, railway transportation, power line inspection, industrial conveyor belts, aviation security, and unmanned retail. Methodological advancements generally follow four directions: YOLO series lightweight optimization, attention mechanism integration, downsampling structure redesign, and cross-modal feature enhancement.

In the food industry, Wang et al. [1] proposed an enhanced YOLOv8 variant for detecting micro-impurities in Pu-erh sun-dried green tea, significantly improving fine-grained feature extraction for small foreign objects. Khan et al. [2] combined near-infrared hyperspectral imaging with compressed deep models and hardware-aware acceleration, enabling non-destructive FOD in poultry products with high edge efficiency. For industrial conveyors, Li et al. [3] integrated ESRGAN-based super-resolution with an improved YOLOv11 to alleviate motion blur and occlusion caused by belt wear and dust accumulation.

Railway and power line inspection remain the most intensively studied FOD scenarios. To address data scarcity in catenary monitoring, Chen et al. [4] employed generative data augmentation and lightweight perception heads to reduce missed detections. Gu et al. [5] proposed ATW-YOLO, reconstructing the downsampling pipeline and embedding a customized attention module to accommodate irregular and multi-scale railway foreign objects. Bin et al. [6] designed CI-YOLO, a parameter-efficient model satisfying real-time constraints on embedded edge devices. Departing from conventional CNNs, Dong et al. [7] introduced a lightweight transformer-based detector that adapts well to illumination variations and cluttered backgrounds.

Aviation and unmanned retail scenarios have also witnessed rapid progress. Mushtaq et al. [8] adopted multi-model fusion to achieve high-precision FOD detection at airports, enhancing infrastructure safety. In unmanned retail, research is divided into commodity recognition and cabinet anomaly detection. Luo et al. [9] proposed a day–night cross-modal network for robust commodity recognition under low-light conditions. Hou et al. [10] introduced BCSM-YOLO for packaged product detection, while Patel et al. [11] fused visual and OCR features to improve retail recognition accuracy. Chadha [12] systematically reviewed vision-based retail recognition frameworks. Beyond recognition, Agranata et al. [13] combined pose estimation and object detection to analyze customer behavior and spatial distribution.

Despite these advances, most existing FOD approaches target single, fixed scenarios. There remains a notable lack of detection frameworks specifically adapted to retail cabinets, where extreme target scale variation, low-light conditions, reflective surfaces, and sample imbalance coexist simultaneously. Unlike open scenarios, the confined retail cabinet presents unique challenges: foreign objects are often tiny fragments occluded by stacked commodities, and strong reflections and uneven illumination further degrade feature visibility. Current commodity recognition methods focus primarily on large-scale packaged products and cannot distinguish small debris from background clutter. Meanwhile, edge-deployed lightweight detectors designed for industrial inspection typically sacrifice fine-grained feature representation to reduce computational cost, leading to frequent missed and false detections for cabinet-specific FOD tasks. This underscores the urgent need for tailored algorithms that simultaneously balance accuracy, efficiency, and robustness for retail cabinet environments.

2.2. Low-Light Image Enhancement

Low-light imaging inherently suffers from low visibility, severe noise, color distortion, and structural detail loss, all of which degrade downstream detection performance. Recent deep learning-based enhancement methods can be categorized into five groups: custom color space design, normalizing flow models, transformer-based architectures, task-specific restoration networks, and systematic reviews.

From an algorithmic perspective, Yan et al. [14] proposed the HVI color space to redistribute pixel intensities under dim lighting, achieving fundamental image quality improvement. Xu et al. [15] introduced UPT-Flow, a multi-scale transformer-guided normalizing flow model that effectively balances noise suppression and texture preservation. Feijoo et al. [16] developed DarkIR, a restoration network explicitly optimized for color shift and grain noise in dark environments, demonstrating strong generalization across public benchmarks.

Dataset construction and benchmarking for low-light image enhancement have witnessed continuous progress. Ciubotariu et al. [17] constructed a unified low-light enhancement benchmark under the NTIRE 2026 framework and standardized corresponding evaluation protocols to support fair experimental comparisons. In addition, comprehensive review works conducted by Zhao et al. [18] and Liu and Fan [19] systematically summarize the mainstream architectural trends, loss function designs, public datasets, and practical applications in this research field. These surveys also point out several persistent and challenging issues, including irreversible detail loss in extremely dark scenarios, undesired artifact introduction during image enhancement, and the poor compatibility between low-light enhancement models and downstream high-level vision tasks.

Crucially, most existing low-light enhancement algorithms are designed for generic natural scenes. Few methods consider the unique constraints of retail cabinets—narrow enclosed spaces, uneven illumination, strong reflections, and glossy surfaces. Direct application of general enhancement techniques often leads to over-exposure and detail erosion, ultimately degrading cabinet foreign object detection accuracy. In retail cabinets, local specular reflections on commodity packaging can easily be amplified as high-intensity noise during enhancement, while small debris details are often suppressed or washed out by over-aggressive brightness adjustment. Moreover, existing enhancement methods typically treat the image as a whole, failing to prioritize the recovery of small foreign object features, which are exactly the critical targets in our task. This mismatch between generic enhancement and cabinet-specific detection requirements further justifies the need for our task-aware optimization strategy, which integrates feature enhancement directly into the detection pipeline rather than relying on a separate preprocessing step.

The main contributions of this paper are summarized as follows:

(1): A lightweight improved YOLOv11n-FOD (foreign object detection) model is proposed for retail cabinet foreign object detection, achieving an optimal balance between high detection accuracy and real-time inference efficiency.
(2): A dynamic up-and-down sampling module integrated with the CBAM is introduced, significantly enhancing multi-scale feature extraction and improving detection robustness under complex background interference.
(3): By incorporating advanced low-light image enhancement techniques, the proposed framework demonstrates superior detection robustness compared to state-of-the-art detectors in dimly illuminated retail scenarios.

3. Materials and Methods

3.1. Overall Algorithm Framework

As shown in Figure 1, this paper proposes a multi-scale foreign object detection model for intelligent retail cabinets, built upon the YOLO framework. Targeted improvements are introduced to address three core challenges in the retail cabinet scenario: extreme scale imbalance of foreign objects, severe occlusion and reflective noise interference, and poor generalization to complex lighting conditions. First, the standard C3K2 bottleneck in the backbone is replaced with the proposed C3K2-SAC module. This redesigned structure enhances shallow feature extraction capabilities by optimizing residual connections and channel-wise feature recalibration, strengthening the model’s ability to capture fine-grained features of tiny foreign objects such as fragments and small debris. This directly alleviates the common problem of small object missed detection in retail cabinet scenarios. Second, the default nn.Upsample operation in the neck is replaced with the CARAFE (Content-Aware ReAssembly of FEatures) module, and a 1/64 downsampling branch is added during the feature fusion stage. The content-aware upsampling operator preserves more contextual details during cross-scale feature fusion, while the additional downsampling path enables the network to better perceive and learn multi-scale feature representations. These modifications together improve the model’s robustness against partial occlusion and enhance its generalization to objects of varying sizes. Third, a CBAM + CONV attention-enhanced detection head is introduced. The combined channel and spatial attention mechanism suppresses background noise caused by strong reflections and uneven lighting in retail cabinets, allowing the model to focus on discriminative features of occluded or partially visible foreign objects. This refinement significantly boosts detection accuracy and localization precision under challenging conditions. Consequently, these complementary modifications effectively improve the overall performance of the proposed model, enabling it to handle various foreign object detection tasks robustly in practical retail cabinet scenarios.

3.2. C3K2-SAC

The structural analysis of the C3k2 module in YOLOv11 indicates that its original design lacks explicit attention mechanisms. To address this, this paper introduces the PSAModule, whose core principle follows the self-attention paradigm of transformers: constructing Query (Q) and Key (K) to compute an attention matrix, which is then fused with Value (V) to generate enhanced feature representations. This process effectively suppresses background interference and enables the model to focus on discriminative target regions.

Further analysis reveals that the bottleneck structure within C3k2, despite enabling stacked feature extraction, relies on single-scale convolutional kernels. In scenarios characterized by strong occlusion, multi-scale variations, and complex textures (e.g., retail cabinets), this limitation often leads to insufficient feature modeling. To mitigate this, dilated convolution is introduced. By adjusting the dilation rate d, the receptive field is expanded without significantly increasing the parameter count.

The relationship between the added dilation parameter and the effective kernel size is formulated in Equation (1). When

d = 1

, it degenerates into a standard convolution; when

d \in {1, 2, 3}

, multi-scale receptive fields are obtained. Furthermore, the spatial dimension transformation of the output feature map is derived in Equation (2). This study employs Depthwise Separable Convolution (DWConv) as the feature extraction operator. Inheriting from the parent Conv class, its padding parameter p is dynamically calculated to maintain consistent spatial dimensions between input and output. This design preserves scale consistency, ensuring feature alignment across different branches and stabilizing multi-scale fusion.

Through the derivation of Equation (2), it can be observed that the spatial resolution of the output feature map remains constant provided that appropriate stride and padding strategies are applied. This design is particularly crucial for multi-scale feature fusion, as it guarantees feature alignment between different branches, thereby improving fusion effectiveness and training stability.

The relationship between the added dilation parameter and the effective kernel size is formulated in Equation (1). Furthermore, the spatial dimension transformation of the output feature map is derived in Equation (2).

p = \frac{(k - 1) \times d}{2}

(1)

o = ⌊\frac{i + 2 p - k - (k - 1) (d - 1)}{s}⌋ + 1

(2)

Parameter Analysis:

In Equations (1) and (2), o represents the spatial size of the output feature map, i is the input size, k denotes the kernel size, d is the dilation rate, s is the stride, and p is the padding size. Equation (1) is designed to calculate the padding required to maintain dimensional consistency when the dilation rate varies. Equation (2) is the standard formulation for calculating convolutional output size, where

⌊ \cdot ⌋

denotes the floor operation.

Proof.

Assuming the stride is set to

s = 1

, substituting Equation (1) into Equation (2) yields

\begin{matrix} o & = ⌊\frac{i + 2 (\frac{(k - 1) d}{2}) - k - (k - 1) (d - 1)}{1}⌋ + 1 \\ = ⌊i + (k - 1) d - k - (k - 1) d + (k - 1)⌋ + 1 \\ = ⌊i - 1⌋ + 1 \\ = i \end{matrix}

Thus, it is mathematically proven that under the condition of

s = 1

, the proposed dynamic padding strategy strictly guarantees that the output feature map size o remains identical to the input size i. □

This study employs Depthwise Separable Convolution (DWConv) as the feature extraction operator. Inheriting from the parent Conv class, its padding parameter p is dynamically calculated to maintain consistent spatial dimensions between input and output. This design preserves scale consistency, ensuring feature alignment across different branches and stabilizing multi-scale fusion. Through the derivation above, it can be observed that the spatial resolution of the output feature map remains constant provided that appropriate stride and padding strategies are applied. This design is particularly crucial for multi-scale feature fusion, as it guarantees feature alignment between different branches, thereby improving fusion effectiveness and training stability.

As illustrated in Figure 2, to enhance the network’s ability to model multi-scale targets and complex background information, dilated convolution is introduced in the feature extraction stage. Compared with standard convolution, dilated convolution inserts gaps between kernel elements, thereby expanding the feature sampling interval. Its mathematical expression is given by

Y (i, j) = \sum_{m = 1}^{K} \sum_{n = 1}^{K} W (m, n) \cdot X (i + r \cdot m, j + r \cdot n)

(3)

In Equation (3), W denotes the input feature map, where m and n represent the spatial coordinates. The parameter r denotes the dilation rate; when

r = 1

, the equation degenerates into standard convolution. For a convolutional kernel of size

K \times K

, its effective receptive field is formulated as

ERF = (K - 1) \cdot r + 1 .

(4)

Consequently, without increasing the number of parameters, dilated convolution significantly enlarges the receptive field, strengthening the joint modeling capability for local details and global context. In object detection tasks, this mechanism improves the representation of small, occluded, and scale-varying targets, thereby enhancing detection accuracy and recall.

3.3. CBAM

The attention mechanism guides the model to focus on critical channels and salient spatial regions, thereby strengthening feature extraction and improving the efficiency of capturing essential information. The Convolutional Block Attention Module (CBAM) is designed to address the limitations of conventional convolutional neural networks in handling features with varying scales, shapes, and orientations, enabling more flexible feature selection and representation. To achieve this, CBAM incorporates two complementary types of attention: channel attention and spatial attention. The channel attention mechanism enhances the representation of informative feature channels, while the spatial attention mechanism emphasizes key regions within the feature map, allowing the network to better capture spatially significant information.

The CBAM consists of two key components: the channel attention module and the spatial attention module.The structure of the channel attention module is shown in Figure 3. These two modules can be independently integrated into different layers of a convolutional neural network (CNN) to enhance the network’s feature representation capability.

The Convolutional Block Attention Module (CBAM) is a lightweight attention mechanism that combines channel attention and spatial attention. Similar to SENet, CBAM can be seamlessly integrated into almost any convolutional neural network (CNN), significantly enhancing model performance with only a slight increase in computational cost and parameter count. For the channel attention module, given an input feature map

F \in R^{H \times W \times C}

, we first conduct adaptive average pooling to aggregate global spatial information. Next, a two-dimensional convolution operation is applied to capture inter-channel dependencies. Finally, the generated feature is fed into the Sigmoid activation function to produce channel-wise attention weights. The refined feature map is obtained by multiplying the weight matrix with the original input feature

F

. After adaptive average pooling (AAF) and convolution processing, the target channel attention map

M_{c}

is derived via the Sigmoid activation function, which is formulated as follows:

z_{c} = AdaptiveAvgPool 2 d (f_{c}) = \frac{1}{H \cdot W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} f_{c} (i, j)

(5)

where

f_{c} (i, j)

refers to the pixel value at spatial coordinate

(i, j)

of the c-th channel feature map, H and W are the height and width of the input feature map respectively.

M_{c} (F) = σ ({Conv}_{1 \times 1} (z_{c} (F)))

(6)

where

F \in R^{C \times H \times W}

denotes the input feature tensor,

M_{c} (F) \in R^{C \times 1 \times 1}

is the output channel attention weight matrix,

σ (\cdot)

denotes the Sigmoid activation function,

z_{c} (\cdot)

represents 2D global adaptive average pooling defined in Equation (5), and

{Conv}_{1 \times 1} (\cdot)

stands for 1 × 1 convolution for channel dimension mapping.

F_{o u t} = F ⊙ M_{c} (F)

(7)

where ⊙ denotes element-wise Hadamard product between tensors.

Adaptive average pooling (AAP) is a flexible pooling operation that adaptively computes pooling kernels and strides according to the target output size, without manually configuring kernel size or stride. Given an input feature map of arbitrary spatial resolution, AAP evenly divides the feature map into the target number of sub-regions and calculates the average value of pixels within each sub-region as the output feature. When the target output size is set to

1 \times 1

, AAP degenerates into global average pooling (GAP), which compresses each feature channel into a single scalar to aggregate global spatial context. In the channel attention branch of our improved CBAM, we adopt single-branch adaptive average pooling instead of the original dual-branch structure with average pooling and max pooling, which effectively reduces computational overhead and suppresses interference from occluded pixels in dense commodity detection scenarios.

The spatial attention module is illustrated in Figure 4. For the input feature map

F^{'}

, channel-wise information is aggregated by applying both average pooling and max pooling across the channel dimension, resulting in two feature maps of size H × W × 1. These maps are then concatenated and passed through a 7 × 7 convolutional layer to fuse the spatial information. Finally, a Sigmoid activation function is applied to generate spatial attention weights, which are multiplied with the input feature map

F^{'}

to produce the spatial attention map

M_{s}

, as defined by the following equation:

M_{s} (F^{'}) = δ (f^{7 \times 7} (C O N V ([AvgPool (F^{'}); MaxPool (F^{'})])))

(8)

Here,

F^{'}

denotes the feature map after channel attention; [;] indicates concatenation along the channel dimension;

f^{7 \times 7}

represents a convolutional kernel of size 7 × 7; CONV represents the convolution operation after pooling operation;

δ

is the Sigmoid activation function; and

A v g P o o l

and

M a x P o o l

correspond to global average pooling and global max pooling, respectively.

Final output of CBAM:

F^{″} = M_{s} (F^{'}) ⊙ F^{'}

(9)

where

F^{'}

denotes the input intermediate feature tensor for spatial attention module,

M_{s} (F^{'})

represents the spatial attention weight matrix calculated from

F^{'}

, and ⊙ stands for element-wise Hadamard product, which implements pixel-wise weighting of the original feature map by spatial attention weights.

The CBAM architecture is composed of a combination of channel attention and spatial attention modules, as shown in Figure 5. The channel attention module is based on SENet, while the spatial attention module is appended after SENet and similarly leverages global pooling along the channel dimension to generate a two-dimensional spatial attention map. By performing both global average pooling and global max pooling in the spatial and channel dimensions, CBAM is able to capture a richer set of informative features, enhancing the network’s ability to focus on the most relevant information.

Compared with other attention mechanisms, SENet and ECANet focus solely on channel attention, while CA focuses only on spatial attention. In contrast, CBAM combines these modules in a sequential manner, simultaneously capturing channel-wise importance and highlighting critical spatial regions within feature maps. This enables the model to concentrate more effectively on local regions corresponding to target objects, thereby mitigating background interference. These advantages are further validated in the ablation study section.

3.4. CARAFE

The original CCFPN network adopts the standard upsampling operation based on bilinear interpolation via nn.Upsample(). Limited by the fixed kernel weights and static receptive field of traditional interpolation, this operation easily causes the loss of fine edge semantic information and incomplete feature reconstruction, which degrades the overall detection performance. To address the above defects, this paper introduces the Content-Aware ReAssembly of Features (CARAFE) module [20] for adaptive feature upsampling. As illustrated in Figure 6, the overall workflow is summarized as follows: The input feature map first undergoes a lightweight convolutional layer to compress channel dimensions. Subsequently, dynamic upsampling and weight normalization are performed to generate content-dependent feature reassembly matrices. The generated weight matrices are applied to the corresponding pixel regions of the original feature map to enhance effective target features. Finally, pixel reorganization is implemented to obtain high-resolution upsampled features with expanded receptive fields and complete semantic details.

CARAFE is a typical content-adaptive upsampling operator, which is fundamentally different from conventional upsampling strategies including nearest neighbor interpolation, bilinear interpolation, and transposed convolution. Instead of adopting fixed interpolation kernels or hand-crafted transformation rules, CARAFE dynamically predicts pixel reassembly weights from input feature semantics and aggregates neighboring pixel features to reconstruct high-resolution feature representations.

Compared with CARAFE, traditional upsampling methods have obvious inherent limitations. Nearest neighbor and bilinear interpolation are parameter-free and rely on fixed static transformation rules, which inevitably lose detailed texture information and produce blurred feature maps. Transposed convolution introduces learnable parameters but frequently generates undesirable checkerboard artifacts. Although the PyTorch-implemented PixelShuffle realizes pixel rearrangement for upsampling, it lacks adaptive adjustment capability and cannot dynamically fit diverse feature distributions of different input scenes.

The CARAFE module mainly consists of two core sub-modules: kernel prediction and content-aware reassembly.

Kernel Prediction Module

This module generates dynamic reassembly weights corresponding to each upsampling position through lightweight feature transformation. Specifically, a

1 \times 1

convolution layer combined with the Softmax normalization function is applied to the input feature map. The layer adaptively normalizes the kernel weights without additional global normalization operations, which effectively reduces parameter overhead and computational complexity. The output is pixel-level dynamic weight kernels matching the spatial dimension of the input feature map. The dynamic weight generation process can be formulated as

K = Softmax ({Conv}_{1 \times 1} (X))

(10)

where

X

denotes the input feature map,

{Conv}_{1 \times 1} (\cdot)

represents the

1 \times 1

convolution operation for channel compression and feature transformation, and

Softmax (\cdot)

normalizes the predicted kernel weights to satisfy spatial aggregation constraints.

Content-Aware Reassembly Module

This module aggregates neighboring pixel features based on the dynamically predicted weights to complete adaptive feature upsampling. Different from fixed interpolation rules, CARAFE performs weighted aggregation on local neighboring pixels for each output position to retain edge details and semantic consistency. The local feature aggregation calculation is defined as

χ_{l^{'}}^{'} = \sum_{n = - r}^{r} \sum_{m = - r}^{r} W_{l^{'} (n, m)} \cdot χ_{(i + n, j + m)}

(11)

where

W_{l^{'} (n, m)}

denotes the dynamic reassembly kernel weight at the corresponding spatial position,

χ_{(i + n, j + m)}

represents the pixel feature value of the original input feature map within the local receptive field, and r is the aggregation radius of the reassembly kernel. Before feature aggregation, the input feature map is padded to ensure dimension matching for the unfold operation, which guarantees valid pixel-level reassembly across the entire feature map. To further clarify the overall upsampling process of the CARAFE module, we define the unified calculation formula for feature upsampling. Given an input feature map

X \in R^{C \times H \times W}

and an upsampling scale factor s, the output high-resolution feature map is

Y \in R^{C \times s H \times s W}

. For each output spatial position

(i, j)

, the upsampling calculation is formulated as

Y_{c, i, j} = \sum_{(u, v) \in N (i, j)} K_{i, j, u, v} \cdot X_{c, u, v}

(12)

where

Y_{c, i, j}

denotes the feature value of channel c at position

(i, j)

in the output upsampled feature map,

N (i, j)

represents the valid neighboring pixel set of the input feature map corresponding to the output position

(i, j)

, and

K_{i, j, u, v}

is the content-adaptive dynamic weight predicted by the kernel prediction module. Through weighted aggregation of neighboring pixels with dynamic weights, CARAFE realizes flexible and scene-adaptive upsampling, effectively preserving fine-grained semantic details and avoiding the artifacts caused by traditional fixed upsampling methods. Subsequent ablation experiments on the upsampling module further verify the effectiveness and superiority of the adopted CARAFE module.

4. Results and Analysis

4.1. Experimental Environment Setup

Deep learning frameworks play a pivotal role in determining the efficiency of model development and deployment. In this study, PyTorch was selected as the primary development framework. Compared with static graph frameworks, PyTorch leverages the interpretability and flexibility of Python, enabling dynamic computational graph construction and rapid prototyping. This significantly accelerates network modification and debugging. In contrast, C++, being a strongly typed compiled language, excels in low-level system interaction and dynamic-link library (DLL) development but is less suitable for fast-paced deep learning algorithm iteration.

As shown in Table 1, at the hardware level, the training platform was equipped with an NVIDIA RTX 4080 GPU, and CUDA 12.4 was adopted to accelerate model training. Under this configuration, the proposed RF-DETR model was successfully trained and validated. The corresponding datasets and source codes are publicly available at the following download center: https://blogcenter.top/ accessed on 1 May 2026.

The PyTorch framework used in this experiment is sourced from the pytorch development platform at https://pytorch.org/get-started/previous-versions/ accessed on 1 March 2026. Model training was conducted using NVIDIA GPUs, partly from personal computers equipped with NVIDIA graphics cards, and partly from a cloud server deployed at the main campus of Yangtze University, which played a critical role in facilitating the experiment.

4.2. Object Detection Evaluation Metrics

To comprehensively evaluate the model’s performance, the main evaluation metrics used in the experiment are parameters, GFLOPs, mAP@0.5, and mAP@0.5:0.95. Precision refers to the proportion of correctly predicted positive samples among all predicted positive results, while recall refers to the proportion of actual positive samples that are correctly predicted as positive.

AP
AP (average precision) is calculated as the area under the precision–recall (P-R) curve to evaluate the model’s performance in the object detection task, reflecting the average precision of the model across different recall levels. The calculation formula is as follows:

$A P = \int_{0}^{1} P (r) d r$

(13)
Recall

$Recall = \frac{T P}{T P + F N}$

(14)

In the formula, FN represents the total number of false negatives, and TP represents the number of true positives. Precision denotes the proportion of correctly detected positive samples among all predicted positive results.
Precision

$Precision = \frac{T P}{T P + F P}$

(15)

In the formula, FP represents the number of negative samples incorrectly detected as positive by the model. Let the total number of samples be n, and k is the number of samples detected from them. The recall is denoted as $r_{k}$ , and $p_{k}$ represents the maximum precision for recall greater than $r_{k}$ . Precision is defined as

$A P = \sum_{k = 1}^{n} p_{k} (r_{k + 1} - r_{k})$

(16)
mAP
$m A P$ is the mean of the average precision across all classes and is defined as

$m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}$

(17)

In the formula, N represents the total number of classes, and $A P_{i}$ denotes the average precision of the i-th class.
Inference Time
To comprehensively evaluate the inference capability of the proposed model, the inference time is adopted as a key evaluation metric. This metric is defined as the cumulative sum of three components: preprocessing, model inference, and postprocessing. As illustrated in the experimental results, this composite indicator objectively reflects the real-time responsiveness of the entire object detection framework in practical deployment scenarios.

4.3. Experimental Dataset

Public Dataset

To comprehensively compare and select baseline models, this paper adopts the public dataset for commodity recognition of intelligent retail cabinets released on Alibaba Tianchi as shown in Figure 7. The dataset is available at https://tianchi.aliyun.com/dataset/138895 accessed on 29 June 2026. This dataset follows the VOC annotation format and is compatible with most mainstream deep learning development frameworks. It contains a total of 5422 fully annotated images covering 113 categories of retail commodities. According to the official data partition, the training set consists of 3796 images, the validation set contains 1084 images, and the test set includes 542 images. Since the dataset is collected from real scenarios of intelligent retail cabinets, it can effectively evaluate the overall performance of different detection algorithms in practical application environments.

According to the comparative analysis presented in Table 2, the development of object detection models has evolved into two dominant technical paradigms: CNN and transformer. Among them, transformer-based detectors generally achieve superior detection accuracy, whereas one-stage detectors represented by the YOLO series exhibit outstanding inference efficiency, with models such as YOLOv11 achieving millisecond-level (2 ms) inference speed. Notably, YOLOv11 breaks the inherent accuracy–speed trade-off, achieving high detection precision while maintaining ultra-fast inference. To satisfy the real-time constraints of edge deployment and ensure the detection rate of small foreign objects, this paper selects YOLOv11—with the optimal comprehensive performance—as the baseline architecture for further improvement and optimization.

Custom Dataset CFOD

The experimental dataset was collected in the laboratory’s visual cabinet using a fisheye camera, both products and foreign objects were captured. Since the focus of this study is anomaly detection, additional common foreign objects were deliberately collected to enable the model to recognize frequently occurring anomalies. Other foreign objects were labeled based on the model’s inference results and categorized as miscellaneous anomalies. The specific collection results are shown in the figure below. Access link for the self-built CFOD dataset: https://blogcenter.top accessed on 29 June 2026.

As illustrated in Figure 8, typical residual foreign objects inside retail cabinets include earphones, keys, paper scraps, chewing gum, mobile phones, Xizhilang jelly and other sundries. Such targets are generally tiny in size and frequently suffer from partial occlusion by stacked commodities, which makes them easily left behind in compartments or mistakenly discarded as waste. This inherent detection difficulty constitutes the core research focus of this work.

Due to the limited quantity of originally collected real-scene images, multiple data augmentation strategies are adopted to enrich illumination diversities and synthesize image samples under graded brightness conditions. Meanwhile, parameter tuning of white balance and CLAHE (Contrast Limited Adaptive Histogram Equalization) is implemented in the image preprocessing pipeline. As displayed in the subsequent figure, the above preprocessing operations greatly upgrade raw image quality and lay a solid data foundation for the following foreign object detection task.

As illustrated in Figure 9, pixel-level illumination adjustment experiments are implemented based on OpenCV and Albumentations libraries to simulate diverse lighting conditions. The visualized results sequentially include Gamma correction with

γ = 0.4

and

γ = 0.7

, CLAHE adaptive histogram equalization with clip limit set to 2.0 and 4.0, brightness scaling with

α = 0.6

, as well as the built-in Gamma enhancement function embedded in Albumentations. Such augmentation pipeline enriches the dataset with diversified samples covering under-exposure, over-exposure and distorted contrast, which effectively reproduces complicated practical illumination variations and provides solid data support for improving the illumination robustness of the subsequent detection model.

All original image samples are captured in our laboratory with fisheye cameras mounted on intelligent retail cabinets, consisting of normal commodities and various residual foreign objects. Targeting the foreign object detection task, we prioritize data collection of frequently encountered sundries to guarantee reliable identification for common abnormal targets, while sporadic uncommon foreign items are categorized into one unified miscellaneous class according to model inference feedback, and detailed sample distribution is demonstrated in the following figure.

In this experiment, we compare the performance of various baseline detection models on our self-built dataset. All image samples are collected inside the visual inspection cabinets located in the main teaching building of Yangtze University, covering scenarios ranging from single commodity placement to densely stacked commodities. The dataset is manually annotated via Labelme, and online/offline data augmentation is implemented based on the Albumentations library, including random rotation and white balance adjustment to expand sample diversity. The entire dataset is randomly split into training, validation and test subsets following an 8:1:1 ratio, resulting in 4800 training images, 600 validation images and 600 test images. The source dataset and experimental codes are available at https://blogcenter.top/ accessed on 29 June 2026.

Based on the experimental results on the CFOD benchmark dataset listed in Table 3, existing object detection approaches can be categorized into two mainstream branches: CNN-based detectors (including one-stage and two-stage architectures) and transformer-based models. Despite the theoretically superior feature extraction capability of transformer frameworks represented by DETR, their heavy computational overhead inevitably leads to inferior real-time performance compared with mainstream one-stage detectors.

Among one-stage YOLO variants, YOLOv26 delivers the optimal inference latency of only 1.3 ms, yet it suffers from relatively unsatisfactory recall and moderate mAP50-95 without top-ranked comprehensive accuracy. In contrast, the vanilla YOLOv11n maintains favorable lightweight inference efficiency at 2.3 ms; our refined YOLOv11n-FOD further achieves remarkable accuracy gains, yielding state-of-the-art mAP50 (0.992) and competitive mAP50-95 (0.776). Such well-balanced precision and inference speed perfectly match the practical requirement of foreign object inspection for intelligent retail cabinets, which demands both high detection accuracy and low latency. Accordingly, YOLOv11n is selected as the baseline architecture for subsequent structural improvement and optimization in this work.

4.4. Heatmap Visualization Experiment

Inspired by Grad-CAM, we design a heatmap visualization script to quantitatively and intuitively evaluate the object localization capability of the model. Gradient-weighted Class Activation Mapping (Grad-CAM) is a gradient-based visualization technique, which reveals the image regions that deep neural networks focus on during inference. Although the detection head of YOLOv11n does not adopt a conventional CNN structure, Grad-CAM can still be applied to the convolutional feature maps of the backbone and encoder to generate class-specific activation heatmaps.

The implementation pipeline is as follows: First, we obtain the gradients corresponding to the target class and back-propagate them to the key feature layers of the network. Then, the gradient of each channel is averaged to calculate channel weights. Afterwards, the feature maps are weighted and fused with the obtained weights. The results are processed with the ReLU function to suppress negative responses and upsampled to the same resolution as the original image to generate the final heatmap. Applying Grad-CAM to YOLOv11n helps interpret how the network associates local features with global context for efficient object detection. The heatmaps clearly visualize the attention regions of the model, which can verify the rationality of attention mechanisms and the accuracy of object localization. This method provides reliable support for model optimization and interpretability analysis.

As shown in Figure 10 and Figure 11, the proposed method achieves favorable detection performance in single-category, multi-category, and multi-object scenarios, accurately localizing foreign objects with higher detection confidence compared to the baseline model. In dense multi-category scenes, the baseline model exhibits significantly weaker attention to target regions than the improved model. The red regions in the heatmaps represent the model’s focused areas; the baseline model’s attention is scattered and overly broad, leading to false positives. Due to the high proportion of small objects in the scene, the baseline model lacks sufficient feature extraction capability: its 1/32 downsampling output meets the requirements for medium-sized object detection but fails to effectively represent small objects, resulting in underfitting. In contrast, the improved model demonstrates clear advantages in both target localization accuracy and small object feature focusing.

4.5. Occlusion Comparasion Experiment

To further verify the effectiveness of the improved model proposed in this paper under complex interference scenarios, special tests are carried out on foreign object targets with different occlusion degrees in this experiment. All experiments are implemented on a cloud server platform, and the detailed experimental environment configurations are listed as follows:

Python: 3.10.14;
Ultralytics framework: 8.3.23;
PyTorch: 2.5.1;
Hardware: NVIDIA RTX A6000 GPU with 48670 MiB video memory.

Training hyperparameters are set as follows: the batch size is 16 and the number of training epochs is 100.

As can be seen from the experimental results in Figure 12, the improved model can effectively detect foreign objects under various occlusion ratios and exhibits outstanding anti-occlusion detection performance. It should be noted that all comparative experiments in this work adopt a unified training epoch of 100, which is less than the optimal convergence epoch of the model. Restricted by this setting, some network parameters cannot be fully optimized through sufficient iterations, leaving room for improvement in the detection accuracy of partial targets. This limitation can be alleviated by appropriately increasing training epochs and optimizing hyperparameters.

In the practical application scenario of intelligent retail cabinets, foreign object occlusion is a frequent interference factor. The core performance requirement of the detection model is to reduce the missing detection rate, so as to avoid equipment failures and poor shopping experience caused by undetected foreign objects. Therefore, the model’s strong anti-occlusion capability and low missing detection rate contribute to its great practical application value.

4.6. Comparison Under Diverse Brightness Levels

As can be observed from Figure 13, the brightness of samples in the left four columns gradually decreases from left to right to simulate the gradually darkening environment inside intelligent retail cabinets, while the rightmost image corresponds to dense commodities captured under normal illumination. With continuous reduction in ambient brightness, the prediction confidence of bounding boxes produced by original YOLOv11n drops evidently, and the feature distinguishability of foreign objects degrades severely under extremely low-light conditions. In contrast, benefiting from the embedded feature enhancement module, the improved YOLOv11n-FOD proposed in this paper maintains stable detection confidence under both various dim illumination and normal lighting conditions. The presented architecture effectively mitigates performance degradation caused by insufficient light, which validates its outstanding anti-dark–light robustness for the foreign object detection task of retail cabinets.

4.7. Ablation Experiment

Table 4 lists the results of the multi-module ablation experiments. Compared with the baseline model, C3K2-SAC, CARAFE and CBAM + CONV all improve detection performance individually. CARAFE reduces computational overhead, CBAM + CONV achieves prominent gains in accuracy and recall, and C3K2-SAC enhances feature representation. Experimental results show that simply stacking two modules cannot yield continuous performance improvement. When all three modules are integrated, the model obtains the best overall mAP50, mAP50-95 and recall. For foreign object detection in retail cabinets, C3K2-SAC strengthens feature extraction for tiny targets, CARAFE optimizes multi-scale feature fusion, and CBAM + CONV suppresses noise caused by reflection and occlusion. These modules complement each other to address key challenges in practical scenarios. Meanwhile, the computational cost and inference speed still meet the real-time requirements of edge devices, which demonstrates the rationality and necessity of the three-module combination.

4.8. Attention Comparation Experiment

To verify the effectiveness of the proposed attention module, we conduct ablation experiments on various mainstream attention mechanisms based on our final network equipped with CIoU loss. All training procedures are run on an NVIDIA GeForce RTX 4080 GPU with 16380 MiB VRAM. The training hyperparameters remain consistent across all trials: the total training epoch is set to 100, eight workers are used for parallel data loading, and all experiments utilize the CFOD dataset proposed in this paper.

We compare five classic attention modules to determine the optimal component for dense small commodity detection in unmanned retail scenes. The quantitative experimental results are summarized in Table 5. Bold numbers in each column denote the optimal value of the corresponding evaluation metric. For mAP50, mAP50-95 and recall, larger values correspond to superior detection precision; for GFLOPS and inference latency, smaller values indicate lower computational cost and faster inference speed.

From the perspective of detection accuracy, CBAM achieves the best performance on mAP50 (0.988) and mAP50-95 (0.788), which demonstrates its outstanding capability for both coarse target localization and fine-grained detection under dense occlusion. CA gains the maximum recall of 0.983, yet its mAP50-95 is only 0.746, which reveals insufficient feature extraction capability when goods are heavily stacked and blocked. EMA obtains a competitive mAP50-95 of 0.786, but its mAP50 drops sharply to 0.959, showing weak coarse localization performance. SENet and ECA achieve balanced medium-level accuracy, with identical mAP50-95 of 0.782, while SENet’s mAP50 (0.972) is obviously inferior to CBAM, CA and ECA.

In terms of computational overhead and real-time inference capability: All five attention modules share the identical computational complexity of GFLOPS = 6.6, eliminating the difference in calculation burden between lightweight SENet and other modules. In terms of inference latency, CBAM reaches the shortest inference time of 2.3 ms, which is much faster than EMA (3.8 ms). Although CA achieves the highest recall, its inference delay (2.6 ms) is higher than CBAM. Sequentially, ECA, SENet and EMA have increasingly longer inference times, failing to meet the strict real-time requirement of unmanned retail vision systems.

Comprehensive analysis of detection accuracy and inference speed shows that CBAM is the most suitable attention module for our unmanned retail shelf dense small object detection task. Under the same computational overhead as other attention mechanisms, it delivers the highest overall detection accuracy and the optimal real-time performance, outperforming SENet, CA, ECA and EMA in comprehensive performance.

4.9. Training Loss Curve Comparison

As shown in Figure 14, the penultimate curve represents YOLOv11n-FOD, the final improved model proposed in this work. To quantitatively and intuitively compare the performance of ablation experiments, we plot relevant curves on the validation set, including loss convergence curves, as well as variation curves of mAP50 and mAP50-95. The experimental results indicate that the proposed model achieves better overall performance than the original YOLOv11n baseline. However, the performance improvement does not show a monotonic growth with the superposition of modules. Specifically, the models equipped with only the CBAM attention module or only the CARAFE upsampling module outperform the scheme combining both modules. Two main reasons explain this phenomenon. First, stacking multiple modules increases model parameters and computational cost. The expanded model requires more training epochs for adequate parameter fitting, leading to insufficient convergence under the current training settings. Second, the combination of multiple modules raises model complexity, which imposes higher requirements on data distribution, sample quantity and annotation quality. Restricted by the inherent characteristics of the retail cabinet dataset, such as imbalanced target scales and reflective noise interference, the coupled modules tend to cause overfitting and degraded feature adaptation. Overall, all the improved schemes outperform the original baseline model. By optimizing the basic C3k2 module and adding a 1/64 downsampling branch, our method enhances the extraction of shallow fine-grained features and effectively reduces missed detections of tiny foreign objects in retail cabinets. Meanwhile, the original upsampling module is replaced with CARAFE adaptive upsampling to optimize cross-scale feature fusion and improve the model’s generalization ability. Consequently, the proposed YOLOv11n-FOD model is capable of detecting various foreign objects robustly under the complex conditions of retail cabinet scenarios, including multi-scale targets, low illumination and strong reflection interference.

4.10. Generalization Experiment

To fully verify the generalization and robustness of our improved model in scenarios with densely arranged objects, a high proportion of tiny targets, and severe occlusion, and to objectively demonstrate the effectiveness of the proposed improvements, we conduct comparative experiments on the public benchmark dataset SKU-110K. Released in 2019 by Bar-Ilan University, Tel Aviv University and other institutions in Israel, SKU-110K [33] is a large-scale dense object detection dataset for retail shelf scenarios. It is specially constructed for the detection of densely stacked, visually similar and tiny-sized commodities, and serves as an authoritative benchmark in the research fields of retail vision and dense object detection.

As illustrated in Figure 15, The experiments were conducted on a server with the following hardware and software configurations: an NVIDIA RTX 4090 (24G) GPU, a 20 vCPU Intel Xeon Platinum 8470Q processor, and 90 GB of system memory. The software environment was built on Ubuntu 22.04, using Python 3.12, PyTorch 2.8.0, and CUDA 12.8 to fully leverage GPU computing power.

The experimental results are shown in the figure above. The improved YOLOv11n-FOD model demonstrates significant advantages in both model convergence and detection accuracy. In terms of convergence performance, the improved model exhibits a faster decline and a lower final value of classification loss, indicating higher efficiency in learning category features. For detection accuracy, the mAP@50-95 curve of the improved model consistently outperforms the baseline YOLOv11n model, with a notable improvement in the final converged accuracy. These results validate that the proposed improvements effectively address the inherent limitations of YOLO series models in dense small object scenarios (e.g., insufficient feature extraction and localization drift), significantly enhancing the model’s representation ability and robustness.

5. Conclusions

This study proposed the YOLOv11 model, which was optimized through a lightweight redesign of the original object detection framework. The enhanced architecture significantly improves the model’s ability to capture small object features while maintaining the computational efficiency necessary for deployment on embedded platforms. Although the categories of foreign objects are relatively limited, their intra-class variations exhibit considerable complexity and diversity. Future research will further explore open-set object detection and the integration of multimodal information—such as weight sensor data—to enhance detection accuracy and robustness. In addition, multimodal large model techniques will be investigated to enable more comprehensive and intelligent foreign object detection in retail cabinet scenarios.

Author Contributions

Conceptualization, Z.Z. and K.X.; methodology, Z.Z.; software, Z.Z.; validation, Z.Z.; formal analysis, Z.Z.; investigation, Z.Z.; resources, K.X.; data curation, Z.Z.; writing—original draft preparation, Z.Z.; writing—review and editing, Z.Z.; visualization, Z.Z.; supervision, K.X., W.Z. and J.H.; project administration, K.X., W.Z. and J.H.; funding acquisition, K.X., W.Z. and J.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 62272485.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are publicly available. The SKU110K dataset used in this study is publicly available from the official GitHub repository: https://github.com/eg4000/SKU110K_CVPR19 accessed on 29 June 2026. According to the repository, the dataset is provided solely for academic and non-commercial purposes. The Tianchi public dataset used in this study is subject to the Alibaba Cloud/Tianchi public dataset terms of use, which are available at: https://terms.aliyun.com/legal-agreement/terms/suit_bu1_ali_cloud/suit_bu1_ali_cloud201811161413_63039.html?spm=a2c22.12282016.J_9453290150.2.7e903650QfnqmE accessed on 29 June 2026. The dataset is available at https://tianchi.aliyun.com/dataset/138895 accessed on 29 June 2026.

Acknowledgments

This research was supported by the National Natural Science Foundation of China (Grant No.62272485). The authors gratefully acknowledge this support.

Conflicts of Interest

The authors declare no conflicts of interest. The funder had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Wang, Z.; Zhang, S.; Chen, Y.; Xia, Y.; Wang, H.; Jin, R.; Wang, C.; Fan, Z.; Wang, Y.; Wang, B. Detection of small foreign objects in Pu-erh sun-dried green tea: An enhanced YOLOv8 neural network model based on deep learning. Food Control 2025, 168, 110890. [Google Scholar]
Khan, Z.; Yoon, S.C.; Bhandarkar, S.M. Deep learning model compression and hardware acceleration for high-performance foreign material detection on poultry meat using NIR hyperspectral imaging. Sensors 2025, 25, 970. [Google Scholar] [CrossRef] [PubMed]
Li, Q.; Zeng, R.; Wang, G.; Yang, T. Conveyor belt foreign object detection method based on improved YOLOv11 and ESRGAN. Sci. Rep. 2026; Online ahead of print. [CrossRef]
Chen, Z.; Yang, J.; Li, F.; Feng, Z.; Chen, L.; Jia, L.; Li, P. Foreign object detection method for railway catenary based on a scarce image generation model and lightweight perception architecture. IEEE Trans. Circuits Syst. Video Technol. 2025, 36, 1377–1391. [Google Scholar] [CrossRef]
Gu, W.; Gao, W.; Zou, Y.; Ma, S. ATW-YOLO: Reconstructing the downsampling process and attention mechanism of YOLO network for rail foreign body detection. Signal Image Video Process. 2025, 19, 368. [Google Scholar] [CrossRef]
Bin, F.; He, J.; Qiu, K.; Hu, L.; Zheng, Z.; Sun, Q. CI-YOLO: A lightweight foreign object detection model for inspecting transmission line. Measurement 2025, 242, 116193. [Google Scholar]
Dong, Z.; Yang, Q.; Chen, H.L.; Zhou, H.; Gao, D. A lightweight transformer-based framework for real-time foreign object detection in complex railway environments. J. Real-Time Image Process. 2026, 23, 3. [Google Scholar]
Mushtaq, Y.; Ali, W.; Ghani, U.; Khan, R.U.; Adak, A.K. Advancing aviation safety and sustainable infrastructure: High-accuracy detection and classification of foreign object debris using deep learning models. Int. J. Sustain. Dev. Goals 2025, 1, 82–98. [Google Scholar] [CrossRef]
Luo, Z.; Fu, Z.; Huang, Z.; Fu, W.; Zhu, Z.; Chen, X. A day-night cross-modal network for robust commodity recognition under low-light illumination. Eng. Appl. Artif. Intell. 2026, 164, 113164. [Google Scholar] [CrossRef]
Hou, P.; Huang, S. BCSM-YOLO: An improved product package recognition algorithm for unmanned retail stores based on YOLOv11. IEEE Access 2025, 13, 139665–139679. [Google Scholar] [CrossRef]
Patel, S. Multi-Modal product recognition in retail environments: Enhancing accuracy through integrated vision and OCR approaches. World J. Adv. Res. Rev. 2025, 25, 1837–1844. [Google Scholar] [CrossRef]
Chadha, S. Vision-Based Object Recognition in Retail. Int. J. Sci. Res. Eng. Trends 2025, 11, 1–8. [Google Scholar]
Agranata, I.Y.B.; Hamami, F.; Suakanto, S. Detection of Customer Interaction and Density Patterns Using Pose Estimation and Object Detection for Retail Store Layout Optimization. In Proceedings of the 2025 International Seminar on Intelligent Technology and Its Applications (ISITIA); IEEE: Piscataway, NJ, USA, 2025; pp. 230–235. [Google Scholar]
Yan, Q.; Feng, Y.; Zhang, C.; Pang, G.; Shi, K.; Wu, P.; Dong, W.; Sun, J.; Zhang, Y. Hvi: A new color space for low-light image enhancement. In Proceedings of the Computer Vision and Pattern Recognition Conference; IEEE: Piscataway, NJ, USA, 2025; pp. 5678–5687. [Google Scholar]
Xu, L.; Hu, C.; Hu, Y.; Jing, X.; Cai, Z.; Lu, X. UPT-Flow: Multi-scale transformer-guided normalizing flow for low-light image enhancement. Pattern Recognit. 2025, 158, 111076. [Google Scholar]
Feijoo, D.; Benito, J.C.; Garcia, A.; Conde, M.V. Darkir: Robust low-light image restoration. In Proceedings of the Computer Vision and Pattern Recognition Conference; IEEE: Piscataway, NJ, USA, 2025; pp. 10879–10889. [Google Scholar]
Ciubotariu, G.; Rehman, A.; Dharejo, F.A.; Naqvi, R.A.; Conde, M.V.; Timofte, R.; Jin, Z.; Wu, H.; Zhang, W.; Ye, C.; et al. Low Light Image Enhancement Challenge at NTIRE 2026. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2026. [Google Scholar]
Zhao, Q.; Li, G.; He, B.; Shen, R. Deep learning for low-light vision: A comprehensive survey. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 15685–15705. [Google Scholar] [PubMed]
Liu, F.; Fan, L. A review of advancements in low-light image enhancement using deep learning. Neurocomputing 2025, 652, 131052. [Google Scholar] [CrossRef]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. Carafe: Content-aware reassembly of features. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2019; pp. 3007–3016. [Google Scholar]
Tian, J.; Lee, S.; Kang, K. Faster R-CNN in healthcare and disease detection: A comprehensive review. In Proceedings of the 2025 International Conference on Electronics, Information, and Communication (ICEIC); IEEE: Piscataway, NJ, USA, 2025; pp. 1–6. [Google Scholar]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2019; pp. 6569–6578. [Google Scholar]
Ale, L.; Zhang, N.; Li, L. Road damage detection using RetinaNet. In Proceedings of the 2018 IEEE International Conference on Big Data (Big Data); IEEE: Piscataway, NJ, USA, 2018; pp. 5197–5200. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Sapkota, R.; Cheppally, R.H.; Sharda, A.; Karkee, M. Rf-detr object detection vs yolov12: A study of transformer-based and cnn-based architectures for single-class and multi-class greenfruit detection in complex orchard environments under label ambiguity. arXiv 2025, arXiv:2504.13099. [Google Scholar]
Sapkota, R.; Karkee, M. Ultralytics YOLO evolution: An overview of YOLO26, YOLO11, YOLOv8 and YOLOv5 object detectors for computer vision and pattern recognition. arXiv 2025, arXiv:2510.09653. [Google Scholar]
Afifah, V.; Erniwati, S. Yolov8 for object detection: A comprehensive review of advances, techniques, and applications. IJACI Int. J. Adv. Comput. Inform. 2026, 2, 53–61. [Google Scholar]
Ghahremani, A.; Adams, S.D.; Norton, M.; Khoo, S.Y.; Kouzani, A.Z. Detecting defects in solar panels using the yolo v10 and v11 algorithms. Electronics 2025, 14, 344. [Google Scholar] [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D. Yolov12: Attention-centric real-time object detectors. Adv. Neural Inf. Process. Syst. 2026, 38, 78433–78457. [Google Scholar]
Liu, Z.; Wang, J.; Wu, H.; Xue, F.; Qin, Z.; Sun, S.; Guo, X.; Zhao, F. Water-aware real-time detection of floating plastic debris via an enhanced YOLOv13 framework for aquatic pollution monitoring. Expert Syst. Appl. 2026, 313, 131552. [Google Scholar]
Hidayatullah, P.; Tubagus, R. YOLO26: A Comprehensive Architecture Overview and Key Improvements. arXiv 2026, arXiv:2602.14582. [Google Scholar]
Goldman, E.; Herzig, R.; Eisenschtat, A.; Goldberger, J.; Hassner, T. Precise detection in densely packed scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2019; pp. 5227–5236. [Google Scholar]

Figure 1. Overall algorithm flowchart.

Figure 2. Improved C3k2 module architecture.

Figure 3. Channel attention module.

Figure 4. Spatial attention module.

Figure 5. CBAM module.

Figure 6. CARAFE module.

Figure 7. Public dataset from the Alibaba Tianchi Competition.

Figure 8. CFOD dataset.

Figure 9. Visualization of multi-illumination augmentation results.

Figure 10. Single-class detection in retail scenes.

Figure 11. Multi-class detection in dense scenes.

Figure 12. Occlusion comparison experiment.

Figure 13. Detection comparison experiment under different illumination levels on the CFOD test dataset. The first row shows original test images with brightness gradually decreasing from left to right; the second row denotes detection results of vanilla YOLOv11n; the third row presents predictions of our improved YOLOv11n-FOD model.

Figure 14. Metric comparison chart.

Figure 15. Results on the SKU-110K dataset.

Table 1. System environment configuration.

NAME	Configuration Information
CPU	Intel(R) Core(TM) i7-14700K 3.40 GHz
Graphics Card	NVIDIA GeForce RTX 4080 16 G / NVIDIA A6000 (50 G)
DeepLearning Environment	CUDA v12.4 + CUDNN v8.9.8
Pytorch	2.5.1
Operating System	win11 Professional Edition
Camera	HF899_27mm
Compilation Software	Pycharm 2025.3 EAP
Optimizer	SGD
Epochs	100
Batch size	16
Learning rate	0.001
augment	True

Table 2. Comparison of baseline models on the public dataset.

Models	GFLOPs	Recall	mAP@50	mAP@50-95	Inference_Time (ms)
Faster R-CNN [21]	20.9	0.899	0.724	0.329	7.8
CenterNet [22]	15.32	0.902	0.714	0.330	4.3
RetinaNet [23]	48.64	0.897	0.710	0.328	6.3
DETR [24]	156.77	0.905	0.718	0.312	9.5
Deformable DETR [25]	89.30	0.908	0.733	0.315	7.2
Rf-DETR [26]	14.43	0.910	0.725	0.312	6.1
YOLOv5n [27]	4.50	0.885	0.880	0.420	2.6
YOLOv8n [28]	7.40	0.891	0.891	0.483	3.1
YOLOv10n [29]	7.80	0.749	0.749	0.430	2.7
YOLOv11n [3]	6.50	0.911	0.956	0.467	2.6
YOLOv12 [30]	6.30	0.907	0.954	0.456	3.8
YOLOv13 [31]	6.40	0.913	0.953	0.480	4.8
YOLOv26 [32]	5.80	0.788	0.855	0.432	2.1

Note: The values in bold represent the best performance in each column.

Table 3. Comparative experiments of baseline models on CFOD dataset.

Model	GFLOPS	mAP50	mAP50-95	Recall	Inference Time (ms)
Faster R-CNN [21]	88.32	0.975	0.765	0.953	7.6
CenterNet [22]	15.32	0.976	0.764	0.896	3.4
Retinanet [23]	48.64	0.973	0.763	0.953	6.7
DETR [24]	156.78	0.977	0.761	0.942	8.8
Deformable DETR [25]	8.92	0.975	0.763	0.938	8.3
Rf-DETR [26]	14.43	0.983	0.788	0.955	5.6
YOLOv5n [27]	4.50	0.977	0.772	0.948	2.6
YOLOv8n [28]	8.70	0.981	0.754	0.958	2.7
YOLOv10n [29]	8.70	0.963	0.772	0.964	2.5
YOLOv11n [3]	6.50	0.984	0.768	0.992	2.3
YOLOv12 [30]	6.30	0.973	0.763	0.986	2.7
YOLOv13 [31]	6.40	0.984	0.774	0.967	2.9
YOLOv26 [32]	5.80	0.983	0.771	0.961	1.3
YOLOv11n-FOD	6.60	0.992	0.776	0.995	2.4

Note: Bold values represent the optimal result for each evaluation metric.

Table 4. Ablation experiments of multi-module improvements.

Index	C3K2-SAC	CARAFE	CBAM + CONV	GFLOPS	mAP50	mAP50-95	Recall	Inference Time (ms)
0	×	×	×	6.5	0.984	0.768	0.966	2.1
1	✓	×	×	6.8	0.986	0.742	0.962	2.7
2	×	✓	×	5.8	0.984	0.753	0.968	2.6
3	×	×	✓	6.6	0.985	0.771	0.973	2.7
4	✓	✓	×	6.3	0.988	0.778	0.970	1.9
5	✓	×	✓	7.1	0.983	0.779	0.980	2.1
6	×	✓	✓	7.1	0.983	0.779	0.980	2.8
7	✓	✓	✓	6.8	0.988	0.778	0.976	2.8

Note: ✓ denotes module enabled, × denotes module removed; bold values indicate optimal value per metric.

Table 5. Comparison experiment of different attention mechanisms.

Attn Type	GFLOPS	mAP50	mAP50-95	Recall	Inference Time (ms)
CBAM	6.6	0.988	0.788	0.976	2.3
CA	6.6	0.987	0.746	0.983	2.6
ECA	6.6	0.985	0.782	0.960	2.7
SENet	6.6	0.972	0.782	0.965	2.8
EMA	6.6	0.959	0.786	0.959	3.8

Note: The bold values in each column represent the optimal performance of the corresponding metric.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, Z.; Xie, K.; Zhang, W.; He, J. Foreign Object Detection Model for Retail Cabinets Under Complex Backgrounds. Electronics 2026, 15, 2920. https://doi.org/10.3390/electronics15132920

AMA Style

Zhou Z, Xie K, Zhang W, He J. Foreign Object Detection Model for Retail Cabinets Under Complex Backgrounds. Electronics. 2026; 15(13):2920. https://doi.org/10.3390/electronics15132920

Chicago/Turabian Style

Zhou, Zhenshuo, Kai Xie, Wei Zhang, and Jianbiao He. 2026. "Foreign Object Detection Model for Retail Cabinets Under Complex Backgrounds" Electronics 15, no. 13: 2920. https://doi.org/10.3390/electronics15132920

APA Style

Zhou, Z., Xie, K., Zhang, W., & He, J. (2026). Foreign Object Detection Model for Retail Cabinets Under Complex Backgrounds. Electronics, 15(13), 2920. https://doi.org/10.3390/electronics15132920

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Foreign Object Detection Model for Retail Cabinets Under Complex Backgrounds

Abstract

1. Introduction

2. Related Work

2.1. Deep Learning-Based Foreign Object Detection

2.2. Low-Light Image Enhancement

3. Materials and Methods

3.1. Overall Algorithm Framework

3.2. C3K2-SAC

3.3. CBAM

3.4. CARAFE

4. Results and Analysis

4.1. Experimental Environment Setup

4.2. Object Detection Evaluation Metrics

4.3. Experimental Dataset

4.4. Heatmap Visualization Experiment

4.5. Occlusion Comparasion Experiment

4.6. Comparison Under Diverse Brightness Levels

4.7. Ablation Experiment

4.8. Attention Comparation Experiment

4.9. Training Loss Curve Comparison

4.10. Generalization Experiment

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI