Edge-Embedded Multi-Feature Fusion Network for Automatic Checkout

Li, Jicai; Zhu, Meng; Ren, Honge

doi:10.3390/jimaging11100337

Open AccessArticle

Edge-Embedded Multi-Feature Fusion Network for Automatic Checkout

by

Jicai Li

¹

,

Meng Zhu

^2,*

and

Honge Ren

^1,3,*

¹

College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China

²

College of Information Engineering, Harbin University, Harbin 150076, China

³

Heilongjiang Forestry Intelligent Equipment Engineering Research Center, Harbin 150040, China

^*

Authors to whom correspondence should be addressed.

J. Imaging 2025, 11(10), 337; https://doi.org/10.3390/jimaging11100337

Submission received: 18 August 2025 / Revised: 18 September 2025 / Accepted: 24 September 2025 / Published: 27 September 2025

(This article belongs to the Section Computer Vision and Pattern Recognition)

Download

Browse Figures

Versions Notes

Abstract

The Automatic Checkout (ACO) task aims to accurately generate complete shopping lists from checkout images. Severe product occlusions, numerous categories, and cluttered layouts impose high demands on detection models’ robustness and generalization. To address these challenges, we propose the Edge-Embedded Multi-Feature Fusion Network (E2MF2Net), which jointly optimizes synthetic image generation and feature modeling. We introduce the Hierarchical Mask-Guided Composition (HMGC) strategy to select natural product poses based on mask compactness, incorporating geometric priors and occlusion tolerance to produce photorealistic, structurally coherent synthetic images. Mask-structure supervision further enhances boundary and spatial awareness. Architecturally, the Edge-Embedded Enhancement Module (E3) embeds salient structural cues to explicitly capture boundary details and facilitate cross-layer edge propagation, while the Multi-Feature Fusion Module (MFF) integrates multi-scale semantic cues, improving feature discriminability. Experiments on the RPC dataset demonstrate that E2MF2Net outperforms state-of-the-art methods, achieving checkout accuracy (cAcc) of 98.52%, 97.95%, 96.52%, and 97.62% on Easy, Medium, Hard, and Average mode, respectively. Notably, it improves by 3.63 percentage points in the heavily occluded Hard mode and exhibits strong robustness and adaptability in incremental learning and domain generalization scenarios.

Keywords:

automatic checkout; object detection; multi-feature fusion; edge enhancement

1. Introduction

With the rapid advancements in deep learning and computer vision, Automatic Checkout (ACO) technology has emerged as a key driver for intelligent transformation in the retail industry. ACO leverages computer vision to analyze product images in real time and automatically generate shopping lists. This approach overcomes the efficiency bottlenecks of manual checkout and supports the development of unmanned retail and personalized services. Its advances carry both strong theoretical significance and broad practical value.

However, ACO faces several technical challenges such as product occlusion, overlap, diverse poses, and high inter-class similarity. These difficulties demand stronger structural understanding, boundary perception, and fine-grained feature modeling.

Furthermore, as illustrated in Figure 1, ACO faces a significant domain gap: training data often consists of multi-view images of individual products, while test data comprises real-world checkout images containing multiple items. This discrepancy in data distribution hinders direct model generalization. In addition, the frequent update of product categories in retail scenarios introduces further challenges for model scalability. Consequently, the ACO task is essentially a complex problem of cross-domain fine-grained object detection and counting.

Existing research addresses these challenges mainly in two directions. The first is synthetic data generation, using methods such as cut-and-paste and style transfer. This alleviates data scarcity and supports cross-domain learning. However, synthetic samples often distort physical structures and fail to capture realistic layouts and occlusion patterns. As a result, a domain gap remains, which limits model generalization. The second direction focuses on strengthening feature representation. This improves detection robustness under occlusion, fine-grained categories, and complex layouts. Moreover, most detection frameworks are still designed for general tasks and lack the capacity to model edge features and structural details. In dense and heavily occluded checkout scenes, they often fall short of the accuracy required in retail applications.

To address these challenges, we propose a unified framework for data augmentation and structural modeling, called Edge-Embedded Multi-Feature Fusion Network (E2MF2Net). On the data side, we design a Hierarchical Mask-Guided Composition (HMGC) strategy that incorporates pose pruning and occlusion control mechanisms to ensure physical plausibility and structural realism of synthesized images. Additionally, mask-based supervision is introduced to enhance the model’s perception of boundaries and occluded regions. On the structural side, we construct an Edge-Embedded Enhancement Module (E3) and a Multi-Feature Fusion Module (MFF) to improve the model’s capability in structural modeling and semantic representation. Extensive experiments on the RPC dataset, including systematic ablation studies and incremental learning evaluations, demonstrate the effectiveness of our method in addressing challenges such as complex occlusions, cross-domain generalization, and category expansion.

In summary, the main contributions of this work are as follows:

We propose E2MF2Net, a novel detection framework for ACO tasks, which achieves collaborative optimization in image synthesis and feature modeling, effectively alleviating the insufficient detection accuracy of mainstream methods in complex occlusion scenarios.
A Hierarchical Mask-Guided Composition (HMGC) strategy based on pose stability and occlusion tolerance is designed to enhance the structural consistency and perceptual realism of synthetic images.
An Edge-Embedded Enhancement Module (E3) and a Multi-Feature Fusion Module (MFF) are constructed to respectively enhance the model’s boundary perception and multi-scale semantic integration capabilities.
The proposed method achieves leading performance on the RPC dataset and demonstrates strong generalization and robustness in scenarios such as incremental learning and complex occlusion.

2. Related Work

2.1. General Object Detection Frameworks

Object detection methods can be broadly categorized into two paradigms: one-stage and two-stage approaches. One-stage detectors, such as the YOLO [1,2] series, directly perform classification and regression on feature maps in a single forward pass, offering fast inference speed and suitability for real-time applications. In contrast, two-stage detectors, such as the R-CNN [3,4] family, decompose the detection pipeline into region proposal generation followed by classification and regression, achieving higher accuracy, especially in tasks involving severe occlusions or requiring fine-grained recognition.

Regardless of the detection paradigm, most frameworks adopt a Backbone-Neck-Head architecture. The Backbone (e.g., VGG16 [5], ResNet [6], MViT [7], MViTv2 [8]) serves as the fundamental module for visual feature extraction, providing low-level semantic representations. Notably, MViTv2 employs spatial downsampling, multi-head self-attention, and attention pooling. These mechanisms enhance object boundary localization and contextual modeling, which leads to strong performance in ACO tasks.

The Neck, such as the Feature Pyramid Network (FPN) [9], enhances multi-scale feature fusion to improve the detector’s sensitivity to objects of various sizes. The Head is responsible for final prediction tasks. For example, Cascade R-CNN [10] introduces multi-stage refinement mechanisms at the Head level, significantly boosting robustness and accuracy in complex scenes.

Recent studies have improved general object detection frameworks through lightweight design and feature optimization, enabling applications in hyperspectral classification [11], change detection [12], and small-object recognition [13]. Multimodal fusion and structural innovations have also been widely applied to autonomous driving [14] and railway defect detection [15], achieving efficient detection under limited computational resources. In addition, the integration of transformers and convolutions has advanced salient object detection and remote sensing segmentation, while cross-scale modeling further enhances model generalization [16,17,18,19].

2.2. Feature Representation and Structural Enhancement

To enhance the adaptability of detectors to multi-scale objects and complex scenes, extensive research has been conducted. A major line of work emphasizes the structural optimization of the Neck module. In particular, most studies focus on improving feature fusion and strengthening structural representation.

In terms of feature fusion, FPN and its variants (e.g., PANet [20], BiFPN [21]) enhance feature representation by exploring multi-scale feature fusion methods. Some studies [22,23,24,25] attempt to integrate multi-scale 2D feature maps into a unified high-dimensional space. This effectively constructs a “pseudo-3D” feature structure that models spatial layout, channel dependencies, and semantic content together. As a result, the contextual representation capabilities of the network are enhanced.

For structural enhancement, edge-aware mechanisms have been widely employed to strengthen the model’s ability to perceive object boundaries and structural details. Some works [26,27] explicitly guide the network to model object contours by fusing low-level edge details with high-level semantics, thus improving both structural awareness and detection accuracy. Other studies [28,29,30] have also proposed multi-level pooling strategies to explicitly extract structural boundary features.

Despite the notable progress in multi-scale semantic integration, current feature fusion approaches still fall short in modeling structural boundaries explicitly—particularly when dealing with fine-grained variations or heavy occlusions. Meanwhile, although edge enhancement methods improve the detector’s sensitivity to boundary cues, their integration with multi-scale semantic features remains suboptimal. Motivated by these observations, we propose a structural enhancement strategy that combines multi-feature fusion and edge awareness to improve the model’s capacity for complex structural modeling, thereby better addressing the fine-grained recognition and localization challenges inherent in ACO tasks.

2.3. Automatic Checkout Systems

With the continuous advancement of research in product visual recognition, a wide range of public datasets has been developed to support model training and evaluation across various application scenarios. However, significant differences exist among these datasets in terms of task suitability, scene complexity, and data distribution, making them difficult to apply directly to ACO tasks.

Early grocery-related datasets—such as SOIL-47 [31], Freiburg Groceries Dataset [32], SHORT [33], MVTec D2S [34], and Supermarket Produce Dataset [35]—are small and mainly support single-object classification, making them unsuitable for complex checkout scenarios. Shelf or warehouse datasets like GroZi-120 [36], Grocery Products Dataset [37], RP2K [38], SKU-110K [39], and Locount Dataset [40] feature orderly layouts, which differ from cluttered checkout scenes. Datasets for vending machines (UVMs [41], TGFS [42]) have narrow applicability, while large-scale e-commerce datasets (Products-6K [43], AliProducts [44], Products-10K [45], Product1M [46]) focus on retrieval or text-image matching and lack occlusion modeling.

These limitations motivate the use of datasets specifically designed for complex checkout scenarios. In contrast, the Retail Product Checkout (RPC) [47] dataset provides large-scale, richly annotated checkout images with natural stacking and occlusions, closely mimicking real cashier environments. Its high representativeness makes it the primary benchmark for ACO model training and evaluation in this study.

To address the challenges in ACO scenarios, various methods have been proposed based on the RPC dataset. Wei et al. [47] established a baseline using copy-pasted synthetic images and CycleGAN-based [48] adaptation. Li et al. [49] introduced a posture filtering method and DPNet for joint detection and counting. IncreACO [50] improved synthesis by incorporating real spatial layouts. DPSNet [51] extended DPNet with multi-scale feature learning and iterative distillation. Sun et al. [52] proposed CGFENet, integrating global and local attention into FPN. PSP [53] built prototype representations for similarity-based discrimination, followed by PLACO [54] which aligned cross-domain prototypes. S2MC2Net [55] used point-level supervision and density maps for counting, iCashier [56] introduced triplet learning against forgetting, and DEGNet [57] adopted a decoupled edge-image architecture for improved localization.

In addition, some recent works [58,59,60] have explored the adaptation or fine-tuning of YOLO series detectors to facilitate lightweight ACO deployment on edge devices. Recent advances in multi-view learning for handling incomplete multi-modal data [61,62] offer new perspectives for addressing the challenges of missing or conflicting multi-sensor data in ACO systems. While previous methods have partially addressed domain gap and occlusions, most still face limitations in synthesizing high-quality data and lack mechanisms for jointly modeling edge structures and multi-scale semantics. These shortcomings hinder the effectiveness of current approaches in challenging scenarios. Consequently, there remains considerable room for improving recognition accuracy and enhancing model generalization.

3. Methodology

3.1. Hierarchical Mask-Guided Composition

To address the limitations of existing synthetic product image generation methods in terms of spatial layout and structural consistency, a Hierarchical Mask-Guided Composition (HMGC) strategy is proposed. This method selects product instances with more natural and realistic poses by analyzing mask compactness, and incorporates pose pruning and structural control mechanisms to generate virtual checkout images that conform better to physical constraints and real-world scene structures.

Furthermore, we design a mask-supervised training pipeline to explicitly guide the model in perceiving object boundaries and spatial structures, thereby enhancing its robustness to the distribution gap between synthetic and real images (see Section 4.3 for details). The overall HMGC strategy comprises two key components: one is a pose pruning strategy based on mask compactness, which eliminates unnatural product placements; the other is a mask-aligned image composition and supervision mechanism, which enables high-quality image generation and promotes structural consistency learning.

3.1.1. Pose Pruning Strategy Based on Mask Compactness

For every product category

c \in C

, let the corresponding image set be

I_{c} = {I_{c}^{1}, I_{c}^{2}, \dots, I_{c}^{n}}

, where each item is captured from four viewpoints with multiple in-plane rotations. Background pixels are removed using the method in DPNet [49]. However, not every pose is physically plausible in real checkout scenarios—for example, hanging snack bags in a vertical pose or upside-down beverage bottles are rarely observed in practice. Consequently, eliminating pose-anomalous samples before composition is a necessary preprocessing step. To this end, we introduce a stability metric that combines mask area and boundary compactness to evaluate the geometric plausibility and visual usability of each instance. Specifically, for a single product image I, the non-zero region of its alpha mask is extracted, and the minimum area enclosing rotated rectangle

R (I)

is computed. The Mask Rect Rate (MRR) of I is defined as

MRR (I)

MRR (I) = \frac{| M a s k (I) |}{Area (R (I))},

(1)

where

M a s k (I)

is the foreground-mask pixel set,

| M a s k (I) |

is its cardinality (mask area) and

Area (R (I))

denotes the area of the enclosing rectangle. When

MRR (I) \to 1

, the product boundary is compact and the pose is considered stable and suitable for synthesis. For any

I_{c}^{i} \in I_{c}

, we further define its Stability Score as

S_{stable} (I_{c}^{i})

S_{stable} (I_{c}^{i}) = \frac{| M a s k (I_{c}^{i}) |}{M a s k_{c}^{max}} \cdot \frac{MRR (I_{c}^{i})}{{MRR}_{c}^{max}},

(2)

where

M a s k_{c}^{max} = {max}_{I_{c}^{i} \in I_{c}} | M a s k (I_{c}^{i}) |

is the maximum mask area in category c and

{MRR}_{c}^{max} = {max}_{I_{c}^{i} \in I_{c}} MRR (I_{c}^{i})

is the maximum compactness. The first term measures relative mask size within the category, and the second the relative compactness. All stability scores are clustered with Otsu’s method [63] to obtain an adaptive threshold

α

(set to 0.6 in this work). Images with

S_{stable} (I_{c}^{i}) \geq α

are retained as high-quality samples for subsequent composition.

In real checkout scenes, inter-product occlusions depend not only on spatial position but also on geometric and structural constraints. For example, a flimsy bag cannot realistically rest horizontally on top of an upright bottle. To mimic such physical realism, we define an Occlusion Tolerance (OT) for each category c as

OT (c) = 0.8 \sqrt{\frac{max (0, β (c) - 0.5)}{2.5}},

(3)

where the parameter

β (c)

is computed by

β (c) = \frac{M a s k_{top}^{min} (c)}{M a s k_{horizontal}^{max} (c)},

(4)

with

M a s k_{top}^{\min} (c)

the minimal mask area in the top-view, and

M a s k_{horizontal}^{\max} (c)

the maximal mask area in the horizontal view. When

β (c)

is larger, it indicates that the category exhibits similar shapes across viewpoints and can physically sustain a certain degree of occlusion or stacking; a smaller

β (c)

indicates lower tolerance to being occluded.

3.1.2. Image Composition Strategy

After pose pruning and sample filtering, we design a structured multi-stage image composition pipeline to paste multiple preprocessed single-product images onto a shared background, thereby generating checkout images with realistic occlusion patterns and geometrically plausible layouts. This process involves a series of position sampling steps, occlusion constraints, and paste scheduling rules to ensure that the resulting images are both visually realistic and structurally annotatable.The complete procedure of this strategy is summarized in the pseudocode provided in Algorithm 1.

Formally, let the set of composable elements be

S y n = {S y n_{1}, S y n_{2}, \dots, S y n_{n}}

, the background image be

S y n_{b}

, and the final composite image be

I_{s y n}

. We first sort

S y n

in descending order according to each item’s OT. Each item is augmented before being pasted, using random rotation (in the range

[- 180^{\circ}, 180^{\circ}]

) and scaling (scaling factor in

[0.4, 0.7]

) to increase diversity. Next, the first element

S y n_{1}

is pasted onto

S y n_{b}

using its foreground mask region to generate the initial composite image

I_{s y n}^{1}

. Its pasted region is recorded as mask

M_{1}

, and the initial mask set is

M = {M_{1}}

. For each subsequent element

S y n_{i} (i > 1)

, the following strategy is applied

Candidate Pasting Point Extraction via Edge Density:

-

Compute the union of existing masks

M_{union}^{i - 1} = ⋃_{j = 1}^{i - 1} M_{j}

;

-

Extract edge pixels from the union mask using the Sobel operator

P_{edge}^{i - 1} = Edge (M_{union}^{i - 1})

;

-

For each edge pixel

P (x, y) \in P_{edge}^{i - 1}

, calculate the non-zero pixel density

d (P (x, y))

within a square neighborhood

N (x, y)

of size

k \times k

(e.g., k=200), where

I [\cdot]

denotes the indicator function, which equals 1 if the condition is satisfied and 0 otherwise.

d (P (x, y)) = \frac{1}{k^{2}} \sum_{(i, j) \in N (x, y)} I [M_{union}^{i - 1} (x, y) > 0] .

(5)

-

Select low-density points

d (P (x, y)) < τ_{d}

(threshold set to 0.5) to form the candidate paste point set

P_{candidate}^{i - 1}

sorted in ascending order of density.

Algorithm 1 HMGC: Image Composition Strategy

1:: Input:
2:: $S y n$ : Composable elements sorted in descending order by OT value
3:: $S y n_{b}$ : Background image
4:: $τ_{d}$ : Edge density threshold (e.g., 0.5)
5:: k: Neighborhood size for density calculation (e.g., 200)
6:: $N_{\max}$ : Maximum paste attempts per item (e.g., 500)
7:: Output:
8:: $I_{s y n}$ : Generated synthetic checkout image
9:: $M$ : Set of instance masks for $I_{s y n}$
10:: // Initialize with first element
11:: $I_{s y n}^{1} \leftarrow Paste (S y n_{b}, S y n_{1})$
12:: $M_{1} \leftarrow GetPastedRegion (I_{s y n}^{1})$
13:: $M \leftarrow {M_{1}}$
14:: for $i \leftarrow 2$ to $| S y n |$ do
15:: $S y n_{i} \leftarrow RandomRotateAndScale (S y n_{i})$
16:: $(m a s k_{target}, I_{target}) \leftarrow S y n_{i}$
17:: $s u c c e s s \leftarrow False$
18:: $a t t e m p t s \leftarrow 0$
19:: // Find suitable paste location
20:: $M_{union}^{i - 1} \leftarrow ⋃_{j = 1}^{i - 1} M_{j}$
21:: $P_{edge}^{i - 1} \leftarrow Edge (M_{union}^{i - 1})$
22:: $P_{candidate}^{i - 1} \leftarrow FindLowDensityPointsAndSort (P_{edge}^{i - 1}, M_{union}^{i - 1}, τ_{d}, k)$
23:: while not $s u c c e s s$ and $a t t e m p t s < N_{\max}$ do
24:: if $P_{candidate}^{i - 1}$ is not empty then
25:: $p_{a t t e m p t s} \leftarrow SelectPointByIndex (P_{candidate}^{i - 1})$
26:: $p_{r} \leftarrow SampleNearbyPoint (p_{a t t e m p t s})$
27:: // Align and paste item
28:: $M_{k} \leftarrow FindSourceMask (M, p_{a t t e m p t s})$
29:: $I_{candidate} \leftarrow AlignAndPaste (I_{s y n}^{i - 1}, I_{target}, M_{k}, m a s k_{target}, p_{r})$
30:: $M_{i} \leftarrow GetPastedRegion (I_{candidate})$
31:: // Validate occlusion constraints
32:: if $ValidateOcclusion (M_{i}, M) ()$ then
33:: $I_{s y n}^{i} \leftarrow I_{candidate}$
34:: $M \leftarrow M \cup {M_{i}}$
35:: $s u c c e s s \leftarrow True$
36:: end if
37:: end if
38:: $a t t e m p t s \leftarrow a t t e m p t s + 1$
39:: end while
40:: end for
41:: return $I_{s y n}, M$

Geometric Alignment-Based Pasting:
-
Choose the first candidate point $P_{0} \in P_{candidate}^{i - 1}$ , and randomly sample a paste reference point $P_{r}$ within a 50-pixel neighborhood;
-
Let $M_{k} \in M$ be the mask containing $P_{0} \in M_{k}$ . Compute its minimum enclosing rotated rectangle with corner set $R e c t_{source} = {R_{0}^{s}, R_{1}^{s}, R_{2}^{s}, R_{3}^{s}}$ (in counter-clockwise order), where $R_{0}^{s}$ is the bottom-left vertex.
-
Similarly, compute the rectangle for $S y n_{i}$ : $R e c t_{target} = {R_{0}^{t}, R_{1}^{t}, R_{2}^{t}, R_{3}^{t}}$
-
Identify the two vertices ${R_{a}^{s}, R_{b}^{s}} (a, b \in {0, 1, 2, 3})$ of $R e c t_{source}$ closest to $P_{r}$ , then randomly select a vertex $R_{c}^{t} \in R e c t_{target}$ such that $c \notin {a, b}$ . Align $R_{c}^{t}$ to $P_{r}$ , and paste $S y n_{i}$ onto $I_{s y n}^{i - 1}$ accordingly, yielding candidate composite $I_{s y n}^{i}$ .
Occlusion Ratio Constraint Validation:
-
Let $M_{i}$ be the pasted region of $S y n_{i}$ in $I_{s y n}^{i}$ . For any previous mask $M_{j (j < i)} \in M$ , define its top-layer union as $M_{top - union}^{j} = (⋃_{k = j + 1}^{i - 1} M_{k}) \cup M_{i}$ ; Then, the occlusion ratio of $M_{j}$ is $r_{shade} (M_{j})$

$r_{shade} (M_{j}) = \frac{| M_{j} \cap M_{top - union}^{j} |}{| M_{j} |} .$

(6)

-
If $r_{shade} > 0.5$ , the paste is invalid; proceed to the next candidate point in $P_{candidate}^{i - 1}$ and repeat. If all attempts (max 500) fail, discard $S y n_{i}$ .
-
If occlusion constraints are satisfied, add $M_{i}$ to $M$ and update $I_{s y n}^{i}$ . Proceed to the next item.
Final Supervision and Annotation Generation: By enforcing OT-based priority and progressive occlusion validation, we ensure that occlusion-sensitive items appear on top, while geometrically stable items act as structural support. A dynamic supervision strategy ensures that the occlusion ratio of any object in the final virtual image remains within a tolerable range. After all elements in $S y n$ are successfully composed, we obtain a structurally compact and realistic image $I_{s y n}$ and its mask set $M$ . The set $M$ enables the generation of annotations compatible with the RPC dataset. Compared to the RPC test set, which only includes bounding boxes and center-point annotations, our pipeline provides instance-level masks, offering stronger structural supervision.

3.2. Architecture of the Detection Framework

As part of the proposed Edge-Embedded Multi-Feature Fusion Network (E2MF2Net), the detection framework is designed to effectively capture fine-grained product features in complex ACO scenarios. This framework primarily comprises two key modules: the Edge-Embedded Enhancement Module (E3) and the Multi-Feature Fusion Module (MFF), as shown in Figure 2.

Specifically, the E3 module enhances each feature level with top-down edge-guided processing, embedding salient structural cues into local feature maps to strengthen the representation of object boundaries. Building on this, the MFF module enhances the contextual modeling of semantic features while preserving edge-aware cues, ultimately producing more discriminative feature maps for prediction by the detection head.

3.2.1. Edge-Embedded Enhancement Module

The proposed E3 aims to overcome the limitations of conventional FPN architectures in preserving low-level structural cues—particularly their deficiency in representing object boundaries, contours, and local transitions. While standard top-down pathways in FPNs effectively propagate semantic context, they often dilute spatially detailed information from shallow layers, which is critical for fine-grained object localization under occlusions and background clutter.

As illustrated in Figure 2, the E3 module constructs an auxiliary enhancement path that independently processes each stage’s features to extract edge-aware representations. The design is implemented as follows.

Let the output feature map of the i-th stage from the backbone be denoted as

X^{i}

X^{i} \in R^{B \times \frac{H}{2^{i}} \times \frac{W}{2^{i}}}, i \in {2, 3, 4, 5},

(7)

where B is the batch size, and

H, W

are the height and width of the input image. To amplify structural cues, the edge-enhanced feature map

X_{edge}^{i}

is formulated as

X_{edge}^{i} = Concat ([E 3 (X^{i}), E 3 (E 3 (X^{i}))]) + X^{i},

(8)

where

Concat (\cdot)

denotes channel-wise concatenation, and

E 3 (\cdot)

denotes the E3 block for edge enhancement. Each

E 3 (\cdot)

performs salient edge extraction and adaptive fusion, formulated as

E 3 (X) = X + Att ((X - A P (X)), (X - M P (X))),

(9)

where

A P (\cdot)

and

M P (\cdot)

denote 3 × 3 average pooling and max pooling operations with stride 1, respectively. The subtraction from the original feature map emphasizes local edge transitions by suppressing uniform regions, thus highlighting spatial discontinuities. The differential features from average and max pooling are then adaptively fused via

Att (a, b)

Att (a, b) = γ a + (1 - γ) b,

(10)

where

γ \in [0, 1]

is a learnable attention weight that dynamically balances the fusion of differential edge responses from average and max pooling. The core idea of this design lies in performing multi-scale local context comparisons over the original feature map to enhance activations in edge transition regions. By stacking multiple E3 operations, the network expands its receptive field for edge-aware representations, thereby enriching boundary features at higher semantic levels. This attention mechanism offers a flexible way to preserve both soft contour transitions and hard edge boundaries, depending on content-specific spatial statistics.

The theoretical motivation behind this design lies in local contrastive filtering, where edge evidence is derived through contextual comparisons across multiple scales. Unlike conventional high-pass filters, which may amplify noise, the dual-path differential pooling coupled with learnable fusion adaptively selects discriminative transitions, enabling the network to retain structural consistency without sacrificing semantic abstraction.

To integrate edge-aware information into the multi-level FPN hierarchy, the enhanced features

X_{edge}^{i}

are fused with the upsampled output from the higher-level semantic features

X^{i^{'}} = Concat (X_{edge}^{i}, UP (X^{{(i + 1)}^{'}})), i \in {2, 3, 4},

(11)

where

UP (\cdot)

denotes bilinear upsampling with a scale factor of 2.

X^{5^{'}}

(equivalent to

X^{5}

) is the final high-level output from the backbone. This preserves the standard top-down semantic propagation while incorporating multi-scale edge information.

As illustrated in Figure 3, the E3 module forms a structurally aware multi-scale fusion loop. By reinforcing edge cues across hierarchical levels, this mechanism augments the saliency of detailed information within feature maps, thereby mitigating its dilution during top-down residual propagation. Consequently, this module strengthens the network’s capability in delineating object boundaries, disambiguating overlapping instances, and enhancing spatial coherence—factors that are crucial for detection performance, particularly in dense and occluded scenes.

3.2.2. Multi-Feature Fusion Module

While the E3 module effectively enhances edge-awareness in the feature hierarchy, the network may still encounter challenges in detecting objects under complex scenes due to limited semantic abstraction—particularly in the presence of background clutter or occlusions. To address this, we propose MFF, which aims to incorporate rich contextual semantics via multi-scale integration and hierarchical compression. The core intuition is to enable fine-to-coarse feature guidance, where lower-level details and higher-level semantics are jointly optimized to strengthen both localization and category-level discrimination.

We define the feature set output from the E3 module as

X = {X^{2^{'}}, X^{3^{'}}, X^{4^{'}}, X^{5^{'}}}

. Our goal is to synthesize a unified feature representation

P^{i}

for each level, preserving low-level precision while enriching high-level semantics.

To this end, for lower stages

i \in {2, 3}

, we perform multi-scale upsampling and feature compression through a recursive top-down fusion scheme

P^{i} = P U (E C (X^{i^{'}}, UP (X^{{(i + 1)}^{'}}), UP (UP (X^{{(i + 2)}^{'}})))),

(12)

Here,

E C (a, b, c)

expands and concatenates features across three adjacent levels, enabling hierarchical cross-talk among levels and preserving contextual richness

E C (a, b, c) = [a : b : c],

(13)

and

P U (\cdot)

represents a lightweight 3D convolutional compression unit, composed of a 1 × 1 × 1 3D convolution, normalization, activation, spatial pooling, and dimensionality reduction. This module acts as an inter-scale context projector, learning joint correlations across spatial and scale dimensions and enhancing cross-level semantic continuity.

For higher layers in

X

, which already encode sufficient abstraction, we apply conventional convolutional refinement

P^{4} = {Conv}_{3 \times 3} (X^{4^{'}}),

(14)

P^{5} = {Conv}_{3 \times 3} (X^{5^{'}}),

(15)

P^{6} = {Conv}_{3 \times 3} ({MaxPool}_{3 \times 3, s = 2} (P^{5})) .

(16)

The final multi-scale feature set is

P = {P^{2}, P^{3}, P^{4}, P^{5}, P^{6}}

, which is passed to the detection head for final prediction.

The MFF module is grounded in feature-scale disentanglement and multi-resolution consistency. By explicitly modeling inter-scale dependencies through hierarchical fusion, the module enables the network to reconcile spatial accuracy with semantic abstraction. Unlike standard additive FPN fusion, our EC-PU scheme provides nonlinear context transformation and semantic continuity, yielding greater robustness in dense or occluded scenes.

4. Experiments

4.1. Dataset Details

To systematically evaluate the proposed image composition strategy and detection model in the context of ACO, we adopt the most representative and challenging public benchmark—the RPC dataset.

This dataset contains 200 retail product categories and 83,739 images, including 53,739 single-product images and 30,000 checkout images. The single-product images are captured under controlled lighting and backgrounds from various viewpoints, while the checkout images simulate realistic cashier scenes with diverse object arrangements, such as rotations, overlaps, and occlusions. Among the 30,000 real checkout images, 6000 are selected for auxiliary training, while the remaining 24,000 are used for performance evaluation. Furthermore, the dataset provides rich annotations (including shopping lists, point-level positions, and bounding boxes), and categorizes checkout scenes into three occlusion difficulty levels—easy (3–10 instances), medium (10–15 instances), and hard (15–20 instances). Additionally, we utilize the SKU-110K [39] dataset to further validate the model’s performance across diverse benchmarks. This large-scale dataset is characterized by densely packed retail products on shelves and contains over 11,000 images with bounding box annotations.

To address the domain gap between single-product and checkout images, we employ the proposed HMGC strategy to generate 46,000 high-fidelity synthetic checkout images for training with a fixed resolution of 1815 × 1815 pixels. This greatly enhances the diversity and generalization capacity of the training data. A summary of the datasets is shown in Table 1.

4.2. Evaluation Metrics

To comprehensively assess model performance in the ACO task, we utilize a set of specialized metrics defined by the RPC [47] dataset, including Checkout Accuracy (cAcc), Average Counting Distance (ACD), Mean Category Counting Distance (mCCD), and Mean Category Intersection over Union (mCIoU). These are formally defined as follows.

cAcc evaluates whether the predicted shopping list (both category and quantity) is completely correct for each image. As the most critical metric for ACO systems, higher cAcc indicates better commercial feasibility. It is computed as

cAcc = \frac{1}{N} \sum_{i = 1}^{N} δ (\sum_{k = 1}^{K} | {\hat{y}}_{i, k} - y_{i, k} |, 0),

(17)

where N is the number of images, K is the total number of product categories,

{\hat{y}}_{i, k}

and

y_{i, k}

denote the predicted and ground-truth count for category k in image i, respectively. The indicator function

δ (a, b)

= 1 if

a = b

, and 0 otherwise.

ACD measures the total absolute count error per image across all categories

ACD = \frac{1}{N} \sum_{i = 1}^{N} \sum_{k = 1}^{K} |{\hat{y}}_{i, k} - y_{i, k}| .

(18)

mCCD computes the average relative error per category across the dataset

mCCD = \frac{1}{K} \sum_{k = 1}^{K} \frac{\sum_{i = 1}^{N} |{\hat{y}}_{i, k} - y_{i, k}|}{\sum_{i = 1}^{N} y_{i, k}} .

(19)

mCIoU measures the similarity between predicted and ground-truth shopping lists in a set-wise manner

mCIoU = \frac{1}{K} \sum_{k = 1}^{K} \frac{\sum_{i = 1}^{N} min ({\hat{y}}_{i, k}, y_{i, k})}{\sum_{i = 1}^{N} max ({\hat{y}}_{i, k}, y_{i, k})} .

(20)

In addition to these ACO-specific classification and counting metrics, we also report mAP50 and mmAP following the COCO [64] evaluation protocol, to assess detection quality in terms of object localization and bounding box accuracy.

4.3. Implementation Details

The proposed detection network is implemented based on Detectron2 [65], with all training and inference conducted using PyTorch [66] 2.3.0 and CUDA 11.8. All experiments are run on a workstation equipped with an NVIDIA RTX 4090 GPU, with Automatic Mixed Precision (AMP) enabled to fully exploit Tensor Core acceleration. The overall architecture adopts the Cascade Mask R-CNN detection framework with MViTv2 as the backbone. The AdamW optimizer and step-based learning rate decay are used, with decay to 0.1 and 0.01 of the original rate at 0.6 and 0.9 of total iterations.

To bridge the significant appearance and distribution gap between synthetic and real checkout images, we propose a mask-supervised two-stage training paradigm (pretrain-then-transfer) that is designed to enhance the model’s cross-domain generalization and structural awareness without introducing additional domain adaptation modules.

In the first stage, we utilize large-scale synthetic checkout images with precise instance-level mask annotations to pretrain the model. Input images are resized to 448 × 448, and the training set contains 46,000 automatically composed images. The model is trained using a batch size of 10 and an initial learning rate of 5 × 10⁻⁴ for a total of 60,000 iterations.

The mask branch plays a crucial role in structural guidance by supervising instance boundaries and enhancing context modeling. By leveraging mask supervision, the network learns to identify object instances precisely while capturing inter-instance structural relationships.

In the second stage, the input size is fixed at 832 × 832, and 6000 real checkout images are used for fine-tuning. A batch size of 5 is adopted, with the initial learning rate set to 1 × 10⁻⁴ for a total of 36,000 iterations. For evaluation on the SKU-110K dataset, we use an input size of 832 × 832, with 8233 training images and 2941 testing images. A batch size of 5 is adopted, with the initial learning rate set to 5 × 10⁻⁴ for a total of 50,000 iterations.

The model is initialized with the weights from the pretraining stage, but the mask branch is removed, with only the box head retained for detection on real images. This transfer process enables efficient adaptation to the target domain while preserving the structural priors learned from the synthetic data.

4.4. Experimental Results

To comprehensively evaluate the effectiveness of the proposed method in complex product detection scenarios, we compare it against advanced ACO methods from representative technical paradigms, including DPSNet (multi-scale detection) [51], PLACO (prototype-based alignment) [54], and DEGNet (decoupled edge guidance) [57]. These methods are selected as they represent the strongest and most competitive solutions currently available in the ACO literature. For each method, we cite the best performance reported in the original publication on the RPC dataset using the same evaluation metrics, ensuring a fair and representative comparison, as summarized in Table 2.

Our method consistently achieves the best performance across all evaluation metrics and occlusion difficulty levels, demonstrating its superior generalization and fine-grained modeling capabilities. Specifically, our method achieves cAcc of 98.52%, 97.95%, 96.52% and 97.62% under easy, medium, hard, and average mode, improving upon DEGNet by 1.32, 2.20, 3.63 and 2.37 percentage points. In localization performance, the relatively high mCIoU (99.75%) combined with low values for ACD (0.03) and mCCD (0.00) indicates stable detection and counting capability. When evaluated under COCO standards, our method shows improvements of 0.32 and 2.32 percentage points over DEGNet in mAP50 and mmAP, respectively, demonstrating consistent progress in localization accuracy.

We repeated the experiments five times. Compared with DEGNet (71.83 M parameters, 95.25 ± 0.07% cAcc), our E2MF2Net (97.88 M parameters, 97.62 ± 0.05% cAcc) shows statistically significant improvements. A paired t-test confirms the difference (p = 2.22 × 10⁻¹⁶ < 0.05), and the effect size is large (Cohen’s d = 36.74), indicating that our method exhibits distinctiveness.

As shown in Figure 4, we visualize detection results of E2MF2Net across different occlusion levels, where each row corresponds to easy, medium, and hard modes, respectively.

The results on the SKU-110K dataset are shown in Table 3. Our model achieves 42.2% mmAP and 64.5% mAP50. The performance is comparable to YOLOv7 and Deformable DETR, but lower than the task-specific DeCo-DETR. These results show that the method retains a degree of generalization even without architectural changes, tuning, or task-specific data augmentation.

4.5. Ablation Study

To thoroughly investigate the individual contributions of each proposed module, we conduct four groups of ablation experiments by progressively incorporating the key components: HMGC, E3, and MFF. All experiments are conducted based on the same settings and evaluation protocols to ensure fair comparisons. The results are presented in Table 4.

Experiment A, the baseline without any structural enhancement or synthetic data, achieves 93.93% cAcc under average mode, with notable drops on hard cases (90.80%) and higher ACD value (0.09). Adding HMGC in Experiment B provides modest gains in cAcc (+0.32 percentage points under average mode), as the additional generated samples improve generalization. Incorporating E3 in Experiment C further boosts performance, especially in medium and hard mode, where cAcc under hard mode increases by 3.77 percentage points over Experiment B, and ACD is reduced from 0.08 to 0.05, showing the benefit of enhanced boundary features. Most significantly, adding MFF to form the full model (Experiment D) yields the highest scores across all metrics, with average cAcc reaching 97.62%, mCIoU up to 99.75%, and minimal counting deviation (ACD = 0.03, mCCD = 0.00). These results confirm that HMGC, E3, and MFF complement each other, jointly improving classification robustness, spatial localization, and counting precision.

To comprehensively assess the impact of each component, we also report the computational costs in terms of parameter count and inference latency, as detailed in Table 5. The baseline model (without E3 or MFF) contains 83.48 M parameters and processes an image in 73.3 ms. Integrating the E3 module introduces 11.51 M additional parameters and increases latency by only 1.6 ms, indicating its efficient design for edge-aware enhancement. The subsequent inclusion of the MFF module adds a further 2.89 M parameters and 2.9 ms in latency, culminating in a full model with 97.88 M parameters and an inference speed of 77.8 ms per image. This marginal increase in computational overhead is justified by the significant gains in accuracy and robustness demonstrated in our experiments. More importantly, the final latency remains within a practical range for real-world deployment, affirming that E2MF2Net successfully balances performance with efficiency.

4.6. Incremental Learning

In practical ACO deployments, the continual expansion of product inventories requires detection models to support incremental learning, enabling them to quickly adapt to new product categories without forgetting previously learned classes.

To evaluate the scalability and robustness of our model under incremental learning scenarios, we conduct a three-stage evaluation simulating evolving category distributions in the RPC dataset. Following the protocol of [54], we use 183 categories as base classes and the remaining 17 categories as novel classes to ensure consistency with prior work.

In the first configuration (denoted as 183/183), the model is trained and evaluated solely on the 183 base classes. Then, in the second configuration (183/200), the same model is directly evaluated on the complete 200-class test set without further training, which allows us to assess its zero-shot generalization ability to the 17 unseen categories. Finally, in the third configuration ((183 + 17)/200), the model is incrementally fine-tuned using training samples from the 17 novel classes—while retaining data from the base classes—and subsequently evaluated on the full 200-class test set.

For comparison, we include IncreACO [50], PLACO [54] and DEGNet [57] in our evaluation as representative incremental ACO methods. The experimental results are summarized in Table 6.

The proposed method first achieves a high accuracy of 97.01% on the 183 base classes, outperforming existing approaches and demonstrating strong initial learning capability. After the incremental fine-tuning on the 17 novel categories, the model restores its overall performance to 96.59% on the full 200-class test set, showing minimal degradation on the base classes while maintaining strong generalization across all categories.

These results demonstrate that E2MF2Net exhibits strong resistance to catastrophic forgetting through a simple fine-tuning procedure, without employing any specialized continual learning techniques. This capability originates from the model’s architectural design: HMGC pre-training establishes robust structural priors, while the E3 and MFF modules enhance feature discriminability and stability. Consequently, the model adapts to new classes with minimal interference to its existing knowledge, confirming the intrinsic architectural robustness against forgetting.

5. Discussion

The present study introduces E2MF2Net, a unified framework designed for the ACO task. This framework simultaneously addresses key challenges in ACO, including the domain gap, fine-grained product recognition, and scalable incremental learning, by coordinating innovations at both the data and model levels. The method comprises three core components: the HMGC strategy, the MFF module, and the E3 module.

HMGC generates synthetic images that are both photorealistic and structurally consistent through pose pruning and structure-aligned composition. This process injects crucial structural priors into the learning process. More importantly, HMGC produces large-scale and structurally plausible synthetic images. These images compel the model to learn a deeper and more general understanding of product physics and occlusion. The model does not simply memorize pixel patterns. Instead, it acquires a robust and well-organized feature foundation. This foundation supports a generalized representation of checkout scenes. As a result, the features learned for base categories are less likely to be overwritten during fine-tuning. This process enhances the model’s incremental learning ability and strengthens its resistance to catastrophic forgetting. When integrated with a mask-supervised training regime, the network learns instance boundaries and spatial relationships between instances with high fidelity, significantly enhancing robustness during the transition from synthetic to real checkout images. Notably, both HMGC and the mask branch are active only during training and are entirely excluded during inference. This design ensures substantial performance improvements without introducing any additional computational cost at runtime.

The E3 module is designed to counteract the tendency of conventional feature pyramids to dilute boundary information. It leverages difference-pooling-guided, edge-aware attention to preserve fine-grained structural cues and propagate them to higher feature levels. This capability allows the model to resolve ambiguous boundaries between heavily overlapping instances, which is essential for distinguishing products under severe occlusion. Complementarily, the MFF module employs a recursive expand-compress fusion mechanism across adjacent feature levels. This process retains edge-aware details while enriching contextual semantics, thereby achieving a balance between spatial precision and semantic abstraction. As a result, MFF strengthens the discriminative feature representations necessary for fine-grained category recognition. Furthermore, the combined operation of E3 and MFF contributes to mitigating catastrophic forgetting. E3’s precise boundary modeling establishes more stable and discriminative feature representations. This reduces interference between old and new categories during fine-tuning. At the same time, MFF’s multi-scale integration enhances the robustness and generalizability of these features. As a result, they are less likely to be overwritten when learning new tasks.

Evaluated on the RPC dataset, E2MF2Net outperforms current state-of-the-art methods. The model attains cAcc of 98.52%, 97.95%, 96.52%, and 97.62% in the Easy, Medium, Hard, and Average mode, respectively, and yields a notable improvement of 3.63 percentage points in the Hard mode relative to a strong baseline (DEGNet). Given that ACO emphasizes complete and accurate shopping-list generation, instance-level counting and categorical correctness (i.e., cAcc) serve as the primary evaluation criteria. Although COCO-style metrics such as mAP50 and mmAP are treated as supplementary diagnostics, the observed improvements in these metrics further corroborate enhanced localization accuracy.

Ablation studies validate the contribution of each component, and incremental-learning experiments demonstrate that the model experiences only minor degradation after the inclusion of novel classes and can rapidly recover via fine-tuning. We acknowledge that existing continual learning methods offer effective mechanisms for mitigating catastrophic forgetting. However, these methods often impose additional constraints and introduce significant computational overhead. In contrast, our framework achieves competitive incremental learning performance through architectural innovations and pre-training strategies alone, without relying on explicit continual learning techniques. The observed resistance to forgetting indicates that, in application domains such as ACO, carefully designed pre-training and feature representation strategies can substantially reduce the need for complex continual learning algorithms.

Despite the encouraging results, several limitations of the proposed method remain and suggest directions for future work. A first limitation lies in the computational cost of HMGC. Its

O (n^{2})

mask and edge operations become heavy when processing high-resolution images. This is manageable for checkout-counter images with fixed sizes (e.g., 1815 × 1815), but scaling to larger scenes is challenging. In addition, HMGC works well for rigid products but struggles with deformable items such as soft packages. It also has difficulty modeling extreme occlusions. These issues are caused by its reliance on 2D composition and the limited diversity of available single-product captures.

Dataset reliance is another constraint. RPC, while the largest and most representative ACO benchmark, focuses only on checkout-counter layouts. This narrow scope limits evaluation diversity. Results on SKU-110K show that our method does not fully adapt to shelf-style settings, which contain dense arrangements and many small objects. The absence of specialized post-processing and dataset-specific augmentation further reduces performance in such cases. Moreover, most ACO datasets remain unimodal, relying only on visual information. This prevents the model from exploiting complementary signals such as depth sensing or RFID.

While the proposed method achieves competitive accuracy, we acknowledge certain limitations in computational efficiency. The main constraints on parameters and inference speed arise from the complexity of the MViTv2 backbone and the Cascade R-CNN framework. Notably, the HMGC strategy operates only during the data generation phase and does not introduce any overhead at inference. In addition, the E3 and MFF modules contribute only marginally to the total parameter count and inference latency, as confirmed by our ablation studies.

Future work will address these challenges. One direction is to integrate 3D-aware modeling or physics-based simulation into HMGC, improving the handling of deformable items and occlusion. Another is to combine the framework with continual learning to support growing product catalogs while avoiding catastrophic forgetting. Deployment efficiency could also be improved through lighter backbones, knowledge distillation, pruning, and quantization. Furthermore, we plan to improve dense small object detection by optimizing post-processing components, including NMS strategies and IoU-based loss functions. These enhancements aim to boost generalization across datasets with different object distributions and scales. Finally, multimodal inputs and domain adaptation strategies should be explored to strengthen robustness in heavily occluded or diverse retail environments.

Author Contributions

Conceptualization, H.R.; methodology, H.R.; software, J.L. and M.Z.; validation, M.Z. and H.R.; formal analysis, J.L. and M.Z.; resources, H.R.; data curation, J.L. and M.Z.; writing—original draft preparation, J.L. and M.Z.; writing—review and editing, M.Z. and H.R.; visualization, J.L.; supervision, H.R.; project administration, H.R.; funding acquisition, H.R. All authors have read and agreed to the published version of the manuscript.

Funding

This work was made possible by the project of Harbin Science and Technology Plan (ZC2022ZJ010027), the Young Doctoral Research Initiation Fund Project of Harbin University (HUDF2022110) and the Natural Science Foundation of Heilongjiang Province(LH2024F047). Their contributions have been instrumental in enabling us to conduct this study and reach our research goals.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in [Retail Product Checkout Dataset] at [https://rpc-dataset.github.io] (accessed on 1 April 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
He, K.; Gkioxari, G.; DollÁR, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Fan, H.; Xiong, B.; Mangalam, K.; Li, Y.; Yan, Z.; Malik, J.; Feichtenhofer, C. Multiscale Vision Transformers. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 6804–6815. [Google Scholar]
Li, Y.; Wu, C.Y.; Fan, H.; Mangalam, K.; Xiong, B.; Malik, J.; Feichtenhofer, C. MViTv2: Improved Multiscale Vision Transformers for Classification and Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 4794–4804. [Google Scholar]
Lin, T.Y.; DollÁR, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into High Quality Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]
Li, C.; Rasti, B.; Tang, X.; Duan, P.; Li, J.; Peng, Y. Channel-Layer-Oriented Lightweight Spectral–Spatial Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–14. [Google Scholar] [CrossRef]
Wang, M.; Zhu, B.; Zhang, J.; Fan, J.; Ye, Y. A Lightweight Change Detection Network Based on Feature Interleaved Fusion and Bistage Decoding. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 2557–2569. [Google Scholar] [CrossRef]
Lv, Y.; Tian, B.; Guo, Q.; Zhang, D. A Lightweight Small Target Detection Algorithm for UAV Platforms. Appl. Sci. 2025, 15, 12. [Google Scholar] [CrossRef]
Khan, A.A.; Shao, J.; Rao, Y.; She, L.; Shen, H.T. LRDNet: Lightweight LiDAR Aided Cascaded Feature Pools for Free Road Space Detection. IEEE Trans. Multimed. 2025, 27, 652–664. [Google Scholar] [CrossRef]
Cao, Y.; Liu, Y.; Sun, Y.; Su, S.; Wang, F. Enhancing rail safety through real-time defect detection: A novel lightweight network approach. Accid. Anal. Prev. 2024, 203, 107617. [Google Scholar] [CrossRef] [PubMed]
Qiu, Y.; Liu, Y.; Zhang, L.; Lu, H.; Xu, J. Boosting Salient Object Detection with Transformer-Based Asymmetric Bilateral U-Net. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 2332–2345. [Google Scholar] [CrossRef]
Zheng, W.; Lu, S.; Yang, Y.; Yin, Z.; Yin, L. Lightweight transformer image feature extraction network. PeerJ Comput. Sci. 2024, 10, e1755. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, Y.; Wang, Y.; Mei, S. Rethinking Transformers for Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
Wang, W.; Chen, W.; Qiu, Q.; Chen, L.; Wu, B.; Lin, B.; He, X.; Liu, W. CrossFormer++: A Versatile Vision Transformer Hinging on Cross-Scale Attention. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 3123–3136. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10778–10787. [Google Scholar]
Wang, X.; Zhang, S.; Yu, Z.; Feng, L.; Zhang, W. Scale-equalizing pyramid convolution for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13359–13368. [Google Scholar]
Park, H.J.; Kang, J.W.; Kim, B.G. ssFPN: Scale sequence (S²) feature-based feature pyramid network for object detection. Sensors 2023, 23, 4432. [Google Scholar] [CrossRef] [PubMed]
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
Kang, M.; Ting, C.M.; Ting, F.F.; Phan, R.C.W. ASF-YOLO: A novel YOLO model with attentional scale sequence fusion for cell instance segmentation. Image Vis. Comput. 2024, 147, 105057. [Google Scholar] [CrossRef]
Zhao, J.X.; Liu, J.J.; Fan, D.P.; Cao, Y.; Yang, J.; Cheng, M.M. EGNet: Edge guidance network for salient object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8779–8788. [Google Scholar]
Liu, S.; Gao, K.; Qin, J.; Gong, H.; Wang, H.; Zhang, L.; Gong, D. SE2Net: Semantic segmentation of remote sensing images based on self-attention and edge enhancement modules. J. Appl. Remote Sens. 2021, 15, 026512. [Google Scholar] [CrossRef]
Yang, L.; Gu, Y.; Feng, H. Multi-scale Feature Fusion and Feature Calibration with Edge Information Enhancement for Remote Sensing Object Detection. Sci. Rep. 2025, 15, 15371. [Google Scholar] [CrossRef]
Li, K.; Geng, Q.; Wan, M.; Cao, X.; Zhou, Z. Context and Spatial Feature Calibration for Real-Time Semantic Segmentation. IEEE Trans. Image Process. 2023, 32, 5465–5477. [Google Scholar] [CrossRef] [PubMed]
Gao, S.; Zhang, P.; Yan, T.; Lu, H. Multi-scale and Detail-Enhanced Segment Anything Model for Salient Object Detection. In Proceedings of the ACM International Conference on Multimedia, Melbourne, Australia, 28 October 2024–1 November 2024; pp. 9894–9903. [Google Scholar]
Koubaroulis, D.; Matas, J.; Kittler, J. Evaluating Colour-Based Object Recognition Algorithms Using the SOIL-47 Database. In Proceedings of the Asian Conference on Computer Vision, Melbourne, Australia, 23–25 January 2002; Volume 2, pp. 840–845. [Google Scholar]
Jund, P.; Abdo, N.; Eitel, A.; Burgard, W. The Freiburg Groceries Dataset. arXiv 2016, arXiv:1611.05799. [Google Scholar] [CrossRef]
Rivera-Rubio, J.; Idrees, S.; Alexiou, I.; Hadjilucas, L.; Bharath, A.A. A dataset for Hand-Held Object Recognition. In Proceedings of the IEEE International Conference on Image Processing, Paris, France, 27–30 October 2014; pp. 5881–5885. [Google Scholar] [CrossRef]
Follmann, P.; Böttger, T.; Härtinger, P.; König, R.; Ulrich, M. MVTec D2S: Densely Segmented Supermarket Dataset. In Proceedings of the Computer Vision–ECCV 2018, Munich, Germany, 8–14 September 2018; pp. 581–597. [Google Scholar] [CrossRef]
Rocha, A.; Hauagge, D.C.; Wainer, J.; Goldenstein, S. Automatic fruit and vegetable classification from images. Comput. Electron. Agric. 2010, 70, 96–104. [Google Scholar] [CrossRef]
Merler, M.; Galleguillos, C.; Belongie, S. Recognizing Groceries in situ Using in vitro Training Data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 17–22 June 2007. [Google Scholar] [CrossRef]
George, M.; Floerkemeier, C. Recognizing Products: A Per-exemplar Multi-label Image Classification Approach. In Computer Vision–ECCV 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Pt. II; Lecture Notes in Computer Science; Springer: Zurich, Switzerland, 2014; Volume 8690, pp. 440–455. [Google Scholar] [CrossRef]
Peng, J.; Xiao, C.; Wei, X.; Li, Y. RP2K: A Large-Scale Retail Product Dataset for Fine-Grained Image Classification. arXiv 2020, arXiv:2006.12634. [Google Scholar]
Goldman, E.; Herzig, R.; Eisenschtat, A.; Goldberger, J.; Hassner, T. Precise Detection in Densely Packed Scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 5227–5236. [Google Scholar]
Cai, Y.; Wen, L.; Zhang, L.; Du, D.; Wang, W. Rethinking Object Detection in Retail Stores. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 947–954. [Google Scholar]
Zhang, H.; Li, D.; Ji, Y.; Zhou, H.; Wu, W.; Liu, K. Toward New Retail: A Benchmark Dataset for Smart Unmanned Vending Machines. IEEE Trans. Ind. Inform. 2020, 16, 7722–7731. [Google Scholar] [CrossRef]
Hao, Y.; Fu, Y.; Jiang, Y.G. Take Goods from Shelves: A Dataset for Class-Incremental Object Detection. In Proceedings of the 2019 on International Conference on Multimedia Retrieval, Ottawa, ON, Canada, 10–13 June 2019; pp. 271–278. [Google Scholar] [CrossRef]
Georgiadis, K.; Kordopatis-Zilos, G.; Kalaganis, F.; Migkotzidis, P.; Chatzilari, E.; Panakidou, V.; Pantouvakis, K.; Tortopidis, S.; Papadopoulos, S.; Nikolopoulos, S.; et al. Products-6K: A Large-Scale Groceries Product Recognition Dataset. In Proceedings of the PETRA Conference, Virtual, 29 June–2 July 2021; pp. 1–7. [Google Scholar]
Cheng, L.; Zhou, X.; Zhao, L.; Li, D.; Shang, H.; Zheng, Y.; Pan, P.; Xu, Y. Weakly Supervised Learning with Side Information for Noisy Labeled Images. In Proceedings of the Computer Vision–ECCV 2020, Glasgow, UK, 23–28 August 2020; pp. 306–321. [Google Scholar]
Bai, Y.; Chen, Y.; Yu, W.; Wang, L.; Zhang, W. Products-10K: A Large-scale Product Recognition Dataset. arXiv 2020, arXiv:2008.10545. [Google Scholar]
Zhan, X.; Wu, Y.; Dong, X.; Wei, Y.; Lu, M.; Zhang, Y.; Xu, H.; Liang, X. Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-modal Pretraining. In Proceedings of the 2021 IEEE International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 11762–11771. [Google Scholar]
Wei, X.S.; Cui, Q.; Yang, L.; Wang, P.; Liu, L. RPC: A Large-Scale Retail Product Checkout Dataset. arXiv 2019, arXiv:1901.07249. [Google Scholar] [CrossRef]
Chu, C.; Zhmoginov, A.; Sandler, M. CycleGAN, a Master of Steganography. arXiv 2017, arXiv:1712.02950. [Google Scholar] [CrossRef]
Li, C.; Du, D.; Zhang, L.; Luo, T.; Wu, Y.; Tian, Q.; Wen, L.; Lyu, S. Data Priming Network for Automatic Check-Out. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; Amsaleg, L., Huet, B., Larson, M.A., Gravier, G., Hung, H., Eds.; ACM: New York, NY, USA, 2019; pp. 2152–2160. [Google Scholar] [CrossRef]
Yang, Y.; Sheng, L.; Jiang, X.; Wang, H.; Xu, D.; Cao, X. IncreACO: Incrementally Learned Automatic Check-out with Photorealistic Exemplar Augmentation. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 626–634. [Google Scholar]
Zhang, L.; Du, D.; Li, C.; Wu, Y.; Luo, T. Iterative Knowledge Distillation for Automatic Check-Out. IEEE Trans. Multim. 2021, 23, 4158–4170. [Google Scholar] [CrossRef]
Sun, Y.; Luo, T.; Zuo, Z. Context-guided feature enhancement network for automatic check-out. Neural Comput. Appl. 2022, 34, 593–606. [Google Scholar] [CrossRef]
Chen, H.; Wei, X.S.; Zhang, F.; Shen, Y.; Xu, H.; Xiao, L. Automatic Check-Out via Prototype-Based Classifier Learning from Single-Product Exemplars. In Proceedings of the Computer Vision–ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; Avidan, S., Brostow, G., CissÉ, M., Farinella, G.M., Hassner, T., Eds.; Lecture Notes in Computer Science. Springer: Berlin/Heidelberg, Germany, 2022; Volume 13685, pp. 277–293. [Google Scholar] [CrossRef]
Chen, H.; Wei, X.S.; Xiao, L. Prototype Learning for Automatic Check-Out. IEEE Trans. Multim. 2023, 25, 9147–9160. [Google Scholar] [CrossRef]
Chen, H.; Zhou, Y.; Li, J.; Wei, X.S.; Xiao, L. Self-Supervised Multi-Category Counting Networks for Automatic Check-Out. IEEE Trans. Image Process. 2022, 31, 3004–3016. [Google Scholar] [CrossRef] [PubMed]
Chen, H.; Wei, X.; Xiao, L. Incremental learning method for intelligent retail automatic check-out. J. Nanjing Univ. Sci. Technol. 2024, 48, 74–81. [Google Scholar]
You, R.; He, F.; Lin, W. Decoupled Edge Guidance Network for Automatic Checkout. Int. J. Neural Syst. 2023, 33, 2350049:1–2350049:17. [Google Scholar] [CrossRef] [PubMed]
Tan, L.; Liu, S.; Gao, J.; Liu, X.; Chu, L.; Jiang, H. Enhanced Self-Checkout System for Retail Based on Improved YOLOv10. J. Imaging 2024, 10, 248. [Google Scholar] [CrossRef] [PubMed]
Skinderowicz, R. Solving Automatic Check-out with Fine-Tuned YOLO Models. Procedia Comput. Sci. 2024, 246, 1649–1658. [Google Scholar] [CrossRef]
Yang, X.; Zhang, X.; Song, C.; Jin, H.; Li, G. Product Detection and Counting Algorithm for Automatic Check-Out. Comput. Eng. Appl. 2024, 60, 324–332. [Google Scholar]
Liu, C.; Wen, J.; Xu, Y.; Zhang, B.; Nie, L.; Zhang, M. Reliable Representation Learning for Incomplete Multi-View Missing Multi-Label Classification. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 4940–4956. [Google Scholar] [CrossRef] [PubMed]
Liu, C.; Wu, Z.; Wen, J.; Xu, Y.; Huang, C. Localized Sparse Incomplete Multi-View Clustering. IEEE Trans. Multimed. 2023, 25, 5539–5551. [Google Scholar] [CrossRef]
Otsu, N. A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 1979, 9, 62–66. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; DollÁR, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the Computer Vision–ECCV 2014, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Wu, Y.; Kirillov, A.; Massa, F.; Lo, W.Y.; Girshick, R. Detectron2. 2019. Available online: https://github.com/facebookresearch/detectron2 (accessed on 28 March 2025).
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Adv. Neural Inf. Process. Syst. 2019, 32, 8026–8037. [Google Scholar]
Dai, L.; Liu, H. DeCo-DETR: Densely Packed Commodity Detection with Transformer. In Proceedings of the 2023 7th Asian Conference on Artificial Intelligence Technology (ACAIT), Jiaxing, China, 10–12 November 2023; pp. 453–462. [Google Scholar] [CrossRef]

Figure 1. Illustration of the Automatic Checkout (ACO) task and its domain gap. The ACO task aims to generate complete shopping lists by detecting all products in checkout images. However, training data typically consists of multi-view images of single isolated products, while test data (checkout images) involves real-world scenes with multiple occluded and overlapping items, leading to significant cross-domain differences that hinder model generalization.

Figure 2. Architecture of the detection framework within the proposed Edge-Embedded Multi-Feature Fusion Network (E2MF2Net). It contains two modules: the Edge-Embedded Enhancement Module (E3) for boundary-aware feature enhancement and the Multi-Feature Fusion Module (MFF) for multi-scale semantic fusion.

Figure 3. Architecture of Edge-Embedded Enhancement Module (E3). The module extracts salient edge features via differential pooling and applies attention-based fusion to enhance boundary-aware feature representations.

Figure 4. Detection results of E2MF2Net under different occlusion levels. Each row corresponds to Easy (first row), Medium (second row), and Hard (third row) modes.

Table 1. Statistics of the RPC and SKU-110K datasets.

Datasets	Train_Image	Test_Image	Total_Image
RPC	46,000 (synthetic) + 6000 (real)	24,000 (real)	76,000
SKU-110K	8233	2941	11,174

Table 2. Performance comparison with state-of-the-art ACO methods on the RPC dataset under easy, medium, hard, and average mode.

Network	${cAcc}_{e a s y}$	${cAcc}_{m e d i u m}$	${cAcc}_{h a r d}$	${cAcc}_{a v e r a g e}$	mCIoU	ACD	mCCD	mAP50	mmAP
DPSNet [51]	97.22%	91.64%	85.06%	90.74%	99.12%	0.10	0.01	99.02%	82.72%
PLACO [54]	95.97%	94.73%	90.47%	93.73%	99.20%	0.10	0.01	98.94%	85.62%
DEGNet [57]	97.20%	95.75%	92.89%	95.25%	99.47%	0.07	0.01	99.02%	84.96%
Ours	98.52%	97.95%	96.52%	97.62%	99.75%	0.03	0.00	99.34%	87.28%

The proposed method achieves the highest scores across all evaluation metrics, consistently outperforming strong baselines in both classification and localization accuracy, with particularly large gains under the hard mode. Bold numbers indicate the best performance.

Table 3. Performance Comparisons on the SKU-110K dataset.

Network	mmAp	mAP50
Faster R-CNN	38.4%	62%
YOLOv7	42.6%	62.3%
DERT	25.6%	37.8%
Deformable DERT	42.6%	63.5%
DAB-DERT	33.9%	81%
DeCo-DETR	54.8%	93.3%
Ours	42.2%	64.5%

All results of other models are reported from [67].

Table 4. Ablation studies of proposed method on the RPC dataset.

Experiment	HMGC	E3	MFF	${cAcc}_{e a s y}$	${cAcc}_{m e d i u m}$	${cAcc}_{h a r d}$	${cAcc}_{a v e r a g e}$	mCIoU	ACD	mCCD
A				96.21%	94.95%	90.80%	93.93%	99.30%	0.09	0.01
B	✔			96.65%	95.00%	91.31%	94.25%	99.35%	0.08	0.01
C	✔	✔		97.72%	96.78%	95.08%	96.49%	99.61%	0.05	0.00
D	✔	✔	✔	98.52%	97.95%	96.52%	97.62%	99.75%	0.03	0.00

Notes: HMGC refers to a data generation method combined with corresponding training strategies. Experiments without HMGC (e.g., Experiment A) are trained directly on 6000 real checkout images. A checkmark (✔) indicates that the corresponding component is included in the experiment. Bold numbers indicate the best performance.

Table 5. Ablation studies about computational costs of proposed method.

E3	MFF	Params	Inference
		83.48 M	73.3 ms/img
✔		94.99 M	74.9 ms/img
✔	✔	97.88 M	77.8 ms/img

Table 6. Comparison of incremental learning with other methods.

Network	Train/Test	${cAcc}_{a v e r a g e}$
IncreACO [50]	183/183	$78.76 %$
	183/200	$47.53 %$
	(183 + 17)/200	$74.3 %$
PLACO [54]	183/183	$89.25 %$
	183/200	$47.35 %$
	(183 + 17)/200	$87.39 %$
DEGNet [57]	183/183	$91.97 %$
	183/200	$50.86 %$
	(183 + 17)/200	$91.35 %$
Ours	183/183	$97.01 %$
	183/200	$53.42 %$
	(183 + 17)/200	$96.59 %$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, J.; Zhu, M.; Ren, H. Edge-Embedded Multi-Feature Fusion Network for Automatic Checkout. J. Imaging 2025, 11, 337. https://doi.org/10.3390/jimaging11100337

AMA Style

Li J, Zhu M, Ren H. Edge-Embedded Multi-Feature Fusion Network for Automatic Checkout. Journal of Imaging. 2025; 11(10):337. https://doi.org/10.3390/jimaging11100337

Chicago/Turabian Style

Li, Jicai, Meng Zhu, and Honge Ren. 2025. "Edge-Embedded Multi-Feature Fusion Network for Automatic Checkout" Journal of Imaging 11, no. 10: 337. https://doi.org/10.3390/jimaging11100337

APA Style

Li, J., Zhu, M., & Ren, H. (2025). Edge-Embedded Multi-Feature Fusion Network for Automatic Checkout. Journal of Imaging, 11(10), 337. https://doi.org/10.3390/jimaging11100337

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Edge-Embedded Multi-Feature Fusion Network for Automatic Checkout

Abstract

1. Introduction

2. Related Work

2.1. General Object Detection Frameworks

2.2. Feature Representation and Structural Enhancement

2.3. Automatic Checkout Systems

3. Methodology

3.1. Hierarchical Mask-Guided Composition

3.1.1. Pose Pruning Strategy Based on Mask Compactness

3.1.2. Image Composition Strategy

3.2. Architecture of the Detection Framework

3.2.1. Edge-Embedded Enhancement Module

3.2.2. Multi-Feature Fusion Module

4. Experiments

4.1. Dataset Details

4.2. Evaluation Metrics

4.3. Implementation Details

4.4. Experimental Results

4.5. Ablation Study

4.6. Incremental Learning

5. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI