1. Introduction
Large-scale recognition of stock keeping units (SKUs) is a challenging instance of fine-grained visual recognition. It often involves thousands of visually similar products and, in practice, only a single reference image per class [
1]. Unlike conventional object recognition benchmarks, SKU recognition requires robust representations. These representations must handle variations in viewpoint, illumination, and occlusion while remaining sensitive to subtle cues such as typography, colour schemes, and packaging geometry. In real-world retail deployments, such capabilities are essential for automated planogram compliance and shelf-monitoring systems [
2]. Such systems rely on accurate SKU-level recognition to ensure correct product placement and stock availability in dynamic retail environments.
Although supervised convolutional neural networks achieve strong performance on fine-grained tasks, their effectiveness typically depends on large, densely annotated datasets [
3]. In retail environments, maintaining such datasets at the SKU level is especially challenging due to frequent product turnover and continuously evolving assortments [
4]. These constraints motivate the adoption of self-supervised learning (SSL). SSL enables representation learning without exhaustive manual annotation and has demonstrated strong transferability across downstream tasks.
Contrastive self-supervised learning has recently emerged as a dominant paradigm in visual representation learning. Methods such as SimCLR [
5], MoCo [
6], and BYOL [
7] learn embeddings by enforcing consistency across augmented views of the same image, either through explicit negative sampling or predictive alignment. When trained on large-scale datasets such as ImageNet [
8], these approaches can match or surpass supervised pre-training across numerous downstream tasks. However, different contrastive objectives induce distinct representation geometries, which may have significant implications for fine-grained recognition.
Previous research suggests that contrastive methods differ in terms of linear separability, sensitivity to augmentation, and adaptability under fine-tuning [
5,
7]. Recent surveys on few-shot fine-grained image classification provide a comprehensive taxonomy of meta-learning, data augmentation, and self-supervised strategies, highlighting persistent challenges related to domain adaptation and limited supervision [
3]. Nevertheless, systematic analyses of how contrastive objectives influence representation geometry and transfer behaviour in large-scale SKU recognition remain limited. Understanding these differences is important when selecting SSL strategies under realistic computational and annotation constraints.
This study presents a controlled empirical comparison of SimCLR [
5], MoCo v2 [
9], and BYOL [
7] for SKU recognition. We use a shared backbone architecture and identical augmentation strategies. Representations are evaluated using both linear evaluation and full fine-tuning. Experiments are conducted on the RP2K benchmark and extended to an in-house dataset collected under real retail conditions to evaluate robustness and cross-domain generalisation [
10].
The experiments indicate a trade-off between linear separability and fine-tuning adaptability across contrastive objectives. SimCLR produces highly linearly separable embeddings, making it well-suited to frozen-encoder scenarios. In contrast, BYOL performs better when end-to-end fine-tuning is feasible. We further show that model capacity interacts strongly with the optimisation budget, with lightweight encoders often outperforming deeper architectures under fixed training constraints.
The main contributions of this paper are as follows:
A systematic comparison of SimCLR, MoCo v2, and BYOL for large-scale fine-grained SKU recognition.
An analysis of how contrastive objectives influence linear separability and fine-tuning behaviour.
Evidence that encoder capacity and predictor design critically affect BYOL performance.
An extension of the evaluation to a real-world in-house dataset, highlighting the effect of domain shift.
2. Related Work
2.1. Contrastive Self-Supervised Learning
Early representation learning approaches employed Siamese architectures trained with contrastive or triplet losses to learn discriminative embeddings for tasks such as signature verification and face recognition. These approaches established the foundations of metric learning by explicitly modelling similarity relationships in the embedding space [
11,
12].
Most modern self-supervised contrastive learning methods rely on the InfoNCE objective [
13], which maximises agreement between representations of positive pairs while contrasting them against a set of negative samples drawn from a batch or a memory bank. SimCLR [
5] demonstrated that a relatively simple contrastive framework, when combined with strong data augmentation, a projection head, and sufficiently large batch sizes, can match or even surpass supervised ImageNet [
8] pre-training. MoCo [
9] addressed the computational limitations of large-batch contrastive learning by introducing a momentum encoder and a negative example queue, thus decoupling the number of negatives from batch size and improving hardware efficiency. BYOL [
7] further reduced the need for explicit negative samples by training an online network to predict the representation of a target network, which was updated via an exponential moving average. Related non-contrastive approaches, including SimSiam [
14], Barlow Twins [
15], and VICReg [
16], address representation collapse through architectural asymmetry or redundancy reduction. These methods also highlight the sensitivity of self-supervised learning to dataset scale and diversity.
Recent surveys offer a comprehensive overview of contrastive self-supervised learning, covering loss functions, architectural design choices, and applications across domains such as vision, speech, and remote sensing [
17]. Previous studies highlight the importance of data augmentation, projection heads, and evaluation protocols, particularly the distinction between linear evaluation and full fine-tuning. Previous studies have shown that linear evaluation and full fine-tuning capture complementary aspects of representation quality, and that strong linear performance does not necessarily imply superior downstream adaptability [
18].
Importantly, previous studies suggest that different contrastive objectives induce distinct geometric properties in the learned representation space, influencing factors such as linear separability, feature isotropy, and optimisation behaviour during fine-tuning [
14,
15,
19]. Although linear evaluation captures the intrinsic separability of embeddings, fine-tuning performance reflects the adaptability of representations to task-specific decision boundaries. Understanding this distinction is particularly relevant for fine-grained recognition problems, where downstream performance may not correlate directly with linear probe accuracy.
2.2. Retail Product and SKU Recognition
Retail product recognition is a challenging instance of fine-grained visual classification, involving thousands of visually similar stock keeping units (SKUs) captured under unconstrained in-store conditions. The evaluation of such systems has relied on several publicly available benchmarks that vary substantially in scale and task formulation.
Early instance-level datasets such as GroZi-120 [
20] introduced the problem of recognising grocery products from in-store images using studio-quality references. More recent benchmarks, including RP2K [
10], extend this setting to thousands of products and more realistic acquisition conditions. RP2K comprises over 450,000 images across approximately 2000 classes. It exhibits strong class imbalance, background clutter, and substantial variation in viewpoint and lighting, making it a realistic testbed for large-scale embedding-based evaluation.
In contrast, datasets such as SKU110K [
21] focus primarily on object detection in densely packed retail shelves, providing large numbers of annotated bounding boxes, but not explicitly addressing instance-level retrieval. Recent system-level datasets extend beyond instance-level recognition by incorporating tasks such as planogram compliance verification and shelf-row localisation, capturing the structural organisation of products on shelves, and supporting practical retail analytics applications [
2].
Early work on retail shelf product recognition often followed a two-stage pipeline, combining object detection with global descriptor matching between reference and query images as demonstrated by Tonioni et al. [
22]. Such approaches framed SKU recognition as an instance-level retrieval problem, relying on embedding similarity rather than closed-set classification.
Subsequent research has primarily explored supervised classification models and embedding-based metric learning approaches, particularly in grocery and checkout scenarios [
23,
24]. Metric learning methods are particularly well-suited to large SKU catalogues, as they enable nearest-neighbour retrieval and support the incremental addition of new products without requiring full classifier retraining. Proxy-based metric learning has been proposed to address scalability issues arising from the large number of classes [
25].
In addition to embedding-based retrieval models, several retail frameworks at the system-level integrate fully supervised object detection and optical character recognition (OCR) modules to distinguish visually similar products based on packaging text and layout cues [
26]. Such pipelines demonstrate strong performance in controlled settings but require extensive labelled data and task-specific post-processing components.
Despite their effectiveness, most retail-focused methods rely on fully labelled training datasets, which are expensive to acquire and difficult to maintain in practice due to frequent product turnover and evolving assortments. Although self-supervised and contrastive learning approaches offer the potential to leverage large volumes of unlabelled retail imagery, systematic evaluations of such methods for large-scale SKU recognition remain limited. In particular, little attention has been paid to how different contrastive objectives behave under extreme fine-grained conditions, strong class imbalance, and domain shifts between public benchmarks and real retail environments. This gap motivates the present study.
3. Contrastive Representation Learning Methods
We consider contrastive self-supervised learning in a unified experimental setting to enable a controlled comparison between methods. All approaches operate on pairs of augmented views of the same input image and share a common encoder backbone, augmentation pipeline, and evaluation protocol. The methods differ primarily in how positive and negative relationships are defined and how the training objective is formulated.
Formally, given an input image x, two augmented views are sampled from an augmentation distribution . These views are processed by an encoder network followed by method-specific projection and prediction heads. The resulting representations are optimised according to the contrastive or predictive objectives described in the following.
3.1. SimCLR
SimCLR [
5] learns visual representations by maximising agreement between two independently augmented views of the same image within a minibatch. For a batch of
N original images, two stochastic augmentations are applied to each image, yielding
samples. Each view is processed by a shared encoder followed by a projection head, producing latent representations that are optimised using the Normalised Temperature-scaled Cross Entropy (NT-Xent) loss. The objective encourages representations of positive pairs (different augmentations of the same image) to be close in the embedding space while pushing them away from other samples in the minibatch, which act as implicit negative examples. Consequently, SimCLR benefits from large batch sizes that provide a diverse set of negatives.
The NT-Xent loss for a positive pair
is defined as
where
and
denote the
-normalised latent representations of two augmented views derived from the same input image, forming a positive pair. The function
denotes the cosine similarity between representations. The denominator includes similarities between
and the remaining
representations in the minibatch, which serve as negative samples. The indicator function
excludes the trivial comparison of a sample with itself. The temperature parameter
scales the similarity scores and controls the concentration level of the resulting distribution. The final loss is computed across all positive pairs in the minibatch, considering both directions
and
.
The projection head consists of a two-layer MLP with ReLU activation and an output dimension of 128.
3.2. MoCo v2
MoCo v2 [
9] addresses the batch-size dependency of SimCLR by maintaining a memory queue of negative examples. A momentum-updated encoder is used to compute key representations, while the main encoder produces query representations. For each query, the positive key is obtained from an alternatively augmented view of the same image, and the InfoNCE loss is computed using the positive key and all entries in the queue as negatives. The momentum update ensures the temporal consistency of stored representations, which stabilises training [
13].
MoCo v2 uses the same contrastive objective as SimCLR, namely the InfoNCE (NT-Xent) loss, where a query representation is encouraged to be similar to its corresponding positive key and dissimilar to a large set of negative keys. Unlike SimCLR, which relies on large batch sizes to provide many negatives, MoCo v2 maintains a dynamic queue that stores representations from previous mini-batches.
The parameters of the key encoder are not updated by backpropagation. Instead, they are updated using a momentum-based moving average of the query encoder parameters:
where
denotes the parameters of the query encoder,
denotes the parameters of the key encoder, and
m is the momentum coefficient. The projection head and output dimension are identical to those used in SimCLR.
3.3. BYOL
BYOL eliminates the need for explicit negative samples altogether. It consists of an online network comprising an encoder, a projection head, a predictor, as well as a target network comprising an encoder and a projection head. Given two augmented views of the same image, the online network is trained to predict the target network’s representation. The target network parameters are updated as a momentum-based moving average of the online network parameters. A stop-gradient operation is applied to the target branch to prevent representational collapse. The BYOL objective is defined as
where
denotes the online encoder,
is the predictor network applied to the online projection, and
denotes the target encoder. The operator
represents the stop-gradient operation, which blocks gradient flow through the target branch [
7].
The target network parameters
are updated using an exponential moving average of the online network parameters:
where
is a momentum coefficient typically close to 1.
Unlike contrastive methods such as SimCLR or MoCo, BYOL does not use negative samples. Representation collapse is mitigated through architectural and optimisation asymmetry between the online and target networks, including the use of the predictor and the stop-gradient operation. The stop-gradient prevents trivial collapse by blocking gradient propagation through the target branch, enforcing asymmetric prediction, and stabilising training.
3.4. Method Comparison
As shown in
Table 1, SimCLR relies on large in-batch negative sampling, leading to quadratic similarity computations and substantial GPU memory requirements. MoCo v2 alleviates this limitation by decoupling the number of negatives from the batch size via a memory queue, thus improving hardware efficiency. BYOL eliminates explicit negative sampling entirely, reducing similarity computations while introducing architectural complexity through a momentum target network. These differences have practical implications for large-scale industrial deployment and for integration with proxy-based fine-tuning pipelines.
4. Experimental Setup
4.1. Network Architecture and Optimisation
Unless stated otherwise, all experiments use a ResNet-50 [
27] encoder with the classification layer removed. This architecture was selected as the primary backbone due to its widespread adoption in the metric learning and large-scale image recognition literature, enabling comparison with prior work. Furthermore, ResNet-50 provides a good balance between representational capacity and computational efficiency, which is important in industrial retail scenarios where both training and inference costs must be considered. The goal of this study is to analyse how different learning objectives behave under identical architectural conditions. Therefore, a stable and widely adopted convolutional backbone was selected to isolate the impact of the loss functions.
The encoder outputs are passed to method-specific projection heads. For SimCLR and MoCo v2, the projection head consists of a two-layer multilayer perceptron (MLP) with ReLU activation and an output dimension of 128. For BYOL, a similar projection head is employed, followed by an additional two-layer MLP predictor in the online branch.
All models are trained using the same data augmentation pipeline (
Section 4.2) to ensure that differences in performance arise from learning objectives rather than input transformations. SimCLR is optimised using the Adam optimiser, while MoCo v2 and BYOL are trained with stochastic gradient descent (SGD) with momentum, following standard practice. In all cases, cosine annealing is applied to the learning rate schedule.
4.2. Data Augmentation
Data augmentation comprises transformations that preserve semantic identity and, therefore, plays a central role in contrastive learning. For retail product images, we adopt a strong augmentation strategy inspired by prior work on SimCLR and BYOL and adapted to the SKU recognition setting:
Random resized cropping and horizontal flipping;
Colour jittering (brightness, contrast, saturation, and hue);
Random conversion to greyscale;
Gaussian blur;
Occasional cutout-style occlusions.
These transformations simulate realistic variations encountered in retail environments, such as changes in viewpoint, lighting, and partial occlusions. Examples of the applied augmentations, including cropping, colour distortion, and blurring, are illustrated in
Figure 1. Following these augmentations, all images are normalised using the ImageNet standard parameters, with mean values
and standard deviations
.
4.3. Datasets
4.3.1. RP2K
RP2K [
10] is a large-scale retail product dataset comprising more than 450,000 images across approximately 2000 SKU classes. The dataset exhibits strong class imbalance and substantial variation in background, viewpoint, and illumination, reflecting realistic in-store acquisition conditions. Example images are shown in
Figure 2, illustrating the diversity of categories and acquisition variability.
The images presented in
Figure 2 are product crops extracted from full shelf photographs captured under unconstrained retail conditions. Consequently, the crops may exhibit reduced resolution, compression artefacts, motion blur, and imperfect framing. These characteristics reflect real deployment conditions and make fine-grained SKU recognition more challenging, particularly when subtle packaging details must be distinguished.
4.3.2. In-House SKU Dataset
To assess performance under retailer-specific conditions, we also evaluate our results on an in-house SKU dataset (InSKU) collected in real stores. The dataset contains:
1400 SKU classes;
33,600 images;
Images acquired using smartphone cameras, CCTV snapshots, or handheld scanners;
SKU-level annotations derived from store catalogues or enterprise resource planning systems.
Representative images from the InSKU dataset are shown in
Figure 3, highlighting real-world acquisition conditions, including shelf clutter, motion blur, and partial occlusions.
As in RP2K, the images in
Figure 3 correspond to the cropped product regions extracted from full shelf photographs. Many of these images were captured with smartphone cameras or surveillance devices and therefore exhibit limited spatial resolution and non-ideal imaging conditions. The cropping process, while necessary to isolate individual SKUs, often results in small regions of objects with reduced detail. At the SKU level, where class distinctions may depend on minor textual elements or subtle colour variations, such degradation substantially increases recognition complexity. Consequently, the visual quality of the examples reflects realistic operational constraints rather than curated benchmark conditions.
Compared to RP2K, InSKU exhibits stronger domain specificity in terms of store layout, lighting conditions, and branding characteristics, as well as increased visual noise, including motion blur and partial occlusions.
4.4. Evaluation Protocol
4.4.1. Linear Evaluation Protocol
We evaluate the intrinsic quality of the learned representations using the standard linear evaluation protocol. In this setting, the encoder parameters are frozen, and a single linear classifier is trained on top of the pooled feature representations. The classifier is optimised using cross-entropy loss on the labelled training split, and performance is reported on a held-out test set in terms of Top-1 and Top-5 accuracy. This protocol measures the linear separability of the learned embeddings.
4.4.2. Fine-Tuning
To assess representation adaptability, we additionally perform full fine-tuning, updating both the encoder and the classification head on labelled data. We consider two transfer scenarios:
During fine-tuning, a lower learning rate is applied to the encoder than to the classifier to preserve previously learned representations while enabling task-specific adaptation. Cosine annealing schedules were applied to both components. Performance is evaluated on a held-out test set using Top-1 and Top-5 accuracy.
4.5. Hyperparameter Configuration
The main hyperparameter settings used for SimCLR [
5], MoCo v2 [
9], and BYOL [
7] across the RP2K [
10] and InSKU datasets are summarised in
Table 2. The experiments were conducted on a workstation running Ubuntu and equipped with an NVIDIA RTX 5000 GPU (32 GB VRAM). The code was implemented using Python 3.12.3, PyTorch 2.6.0 and CUDA 12.8. All models were trained on a single GPU using Automatic Mixed Precision (AMP) via PyTorch’s native autocast and GradScaler modules. This approach enabled stable training at higher throughput while maintaining the numerical precision required for convergence.
5. Results on RP2K
5.1. Self-Supervised Pre-Training and Linear Evaluation
We first pre-train SimCLR, MoCo v2, and BYOL on the RP2K dataset in a purely self-supervised manner. All models use a ResNet-50 backbone and are trained for 150 epochs using the augmentation pipeline described in
Section 4.2.
To assess linear separability, we apply the standard linear evaluation protocol in the frozen-encoder setting. A linear classifier is trained for 50 epochs while keeping the encoder parameters fixed. The results are summarised in
Table 3.
As shown in
Table 3, SimCLR achieves the strongest performance under the linear evaluation protocol in this configuration, achieving near-perfect Top-5 accuracy and very high Top-1 accuracy despite the large number of SKU classes. BYOL achieves moderate performance, whereas MoCo v2 achieves substantially lower accuracy under this configuration.
These results suggest that, in the frozen-encoder setting, SimCLR produces more linearly separable representations when trained directly on RP2K as evidenced by the performance gap reported in
Table 3.
5.2. Effect of Encoder Depth and BYOL Variants
To better understand how BYOL can be adapted to fine-grained retail data, we explore the impact of encoder depth and predictor design. We evaluate ResNet-18, ResNet-50, and ResNet-152 backbones, as well as modifications to the online network and predictor. Although the standard BYOL online network uses a 2-layer MLP predictor, we introduce a deeper 3-layer architecture, denoted as BYOL*. We evaluate how this modification affects performance across different scales, specifically using ResNet-18 and ResNet-50 backbones. All experiments in this subsection use 50 epochs of self-supervised pre-training and 50 epochs of linear classifier training on RP2K.
The results in
Table 4 show that the shallower ResNet-18 backbone achieves the best performance among the evaluated architectures. In contrast, the very deep ResNet-152 does not converge satisfactorily under the available training budget, which may be attributable to optimisation difficulties and over-parameterisation relative to the number of training epochs.
As reported in
Table 5, the BYOL* configuration with the ResNet-18 backbone achieves the strongest performance among the evaluated variants, slightly exceeding the ResNet-50 baseline while using fewer parameters and a comparable training budget. These results suggest that BYOL performance is sensitive to the interaction between model capacity and the optimisation budget, particularly on fine-grained datasets such as RP2K.
5.3. Transfer from ImageNet to RP2K
We next evaluate how contrastive representations pre-trained on ImageNet transfer to SKU recognition. Encoders pre-trained with SimCLR, MoCo v2, and BYOL on ImageNet are adapted to RP2K under two settings: classifier-only training with frozen encoders and full fine-tuning of both the encoder and the classifier.
As shown in
Table 6, SimCLR again achieves the strongest linear performance among the evaluated methods, indicating that its ImageNet representations remain highly linearly separable even for fine-grained SKU classes.
When the encoder is fully fine-tuned (
Table 7), BYOL achieves near-perfect accuracy under the evaluated setting and outperforms SimCLR in terms of Top-1 accuracy. This result suggests that BYOL representations may be particularly well-suited for task-specific adaptation when end-to-end optimisation is feasible.
5.4. Analysis and Deduplication of RP2K Dataset
The results in
Table 7 motivated a closer inspection of the RP2K dataset to verify data integrity and exclude potential sources of evaluation bias. An integrity check based on SHA-256 hashing revealed that approximately 7.7% of the original test set images had exact duplicates in the training split. Such overlap may lead to information leakage and artificially inflate evaluation performance.
To address this issue, all detected duplicate samples were removed prior to further experimentation. The dataset was subsequently re-partitioned to ensure that no exact duplicates were shared across subsets.
Furthermore, due to the highly imbalanced class distribution in the original RP2K dataset, the classes were filtered according to the criterion , where denotes the number of images assigned to a given class. This filtering procedure yielded a curated subset of 1076 classes. The refined dataset was then randomly partitioned into training, validation, and test splits, stratified by class, in a 90:5:5 ratio.
After SHA-256-based deduplication and re-partitioning, no exact duplicates spanning multiple splits remained.
Using this refined dataset, each model was trained in a two-stage process consisting of 50 epochs of self-supervised encoder training followed by 50 epochs of linear classifier training. The resulting performance metrics are presented in
Table 8.
As shown in
Table 8, the relative ranking of the methods changes after filtering and deduplicating the dataset. Among the methods evaluated in the filtered configuration, MoCo v2 achieves the highest Top-1 accuracy (87.17%) and Macro-F1 score (0.87), slightly outperforming SimCLR. This contrasts with the original RP2K configuration, where SimCLR demonstrated superior linear separability. The improved Macro-F1 score suggests that MoCo v2 benefits from the more balanced class distribution in the filtered subset. In contrast, BYOL exhibits a substantial performance drop, which may reflect increased sensitivity to reduced dataset scale and shorter pre-training duration.
5.5. Summary of RP2K Results
SimCLR produces the most linearly separable representations among the evaluated methods under the original RP2K configuration. However, after filtering and deduplication, MoCo v2 demonstrates competitive and, in certain configurations, superior linear performance under a reduced training budget. These complementary observations suggest that representation quality is influenced not only by the contrastive objective but also by dataset composition and the optimisation regime. Consequently, selecting a contrastive learning strategy for SKU recognition may need to account for deployment constraints, including the available data scale and computational resources.
6. Generalisation and In-House Dataset Evaluation
6.1. Evaluation Protocol on the In-House Dataset
To assess robustness under real deployment conditions, we extend our evaluation to the InSKU dataset, which exhibits a distribution distinct from RP2K, characterised by store-specific layouts, lighting conditions, and a smaller sample size. All InSKU experiments follow the evaluation protocol described in
Section 4, with Top-1 and Top-5 accuracy reported under frozen-encoder or fine-tuning constraints.
6.2. Self-Supervised Pre-Training on InSKU
We first evaluate the intrinsic quality of representations learned directly from InSKU without any supervised fine-tuning. The encoders are trained in a purely self-supervised manner and evaluated using a retrieval-based protocol, where performance reflects the model’s ability to associate samples of the same SKU in the embedding space.
The results in
Table 9 suggest that InSKU poses a substantially greater challenge for self-supervised learning than RP2K. The smaller dataset size and reduced diversity may increase the risk of overfitting to background or illumination cues.
A particularly notable case is BYOL, which achieves only 7.31% Top-1 accuracy. As BYOL relies on predictive alignment without explicit negative samples, it appears especially sensitive to limited data diversity, a behaviour consistent with prior observations on non-contrastive self-supervised methods [
14,
15]. By contrast, SimCLR may benefit from explicit negative pairs, which provide an additional separation signal under constrained conditions.
6.3. Linear Probing on InSKU
Following self-supervised pre-training, we attach a linear classifier to the frozen ResNet-50 encoders and train it for 250 epochs on InSKU.
As shown in
Table 10, SimCLR yields the highest linear evaluation accuracy on InSKU, outperforming MoCo v2 and BYOL under the evaluated configuration. Nevertheless, overall performance remains substantially lower than on RP2K, suggesting reduced linear separability in representations learned from smaller, more homogeneous datasets.
6.4. Cross-Domain Transfer
We further evaluate robustness by testing the encoders pre-trained on RP2K and applying them to InSKU under a frozen-encoder constraint. Only the linear classifier is trained on InSKU for 250 epochs.
The results in
Table 11 reveal a substantial performance gap between the evaluated methods. SimCLR demonstrates relatively strong generalisation under the frozen-encoder constraint, achieving 61.90% Top-1 accuracy—higher than when trained directly on InSKU. In contrast, MoCo v2 and BYOL exhibit substantial performance degradation in this setting.
We note that the results in
Table 11 correspond to a stress test under a frozen-encoder constraint. Earlier fine-tuning experiments on RP2K (
Table 7) indicate that MoCo v2 and BYOL can achieve substantially higher performance when end-to-end adaptation is permitted.
6.5. Error Analysis and Fine-Grained Confusions
To better understand the limitations of our models, we performed a detailed error analysis focusing on fine-grained confusions. Retail product recognition on the InSKU dataset presents a significant challenge due to high intra-class variability and minimal inter-class differences. As illustrated in
Figure 4, models often struggle to distinguish between specific pairs of SKUs that share nearly identical visual features.
This difficulty often arises from branding strategies in which products of the same line differ only by minor textual details—such as flavour, weight, or limited-edition markings—while maintaining the same colour palette and packaging layout. In addition, competing brands often adopt similar design languages for the same product categories. The challenge is further intensified by the nature of the acquisition pipeline: the analysed images are cropped product regions extracted from full-shelf photographs, frequently captured on mobile devices under unconstrained retail conditions. As a result, crops often exhibit limited spatial resolution, compression artefacts, motion blur, and imperfect framing. When combined with real-world factors such as varied lighting and partial occlusions, these subtle visual distinctions become nearly imperceptible at the SKU level. These errors suggest that fine-grained recognition in retail environments extends beyond general object detection and may require the extraction of highly localised and discriminative visual descriptors under realistic imaging constraints.
6.6. Cross-Domain Representation Geometry Analysis
To further investigate the effect of domain shift between RP2K and InSKU, we analysed the geometry of the learned embedding spaces using joint visualisations of representations from both datasets. For each method (SimCLR, MoCo v2, and BYOL), we projected the high-dimensional embeddings of samples from RP2K and InSKU into two dimensions using t-distributed Stochastic Neighbour Embedding (t-SNE).
Figure 5 illustrates the relative arrangement of samples from both domains within a shared representation space. It presents joint t-SNE projections of RP2K and InSKU embeddings. The visualisations highlight qualitative differences in cross-domain alignment and domain clustering behaviour induced by SimCLR, MoCo v2, and BYOL.
6.6.1. Qualitative Analysis of Domain Structure
When SimCLR is trained on RP2K and evaluated jointly on RP2K and InSKU samples, the resulting embedding space exhibits partial domain separation. While substantial overlap between RP2K and InSKU samples is observed, certain regions of the space are predominantly occupied by one domain. This visual pattern suggests that SimCLR may capture domain-specific characteristics to some extent. At the same time, the representation geometry still enables partial cross-domain alignment. This behaviour is consistent with the relatively strong frozen-encoder transfer performance observed for SimCLR.
In contrast, BYOL trained on RP2K produces a more pronounced domain-level clustering effect. The t-SNE projections reveal that RP2K and InSKU samples tend to occupy more clearly separated regions of the embedding space, indicating stronger domain-specific structuring. This behaviour may partially explain BYOL’s weaker performance under frozen-encoder transfer, as domain-specific clustering can reduce cross-domain linear separability. At the same time, the smooth and well-organised geometry of the embedding space is consistent with BYOL’s strong adaptability under full fine-tuning.
MoCo v2 exhibits a different pattern. The embeddings of RP2K and InSKU samples appear more uniformly intermixed, with less clearly defined domain clusters. However, the overall structure of the embedding space is less compact and less distinctly organised compared to SimCLR and BYOL. This observation aligns with the comparatively weaker linear evaluation results for MoCo v2 on RP2K, suggesting that the lack of clear geometric separation may reflect suboptimal feature discrimination rather than improved domain invariance.
When models are trained directly on the smaller InSKU dataset, the resulting embedding spaces display reduced structural coherence across all methods. In particular, the separation between domains becomes less stable and more diffuse. This is consistent with the hypothesis that dataset scale and diversity may play an important role in shaping the global geometry of self-supervised representations.
6.6.2. Interpretation in the Context of Domain Shift
The joint visualisations provide qualitative evidence that different contrastive objectives induce distinct domain sensitivities in the embedding space. SimCLR appears to favour representations that partially align samples across domains, which may facilitate frozen-encoder transfer. BYOL, by contrast, tends to produce more domain-clustered embeddings, which can hinder direct transfer but benefit from end-to-end adaptation. In this configuration, MoCo v2 shows neither strong domain clustering nor highly separable class structures.
These findings suggest that the domain shift may not be solely a function of differences in dataset distributions. It may also depend on the geometric properties induced by the self-supervised objective. The interaction between contrastive alignment, feature uniformity, and predictor asymmetry influences generalisation. In particular, these factors may determine whether representations become more domain-invariant or remain tightly coupled to the source distribution.
In general, geometric analysis complements the quantitative results reported in
Section 5 and
Section 6, providing additional insight into why certain methods demonstrate stronger frozen transfer performance (e.g., SimCLR), while others achieve superior results under full fine-tuning (e.g., BYOL). These observations further support the view that raw accuracy alone is insufficient to determine method suitability. Representation geometry and sensitivity to domain shift also play an important role.
6.7. Summary of InSKU Results
Experiments on InSKU further illustrate the sensitivity of contrastive learning methods to dataset scale and domain shift. Although SimCLR demonstrates robust generalisation under frozen encoders, BYOL and MoCo v2 appear to require fine-tuning to achieve competitive performance in retailer-specific settings.
7. Discussion
The experiments provide insight into the behaviour of contrastive self-supervised learning methods for large-scale SKU recognition, particularly when evaluated under different optimisation regimes and domain conditions.
Contrastive learning appears to be effective for large-scale SKU recognition under the evaluated conditions, but its success depends on dataset scale and training regime. On RP2K, both SimCLR and BYOL achieve high recognition accuracy. This suggests that contrastive pre-training can extract discriminative representations from unlabelled retail images. However, experiments on the smaller and more domain-specific InSKU dataset reveal that this effectiveness does not automatically transfer to constrained data regimes, underscoring the potential importance of dataset scale and diversity in this setting.
Linear separability and fine-tuning adaptability appear to capture complementary properties of learned representations. Across RP2K experiments, SimCLR consistently produces the most linearly separable embeddings among the evaluated methods, making it well-suited to scenarios where the encoder must remain frozen and only a lightweight classifier can be retrained. In contrast, BYOL exhibits substantially weaker linear evaluation performance but benefits disproportionately from end-to-end fine-tuning, ultimately achieving the highest accuracy when full optimisation is feasible. This trade-off may reflect differences in the inductive biases associated with each objective.
Model capacity interacts strongly with optimisation budget in fine-grained settings. Our analysis of BYOL architecture variants reveals that lightweight backbones, such as ResNet-18, can outperform deeper models under fixed training budgets. In fine-grained SKU recognition, excessive model capacity may hinder optimisation rather than improve representation quality under limited training budgets.
MoCo v2 performance appears sensitive to configuration choices. In our experiments, under the original RP2K configuration and full pre-training regime, MoCo v2 achieves lower performance than SimCLR and BYOL. However, after dataset filtering and deduplication, it achieves competitive and, in some cases, superior linear performance under a reduced training budget. This behaviour may reflect sensitivity to specific hyperparameters such as queue size, momentum coefficient, and batch size, rather than an inherent limitation of the method. Such sensitivity can pose practical challenges in compute-constrained retail settings, where extensive hyperparameter tuning is often infeasible. The observed performance of MoCo v2 should therefore be interpreted in the context of the hyperparameter configuration reported in
Table 2, with particular emphasis on the queue size and momentum coefficient.
Domain shift highlights potential differences in representation robustness. Experiments on the InSKU dataset highlight the critical role of domain adaptation in retail applications. While SimCLR demonstrates strong generalisation from RP2K to InSKU under a frozen-encoder constraint, both BYOL and MoCo v2 exhibit substantial performance degradation in this setting. This suggests that representations learned by these methods may be more tightly coupled to the source data distribution and may benefit from end-to-end fine-tuning to adapt effectively. The InSKU evaluation protocol may provide a practical framework for assessing such robustness prior to deployment. The qualitative representation geometry analysis presented in
Section 6.6 is consistent with this interpretation. Joint t-SNE projections suggest that different self-supervised objectives are associated with distinct domain clustering patterns in the projected space. In particular, BYOL representations exhibit stronger domain-specific grouping, which may explain their reduced frozen transfer performance but improved adaptability under full fine-tuning. SimCLR embeddings demonstrate comparatively greater cross-domain overlap, consistent with stronger linear separability in cross-domain settings.
Retail deployment constraints favour representation-based and self-supervised learning paradigms. While supervised ImageNet pre-training is a strong baseline in many vision tasks, large-scale retail deployment imposes additional constraints that limit the practicality of purely supervised approaches. Although supervised baselines were not the primary focus of this study, prior work shows that ImageNet pre-training yields strong, general-purpose features. However, such features may struggle to capture fine-grained retail-specific distinctions without additional adaptation. Our results suggest that self-supervised objectives trained directly on retail imagery may produce representations that are more specialised to the SKU domain. This effect appears to be more pronounced under large-scale pre-training conditions. Even small convenience stores typically carry 2000 to 4000 SKUs, while larger retail chains may handle tens of thousands of products. The product assortment continuously evolves, with frequent packaging updates, seasonal variants, and newly introduced items. In such environments, class-specific supervised classifiers require repeated annotation and full model retraining. This process can be computationally costly and operationally inefficient in large-scale retail settings.
8. Conclusions
This study examined contrastive self-supervised learning methods for large-scale SKU recognition, using the RP2K dataset as the primary benchmark and introducing a structured evaluation protocol for an in-house InSKU dataset collected under real retail conditions. The objective was not to establish a universally superior method, but rather to analyse how different contrastive objectives behave under controlled architectural and optimisation settings.
The experimental results indicate that the evaluated contrastive objectives are associated with distinct representation characteristics. Under the linear evaluation protocol on RP2K, SimCLR produced the highest Top-1 accuracy among the tested configurations, suggesting strong linear separability of the learned embeddings in this setting. In contrast, BYOL achieved the highest performance under full fine-tuning, indicating that its representations may benefit more substantially from task-specific adaptation. These findings suggest that linear evaluation and full fine-tuning capture complementary aspects of representation quality.
The analysis of BYOL architecture variants further suggests that model capacity interacts with the available optimisation budget in fine-grained recognition scenarios. In the examined configurations, the lightweight backbones achieved competitive or superior performance compared to deeper architectures under limited training duration. This observation may reflect optimisation constraints rather than inherent limitations of deeper networks and should therefore be interpreted within the specific experimental setup considered in this study.
Experiments conducted on the in-house InSKU dataset highlight the sensitivity of self-supervised representations to dataset scale and domain shift. While SimCLR exhibited comparatively stronger performance under frozen-encoder transfer from RP2K to InSKU, BYOL and MoCo v2 showed more pronounced performance degradation in this constrained setting. However, earlier fine-tuning experiments suggest that these methods can recover substantially higher performance when end-to-end adaptation is permitted. Taken together, these findings suggest that the effectiveness of contrastive learning for SKU recognition depends on several factors, including the objective function, dataset characteristics, computational budget, and retraining strategy.
From a deployment perspective, the findings suggest that method selection should be aligned with operational constraints. Linear separability may be particularly relevant in scenarios where the encoder must remain fixed, whereas adaptability under fine-tuning may be more critical when full retraining is feasible. These considerations are likely to be especially important in retail environments characterised by frequent product turnover and evolving assortments.
Several limitations should be acknowledged. First, the experiments were conducted using a single backbone family (ResNet) and fixed augmentation strategies to enable controlled comparison; alternative architectural choices may yield different behaviours. Second, although efforts were made to mitigate dataset bias through filtering and deduplication, residual dataset-specific effects cannot be entirely excluded. Third, the geometric analysis based on t-SNE visualisations provides qualitative rather than quantitative evidence of domain structure and should be interpreted accordingly.
Future work may extend this analysis to alternative backbone architectures, including vision transformers [
28], to further investigate how representation geometry interacts with domain shift in retail SKU recognition. Additional studies incorporating larger multi-retailer datasets and systematic fine-tuning on in-house data would help clarify the generality of the observed trends.
Overall, the results presented here suggest that contrastive self-supervised learning constitutes a viable approach for large-scale SKU recognition. However, the suitability of a given objective appears to depend on multiple interacting factors, including representation geometry, the optimisation regime, dataset scale, and deployment constraints, rather than raw accuracy alone.
Author Contributions
Conceptualisation, W.K. and G.S.; methodology, W.K.; software, W.K.; validation, W.K. and G.S.; formal analysis, G.S.; investigation, W.K.; resources, G.S.; data curation, G.S.; writing—original draft preparation, W.K. and G.S.; writing—review and editing, G.S.; visualisation, W.K.; supervision, G.S.; project administration, G.S.; funding acquisition, G.S. All authors have read and agreed to the published version of the manuscript.
Funding
This research was co-funded by the National Centre for Research and Development under Subtask 1.1.1 of the Smart Growth Operational Program 2014–2020, co-financed from public funds of the Regional Development Fund No. 2014/2020 under grant no. POIR.01.01.01-00-2326/20.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The RP2K dataset analysed in this study is publicly available at
https://www.kaggle.com/datasets/khyeh0719/rp2k-dataset/data (accessed on 10 March 2026). The InSKU dataset cannot be publicly released due to commercial restrictions. Limited access to data may be provided by the corresponding author upon reasonable request and with permission of the data owner.
Acknowledgments
The authors would like to thank the Omniaz data collection and annotation team for their contribution to the preparation of the in-house dataset. Generative AI tools were used for language editing, grammar correction, and improving clarity and readability of the manuscript. The scientific content, experimental design, data analysis, results, and conclusions were entirely produced and verified by the authors. No generative AI tools were used to generate data, conduct experiments, or draw scientific conclusions.
Conflicts of Interest
Author Grzegorz Sarwas was employed by the company Omniaz Sp. z o.o. The remaining author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| MLP | Multi-Layer Perceptron |
| SGD | Stochastic Gradient Descent |
| SKU | Stock Keeping Unit |
| SSL | Self-Supervised Learning |
References
- Tonioni, A.; Di Stefano, L. Domain Invariant Hierarchical Embedding for Grocery Products Recognition. Comput. Vis. Image Underst. 2019, 182, 81–92. [Google Scholar] [CrossRef]
- Pietrini, R.; Paolanti, M.; Mancini, A.; Frontoni, E.; Zingaretti, P. Shelf Management: A Deep Learning-Based System for Shelf Visual Monitoring. Expert Syst. Appl. 2024, 255, 124635. [Google Scholar] [CrossRef]
- Lim, J.M.; Lim, K.M.; Lee, C.P.; Lim, J.Y. A Review of Few-Shot Fine-Grained Image Classification. Expert Syst. Appl. 2025, 275, 127054. [Google Scholar] [CrossRef]
- Kowalczyk, A.; Sarwas, G. One-Shot Learning from Prototype Stock Keeping Unit Images. Information 2024, 15, 526. [Google Scholar] [CrossRef]
- Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the International Conference on Machine Learning; III; Daume, H., Singh, A., Eds.; PMLR: Cambridge, MA, USA, 2020; Volume 119, pp. 1597–1607. [Google Scholar]
- He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning. arXiv 2020, arXiv:1911.05722. [Google Scholar] [CrossRef]
- Grill, J.B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.; Buchatskaya, E.; Doersch, C.; Avila Pires, B.; Guo, Z.; Gheshlaghi Azar, M.; et al. Bootstrap Your Own Latent—A New Approach to Self-Supervised Learning. In Proceedings of the Conference on Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 21271–21284. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]
- Chen, X.; Fan, H.; Girshick, R.B.; He, K. Improved Baselines with Momentum Contrastive Learning. arXiv 2020, arXiv:2003.04297. [Google Scholar] [CrossRef]
- Peng, J.; Xiao, C.; Li, Y. RP2K: A Large-Scale Retail Product Dataset for Fine-Grained Image Classification. arXiv 2021, arXiv:2006.12634. [Google Scholar]
- Weinberger, K.Q.; Saul, L.K. Distance Metric Learning for Large Margin Nearest Neighbor Classification. J. Mach. Learn. Res. 2009, 10, 207–244. [Google Scholar]
- Bromley, J.; Guyon, I.; LeCun, Y.; Säckinger, E.; Shah, R. Signature Verification Using a “Siamese” Time Delay Neural Network. In Proceedings of the Conference on Advances in Neural Information Processing Systems; Cowan, J., Tesauro, G., Alspector, J., Eds.; Morgan-Kaufmann: Burlington, MA, USA, 1993; Volume 6, pp. 737–744. [Google Scholar]
- van den Oord, A.; Li, Y.; Vinyals, O. Representation Learning with Contrastive Predictive Coding. arXiv 2019, arXiv:1807.03748. [Google Scholar] [CrossRef]
- Chen, X.; He, K. Exploring Simple Siamese Representation Learning. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15745–15753. [Google Scholar] [CrossRef]
- Zbontar, J.; Jing, L.; Misra, I.; LeCun, Y.; Deny, S. Barlow Twins: Self-Supervised Learning via Redundancy Reduction. In Proceedings of the International Conference on Machine Learning; Meila, M., Zhang, T., Eds.; PMLR: Cambridge, MA, USA, 2021; Volume 139, pp. 12310–12320. [Google Scholar]
- Bardes, A.; Ponce, J.; Lecun, Y. VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning. In Proceedings of the International Conference on Learning Representations, Online, 25–29 April 2022. [Google Scholar]
- Gui, J.; Chen, T.; Zhang, J.; Cao, Q.; Sun, Z.; Luo, H.; Tao, D. A Survey on Self-Supervised Learning: Algorithms, Applications, and Future Trends. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 9052–9071. [Google Scholar] [CrossRef] [PubMed]
- Kornblith, S.; Shlens, J.; Le, Q.V. Do Better ImageNet Models Transfer Better? In Proceedings of the Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2656–2666. [Google Scholar] [CrossRef]
- Wang, T.; Isola, P. Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere. In Proceedings of the International Conference on Machine Learning; III; Daume, H., Singh, A., Eds.; PMLR: Cambridge, MA, USA, 2020; Volume 119, pp. 9929–9939. [Google Scholar]
- Merler, M.; Galleguillos, C.; Belongie, S. Recognizing Groceries in situ Using in vitro Training Data. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 17–22 June 2007; pp. 1–8. [Google Scholar] [CrossRef]
- Goldman, E.; Herzig, R.; Eisenschtat, A.; Goldberger, J.; Hassner, T. Precise Detection in Densely Packed Scenes. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 5222–5231. [Google Scholar] [CrossRef]
- Tonioni, A.; Serra, E.; Di Stefano, L. A Deep Learning Pipeline for Product Recognition on Store Shelves. In Proceedings of the International Conference on Image Processing, Applications and Systems, Lyon, France, 9–11 January 2018; pp. 25–31. [Google Scholar] [CrossRef]
- Wei, X.S.; Cui, Q.; Yang, L.; Wang, P.; Liu, L.; Yang, J. RPC: A large-scale and fine-grained retail product checkout dataset. Sci. China Inf. Sci. 2022, 65, 197101. [Google Scholar] [CrossRef]
- Wei, Y.; Tran, S.; Xu, S.; Kang, B.; Springer, M. Deep Learning for Retail Product Recognition: Challenges and Techniques. Comput. Intell. Neurosci. 2020, 2020, 8875910. [Google Scholar] [CrossRef] [PubMed]
- Movshovitz-Attias, Y.; Toshev, A.; Leung, T.K.; Ioffe, S.; Singh, S. No Fuss Distance Metric Learning Using Proxies. In Proceedings of the International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 360–368. [Google Scholar] [CrossRef]
- Selvam, P.; Koilraj, J.A.S. A Deep Learning Framework for Grocery Product Detection and Recognition. Food Anal. Methods 2022, 15, 3498–3522. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Caron, M.; Touvron, H.; Misra, I.; Jegou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging Properties in Self-Supervised Vision Transformers. In Proceedings of the International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 9630–9640. [Google Scholar] [CrossRef]
Figure 1.
Examples of data augmentations applied during contrastive pre-training: (a) original image, (b) crop and resize, (c) crop, resize and horizontal flip, (d) colour distortion (drop), (e) colour distortion (jitter), (f) rotation, (g) cutout, (h) Gaussian noise, (i) Gaussian blur, (j) Sobel filtering.
Figure 1.
Examples of data augmentations applied during contrastive pre-training: (a) original image, (b) crop and resize, (c) crop, resize and horizontal flip, (d) colour distortion (drop), (e) colour distortion (jitter), (f) rotation, (g) cutout, (h) Gaussian noise, (i) Gaussian blur, (j) Sobel filtering.
Figure 2.
Example images from the RP2K dataset illustrating different SKU categories and variations in background, viewpoint, and illumination.
Figure 2.
Example images from the RP2K dataset illustrating different SKU categories and variations in background, viewpoint, and illumination.
Figure 3.
Representative images from the in-house SKU (InSKU) dataset illustrating challenging real retail acquisition conditions.
Figure 3.
Representative images from the in-house SKU (InSKU) dataset illustrating challenging real retail acquisition conditions.
Figure 4.
Examples of fine-grained confusions.
Figure 4.
Examples of fine-grained confusions.
Figure 5.
t-SNE projections of joint embeddings from RP2K (blue) and InSKU (orange). Each subfigure shows representations extracted using a model trained on the indicated dataset. The visualisations illustrate differences in domain clustering and cross-domain alignment induced by different self-supervised objectives.
Figure 5.
t-SNE projections of joint embeddings from RP2K (blue) and InSKU (orange). Each subfigure shows representations extracted using a model trained on the indicated dataset. The visualisations illustrate differences in domain clustering and cross-domain alignment induced by different self-supervised objectives.
Table 1.
Extended comparison of representative self-supervised learning methods with respect to computational complexity, memory footprint, optimisation strategy, and practical deployment considerations. Here, B denotes batch size, D embedding dimension, and Q queue size.
Table 1.
Extended comparison of representative self-supervised learning methods with respect to computational complexity, memory footprint, optimisation strategy, and practical deployment considerations. Here, B denotes batch size, D embedding dimension, and Q queue size.
| Method | Negatives | Memory Mechanism | Computa–Tional Comple–Xity | Memory Footprint | Batch Size | Training Stability |
|---|
| SimCLR | In-batch | None | similarity computations | Low (no queue), but high GPU memory due to large B | Very high | High (with strong augmentation) |
| MoCo v2 | Queue-based | Momentum encoder + queue (Q) | similarity computations | Moderate (queue of size ) | Moderate | High |
| BYOL | No explicit negatives | Momentum target network | (no negative comparisons) | Moderate (dual networks) | Moderate | High (collapse avoided via architecture) |
Table 2.
Hyperparameter configuration for SimCLR, MoCo v2, and BYOL across the RP2K and InSKU datasets.
Table 2.
Hyperparameter configuration for SimCLR, MoCo v2, and BYOL across the RP2K and InSKU datasets.
| Parameter | SimCLR | MoCo v2 | BYOL |
|---|
| RP2K | InSKU | RP2K | InSKU | RP2K | InSKU |
|---|
| Batch size | 256 | 128 | 64 | 256 |
| Learning rate | 0.01 | 0.0015 | 0.001 | 0.0015 | 0.01 | 0.1 |
| Temperature | 0.07 | 0.07 | – |
| Momentum coefficient | – | 0.999 | 0.996 |
| Queue size | – | 16384 | 4096 | – |
| Optimiser | ADAM | SGD | SGD |
| Optimiser Momentum | – | 0.9 | 0.9 |
| Optimiser Weight Decay | | | |
Table 3.
Linear evaluation accuracy (Top-1 and Top-5) on RP2K after self-supervised pre-training using a ResNet-50 backbone.
Table 3.
Linear evaluation accuracy (Top-1 and Top-5) on RP2K after self-supervised pre-training using a ResNet-50 backbone.
| Method | Encoder Epochs | Classifier Epochs | Training Time (Encoder) | Top-1 [%] | Top-5 [%] |
|---|
| SimCLR | 150 | 50 | 37 h 27 min | 94.98 | 99.22 |
| MoCo v2 | 150 | 50 | 35 h 40 min | 29.38 | 50.92 |
| BYOL | 150 | 50 | 35 h 30 min | 62.90 | 83.36 |
Table 4.
Linear evaluation accuracy (Top-1 and Top-5) of BYOL on RP2K with different encoder backbones.
Table 4.
Linear evaluation accuracy (Top-1 and Top-5) of BYOL on RP2K with different encoder backbones.
| Encoder | Encoder Epochs | Classifier Epochs | Top-1 [%] | Top-5 [%] |
|---|
| ResNet-50 | 50 | 50 | 62.74 | 82.17 |
| ResNet-18 | 50 | 50 | 64.04 | 83.96 |
| ResNet-152 | 50 | 50 | 35.67 | 59.29 |
Table 5.
Linear evaluation accuracy (Top-1 and Top-5) of BYOL variants on RP2K.
Table 5.
Linear evaluation accuracy (Top-1 and Top-5) of BYOL variants on RP2K.
| Configuration | Encoder | Encoder Epochs | Classifier Epochs | Top-1 [%] | Top-5 [%] |
|---|
| BYOL (base) | ResNet-50 | 50 | 50 | 62.74 | 82.17 |
| BYOL* | ResNet-50 | 50 | 50 | 61.61 | 82.21 |
| BYOL* | ResNet-18 | 50 | 50 | 65.84 | 84.87 |
Table 6.
Linear evaluation accuracy (Top-1 and Top-5): ImageNet contrastive pre-training → RP2K, classifier-only training.
Table 6.
Linear evaluation accuracy (Top-1 and Top-5): ImageNet contrastive pre-training → RP2K, classifier-only training.
| Method | Classifier Epochs | Training Time | Top-1 [%] | Top-5 [%] |
|---|
| SimCLR | 50 | 4 h | 91.52 | 98.21 |
| MoCo v2 | 50 | 4 h | 89.98 | 97.35 |
| BYOL | 50 | 4 h | 88.82 | 97.22 |
Table 7.
Linear evaluation accuracy (Top-1 and Top-5): ImageNet contrastive pre-training → RP2K, full fine-tuning.
Table 7.
Linear evaluation accuracy (Top-1 and Top-5): ImageNet contrastive pre-training → RP2K, full fine-tuning.
| Method | Pre-Training Epochs | Fine-Tuning Epochs | Training Time | Top-1 [%] | Top-5 [%] |
|---|
| SimCLR | 50 | 50 | 16 h | 95.58 | 99.21 |
| MoCo v2 | 50 | 50 | 16 h | 89.15 | 97.04 |
| BYOL | 50 | 50 | 16 h | 99.22 | 99.98 |
Table 8.
Linear evaluation accuracy (Top-1 and Top-5) and Macro-F1 scores on filtered RP2K after self-supervised pre-training using a ResNet-50 backbone.
Table 8.
Linear evaluation accuracy (Top-1 and Top-5) and Macro-F1 scores on filtered RP2K after self-supervised pre-training using a ResNet-50 backbone.
| Method | Encoder Epochs | Classifier Epochs | Top-1 [%] | Top-5 [%] | Macro-F1 |
|---|
| SimCLR | 50 | 50 | 84.41 | 96.10 | 0.84 |
| MoCo v2 | 50 | 50 | 87.17 | 96.98 | 0.87 |
| BYOL | 50 | 50 | 25.15 | 44.73 | 0.23 |
Table 9.
Retrieval accuracy (Top-1 and Top-5) after self-supervised pre-training on InSKU (ResNet-50 backbone).
Table 9.
Retrieval accuracy (Top-1 and Top-5) after self-supervised pre-training on InSKU (ResNet-50 backbone).
| Method | Encoder Epochs | Training Time | Top-1 [%] | Top-5 [%] |
|---|
| SimCLR | 1000 | 7 h 40 min | 44.34 | 59.97 |
| MoCo v2 | 1000 | 5 h 45 min | 24.63 | 37.05 |
| BYOL | 1000 | 5 h 30 min | 7.31 | 11.40 |
Table 10.
Linear evaluation accuracy (Top-1 and Top-5) on InSKU after self-supervised pre-training (ResNet-50 backbone).
Table 10.
Linear evaluation accuracy (Top-1 and Top-5) on InSKU after self-supervised pre-training (ResNet-50 backbone).
| Method | Encoder Epochs | Classifier Epochs | Training Time | Top-1 [%] | Top-5 [%] |
|---|
| SimCLR | 1000 | 250 | 7 h 52 min | 54.53 | 73.53 |
| MoCo v2 | 1000 | 250 | 5 h 59 min | 31.13 | 50.42 |
| BYOL | 1000 | 250 | 5 h 44 min | 14.33 | 26.70 |
Table 11.
Linear evaluation accuracy (Top-1 and Top-5) on InSKU after self-supervised pre-training on RP2K (ResNet-50 backbone).
Table 11.
Linear evaluation accuracy (Top-1 and Top-5) on InSKU after self-supervised pre-training on RP2K (ResNet-50 backbone).
| Method | Encoder Epochs | Classifier Epochs | Top-1 [%] | Top-5 [%] |
|---|
| SimCLR | 150 | 250 | 61.90 | 79.85 |
| MoCo v2 | 150 | 250 | 6.72 | 17.94 |
| BYOL | 150 | 250 | 7.52 | 18.75 |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |