xScore: A Simple Metric for Cross-Domain Robustness in Lightweight Vision Models

Zhang, Weidong; Ding, Pak Lun Kevin; Li, Baoxin; Liu, Huan

doi:10.3390/a19010014

Open AccessArticle

xScore: A Simple Metric for Cross-Domain Robustness in Lightweight Vision Models

School of Computing and AI, Arizona State University, Tempe, AZ 85281, USA

^*

Authors to whom correspondence should be addressed.

Algorithms 2026, 19(1), 14; https://doi.org/10.3390/a19010014

Submission received: 8 December 2025 / Revised: 20 December 2025 / Accepted: 20 December 2025 / Published: 23 December 2025

(This article belongs to the Special Issue Advances in Deep Learning-Based Data Analysis)

Download

Browse Figures

Versions Notes

Abstract

Lightweight vision models are widely deployed in mobile and embedded systems, where strict computational and memory budgets demand compact architectures. However, their evaluation remains dominated by ImageNet—a single, large natural-image dataset that requires substantial training resources. This creates a dilemma: lightweight models trained on ImageNet often reach capacity limits due to their constrained size, while scaling them to billions of parameters with specialized training tricks to achieve top-tier ImageNet accuracy does not guarantee proportional performance once the architectures are scaled back down to meet mobile constraints, particularly when re-evaluated on diverse data domains. These challenges raise two key questions: How should cross-dataset robustness be quantified in a simple and lightweight way, and which architectural elements consistently support generalization under tight resource constraints? To answer them, we introduce the Cross-Dataset Score (xScore), a simple metric that captures both average accuracy across domains and the stability of model rankings. Evaluating 11 representative lightweight models (2.5 M parameters) across seven datasets, we find that (1) ImageNet accuracy is a weak proxy for cross-domain performance, (2) xScore provides a simple and interpretable robustness metric, and (3) high-xScore models reveal architectural patterns linked to stronger generalization. Finally, the architectural insights and evaluation framework presented here provide practical guidance for measuring the xScore of future lightweight models.

Keywords:

lightweight vision models; mobile vision; cross-dataset robustness; ImageNet benchmark; mobile and embedded systems; architectural design patterns; multi-dataset evaluation

1. Introduction

The growing deployment of deep learning on mobile and edge devices has intensified the need for lightweight vision models that operate under strict constraints on computation, memory, and energy [1,2]. Although a wide range of mobile-oriented architectures—such as MobileNet, ShuffleNet, EfficientNet, and MobileViT—have emerged [3,4,5,6,7,8,9,10], along with techniques like pruning, quantization, and neural architecture search [11,12,13,14,15], their evaluation remains largely focused on ImageNet. Mobile models are often scaled up for ImageNet training to achieve measurable performance gains. However, before deployment on mobile devices, these architectures must be scaled down and retrained on smaller, domain-specific datasets, where performance on ImageNet may no longer hold and cross-domain generalization is uncertain [16,17,18,19]. Understanding whether ImageNet-optimized architectures generalize effectively under these conditions has thus become an increasingly important and underexplored problem.

These challenges raise several critical questions for the design and evaluation of lightweight mobile vision models. First, do ImageNet-optimized architectures maintain stable performance across domains, or are their performance gains largely dataset-specific [3,5,6,8,9,11,12,13,14,15,20,21,22,23,24,25,26]? Prior work on cross-domain robustness, such as DomainBed [27] and WILDS [28], has exposed the limitations of average accuracy by evaluating models across heterogeneous domains, environments, and real-world distribution shifts, often reporting mean and worst-case performance. However, they primarily report per-domain accuracies rather than a single robustness metric, and they do not explicitly address robustness in mobile-scale models trained from scratch under constrained parameter and computational budgets. Second, which evaluation metrics most effectively capture cross-domain performance and robustness in resource-constrained models, providing a practical substitute for large-scale benchmarks like ImageNet [16,17,18,19,29,30]? Third, which architectural patterns consistently deliver strong performance under standardized, limited-resource training, revealing design principles that support robust generalization [1,2,5,8,9,18,19,21,23,25,26,31,32,33,34,35]? Collectively, these questions highlight the need for a simple evaluation framework that moves beyond single-dataset benchmarks and offers actionable guidance for developing lightweight models that generalize reliably across domains.

To address these questions, we develop a benchmark framework to systematically evaluate 11 representative mobile-oriented architectures across seven diverse datasets, with the goal of uncovering robust design patterns and establishing reliable evaluation practices. All models are trained from scratch under standardized conditions, with standardization referring to a fixed data pre-processing and augmentation pipeline, training framework, optimizer, learning-rate schedule, loss function, and stopping criterion applied uniformly across models; implementation details are provided in the repository [36], thereby isolating architectural effects from dataset-specific tuning or optimization artifacts. This framework enables us to assess whether ImageNet performance reliably predicts behavior across domains and to explore a smaller and cheaper metric as a practical alternative to large-scale benchmarks. Based on this analysis, we introduce the Cross-Dataset Score (xScore), computed from four representative datasets, which effectively captures cross-domain performance. Examining models with high xScores further reveals architectural patterns that consistently support strong generalization. This approach motivates principled evaluation beyond single-dataset benchmarks and guides the design of more robust lightweight models.

In summary, we make three primary contributions. First, we establish a fully reproducible benchmark capable of evaluating future mobile-oriented architectures across diverse datasets under standardized training conditions, enabling fair and controlled comparisons as reflected by xScore [36]. Second, we demonstrate that xScore offers a more informative evaluation metric than using ImageNet performance as the primary metric, jointly capturing accuracy and consistency across datasets, and computable from just four small, diverse datasets [37,38,39,40]. Third, we use xScore to identify architectural patterns that consistently support robust generalization under tight resource constraints, highlighting design principles that meaningfully contribute to reliable cross-domain performance [3,5,6,8,20,21,22,25,26,31]. Together, these contributions establish a principled framework for evaluating mobile vision models and offer actionable guidance for designing architectures that generalize across diverse datasets.

2. Materials and Methods

2.1. Selection of Mobile-Friendly Models

Although there are numerous vision models in the literature, we select eleven representative mobile architectures chosen for their popularity and architectural diversity, spanning the dominant design strategies used in efficient image classification [12,13,14,15]. The goal is not to compile an exhaustive or latest survey, but to establish a balanced, reproducible set of models that reflect the major architectural paradigms under comparable parameter budgets, FLOPs, and training conditions. This enables controlled cross-domain comparison without the confounding effects of per-model hyperparameter tuning or dataset-specific optimization.

Our selection covers three broad categories. (1) Classical lightweight CNNs—MobileNet [4,31,32], ShuffleNet [8,21], and GhostNet [3,22]—demonstrate the core efficiency mechanisms such as depthwise separable convolutions, inverted bottlenecks, and channel operations, all widely deployed in commercial mobile systems. (2) Modern refined CNNs—EfficientNet [6,20], ConvNeXt [41], and StarNet [42]—capture the evolution toward systematically scaled and Transformer-inspired convolutional backbones. (3) Alternative design paths, including the patch-based ConvMixer [18], NAS-discovered architectures such as TinyNet [35] and FBNet [9,23], and the hybrid CNN–Transformer MobileViT [5], allow us to evaluate architectural diversity ranging from pure convolutional designs to attention-augmented hybrids.

All models are intentionally constrained to approximately 2.5 M parameters, ensuring a fair comparison where performance differences primarily reflect architectural capacity. Exceptions include FBNet, which retains its native 3.6 M parameters due to the practical difficulty of creating a 2.5 M variant, and MobileViT [5], for which the transformer-based QKV heads are excluded from the count following standard parameter accounting conventions. Table 1 reports each model’s average parameter count and FLOPs, measured using fvcore for a single forward pass on a standard 224 × 224 input. Notably, ConvMixer and MobileViT exhibit substantially higher FLOPs due to extensive depthwise convolutions and attention operations, respectively.

2.2. Selection of Datasets

Similarly to our selection of mobile-friendly models, we choose seven datasets based on their popularity, diversity, and broad coverage of real-world visual domains to evaluate cross-dataset generalization. These datasets span different visual characteristics and levels of granularity, ensuring that evaluation captures not only single-benchmark accuracy but also adaptability across object- and scene-centric distributions, natural and specialized imagery, and coarse- to fine-grained tasks.

Imagenette-160 [43] is a subset of ImageNet-1k comprising ten representative object categories. It preserves the object-centric nature of ImageNet images while being substantially smaller, enabling rapid experimentation, reproducible benchmarking, and systematic hyperparameter tuning for mobile-scale architectures.

CIFAR-10 and CIFAR-100 [37] contain low-resolution images with 10 distinct object categories and 100 semantically fine-grained classes, respectively. Beyond tight spatial constraints, these datasets pose challenges such as high intra-class variation, strong inter-class similarity (especially for CIFAR-100), and limited image resolution, requiring models to extract robust discriminative features from minimal pixel information.

Stanford Dogs [38] contains 120 dog breeds and serves as a fine-grained recognition benchmark. The dataset is challenging because many breeds are visually similar, individual classes exhibit high intra-class variation, and each class has relatively few training samples, demanding that models learn subtle morphological cues efficiently.

HAM10000 [40] contains dermatoscopic images of skin lesions spanning 7 diagnostic categories. Beyond visual statistics that differ substantially from natural images, the dataset exhibits strong class imbalance: the most common class, melanocytic nevus, accounts for ∼67% of images, whereas the rarest class, dermatofibroma, comprises only ∼1%. This ∼67× imbalance creates learning challenges for models, which may favor common classes while struggling to represent rare ones adequately.

MIT Indoor-67 [44] contains 67 indoor scene categories. The dataset emphasizes global spatial layouts and context-dependent features rather than isolated objects. Challenges include high intra-class variability (e.g., different lighting, furniture arrangement) and subtle differences between visually similar categories, requiring models to capture holistic spatial structures.

MiniPlaces [45] comprises 100 scene categories and also focuses on scene recognition with context-dependent cues. Its challenges include large inter-class diversity, complex backgrounds, and scale variation, demanding models that can generalize across diverse indoor and outdoor environments while maintaining sensitivity to global layout patterns.

We deliberately select this subset of seven datasets to provide a diverse yet manageable benchmark. While not exhaustive, these datasets are chosen to probe the capacity of mobile-friendly models across varied domains, balancing experimental diversity, computational efficiency, and reproducibility. Table 2 summarizes dataset characteristics and approximate training images per class.

2.3. Training Framework

To ensure fair and reproducible evaluation of both efficiency and representational capacity for mobile applications, we adopt a streamlined training procedure across all models and datasets. All models share the same training hyperparameters—learning rate schedule, batch size, optimizer, loss function, and data augmentation pipeline—and we deliberately avoid model-specific tuning. This ensures observed differences reflect intrinsic architectural characteristics rather than dataset- or model-specific optimizations.

All models are trained from scratch to prevent information leakage from pretrained weights. This is particularly important for datasets like Stanford Dogs and ImageNette, which are derived from ImageNet and could contain test samples already seen during pretraining. The shared hyperparameters are summarized in Table 3. Detailed information on each model architecture, dataset dataloaders, preprocessing, and the shared training schedule is available in the project’s GitHub repository [36] at commit 046d8e9. The overall training and evaluation procedure is summarized in Algorithm 1.

Algorithm 1: Training of a Mobile-Scale Vision Model on a Single Dataset (Abstracted)

Input:: Dataset $D \in {D_{1}, \dots, D_{7}}$ , model $M \in {m_{1}, \dots, m_{11}}$ , standardized training framework, fixed hyperparameters
Output:: Test accuracy for each $(D_{i}, m_{j})$ pair

1: Select Model and Dataset: Choose a dataset $D_{i}$ and a model $m_{j}$ for evaluation.
2: Load Data: Prepare training and test sets with standard preprocessing and augmentations.
3: Initialize Model: Instantiate the model $m_{j}$ and integrate it with the training framework.
4: Configure Training: Set optimizer, learning-rate schedule, and hyperparameters.
5: Train Model: Iteratively update model weights on the training set according to the training framework until convergence.
6: Evaluate Model: Compute test accuracies and record them for subsequent analysis (e.g., $x S c o r e$ ).

2.4. Cross-Dataset Score Metric

While evaluating a mobile model on a single dataset (e.g., ImageNet) may not accurately reflect its overall generalization, comparing performance across multiple datasets is also non-trivial. Each dataset differs in visual domain, structural complexity, and task difficulty, producing accuracies on different scales that are hard to interpret and compare directly. In practice, models are often reported on selectively chosen datasets, with varying depths, parameter budgets, and training strategies, further obscuring fair comparison. To address these challenges, we introduce xScore, a scalar metric that summarizes a model’s behavior across multiple predefined datasets into a single, interpretable value, computed under a uniform and controlled training framework. By jointly accounting for average accuracy and cross-domain consistency,

x S c o r e

enables direct and meaningful comparison between models that would otherwise require multi-dimensional analysis.

Let

M = {m_{1}, \dots, m_{K}}

be the set of models and

D = {d_{1}, \dots, d_{N}}

the set of datasets. The accuracy of model

m_{i}

on dataset

d_{j}

is

A_{i, j}

, which we normalize as

{\hat{A}}_{i, j} = \frac{A_{i, j} - {min}_{i} A_{i, j}}{{max}_{i} A_{i, j} - {min}_{i} A_{i, j}} .

(1)

We then define the mean performance

G_{i}

and variance

V_{i}

of model

m_{i}

across datasets:

G_{i} = \frac{1}{N} \sum_{j = 1}^{N} {\hat{A}}_{i, j}, V_{i} = Var ({{\hat{A}}_{i, j}}_{j = 1}^{N}),

(2)

and combine them into a single xScore for model

m_{i}

:

x S c o r e_{i} = G_{i} - λ \cdot V_{i}, λ \in [0, 1] .

(3)

Here,

G_{i}

captures average performance,

V_{i}

quantifies sensitivity to domain shifts, and

x S c o r e_{i}

combines these components into a single metric summarizing a model’s overall cross-dataset robustness. Unlike per-dataset evaluations, xScore provides a holistic measurement that supports fair comparison of lightweight models across diverse domains by reflecting both accuracy and ranking stability under realistic deployment scenarios.

3. Experiments, Results and Discussion

To quantify cross-domain robustness with a single, comparable metric, we compute each model’s

x S c o r e

by aggregating its normalized test accuracies across multiple datasets (Equations (1)–(3)). By comparing

x S c o r e

values across architectures, we examine whether this metric provides a more objective and reliable assessment of mobile-scale models than conventional single-dataset benchmarks such as ImageNet, and what it reveals about the relationship between architectural efficiency and generalization.

To obtain each model’s

x S c o r e

, we evaluate its test accuracy on all seven datasets using the streamlined training framework described in Section 2. We apply this pipeline consistently to all eleven mobile-scale vision architectures, training each model from scratch with standard weight initialization. We use the Adam optimizer [46] without weight decay, a maximum learning rate of

10^{- 3}

(reduced to

10^{- 4}

for MobileViT due to higher sensitivity), and a minimum learning rate of

10^{- 5}

. Training begins with a 5-epoch linear warm-up, followed by cosine annealing [47] to ensure stable convergence [12,15]. Cross-entropy loss with label smoothing [48] is used throughout.

All dataset images are resized to

224 \times 224

, normalized using each dataset’s mean and standard deviation, and augmented with CutMix [49], RandomFlip, and ColorJitter [30]. Batch size is 32, and training is performed in FP32 on an NVIDIA GeForce RTX 3090 GPU to emulate mobile deployment constraints. Each model is trained for 100 epochs on each dataset, with performance reported at the end on the corresponding test or evaluation split. Early stopping is intentionally not employed to ensure a fair and blind comparison across models, with no visibility into test performance until training is complete and to prevent any heuristic adjustments based on early stopping.

Given the large number of model–dataset combinations, each model is initially trained once per dataset to identify top-performing architectures. Based on this screening stage, the three best-performing models—ConvMixer, EfficientNet, and MobileViT—are selected and evaluated in two additional runs with different random seeds, resulting in three runs in total to estimate average performance. Owing to the limited number of repeated evaluations, standard deviations are not reported. For each model–dataset combination, test accuracy is reported in Table 4, with the top three models (marked by ⁺) reported as the average over three runs. Models are listed in rows and datasets in columns. All models perform well on CIFAR-10, reflecting its common use as a lightweight benchmark, while ImageNette achieves similarly high accuracy, confirming its role as a reduced-scale proxy for ImageNet. In contrast, performance drops sharply on Stanford Dogs and MIT Indoor-67, which are less commonly incorporated into mobile-model pipelines. This pattern underscores the limitations of single-dataset evaluation and motivates xScore as a more comprehensive measure of cross-domain robustness.

For clarity, each dataset’s highest and lowest accuracies among the evaluated models are highlighted in green and red, respectively, in Table 4. These column-wise extrema are used in the

x S c o r e

calculation as normalization anchors (Equation (1)), establishing a consistent relative scale for the current and future models. While these anchors are based on the models considered, adding a future state-of-the-art model with higher accuracy than the current maximum would produce a normalized value

{\hat{A}}_{i, j} > 1.0

, resulting in an

x S c o r e > 1.0

that exceeds those of the listed models. Because normalization scales differences proportionally, xScore continues to reflect cross-dataset improvement and maintain meaningful relative rankings, without adjusting the original min and max anchors, which would require recalculating all previous model xScores.

Table 5 reports the

x S c o r e

for each architecture with

λ \in {0.25, 0.5, 0.75}

, combining mean performance and cross-dataset variance according to Equations (2) and (3). Models are ordered by

x S c o r e

magnitude, providing a concise summary of cross-domain robustness. Notably, the relative ranking of models remains largely consistent across different

λ

values, indicating the stability and discriminative power of the proposed xScore metric.

For subsequent analysis, we adopt

λ = 0.5

to balance cross-dataset generalization and performance variability, avoiding bias toward either extreme. For clarity, the

x S c o r e

is computed from normalized cross-dataset accuracies

{\hat{A}}_{i, j}

(Equation (1)), where per-dataset minima and maxima define the normalization range. From these normalized values, we compute the mean performance

G_{i}

and variance

V_{i}

across datasets (Equation (2)), and combine them via Equation (3) to obtain a single metric that jointly captures accuracy and robustness to domain shifts. As a result, the xScore enables fair and reproducible comparison of lightweight models across diverse benchmarks.

Examining the

x S c o r e

classifications in Table 5 alongside the per-dataset accuracies in Table 4 reveals a clear pattern: despite uniform training pipelines, consistent input resolutions, and comparable parameter capacities, the models’ accuracies do not fluctuate mildly around a shared average or exhibit random oscillations across datasets. Instead, they display substantial and systematic divergence across domains, which is directly reflected in their

x S c o r e

values. Such disparity prompts several methodological questions:

3.1. Is ImageNet a Fair Benchmark for Mobile Models?

Traditional single-dataset metrics, particularly ImageNet accuracy, are widely used as evaluation standards. However, prior work shows that higher ImageNet accuracy does not reliably translate to downstream performance or cross-domain robustness [16,17,19]. Structural regularities in ImageNet—such as object-centered composition, consistent lighting, and balanced class distributions [50]—can inflate reported performance while masking weaknesses in generalization. Despite this, evaluating mobile-scale architectures exclusively through ImageNet has become routine, even though the dataset was designed for high-capacity models rather than networks constrained to a few million parameters. Expecting such models to perform competitively on a 1.28M-image, 1000-class benchmark is therefore a fundamental mismatch of capacity and task difficulty.

This misalignment is further compounded by common reporting practices. So-called “state-of-the-art” ImageNet results for mobile backbones are frequently achieved by scaling models far beyond realistic deployment budgets, pairing them with bespoke augmentation pipelines, and applying heavy regularization and extensive hyperparameter tuning. These inconsistent—and often opaque—training conditions further undermine meaningful evaluation and render direct architectural comparisons unreliable.

To mitigate potential confounds, we retrain all models from scratch with comparable parameter budgets suitable for realistic mobile environments. Figure 1 shows the per-model correlation between ImageNette accuracy and

x S c o r e

across the seven datasets, illustrating whether ImageNette accuracy predicts cross-domain robustness. If ImageNet (as proxied by ImageNette) were a reliable benchmark for mobile robustness, the correlation in Figure 1 would be tight and monotonic. Instead, it is loose, noisy, and in some cases inverted, clearly demonstrating the divergence between ImageNette performance and cross-domain robustness. For example, several models (e.g., GhostNet) achieve high ImageNette accuracy yet generalize poorly across diverse datasets, while others (e.g., StarNet) with lower ImageNette performance exhibit more stable cross-domain behavior. The implication is clear: ImageNet accuracy, in practice, provides an incomplete and often misleading signal of mobile model capability. By contrast,

x S c o r e

offers a more faithful, domain-aware assessment, capturing both performance and stability in ways that reflect real-world mobile deployments.

x S c o r e

is most meaningful when applied to models with comparable parameter budgets trained under the provided framework, including the standardized training procedure and hyperparameters. Models with substantially higher capacity may achieve superior raw accuracy on individual datasets, but such gains often reflect scale rather than architectural merit. In contrast,

x S c o r e

measures performance consistency across heterogeneous domains under equivalent resource constraints, rather than dominance on a single benchmark. This distinction is crucial for mobile-scale research, where efficiency, adaptability, and robustness are as important as raw accuracy. By normalizing evaluation to capacity-equivalent models,

x S c o r e

provides a fairer and more interpretable measure of cross-domain generalization.

3.2. How to Make xScore a Practical Measure of Mobile Architecture Generalization

While many datasets exist—and new ones continue to be created—we aim to show that not all are equally discriminative or challenging for computing a representative

x S c o r e

. To demonstrate that a smaller subset suffices without losing generality, we perform a brute-force search to identify four datasets from the original seven that best preserve the full

x S c o r e

ranking. For each candidate subset, the seven-dataset

x S c o r e

is predicted via simple linear regression on the subset’s accuracies, with the coefficient of determination (

R^{2}

[51]) quantifying fidelity. Subsets with

R^{2}

near 1 closely reproduce the full

x S c o r e

, and exhaustive evaluation selects the most representative combination.

This process selects CIFAR-10, HAM10000, Stanford Dogs, and MIT Indoor-67, achieving

R^{2} = 0.99

. The resulting 4-dataset

x S c o r e

for each model is listed in the last column of Table 5. Figure 2 shows the correspondence between the full seven-dataset benchmark and the reduced four-dataset subset for the eleven mobile-scale vision models. The x-axis denotes

x S c o r e_{7}

, computed across all seven datasets, while the y-axis shows

x S c o r e_{4}

, predicted from the selected subset. The dashed diagonal line (

y = x

) marks perfect agreement. Each bubble represents a model, where color encodes its generalization score

G_{i}

(mean normalized accuracy) and size reflects its cross-dataset variance

V_{i}

. Smaller bubbles indicate stable performance; larger ones indicate variability across domains. Most models align closely with the identity line, confirming that the four-dataset

x S c o r e

is a faithful approximation of the full seven-dataset benchmark, preserving both ranking and generalization trends.

Notably, the subset containing CIFAR-10, HAM10000, Stanford Dogs, and MIT Indoor-67 happens to span four distinct domains—low-resolution object diversity, domain-shifted medical imagery, fine-grained categorization, and general scene recognition—preserving heterogeneity while dramatically reducing computational cost. The combined benchmark includes roughly 74,000 training samples across 204 classes, compared to ImageNet’s 1.28 million samples and 1000 classes. This reduced yet representative set avoids the underfitting common in parameter-limited mobile models and provides a reliable, computationally efficient measure of cross-domain generalization.

From a practical perspective, evaluating a new model using

x S c o r e

is straightforward. The shared training and evaluation framework—including dataloaders, preprocessing, and schedule—is available on GitHub [36]. A new model can be trained independently on each dataset (see Algorithm 1, Step 3), and its per-dataset accuracies aggregated using Equation (1)–(3) with the fixed anchors in Table 4 to compute its

x S c o r e

. On standard desktop GPUs (e.g., NVIDIA RTX 3090), evaluation per dataset requires only a few hours, with memory demands compatible with mobile-scale models. Reference models’

x S c o r e s

are already available, allowing immediate comparison.

3.3. What Can xScore Reveal About Mobile Architectures’ Efficiency?

Building on the demonstration that

x S c o r e

is a reliable cross-domain metric, we take an open-ended approach to examine models with high

x S c o r e

values, aiming to uncover which architectural patterns contribute to their superior performance. The 7-dataset

x S c o r e

in Table 5 highlights EfficientNet and ConvMixer as the most consistent performers, forming a top tier with both high accuracy and stability across diverse datasets. The next tier—including MobileViT, MobileNet, StarNet, and FBNet—shows moderate generalization (

x S c o r e

0.65–0.8), while ShuffleNet, GhostNet, MobileOne, and TinyNet occupy a lower tier (0.5–0.6). Notably, ConvNeXt, despite strong reported ImageNet performance, exhibits the lowest cross-domain

x S c o r e

under the constrained model size, suggesting overfitting to high-capacity benchmarks or reliance on implicit regularization that does not generalize.

Table 6 summarizes the core architectural elements across the eleven evaluated mobile models. All architectures share a multi-layer convolutional backbone with depthwise and pointwise convolutions, residuals, and standard activations. However, each model incorporates distinct design patterns. Evaluating elements such as pointwise

1 \times 1

convolutions, depthwise separable layers, inverted bottlenecks [31], SE modules [33], patch-based convolutions [52], and large-kernel spatial mixing [53] reveals their unequal impact on cross-domain robustness.

Most notably, EfficientNet and ConvMixer illustrate complementary strategies for robust channel-wise information flow. EfficientNet employs inverted residuals to expand bottlenecks, followed by depthwise convolutions in an “expand–filter–squeeze” pattern, with Squeeze-and-Excite attention dynamically reweighting channels as spatial resolution contracts. This combination may help explain its strong performance. ConvMixer, in contrast, maintains uniform spatial resolution and isotropic convolutions. Inputs are patch-embedded and processed with depthwise convolutions, residual connections, and pointwise convolutions, enabling unrestricted channel mixing and direct learning of local and mid-level representations. Avoiding hierarchical downsampling preserves spatial information, which may support robust generalization with a minimal design. Despite their different philosophies, both architectures embody the principle of preserving rich, independent channel representations that encode diverse visual features, yielding compact yet expressive feature maps.

In comparison, models that limit inter-channel communication or reduce feature diversity—such as ShuffleNet, GhostNet, and MobileOne—illustrate the trade-offs inherent in efficiency-driven design. Techniques such as group convolutions, channel shuffling, “ghost” features, and inference-time merging constrain representational capacity, which may lead to weaker cross-domain stability. Hybrid architectures like MobileViT tend to underperform in low-parameter regimes, suggesting that attention mechanisms alone cannot overcome tight capacity constraints. NAS-optimized models, including FBNet and TinyNet, prioritize ImageNet efficiency through block-level optimization, often at the expense of transfer robustness. Collectively, these observations suggest that macro-architectural factors—depth, width, scaling strategy, and feature mixing—appear to play a more decisive role in generalization than block-level optimizations alone.

MobileNet—despite its relative simplicity—achieves competitive performance across datasets. This suggests that its straightforward design scales effectively across diverse visual domains, whereas derived models are often optimized specifically for ImageNet. These observations further illustrate why xScore can provide a more reliable measure of model performance across multiple datasets, capturing generalization trends beyond a single benchmark.

In summary,

x S c o r e

enables quantification of both shared and distinct architectural pattern effects, indicating that cross-domain robustness is more closely associated with coordinated macro-architecture and effective channel-wise information flow, rather than single-dataset accuracy or parameter-efficient block design. As such,

x S c o r e

provides a principled framework for identifying and formalizing design patterns for the next generation of mobile-scale architectures, balancing efficiency with robust generalization.

4. Conclusions

In the context of mobile vision models, large-scale datasets such as ImageNet are often overly demanding and misaligned with mobile deployment constraints. While ImageNet remains valuable for general benchmarking, it is not ideally suited for evaluating mobile-scale architectures. By contrast,

x S c o r e

provides a faster, more interpretable, and reliable measure of cross-domain performance. Its selected 4-dataset benchmark—roughly 74,000 training samples across 204 classes, less than 6% of ImageNet—enables substantially shorter training time and higher computation efficiency while preserving domain diversity. Each candidate model requires only a single training–evaluation pass on each of the four datasets to compute its

x S c o r e

. Beyond evaluation,

x S c o r e

highlights key architectural drivers of robustness, particularly efficient channel-wise information flow and balanced macro-architectural scaling. The framework is reproducible, extensible, and tractable even for parameter-limited models (see Section 3.2 for implementation details and computational considerations). Taken together,

x S c o r e

constitutes a practical, domain-aware alternative to single-dataset metrics, providing diagnostic insight and actionable guidance for designing reliable lightweight vision architectures.

Our work aligns with the growing emphasis on cross-dataset evaluation and reproducible benchmarking [11,14,16,17]. While prior studies have documented the instability of ImageNet-pretrained models under domain shift [19] and proposed techniques such as self-distillation [54], feature sharing [55], and token pruning [56,57], they lack a simple quantitative metric for systematic cross-domain assessment.

x S c o r e

fills this gap by normalizing for parameter budget, balancing mean performance and variance, and enabling a reduced four-dataset benchmark that preserves full cross-domain rankings, lowering computational cost while maintaining interpretability.

Several limitations should be noted, highlighting areas for further research that could complement and extend the findings of this study:

Model constraints and training strategy: This study focuses on models constrained to approximately 2.5 M parameters, using fixed architectural layouts and consistent training hyperparameters. We did not explore cases where architecture-specific tuning within the same parameter budget could yield stronger performance, nor did we evaluate models with higher parameter counts (e.g., 5 M), both of which could influence the xScore ranking. Variations in training protocols—such as additional evaluation seeds, different batch sizes, varied data augmentation, or optimizer settings—may influence absolute performance, though relative trends are expected to remain largely consistent.
Dataset coverage: Although the four selected datasets span diverse domains, they were chosen from the seven representative datasets included in this framework, which do not cover the full spectrum of real-world or synthetic visual domains. As a result, observed xScore trends may differ when additional datasets—such as synthetic or video data—are incorporated.
Vision task specificity: This study is limited to image classification tasks. Tasks requiring fine spatial precision, such as semantic segmentation or object detection, may rely on architectural features not captured by the image classification evaluations, potentially producing different xScore rankings. Dedicated benchmark metrics would be needed to fairly assess performance for these task-specific scenarios.

Looking ahead,

x S c o r e

can guide next-generation mobile architecture design, inform robustness-oriented neural architecture search (NAS), and support evaluation in semi- or self-supervised settings. The design patterns observed in EfficientNet and ConvMixer—particularly their strategies for maximizing channel-wise information flow—can assist in searching and testing of more efficient architecture models. In NAS,

x S c o r e

prioritizes cross-domain robustness and discourages overfitting to datasets like ImageNet. In deployment, it enables early-stage assessment of mobile models, predicting real-world stability without the cost of full-scale training—an essential advantage for mobile and edge systems.

Author Contributions

Conceptualization, W.Z.; methodology, W.Z. and P.L.K.D.; software, W.Z.; validation, W.Z., P.L.K.D. and H.L.; formal analysis, W.Z. and P.L.K.D.; investigation, W.Z.; resources, W.Z.; data curation, W.Z.; writing—original draft preparation, W.Z.; writing—review and editing, W.Z., P.L.K.D. and H.L.; supervision, B.L. and H.L.; project administration, H.L.; funding acquisition, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors. The raw data and source code that generate all tables and figures in this paper are available at https://github.com/javawormer/beyond_imagenet (accessed on 31 October 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR), Online, 3–7 May 2021. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Mehta, S.; Rastegari, M. MobileViT: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar]
Tan, M.; Le, Q.V. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; PMLR. pp. 6105–6114. [Google Scholar]
Vasu, P.K.A.; Gabriel, J.; Zhu, J.; Tuzel, O.; Ranjan, A. MobileOne: An improved one millisecond mobile backbone. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 7907–7917. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar]
Wu, B.; Dai, X.; Zhang, P.; Wang, Y.; Sun, F.; Wu, Y.; Tian, Y.; Vajda, P.; Jia, Y.; Keutzer, K. FBNet: Hardware-aware efficient convnet design via differentiable neural architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 10734–10742. [Google Scholar]
Zhou, D.; Hou, Q.; Chen, Y.; Feng, J.; Yan, S. Rethinking bottleneck structure for efficient mobile network design. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 680–697. [Google Scholar]
Lin, J.; Zhu, L.; Chen, W.M.; Wang, W.C.; Han, S. Tiny Machine Learning: Progress and Futures. IEEE Circuits Syst. Mag. 2023, 23, 8–34. [Google Scholar] [CrossRef]
Liu, H.I.; Galindo, M.; Xie, H.; Wong, L.K.; Shuai, H.H.; Li, Y.H.; Cheng, W.H. Lightweight Deep Learning for Resource-Constrained Environments: A Survey. arXiv 2024, arXiv:2404.07236. [Google Scholar] [CrossRef]
Menghani, G. Efficient deep learning: A survey on making deep learning models smaller, faster, and better. ACM Comput. Surv. 2023, 55, 1–37. [Google Scholar] [CrossRef]
Wang, X.; Jia, W. Optimizing Edge AI: A Comprehensive Survey on Data, Model, and System Strategies. arXiv 2025, arXiv:2501.03265. [Google Scholar] [CrossRef]
Wang, X.; Tang, Z.; Guo, J.; Meng, T.; Wang, C.; Wang, T.; Jia, W. Empowering Edge Intelligence: A Comprehensive Survey on On-Device AI Models. ACM Comput. Surv. 2025, 57, 1–39. [Google Scholar] [CrossRef]
Fang, A.; Kornblith, S.; Schmidt, L. Does progress on ImageNet transfer to real-world datasets? In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 10–16 December 2023; Volume 36, pp. 25050–25080. [Google Scholar]
Kornblith, S.; Shlens, J.; Le, Q.V. Do Better ImageNet Models Transfer Better? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 2661–2671. [Google Scholar]
Tuggener, L.; Schmidhuber, J.; Stadelmann, T. Is it enough to optimize CNN architectures on ImageNet? Front. Comput. Sci. 2022, 4, 1041703. [Google Scholar] [CrossRef]
Vishniakov, K.; Shen, Z.; Liu, Z. ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy. arXiv 2023, arXiv:2311.09215. [Google Scholar]
Tan, M.; Le, Q.V. EfficientNetV2: Smaller models and faster training. In Proceedings of the International Conference on Machine Learning (ICML), Online, 18–24 July 2021; PMLR. pp. 10096–10106. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.; Sun, J. ShuffleNet V2: Practical guidelines for efficient CNN architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Tang, Y.; Han, K.; Guo, J.; Xu, C.; Xu, C.; Wang, Y. GhostNetV2: Enhance cheap operation with long-range attention. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022; Volume 35, pp. 9969–9982. [Google Scholar]
Wan, A.; Dai, X.; Zhang, P.; He, Z.; Tian, Y.; Xie, S.; Wu, B.; Yu, M.; Xu, T.; Chen, K.; et al. FBNetV2: Differentiable neural architecture search for spatial and channel dimensions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 12965–12974. [Google Scholar]
Guo, Z.; Zhang, X.; Mu, H.; Heng, W.; Liu, Z.; Wei, Y.; Sun, J. Single path one-shot neural architecture search with uniform sampling. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 544–560. [Google Scholar]
Li, Y.; Yuan, G.; Wen, Y.; Hu, J.; Evangelidis, G.; Tulyakov, S.; Ren, J. EfficientFormer: Vision transformers at MobileNet speed. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–3 December 2022; Volume 35, pp. 12934–12949. [Google Scholar]
Shaker, A.; Maaz, M.; Rasheed, H.; Khan, S.; Yang, M.H.; Khan, F.S. SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 17379–17390. [Google Scholar] [CrossRef]
Gulrajani, I.; Lopez-Paz, D. In search of lost domain generalization. arXiv 2020, arXiv:2007.01434. [Google Scholar] [CrossRef]
Koh, P.W.; Sagawa, S.; Marklund, H.; Xie, S.M.; Zhang, M.; Balsubramani, A.; Hu, W.; Yasunaga, M.; Phillips, R.L.; Gao, I.; et al. Wilds: A benchmark of in-the-wild distribution shifts. In Proceedings of the International Conference on Machine Learning (ICML), Online, 18–24 July 2021; PMLR, pp. 5637–5664. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems (NeurIPS), Lake Tahoe, NV, USA, 3–8 December 2012; pp. 1097–1105. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Tan, M.; Chen, B.; Pang, R.; Vasudevan, V.; Sandler, M.; Howard, A.; Le, Q.V. MnasNet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 2820–2828. [Google Scholar]
Han, K.; Wang, Y.; Zhang, Q.; Zhang, W.; Xu, C.; Zhang, T. Model Rubik’s Cube: Twisting Resolution, Depth and Width for TinyNets. arXiv 2020, arXiv:2010.14819. [Google Scholar]
Zhang, W. xScore: Evaluation Framework with Training Wrapper and Dataset Loader. GitHub Repository. 2025. Available online: https://github.com/javawormer/beyond_imagenet (accessed on 31 October 2025).
Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; Technical Report; CIFAR-10 and CIFAR-100 datasets; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
Khosla, A.; Jayadevaprakash, B.; Yao, B.; Fei-Fei, L. Novel dataset for fine-grained image categorization: Stanford Dogs. In Proceedings of the CVPR Workshop on Fine-Grained Visual Categorization, Colorado Springs, CO, USA, 20–25 June 2011. [Google Scholar]
Martins, O.O.; Oosthuizen, C.C.; Desai, D.A. Evaluating unified training optimisations for MobileNetV2: Efficiency-accuracy trade-offs in fine-grained dog breed classification. Discov. Appl. Sci. 2025, 7, 1240. [Google Scholar] [CrossRef]
Tschandl, P.; Rosendahl, C.; Kittler, H. The HAM10000 dataset: A large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci. Data 2018, 5, 180161. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 11966–11976. [Google Scholar]
Ma, X.; Dai, X.; Bai, Y.; Wang, Y.; Fu, Y. Rewrite the Stars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Waikoloa, HI, USA, 16–20 June 2024; pp. 5694–5703. [Google Scholar]
Howard, J. Imagenette: A Subset of ImageNet for Faster Experimentation. 2019. Available online: https://github.com/fastai/imagenette (accessed on 5 July 2025).
Quattoni, A.; Torralba, A. Recognizing indoor scenes. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–25 June 2009; pp. 413–420. [Google Scholar]
Zhou, B.; Lapedriza, A.; Khosla, A.; Oliva, A.; Torralba, A. Places: A 10 million Image Database for Scene Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1452–1464. [Google Scholar] [CrossRef] [PubMed]
Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6023–6032. [Google Scholar]
Raghu, M.; Unterthiner, T.; Kornblith, S.; Zhang, C.; Dosovitskiy, A. Do Vision Transformers See Like Convolutional Neural Networks? In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Virtual, 6–14 December 2021; Volume 34, pp. 12116–12128. [Google Scholar]
Draper, N.; Smith, H. Applied Regression Analysis, 3rd ed.; Wiley: Hoboken, NJ, USA, 1998. [Google Scholar]
Trockman, A.; Kolter, J.Z. Patches Are All You Need? arXiv 2022, arXiv:2201.09792. [Google Scholar] [PubMed]
Ding, X.; Wang, X.; Zhao, Y.; Li, Y.; Tang, Y.; Sun, J. Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 2977–2986. [Google Scholar]
Zhang, L.; Song, J.; Gao, A.; Chen, J.; Bao, C.; Ma, K. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3713–3722. [Google Scholar]
Hao, Z.; Luo, Y.; Wang, Z.; Hu, H.; An, J. CDFKD-MFS: Collaborative Data-Free Knowledge Distillation via Multi-Level Feature Sharing. IEEE Trans. Multimed. 2022, 24, 4262–4274. [Google Scholar] [CrossRef]
Liang, Y.; Chongjian, G.; Tong, Z.; Song, Y.; Wang, J.; Xie, P. EViT: Expediting Vision Transformers via Token Reorganizations. In Proceedings of the International Conference on Learning Representations (ICLR), Online, 25 April–29 April 2022. [Google Scholar]
Rao, Y.; Zhao, W.; Liu, B.; Lu, J.; Zhou, J.; Hsieh, C.J. DynamicViT: Efficient vision transformers with dynamic token sparsification. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 2021), Virtual Conference, 6–14 December 2021; Volume 34, pp. 13937–13949. [Google Scholar]

Figure 1. Correlation between xScore and Imagenette accuracy across models.

Figure 2. Correlation between 7-dataset xScores and 4-dataset xScores across models.

Table 1. Average parameters (M) and FLOPs (M) for 224 × 224 input mobile models. * MobileViT excludes Transformer QKV heads parameter count.

Model	Params (M)	FLOPs (M)
ConvMixer	2.38	2347
ConvNext	2.65	221
EfficientNet	2.58	542
FBNet	3.66	396
GhostNet	2.35	314
MobileNet	2.31	253
MobileOne	2.25	559
MobileViT *	2.82	2100
ShuffleNet	2.22	326
StarNet	2.69	424
TinyNet	2.52	253

Table 2. Summary of the seven datasets used for cross-dataset generalization evaluation.

Dataset	# Classes	Train Size	Test Size	Train Images per Class
Imagenette	10	9469	3,925	∼947
CIFAR-10	10	50,000	10,000	5000
CIFAR-100	100	50,000	10,000	500
Stanford Dogs	120	12,000	8580	100
HAM10000	7	7000	3000	∼1000
MIT Indoor-67	67	5360	1340	∼80
MiniPlaces	100	100,000	10,000	1000

Table 3. Standardized training hyperparameters used across all models for 100 epochs. * FBNet and MobileViT have slightly higher parameter counts due to fixed architectural configurations. ^† MobileViT used a lower maximum learning rate of

10^{- 4}

for training stability.

Table 3. Standardized training hyperparameters used across all models for 100 epochs. * FBNet and MobileViT have slightly higher parameter counts due to fixed architectural configurations. ^† MobileViT used a lower maximum learning rate of

10^{- 4}

for training stability.

Hyperparameter	Value
Data augmentations	CutMix, RandomFlip, ColorJitter
Image size	$224 \times 224$ , normalized per dataset
Model parameter count	≈2.5 M *
Maximum learning rate	$10^{- 3}$ (^† MobileViT: $10^{- 4}$ )
Minimum learning rate	$10^{- 5}$
Warm-up	5 epochs, linear schedule
Learning rate decay	Cosine annealing
Optimizer	Adam (no weight decay)
Loss function	Cross-entropy with label smoothing
Batch size	32
Training epochs	100

Table 4. Cross-dataset accuracy (%) of 11 mobile-scale models across seven benchmarks (C-10, C-100, and HAM denote CIFAR-10, CIFAR-100, and HAM10000, respectively). Accuracies of models marked with ⁺ are reported as the average over three independent runs with different random seeds. Green and red colors indicate the highest and lowest accuracy, respectively, for each dataset.

Model	C-10	Imagenette	C-100	HAM1	Dogs	Indoor67	Miniplaces
ConvMixer ⁺	94.67	89.31	77.63	80.68	55.01	58.28	58.12
EfficientNet ⁺	95.14	88.76	78.75	79.54	63.72	59.61	61.18
MobileViT ⁺	95.21	86.49	77.38	77.27	59.11	54.81	58.98
GhostNet	93.56	86.29	73.52	76.31	55.22	41.34	57.73
MobileNet	94.17	86.24	75.74	78.78	57.53	50.60	57.39
TinyNet	94.90	85.30	78.21	74.38	29.59	50.67	59.06
StarNet	94.88	84.31	77.59	79.04	48.91	51.79	55.45
ShuffleNet	92.15	84.00	75.80	78.64	55.49	53.81	57.03
FBNet	94.36	83.11	75.62	78.44	51.41	48.73	57.61
MobileOne	93.34	81.40	74.55	79.11	46.74	43.51	56.33
ConvNext	91.97	76.05	68.91	78.58	30.96	32.76	51.01

Table 5. 7-dataset xScore computed with

λ \in {0.25, 0.5, 0.75}

, combining mean performance (column G) and cross-dataset variance (column V) according to Equations (2) and (3). The 4-dataset xScore is predicted using the four most representative datasets with

λ = 0.5

, as described in Section 3.2.

Table 5. 7-dataset xScore computed with

λ \in {0.25, 0.5, 0.75}

, combining mean performance (column G) and cross-dataset variance (column V) according to Equations (2) and (3). The 4-dataset xScore is predicted using the four most representative datasets with

λ = 0.5

, as described in Section 3.2.

Model	G	V	7-Dataset xScore $λ = 0.25$	7-Dataset xScore $λ = 0.5$	7-Dataset xScore $λ = 0.75$	4-Dataset xScore $λ = 0.5$
EfficientNet	0.965	0.004	0.964	0.963	0.962	0.951
ConvMixer	0.873	0.014	0.870	0.866	0.863	0.842
MobileViT	0.797	0.028	0.790	0.783	0.776	0.829
MobileNet	0.707	0.004	0.706	0.705	0.704	0.705
StarNet	0.693	0.028	0.687	0.680	0.673	0.706
FBNet	0.640	0.004	0.639	0.638	0.637	0.641
ShuffleNet	0.595	0.062	0.580	0.564	0.549	0.574
GhostNet	0.538	0.037	0.532	0.520	0.510	0.493
MobileOne	0.511	0.016	0.529	0.503	0.499	0.469
TinyNet	0.572	0.163	0.507	0.491	0.450	0.474
ConvNext	0.101	0.062	0.085	0.070	0.054	0.095

Table 6. Comparison of core architectural elements across modern lightweight convolutional and hybrid models. Header abbreviations: EN = EfficientNet, CM = ConvMixer, MV = MobileViT, MN = MobileNet, SN = ShuffleNet, GN = GhostNet, TN = TinyNet, MO = MobileOne, FN = FBNet, CX = ConvNeXt, ST = StarNet. Column abbreviations: C-H-W Scaling = Channel Height–Width dimension scaling, NAS = Neural Architecture Search, FCN = Fully Connected Network (also known as MLP), × = Corresponding architecture element does not exist in model, ✔ = Corresponding architecture element exists in model.

Architecture Elements	EN	CM	MV	MN	ST	FN	SN	GN	TN	MO	CX
Patch-based Stem	×	✔	×	×	×	×	×	×	×	×	✔
Conv Stem	✔	×	✔	✔	✔	✔	✔	✔	✔	✔	×
Depthwise Conv	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔
Pointwise Conv	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔
Width/Channel Scaling	✔	×	×	✔	✔	×	✔	✔	✔	✔	×
Residual Skip	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔
Inverted Residual	✔	×	✔	✔	×	✔	×	✔	✔	×	×
C–H–W Scaling	✔	×	×	×	×	×	×	×	×	✔	×
Squeeze-and-Excite (SE)	✔	×	×	×	×	×	×	✔	✔	✔	×
Channel-wise Concatenation	×	×	×	×	×	×	✔	✔	×	×	×
NAS	✔	×	×	×	×	✔	×	×	✔	×	×
ReLU+	✔	✔	✔	✔	✔	✔	×	✔	✔	×	✔
Layerwise Multi-head Transformer	×	×	✔	×	×	×	×	×	×	×	×
Global Avg Pooling	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔
FCN Classifier
4-dataset xScore	0.951	0.842	0.829	0.705	0.706	0.641	0.574	0.493	0.474	0.469	0.095

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, W.; Ding, P.L.K.; Li, B.; Liu, H. xScore: A Simple Metric for Cross-Domain Robustness in Lightweight Vision Models. Algorithms 2026, 19, 14. https://doi.org/10.3390/a19010014

AMA Style

Zhang W, Ding PLK, Li B, Liu H. xScore: A Simple Metric for Cross-Domain Robustness in Lightweight Vision Models. Algorithms. 2026; 19(1):14. https://doi.org/10.3390/a19010014

Chicago/Turabian Style

Zhang, Weidong, Pak Lun Kevin Ding, Baoxin Li, and Huan Liu. 2026. "xScore: A Simple Metric for Cross-Domain Robustness in Lightweight Vision Models" Algorithms 19, no. 1: 14. https://doi.org/10.3390/a19010014

APA Style

Zhang, W., Ding, P. L. K., Li, B., & Liu, H. (2026). xScore: A Simple Metric for Cross-Domain Robustness in Lightweight Vision Models. Algorithms, 19(1), 14. https://doi.org/10.3390/a19010014

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

xScore: A Simple Metric for Cross-Domain Robustness in Lightweight Vision Models

Abstract

1. Introduction

2. Materials and Methods

2.1. Selection of Mobile-Friendly Models

2.2. Selection of Datasets

2.3. Training Framework

2.4. Cross-Dataset Score Metric

3. Experiments, Results and Discussion

3.1. Is ImageNet a Fair Benchmark for Mobile Models?

3.2. How to Make xScore a Practical Measure of Mobile Architecture Generalization

3.3. What Can xScore Reveal About Mobile Architectures’ Efficiency?

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI