1. Introduction
The growing deployment of deep learning on mobile and edge devices has intensified the need for lightweight vision models that operate under strict constraints on computation, memory, and energy [
1,
2]. Although a wide range of mobile-oriented architectures—such as MobileNet, ShuffleNet, EfficientNet, and MobileViT—have emerged [
3,
4,
5,
6,
7,
8,
9,
10], along with techniques like pruning, quantization, and neural architecture search [
11,
12,
13,
14,
15], their evaluation remains largely focused on ImageNet. Mobile models are often scaled up for ImageNet training to achieve measurable performance gains. However, before deployment on mobile devices, these architectures must be scaled down and retrained on smaller, domain-specific datasets, where performance on ImageNet may no longer hold and cross-domain generalization is uncertain [
16,
17,
18,
19]. Understanding whether ImageNet-optimized architectures generalize effectively under these conditions has thus become an increasingly important and underexplored problem.
These challenges raise several critical questions for the design and evaluation of lightweight mobile vision models. First, do ImageNet-optimized architectures maintain stable performance across domains, or are their performance gains largely dataset-specific [
3,
5,
6,
8,
9,
11,
12,
13,
14,
15,
20,
21,
22,
23,
24,
25,
26]? Prior work on cross-domain robustness, such as DomainBed [
27] and WILDS [
28], has exposed the limitations of average accuracy by evaluating models across heterogeneous domains, environments, and real-world distribution shifts, often reporting mean and worst-case performance. However, they primarily report per-domain accuracies rather than a single robustness metric, and they do not explicitly address robustness in mobile-scale models trained from scratch under constrained parameter and computational budgets. Second, which evaluation metrics most effectively capture cross-domain performance and robustness in resource-constrained models, providing a practical substitute for large-scale benchmarks like ImageNet [
16,
17,
18,
19,
29,
30]? Third, which architectural patterns consistently deliver strong performance under standardized, limited-resource training, revealing design principles that support robust generalization [
1,
2,
5,
8,
9,
18,
19,
21,
23,
25,
26,
31,
32,
33,
34,
35]? Collectively, these questions highlight the need for a simple evaluation framework that moves beyond single-dataset benchmarks and offers actionable guidance for developing lightweight models that generalize reliably across domains.
To address these questions, we develop a benchmark framework to systematically evaluate 11 representative mobile-oriented architectures across seven diverse datasets, with the goal of uncovering robust design patterns and establishing reliable evaluation practices. All models are trained from scratch under standardized conditions, with
standardization referring to a fixed data pre-processing and augmentation pipeline, training framework, optimizer, learning-rate schedule, loss function, and stopping criterion applied uniformly across models; implementation details are provided in the repository [
36], thereby isolating architectural effects from dataset-specific tuning or optimization artifacts. This framework enables us to assess whether ImageNet performance reliably predicts behavior across domains and to explore a smaller and cheaper metric as a practical alternative to large-scale benchmarks. Based on this analysis, we introduce the Cross-Dataset Score (xScore), computed from four representative datasets, which effectively captures cross-domain performance. Examining models with high xScores further reveals architectural patterns that consistently support strong generalization. This approach motivates principled evaluation beyond single-dataset benchmarks and guides the design of more robust lightweight models.
In summary, we make three primary contributions. First, we establish a fully reproducible benchmark capable of evaluating future mobile-oriented architectures across diverse datasets under standardized training conditions, enabling fair and controlled comparisons as reflected by xScore [
36]. Second, we demonstrate that xScore offers a more informative evaluation metric than using ImageNet performance as the primary metric, jointly capturing accuracy and consistency across datasets, and computable from just four small, diverse datasets [
37,
38,
39,
40]. Third, we use xScore to identify architectural patterns that consistently support robust generalization under tight resource constraints, highlighting design principles that meaningfully contribute to reliable cross-domain performance [
3,
5,
6,
8,
20,
21,
22,
25,
26,
31]. Together, these contributions establish a principled framework for evaluating mobile vision models and offer actionable guidance for designing architectures that generalize across diverse datasets.
3. Experiments, Results and Discussion
To quantify cross-domain robustness with a single, comparable metric, we compute each model’s by aggregating its normalized test accuracies across multiple datasets (Equations (1)–(3)). By comparing values across architectures, we examine whether this metric provides a more objective and reliable assessment of mobile-scale models than conventional single-dataset benchmarks such as ImageNet, and what it reveals about the relationship between architectural efficiency and generalization.
To obtain each model’s
, we evaluate its test accuracy on all seven datasets using the streamlined training framework described in
Section 2. We apply this pipeline consistently to all eleven mobile-scale vision architectures, training each model from scratch with standard weight initialization. We use the Adam optimizer [
46] without weight decay, a maximum learning rate of
(reduced to
for MobileViT due to higher sensitivity), and a minimum learning rate of
. Training begins with a 5-epoch linear warm-up, followed by cosine annealing [
47] to ensure stable convergence [
12,
15]. Cross-entropy loss with label smoothing [
48] is used throughout.
All dataset images are resized to
, normalized using each dataset’s mean and standard deviation, and augmented with CutMix [
49], RandomFlip, and ColorJitter [
30]. Batch size is 32, and training is performed in FP32 on an NVIDIA GeForce RTX 3090 GPU to emulate mobile deployment constraints. Each model is trained for 100 epochs on each dataset, with performance reported at the end on the corresponding test or evaluation split. Early stopping is intentionally not employed to ensure a fair and blind comparison across models, with no visibility into test performance until training is complete and to prevent any heuristic adjustments based on early stopping.
Given the large number of model–dataset combinations, each model is initially trained once per dataset to identify top-performing architectures. Based on this screening stage, the three best-performing models—ConvMixer, EfficientNet, and MobileViT—are selected and evaluated in two additional runs with different random seeds, resulting in three runs in total to estimate average performance. Owing to the limited number of repeated evaluations, standard deviations are not reported. For each model–dataset combination, test accuracy is reported in
Table 4, with the top three models (marked by
+) reported as the average over three runs. Models are listed in rows and datasets in columns. All models perform well on CIFAR-10, reflecting its common use as a lightweight benchmark, while ImageNette achieves similarly high accuracy, confirming its role as a reduced-scale proxy for ImageNet. In contrast, performance drops sharply on Stanford Dogs and MIT Indoor-67, which are less commonly incorporated into mobile-model pipelines. This pattern underscores the limitations of single-dataset evaluation and motivates xScore as a more comprehensive measure of cross-domain robustness.
For clarity, each dataset’s highest and lowest accuracies among the evaluated models are highlighted in green and red, respectively, in
Table 4. These column-wise extrema are used in the
calculation as normalization anchors (Equation (
1)), establishing a consistent relative scale for the current and future models. While these anchors are based on the models considered, adding a future state-of-the-art model with higher accuracy than the current maximum would produce a normalized value
, resulting in an
that exceeds those of the listed models. Because normalization scales differences proportionally, xScore continues to reflect cross-dataset improvement and maintain meaningful relative rankings, without adjusting the original min and max anchors, which would require recalculating all previous model xScores.
Table 5 reports the
for each architecture with
, combining mean performance and cross-dataset variance according to Equations (
2) and (
3). Models are ordered by
magnitude, providing a concise summary of cross-domain robustness. Notably, the relative ranking of models remains largely consistent across different
values, indicating the stability and discriminative power of the proposed xScore metric.
For subsequent analysis, we adopt
to balance cross-dataset generalization and performance variability, avoiding bias toward either extreme. For clarity, the
is computed from normalized cross-dataset accuracies
(Equation (
1)), where per-dataset minima and maxima define the normalization range. From these normalized values, we compute the mean performance
and variance
across datasets (Equation (
2)), and combine them via Equation (
3) to obtain a single metric that jointly captures accuracy and robustness to domain shifts. As a result, the xScore enables fair and reproducible comparison of lightweight models across diverse benchmarks.
Examining the
classifications in
Table 5 alongside the per-dataset accuracies in
Table 4 reveals a clear pattern: despite uniform training pipelines, consistent input resolutions, and comparable parameter capacities, the models’ accuracies do not fluctuate mildly around a shared average or exhibit random oscillations across datasets. Instead, they display substantial and systematic divergence across domains, which is directly reflected in their
values. Such disparity prompts several methodological questions:
3.1. Is ImageNet a Fair Benchmark for Mobile Models?
Traditional single-dataset metrics, particularly ImageNet accuracy, are widely used as evaluation standards. However, prior work shows that higher ImageNet accuracy does not reliably translate to downstream performance or cross-domain robustness [
16,
17,
19]. Structural regularities in ImageNet—such as object-centered composition, consistent lighting, and balanced class distributions [
50]—can inflate reported performance while masking weaknesses in generalization. Despite this, evaluating mobile-scale architectures exclusively through ImageNet has become routine, even though the dataset was designed for high-capacity models rather than networks constrained to a few million parameters. Expecting such models to perform competitively on a 1.28M-image, 1000-class benchmark is therefore a fundamental mismatch of capacity and task difficulty.
This misalignment is further compounded by common reporting practices. So-called “state-of-the-art” ImageNet results for mobile backbones are frequently achieved by scaling models far beyond realistic deployment budgets, pairing them with bespoke augmentation pipelines, and applying heavy regularization and extensive hyperparameter tuning. These inconsistent—and often opaque—training conditions further undermine meaningful evaluation and render direct architectural comparisons unreliable.
To mitigate potential confounds, we retrain all models from scratch with comparable parameter budgets suitable for realistic mobile environments.
Figure 1 shows the per-model correlation between ImageNette accuracy and
across the seven datasets, illustrating whether ImageNette accuracy predicts cross-domain robustness. If ImageNet (as proxied by ImageNette) were a reliable benchmark for mobile robustness, the correlation in
Figure 1 would be tight and monotonic. Instead, it is loose, noisy, and in some cases inverted, clearly demonstrating the divergence between ImageNette performance and cross-domain robustness. For example, several models (e.g., GhostNet) achieve high ImageNette accuracy yet generalize poorly across diverse datasets, while others (e.g., StarNet) with lower ImageNette performance exhibit more stable cross-domain behavior. The implication is clear: ImageNet accuracy, in practice, provides an incomplete and often misleading signal of mobile model capability. By contrast,
offers a more faithful, domain-aware assessment, capturing both performance and stability in ways that reflect real-world mobile deployments.
is most meaningful when applied to models with comparable parameter budgets trained under the provided framework, including the standardized training procedure and hyperparameters. Models with substantially higher capacity may achieve superior raw accuracy on individual datasets, but such gains often reflect scale rather than architectural merit. In contrast, measures performance consistency across heterogeneous domains under equivalent resource constraints, rather than dominance on a single benchmark. This distinction is crucial for mobile-scale research, where efficiency, adaptability, and robustness are as important as raw accuracy. By normalizing evaluation to capacity-equivalent models, provides a fairer and more interpretable measure of cross-domain generalization.
3.2. How to Make xScore a Practical Measure of Mobile Architecture Generalization
While many datasets exist—and new ones continue to be created—we aim to show that not all are equally discriminative or challenging for computing a representative
. To demonstrate that a smaller subset suffices without losing generality, we perform a brute-force search to identify four datasets from the original seven that best preserve the full
ranking. For each candidate subset, the seven-dataset
is predicted via simple linear regression on the subset’s accuracies, with the coefficient of determination (
[
51]) quantifying fidelity. Subsets with
near 1 closely reproduce the full
, and exhaustive evaluation selects the most representative combination.
This process selects
CIFAR-10,
HAM10000,
Stanford Dogs, and
MIT Indoor-67, achieving
. The resulting 4-dataset
for each model is listed in the last column of
Table 5.
Figure 2 shows the correspondence between the full seven-dataset benchmark and the reduced four-dataset subset for the eleven mobile-scale vision models. The
x-axis denotes
, computed across all seven datasets, while the
y-axis shows
, predicted from the selected subset. The dashed diagonal line (
) marks perfect agreement. Each bubble represents a model, where
color encodes its generalization score
(mean normalized accuracy) and
size reflects its cross-dataset variance
. Smaller bubbles indicate stable performance; larger ones indicate variability across domains. Most models align closely with the identity line, confirming that the four-dataset
is a faithful approximation of the full seven-dataset benchmark, preserving both ranking and generalization trends.
Notably, the subset containing CIFAR-10, HAM10000, Stanford Dogs, and MIT Indoor-67 happens to span four distinct domains—low-resolution object diversity, domain-shifted medical imagery, fine-grained categorization, and general scene recognition—preserving heterogeneity while dramatically reducing computational cost. The combined benchmark includes roughly 74,000 training samples across 204 classes, compared to ImageNet’s 1.28 million samples and 1000 classes. This reduced yet representative set avoids the underfitting common in parameter-limited mobile models and provides a reliable, computationally efficient measure of cross-domain generalization.
From a practical perspective, evaluating a new model using
is straightforward. The shared training and evaluation framework—including dataloaders, preprocessing, and schedule—is available on GitHub [
36]. A new model can be trained independently on each dataset (see Algorithm 1, Step 3), and its per-dataset accuracies aggregated using Equation (1)–(3) with the fixed anchors in
Table 4 to compute its
. On standard desktop GPUs (e.g., NVIDIA RTX 3090), evaluation per dataset requires only a few hours, with memory demands compatible with mobile-scale models. Reference models’
are already available, allowing immediate comparison.
3.3. What Can xScore Reveal About Mobile Architectures’ Efficiency?
Building on the demonstration that
is a reliable cross-domain metric, we take an open-ended approach to examine models with high
values, aiming to uncover which architectural patterns contribute to their superior performance. The 7-dataset
in
Table 5 highlights EfficientNet and ConvMixer as the most consistent performers, forming a top tier with both high accuracy and stability across diverse datasets. The next tier—including MobileViT, MobileNet, StarNet, and FBNet—shows moderate generalization (
0.65–0.8), while ShuffleNet, GhostNet, MobileOne, and TinyNet occupy a lower tier (0.5–0.6). Notably, ConvNeXt, despite strong reported ImageNet performance, exhibits the lowest cross-domain
under the constrained model size, suggesting overfitting to high-capacity benchmarks or reliance on implicit regularization that does not generalize.
Table 6 summarizes the core architectural elements across the eleven evaluated mobile models. All architectures share a multi-layer convolutional backbone with depthwise and pointwise convolutions, residuals, and standard activations. However, each model incorporates distinct design patterns. Evaluating elements such as pointwise
convolutions, depthwise separable layers, inverted bottlenecks [
31], SE modules [
33], patch-based convolutions [
52], and large-kernel spatial mixing [
53] reveals their unequal impact on cross-domain robustness.
Most notably, EfficientNet and ConvMixer illustrate complementary strategies for robust channel-wise information flow. EfficientNet employs inverted residuals to expand bottlenecks, followed by depthwise convolutions in an “expand–filter–squeeze” pattern, with Squeeze-and-Excite attention dynamically reweighting channels as spatial resolution contracts. This combination may help explain its strong performance. ConvMixer, in contrast, maintains uniform spatial resolution and isotropic convolutions. Inputs are patch-embedded and processed with depthwise convolutions, residual connections, and pointwise convolutions, enabling unrestricted channel mixing and direct learning of local and mid-level representations. Avoiding hierarchical downsampling preserves spatial information, which may support robust generalization with a minimal design. Despite their different philosophies, both architectures embody the principle of preserving rich, independent channel representations that encode diverse visual features, yielding compact yet expressive feature maps.
In comparison, models that limit inter-channel communication or reduce feature diversity—such as ShuffleNet, GhostNet, and MobileOne—illustrate the trade-offs inherent in efficiency-driven design. Techniques such as group convolutions, channel shuffling, “ghost” features, and inference-time merging constrain representational capacity, which may lead to weaker cross-domain stability. Hybrid architectures like MobileViT tend to underperform in low-parameter regimes, suggesting that attention mechanisms alone cannot overcome tight capacity constraints. NAS-optimized models, including FBNet and TinyNet, prioritize ImageNet efficiency through block-level optimization, often at the expense of transfer robustness. Collectively, these observations suggest that macro-architectural factors—depth, width, scaling strategy, and feature mixing—appear to play a more decisive role in generalization than block-level optimizations alone.
MobileNet—despite its relative simplicity—achieves competitive performance across datasets. This suggests that its straightforward design scales effectively across diverse visual domains, whereas derived models are often optimized specifically for ImageNet. These observations further illustrate why xScore can provide a more reliable measure of model performance across multiple datasets, capturing generalization trends beyond a single benchmark.
In summary, enables quantification of both shared and distinct architectural pattern effects, indicating that cross-domain robustness is more closely associated with coordinated macro-architecture and effective channel-wise information flow, rather than single-dataset accuracy or parameter-efficient block design. As such, provides a principled framework for identifying and formalizing design patterns for the next generation of mobile-scale architectures, balancing efficiency with robust generalization.
4. Conclusions
In the context of mobile vision models, large-scale datasets such as ImageNet are often overly demanding and misaligned with mobile deployment constraints. While ImageNet remains valuable for general benchmarking, it is not ideally suited for evaluating mobile-scale architectures. By contrast,
provides a faster, more interpretable, and reliable measure of cross-domain performance. Its selected 4-dataset benchmark—roughly 74,000 training samples across 204 classes, less than 6% of ImageNet—enables substantially shorter training time and higher computation efficiency while preserving domain diversity. Each candidate model requires only a single training–evaluation pass on each of the four datasets to compute its
. Beyond evaluation,
highlights key architectural drivers of robustness, particularly efficient channel-wise information flow and balanced macro-architectural scaling. The framework is reproducible, extensible, and tractable even for parameter-limited models (see
Section 3.2 for implementation details and computational considerations). Taken together,
constitutes a practical, domain-aware alternative to single-dataset metrics, providing diagnostic insight and actionable guidance for designing reliable lightweight vision architectures.
Our work aligns with the growing emphasis on cross-dataset evaluation and reproducible benchmarking [
11,
14,
16,
17]. While prior studies have documented the instability of ImageNet-pretrained models under domain shift [
19] and proposed techniques such as self-distillation [
54], feature sharing [
55], and token pruning [
56,
57], they lack a simple quantitative metric for systematic cross-domain assessment.
fills this gap by normalizing for parameter budget, balancing mean performance and variance, and enabling a reduced four-dataset benchmark that preserves full cross-domain rankings, lowering computational cost while maintaining interpretability.
Several limitations should be noted, highlighting areas for further research that could complement and extend the findings of this study:
Model constraints and training strategy: This study focuses on models constrained to approximately 2.5 M parameters, using fixed architectural layouts and consistent training hyperparameters. We did not explore cases where architecture-specific tuning within the same parameter budget could yield stronger performance, nor did we evaluate models with higher parameter counts (e.g., 5 M), both of which could influence the xScore ranking. Variations in training protocols—such as additional evaluation seeds, different batch sizes, varied data augmentation, or optimizer settings—may influence absolute performance, though relative trends are expected to remain largely consistent.
Dataset coverage: Although the four selected datasets span diverse domains, they were chosen from the seven representative datasets included in this framework, which do not cover the full spectrum of real-world or synthetic visual domains. As a result, observed xScore trends may differ when additional datasets—such as synthetic or video data—are incorporated.
Vision task specificity: This study is limited to image classification tasks. Tasks requiring fine spatial precision, such as semantic segmentation or object detection, may rely on architectural features not captured by the image classification evaluations, potentially producing different xScore rankings. Dedicated benchmark metrics would be needed to fairly assess performance for these task-specific scenarios.
Looking ahead, can guide next-generation mobile architecture design, inform robustness-oriented neural architecture search (NAS), and support evaluation in semi- or self-supervised settings. The design patterns observed in EfficientNet and ConvMixer—particularly their strategies for maximizing channel-wise information flow—can assist in searching and testing of more efficient architecture models. In NAS, prioritizes cross-domain robustness and discourages overfitting to datasets like ImageNet. In deployment, it enables early-stage assessment of mobile models, predicting real-world stability without the cost of full-scale training—an essential advantage for mobile and edge systems.