You are currently viewing a new version of our website. To view the old version click .
Journal of Imaging
  • Article
  • Open Access

19 September 2025

Empirical Evaluation of Invariances in Deep Vision Models

,
and
MLV Research Group, Department of Informatics, Democritus University of Thrace, 65404 Kavala, Greece
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Advances in Machine Learning for Computer Vision Applications

Abstract

The ability of deep learning models to maintain consistent performance under image transformations-termed invariances, is critical for reliable deployment across diverse computer vision applications. This study presents a comprehensive empirical evaluation of modern convolutional neural networks (CNNs) and vision transformers (ViTs) concerning four fundamental types of image invariances: blur, noise, rotation, and scale. We analyze a curated selection of thirty models across three common vision tasks, object localization, recognition, and semantic segmentation, using benchmark datasets including COCO, ImageNet, and a custom segmentation dataset. Our experimental protocol introduces controlled perturbations to test model robustness and employs task-specific metrics such as mean Intersection over Union (mIoU), and classification accuracy (Acc) to quantify models’ performance degradation. Results indicate that while ViTs generally outperform CNNs under blur and noise corruption in recognition tasks, both model families exhibit significant vulnerabilities to rotation and extreme scale transformations. Notably, segmentation models demonstrate higher resilience to geometric variations, with SegFormer and Mask2Former emerging as the most robust architectures. These findings challenge prevailing assumptions regarding model robustness and provide actionable insights for designing vision systems capable of withstanding real-world input variability.

1. Introduction

The ability of computational models to maintain consistent representations of visual data under geometric and photometric transformations—known as image invariances—has been a cornerstone of computer vision research. Effectively handling image invariances has profound implications across numerous practical and critical domains. Applications ranging from autonomous driving systems, robotic navigation, medical imaging diagnostics, surveillance and security systems to augmented reality heavily depend on robust invariant recognition mechanisms. Systems that inadequately manage invariances tend to exhibit significant performance deterioration when faced with minor deviations from idealized training conditions. Invariances ensure that a model recognizes an object regardless of its orientation (rotation invariance), size (scale invariance), or perturbations such as blur or noise [1,2]. For instance, a rotation-invariant system identifies a cat whether it is upright or upside-down, while scale invariance allows detection of the same object at varying distances. These properties are critical for real-world applications, where input data rarely conforms to idealized conditions.
Despite their importance, the mechanisms by which modern deep learning architectures such as convolutional neural networks (CNNs) and vision transformers (ViTs) achieve invariances remain poorly understood, particularly when compared to classical computer vision methods [3]. Moreover, systematic comparisons of invariances in CNNs and transformers are scarce. Prior work has focused on isolated transformations (e.g., rotation or scale) or specific architectures [3,4,5]. This work aims to empirically investigate how CNNs and transformers respond to systematic transformations, shedding light on their robustness and limitations. This paper bridges the identified gap in the literature by analyzing how both model families respond to injected invariances—controlled perturbations in rotation, scale, blur, and noise—against original images. By quantifying output deviations and latent space shifts, this work aims to provide a unified framework for evaluating invariances across model architectures. Our findings challenge assumptions about the inherent robustness of deep vision models and offer guidelines for designing more reliable vision systems. To this end, the main scope of this work is the design of a systematic benchmarking study providing the following contributions:
  • Unified evaluation across tasks—recognition, localization, and segmentation are rarely benchmarked together under identical perturbation protocols.
  • Controlled comparison—testing of 30 models under the same invariance transformations, ensuring a fair basis for comparison.
  • Empirical confirmation of assumptions—although prior works have suggested such robustness patterns, they were often task-specific or anecdotal. The presented results provide systematic evidence that these assumptions hold across different tasks and perturbations.
The rest of the paper is structured as follows: in Section 2, approaches to invariance are reviewed, from traditional to deep learning-based, as well as related works covering a wide range of invariances. Section 3 presents the proposed methodology, including datasets used, model selection and invariance applications. The experimental setup is described in Section 4, while the results are summarized in Section 5. Discussion and conclusions are included in Section 6 and Section 7, respectively.

3. Materials and Methods

The proposed approach is illustrated in Figure 3. Three key computer vision tasks are examined: object localization, object recognition, and semantic segmentation. For each task, a benchmark dataset is selected, and four invariance transformations are applied to images, towards evaluating, through targeted performance metrics, the robustness of 10 different deep learning models for each task, to applied invariances.
Figure 3. Pipeline of the proposed methodology towards studying deep learning model invariances.
The rest of the section presents a detailed overview of the datasets utilized for evaluating models’ performance (ten models for each task) across the three key computer vision tasks. To ensure robust comparative analysis, we carefully selected established benchmark datasets while addressing the need for consistent evaluation metrics through strategic dataset preparation and fine-tuning approaches.

3.1. Benchmark Datasets

The empirical evaluation required datasets that could effectively assess model invariances across different vision tasks. The datasets were selected based on their prevalence in the literature, quality of annotations, and suitability for our specific evaluation needs.

3.1.1. Object Localization Dataset

For evaluating localization models, we utilized a subset of the COCO validation dataset. The COCO (Common Objects in Context) dataset is widely recognized for its complex scenes containing multiple objects at various scales with detailed instance-level annotations [62]. All localization models in our study were originally trained on the COCO dataset, making its validation set an appropriate choice for evaluation.
This selection was necessitated by the lack of available test data with corresponding ground truth labels. The COCO validation set contains extensive instance annotations, including bounding boxes and segmentation masks across 80 common object categories [62,63]. This rich annotation scheme enabled us to evaluate fine-grained localization performance across different scales and object configurations. The original size of the dataset consisted of 5000 different images, yet a smaller random selection of five images for each class was made, 400 images in total. The COCO validation subset provided sufficient diversity to assess model invariance to object scale, perspective, occlusion, and background complexity, which are considered key aspects in our investigation of CNN invariances.

3.1.2. Object Recognition Dataset

For recognition models, we employed a carefully selected subset of the ImageNet ILSVRC2012 validation dataset. The ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) provides a robust benchmark for object recognition with 1000 object categories [64,65]. The dataset consists of thousands of images, but our subset was randomly selected to provide a sample of five images per class with a total of 5000 images.
This subset was chosen primarily because it represents the de facto standard for training and evaluating recognition models. The diverse nature of ImageNet categories allowed us to assess recognition performance across a wide spectrum of object types, illumination conditions, and viewpoints. This diversity was crucial for our study of CNNs’ invariances in the recognition domain. The validation subset contains high-quality annotations and presents challenging recognition scenarios that test the limits of model invariance to various transformations and real-world conditions.

3.1.3. Semantic Segmentation Dataset

To address the challenge of comparing models trained on different datasets, we incorporated a custom dataset retrieved from Kaggle [66]. This approach was necessary because the segmentation models in our study were originally trained on different datasets with varying annotation schemes and object categories. This dataset was used for fine-tuning the already trained models. Fine-tuning generally requires much less data, since the model already has rich learned representations. Even a few thousand samples can be enough to adapt a pre-trained model to a new domain. An important reason for its selection was that its data is always centered. The original dataset contained 588 pictures of leaves with a type of disease on them, together with their annotation. After reviewing the dataset and removing unsuitable photos, an augmentation was made to reach a more appropriate number of 3138 images that were used for the fine-tuning of each model, while 62 images from the original dataset were kept intact for the inference and validation of the models. Augmentation included the rotation of images by 60 degrees in order to create a more diverse dataset. Finally, the decision to use a small subsample of data was made due to our limited computational resources.
The custom dataset served as a crucial fine-tuning benchmark, particularly for transformer-based segmentation models. By fine-tuning all models on this common dataset, we established a more equitable baseline for performance comparison, mitigating the bias introduced by differences in pre-training data [64,67].

3.2. Models’ Selection Criteria

The selection criteria prioritized models that are openly accessible, widely adopted within the research community, and extensively documented online. More specifically, selection criteria included the following aspects:
  • Documentation and Resource Availability. The most important factor for model selection was firstly models that were trained on open source and well-constructed datasets, and secondly models with comprehensive documentation, tutorials, and implementation examples were favored to facilitate integration into our research pipeline. The availability of educational resources surrounding these models ensures efficient troubleshooting and optimization.
  • Open-Source Availability. All selected models need to be available through open-source licenses, allowing for unrestricted academic use and modification. This accessibility is crucial for reproducibility and extension of our research findings. Open-source models also typically provide pre-trained weights on standard datasets, reducing the computational resources required for implementation. Models with significant adoption within the computer vision community were prioritized. All models were imported from well-known libraries such as PyTorch or Hugging Face.
  • Performance-Efficiency Balance. This criterion spans a range of architectures that offer different trade-offs between accuracy and computational efficiency. This variety allows us to evaluate which models best suit specific hardware constraints and performance requirements. Many models had the ability to select multiple versions of each other with different sizes of parameters, which made them much easier or much harder to run. For this research, each model was selected based on the ability of the testing hardware to infer it in the shortest amount of time without reducing the performance due to its small number of parameters.
Table 1, Table 2 and Table 3 include the selected models, ten for each one of the three key computer vision tasks: object localization, object recognition, and semantic segmentation.
Table 1. Details of selected models for object localization.
Table 2. Details of selected models for object recognition.
Table 3. Details of selected models for segmentation.
At this point, it should be noted that all models evaluated for localization and recognition were pretrained on their respective standard datasets—localization models on COCO and recognition models on ImageNet—consistent with common practice and the model authors’ original training regimes. However, for the semantic segmentation task, the selected models originated from diverse training backgrounds, having been trained on different datasets with varying annotation schemas. In the latter case, to ensure a fair and consistent comparison across segmentation architectures, we standardized their evaluation by fine-tuning all segmentation models on the same custom dataset

3.3. Generating Degraded Images

3.3.1. Blurred Images

To assess the robustness of the models under image blur distortions, we applied Gaussian blurring at multiple levels to a fixed subset of images from the dataset. The blurring was implemented using a Gaussian kernel of fixed size (5 × 5), with the degree of blur controlled by varying the standard deviation (σ, or sigma) parameter. Specifically, we evaluated blur invariance across five sigma values: σ = 0 (no blur), 1, 2, 3, and 4. The application of Gaussian blur was performed using OpenCV’s GaussianBlur function, with identical values for both sigmaX and sigmaY to ensure isotropic smoothing.
Unlike geometric transformations, the blur operation does not require adjustment of bounding box coordinates, as it preserves the spatial layout of the image. For each level of blur, the blurred images were passed directly to a pre-trained model for object detection. Model predictions were then evaluated against the unaltered ground truth annotations.

3.3.2. Noised Images

To investigate the robustness of the models to additive Gaussian noise, we applied varying levels of pixel-wise noise to images of the dataset. Gaussian noise was synthetically added to each image in the normalized pixel intensity range [0, 1]. Specifically, noise was sampled from a zero-mean normal distribution with standard deviations (std) of 0.0, 0.05, 0.1, 0.15, 0.2, 0.25, and 0.3. The noise was added to the RGB image values, and the results were clipped to maintain valid pixel intensity bounds. Afterwards, the images were scaled back to the standard matching range of the input for the model.
This transformation preserved the original spatial and structural layout of the images and did not require any modification to ground truth bounding boxes. For example, the same annotations provided by the COCO dataset were used for performance evaluation across all noise levels. Inference was performed on each noise-augmented image using a pre-trained model. The model outputs were compared against the ground truth annotations.

3.3.3. Rotated Images

To evaluate the robustness of the models under rotation transformations, we systematically applied controlled image rotations to the datasets. For each selected image, a series of rotations was performed at fixed angular intervals, specifically from 0 to 330 degrees. Yet, it should be noted that for the experiments, only up to 180 degrees were used, since beyond this range, the same results are mirrored, offering no additional insights. Moreover, inclusion of redundant degree points would significantly increase the models’ runtime without further contributing meaningful results. The rotation was applied using a standard 2D affine transformation centered at the image midpoint. To ensure that the entire rotated image content was preserved within the frame, the dimensions of the output image were adjusted based on the rotation matrix to accommodate any spatial expansion resulting from the transformation.
Crucially, for the case of the localization task, ground truth bounding boxes associated with each image were also transformed to align with the rotated image coordinates. This was achieved by computing the rotated positions of the four corners of each bounding box and then recomputing the axis-aligned bounding box that minimally enclosed these rotated points. This process ensured that object annotations remained consistent and accurate under all transformation conditions. Following the rotation, the transformed images and their adjusted annotations were input into the appropriate model.

3.3.4. Scaled Images

To evaluate the scale invariance of the models, we conducted a series of controlled image resizing operations on the datasets. Each image was rescaled using a range of predefined scale factors: 0.1, 0.25, 0.5, 0.75, 1.0 (original resolution), 1.25, 1.5, 2.0, and 3.0. The scaling was performed using bilinear interpolation via OpenCV’s resize function, ensuring consistent image quality across resolutions.
After rescaling, each image was passed through each model. Since the resized images modified the spatial scale of objects, all predicted bounding boxes were rescaled back to the original coordinate frame before evaluation. This normalization step ensured that detection results could be directly compared against the original ground truth annotations provided by the dataset.
Figure 4 includes indicative visualizations of an image passing through different levels of blur (Figure 4a), Gaussian Noise (Figure 4b), Rotation (Figure 4c), and Scaling (Figure 4d). Note that the figures are provided for reference and do not depict all possible scales of invariances.
Figure 4. Indicative image transformations: (a) Gaussian blur of different strengths; (b) Gaussian noise of different strengths; (c) Rotation of different angles; (d) Scaling.

3.4. Models’ Performance Evaluation

A comprehensive evaluation framework was employed to assess model performance across localization, recognition, and segmentation tasks, followed by a systematic robustness analysis against image perturbations. The methodology aligns with recent advances in vision model evaluation while addressing domain-specific requirements through tailored metric selection and transformation protocols.
Localization performance was quantified using the Mean Average Precision (mAP). mAP is calculated across Intersection-over-Union (IoU) thresholds ranging from 0.50 to 0.95 in 0.05 increments, following the COCO evaluation protocol, and provides a robust measure of detection accuracy across various overlap criteria [68].
Recognition performance was assessed by using Accuracy, calculated as: Accuracy = (TP + TN)/(TP + TN + FP + FN), where TP, TN, FP, and FN refer to true positive, true negative, false positive and false negative, respectively. Accuracy measures the overall correctness of classification across all classes [69].
Finally, we evaluated segmentation performance by using the Mean Intersection over Union (mIoU). Calculated as: mIoU = (1/C) × Σ [TP_c/(TP_c + FP_c + FN_c)] for c = 1 to C, where C is the number of classes, TP_c is the number of true positives for class c, FP_c is the number of false positives, and FN_c is the number of false negatives. This metric penalizes both over- and under-segmentation errors while being class-agnostic.
This multi-metric approach ensured comprehensive characterization of model capabilities, with mIoU emphasizing structural segmentation fidelity, mAP quantifying localization reliability, and Accuracy reflecting diagnostic recognition accuracy under real-world conditions [70,71]. The unified evaluation framework enabled direct cross-model comparisons while maintaining alignment with domain-specific performance requirements for plant disease analysis.

4. Experimental Setup

To ensure a fair and statistically meaningful assessment of model performance under variant transformations, we selected five random images per class for both localization and recognition models. This sampling strategy strikes a balance between computational efficiency and representational diversity, enabling a robust estimation of performance without overwhelming the evaluation pipeline. By drawing multiple images per class, we mitigate the risk of bias that might arise from atypical examples or outliers, thereby improving the generalizability of the reported metrics. This approach also facilitates class-wise performance comparisons and supports a more granular analysis of invariance effects across different object categories.
For the inference, both the localization and recognition models were evaluated under controlled variations. Although a larger sample size could further stabilize the statistical estimates, the chosen five-image-per-class configuration provides sufficient coverage for detecting trends and patterns in model robustness, especially when compounded over many classes. This design also allows us to compute average metrics and confidence intervals, which are essential for drawing reliable conclusions about model sensitivity to transformations.
The experimental evaluation of the segmentation models was conducted using a custom dataset of diseased leaf images obtained from Kaggle. The initial dataset was relatively small, comprising paired images of diseased leaves along with their corresponding segmentation masks that precisely delineate the affected regions. To address the limited size of the original dataset and ensure sufficient training data for effective model fine-tuning, a systematic data augmentation strategy was implemented. Each original image was subjected to three successive 90-degree rotations, effectively quadrupling the dataset size by creating four distinct variations of each source image (the original plus three rotated versions). This geometric transformation-based augmentation approach was selected to preserve the essential structural characteristics of leaf diseases while introducing sufficient variability to enhance model generalization capabilities [72,73]. The augmentation protocol maintained the paired relationship between input images and their corresponding segmentation masks, ensuring consistency in the training data.
For all ten architectures—whether convolutional or transformer-based—we followed a uniform fine-tuning pipeline on our custom leaf dataset: we first wrapped the raw RGB images and their binary masks into a PyTorch Dataset, applying identical spatial preprocessing (resizing to the input of the model, normalizing inputs, thresholding masks to {0,1}) and retaining originals for visualization; next, we fed each sample through the respective model’s built-in preprocessing or a shared image processor to produce tensor inputs, then reconfigured the model’s classification head to output two classes (background vs. leaf), allowing weight adaptation via an “ignore mismatched sizes” flag when remapping final layers. During training, we batched data with a custom collate function, back-propagated a pixel-wise segmentation loss across the entire network and monitored validation IoU for early stopping.

5. Results

This comprehensive empirical study presents systematic evaluation results of deep learning models across three computer vision tasks—object localization, recognition, and semantic segmentation—under four types of invariance transformations. The evaluation encompasses 30 distinct models tested against blur, noise, rotation, and scale invariances using established benchmark datasets, including COCO validation subset, ImageNet ILSVRC2012 validation subset and the augmented custom Kaggle leaf disease segmentation dataset. Our findings reveal significant variation in patterns of robustness across model architectures and transformation types, providing empirical evidence for the current limitations of invariance handling in modern deep learning systems.

5.1. Object Localization

The evaluation of object localization models under Gaussian blur transformations reveals varying degrees of robustness across different architectures. Table 4 presents the mean Average Precision (scores for ten localization models tested with Gaussian blur kernels ranging from 0 (no blur) to 4. YOLOv10 and v11 demonstrate the highest baseline performance at 0.88 mAP with sigma = 0, followed closely by YOLOv8 and v9 at 0.87. Notably, we found that most modern localization models exhibit strong robustness to Gaussian blur, maintaining mAP within ±0.01 across blur strengths up to sigma = 4. This suggests that coarse object structure is sufficient for accurate localization in these cases. Figure 5 illustrates the localization blur plots for all models, divided into two groups for better visualization; Group 1 includes Yolov10, Yolov11, Yolov7, Yolov8, and Yolov9, while Group 2 includes EfficientDet, RetinaNet, RetinaNetv2, SSD, and SSDlite.
Table 4. Models’ robustness evaluation (mAP) on object localization under Gaussian blur transformation of different strengths.
Figure 5. Localization Gaussian blur plots for all models and strengths: (a) Group 1; (b) Group 2.
Gaussian noise impacts localization performance across all tested models to a higher percentage, as evidenced in Table 5. The results demonstrate a consistent pattern of performance degradation as noise intensity increases from 0 to 0.3. RetinaNetv2 shows the most severe degradation, dropping from 0.84 at std = 0 to 0.75 mAP at std = 0.3, representing a sizable performance loss. Most models manage to keep an acceptable performance even in higher amounts of noise. Figure 6 illustrates the localization noise plots for all models, divided into two groups for better visualization.
Table 5. Models’ robustness evaluation (mAP) on object localization under Gaussian noise of different strengths.
Figure 6. Localization Gaussian noise plots for all models and strengths: (a) Group 1; (b) Group 2.
Rotation transformation reveals significant invariance limitations across all localization models, as shown in Table 6. The results demonstrate a consistent pattern where performance at 90° and 180° rotation angles approaches baseline levels, while intermediate angles (30°, 60°, 120°, 150°) cause substantial performance degradation. This can be attributed to many possible factors, such as the visibility of the subject after the rotation. Figure 7 illustrates the localization rotation plots for all models, divided into two groups for better visualization.
Table 6. Models’ robustness evaluation (mAP) on object localization under rotation transformation of different angles.
Figure 7. Localization rotation plots for all models and strengths: (a) Group 1; (b) Group 2.
Scale transformation demonstrateσ varying impact on localization performance, as presented in Table 7. Most models show degraded performance at extreme downscaling (0.1×) and some improvement as the scale approaches the baseline, while after that, staying near it. Figure 8 illustrates the localization scaling plots for all models, divided into two groups for better visualization.
Table 7. Models’ robustness evaluation (mAP) on object localization under scaling transformation of different scales.
Figure 8. Localization scaling plots for all models and strengths: (a) Group 1; (b) Group 2.

5.2. Object Recognition

Recognition models demonstrate varying degrees of blur robustness across different architectures, as shown in Table 8. ConvNeXt achieves the highest baseline accuracy at 0.84 with σ = 0, followed by Swin Transformer at 0.82. Under increasing blur intensity, ConvNeXt maintains relatively strong performance compared to other models, declining to 0.60 at σ = 4, representing a 29% accuracy reduction. Vision Transformer shows superior blur robustness, declining from 0.81 to 0.62, demonstrating only a 24% performance loss.
Table 8. Models’ robustness evaluation (Accuracy) on object recognition under Gaussian blur transformation of different strengths.
Mobile architectures exhibit significant blur sensitivity, with MobileNetV2 dropping from 0.73 to 0.39 accuracy and RegNet showing the most severe degradation from 0.68 to 0.26. EfficientNet demonstrates moderate blur robustness, maintaining 0.46 accuracy at σ = 4 from a baseline of 0.78. Swin Transformer, despite strong baseline performance, shows substantial blur sensitivity with accuracy declining to 0.52 at maximum blur intensity, showing that transformers also have trouble with the recognition of blurred images. Figure 9 illustrates the recognition blur plots for all models, divided into two groups for better visualization Group 1 includes EfficientNet, MobilNetV2, MobileNetV3, Regnet, Resnet and ResnetXt, while Group 2 includes ConvNext, DeiT-Base, SwinTransformer, and VisionTransformer.
Figure 9. Recognition Gaussian blur plots for all models and strengths: (a) Group 1; (b) Group 2.
Gaussian noise significantly impacts recognition accuracy across all architectures, as demonstrated in Table 9. The column headers appear to have some inconsistencies in the provided data, but the general pattern shows severe performance degradation with increasing noise levels. ConvNeXt shows remarkable noise robustness as opposed to the rest of the CNN models, declining from an accuracy of 0.84 to 0.32. All transformer models maintain strong noise robustness with the smallest performance decline among tested models.
Table 9. Models’ robustness evaluation (Accuracy) on object recognition under Gaussian noise of different strengths.
Mobile architectures demonstrate severe noise sensitivity, with MobileNetV2 showing near-complete failure under high noise conditions. EfficientNet exhibits substantial noise sensitivity despite its baseline performance. Vision Transformer shows moderate noise robustness, maintaining better performance than CNN-based architectures under equivalent noise conditions. Figure 10 illustrates the recognition noise plots for all models, divided into two groups for better visualization.
Figure 10. Recognition Gaussian noise plots for all models and strengths: (a) Group 1; (b) Group 2.
Rotation transformation reveals significant limitations in recognition model invariance, as shown in Table 10. All models exhibit substantial performance degradation at intermediate rotation angles while showing some recovery at 90° and 180° rotation angles. ConvNeXt maintains the most stable rotation performance, declining from 0.84 to approximately 0.70 to 0.77 at intermediate angles. Swin Transformer demonstrates superior rotation robustness, maintaining above 0.72 accuracy across most rotation angles.
Table 10. Models’ robustness evaluation (Accuracy) on object recognition under rotation transformation of different angles.
EfficientNet shows severe rotation sensitivity, with performance dropping to below 0.50 at many intermediate angles. Mobile architectures, once again, demonstrate significant vulnerability, with MobileNetV2 and MobileNetV3 showing substantial performance losses. Vision Transformer exhibits moderate rotation sensitivity but maintains better performance than most CNN architectures at equivalent rotation angles. Figure 11 illustrates the recognition performance under rotation plots for all models, divided into two groups for better visualization.
Figure 11. Recognition rotation plots for all models and strengths: (a) Group 1; (b) Group 2.
Scale transformation reveals significant architectural differences in invariance handling, as presented in Table 11. At extreme downscaling (0.1×), all models show substantial performance degradation, with RegNet performing worst at 0.13 accuracy and Deit showing the best resilience at 0.47. Performance generally improves as scale increases toward baseline resolution, with most models achieving optimal performance around 1.0× to 1.25× scaling.
Table 11. Models’ robustness evaluation (Accuracy) on object recognition under scaling transformation of different scales.
SwinTransformer demonstrates the most stable scaling performance, maintaining above 0.80 across the 0.75× to 2.0× range and achieving peak performance of 0.83 at 1.25× scaling. Vision Transformer shows similar stability with minimal performance variation across scale factors. Mobile architectures again show significant scale sensitivity, with MobileNetV2 and MobileNetV3 exhibiting substantial performance losses at small scale factors. Figure 12 illustrates the recognition performance under scaling plots for all models, divided into two groups for better visualization.
Figure 12. Recognition scaling plots for all models and strengths: (a) Group 1; (b) Group 2.

5.3. Semantic Segmentation

Segmentation models demonstrate varied blur robustness patterns, as presented in Table 12. Mask2Former achieves the highest baseline performance with 0.73 mIoU and maintains superior blur robustness, declining to 0.68 mIoU at σ = 4. CLIPSeg shows excellent blur stability, maintaining above 0.65 mIoU across all blur levels. DeepLabV3 variants demonstrate moderate blur sensitivity, with the ResNet-101 version maintaining 0.43 mIoU at maximum blur intensity.
Table 12. Models’ robustness evaluation (mIoU) on semantic segmentation under Gaussian blur transformation of different strengths.
PSPNet exhibits severe blur sensitivity, dropping from 0.69 to 0.25 mIoU, representing a 64% performance loss. SegFormer maintains good blur robustness with minimal performance degradation across blur levels. UNet and SqueezeNet show moderate blur sensitivity, with performance declining to approximately 0.43–0.45 mIoU at σ = 4. Figure 13 illustrates the segmentation performance under blurring plots for all models, divided into two groups for better visualization. Group 1 includes DeepLabv3+, DeepLabv3_Resnet101, Deeplabv3, FCN, and Unet, while Group 2 includes CLIPSeg, Mask2Former, PSPNet, SegFormer, and SqueezeNet.
Figure 13. Segmentation Gaussian blur plots for all models and strengths: (a) Group 1; (b) Group 2.
Gaussian noise significantly impacts segmentation performance across all tested models, as shown in Table 13. Mask2Former demonstrates the most robust noise handling, declining from 0.73 to 0.35 mIoU but maintaining the highest absolute performance under low noise conditions, while being surpassed by the more stable SegFormer model. CLIPSeg shows good noise robustness, maintaining 0.39 mIoU at σ = 0.3. SegFormer exhibits stable noise performance with gradual degradation across noise levels.
Table 13. Models’ robustness evaluation (mIoU) on semantic segmentation under Gaussian noise of different strengths.
PSPNet shows severe noise sensitivity, with performance dropping to 0.02 mIoU at maximum noise intensity. FCN ResNet-50 demonstrates an unusual pattern, initially showing slight performance improvement at low noise levels before degrading at higher intensities. DeepLabV3 variants show substantial noise sensitivity, with the standard version performing worse than the ResNet-101 variant. Figure 14 illustrates the segmentation performance under noise plots for all models, divided into two groups for better visualization.
Figure 14. Segmentation Gaussian noise plots for all models and strengths: (a) Group 1; (b) Group 2.
Rotation transformation affects segmentation models differently than localization and recognition tasks, as demonstrated in Table 14. Most segmentation models show relatively stable performance across rotation angles, with performance variations typically within 0.05–0.10 mIoU. The DeepLabv3 models maintain strong rotation robustness, showing minimal performance variation across all tested angles. CLIPSeg demonstrates excellent rotation stability with consistent performance around 0.59–0.67 mIoU.
Table 14. Models’ robustness evaluation (mIoU) on semantic segmentation under rotation transformation of different angles.
PSPNet maintains consistent performance across rotation angles, suggesting good rotational invariance properties. SegFormer shows minimal rotation sensitivity with performance remaining stable across all tested angles. Figure 15 illustrates the segmentation performance under rotation plots for all models, divided into two groups for better visualization.
Figure 15. Segmentation rotation plots for all models and strengths: (a) Group 1; (b) Group 2.
Scale transformation reveals significant architectural differences in segmentation model robustness, as presented in Table 15. At extreme downscaling (0.1×), most models show substantial performance degradation, with several models achieving below 0.10 mIoU. Mask2Former demonstrates the most robust scaling performance, maintaining above 0.31 mIoU even at 0.1× scale and achieving peak performance of 0.729 mIoU at 1.25× scaling.
Table 15. Models’ robustness evaluation (mIoU) on semantic segmentation under scaling transformation of different scales.
CLIPSeg shows good scale robustness, maintaining above 0.47 mIoU at 0.1× scale and stable performance across larger scale factors. SegFormer exhibits moderate scale sensitivity with substantial performance loss at small scales but good recovery at normal and large scales. PSPNet, DeepLabv3 with Resenet101, DeepLabc3+ and SqueezeNet show severe scale sensitivity, with near-zero performance at 0.1× scaling. Figure 16 illustrates the segmentation performance under scaling plots for all models, divided into two groups for better visualization.
Figure 16. Segmentation scale plots for all models and strengths: (a) Group 1; (b) Group 2.
The empirical results reveal several consistent patterns across model architectures and tasks. Vision Transformers (ViT, Swin Transformer, DeiT) generally demonstrate superior robustness to blur and noise transformations compared to CNN architectures, particularly in recognition tasks. However, rotation invariance remains challenging for all architecture types, with performance degradation occurring at intermediate angles regardless of model family.
Scale invariance exhibits the most dramatic performance variations, with extreme downscaling (0.1×) resulting in near-complete failure in many models across all tasks. Modern YOLO variants demonstrate relatively stable performance across most invariance types in localization tasks, while mobile architectures consistently show the highest sensitivity to all transformation types. SegFormer emerges as the most robust segmentation model across all invariance types, as its delta on performance decrease is the lowest among the models. In contrast, traditional CNN-based segmentation approaches show significant sensitivity to most transformations.

5.4. Case Studies of Misclassifications Across Architectures

To complement the quantitative results, we include qualitative examples that illustrate typical failure cases for each of the three deep model usages evaluated in this study. These examples provide insight into how and why certain models struggle with specific transformations or object features. Each case shows the ground truth label alongside the model’s incorrect prediction, before applying any transformation. These indicative cases were selected to highlight common patterns in the types of errors made by each architecture.
Regarding the localization models an interesting observation was made. The COCO dataset generally consists of images that depict multiple objects in the scene, as shown in Figure 17. The issue here lies with the labeling, as for each picture, certain aspects of the image are labeled as for example, in the case of the left image of Figure 17, the bottle. However, in this case, the model selects the person, which is not completely wrong, yet the prediction is mismatched.
Figure 17. Indicative examples of object localization incorrect predictions with Yolov8 using an unprocessed image from the COCO dataset. Columns from left to right: ground truth, and incorrect prediction.
The recognition models provide more simple results. It was observed that many predictions by the models misclassified labels had similar properties. For example, in Figure 18, the ground truth image shows a goldfish, and the model predicted a sea slug. Therefore, the model can detect similar animals (i.e., fish) but sometimes lacks the fine-grained detection of the subject.
Figure 18. Indicative examples of object recognition incorrect predictions with SwinTransformer using an unprocessed image from the ImageNet dataset. Columns from left to right: ground truth, and incorrect prediction.
For the segmentation prediction in Figure 19, it can be seen that the model can’t completely predict the original masks of the leaves, yet it manages to simulate their major part.
Figure 19. Indicative examples of semantic segmentation incorrect predictions with Mask2Former using an unprocessed image from the Leaf disease segmentation dataset. Columns from left to right: input image, ground truth mask, and incorrect prediction masks.
These qualitative examples reinforce the quantitative results, showing that even the best-performing models have specific weaknesses. Such visualizations aim to clarify the limitations of current architectures and motivate further research towards improving model robustness across diverse input variations. It should be noted that many observed differences between models—often within ±0.01 mAP or mIoU—are smaller than the expected statistical noise from the limited test set and should be interpreted as indicative trends rather than definitive claims of model superiority.

6. Discussion

6.1. Observation Points

The empirical results reveal fundamental architectural trade-offs in handling different invariance types across vision tasks. While YOLO variants demonstrated exceptional blur and scale robustness in localization under noise, their results expose vulnerabilities in the frequency-domain. This aligns with the theoretical observations from Azulay et al. [74], who attributed CNNs’ sensitivity to improper sampling of high-frequency components. The counterintuitive SSD performance improvement under moderate blur suggests blur-induced regularization effects, potentially mitigating overfitting to high-frequency artifacts as hypothesized in Cui et al.’s anti-aliasing work [75].
Vision Transformers exhibited superior noise robustness, supporting the findings of Wang et al. [76] of ViTs’ reduced high-frequency sensitivity. However, their rotation performance parity with CNNs challenges assumptions about attention mechanisms inherently solving geometric invariance, corroborating recent debates in the work of Pinto et al. [77]. The segmentation results further enhance this narrative that shows that transformers overall seem to have broadly a more stable performance, even in higher strengths of different image degradations.
Our multi-task analysis uncovered a fundamental scale invariance paradox: while localization models maintained their functionality at 0.1× scaling, recognition and segmentation models suffered catastrophic failures. This dichotomy suggests scale invariance operates through distinct mechanisms in detection versus classification architectures. The localization results support the feature pyramid hypothesis of Mumuni et al. [78], where multi-scale anchors provide inherent scale resilience.
The universal rotation sensitivity across architectures underscores a critical limitation of modern deep vision systems. The segmentation exception suggests dense prediction tasks may indirectly learn rotation-equivariant features through pixel-wise consistency objectives, which is a phenomenon meriting further investigation.
The noise results expose a fundamental CNN-ViT divergence: while CNNs suffered progressive degradation, ViTs displayed thresholded failure modes. The segmentation anomaly (FCN ResNet-50’s increase in mIoU at σ = 0.05 as seen in Table 13) parallels biomedical imaging findings where low noise regularizes over-segmentation [79], suggesting task-dependent noise responses.

6.2. Limitations and Future Directions

While comprehensive, this study faces specific limitations, summarized in the following points:
  • Static Transformation Analysis: Real-world invariances often involve combined perturbations, which are absent from our isolated tests.
  • Simplicity of Transformations: The applied perturbations (blur, noise, rotation, scale) were simulated independently and represent clean, idealized distortions. In practice, corruptions frequently co-occur (e.g., blur + rotation, scale + occlusion) or appear partially across the image, producing more complex challenges. Evaluating such compounded perturbations remains an important direction for future work. Yet, it should be noted that isolating individual transformations allows for a more precise evaluation of the models’ sensitivity and robustness to specific types of perturbations, while single transformations in this work also aim to serve as a clear baseline of a foundational benchmark.
  • Dataset Bias: COCO/ImageNet focus limits ecological validity for specialized domains like medical or satellite imaging.
  • Black-Box Metrics: Layer-wise invariance analysis could reveal mechanistic insights beyond task performance.
  • Given our modest per-class sample size, marginal score differences (e.g., ±0.01 in mAP or mIoU) are not necessarily statistically significant and should not be over-interpreted.
  • Limited computational resources: in this work subsets of data were considered to fine-tune the segmentation pre-trained models, mainly due to limited available resources. Considering recent robustness benchmarks specifically designed for object detection, such as ImageNet-C Hendrycks and Dietterich [41], future work will aim enhance the size of used datasets.
Recent advances in complex and dynamic environments pose significant challenges for testing the invariance capabilities of vision models. In surgical contexts, Zhang et al. [80] proposed adaptive graph learning frameworks that anticipate surgical workflow by modeling dynamic interactions under varying factors such as lighting changes, occlusions, and motion-induced artifacts typical in robotic-assisted procedures. These settings introduce compounded invariance challenges involving simultaneous blur, scale variations, and geometric transformations, thereby providing realistic benchmarks for assessing model robustness in localization and segmentation tasks critical to patient safety. Similarly, in underwater detection scenarios, Ge et al. [81] presented datasets capturing marine organisms under diverse environmental conditions like illumination changes, turbidity, and color distortion. These factors inherently combine noise, blur, color shifts, and scale variations, offering an ecologically valid framework to evaluate model robustness under naturally occurring complex perturbations. Both domains underscore the necessity of evaluating vision models beyond isolated invariance tests to reflect real-world, multi-faceted transformation challenges.
The systematic evaluation of models across such challenging domains would provide more actionable insights for real-world deployment. Surgical tool detection in dynamic scenarios and underwater object recognition represent domains where invariance failures can have significant consequences—patient safety in surgical robotics and autonomous underwater vehicle navigation, respectively. Therefore, future work should prioritize evaluation on these challenging datasets to better understand the practical limitations of current deep learning architectures.
Deep learning models exhibit significant vulnerabilities to fundamental image transformations. Systematic evaluation of thirty CNNs and ViTs across localization, recognition, and segmentation tasks reveals that while ViTs outperform CNNs under blur and noise in recognition, both architectures suffer substantial degradation under rotation and extreme scale transformations. Segmentation models, particularly SegFormer and Mask2Former, demonstrate superior geometric robustness. These findings expose persistent architectural limitations and underscore the need for explicit invariance mechanisms in vision systems [41].
Regarding the Scale Invariance Paradox, its observation suggests that multi-scale anchor mechanisms in detection architectures provide inherent scale resilience absent in classification systems. Potential solutions include: (1) integrating feature pyramid networks (FPNs) into recognition and segmentation architectures to enable multi-scale processing; (2) implementing scale-equivariant convolutions that explicitly preserve scale relationships across network layers; (3) adopting pyramid pooling strategies from localization models; and (4) developing hybrid training protocols that expose classification models to extreme scale variations during pre-training, mimicking the multi-scale robustness observed in detection frameworks.
These results collectively challenge the notion of universal architectural superiority, instead advocating for task-specific model selection guided by operational invariance requirements. The findings particularly underscore the urgent need for standardized robustness benchmarks beyond conventional accuracy metrics in real-world vision system deployment.

7. Conclusions

This empirical study provides comprehensive evidence of significant variation in robustness patterns across deep learning architectures and transformation types, challenging assumptions about inherent invariance capabilities in modern vision systems. The systematic evaluation of 30 models across three computer vision tasks reveals fundamental limitations that persist across architectural families, with implications for real-world deployment of deep vision systems.

7.1. Key Empirical Findings

Results demonstrate that Vision Transformers consistently outperformed CNN architectures in handling blur and noise transformations across recognition tasks. Vision Transformer exhibited lower performance loss under maximum blur intensity (σ = 4), compared to substantially higher degradation in CNN-based models such as RegNet, which showed extreme performance loss under equivalent conditions. Similarly, noise robustness analysis revealed that ViTs maintained better performance retention compared to CNN architectures under equivalent noise conditions. In general, all models show signs of performance degradation when applying noise and blur, regardless of their underlying architecture.
All tested architectures, regardless of their type (CNN or Transformer), exhibited substantial performance degradation at intermediate rotation angles (30°, 60°, 120°, 150°) while showing recovery at 90° and 180° rotations. This pattern emerged consistently across localization, recognition, and segmentation tasks, while for the segmentation models, the degradation was smaller, indicating that rotation invariance remains an unsolved challenge for current deep learning architectures.
Extreme downscaling (0.1×) caused near-complete failure across all model types and tasks, with recognition and segmentation systems suffering more severe degradation than localization models. This scale invariance paradox reveals fundamental differences in how various vision tasks handle scale transformations, with localization models maintaining some functionality where classification systems fail completely.
These findings represent two novel empirical contributions: (1) the scale invariance paradox demonstrates fundamental architectural differences in handling extreme transformations across vision tasks, challenging assumptions about unified robustness; and (2) ViT thresholded failure modes under noise reveal discrete rather than gradual degradation patterns, contrasting with CNN progressive failure and suggesting different underlying robustness mechanisms.

7.2. Task-Specific Invariance Patterns

The multi-task analysis revealed distinct invariance mechanisms operating across computer vision applications. Segmentation models demonstrated superior rotation stability compared to localization and recognition tasks, with most models showing performance variations within 0.05–0.10 mIoU across rotation angles. This suggests that dense prediction tasks may inherently develop rotation-equivariant features through pixel-wise consistency objectives.
Localization models maintained functionality under extreme scale transformations where recognition systems failed, supporting the hypothesis that multi-scale anchor mechanisms provide inherent scale resilience in detection architectures. The noise robustness analysis exposed fundamental CNN-ViT divergences, with CNNs suffering progressive degradation while ViTs displayed thresholded failure modes.
The documented vulnerabilities across all tested architectures underscore the urgent need for developing dedicated invariance mechanisms beyond conventional data augmentation strategies. The superior performance of certain models under specific transformation types (e.g., Mask2Former for segmentation, Vision Transformers for noise robustness) provides empirical guidance for practitioners selecting models based on expected operational conditions.

7.3. Future Research Directions

This study establishes a foundation for several critical research directions. The development of standardized robustness benchmarks beyond conventional accuracy metrics emerges as an immediate priority for the computer vision community. Future investigations should examine combined transformation effects and explore layer-wise invariance analysis to reveal mechanistic insights into architectural differences.
The documented scale invariance paradox and rotation sensitivity across all architectures indicate fundamental theoretical gaps that require novel architectural innovations rather than incremental improvements. The task-specific invariance patterns observed suggest potential for developing hybrid architectures that leverage the strengths of different model families for enhanced robustness across transformation types.
The empirical evidence presented in this study contributes to the growing body of literature documenting the limitations of current deep learning approaches while providing quantitative benchmarks for evaluating future invariance-aware architectures. Our findings emphasize that achieving robust vision systems requires explicit consideration of invariance properties during both architectural design and model selection phases, moving beyond the assumption that deeper or larger models inherently provide superior robustness capabilities.

Author Contributions

Conceptualization, G.A.P.; methodology, G.A.P.; software, K.K.; validation, G.A.P., E.V. and K.K.; investigation, K.K. and E.V.; data curation, K.K.; writing—original draft preparation, K.K. and E.V.; writing—review and editing, E.V. and G.A.P.; visualization, G.A.P.; supervision, G.A.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study refer to three datasets available in public domains: (1) COCO dataset available in GitHub at https://github.com/cocodataset/cocodataset.github.io (assessed on 8 August 2025), (2) ImageNet ILSVRC2012 dataset available at https://www.image-net.org/challenges/LSVRC/2012/ (assessed on 08 August 2025), (3) Leaf disease segmentation dataset available in Kaggle at https://www.kaggle.com/datasets/fakhrealam9537/leaf-disease-segmentation-dataset (assessed on 8 August 2025).

Acknowledgments

This work was supported by the MPhil program “Advanced Technologies in Informatics and Computers”, which was hosted by the Department of Informatics, Democritus University of Thrace, Kavala, Greece.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
CNNsConvolutional Neural Networks
ViTsVision Transformers
COCOCommon Objects in Context
mIoUMean Intersection over Union
AccAccuracy
SIFTScale-Invariant Feature Transform
SVMsSupport Vector Machines
ASIFTAffine-Scale-Invariant Feature Transform
SURFSpeeded-Up Robust Features
ORBOriented FAST and Rotated BRIEF
FASTFeatures from Accelerated Segment Test
BRIEFBinary Robust Independent Elementary Features
HOGHistogram of Oriented Gradients
SPMSpatial Pyramid Matching
DIALDomain Invariant Adversarial Learning
DATDomain-wise Adversarial Training
DCTDiscrete Cosine Transform
PyramidATPyramid Adversarial Training
SP-ViTSpatial Prior-enhanced Vision Transformers
STNsSpatial Transformer Networks
RViTRotation Invariant Vision Transformer
AMRArtificial Mental Rotation
SPPSpatial Pyramid Pooling
RiTRotation Invariance Transformer
ILSVRC2012ImageNet Large Scale Visual Recognition Challenge 2012
FPNsFeature pyramid networks

References

  1. Saremi, S.; Sejnowski, T.J. Hierarchical Model of Natural Images and the Origin of Scale Invariance. Proc. Natl. Acad. Sci. USA 2013, 110, 3071–3076. [Google Scholar] [CrossRef]
  2. Rodríguez, M.; Delon, J.; Morel, J.-M. Fast Affine Invariant Image Matching. Image Process. Line 2018, 8, 251–281. [Google Scholar] [CrossRef]
  3. Saha, S.; Gokhale, T. Improving Shift Invariance in Convolutional Neural Networks with Translation Invariant Polyphase Sampling. arXiv 2024, arXiv:2404.07410. [Google Scholar] [CrossRef]
  4. Kvinge, H.; Emerson, T.H.; Jorgenson, G.; Vasquez, S.; Doster, T.; Lew, J.D. In What Ways Are Deep Neural Networks Invariant and How Should We Measure This? arXiv 2022, arXiv:2210.03773. [Google Scholar] [CrossRef]
  5. Lee, J.; Yang, J.; Wang, Z. What Does CNN Shift Invariance Look Like? A Visualization Study. In Proceedings of the Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Online, 23–28 August 2020; pp. 196–210, ISBN 9783030682378. [Google Scholar]
  6. Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
  7. Lee, A. Comparing Deep Neural Networks and Traditional Vision Algorithms in Mobile Robotics. Swart. Univ. 2015, 40, 1–9. [Google Scholar]
  8. Bay, H.; Tuytelaars, T.; Van Gool, L. SURF: Speeded Up Robust Features. In Proceedings of the Computer Vision–ECCV 2006, Graz, Austria, 7–13 May 2006; Leonardis, A., Bischof, H., Pinz, A., Eds.; Springer: Berlin/Heidelberg, Germany, 2006; Volume 3951, pp. 404–417, ISBN 978-3-540-33832-1/978-3-540-33833-8. [Google Scholar]
  9. Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An Efficient Alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; IEEE: New York, NY, USA, 2011; pp. 2564–2571. [Google Scholar]
  10. Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; IEEE: New York, NY, USA, 2005; Volume 1, pp. 886–893. [Google Scholar]
  11. Lazebnik, S.; Schmid, C.; Ponce, J. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Volume 2 (CVPR’06), New York, NY, USA, 17–22 June 2006; IEEE: New York, NY, USA, 2006; Volume 2, pp. 2169–2178. [Google Scholar]
  12. Flajolet, P.; Sedgewick, R. Mellin Transforms and Asymptotics: Finite Differences and Rice’s Integrals. Theor. Comput. Sci. 1995, 144, 101–124. [Google Scholar] [CrossRef]
  13. Chen, B.; Shu, H.; Zhang, H.; Coatrieux, G.; Luo, L.; Coatrieux, J.L. Combined Invariants to Similarity Transformation and to Blur Using Orthogonal Zernike Moments. IEEE Trans. Image Process. 2011, 20, 345–360. [Google Scholar] [CrossRef]
  14. Huang, Z.; Leng, J. Analysis of Hu’s Moment Invariants on Image Scaling and Rotation. In Proceedings of the 2010 2nd International Conference on Computer Engineering and Technology, Chengdu, China, 16–18 April 2010; IEEE: New York, NY, USA, 2010; pp. V7-476–V7-480. [Google Scholar]
  15. Papakostas, G.A.; Boutalis, Y.S.; Karras, D.A.; Mertzios, B.G. Efficient Computation of Zernike and Pseudo-Zernike Moments for Pattern Classification Applications. Pattern Recognit. Image Anal. 2010, 20, 56–64. [Google Scholar] [CrossRef]
  16. Papakostas, G.A.; Boutalis, Y.S.; Karras, D.A.; Mertzios, B.G. Fast Numerically Stable Computation of Orthogonal Fourier–Mellin Moments. IET Comput. Vis. 2007, 1, 11–16. [Google Scholar] [CrossRef]
  17. Wang, K.W.K.; Ping, Z.P.Z.; Sheng, Y.S.A.Y. Development of Image Invariant Moments—A Short Overview. Chin. Opt. Lett. 2016, 14, 091001. [Google Scholar] [CrossRef][Green Version]
  18. Immer, A.; van der Ouderaa, T.F.A.; Rätsch, G.; Fortuin, V.; van der Wilk, M. Invariance Learning in Deep Neural Networks with Differentiable Laplace Approximations. Adv. Neural Inf. Process. Syst. 2022, 35, 12449–12463. [Google Scholar][Green Version]
  19. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
  20. Deng, W.; Gould, S.; Zheng, L. On the Strong Correlation Between Model Invariance and Generalization. arXiv 2022, arXiv:2207.07065. [Google Scholar] [CrossRef]
  21. Abdalla, A. A Visual History of Interpretation for Image Recognition. Available online: https://thegradient.pub/a-visual-history-of-interpretation-for-image-recognition/ (accessed on 22 June 2025).
  22. Chavhan, R.; Gouk, H.; Stuehmer, J.; Heggan, C.; Yaghoobi, M.; Hospedales, T. Amortised Invariance Learning for Contrastive Self-Supervision. arXiv 2023, arXiv:2302.12712. [Google Scholar] [CrossRef]
  23. Levi, M.; Attias, I.; Kontorovich, A. Domain Invariant Adversarial Learning. arXiv 2022, arXiv:2104.00322. [Google Scholar] [CrossRef]
  24. Xin, S.; Wang, Y.; Su, J.; Wang, Y. On the Connection between Invariant Learning and Adversarial Training for Out-of-Distribution Generalization. Proc. AAAI Conf. Artif. Intell. 2023, 37, 10519–10527. [Google Scholar] [CrossRef]
  25. Gu, J.; Tresp, V.; Qin, Y. Are Vision Transformers Robust to Patch Perturbations? In Proceedings of the Computer Vision–ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; pp. 404–421. [Google Scholar]
  26. Xu, W.; Hirami, K.; Eguchi, K. Self-Supervised Learning for Neural Topic Models with Variance–Invariance–Covariance Regularization. Knowl. Inf. Syst. 2025, 67, 5057–5075. [Google Scholar] [CrossRef]
  27. Dvořáček, P. How Image Distortions Affect Inference Accuracy. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 30 June–5 July 2024; IEEE: New York, NY, USA, 2024; pp. 1–7. [Google Scholar]
  28. Hossain, M.T.; Teng, S.W.; Sohel, F.; Lu, G. Robust Image Classification Using a Low-Pass Activation Function and DCT Augmentation. IEEE Access 2021, 9, 86460–86474. [Google Scholar] [CrossRef]
  29. Zhang, R. Making Convolutional Networks Shift-Invariant Again. arXiv 2019, arXiv:1904.11486. [Google Scholar] [CrossRef]
  30. Hendrycks, D.; Mu, N.; Cubuk, E.D.; Zoph, B.; Gilmer, J.; Lakshminarayanan, B. AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty. arXiv 2020, arXiv:1912.02781. [Google Scholar] [CrossRef]
  31. Calian, D.A.; Stimberg, F.; Wiles, O.; Rebuffi, S.-A.; Gyorgy, A.; Mann, T.; Gowal, S. Defending Against Image Corruptions Through Adversarial Augmentations. arXiv 2021, arXiv:2104.01086. [Google Scholar] [CrossRef]
  32. Xie, Q.; Luong, M.-T.; Hovy, E.; Le, Q.V. Self-Training With Noisy Student Improves ImageNet Classification. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: New York, NY, USA, 2020; pp. 10684–10695. [Google Scholar]
  33. Usama, M.; Asim, S.A.; Ali, S.B.; Wasim, S.T.; Mansoor, U. Bin Analysing the Robustness of Vision-Language-Models to Common Corruptions. arXiv 2025, arXiv:2504.13690. [Google Scholar]
  34. Herrmann, C.; Sargent, K.; Jiang, L.; Zabih, R.; Chang, H.; Liu, C.; Krishnan, D.; Sun, D. Pyramid Adversarial Training Improves ViT Performance. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: New York, NY, USA, 2022; pp. 13409–13419. [Google Scholar]
  35. Zhou, Y.; Xiang, W.; Li, C.; Wang, B.; Wei, X.; Zhang, L.; Keuper, M.; Hua, X. SP-ViT: Learning 2D Spatial Priors for Vision Transformers. arXiv 2022, arXiv:2206.07662. [Google Scholar]
  36. Zhang, C.; Zhang, C.; Zheng, S.; Zhang, M.; Qamar, M.; Bae, S.-H.; Kweon, I.S. A Survey on Audio Diffusion Models: Text To Speech Synthesis and Enhancement in Generative AI. arXiv 2023, arXiv:2303.13336. [Google Scholar] [CrossRef]
  37. Zheng, S.; Song, Y.; Leung, T.; Goodfellow, I. Improving the Robustness of Deep Neural Networks via Stability Training. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA, 2016; pp. 4480–4488. [Google Scholar]
  38. Xie, C.; Wu, Y.; Van Der Maaten, L.; Yuille, A.L.; He, K. Feature Denoising for Improving Adversarial Robustness. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: New York, NY, USA, 2019; pp. 501–509. [Google Scholar]
  39. Cohen, J.M.; Rosenfeld, E.; Kolter, J.Z. Certified Adversarial Robustness via Randomized Smoothing. arXiv 2019, arXiv:1902.02918. [Google Scholar] [CrossRef]
  40. Paul, S.; Chen, P.-Y. Vision Transformers Are Robust Learners. Proc. AAAI Conf. Artif. Intell. 2022, 36, 2071–2081. [Google Scholar] [CrossRef]
  41. Hendrycks, D.; Dietterich, T. Benchmarking Neural Network Robustness to Common Corruptions and Perturbations. arXiv 2019, arXiv:1903.12261. [Google Scholar] [CrossRef]
  42. Roy, S.; Marathe, A.; Walambe, R.; Kotecha, K. Self Supervised Learning for Classifying the Rotated Images. In Communications in Computer and Information Science; Springer Nature: Cham, Switzerland, 2023; pp. 17–24. ISBN 9783031356438. [Google Scholar]
  43. Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial Transformer Networks. arXiv 2016, arXiv:1506.02025. [Google Scholar]
  44. Cohen, T.S.; Welling, M. Group Equivariant Convolutional Networks. In Proceedings of the 33rd International Conference on Machine Learning, ICML, New York, NY, USA, 20–22 June 2016. [Google Scholar]
  45. Worrall, D.E.; Garbin, S.J.; Turmukhambetov, D.; Brostow, G.J. Harmonic Networks: Deep Translation and Rotation Equivariance. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: New York, NY, USA, 2017; pp. 7168–7177. [Google Scholar]
  46. Zhang, P.; Tang, J.; Zhong, H.; Ning, M.; Fan, Y. Pre-Rotation Only at Inference-Time: A Way to Rotation Invariance. In Proceedings of the Fourteenth International Conference on Digital Image Processing (ICDIP 2022), Wuhan, China, 20–23 May 2022; Xie, Y., Jiang, X., Tao, W., Zeng, D., Eds.; SPIE: Washington, DC, USA, 2022; p. 119. [Google Scholar]
  47. Krishnan, P.T.; Krishnadoss, P.; Khandelwal, M.; Gupta, D.; Nihaal, A.; Kumar, T.S. Enhancing Brain Tumor Detection in MRI with a Rotation Invariant Vision Transformer. Front. Neuroinform. 2024, 18, 1414925. [Google Scholar] [CrossRef]
  48. Sun, X.; Wang, C.; Wang, Y.; Wei, J.; Sun, Z. IrisFormer: A Dedicated Transformer Framework for Iris Recognition. IEEE Signal Process. Lett. 2025, 32, 431–435. [Google Scholar] [CrossRef]
  49. Tuggener, L.; Stadelmann, T.; Schmidhuber, J. Efficient Rotation Invariance in Deep Neural Networks through Artificial Mental Rotation. arXiv 2023, arXiv:2311.08525. [Google Scholar] [CrossRef]
  50. Chen, S.; Ye, M.; Du, B. Rotation Invariant Transformer for Recognizing Object in UAVs. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; ACM: New York, NY, USA, 2022; pp. 2565–2574. [Google Scholar]
  51. Graziani, M.; Lompech, T.; Müller, H.; Depeursinge, A.; Andrearczyk, V. On the Scale Invariance in State of the Art CNNs Trained on ImageNet. Mach. Learn. Knowl. Extr. 2021, 3, 374–391. [Google Scholar] [CrossRef]
  52. Jansson, Y.; Lindeberg, T. Exploring the Ability of CNN s to Generalise to Previously Unseen Scales over Wide Scale Ranges. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; IEEE: New York, NY, USA, 2021; pp. 1181–1188. [Google Scholar]
  53. Xie, W.; Liu, T. MFP-CNN: Multi-Scale Fusion and Pooling Network for Accurate Scene Classification. In Proceedings of the 2024 2nd International Conference on Computer, Vision and Intelligent Technology (ICCVIT), Huaibei, China, 24–27 November 2024; IEEE: New York, NY, USA, 2024; pp. 1–8. [Google Scholar]
  54. Chang, J.-R.; Chen, Y.-S. Pyramid Stereo Matching Network. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: New York, NY, USA, 2018; pp. 5410–5418. [Google Scholar]
  55. Kumar, D.; Sharma, D. Feature Map Upscaling to Improve Scale Invariance in Convolutional Neural Networks. In Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, Virtual Event, 8–10 February 2021; SCITEPRESS-Science and Technology Publications: Setúbal, Portugal, 2021; pp. 113–122. [Google Scholar]
  56. Wei, X.-S.; Gao, B.-B.; Wu, J. Deep Spatial Pyramid Ensemble for Cultural Event Recognition. In Proceedings of the 2015 IEEE International Conference on Computer Vision Workshop (ICCVW), Santiago, Chile, 7–13 December 2015; IEEE: New York, NY, USA, 2015; pp. 280–286. [Google Scholar][Green Version]
  57. Kumar, D.; Sharma, D. Feature Map Augmentation to Improve Scale Invariance in Convolutional Neural Networks. J. Artif. Intell. Soft Comput. Res. 2023, 13, 51–74. [Google Scholar] [CrossRef]
  58. Xu, Y.; Zhang, Q.; Zhang, J.; Tao, D. ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias. arXiv 2021, arXiv:2106.03348. [Google Scholar] [CrossRef]
  59. Ge, J.; Wang, Q.; Tong, J.; Gao, G. RPViT: Vision Transformer Based on Region Proposal. In Proceedings of the 2022 the 5th International Conference on Image and Graphics Processing (ICIGP), Beijing, China, 7–9 January 2022; ACM: New York, NY, USA, 2022; pp. 220–225. [Google Scholar]
  60. Qian, Z. ECViT: Efficient Convolutional Vision Transformer with Local-Attention and Multi-Scale Stages. arXiv 2025, arXiv:2504.14825. [Google Scholar]
  61. Yu, A.; Niu, Z.-H.; Xie, J.-X.; Zhang, Q.-L.; Yang, Y.-B. EViTIB: Efficient Vision Transformer via Inductive Bias Exploration for Image Super-Resolution. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 30 June–5 July 2024; IEEE: New York, NY, USA, 2024; pp. 1–7. [Google Scholar]
  62. Singh, B.; Davis, L.S. An Analysis of Scale Invariance in Object Detection-SNIP. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: New York, NY, USA, 2018; pp. 3578–3587. [Google Scholar]
  63. Lee, Y.; Lama, B.; Joo, S.; Kwon, J. Enhancing Human Key Point Identification: A Comparative Study of the High-Resolution VICON Dataset and COCO Dataset Using BPNET. Appl. Sci. 2024, 14, 4351. [Google Scholar] [CrossRef]
  64. Iglovikov, V.; Shvets, A. TernausNet: U-Net with VGG11 Encoder Pre-Trained on ImageNet for Image Segmentation. arXiv 2018, arXiv:1801.05746. [Google Scholar]
  65. Sharma, P.; Ding, N.; Goodman, S.; Soricut, R. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-Text Dataset For Automatic Image Captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 2556–2565. [Google Scholar]
  66. Fakhre, A. Leaf Disease Segmentation Dataset. Available online: https://www.kaggle.com/datasets/fakhrealam9537/leaf-disease-segmentation-dataset (accessed on 22 June 2025).
  67. Goyal, M.; Yap, M.; Hassanpour, S. Multi-Class Semantic Segmentation of Skin Lesions via Fully Convolutional Networks. In Proceedings of the 13th International Joint Conference on Biomedical Engineering Systems and Technologies, Valletta, Malta, 24–26 February 2020; SCITEPRESS-Science and Technology Publications: Setúbal, Portugal, 2020; pp. 290–295. [Google Scholar]
  68. Tian, J.; Jin, Q.; Wang, Y.; Yang, J.; Zhang, S.; Sun, D. Performance Analysis of Deep Learning-Based Object Detection Algorithms on COCO Benchmark: A Comparative Study. J. Eng. Appl. Sci. 2024, 71, 76. [Google Scholar] [CrossRef]
  69. Danişman, T. Bagging Ensemble for Deep Learning Based Gender Recognition Using Test-Time Augmentation on Large-Scale Datasets. Turk. J. Electr. Eng. Comput. Sci. 2021, 29, 2084–2100. [Google Scholar] [CrossRef]
  70. Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding Box Regression Loss with Dynamic Focusing Mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
  71. Li, R.; Chen, R.; Yue, J.; Liu, L.; Jia, Z. Segmentation and Quantitative Evaluation of Tertiary Lymphoid Structures in Hepatocellular Carcinoma Based on Deep Learning. In Proceedings of the 2024 2nd International Conference on Algorithm, Image Processing and Machine Vision (AIPMV), Zhenjiang, China, 12–14 July 2024; IEEE: New York, NY, USA, 2024; pp. 31–35. [Google Scholar]
  72. Shyamala Devi, M.; Eswar, R.; Hibban, R.M.D.; Jai, H.H.; Hari, P.; Hemanth, S. Encrypt Decrypt ReLU Activated UNet Prototype Based Prediction of Leaf Disease Segmentation. In Proceedings of the 2024 International Conference on Signal Processing, Computation, Electronics, Power and Telecommunication (IConSCEPT), Karaikal, India, 4–5 July 2024; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
  73. Br, P.; Av, S.H.; Ashok, A. Diseased Leaf Segmentation from Complex Background Using Indices Based Histogram. In Proceedings of the 2021 6th International Conference on Communication and Electronics Systems (ICCES), Coimbatore, India, 8–10 July 2021; IEEE: New York, NY, USA, 2021; pp. 1502–1507. [Google Scholar]
  74. Azulay, A.; Weiss, Y. Why Do Deep Convolutional Networks Generalize so Poorly to Small Image Transformations? J. Mach. Learn. Res. 2019, 20, 1–25. [Google Scholar]
  75. Cui, Y.; Zhang, C.; Qiao, K.; Wang, L.; Yan, B.; Tong, L. Study on Representation Invariances of CNNs and Human Visual Information Processing Based on Data Augmentation. Brain Sci. 2020, 10, 602. [Google Scholar] [CrossRef]
  76. Wang, Z.; Bai, Y.; Zhou, Y.; Xie, C. Can CNNs Be More Robust Than Transformers? arXiv 2023, arXiv:2206.03452. [Google Scholar] [CrossRef]
  77. Pinto, F.; Torr, P.H.S.; Dokania, P.K. An Impartial Take to the CNN vs Transformer Robustness Contest. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2022; pp. 466–480. ISBN 9783031197772. [Google Scholar]
  78. Mumuni, A.; Mumuni, F. CNN Architectures for Geometric Transformation-Invariant Feature Representation in Computer Vision: A Review. SN Comput. Sci. 2021, 2, 340. [Google Scholar] [CrossRef]
  79. Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of Deep Learning: Concepts, CNN Architectures, Challenges, Applications, Future Directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef]
  80. Zhang, F.X.; Deng, J.; Lieck, R.; Shum, H.P.H. Adaptive Graph Learning from Spatial Information for Surgical Workflow Anticipation. arXiv 2024, arXiv:2412.06454. [Google Scholar] [CrossRef]
  81. Ge, H.; Sun, P.; Lu, Y. A New Dataset, Model, and Benchmark for Lightweight and Real-Time Underwater Object Detection. Neurocomputing 2025, 651, 130891. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.