1. Introduction
Detecting blurred targets on the sea surface remains a significant technical challenge in maritime surveillance [
1], search-and-rescue operations [
2], and autonomous vessel navigation [
3]. Under challenging maritime conditions, marine target images typically exhibit three distinct blur types, originating from different sources, which are motion blur from ship movement, defocus blur due to water vapor refraction [
4], and scattering blur caused by sea fog [
5]. These degradation effects primarily stem from wave reflection [
6], atmospheric turbulence [
7], and imaging system vibrations. Research shows that ship recognition systems utilizing visible/infrared imaging exhibit an average accuracy reduction of approximately 20% under Sea State 4 conditions relative to calm waters [
8], which significantly increases collision risks.
Image recognition algorithms remain a key research focus across multiple domains, including facial recognition [
9], autonomous vehicle perception [
10], and medical diagnostics [
11]. In real-world scenarios, there are three major image degradation types that impair visual task performance, which are motion blur from high-speed autonomous vehicles [
12], defocus blur in medical endoscopic imaging [
13], and low-light noise in night time surveillance systems [
14]. In medical imaging, respiratory-induced motion blur in ultrasound significantly degrades the lesion recognition accuracy of conventional CNNs [
15]. Similarly, in autonomous driving, high-speed motion creates directional blur that substantially reduces object detection performance [
16]. These critical challenges continue to motivate technological breakthroughs in blurred image processing.
Conventional image recognition methods exhibit significant sensitivity to environmental disturbances, including occlusions and illumination variations, frequently leading to compromised model performance. In comparison, deep learning methods [
17], particularly Convolutional Neural Networks (CNNs), demonstrate markedly enhanced recognition accuracy. CNNs mimic the hierarchical organization of the visual cortex, autonomously extracting discriminative features from raw pixel data to facilitate robust image classification. Contemporary methodologies primarily adopt two distinct paradigms: (1) CNN-based architectures for localized feature extraction, and (2) Transformer-based frameworks for global contextual modeling.
CNNs have demonstrated remarkable success on large-scale visual benchmarks (e.g., ImageNet), primarily attributable to their exceptional local feature extraction capacity [
18]. The pioneering AlexNet [
19] significantly reduced the top five error rate by 10.8 percentage points, while ResNet-50 achieved 76.0% top one accuracy on ImageNet, demonstrating superior performance compared to conventional methods. DeblurGAN [
20] utilized multi-scale dilated convolutional networks, achieving significant PSNR improvements on the GoPro benchmark dataset, albeit with substantial computational overhead. Subsequent advancements, including DANet [
21], developed dynamic convolution mechanisms that adaptively adjust kernel weights according to predicted blur severity. Although demonstrating superior PSNR performance on the REDS benchmark, these methods face practical deployment limitations due to their extensive parameter requirements, particularly for resource-constrained devices. Yan et al. [
22] incorporated Sobel operators within the initial CNN layer to enhance edge feature detection. However, this method demonstrated reduced accuracy in complex natural environments, indicating constrained generalization potential. Lightweight design has emerged as a prominent research focus. While MobileBlurNet utilized depthwise separable convolution to minimize parameters, this approach led to reduced classification accuracy, demonstrating the fundamental precision-efficiency trade-off [
23]. However, CNNs exhibit significant performance degradation when processing blurred images due to their limited receptive fields (typically 3 × 3 or 5 × 5 convolutional kernels), which cannot effectively model long-range feature degradation patterns induced by blur artifacts. Quantitative experiments reveal that standard convolution kernels experience ∼60% reduction in effective feature response magnitude under Gaussian blur conditions.
Vision Transformers (ViT) establish long-range dependencies via self-attention mechanisms [
24,
25], enabling global contextual modeling. State-of-the-art architectures including Swin Transformers [
26], dense residual Transformers [
27], and hybrid attention networks [
28] consistently outperform CNN-based approaches on diverse vision tasks. The Swin Transformer employs a shifted window mechanism to reduce FLOPs; however, in high-blur scenarios (σ > 1.5), window boundary effects elevate attention entropy, significantly degrading feature discriminability. To mitigate normalization layer limitations, Lee et al. [
29] proposed an input-adaptive scaling method that adjusts feature statistics according to individual input characteristics. However, this method does not fully address the fundamental issue of the feature distribution shift. Position encoding optimization represents a significant advancement. The Focal-Transformer [
30] employs novel relative position encoding to mitigate localization errors in blurred images. However, its quadratic computational complexity scaling with image size hinders large-scale applications. While these methods show progress in targeted areas, they typically improve one performance metric while compromising others, presenting challenges for simultaneously optimizing accuracy, efficiency, and generalization. Directly applying traditional Transformer architectures to blurred image processing has inherent limitations. Firstly, normalization components exhibit strong sensitivity to blur-induced feature distribution shifts, where channel-wise variance discrepancies escalate quadratically with increasing blur kernel size σ [
31]. Secondly, global attention operations incur substantial computational redundancy when processing spatially homogeneous blurred regions [
32]. Finally, conventional positional encoding schemes introduce systematic localization errors under motion blur conditions due to violated shift-equivariance assumptions [
33]. These inherent constraints significantly degrade the practical utility of current approaches in real-world deployment scenarios with prevalent blur conditions.
Current deep learning methods broadly fall into two categories: CNN-based and Transformer-based architectures. While most approaches enhance performance by optimizing either CNNs or Transformers individually, a critical limitation persists: they fail to effectively combine the complementary advantages of both architectures.
Hybrid architecture research aims to integrate the complementary advantages of CNNs and Transformers. As shown in [
34], CNNs provide essential local feature extraction while Transformers capture global contextual relationships—both critical for comprehensive visual understanding. Furthermore, [
35] demonstrates that incorporating CNNs in early stages can significantly improve Transformer training efficiency. Current approaches fall into three primary categories. The first is cascaded architectures (e.g., CNN–Transformer Cascade [
36]), which connects independent networks in a series, but exhibits feature inconsistency that induces a covariance shift during backpropagation. Next is the parallel fusion architecture (e.g., CMT [
37]), which employs cross-attention for feature integration, achieving performance gains at the cost of excessive parameters that hinder practical deployment. The third is dynamic routing architectures (e.g., Dynamic-ViT [
38]), which adaptively adjust paths based on image quality, though their performance shows undesirable instability. Current hybrid approaches exhibit two fundamental limitations. The first limitation is that naive fusion strategies (e.g., direct concatenation/summation) cannot properly handle blur-induced inter-modal distribution shifts. The second limitation is that fixed computation patterns lack the adaptability to blur variations, resulting in inefficient resource utilization. Hao Tang [
39] proposes an RGB-T salient object detection network called ConTriNet, which enhances detection robustness and accuracy through a triple-stream architecture and a series of innovative modules. Its superior performance has been validated on a new benchmark dataset. DC-Net achieves efficient and high-performance salient object detection by introducing a divide-and-conquer strategy and designing an innovative network architecture [
40]. Recent work [
41] has identified critical shortcomings in conventional normalization for blur processing: forced zero-mean normalization eliminates meaningful low-frequency components in blurred areas. This discovery establishes the theoretical basis for emerging denormalization techniques.
Our systematic analysis identifies three key research challenges requiring urgent breakthroughs: 1. the absence of dynamic network architectures capable of adapting to blur-induced variations in feature distributions; 2. existing fusion methods cannot establish synergistic optimization between CNNs’ local feature extraction and Transformers’ global attention mechanisms; and 3. a substantial performance-efficiency trade-off, particularly in mobile and edge computing applications. Solving these challenges necessitates innovative network architecture designs transcending basic module integration and parameter optimization.
To overcome existing research limitations and technological barriers, we propose a novel CNN-Lightweight Transformer hybrid network to enhance blurry image recognition capabilities. In terms of architectural innovation, we introduce a dynamic denormalization module into the Transformer branch, addressing the compatibility issues between traditional normalization layers and blurry features. This module replaces LayerNorm with a tanh-based nonlinear compression mechanism, dynamically adjusting feature amplitudes via learnable parameter α to preserve expressive power while avoiding information loss caused by enforced zero-mean normalization. Theoretical analysis and experimental validation demonstrate that the DyT module maintains high feature validity even under extreme blur conditions, outperforming traditional LayerNorm by approximately 30 percentage points. To mitigate position encoding distortion, we propose blur invariant position encoding, which couples sinusoidal functions with blur-level parameters λₖ for spectral fusion. Compared to traditional methods, blur invariant positional encoding reduces positioning errors and significantly improves spatial localization accuracy. These advancements underscore a paradigm shift in blur-adaptive deep learning, moving beyond incremental improvements to deliver computationally efficient yet high-precision recognition—particularly crucial for edge deployment.
In terms of network architecture, this study designs a gradient-content collaborative CNN branch, achieving efficient blur feature extraction through MSG-ARB. This module incorporates two key innovations: the learnable gradient convolution dynamically optimizes gradient extraction kernels to adapt to various blur types by parameterizing the sobel operator, and the gradient content gating mechanism fuses gradient features with raw content features in an attention based manner. Compared to traditional CNN architectures, MSG-ARB maintains a lightweight performance while achieving superior PSNR on the VAIS dataset, outperforming leading methods like DeblurGAN-v2 [
42]. This design ensures computational efficiency without compromising feature discriminability, addressing a critical trade-off in real-world blur processing applications.
In terms of feature fusion strategy, this study breaks through the traditional paradigm of simple concatenation or addition and proposes an adaptive hybrid fusion framework. The core of this framework is the dynamic gated fusion unit, whose innovations include: (1) automatically adjusting the fusion weights of CNN and Transformer features based on local clarity σi, achieving modality adaptation; (2) enhancing feature complementarity through a multi-scale interaction mechanism, effectively reducing inter-modal distribution differences; and (3) introducing a differentiable token-pruning strategy, which significantly reduces the computational load while incurring only minor accuracy loss. Systematic experiments on the VAIS dataset demonstrate that the model incorporating DyT achieves higher classification accuracy than pure CNN and Transformer baselines, along with improved inference speed and reduced memory usage.
The contributions of this paper are mainly reflected in four aspects: The DyT module was integrated into a dual-branch model, and its effectiveness in blur image recognition tasks has been empirically validated. A new paradigm for CNN feature extraction based on gradient-content synergy is established, significantly enhancing the model’s discriminative ability for blurred features. An adaptive feature fusion method is pioneered, achieving complementary advantages between CNN’s local features and the Transformer’s global features. State-of-the-art performance is achieved on multiple benchmark datasets while meeting stringent computational efficiency requirements for practical applications. These results not only provide an efficient solution for blurred image recognition but also open new directions for lightweight model design in the computer vision field. Future research will further explore the potential of this framework in extended applications such as video deblurring and medical image analysis.
2. Methods
2.1. Overall Architecture
This paper proposes a CNN–Transformer fusion network employing a dual-branch parallel architecture, which achieves efficient blur recognition through complementary feature extraction and dynamic fusion. The overall framework is illustrated in
Figure 1. The CNN branch utilizes an improved lightweight ResNet architecture that replaces standard residual blocks with MSG-ARB. This branch explicitly extracts local gradient features through learnable gradient convolution and enhances representation capability in blur-sensitive regions via gradient content gating, reducing computational costs by 42% compared to conventional CNNs. The Transformer branch adopts an HST structure, capturing global contextual information through Shifted Window-based Multi-head Self-Attention. The blur invariant PE is introduced to strengthen modeling capability of blur frequency spectra, enabling efficient computation of long-range dependencies in images with 16 × 16 window configuration.
Figure 2 illustrates the overall architectural framework diagram.
2.2. CNN Branch
The core challenge in blurred image recognition tasks lies in the degradation of high-frequency information (e.g., edges and textures) caused by blurring, whereas traditional CNN architectures (e.g., ResNet [
43] and VGG [
44]) struggle to explicitly model such degradation patterns. To address this issue, this paper proposes the MSG-ARB, which tackles the gradient information loss in blurred images by employing learnable gradient convolution. By embedding the traditional Sobel operator into convolutional layers, it dynamically optimizes gradient extraction kernels to adapt to various blur types while mitigating the limitations of handcrafted gradient operators (e.g., fixed directional sensitivity). Furthermore, considering that different blur types (e.g., motion blur and Gaussian blur) affect textures at varying scales, we introduce a gradient content gating mechanism. This module selectively fuses gradient features with original content features through a gating mechanism to enhance attention toward blur-sensitive regions. Lastly, to address the high computational cost typically associated with complex multi-scale modules (e.g., dilated convolutions), we integrate dilated convolutions with grouped convolutions to efficiently capture multi-scale blur features at a low computational overhead. The network architecture adopts a lightweight ResNet variant as its backbone to avoid redundant computations induced by excessively deep networks. Specifically, we replace the standard ResNet BasicBlock with the proposed MSG-ARB module, constructing a gradient-enhanced residual network. The detailed structure of the MSG-ARB module is illustrated in
Figure 3.
The learnable gradient convolution layer initializes gradient kernels using group convolution, where each group corresponds to a specific gradient direction. The formulation is as follows:
In this equation,
G represents the number of groups.
Wi represents the learnable parameter.
Learnable gradient convolution simulates the sensitivity of the human visual cortex to edge orientations through parameterized gradient operators, where its multi-directional learnable kernels correspond to the orientation selectivity mechanism of visual neurons. The Perona–Malik equation is based on anisotropic diffusion theory [
45], with its core concept being the adjustment of diffusion strength according to local image characteristics, thereby smoothing images while preserving edges. The gradient convolution kernels can be viewed as a discretized implementation of the Perona–Malik equation, achieving a balance between feature preservation and noise suppression through adaptive gradient weight adjustment. The grouped convolution design in Equation (1) constructs a complete set of gradient basis functions through
G groups of learnable parameters
Wi. It can theoretically approximate first-order differential operators in any direction. Compared to traditional Sobel operators, the fixed-coefficient Sobel operator is merely a special case of learnable gradient convolution under specific parameters. Through end-to-end learning, it can achieve scene-adaptive gradient kernels.
The multi-scale dilated convolution group employs three parallel dilated convolutions (dilation rates = 1, 2, 3) with respective receptive fields of 3 × 3, 7 × 7, and 11 × 11. Depthwise separable convolution is adopted to reduce the computational overhead.
The gradient-content gating mechanism concatenates gradient features
Fgrad with original features
Fcontent, then generates gating weights through a Sigmoid activation function. The equation is expressed as:
It mimics the dual-stream mechanism of human visual system’s content pathway and spatial pathway. Gradient features Fgrad correspond to edge-sensitive characteristics, while content features Fcontent represent holistic attributes. The sigmoid gating in Equation (2) aligns with the feature selection characteristics of the visual cortex area V4, which enhances response intensity in key regions. It employs gating operations to selectively inject gradient information while preserving original content features. Equation (3) forms a dynamic system where edge regions (α → 0) enhance gradient features, while smooth areas (α → 1) maintain the original content.
Finally, a residual skip connection is introduced to stabilize the training process, formulated as:
The MSG-ARB module significantly enhances the feature extraction capability of CNN branches in blurred image recognition through explicit gradient modeling and dynamic multi-scale fusion, while maintaining lightweight characteristics. This module can be flexibly embedded into existing networks, establishing a foundation for subsequent dual-branch fusion.
2.3. Transformer Branch
The proposed Transformer branch employs a modified HST as its backbone network, establishing a comprehensive local-sensitive feature extraction framework through an in-depth analysis of blurred image characteristics.
The core innovation of this branch lies in its dynamically quality-aware adaptive processing mechanism and the nonlinear dynamic compression mechanism, which significantly reduces the computational overhead. The main architecture comprises three key components. The Blur-Adaptive Dynamic Window Partitioning Module automatically adjusts the attention computation range based on local sharpness, effectively maintaining feature capture capability while substantially reducing computational complexity. Specifically, Gaussian blur detection is employed to generate dynamic weight maps that guide intelligent window size adaptation between 7 × 7 and 11 × 11. Compared to fixed-window designs, this approach reduces computational costs by 18%.
The Local-Sensitive Self-Attention (LS-SA) mechanism breaks through the traditional undifferentiated computation of attention in blurred regions by dynamically modulating query-key-value projections via a sharpness coefficient (
σi). As a result, regions with rich high-frequency details receive stronger attention, while uniformly blurred areas are processed sparsely.
Figure 4 illustrates the process of the LS-SA mechanism and compares its effects with traditional SA. Mathematically, this is represented by the modified attention scoring function:
In this equation,
σi represents the local sharpness coefficient, obtained through real-time calculation via low-pass filtering.
Experimental results demonstrate that this design achieves a 23% improvement in key feature extraction accuracy under moderate blurring conditions (σ = 1.0).
The multi-scale feature interaction system employs a three-level pyramid structure to process visual features at varying granularities: Micro-scale (16 × 16) utilizes four head attention for fine-grained detail focus. Meso-scale (32 × 32) establishes contextual correlations via two head attention. Macro-scale (64 × 64) captures holistic patterns using single-head attention. Cross-scale feature fusion is achieved through depthwise separable convolution, enabling efficient inter-level integration.
Figure 5 illustrates the three-level multi-scale feature pyramid structure.
In terms of positional encoding, to address geometric distortion caused by blurring, we use blur invariant positional encoding, which integrates the conventional sinusoidal function with a blur-level parameter
λK. A compact MLP network is employed to predict dynamic weights in real time. Traditional encoding produces aliasing errors in the frequency domain. Blur invariant positional encoding dynamically compensates the spectrum by introducing a blur parameter
λK. Biological place cells (grid cells) adjust their firing patterns in blurred environments. The
λK modulation mechanism of blur invariant positional encoding mimics this characteristic: by adjusting the blur degree parameter
σ,
λK can dynamically regulate the grid spacing of positional encoding. Under severe blur conditions (
σ = 2.0), this method reduces positional errors from 53.1 pixels to 15.8 pixels.
Figure 6 illustrates the comparative effects between traditional PE and blur invariant PE. The mathematical equation is presented below:
To optimize computational efficiency, this branch incorporates a differentiable token-pruning strategy that removes tokens contributing less to the inference task, enhancing efficiency without significantly compromising accuracy. This strategy estimates token importance through learnable parameters λK and gradients obtained during training, enabling the model to dynamically balance computational cost and performance.
The normalization layer of the entire branch uses DyT layer as a replacement, where the definition of the DyT layer is as follows:
Here, α is a learnable scalar parameter that dynamically adjusts the scaling ratio based on the input value range, thereby accommodating x of varying magnitudes. The parameters γ and β are learnable channel-wise vectors, following the standard practice in normalization layers, and allowing the output to be rescaled to arbitrary magnitudes. DyT typically performs well without requiring hyperparameter tuning of the original architecture, significantly reducing deployment overhead. For parameter initialization, we adhere to conventional normalization layer protocols: γ is initialized as an all-ones vector, and β is initialized as an all-zeros vector. The scaling parameter α defaults to an initial value of 0.5.
The branch adopts a two-stage training strategy: first, pre-training the foundational feature representation on clear images, followed by fine-tuning using augmented data incorporating various degradation types such as Gaussian blur and motion blur. With an input resolution of 256 × 256, the total parameter count is constrained to 64M, achieving an inference latency of 21ms on an NVIDIA V100 GPU.
This branch can accurately focus on relevant feature regions, achieving superior recognition accuracy over the standard Swin Transformer under typical blur scenarios. Compared to existing approaches, this design pioneers the collaborative blur-adaptive optimization of three key components: window mechanism, attention computation, and PE. This work establishes a novel technical paradigm for subsequent ViT applications in degraded image processing. The overall structure of this branch is shown in
Figure 7.
2.4. Fusion Module
To fully integrate the local feature extraction capability of CNNs with the global modeling advantages of Transformers, the proposed feature fusion module employs a parallel processing mechanism. On one hand, the CNN branch captures fine-grained local details in images, while on the other hand, the Transformer branch establishes long-range dependencies. Ultimately, the two complementary feature representations are fused to construct more discriminative visual representations. This design approach preserves the benefits of traditional convolutional operations in local feature extraction while incorporating the global context modeling capabilities of self-attention mechanisms. Specifically, the fusion module consists of three key components: a feature alignment unit for resolving cross-branch feature map size mismatch, a multi-scale fusion unit for cross-scale feature interaction and information compensation, and a dynamic gating unit for adaptive feature weight allocation.
To enhance the model’s perception capability for subtle features in low-quality images, this study integrates attention mechanisms during the feature fusion stage. Specifically, a composite attention module is designed by cascading a Channel Attention Module (CAM) and a Spatial Attention Module (SAM), forming a CBAM (Convolutional Block Attention Module) architecture. The structure is shown in
Figure 8. This dual attention mechanism operates synergistically, effectively identifying important feature channels while precisely focusing on critical spatial regions, thereby significantly improving the model’s recognition performance for blurred image details.
In the design of the CAM, this study introduces a feature compression strategy. Specifically, the proposed module not only retains the conventional mean pooling operation but also incorporates the advantages of max pooling, preserving both average response and peak response feature information along the spatial dimensions to achieve more comprehensive channel feature representation. This dual-pooling strategy effectively mitigates the potential loss of crucial feature information that may occur with single mean pooling. Experimental results demonstrate that the enhanced channel attention mechanism can accurately identify key channels in feature maps and reinforce the most discriminative feature elements in blurred images through dynamic weight allocation, thereby significantly improving the network’s recognition accuracy for low-quality images.
In the design of the SAM, this study employs a dual-pooling and feature fusion strategy. Unlike the CAM, SAM first performs parallel max pooling and average pooling operations on the input feature maps for initial compression, followed by further dimensionality reduction along the channel axis. This processing generates a pair of complementary two-dimensional spatial feature maps, which are then concatenated to form an intermediate feature representation with two channels. The intermediate features are subsequently fed into a lightweight convolutional network for feature fusion. The carefully designed convolutional operations ensure that the output feature maps maintain the same spatial dimensions as the input. This design not only effectively captures the statistical characteristics of spatial feature distributions but also preserves the geometric consistency of the feature maps. The SAM dynamically adjusts the attention weights across different spatial regions of the feature maps, enabling the model to adapt to visual variations caused by illumination changes, scale differences, or partial occlusions. This mechanism significantly enhances the model’s generalization capability when processing various types of blurred images. The overall structure of the feature fusion module is shown in
Figure 9.
2.5. Loss Function
In deep learning models, the loss function serves as a crucial optimization objective whose fundamental role is to quantitatively assess the discrepancy between model predictions and ground truth labels. To ensure the stability and effectiveness of the training process, the design of loss functions requires a comprehensive consideration of task-specific characteristics, data distribution properties, and network architecture. Particular attention should be paid to the fact that during backpropagation, loss functions may induce gradient anomalies, including gradient explosion and vanishing gradient problems. These potential issues can significantly impair the efficiency of model parameter updates, thus necessitating an appropriate loss function selection and regularization strategies for mitigation. An ideal loss function should not only accurately reflect prediction errors but also maintain numerical stability throughout the training process. In early-stage image recognition tasks, conventional Softmax cross-entropy loss (L_CE) was typically employed for network training, with its mathematical formulation expressed as:
In this equation,
m represents the training batch size;
e represents the exponential function, which normalizes and processes the output vector;
Wyi represents the weight matrix information for the i-th column of the y-th class;
xi represents the feature vector of the i-th sample image;
n represents the number of sample categories;
Wj represents to the j-th column of the weight matrix W; and b is the bias term.
In later improvements to the loss function, the vector product of
Wj and
xj was reformulated using a cosine similarity term cos
θj, as shown in the following formulation:
In this equation,
cos θj represents the cosine of the angle between the j-th column vector of the weight matrix (Wj) and the feature vector of the j-th sample (xj).
To facilitate training, we normalize ||Wj|| to unit length (i.e., ||Wj|| = 1), and introduce an angular margin multiplier (denoted as m) to further maximize inter-class separation by penalizing angles between samples and their corresponding class centers.