Wide-Angle Image Distortion Correction and Embedded Stitching System Design Based on Swin Transformer

Lai, Shiwen; Cheng, Zuling; Zhang, Wencui; Chen, Maowei

doi:10.3390/app15147714

Open AccessArticle

Wide-Angle Image Distortion Correction and Embedded Stitching System Design Based on Swin Transformer

¹

Department of Global Convergence, Kangwon National University, Chuncheon 24341, Republic of Korea

²

Cross-Cultural, Kookmin University, Seoul 02707, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(14), 7714; https://doi.org/10.3390/app15147714

Submission received: 7 June 2025 / Revised: 4 July 2025 / Accepted: 7 July 2025 / Published: 9 July 2025

(This article belongs to the Special Issue Latest Research on Computer Vision and Image Processing)

Download

Browse Figures

Versions Notes

Abstract

Wide-angle images often suffer from severe radial distortion, compromising geometric accuracy and challenging image correction and real-time stitching, especially in resource-constrained embedded environments. To address this, this study proposes a wide-angle image correction and stitching framework based on a Swin Transformer, optimized for lightweight deployment on edge devices. The model integrates multi-scale feature extraction, Thin Plate Spline (TPS) control point prediction, and optical flow-guided constraints, balancing correction accuracy and computational efficiency. Experiments on synthetic and real-world datasets show that the method outperforms mainstream algorithms, with PSNR gains of 3.28 dB and 2.18 dB on wide-angle and fisheye images, respectively, while maintaining real-time performance. To validate practical applicability, the model is deployed on a Jetson TX2 NX device, and a real-time dual-camera stitching system is built using C++ and DeepStream. The system achieves 15 FPS at 1400 × 1400 resolution, with a correction latency of 56 ms and stitching latency of 15 ms, demonstrating efficient hardware utilization and stable performance. This study presents a deployable, scalable, and edge-compatible solution for wide-angle image correction and real-time stitching, offering practical value for applications such as smart surveillance, autonomous driving, and industrial inspection.

Keywords:

wide-angle image correction; Swin Transformer; thin plate spline transformation; optical flow guidance; real-time stitching; embedded systems; edge computing

1. Introduction

With the rapid advancement of visual perception technologies such as video surveillance, intelligent transportation, and autonomous driving, wide-angle and fisheye cameras have been widely adopted in complex real-world scenarios due to their superior field-of-view coverage and efficient information acquisition capabilities [1,2]. However, these lenses often introduce severe geometric distortions during image acquisition, manifesting as edge stretching, radial deformation, and structural warping [3]. Such distortions not only compromise geometric consistency but also severely affect the accuracy and robustness of downstream tasks, including image analysis, object recognition, and multi-view image stitching [4,5]. Therefore, developing methods for accurate distortion rectification and seamless image stitching, with minimal loss of visual integrity, remains a key challenge in the field of computer vision.

Although deep learning-based approaches have achieved notable progress in image rectification tasks in recent years [6,7], most state-of-the-art models rely on large-scale networks and high-performance computing resources, making them unsuitable for deployment in resource-constrained environments where low latency and real-time performance are critical [8]. These models often perform poorly on edge platforms such as smart surveillance systems and mobile sensing devices. Meanwhile, traditional image rectification and stitching methods are primarily based on geometric modeling—such as polynomial distortion correction and homography transformations [9,10]—which, despite their adaptability in specific scenarios, lack generalizability and robustness when applied to high-resolution images, non-rigid deformations, or dynamic scenes [11]. Furthermore, multi-camera stitching tasks face additional system-level challenges such as image synchronization, dynamic seam-line adjustment, frame rate control, and resource allocation [12], posing significant trade-offs between lightweight design, high performance, and deployment feasibility.

In response to the aforementioned challenges, this study proposes an image processing framework that achieves both high-precision distortion correction and efficient deployment on edge devices. At the algorithmic level, a lightweight image rectification network is designed, integrating a Swin Transformer-based encoder architecture, a Thin Plate Spline (TPS) control point prediction module, and an optical flow estimation mechanism [13,14,15]. The network enhances feature representation through a sliding window attention mechanism and multi-scale feature extraction modules while leveraging optical flow vector directions as structural constraints to effectively model nonlinear distortions, thereby improving rectification accuracy and robustness. To further enhance generalization performance, the framework incorporates a comprehensive data augmentation strategy—including random cropping, elastic deformation, illumination simulation, and noise injection [16]—as well as a spatial attention module (SAM) and multi-branch feature fusion [17], enabling the model to better adapt to complex scene variations.

At the system engineering level, a real-time dual-camera image stitching system is implemented on the Jetson TX2 NX edge computing platform. Leveraging CUDA-based parallel computation and the DeepStream multimedia framework, the system realizes an end-to-end pipeline encompassing image acquisition, preprocessing, rectification, stitching, and video streaming [18,19]. In addition, structural optimizations are applied to reduce computational complexity—such as resizing and pruning Swin Transformer layers, simplifying the TPS prediction module, and employing low-rank approximation and network pruning techniques [20]—thereby ensuring that the proposed model satisfies the dual requirements of real-time performance and deployment stability in resource-constrained environments.

The main objectives of this study can be summarized as follows:

(1): To develop a deep rectification network tailored for high-resolution wide-angle images that balances accuracy, speed, and deployment feasibility;
(2): To implement a dual-camera stitching system that integrates image preprocessing, registration, and GPU-based stitching, enabling real-time image fusion and visual output;
(3): To achieve a closed-loop transition from algorithm design to deployment on edge devices, ensuring efficient system operation and engineering maintainability under resource-constrained conditions.

Taken together, this work provides a comprehensive technical pathway encompassing lightweight algorithm design, system-level integration, and embedded platform deployment, thereby facilitating the transformation of wide-angle image processing technologies from theoretical research to practical engineering applications. The proposed approach holds significant theoretical and practical value for the real-time implementation and productization of intelligent vision systems.

2. Related Work

2.1. Wide-Angle and Fisheye Image Distortion Rectification

Wide-angle and fisheye lenses have been widely adopted in intelligent surveillance, autonomous driving, and smart city applications due to their superior field-of-view coverage capabilities [21]. However, these lenses are prone to introducing significant radial and tangential distortions during image capture [22], which compromise geometric consistency and adversely affect the accuracy of downstream tasks such as object detection, 3D reconstruction, and multi-view image stitching.

To address this issue, existing approaches can be broadly categorized into two main types: geometry-based models and deep learning-based methods.

Geometry-based approaches aim to correct distortions by explicitly modeling the inverse transformation of the imaging process. Representative models include the Brown and Lowe 5 distortion model, equidistant and equisolid angle projection models, and interpolation techniques such as TPS [1,9]. These methods offer good interpretability and computational efficiency but heavily rely on precise camera parameters. As such, their generalization capacity is limited, making them inadequate in complex, dynamic, or non-rigid distortion scenarios.

In recent years, deep learning-based methods such as RectNet and DeepRecNet have demonstrated outstanding performance by predicting nonlinear deformation fields in an end-to-end manner [6,23]. These models do not require intrinsic camera parameters and exhibit strong adaptability. However, they often involve complex architectures and high computational loads, which hinder their deployment on resource-constrained edge devices such as the Jetson TX2.

To alleviate deployment bottlenecks, some studies have incorporated lightweight structures—such as depthwise separable convolutions, channel pruning, and knowledge distillation—to compress model size [24,25,26]. Nevertheless, these methods often sacrifice accuracy and lack system-level optimization for real-time performance and energy efficiency, limiting their practical applicability in real-world engineering scenarios.

In summary, traditional methods remain stable and efficient in controlled environments, while deep learning models offer greater flexibility under complex conditions. However, a unified solution that effectively balances correction accuracy, model compactness, and deployment efficiency remains lacking. To fill this gap, this study proposes a distortion correction model that integrates Transformer-based multi-scale feature extraction, optical flow constraints, and lightweight architectural design. The goal is to develop an all-in-one system that achieves high accuracy, real-time performance, and embedded deployability, thereby addressing the unmet demands of edge computing scenarios in wide-angle image processing.

2.2. Deep Learning-Based Image Rectification

In recent years, deep neural networks have demonstrated remarkable advantages in image distortion correction tasks due to their powerful capacity to model complex nonlinear transformations [22]. Compared with traditional geometric models, deep learning approaches can automatically learn the mapping between distorted and corrected images from data, offering greater flexibility, especially in handling non-parametric and multi-scale deformations [3]. Existing studies have primarily explored three methodological pathways: spatial transformation techniques, control point/deformation field prediction, and hybrid strategies incorporating optical flow estimation [27,28,29].

Early methods were represented by the Spatial Transformer Network (STN) [27], which performs geometric alignment via affine parameter prediction. ST-GAN further improved structural fidelity through adversarial training [30], while GridNet introduced a sparse grid-based control mechanism to enhance local adaptability [31]. Although these approaches feature compact architectures and stable optimization, they often suffer from under-correction when faced with severe edge distortions in wide-angle imagery.

To enhance local modeling capability, control point-based methods using TPS interpolation have been proposed. For example, TPS Transform Net predicts sparse control point locations via deep networks to reconstruct geometric consistency in distorted images [32,33], showing improved correction performance, particularly in peripheral regions. Building on this, methods such as DenseRegWarp estimate pixel-wise deformation fields for fine-grained alignment [34]; however, their complex architectures and computational demands hinder real-time inference and limit deployment on edge devices.

In recent developments, optical flow estimation has been widely integrated into distortion correction tasks as a structure-preserving constraint. Models like RAFT predict dense flow fields through a recurrent architecture, achieving outstanding performance in local feature alignment and structural continuity [29]. Studies have shown that incorporating optical flow as a prior effectively mitigates feature mismatch and enhances edge fidelity [35]. Nonetheless, their high computational cost remains a critical bottleneck for deployment in embedded systems [36].

In summary, deep learning has significantly advanced the field of image distortion correction—from global alignment via STN to non-rigid modeling with TPS-based networks and structure-aware optimization through optical flow constraints. These methods have demonstrated strong performance on benchmark datasets. However, challenges such as large model size, high inference latency, and deployment inefficiencies still impede their suitability for real-time edge applications. To address this gap, this study proposes an efficient correction framework that integrates lightweight architectures, optical flow-based structural constraints, and multi-scale feature extraction. The goal is to enable low-latency, high-precision correction of complex distortions, thereby enhancing the feasibility of real-world engineering deployment.

2.3. Design of Embedded Vision Systems

Image stitching, a key technology in the construction of panoramic vision systems, aims to seamlessly fuse multiple images captured from different viewpoints into a geometrically consistent and visually coherent panorama. The typical pipeline consists of three primary modules: image registration, seam-line optimization, and blending of overlapping regions. Traditional approaches primarily rely on feature-based matching algorithms (e.g., SIFT and SURF) combined with RANSAC for geometric alignment [37,38], followed by blending strategies such as weighted averaging or Poisson image editing to reduce visual discontinuities in overlapping regions [5,39].

However, when dealing with wide-angle images that exhibit strong lens distortion and large parallax, traditional methods often suffer from feature mismatches and seam-line distortions, resulting in geometric discontinuities and ghosting artifacts [5]. Moreover, these approaches typically depend on offline processing, making them unsuitable for real-time applications on edge devices.

To address these limitations, recent studies have introduced deep learning-based solutions to replace conventional feature extraction and blending modules. For example, MOWA leverages attention mechanisms to enhance stitching consistency [40], while RecRecNet employs a recursive reconstruction module to achieve high-quality stitching even in the absence of precise correspondence points [41]. Although such methods perform well on public datasets, they often involve deep network architectures with high inference latency and deployment costs, limiting their practical utility in real-world engineering contexts.

With the growing demand for edge computing, platforms such as Jetson TX2 and Orin NX impose stricter requirements on algorithmic real-time performance, lightweight design, and resource efficiency [42]. However, most existing work focuses on optimizing individual components in isolation and lacks a coordinated design between distortion correction and stitching modules, leaving a gap in fully integrated, deployable engineering pipelines.

In response, this paper proposes an efficient image correction network that integrates Transformer-based architectures and optical flow-guided mechanisms and further develops an embedded panoramic stitching system based on the Jetson TX2 NX platform. The system realizes a complete engineering workflow encompassing image acquisition, distortion correction, and seamless stitching, thereby facilitating the transition of wide-angle vision processing technologies from theoretical research to practical product deployment.

3. Methodology

3.1. Network Architecture Design and Optimization

The image correction algorithm proposed in this study is built upon an encoder–decoder architecture based on the Swin Transformer [43], which integrates shifted window attention with multi-scale feature extraction to effectively balance correction performance and computational efficiency.

The input image is first decomposed into multiple resolution levels, and the extracted multi-scale features are fed into the TPS control point prediction module to generate a deformation control map. This map is subsequently passed through the decoder to produce the motion estimation flow. In parallel, the features are also input into the optical flow estimation module, which assists in deformation modeling and guides the overall loss optimization process. The complete network architecture is illustrated in Figure 1.

To enhance the correction accuracy and generalization capability of the model for high-resolution distorted images, this study incorporates three multi-scale enhancement strategies into the network architecture.

First, during the training phase, in addition to using standard public datasets, image region masks are introduced to emphasize dominant content areas. These masks guide the network to focus on key regions, thereby improving feature weighting and enhancing the model’s perceptual sensitivity to geometrically critical zones.

Second, the backbone employs a Swin Transformer with a shifted window encoding structure, which divides the input image into patches and maps them into a high-dimensional semantic space. Combined with downsampling operations, this design increases channel dimensionality while reducing spatial resolution, effectively improving local feature representation.

Third, multi-scale focus modules are embedded between successive Swin Transformer blocks. These modules integrate dilated convolutions and dual skip connections to expand the receptive field, enabling the joint modeling of local details and global contextual information. This structure significantly improves the model’s adaptability to complex and irregular distortions.

The multi-scale features extracted from the encoder are then fed into the TPS control point prediction module, where a set of convolutional kernels—guided by optical flow vectors—iteratively model non-rigid deformations to produce progressively refined control point maps. These maps are subsequently passed to the decoder to generate the motion estimation flow. In parallel, the original image features are processed by the optical flow estimation module, which employs a feature correlation layer and a GRU-based recurrent mechanism to output the final dense flow field. Both motion representations are fused via concatenation and jointly optimized using a unified loss function.

Given the structural complexity and inference latency of conventional Transformer-based networks in high-resolution wide-angle image processing, a lightweight redesign of the Swin architecture is further implemented. This includes reducing the number of encoder/decoder layers, pruning redundant modules, decreasing the number of attention heads, and simplifying parameter-intensive components. These modifications significantly reduce computational overhead.

Experimental results confirm that the optimized model maintains competitive correction performance while substantially improving inference speed, demonstrating strong potential for deployment in edge computing environments and real-world applications.

3.2. Multi-Scale Feature Extraction Module

3.2.1. Encoder Architecture for Multi-Scale Feature Extraction

The shifted window attention mechanism in the Swin Transformer enables efficient extraction of multi-dimensional image features within local spatial regions and serves as the core foundation for feature representation in this study. To further enhance the expressive capacity of the network, two types of multi-scale feature enhancement modules are introduced between the sliding window attention layers: the Cross-Scale Enhancement and Dilated (CED) Block and the Intermediate Focus Block. Specifically, the CED Block extracts features from multiple resolution scales within the same image, effectively improving the model’s horizontal multi-scale perceptual capacity. In contrast, the Intermediate Focus Block increases network depth to capture richer semantic hierarchies from a vertical dimension, thereby significantly strengthening the model’s understanding of global contextual information.

The proposed encoder architecture consists of four main stages, with the first three stages composed of repeated stacks of sliding window attention modules. These layers progressively extract hierarchical structural features from the input image, forming a gradual feature abstraction process. The complete structure of the encoder is illustrated in Figure 2.

In this study, the encoder architecture begins with a patch partitioning layer that divides the input image into fixed-size patches, with a default size of 4 × 4 pixels per patch. This operation results in a

4^{\times}

downsampling along each spatial dimension and increases the number of output channels by a factor of 16 (i.e., 4²), significantly enhancing the feature representation capacity while maintaining computational efficiency. This patch partitioning mechanism is similar to the Patch Embedding used in Vision Transformer (ViT), but the proposed method adopts smaller patch sizes to better capture fine-grained local features.

Following the patch partitioning, a linear embedding layer projects each low-dimensional feature vector into a predefined high-dimensional space of size

C

, which serves as the input for the subsequent self-attention mechanisms. The dimensionality

C

is determined based on a trade-off between model capacity and computational cost, with the goal of achieving an optimal balance between expressive power and resource efficiency.

Within the hierarchical structure of the encoder, the Patch Merging module is used to merge each group of 2 × 2 adjacent image patches into a new structural block, thereby performing spatial downsampling. This process not only reduces the spatial resolution of the feature maps but also reorganizes the feature dimensions by increasing the number of channels, resulting in more compact inputs with higher semantic density. The aggregation of local region information effectively expands the model’s receptive field and enhances its ability to model contextual relationships.

In addition, the multi-scale feature merging strategy helps capture semantic information across different spatial hierarchies, improving the diversity and generalizability of feature representations. The merged structural blocks are further processed by convolutional layers to enhance the features and extract higher-level abstract semantics.

In the core module of the Swin Transformer—namely, the Swin Transformer Block—this study introduces two efficient local attention mechanisms: Window-based Multi-Head Self-Attention (W-MSA) and Shifted Window-based Multi-Head Self-Attention (SW-MSA), which are employed to replace the traditional global Multi-Head Self-Attention (MSA) mechanism [43]. Compared with global MSA, the sliding window-based attention significantly reduces computational complexity, which can be theoretically expressed as shown in Equation (1).

\begin{array}{l} Ω (M S A) = 4 h w C^{2} + 2 {(h w)}^{2} C + n * C E D b l o c k \\ Ω (W - M S A) = 4 {hwC}^{2} + 2 M^{2} hwC + (n - 1) * C E D b l o c k \end{array}

(1)

Here,

h

and

w

denote the height and width of the feature map,

C

represents the number of channels, and

M

is the window size. For instance, when the feature map size is set to

h = w = 112

, the window size

M = 7

, and the channel dimension

C = 128

, the W-MSA module can save approximately 40,124,743,680 floating point operations (FLOPs) compared to the conventional global MSA module under the same configuration. This substantial reduction in computational cost significantly improves the inference efficiency of the network. The structure of the Swin Transformer Block is illustrated in Figure 3.

3.2.2. Design of the Focus Block

Another key component of the multi-scale feature extraction module is the introduction of the Focus Block, which primarily aims to increase network depth and extract higher-level semantic features from images. This module consists of two core substructures: (1) a Token Mixer responsible for spatial feature interaction and (2) a two-stage Multi-Layer Perceptron (MLP) unit designed to model semantic relationships across channels.

When the Focus Block is positioned between the encoder output and the TPS prediction module input, multiple skip connections are introduced within the block to facilitate direct cross-layer information transmission and integration. This structure helps alleviate the vanishing gradient problem in deep networks and improves both the fidelity and robustness of the extracted features.

The architecture of the Focus Block is illustrated in Figure 4, where dashed lines indicate skip connection paths between different layers.

3.3. Thin Plate Spline Control Point Prediction Module

TPS transformation is a non-rigid deformation method based on control point interpolation. It achieves locally continuous transformation and fine-grained adjustment of the image while preserving global affine properties by minimizing the bending energy functional quantitative measure of deformation smoothness.

In this study, the image features extracted by the encoder are fed into the TPS prediction module to estimate the spatial distribution of TPS control points. It is important to note that the generation of control points is constrained to the regions defined by the input image mask, allowing the model to focus on areas exhibiting prominent distortions or structural variations. The module outputs the total number of TPS control points across the entire image, which serves as an indicator of local deformation complexity—where a higher number of control points suggests a more complex nonlinear rectification requirement.

This module provides critical support for the subsequent geometric reconstruction of the image and enables accurate spatial transformation and distortion correction.

Design of the Thin Plate Spline Control Point Prediction Module

In the TPS control point prediction module, a series of convolutional layers with different kernel configurations are employed to progressively predict an increasing number of control points, progressively configured as 10 × 10, 10 × 10, 12 × 12, and 14 × 14. The number of TPS control points increases at each stage. The control points predicted by the previous head are upsampled and integrated into the next head’s prediction. These control points are then arranged to form a grid structure.

Subsequently, a TPS transformation is applied to warp this predicted grid, aligning it with a reference grid defined on the ground-truth image. The architecture of the TPS control point prediction module is illustrated in Figure 5.

In implementation, considering that cascading fully connected layers incurs substantial computational and storage costs, we employ one or two convolutional layers after each TPS transformation head to predict the control points. The control point computation is defined by Equation (2).

q^{(t)} = h^{(t)} [Ρ (s^{(t - 1)}, q^{(t - 1)})] + U P [q^{(t - 1)}]

(2)

Here,

h_{t}

denotes the TPS transformation at layer t, while

s^{(t — 1)}

and

q^{(t — 1)}

represent the feature map and control points from the (

t - 1

)-th head, respectively. The operator

P (\cdot, \cdot)

denotes the warping operation applied to the feature map based on the predicted control points, and

U P [\cdot]

refers to a custom upsampling layer for control points.

After generating the control point map, it is passed to the decoder, which transforms the feature vectors associated with the TPS control points into the target output sequence. This produces the motion estimation flow, which is then used for loss computation. The output control point map generated by the decoder is illustrated in Figure 6.

3.4. Optical Flow Estimation Module

The introduction of optical flow fields in this study is inspired by prior work in dynamic scene analysis, where inter-frame optical flow is used to predict the contours and positions of moving objects. In the context of wide-angle image rectification, the positional changes of feature points during the deformation process resemble the motion of dynamic targets. During network training, there exists a progressive transformation process in which normal images are increasingly warped into distorted wide-angle images, and optical flow fields can be computed between these stages.

In this study, we first compute the optical flow between source and target images in the dataset and compare it with the intermediate optical flow fields generated during training. These optical flow fields are incorporated into the loss function to guide the rectification process. The vector directions encoded in the optical flow are used as constraints on image deformation, which not only accelerate network convergence but also enhance the final rectification performance.

3.4.1. Design of the Optical Flow Estimation Module

The design of the optical flow field model in this study is inspired by the classical RAFT architecture [29] but differs significantly in how image features are handled. Unlike RAFT, where features are extracted independently, the image features in our method are derived directly from the encoder output.

These encoded features are passed into a feature correlation layer, which computes a four-dimensional (4D) correlation volume of size

W \times H \times W \times H

, representing the similarity between all pairs of pixels in the source and target features. This 4D correlation volume is constructed by computing the dot product between every pair of feature vectors.

To improve efficiency and enable scale-aware matching, multi-scale pooling is applied along the last two dimensions of the 4D volume to construct a hierarchy of multi-scale correlation volumes. The 4D volume

C

can be efficiently computed as a single matrix multiplication, as defined in Equation (3).

C (g_{θ} (I_{1}), g_{θ} (I_{2})) \in R^{H \times W \times H \times W}, C_{i j k l} = \sum_{h} g_{θ} {(I_{1})}_{i j h} \cdot g_{θ} {(I_{2})}_{k l h}

(3)

The constructed 4D correlation volume is fed into a recurrent GRU-based update module, which iteratively generates the final optical flow field. The flow estimation process is initialized with a zero-flow field,

f^{(0)} = 0

, and a series of flow predictions

{f^{(1)}, f^{(2)}, f^{(N)}}

are progressively refined through iteration. At each iteration step, the update operator computes a flow increment

Δ f

, which is added to the previous estimate to obtain the updated flow field:

f^{(t + 1)} = f^{(t)} + Δ f

.

The inputs to the update operator include the current optical flow estimate, the 4D correlation volume, and a latent hidden state. Its output consists of the updated flow increment

Δ f

and a new hidden state. The design of this structure is inspired by the iterative optimization processes commonly found in traditional numerical solvers, aiming to emulate convergence behavior. To this end, the update operator employs parameter sharing (i.e., tied weights) and bounded activation functions to enhance convergence stability during training, ensuring that the predicted optical flow satisfies the convergence condition

f^{(k)} \to f^{*}

.

The core architecture of the update operator is based on a Gated Recurrent Unit (GRU), where the fully connected layers are replaced with convolutional layers to enhance local spatial awareness. This modification enables more effective processing of spatial features. The specific computation process is defined in Equation (4).

\begin{array}{l} z_{t} = C o n v_{3 \times 3} ([h_{t - 1}, x_{t}], W_{z}) \\ r_{t} = C o n v_{3 \times 3} ([h_{t - 1}, x_{t}], W_{r}) \\ h_{t} = (1 - z_{t}) \otimes h_{t - 1} + z_{t} \otimes \tanh (C o n v_{3 \times 3} (r_{t} \otimes h_{t - 1}, x_{t}), W_{h}) \end{array}

(4)

In this structure,

z_{t}

and

r_{t}

represent the update gate and reset gate of the GRU, respectively, both computed from feature maps using two distinct convolutional kernels. The convolution operations within the GRU are implemented via learnable kernels

W x

and the nonlinear activation function used is

t a n h (\cdot)

. The current hidden state

h_{t}

and the previous state

h_{t - 1}

encode the temporal dynamics of the optical flow estimation process. The input

x_{t}

is formed by concatenating the initial flow estimate, the 4D correlation information, and the context features.

To enlarge the receptive field without significantly increasing model complexity, this study replaces the traditional 3 × 3 convolution with two directionally separable GRUs: one utilizing a 1 × 5 convolutional kernel to capture horizontal context and the other using a 5 × 1 kernel to extract vertical contextual information.

The hidden state output from the GRU module is subsequently passed through two convolutional layers to produce the optical flow update

Δ f

. To reduce computational cost, the resolution of the predicted flow is set to 1/8 of the input image size. During both training and evaluation, the predicted optical flow is upsampled to match the resolution of the ground-truth flow, enabling accurate supervision and performance assessment. The overall structure of the optical flow estimation module is illustrated in Figure 7.

3.4.2. Vector Direction of the Optical Flow Field

After the optical flow field is obtained through iterative updates based on multi-scale image features, the vector direction of the flow field is computed. During network training, this directional information is used to partially update the correction parameters in the TPS prediction module. The optical flow vectors are further combined with the motion estimation flow to perform distortion rectification, and their consistency with the ground-truth image is enforced via loss computation. The full training procedure is detailed in Section 3.5.

The output rectified maps from each convolutional stage of the TPS prediction module, together with the target image, are used to generate the optical flow field visualization shown in Figure 8, Figure 9 and Figure 10.

3.5. Loss Function

The wide-angle image correction network is designed to train a control point prediction structure based on TPS, which encodes the positions of TPS control points and their corresponding feature point transformations through a multi-scale feature extraction mechanism. Under the guidance of the vector direction of the optical flow field, the network quantifies the degree of feature map distortion and minimizes it during training.

In this study, four loss functions are jointly incorporated to optimize the training process: pixel-wise reconstruction loss (

L_{R e}

), perceptual loss (

L_{P e}

), grid regularization loss (

L_{G r i d}

), and optical flow constraint loss (

L_{L i g h t}

). Among them, the reconstruction and perceptual losses supervise the correction process at both the pixel and semantic levels, guiding the network to maintain structural consistency while improving perceptual quality.

The reconstruction loss aims to penalize discrepancies between the output image and the original reference image, ensuring that the generated image remains as close as possible to the target in pixel space. This loss is typically measured using Mean Squared Error (MSE) or Mean Absolute Error (MAE). In this study, MSE is adopted as the primary metric for reconstruction loss, defined as Equation (5):

M S E = \frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - y_{i^{'}})}^{2}

(5)

where

N

denotes the total number of pixels,

y_{i}

represents the pixel value of the ground-truth image, and

y i'

is the corresponding pixel value in the generated image.

The perceptual loss provides supervision from the semantic level. In this study, a VGG-16 network pretrained on ImageNet is adopted as the feature extractor. Intermediate feature representations from both the input and target images are extracted from the ReLU3_3 layer, and their Euclidean distance is computed as the perceptual loss term. This formulation enhances the quality of structural restoration and semantic consistency. The loss is defined as Equation (6):

L p e = {‖ϕ (Y) - ϕ (\hat{Y})‖}_{2}^{2}

(6)

ϕ (\cdot)

denotes the intermediate features extracted from the VGG-16 network.

The perceptual loss is typically computed by passing both the input (rectified) image and the target image through a pretrained neural network to obtain their intermediate feature representations. These feature maps are then used as inputs to the loss function, where the distance between them is calculated using either the Euclidean distance (L2 norm) or the Manhattan distance (L1 norm), thereby promoting semantic alignment between the predicted output and the ground truth.

The grid loss constrains the alignment between adjacent deformed grid edges. In Equation (7),

M

denotes the number of edge tuples in grid

m

. By maximizing the cosine similarity between adjacent edges, the corresponding edge pairs tend to become collinear. As a result, the loss reaches its minimum, ensuring structural consistency in the corrected image.

L_{G r i d} = \frac{1}{M} \sum_{(\vec{e t 1}, \vec{e t 2}) \in m} (1 - \frac{〈 \vec{e t 1}, \vec{e t 2} 〉}{‖ \vec{e t 1} ‖ \cdot ‖ \vec{e t 2} ‖})

(7)

Here, the vectors

e_{t 1}

and

e_{t 2}

represent the edges of all consecutively deformed mesh grids in the feature map.

The optical flow distance loss

L_{L i g h t}

is calculated using Equation (8) as the L1 distance between the predicted and ground-truth optical flows, with an exponentially increasing weighting scheme applied to emphasize regions with larger motion. Given the ground-truth motion flow field

D [F]

, the loss is defined as follows:

L = \sum_{i = 1}^{N} α^{N - i} ‖ D [F] - f_{i} ‖_{1}

(8)

To enhance training flexibility and balance the contribution of each loss component to the overall optimization objective, this study introduces weight coefficients into the loss formulation, incorporating four distinct terms: reconstruction loss (

L_{R e}

), perceptual loss (

L_{P e}

), grid regularization loss (

L_{G r i d}

), and optical flow-guided loss (

L_{L i g h t}

). The weighted total loss function is constructed as shown in Equation (9), where

λ_{1}

to

λ_{4}

, respectively, control the relative importance of each loss component:

L = λ_{1} L_{Re} + λ_{2} L_{Pe} + λ_{3} L_{Grid} + λ_{4} L_{L i g h t}

(9)

During the initial training phase, we explored various combinations of loss weights. The results indicated that moderately increasing the weight of the optical flow-guided loss (λ4) helped enhance deformation guidance and improve image reconstruction quality. However, assigning excessive weights to the perceptual loss or grid regularization loss tended to cause training instability or gradient oscillations.

To achieve a balance among training stability, structural accuracy, and generalization capability, we ultimately adopted an equal-weight configuration:

λ_{1} = λ_{2} = λ_{3} = λ_{4} = 1.0

. This setting demonstrated consistently good convergence and robustness across multiple validation experiments. Compared to certain weighted configurations that led to unstable training dynamics, the equal-weight strategy exhibited greater stability and controllability in practical deployment, making it a suitable and engineering-friendly choice.

3.6. Accelerated Deployment of the Wide-Angle Image Correction Network

To improve the runtime efficiency of wide-angle image rectification tasks, this study adopts a lightweight Swin-T (Tiny) configuration as the backbone encoder, incorporating multi-scale feature extraction and a shifted window attention mechanism. This design significantly reduces model complexity and computational overhead while maintaining correction accuracy, thereby improving inference speed and promoting structural compactness.

Although the proposed structure achieves satisfactory performance on server-side platforms, its deployment on edge devices such as the Jetson TX2 NX remains constrained by limited computational resources. Specifically, the model exhibits an average inference latency of approximately 150 ms and high GPU occupancy, which does not satisfy the real-time requirement of 15 FPS. To address this limitation, we propose a pruning-based optimization strategy tailored for embedded deployment scenarios, aiming to further reduce the model size and enhance inference efficiency.

More specifically, structural pruning is applied to both the shifted window attention module within the Swin-T encoder and the TPS control point prediction module. The pruning procedure involves the following steps: (1) introducing L1 regularization during retraining to promote sparse weight distributions, thereby identifying redundant channels and attention heads; (2) ranking the importance of hidden channels and attention heads based on their L1 norm values or output contributions; and (3) removing low-contribution structural units to reduce model complexity while preserving inference capability. The overall pruning framework is illustrated in Figure 11. The performance of the proposed strategy on embedded platforms is further detailed in Section 4.4.2.

Specifically, to optimize the sliding window attention modules within the encoder, this study introduces an L1 regularization term into the objective function, which promotes sparse weight learning during training. Given an input image resolution of 320 × 320, the original encoder produces a 100 × 100 × 24 feature map. Through pruning, the number of output channels is reduced to 15–20, significantly decreasing computational redundancy.

For the TPS control point prediction module, a flow-guided structural constraint is applied by incorporating the loss term from the optical flow estimation module into the overall objective function. Based on a 320 × 320 × 14 control point map, the pruning process computes the L1 distance between the predicted TPS gradients and the optical flow vector directions, which enables the identification and removal of low-response channels. As a result, the number of output channels is reduced to approximately seven.

The joint pruning optimization of these two modules results in an approximate 40% reduction in total network parameters, thereby substantially enhancing runtime efficiency on embedded platforms.

Regarding weight importance evaluation, channel pruning is implemented by computing the L1 norm across columns (or rows) of convolutional weight matrices, whereas attention head pruning is achieved by aggregating the L1 norms of the

Q / K / V

projection matrices within each attention head, as defined in Equations (10) and (11).

Im p o r \tan c e_{j}^{channel} = \sum_{i = 1}^{d_{m}} | W_{i, j} |

(10)

Here,

j

denotes the index of the output channel, and the weight matrix has a shape of

[d_{o u t}, d_{i n}]

. The L1 norm is computed along the output channel dimension.

{Importance}_{k} = \sum_{m \in {Q, K, V}} \sum_{i, j} | W_{i, j}^{m, k} |

(11)

Here,

k

denotes the index of the attention head, and the shape of each attention head’s

Q / K / V

projection matrix is

[h, d_{h e a d} \times d_{e m b e d}]

.

L_{total} = L_{task} + λ \cdot \sum |Importance|

(12)

L1 regularization promotes sparsity by adding the sum of the absolute values of the weight parameters to the loss function, encouraging some weights to approach zero. The formulation is given in Equation (12).

3.7. Hardware and Software Environment

The system is deployed on the Jetson TX2 NX edge computing platform, which features an embedded GPU with 256 CUDA cores (1.33 TFLOPS), supports up to five CSI camera inputs, and is equipped with 4 GB of LPDDR4 memory and a dedicated video codec unit. This hardware configuration makes it well-suited for real-time image correction and stitching tasks. The software environment is built on Ubuntu 18.04 and JetPack 4.5.1, ensuring driver compatibility and system stability.

Two Hikvision B14HV3-LT4MM network cameras (Hikvision, Hangzhou, China), each with a resolution of 2560 × 1440 and a frame rate of 50 Hz, are connected to the system. These cameras support TCP/IP and RTSP protocols and are controlled via the official API. A gigabit router and wired Ethernet connections are used for network communication, enabling high-throughput and concurrent video data transmission (see Figure 12).

The graphical user interface (GUI) is developed using the Qt framework and supports key functionalities such as dual-camera image acquisition, real-time display, image registration, and stitching output. The operational interface is shown in Figure 13. To improve runtime efficiency and ensure stability on the edge platform, a multithreaded architecture is adopted to alleviate resource contention among functional modules.

The image stitching and DeepStream-based streaming (pull/push) functionalities are configured to run on the main thread, which is prioritized for access to system computing resources. Auxiliary modules, such as image preview and button event responses, are executed in separate subthreads. These threads share key intermediate variables (e.g., GPU memory addresses) to avoid redundant computation and reduce communication overhead, thereby ensuring system stability and responsiveness during runtime.

Upon clicking the “Image Calibration Preprocessing” button, the system initiates a series of initialization tasks based on real-time image input from the cameras. These tasks include homography matrix estimation, stitching seam planning, and pixel coordinate mapping table generation. The results are cached in GPU memory (see partial illustration in Figure 14) for subsequent real-time image stitching.

When the “Start Stitching and Streaming” button is triggered, the main processing pipeline begins. Dual-channel image data are fed into the GPU for stitching, and the output is streamed via a sink component to the designated network address (e.g., rtmp://192.168.1.69:93/lsw), which can be accessed in real time through clients such as VLC.

It is worth noting that the current system is designed for automated processing, without user-adjustable parameters, to simplify the operation workflow and maintain stability across specific task scenarios. Historical data storage and access functionality are not yet integrated, but future versions will consider adding result caching and retrieval modules. During embedded deployment, the system leverages model pruning and DeepStream stream optimization strategies to significantly reduce computational redundancy and communication latency. In practical testing, the system demonstrated smooth performance, with no noticeable UI lag or streaming delay, thus meeting the basic real-time requirements of embedded platforms.

4. Experimental Setup and Result Analysis

4.1. Experimental Configuration

The proposed model was trained using the publicly available Places2 dataset [44], which contains over 10 million images spanning more than 500 scene categories. Each category includes approximately 6000 to 40,000 images with a resolution of 256 × 256 pixels, providing high representativeness for real-world scenarios. Based on the distortion simulation method described in Section 4.2, the original images were processed to generate synthetic wide-angle images for training purposes.

During training, 6000 images were randomly selected for training, with 600 used for validation and 250 for testing. To enhance sample diversity, various deformation parameters and randomly selected distortion centers were applied to create a wide range of distortion patterns. For generalization evaluation, the MS-COCO dataset and real-world wide-angle surveillance images were further employed during the testing phase to assess model performance under complex scene conditions.

All experiments were conducted on a server running the Ubuntu 20.04 operating system. The hardware configuration consisted of an Intel Core i9-10900K CPU (Intel, Santa Clara, CA, USA) and an NVIDIA RTX 4090 GPU (NVIDIA, Santa Clara, CA, USA), providing substantial parallel computing power and memory resources suitable for both training and inference tasks.

4.2. Image Rectification Data Preparation

Designing customized data generation methods tailored to specific application scenarios has been widely recognized as an effective strategy for addressing real-world challenges in visual tasks. Building upon this idea, given the difficulty in acquiring large-scale, high-quality wide-angle images paired with their rectified ground-truth counterparts, this study adopts a synthetic data generation approach for training the wide-angle image rectification network. Specifically, existing distortion correction techniques are leveraged to construct a synthetic dataset [6,45,46,47].

Specifically, original image samples are first selected from the MS-COCO dataset [48], and radial distortion in wide-angle images is simulated using a fourth-order polynomial model. Experimental results demonstrate that this model can effectively approximate most common projection models and offers high fitting accuracy.

For the distortion coefficients

ϕ = [k_{1}, k_{2}, k_{3}, k_{4}]

, the value ranges are determined based on empirical settings from prior studies [45,49]. Specifically, we set

k_{1} \in [- 10^{- 5}, - 10^{- 8}]

, while the remaining coefficients are assigned ranges based on empirical observations:

k_{2} \in [- 10^{- 9}, - 10^{- 12}]

or

\in [10^{- 12}, 10^{- 9}]

. The ranges for

k_{3}

and

k_{4}

follow a similar configuration, ensuring that the synthesized distortions closely approximate those found in real-world wide-angle imagery.

Specifically, four distortion coefficients are randomly sampled from the following value ranges to construct a diverse set of radial distortion models:

\begin{array}{l} k_{1} \in [- 1 \times 10^{- 4}, - 1 \times 10^{- 8}] \\ k_{2} \in [1 \times 10^{- 12}, 10^{- 8}] o r k_{2} \in [- 1 \times 10^{- 8}, - 1 \times 10^{- 12}] \\ k_{3} \in [1 \times 10^{- 16}, 1 \times 10^{- 12}] o r k_{3} \in [- 1 \times 10^{- 12}, - 1 \times 10^{- 16}] \\ k_{4} \in [1 \times 10^{- 20}, 1 \times 10^{- 16}] o r k_{4} \in [- 1 \times 10^{- 16}, - 1 \times 10^{- 20}] \\ x_{u} = x_{d} (1 + k_{1} r_{d}^{2} + k_{2} r_{d}^{4} + k_{3} r_{d}^{6} + k_{4} r_{d}^{8} + \dots) \\ y_{u} = y_{d} (1 + k_{1} r_{d}^{2} + k_{2} r_{d}^{4} + k_{3} r_{d}^{6} + k_{4} r_{d}^{8} + \dots) \end{array}

(13)

In Equation (13), (

x_{u}, y_{u}

) and (

x_{d}, y_{d}

) denote the coordinates of the undistorted and distorted image points, respectively, and

r_{d}

represents the Euclidean distance from the distorted point to the distortion center. The parameters

k_{1}, k_{2}, k_{3}, k_{4}

are coefficients in the polynomial distortion model, which controls the intensity and shape of the radial distortion.

The polynomial distortion model is one of the most commonly used approaches for correcting fisheye camera images. Based on mathematical modeling, this method fits the radial distortion present in fisheye imagery using a polynomial function, enabling pixel-level geometric correction. The correction process typically involves two steps: first, converting image coordinates to a normalized coordinate system; then, applying the polynomial function to correct the radial distortion and restore a projection closer to the real-world geometry.

Due to its simplicity and effective fitting performance in handling fisheye distortion, this model has been widely applied in industrial vision and wide-angle image reconstruction tasks. Although more complex distortion models exist, the field has largely standardized around two primary types of polynomial distortion formulations.

4.3. Training Details

During the training process of the wide-angle image correction network, the Adam optimizer was employed to update the network parameters. The exponential decay rates for the first and second moment estimates were set to 0.9 and 0.999, respectively, to ensure stable and convergent gradient estimation.

In the initial training phase (first 15 epochs), only the TPS control point prediction module was activated and supervised via a key point-based task classifier to enhance its capability in modeling geometric deformation. Following this stage, all modules were jointly optimized in an end-to-end manner during the unified training phase.

During inference, the proposed method supports distortion correction for arbitrary-resolution images by applying scale-adaptive transformations to the predicted TPS control points and motion residual flow. This design enhances the model’s flexibility and adaptability in practical deployment scenarios.

For learning rate scheduling, a staged strategy was adopted. A linear warm-up policy was applied over the first three epochs to gradually increase the learning rate, followed by a cosine annealing schedule that smoothly decayed the learning rate from 1 × 10⁻⁴ to 1 × 10⁻⁶ over the remaining training epochs. The batch size was set to 64, and the input image resolution was fixed at 256 × 256 pixels.

Additionally, several data augmentation techniques—including random cropping, horizontal flipping, and brightness perturbation—were applied to improve the model’s robustness and generalization under varying distortion types and image distributions.

4.4. Ablation Study

4.4.1. Ablation Study on Multi-Scale Feature Extraction and Optical Flow Module

In this study, a wide-angle image dataset was constructed by applying radial distortion transformations to the Places2 dataset and was used as the experimental sample for both model training and performance evaluation.

Experiment 2 aims to compare the network’s performance before and after introducing key modules, with a particular focus on the trend of total loss during training. Figure 15 and Figure 16 present the loss curve comparisons between the two configurations. It is evident that the incorporation of the multi-scale feature extraction module and the optical flow field module leads to a significant reduction in total loss. At epoch 50, the optimized model achieved a reduction of 0.3019 in total loss compared to the baseline model (denoted as BASE), indicating improved convergence efficiency and enhanced generalization capability.

This section further verifies the effectiveness of integrating the multi-scale feature extraction module and the optical flow field module into the base Transformer architecture. Using the original Transformer structure as the baseline (denoted as BASE), four comparative configurations were designed:

BASE
BASE+Multi-Scale Module
BASE+Optical Flow Module
BASE+Multi-Scale Module+Optical Flow Module

The performance differences of these models in correcting distorted wide-angle images are summarized in Table 1.

This section further evaluates the effectiveness of integrating multi-scale feature extraction and optical flow modules into the baseline Transformer architecture (denoted as BASE). Four model variants were constructed based on this baseline, and their performance differences are summarized in Table 2. Comparative experiments were conducted on the Places2 dataset, MS-COCO, and real-world wide-angle surveillance images.

Introducing the multi-scale feature extraction module alone resulted in a PSNR improvement from 22.21 dB to 25.36 dB (+3.15 dB), indicating that enhancing channel dimensionality and hierarchical depth significantly improves the model’s ability to represent fine image details. The inclusion of the optical flow module further increased the PSNR to 26.97 dB, demonstrating its strong guidance effect in modeling the directional structure of image distortions. When both modules were jointly incorporated, the model achieved the best performance, with a PSNR of 27.43 dB and SSIM of 0.8190, representing improvements of +5.22 dB and +0.0808 over the BASE model, respectively.

In addition, we evaluated the computational cost and inference efficiency of each model variant. As shown in the table, integrating both modules increased the parameter count from 26.85 M to 35.93 M and the FLOPs from 2.15T to 2.45T. Nonetheless, the average inference time was maintained at 9.5 ms, which satisfies real-time processing requirements (>10 FPS) on the Jetson TX2 NX platform. These results demonstrate that the proposed model achieves a favorable balance between performance gains and lightweight deployment suitability.

4.4.2. Ablation Study on Network Pruning for Wide-Angle Image Correction Deployment

This experiment compares the differences in model size, image correction quality, and inference speed before and after applying pruning-based optimization strategies to the wide-angle image correction network when deployed on an edge computing platform. The test results are shown in Table 2. The baseline model (denoted as BASE) refers to the original network trained on the server. The parameter

ω

represents the channel retention ratio during pruning; a higher

ω

indicates that more channels are preserved. All experiments are conducted on the Jetson TX2 NX platform with an input image resolution of 1024 × 768.

The experimental results in Table 2 indicate that although the correction network operates in real time on a server, its deployment on an edge computing platform results in an inference time of 104 ms per image frame. When integrated with the image stitching system, this latency fails to meet the real-time requirement of 15 FPS, necessitating pruning optimization of the original BASE network.

At a pruning rate of

ω

= 55%, the inference time is reduced to 31 ms; however, the correction quality deteriorates significantly, leading to poor visual outcomes and limiting its applicability in downstream registration tasks. In contrast, the configuration with

ω

= 75% offers a better balance between accuracy and efficiency. While the PSNR drops by only 4.01 dB compared to the BASE model, the inference time is reduced by 68 ms, and it is also 23 ms faster than the

ω

= 85% configuration. Based on this trade-off, we adopt

ω

= 75% as the final pruning ratio, achieving an optimal balance between performance and runtime efficiency for deployment on the embedded platform.

4.5. Comparative Analysis

(1): Evaluation of Convolution Kernel Configurations in the TPS Prediction Module

This experiment analyzes the impact of different convolutional kernel configurations, corresponding to varying numbers of TPS control points, on the rectification performance of the network. Specifically, we examine how the number of control points predicted at each layer of the TPS prediction module affects the output quality. Quantitative metrics, including PSNR and SSIM, are used for evaluation. The experimental results are summarized in Table 3.

To investigate the impact of TPS control point density on correction performance, this study adjusts the output dimensions of convolutional layers in the TPS module to generate different numbers of control points (e.g., 10 × 10, 12 × 12, 16 × 16) and conducts systematic evaluations on a standard distortion dataset. The results are summarized in Table 4. The experiments reveal that the relationship between the number of control points and correction accuracy is not linearly positive. Specifically, when all four TPS modules output 16 × 16 control points, the model—despite its theoretically stronger spatial fitting capacity—shows a notable performance drop in both PSNR and SSIM (16.34 dB, 0.5234), falling behind the medium-scale 12 × 12 configuration and even underperforming the initial 10 × 10 setting.

This phenomenon can be attributed to the following factors:

(1): Overfitting to local distortions: Excessive control points tend to overfit localized deformation regions, which reduces global correction smoothness and causes geometric inconsistencies across the image.
(2): Increased model complexity: A higher number of control points significantly increases the output dimensionality of the network, leading to greater computational overhead and potential training instability or gradient degradation—especially detrimental for embedded deployment.
(3): Enhanced noise sensitivity: In regions with blurred edges or weak textures, dense control points often lack sufficient semantic support, resulting in erroneous displacements and degraded correction accuracy.

In comparative experiments, certain existing methods (e.g., RecRecNet) exhibit significant performance degradation when applied to irregularly distorted images, with substantial drops in PSNR and SSIM. This is largely due to their limited geometric adaptability and lack of structural guidance. In contrast, the optical flow-guided vector constraint mechanism proposed in this study allows dynamic adjustment of TPS control point trajectories, introducing direction awareness into control point prediction. This enhances the model’s robustness and geometric stability in the presence of complex distortions.

Considering correction accuracy, model robustness, and computational efficiency, we adopt a hierarchical control point configuration (10 × 10, 10 × 10, 12 × 12, 14 × 14) as the final model setting. Under this configuration, the model achieves optimal performance, with the PSNR and SSIM reaching 19.97 dB and 0.6151, respectively. These findings highlight that properly regulating TPS control point density—along with direction-aware guidance—is a key strategy for improving wide-angle image correction quality and deployment efficiency.

Visual comparisons of output results are presented in Figure 17 and Figure 18.

(2): Comparison of Different Algorithms on Public Datasets

This experiment compares several mainstream algorithms for wide-angle image rectification and distortion correction. PSNR and SSIM are used as the primary evaluation metrics for quantitative analysis, and inference time is also measured for selected methods. The detailed results are presented in Table 4.

The test dataset was synthesized using images from Places2 and MS-COCO, incorporating various types of distortions, including warping operations and radial distortion modeling approaches based on [6,45]. The final evaluation set covers irregular distortions, wide-angle distortions, and fisheye distortions, with all images resized to a resolution of 256 × 256.

Experimental results show that the proposed method outperforms several state-of-the-art baselines in both PSNR and SSIM, demonstrating superior distortion correction capability and image quality restoration across diverse distortion types. Additionally, it achieves faster inference speeds, ranking among the most efficient in the benchmark. In the wide-angle correction task, the proposed method improves the PSNR by 3.28 dB compared to MOWA. In the fisheye image correction task, it achieves a 2.18 dB gain in PSNR, while maintaining leading inference efficiency.

It is worth noting that methods such as Crop, ROP, and Padding rely on manual parameter adjustments and cannot be executed in an automated inference pipeline. Therefore, inference time is marked as “–” in the table to indicate exclusion from speed comparison.

In the comparative experiments, the proposed method was evaluated using two widely adopted objective metrics—PSNR and SSIM—for quantitatively assessing the quality of rectified images. Additionally, a qualitative comparison based on subjective visual perception was conducted against existing state-of-the-art algorithms. As illustrated in Figure 19, the proposed method exhibits superior overall performance in correcting wide-angle distortions, with particularly pronounced improvements in regions affected by severe deformation.

Moreover, test results on high-resolution and complex surveillance scenarios indicate that the proposed method exhibits stronger robustness and adaptability in restoring structural consistency and edge continuity compared to other baseline approaches. These advantages significantly enhance both the visual quality of the corrected images and their practical usability in real-world applications.

Experimental results demonstrate that the proposed multi-scale feature modeling mechanism, TPS control point prediction strategy, and optical flow-guided constraint collectively exhibit significant advantages in the geometric correction of high-resolution distorted images. Compared with existing methods, the proposed model achieves higher PSNR and SSIM scores on standard benchmark datasets, indicating superior image restoration quality.

In terms of processing efficiency, the proposed algorithm delivers faster inference speed while maintaining excellent correction performance. For instance, on distorted images with a resolution of 640 × 640 pixels, the model achieves a substantially lower average inference time compared to representative methods such as RecRecNet and MOWA, thereby fully meeting the real-time requirements of practical applications.

4.6. System Testing

4.6.1. Real-Time Dual-Camera Image Stitching Implementation

The real-time image stitching system was deployed on the Jetson TX2 NX edge computing platform, utilizing the DeepStream framework to capture dual-camera video streams. The system acquires the initial frame from each camera and performs pre-alignment, during which the computed homography matrices and optimal seam lines are stored in GPU memory. Upon completion of the initialization process, the real-time stitching module is activated, and the stitched output is streamed to a designated URL.

Without the integration of an object detection network, the system achieves a frame rate of 36 FPS, with six-core CPU utilization exceeding 50%, GPU usage at 19%, and memory consumption of approximately 250 MB. After incorporating the object detection network, the frame rate decreases to 27 FPS, CPU utilization increases to over 65%, GPU usage rises to an average of 68.7%, and memory usage reaches approximately 678 MB. Experimental results are illustrated in Figure 20 and Figure 21.

4.6.2. Wide-Angle Image Correction Implementation

In this study, the trained wide-angle image correction network was deployed on the Jetson TX2 NX edge computing platform to support model inference and perform distortion correction. During system execution, the input image to be corrected is fed into the deployed network, and the output is the corresponding geometrically corrected image. The correction results are illustrated in Figure 22 and Figure 23.

4.6.3. Wide-Angle Video Stitching Implementation

The real-time wide-angle image stitching system was deployed on the Jetson TX2 NX edge computing platform. Utilizing the DeepStream framework, dual wide-angle video streams were captured from two cameras. The first frame from each stream was first processed by the deployed correction network for distortion rectification. The precomputed homography matrices and optimal stitching seams (i.e., image registration and alignment parameters) were then stored in GPU memory to initiate the stitching process. The final stitched video output was streamed to a designated URL, as illustrated in Figure 24 and Figure 25.

In the tested configuration—featuring two wide-angle cameras each capturing images at a resolution of 1024 × 768 and producing a stitched output of 1400 × 1400 pixels—the system achieved a stable frame rate of 15 FPS on the Jetson TX2 NX platform. The correction network performed inference at 56 ms per frame, while the stitching operation required an additional 15 ms. The six-core CPU exhibited an average utilization exceeding 90%, and the GPU maintained an average utilization of 75%, with memory usage around 2031 MB.

Experimental results demonstrate that the proposed system is capable of delivering high-quality, low-latency wide-angle image stitching even under constrained embedded computing resources, effectively meeting real-time application requirements.

5. Conclusions

This study addresses the challenges of nonlinear distortion correction and real-time stitching in high-resolution wide-angle images for visual perception tasks. A lightweight image correction and system integration framework tailored for edge computing scenarios is proposed, introducing systematic innovations in both algorithmic design and engineering deployment efficiency.

At the algorithmic level, a direction-guided dynamic geometric constraint mechanism is introduced by incorporating optical flow vector directions into the TPS control point prediction path. Unlike conventional static or rule-based TPS configurations, the proposed method enables end-to-end dynamic adjustment of control points based on explicitly observed local deformation directions in the image, significantly improving the model’s adaptability and correction accuracy under complex nonlinear distortions.

Additionally, a multi-scale feature extraction module is designed by embedding sliding window attention mechanisms across different network levels, enabling joint modeling of multi-resolution image features. This not only enhances the network’s semantic perception and fine-grained structural representation capabilities but also reduces overall computational cost through feature map partitioning and channel compression strategies. The vector direction outputs from the optical flow estimation module are further integrated into the training process as structural guidance signals, directing parameter updates toward actual deformation directions. This improves convergence efficiency and generalization ability, thereby enhancing the model’s robustness to severe geometric distortions.

To address the high computational complexity of Transformer-based architectures in embedded deployments, this study adopts a lightweight Swin-Tiny encoder structure. Combined with L1 regularization and attention head/channel pruning strategies, the model is compressed along both the channel redundancy and attention redundancy dimensions. This results in approximately a 40% reduction in parameter count and a 40% increase in inference speed, effectively mitigating deployment bottlenecks under resource-constrained conditions.

On the deployment side, the proposed system is implemented on the Jetson TX2 NX platform using the DeepStream framework and a multithreaded optimization strategy. A dual-camera real-time stitching system is constructed, achieving a processing speed of 15 FPS at an image resolution of 1400 × 1400. The average correction latency is 56 ms, and the stitching delay is 15 ms. The system operates stably and responds efficiently, fully validating the proposed method’s feasibility and engineering applicability in embedded environments.

Although this study has made notable progress in algorithm design and system implementation, there remains considerable room for further improvement. Future research may proceed along the following directions:

Model lightweighting and cross-platform deployment: For lower-power edge devices (e.g., Jetson Nano or AI camera SoCs), advanced compression techniques such as model pruning, quantization, and knowledge distillation can be further explored to enhance deployment flexibility and adaptability across heterogeneous hardware platforms. Although the proposed model has been successfully deployed on the Jetson TX2 NX platform with favorable inference performance, most baseline algorithms are structurally complex or lack lightweight implementations, making it challenging to conduct unified benchmarking on embedded platforms. As such, this study primarily relies on inference time and model size measured in a server environment. Future work will focus on reconstructing and adapting representative baseline models for embedded deployment to establish a more comprehensive benchmarking framework.

Although the proposed model has achieved initial optimization in inference speed and structural compactness through the integration of the sliding window attention mechanism and multi-scale feature extraction strategy, it is primarily tailored for typical radial distortion scenarios. Its adaptability to more complex distortion patterns—such as extreme wide-angle or non-uniform distortions—remains limited. Future work may explore the incorporation of physics-based modeling constraints or multimodal information fusion strategies to enhance the model’s generalization ability and robustness across diverse nonlinear distortion conditions.

System-level integration and multi-task expansion: Building upon the current image correction framework, future developments may incorporate modules for object detection, behavior recognition, and other perception tasks. This will facilitate the creation of a more comprehensive edge-intelligent system tailored for practical applications in smart surveillance, industrial vision, and other real-world scenarios, thereby enhancing the system’s versatility and environmental adaptability.

In summary, this study presents a feasible technical pathway for wide-angle image correction modeling, system integration, and embedded deployment, providing both theoretical support and a practical foundation for the engineering application of visual perception technologies in resource-constrained environments.

Author Contributions

Conceptualization, M.C. and S.L.; methodology, S.L. and Z.C.; software, W.Z.; validation, Z.C., S.L. and W.Z.; formal analysis, S.L.; investigation, Z.C.; writing—original draft, S.L.; writing—review and editing, S.L., Z.C., W.Z. and M.C.; visualization, S.L.; supervision, M.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data generated or analyzed during this study are included in this article. The raw data are available from the corresponding author upon reasonable request.

Acknowledgments

The authors thank all the participants in this study for their time and willingness to share their experiences and feelings.

Conflicts of Interest

The authors declare no conflicts of interest concerning the research, authorship, and publication of this article.

References

Scaramuzza, D.; Martinelli, A.; Siegwart, R. A toolbox for easily calibrating omnidirectional cameras. In Proceedings of the 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, Beijing, China, 9–15 October 2006; IEEE: Piscataway, NJ, USA, 2006; pp. 5695–5701. [Google Scholar] [CrossRef]
Manzoor, A.; Singh, A.; Sistu, G.; Mohandas, R.; Grua, E.; Scanlan, A.; Eising, C. Deformable convolution based road scene semantic segmentation of fisheye images in autonomous driving. In Proceedings of the IET Conference Proceedings CP887, Shanghai, China, 12–15 July 2024; IET: London, UK, 2024; Volume 2024, pp. 7–14. [Google Scholar] [CrossRef]
Hughes, C.; Glavin, M.; Jones, E.; Denny, P. Review of geometric distortion compensation in fish-eye cameras. In Proceedings of the IET Irish Signals and Systems Conference (ISSC 2008), Galway, Ireland, 18–19 June 2008; IET: London, UK, 2008; pp. 162–167. [Google Scholar] [CrossRef]
Abraham, S.; Förstner, W. Fish-eye-stereo calibration and epipolar rectification. ISPRS J. Photogramm. Remote Sens. 2005, 59, 278–288. [Google Scholar] [CrossRef]
Brown, M.; Lowe, D.G. Automatic panoramic image stitching using invariant features. Int. J. Comput. Vis. 2007, 74, 59–73. [Google Scholar] [CrossRef]
Yin, X.; Wang, X.; Yu, J.; Zhang, M.; Fua, P.; Tao, D. Fisheyerecnet: A multi-context collaborative deep network for fisheye image rectification. In Proceedings of the European conference on computer vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 469–484. [Google Scholar]
Yan, Y.; Liu, H.; Zhang, C.; Xu, C.; Xu, B.; Pan, W.; Dai, S.; Song, Y. DAN: Distortion-aware Network for fisheye image rectification using graph reasoning. Image Vis. Comput. 2025, 156, 105423. [Google Scholar] [CrossRef]
Abuolaim, A.; Brown, M.S. Defocus deblurring using dual-pixel data. In Computer Vision–ECCV 2020, Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part X 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 111–126. [Google Scholar] [CrossRef]
Kannala, J.; Brandt, S.S. A generic camera model and calibration method for conventional, wide-angle, and fish-eye lenses. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 1335–1340. [Google Scholar] [CrossRef]
Liu, X.; Li, H. A study on UAV target detection and 3D positioning methods based on the improved deformable DETR model and multi-view geometry. Adv. Mech. Eng. 2025, 17, 16878132251315505. [Google Scholar] [CrossRef]
Martínez, Á.; Santo, A.; Ballesta, M.; Gil, A.; Payá, L. A Method for the Calibration of a LiDAR and Fisheye Camera System. Appl. Sci. 2025, 15, 2044. [Google Scholar] [CrossRef]
Amirgaliyev, B.; Mussabek, M.; Rakhimzhanova, T.; Zhumadillayeva, A. A Review of Machine Learning and Deep Learning Methods for Person Detection, Tracking and Identification, and Face Recognition with Applications. Sensors 2025, 25, 1410. [Google Scholar] [CrossRef]
Jiao, E.; Leng, Q.; Guo, J.; Meng, X.; Wang, C. Vision Transformer with window sequence merging mechanism for image classification. Appl. Soft Comput. 2025, 171, 112811. [Google Scholar] [CrossRef]
Bookstein, F.L. Principal warps: Thin-plate splines and the decomposition of deformations. IEEE Trans. Pattern Anal. Mach. Intell. 1989, 11, 567–585. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Fischer, P.; Ilg, E.; Hausser, P.; Hazirbas, C.; Golkov, V.; Van Der Smagt, P.; Cremers, D.; Brox, T. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision, Santiago, Chile, 7–13 December 2015; pp. 2758–2766. [Google Scholar] [CrossRef]
Shorten, C.; Khoshgoftaar, T.M. A survey on image data augmentation for deep learning. J. Big Data 2019, 6, 1–48. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Kirk, D.B.; Wen-Mei, W.H. Programming Massively Parallel Processors: A Hands-On Approach; Morgan kaufmann: San Francisco, CA, USA, 2016. [Google Scholar]
Pednekar, V.; Shettigar, N.; Tawhare, S. Enhancing Security Surveillance Through Business Intelligence with NVIDIA DeepStream. In Proceedings of the International Conference on Data Science and Emerging Technologies, Petaling Jaya, Malaysia, 4–5 December 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 91–104. [Google Scholar] [CrossRef]
Han, S.; Mao, H.; Dally, W.J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv 2015. [CrossRef]
Hedi, A.; Lončarić, S. A system for vehicle surround view. IFAC Proc. Vol. 2012, 45, 120–125. [Google Scholar] [CrossRef]
Muhammad, A.N.; Aseere, A.M.; Chiroma, H.; Shah, H.; Gital, A.Y.; Hashem, I.A.T. Deep learning application in smart cities: Recent development, taxonomy, challenges and research prospects. Neural Comput. Appl. 2021, 33, 2973–3009. [Google Scholar] [CrossRef]
Guo, P.; Liu, C.; Hou, X.; Qian, X. QueryCDR: Query-based Controllable Distortion Rectification Network for Fisheye Images. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 266–284. [Google Scholar] [CrossRef]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar] [CrossRef]
Liu, Z.; Li, J.; Shen, Z.; Huang, G.; Yan, S.; Zhang, C. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2736–2744. [Google Scholar] [CrossRef]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015. [CrossRef]
Jaderberg, M.; Simonyan, K.; Zisserman, A. Spatial transformer networks. Adv. Neural Inf. Process. Syst. 2015, 28, 2017–2025. [Google Scholar]
Shi, B.; Wang, X.; Lyu, P.; Yao, C.; Bai, X. Robust scene text recognition with automatic rectification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 4168–4176. [Google Scholar]
Teed, Z.; Deng, J. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020, Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part II 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 402–419. [Google Scholar] [CrossRef]
Lin, C.-H.; Yumer, E.; Wang, O.; Shechtman, E.; Lucey, S. St-gan: Spatial transformer generative adversarial networks for image compositing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 9455–9464. [Google Scholar] [CrossRef]
Yuanchen, Y.; Yunfei, C.; Dongsheng, W. GridNet-3D: A Novel Real-Time 3D Object Detection Algorithm Based on Point Cloud. Chin. J. Electron. 2021, 30, 931–939. [Google Scholar] [CrossRef]
Rocco, I.; Arandjelović, R.; Sivic, J. End-to-end weakly-supervised semantic alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6917–6925. [Google Scholar] [CrossRef]
Gai, S.; Cui, W.; Liang, B. A fisheye image correction method based on deep learning. Robot. Intell. Autom. 2024, 45, 70–76. [Google Scholar]
Zhao, S.; Dong, Y.; Chang, E.I.; Xu, Y. Recursive cascaded networks for unsupervised medical image registration. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 10600–10610. [Google Scholar]
Ren, Z.; Yan, J.; Ni, B.; Liu, B.; Yang, X.; Zha, H. Unsupervised deep learning for optical flow estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar] [CrossRef]
Ranftl, R.; Bochkovskiy, A.; Koltun, V. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 12179–12188. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Fischler, M.A.; Bolles, R.C. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar]
Pérez, P.; Gangnet, M.; Blake, A. Poisson image editing. Semin. Graph. Pap. Push. Boundaries 2023, 2, 577–582. [Google Scholar]
Choi, K.; Jun, K. Real-time panorama video system using networked multiple cameras. J. Syst. Archit. 2016, 64, 110–121. [Google Scholar] [CrossRef]
Nie, L.; Lin, C.; Liao, K.; Liu, S.; Zhao, Y. Unsupervised deep image stitching: Reconstructing stitched features to images. IEEE Trans. Image Process. 2021, 30, 6184–6197. [Google Scholar] [CrossRef] [PubMed]
Karumbunathan, L.S. Nvidia Jetson Agx Orin Series. 2022. Available online: https://www.nvidia.com/content/dam/en-zz/Solutions/gtcf21/jetson-orin/nvidia-jetson-agx-orin-technical-brief.pdf (accessed on 27 March 2025).
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar] [CrossRef]
Zhou, B.; Lapedriza, A.; Khosla, A.; Oliva, A.; Torralba, A. Places: A 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1452–1464. [Google Scholar] [CrossRef] [PubMed]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014, Proceedings of the 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part v 13; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar] [CrossRef]
Rong, J.; Huang, S.; Shang, Z.; Ying, X. Radial lens distortion correction using convolutional neural networks trained with synthesized images. In Proceedings of the Asian Conference on Computer Vision, Taipei, Taiwan, 20–24 November 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 35–49. [Google Scholar] [CrossRef]
Liao, K.; Lin, C.; Zhao, Y.; Gabbouj, M. DR-GAN: Automatic radial distortion rectification using conditional GAN in real-time. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 725–733. [Google Scholar] [CrossRef]
Liao, K.; Nie, L.; Huang, S.; Lin, C.; Zhang, J.; Zhao, Y.; Gabbouj, M.; Tao, D. Deep learning for camera calibration and beyond: A survey. arXiv 2023. [CrossRef]
Bukhari, F.; Dailey, M.N. Automatic radial distortion estimation from a single image. J. Math. Imaging Vis. 2013, 45, 31–45. [Google Scholar] [CrossRef]

Figure 1. Network architecture for wide-angle image rectification.

Figure 2. Encoder module architecture.

Figure 3. Structure of the Swin Transformer Block.

Figure 4. Structure of the Focus Block.

Figure 5. Structure of the TPS prediction module.

Figure 6. Output control point map generated by the decoder.

Figure 7. Structure of the optical flow estimation module.

Figure 8. Illustration of progressive TPS control point warping directions.

Figure 9. Illustration of optical flow vector directions.

Figure 10. Output of the optical flow estimation module.

Figure 11. Framework of the channel pruning strategy.

Figure 12. Physical setup of the system.

Figure 13. System user interface.

Figure 14. Precomputed data stored on the edge computing platform.

Figure 15. Training loss curve of the model without the integration of multi-scale feature extraction and optical flow constraint modules. The x-axis represents the number of training epochs, and the y-axis denotes the total loss value.

Figure 16. Training loss curve of the model with the integration of multi-scale feature extraction and optical flow constraint modules. The x-axis represents the number of training epochs, and the y-axis denotes the total loss value.

Figure 17. Rectification result with TPS control point configuration of 10–10–10–10.

Figure 18. Rectification results with TPS control point configuration of 16–16–16–16.

Figure 19. Visual comparison of rectification results across different algorithms.

Figure 20. GPU performance display.

Figure 21. CPU performance display.

Figure 22. Wide-angle distorted image (left) and rectified image (right), captured by camera unit 1.

Figure 23. Wide-angle distorted image (left) and rectified image (right), captured by camera unit 2.

Figure 24. Image registration result after rectification.

Figure 25. Real-time output of stitched and rectified images.

Table 1. Performance comparison of different configurations with and without key modules.

BASE	MS Module	OF Module	PSNR	SSIM	FLOPs (T)	Params (M)	Time (ms)
√			22.21	0.7382	2.15	26.85	8.1
√	√		25.36	0.7582	2.23	30.45	8.3
√		√	26.97	0.7796	2.31	29.73	8.7
√	√	√	27.43	0.8190	2.45	35.93	9.5

Table 2. Performance evaluation of the calibration network after pruning optimization.

Network	Inference Time/ms	Model Size/MB	SSIM	PSNR/dB
BASE	104	35.93	0.6151	21.97
BASE $ω$ = 85%	59	21.32	0.5332	19.26
BASE $ω$ = 75%	36	13.38	0.5121	17.96
BASE $ω$ = 55%	31	11.01	0.4726	15.36

Table 3. Performance comparison with different numbers of TPS control points.

Metric	10-10-10-10	10-12-12-14	16-16-16-16	Ours (10-10-12-14)
PSNR	17.29	18.36	16.34	19.97
SSIM	0.5875	0.5769	0.5234	0.6151

Table 4. Comparative evaluation with classical algorithms.

Task		Metrics
Input Type	Method	PSNR	SSIM	Parameter/MB	Time/ms
Wide-Angle Images	Crop	11.51	0.1907	32.15	-
	ROP	12.04	0.2775	33.64	-
	Padding	13.90	0.3516	55.64	-
	Deep_Rect	15.36	0.4211	X	642.1
	RecRecNet	18.68	0.5450	62.70	112.3
	MOWA	18.69	0.5450	49.93	96.2
	OUR	21.97	0.6151	35.93	9.5
Fisheye Images	PCN	21.37	0.6925	26.19	451.1
	RecRecNet	21.72	0.7167	62.70	122 2
	MOWA	22.25	0.7488	49.93	106.6
	OUR	24.43	0.8190	35.93	11.1
Irregular Images	RecRecNet	21.48	0.7602	62.70	107.3
	MOWA	21.69	0.7795	49.93	124.7
	OUR	23.88	0.7993	35.93	11.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lai, S.; Cheng, Z.; Zhang, W.; Chen, M. Wide-Angle Image Distortion Correction and Embedded Stitching System Design Based on Swin Transformer. Appl. Sci. 2025, 15, 7714. https://doi.org/10.3390/app15147714

AMA Style

Lai S, Cheng Z, Zhang W, Chen M. Wide-Angle Image Distortion Correction and Embedded Stitching System Design Based on Swin Transformer. Applied Sciences. 2025; 15(14):7714. https://doi.org/10.3390/app15147714

Chicago/Turabian Style

Lai, Shiwen, Zuling Cheng, Wencui Zhang, and Maowei Chen. 2025. "Wide-Angle Image Distortion Correction and Embedded Stitching System Design Based on Swin Transformer" Applied Sciences 15, no. 14: 7714. https://doi.org/10.3390/app15147714

APA Style

Lai, S., Cheng, Z., Zhang, W., & Chen, M. (2025). Wide-Angle Image Distortion Correction and Embedded Stitching System Design Based on Swin Transformer. Applied Sciences, 15(14), 7714. https://doi.org/10.3390/app15147714

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Wide-Angle Image Distortion Correction and Embedded Stitching System Design Based on Swin Transformer

Abstract

1. Introduction

2. Related Work

2.1. Wide-Angle and Fisheye Image Distortion Rectification

2.2. Deep Learning-Based Image Rectification

2.3. Design of Embedded Vision Systems

3. Methodology

3.1. Network Architecture Design and Optimization

3.2. Multi-Scale Feature Extraction Module

3.2.1. Encoder Architecture for Multi-Scale Feature Extraction

3.2.2. Design of the Focus Block

3.3. Thin Plate Spline Control Point Prediction Module

Design of the Thin Plate Spline Control Point Prediction Module

3.4. Optical Flow Estimation Module

3.4.1. Design of the Optical Flow Estimation Module

3.4.2. Vector Direction of the Optical Flow Field

3.5. Loss Function

3.6. Accelerated Deployment of the Wide-Angle Image Correction Network

3.7. Hardware and Software Environment

4. Experimental Setup and Result Analysis

4.1. Experimental Configuration

4.2. Image Rectification Data Preparation

4.3. Training Details

4.4. Ablation Study

4.4.1. Ablation Study on Multi-Scale Feature Extraction and Optical Flow Module

4.4.2. Ablation Study on Network Pruning for Wide-Angle Image Correction Deployment

4.5. Comparative Analysis

4.6. System Testing

4.6.1. Real-Time Dual-Camera Image Stitching Implementation

4.6.2. Wide-Angle Image Correction Implementation

4.6.3. Wide-Angle Video Stitching Implementation

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI