Video Stabilization: A Comprehensive Survey from Classical Mechanics to Deep Learning Paradigms

Xu, Qian; Huang, Qian; Jiang, Chuanxu; Li, Xin; Wang, Yiming

doi:10.3390/modelling6020049

Open AccessReview

Video Stabilization: A Comprehensive Survey from Classical Mechanics to Deep Learning Paradigms

by

Qian Xu

^1,2,

Qian Huang

^1,2,*

,

Chuanxu Jiang

²,

Xin Li

^1,2

and

Yiming Wang

²

¹

Jiangsu Engineering Research Center of Digital Twinning Technology for Key Equipment in Petrochemical Process, Changzhou University, Changzhou 213000, China

²

College of Computer Science and Software Engineering, Hohai University, Nanjing 211100, China

^*

Author to whom correspondence should be addressed.

Modelling 2025, 6(2), 49; https://doi.org/10.3390/modelling6020049

Submission received: 7 May 2025 / Revised: 8 June 2025 / Accepted: 10 June 2025 / Published: 17 June 2025

Download

Browse Figures

Versions Notes

Abstract

Video stabilization is a critical technology for enhancing video quality by eliminating or reducing image instability caused by camera shake, thereby improving the visual viewing experience. It has deeply integrated into diverse applications—including handheld recording, UAV aerial photography, and vehicle-mounted surveillance. Propelled by advances in deep learning, data-driven stabilization methods have emerged as prominent solutions, demonstrating superior efficacy in handling jitter while achieving enhanced processing efficiency. This review systematically examines the field of video stabilization. First, this paper delineates the paradigm shift from classical to deep learning-based approaches. Subsequently, it elucidates conventional digital stabilization frameworks and their deep learning counterparts along with establishing standardized assessment metrics and benchmark datasets for comparative analysis. Finally, this review addresses critical challenges such as robustness limitations in complex motion scenarios and latency constraints in real-time processing. By integrating interdisciplinary perspectives, this work provides scholars with academically rigorous and practically relevant insights to advance video stabilization research.

Keywords:

video stabilization; deep learning; dataset; quality assessment

Graphical Abstract

1. Introduction

With the increasing prevalence of mobile recording devices such as digital cameras, smartphones, wearable gadgets, and unmanned aerial vehicles (UAVs), users can now capture high-resolution videos across diverse environments. However, unlike professional videographers who utilize specialized stabilizers to ensure video stability, amateur users often face challenges when recording with handheld devices or vehicle-mounted recorders. Due to the lack of professional filming skills and stabilization equipment, their videos frequently exhibit noticeable jitter and poor stability. Consequently, video stabilization has emerged as a critical research focus, aiming to eliminate or mitigate undesired motion artifacts to generate stable, high-quality video outputs. Figure 1 visually contrasts the differences between jittery and stabilized video sequences.

A video can be conceptualized as a temporal sequence of frames, i.e., static images captured at minimal time intervals and spatial proximity. Beyond the spatial information inherent in individual frames, videos encapsulate motion trajectories of foreground/background objects and dynamic variations in capturing devices (ego-motion). Interdisciplinary studies in psychology and neurophysiology [1] have extensively investigated how the human visual system infers motion velocity and direction. Empirical evidence demonstrates that viewers perceive and infer motion attributes with high precision during video observation, mirroring real-world perceptual mechanisms. Notably, while stationary objects may escape attention, motion-triggered attention mechanisms enable the rapid detection of dynamic changes. Critically, the visual system robustly discriminates between inherent object motion and artifacts induced by camera shake across most scenarios.

Despite the sophisticated processing capabilities of the human visual system, computationally addressing video jitter remains a formidable challenge. When capture devices operate under adverse external conditions, recorded videos often contain unintentional jitter that deviates from the photographer’s intent. The widespread availability of mobile devices capturing videos anytime and anywhere has exacerbated the pervasiveness of instability issues. Beyond user-induced shaking, jitter artifacts frequently degrade visual quality in videos captured by vehicle-mounted surveillance systems [2] and autonomous vehicle cameras [3], particularly under dynamic operational environments.

Beyond focus inaccuracies, texture distortions, and hardware-induced artifacts, video acquisition introduces another instability artifact termed incapture distortion. Subjective studies have confirmed that video instability is perceptually salient to viewers, provoking significant visual discomfort—for instance, low-frequency vertical oscillations caused by walking during recording can distract viewers and hinder content comprehension. Consequently, unstable camera motion severely degrades user experience. Although systematic analyses remain scarce regarding how unstable camera dynamics specifically impair computer vision tasks, empirical evidence has established that excessive motion adversely impacts action recognition and object detection performance. This underscores the critical need for precise camera motion computation and correction. The primary objective of video stabilization is to compensate for undesired camera motion in jittery videos, thereby generating stabilized outputs with optimized perceptual quality.

Refs. [4,5] optimized the segmentation boundary through foreground/background separation in dynamic scenes to enhance the robustness of the stabilized images and reduce misjudgment due to occlusion, which provides a reference for the stabilized images in complex motion scenes.

Stabilized videos may retain intentional camera movement, provided that such motions follow smooth and controlled trajectories. Thus, the primary objective of video stabilization is not to suppress all camera dynamics, but rather to selectively mitigate irregular, high-frequency jitter components while preserving intentional motion cues.

Contemporary video stabilization technologies are typically categorized into three primary classes: mechanical stabilization, optical stabilization, and digital stabilization.

Mechanical stabilization [6] was a mainstream approach for early camera stabilization, relying on sensors to achieve stability. A typical mechanical stabilization method uses a gyroscope to detect the camera’s motion state and then employs specialized stabilizing equipment to physically adjust the camera’s position, counteracting the effects of shake, as shown in Figure 2. Although mechanical stabilization technology excels at handling large-amplitude, high-frequency random vibrations and has been widely applied in vehicular, airborne, and marine platforms, its precision is limited and it is susceptible to environmental factors like friction. To improve stabilization accuracy, integrating mechanical stabilization with optical or digital stabilization techniques has been considered. However, the need for specialized equipment, device weight, and battery consumption associated with mechanical stabilization impose constraints on its application in handheld devices.

Optical stabilization [7] is another important video stabilization method. It compensates for camera rotation and translation by real-time adjustment of internal optical components in the imaging device, such as mirrors, prisms, or optical wedges, to redirect the optical path or move the imaging sensor. This stabilization process is completed before image information is recorded by the sensor, with the internal component layout shown in Figure 3. To improve stabilization accuracy, some optical stabilization systems incorporate gyroscopes to measure differences in motion velocity at different time points, effectively distinguishing between normal camera movements and unwanted shake. However, optical stabilization technology also has limitations. First, the high cost of optical components increases the overall system cost. Second, it is susceptible to lighting conditions, which can reduce accuracy and degrade the final stabilization effect. It is only suitable for applications with small random vibrations. Therefore, optical stabilization is more appropriate for scenarios with minimal random shake. Given the pursuit of higher stabilization performance and broader applicability, integrating optical stabilization with other stabilization techniques may be a research direction worth exploring.

Digital image stabilization is mainly implemented through software and mostly does not depend on specific hardware devices, such as [8,9,10,11,12,13,14,15,16]. Its fundamental principle involves accurately estimating undesired camera motion from video frames and compensating for such motion through appropriate geometric or optical transformations, as schematically illustrated in the stabilization pipeline of Figure 4. Although DVS exhibits slightly less advantageous processing speeds compared to optical stabilization, it delivers superior performance in terms of stabilization quality and operational flexibility. A defining advantage of DVS lies in its unique capability: it represents the sole technology capable of stabilizing pre-recorded videos post hoc, a critical feature for retrofitting stability to existing footage. With technological advancements, DVS has further diverged into two distinct paradigms, traditional model-driven methods and emerging deep learning-based approaches, expanding the research and application frontiers for video stabilization.

In the field of video stabilization, several representative studies have been conducted [17,18,19,20]. However, ref. [17] focuses solely on traditional methods and lacks a discussion of deep learning-based methods. Ref. [18] further discusses deep learning methods and quality assessment standards, but does not conduct a comprehensive study of deep learning methodologies and their performance comparisons. Although ref. [19] comprehensively covers traditional methods and deep learning-based methods, it does not address quality assessment metrics, datasets, or discussions of the achievements of leading stabilization methods. Ref. [20] provides a systematic overview of the development of image stabilization technology, but lacks sufficient integration and discussion of datasets. Therefore, conducting more in-depth research on video stabilization is crucial for establishing better guidelines for future research. This study achieves the following breakthrough contributions through interdisciplinary integration and systematic reconstruction:

First, we have established a unified classification system that integrates classical mechanics with deep learning paradigms. Unlike [17], which ignores deep learning, or [19], which omits evaluation metrics, we reveal algorithmic innovations and how to address practical bottlenecks through a methodological evolution diagram (Figure 5).

Second, we have conducted the first comprehensive comparison of subjective evaluation metrics and objective metrics (CDS).

Third, we have constructed a dataset classification system, reorganizing 11 benchmark datasets along application scenarios and complexity dimensions to guide training for generalization.

Finally, we propose cross-domain synergies between video stabilization and adjacent fields. For example, we adapted the TCR module from video compression [21] to reduce latency, and transferred angular weighting strategies from skeleton-based angular feature enhancement [22] to motion modeling. These paths have not been explored in existing reviews.

These contributions make this review both a technical reference and a roadmap for the development of next-generation stabilization systems. In this paper, Section 2 reviews the state-of-the-art development and representative methodologies in video stabilization. Section 3 then elaborates on the assessment methodologies for video stabilization quality in detail. Section 4 introduces public datasets and summarizes the state-of-the-art performance. Section 5 discusses the challenges and future directions faced by video stabilization. Finally, Section 6 concludes this work.

2. Advances in Video Stabilization

Currently, research in digital video stabilization (DVS) is predominantly categorized into two paradigms: traditional methods and deep learning-based approaches. Traditional methods rely on manually designed and meticulously extracted features, as shown in Figure 6a. This process demands substantial professional expertise and involves complex operational workflows. In contrast, deep learning-based video stabilization does not directly compute or visualize camera motion trajectories. Instead, it employs supervised learning models for stabilization processing, as illustrated in Figure 6b. Notably, deep learning-based methods have demonstrated superior performance in recent studies [23,24,25,26,27,28,29], primarily due to their capability to automatically extract high-dimensional features. This eliminates the reliance on manual feature extraction and matching, thereby enhancing stabilization effectiveness. In this paper, we provide a systematic overview of DVS by categorizing algorithms into traditional and deep learning-based approaches. The objective is to comprehensively organize the current landscape of video stabilization and systematically expound on digital video stabilization techniques within these two frameworks.

2.1. Algorithms of Traditional Digital Video Stabilization

In 2011, Grundmann et al. proposed the L1-optimal method [30]. This method aims to obtain a stable video that meets the visual requirements by generating a smooth camera path that follows the laws of cinematography. To achieve this goal, they employed a linear programming algorithm. By minimizing the first-order, second-order, and third-order derivatives and simultaneously considering various constraints in the camera path, they effectively smoothed the camera path. It is worth noting that the L1 optimization method tends to generate some smooth paths with zero derivatives. This characteristic enables it to eliminate the undesired low-frequency jitter of the camera, further enhancing the stability of the video. However, its reliance on a global motion model renders it ineffective in scenes with significant parallax or dynamic foregrounds, leading to inaccurate motion estimation and subpar stabilization results.

In 2021, Bradley et al. [31] improved the L1-optimal method. By introducing homography transformation, they significantly enhanced the precision of the algorithm. This method is deeply rooted in Lie theory and operates within the logarithmic homography space to maintain the linearity of the processing, thus enabling efficient convex optimization. In order to enhance the approximation between the stabilized path and the original path, this method employs the L2 norm. Meanwhile, the constraints in the optimization process ensure that only valid pixels are included in the cropped frames, and the field of view is retained according to the approximate values of the area and side length. In addition, this method can effectively address the distortion problem through specific constraints and optimization objectives. When dealing with videos of arbitrary length, Bradley et al. adopted the sliding window strategy, demonstrating extremely high flexibility. To properly handle the problem of discontinuity, they further introduced the third-order Markov property, which means that the first three frames of the current window will maintain the solutions generated within the previous window unchanged. These techniques contribute to improved coherence in the stabilized video. Yet it struggles with non-rigid motion scenarios and can be computationally demanding for high-resolution videos.

In 2013, Liu et al. [32] proposed a video stabilization method called Bundle. This method achieves the globally optimal path by performing two minimization operations. To reduce excessive cropping and geometric distortion, the Bundle method aims to make the stabilized path as close as possible to the original path. Notably, this method calculates the bundled path separately for each grid cell. By thoroughly exploring local transformations and shape-preserving constraints, Bundle aims to effectively handle parallax and rolling shutter effects, thereby further improving stabilization performance. Nevertheless, it comes with a high computational cost due to the need for dense grid computations. It also requires long and reliable feature tracks for accurate motion estimation, which can be difficult to obtain in complex scenes.

In 2014, Liu et al. [33] proposed a video stabilization method named SteadyFlow. The core idea of this method is to achieve video stabilization by smoothing the optical flow. In specific implementation, it skillfully combines the traditional optical flow and the global matrix for initialization. In order to identify discontinuous motion more accurately, this method also divides the motion vectors of a given frame into inliers and outliers based on spatio-temporal analysis. However, the computational complexity associated with dense optical flow calculations can be a drawback, especially for real-time applications. It may also encounter difficulties in scenes with large occlusions or drastic illumination changes.

In 2016, Liu et al. [34] further improved the SteadyFlow method and proposed a new optimization scheme called MeshFlow. This method has excellent online stability and only produces one frame of delay, which greatly improves real-time processing efficiency. By using sparse motion fields and adaptive window technology to achieve minimum delay online stabilization, it is very suitable for real-time feedback applications, such as live video streaming. On the other hand, in scenarios with rapid and large-scale camera rotation, it may fail to accurately capture complex geometric changes, leading to instability or distortion in the stabilized video. The method may also perform poorly in scenarios with significant occlusion.

In 2015, Zhang et al. [35] proposed a video stabilization method. This method uses a set of specific trajectories to work together to remove excessive shaking and grid deformation, thereby achieving faster stabilization processing. In order to reduce geometric distortion in stabilized videos, this method carefully encodes the two key steps of shaking removal and grid deformation in a single global optimization based on the position of the grid vertices. This global optimization method not only improves processing speed but also effectively reduces geometric distortion during video stabilization, further enhancing video quality. However, its use of a single global transformation makes it difficult to handle local parallax in complex scenes, potentially introducing edge blurring. Weight settings require empirical adjustment and lack an automatic adaptation mechanism.

In 2017, Zhang et al. [36] proposed a video stabilization method operating in geometric transformation space. By utilizing the Riemannian metric, they transformed geodesics in Lie groups into optimization paths, providing a novel framework for trajectory smoothing. This approach employs closed-form solutions to compute geodesics within a specific space, enabling efficient camera path stabilization through geometric interpolation. Crucially, the method achieves significant computational efficiency by solving geodesics in the rigid transformation space, offering practical advantages for real-world applications. However, its global transformation model struggles with scenes exhibiting drastic depth variations, necessitating local grid refinement that increases computational overhead. Additionally, geodesic paths prioritize mathematical smoothness over physical plausibility, potentially deviating from actual camera motion.

To address the common limitation wherein feature-based stabilization risks discarding foreground features, Zhang et al. introduced constraints on foreground trajectory motion using inter-frame and intra-frame similarity transformations. These constrained trajectories were used to construct a Delaunay triangular mesh. An optimization problem was then solved with three core constraints: smooth stabilized camera paths for visual coherence. Geometric similarity between stabilized and original triangles to preserve local structure. And consistent relative transformations between stabilized triangles and their neighbors across frames to maintain spatiotemporal motion coherence. Collectively, these constraints enable effective foreground retention during stabilization while mitigating the aforementioned limitations.

In 2018, Wu et al. [37] proposed a novel video stabilization method. They posited that motion within a local temporal window is typically stationary, has constant velocity, or is accelerated. Differing from previous approaches, their optimization method specifically addresses each type of motion irregularity. The first key contribution is the enhanced motion perception fidelity. While most prior methods employed a Gaussian kernel as the temporal weight in optimization, its temporal isotropy performs poorly with accelerated motion. To solve this, Wu et al. introduced the motion steering kernel to replace the Gaussian kernel and used an adaptive window size for better performance in fast motion. The second contribution is the local low-rank regularization term, which improves robustness to different motion patterns and enhances stabilization.

Since deep learning technology entered the field of video stabilization in 2018, digital video stabilization methods have shown a trend of diversification. Compared with traditional methods, these methods, while dealing with video jitter, adopt different training strategies to reduce the possible distortion and cropping problems during the stabilization process. More importantly, the speed advantage brought about by deep learning makes online video stabilization possible. Figure 5 shows various digital video stabilization methods mentioned in this paper in chronological order. The part before the dotted line represents traditional video stabilization technologies, while the part after the dotted line demonstrates new video stabilization methods based on deep learning.

2.2. Video Stabilization Methods Based on Deep Learning

Since 2018, deep learning technology has been increasingly widely applied in the field of video stabilization; for example, the methods used in refs. [27,38,39,40,41,42,43,44] have emerged as prominent approaches for digital video stabilization. By adopting diverse training methods, these methods can not only effectively deal with video jitter but also reduce distortion and cropping phenomena during the stabilization process. At the same time, the high-speed processing ability of deep learning also provides strong support for online video stabilization. The extensive application of deep learning has greatly enriched the technical means of video stabilization. In addition to the basic jitter processing task, different methods also show unique advantages during video processing. According to the type of motion information utilized in motion estimation, this paper classifies the video stabilization algorithms based on deep learning into three major categories: 2D, 3D, and the 2.5D method that uses 2D motion information to estimate three-dimensional motion for handling parallax. This classification method helps to more systematically understand and compare the performance and characteristics of various video stabilization algorithms.

Two-dimensional methods primarily rely on two-dimensional motion calculations, and their processes may also involve other video processing tasks such as compression, denoising, interpolation, and segmentation. Two-dimensional stabilization methods are widely adopted in practice due to their robustness and lower computational overhead. In contrast, 3D methods require reconstructing a 3D scene model and precisely modeling camera poses to compute smooth virtual camera trajectories in 3D space. To determine the camera’s six-degrees-of-freedom (6DoF) pose, researchers have employed various techniques, including projective 3D reconstruction [45], camera depth estimation [46], structure from motion (SFM) [47], and light-field analysis [48]. Three-dimensional methods typically demand high computational costs [49] or specific hardware support. Additionally, 3D methods may fail when large foreground objects are present [50], further limiting their application scenarios, whereas 2D methods exhibit better adaptability across a broader range of scenarios. However, when successfully implemented, 3D methods often yield stabilized results of superior quality.

To integrate the strengths of 2D and 3D approaches while mitigating their limitations, researchers have proposed 2.5D methods. These hybrid methods fuse features from 2D and 3D techniques, aiming to enhance stabilization quality while reducing computational requirements or hardware dependencies. As such, 2.5D methods hold promising development prospects and substantial application potential in the field of digital video stabilization.

2.2.1. Two-Dimensional Video Stabilization

In 2018, Wang et al. [51] introduced StabNet, the first deep neural network-based solution for video stabilization. Before this, the application of deep learning in video stabilization had been limited primarily by the lack of suitable datasets. To train StabNet, the researchers created a dataset called Deepstab. They used a specially designed portable device to capture synchronized pairs of stable and unstable videos. The Deepstab dataset contains 60 video pairs, each approximately 30 s long and recorded at 30 frames per second. The algorithm relies on manually predefined graph topology structures, which may not adapt well to diverse actions or interpersonal differences. The fixed graph structure across layers limits its ability to capture task-specific dependencies. It performs poorly in scenarios with occlusions or low-quality skeleton data, as manually designed features exhibit poor robustness to noise.

Traditional digital image stabilization methods mostly use offline algorithms to smooth the overall camera path through feature matching techniques. In contrast, StabNet is committed to achieving low-latency and real-time camera path smoothing. It neither explicitly presents the camera path nor relies on the information of future frames. Instead, it learns mesh transformations from historical stabilized frames and warps input frames to output stabilized results. The algorithm flow is shown in Figure 7.

Compared with traditional offline video stabilization, the online method StabNet achieves nearly 10 times faster processing without relying on future frames. At the same time, StabNet demonstrates robust performance on low-quality videos, including night scenes and watermarked, blurry, and noisy footage. However, StabNet is slightly deficient in the stability score, and there may be unnatural inter-frame swinging and distortion phenomena. This is mainly due to the limitations of it as an online method, that is, it only relies on the information of historical frames and lacks a comprehensive grasp of the complete camera path. In addition, the effect of StabNet is also limited by its generalization ability, because it requires a large number of video pairs containing different types of motion for training. Unfortunately, since the types of videos covered in the DeepStab dataset are relatively limited, the methods using this dataset are often affected by insufficient generalization ability.

Xu et al. [52] proposed an unsupervised motion trajectory stabilization framework named DUT. While traditional methods depend on human-controlled trajectory smoothing, manual features prove unstable in occluded or textureless scenes. DUT made the first attempt to use the unsupervised deep learning method to explicitly estimate and smooth the trajectory to achieve video stabilization. This framework is composed of a keypoint detector and a motion estimator based on a deep neural network (DNN), which are used to generate grid-based trajectories. At the same time, it also employs a trajectory smoother based on a convolutional neural network (CNN) to stabilize the video. During the unsupervised training process, DUT makes full use of the characteristic of motion continuity, as well as the consistency of the keypoints and grid vertices before and after stabilization.

As illustrated in Figure 8, the DUT framework primarily comprises three core modules: the Keypoint Detection (KD) module, the Motion Propagation (MP) module, and the Trajectory Smoothing (TS) module. The KD module first employs the detector from RFNet [53] (utilized for feature point analysis) along with PWCNet’s optical flow to compute motion vectors. The MP module then propagates sparse keypoints to the dense grid vertices of each frame and obtains their motion trajectories through temporal correlation. Finally, the TS module smooths these trajectories. By optimizing the estimated motion trajectories while maintaining the consistency of the keypoints and vertices before and after stabilization, this stabilizer forms an unsupervised learning scheme.

DUT handles challenging scenes with multi-plane motion effectively. Compared to traditional stabilizers, DUT’s deep learning-based keypoint detector provides advantages in handling fast rotation, blur, and large motions, often achieving lower distortion and higher stability metrics. However, this algorithm relies on optical flow estimation and may fail in low-light or high-motion-blur scenarios. The dual-stream design increases computational complexity compared to single-stream models. In extreme-camera-motion scenarios, performance degrades due to limited temporal context in unsupervised training.

Yu and Ramamoorthi [54] proposed a neural network named DeepFlow. This network innovatively adopts the optical flow method for motion analysis and directly infers the pixel-level warping used for video stabilization from the optical flow field of the input video. The DeepFlow method not only uses optical flow for motion restoration but also achieves smoothing through the warping field. In addition, this method applies the PCA optical flow [55] technology to the field of video stabilization, thus significantly improving the processing robustness in complex scenes such as moving objects, occlusions, and inaccurate optical flow. This strategy of integrating optical flow analysis and learning represents a significant approach for video stabilization.

However, its reliance on optical flow accuracy limits its performance in low-texture or extreme-motion scenes, as PCA optical flow interpolation may introduce blurring in severely occluded regions. The sliding window method for long videos introduces latency and requires multiple iterations for offline processing, limiting its real-time applicability.

Choi and Kweon [56] proposed a full-frame video stabilization method named DIFRINT. This method performs smoothing processing through frame interpolation technology to achieve a stable video output. DIFRINT is a video stabilization method based on deep learning. Its uniqueness lies in its ability to generate stable video frames without cropping and maintain a low distortion rate. This method uses frame interpolation technology to generate interpolations between frames, effectively reducing the inter-frame jitter. Iterative application enhances stabilization.

Given an unstable input video, DIFRINT ingeniously uses frame interpolation as a means of stabilization. Essentially, it interpolates iteratively between consecutive frames while keeping the stable frames at the boundaries of the interpolated frames, thus achieving a full-frame output. This unsupervised deep learning framework does not require paired real stable videos, making the training process more flexible. By using the frame interpolation method to stabilize frames, DIFRINT successfully avoids the problem of introducing cropping.

From the perspective of interpolation, the interpolated frames generated by the DIFRINT deep framework represent the frames captured between two consecutive frames, that is, the intermediate frames in a temporal sense. The sequential generation of such intermediate frames effectively reduces the spatial jitter between adjacent frames. Intuitively, frame interpolation acts like a low-pass filter in the time domain, effectively applying linear interpolation to the frame sequence. Iteratively applying this interpolation significantly amplifies the stabilization effect. The DIFRINT method estimates precise intermediate pixel positions through interpolation, generating intermediate frames for high-precision video stabilization. An additional advantage is user control, the number of iterations and parameters can be adjusted based on preference. Users can adjust iterations and parameters to retain desired instability levels, offering customization. However, it still has certain limitations. In videos with severe jitter, this method may introduce blurring at the image boundaries and may lead to serious distortion during the iterative process.

Liu et al. [57] proposed a frame synthesis algorithm named Hybrid, designed to achieve full-frame video stabilization. This approach effectively addresses issues such as noticeable distortion and significant boundary cropping common in existing stabilization methods. As illustrated in Figure 9, the Hybrid algorithm integrates four key components: a feature extractor, a warping layer, a frame generator with fusion functionality, and a final weighted summation step, collectively producing stabilized video output. The method’s core concept involves robustly fusing information from multiple adjacent frames. It first estimates dense warping fields from neighboring frames, then synthesizes stabilized frames by fusing these warped contents. A key feature of Hybrid is its deep learning-based hybrid spatial fusion technology, which can mitigate the impacts caused by inaccurate optical flow and fast-moving objects, and thus reduce the generation of artifacts. Instead of RGB frames, Hybrid extracts features via trained CNN, fuses aligned feature maps, and decodes to final color frames.

Hybrid employs a hybrid fusion mechanism combining feature-level and image-level fusion to reduce sensitivity to inaccurate optical flow. By learning to predict spatially adaptive fusion weights, Hybrid minimizes blurriness and distortion in the generated videos. Additionally, to enhance perceptual quality, the method transfers high-frequency details to stabilized frames through a unique reprocessing approach. Finally, to minimize blank areas across frames, Hybrid introduces a path adjustment strategy that balances camera motion smoothness with reduced cropping.

Compared to methods that cause boundary distortion and cropping, Hybrid generates full-frame stabilized videos with reduced artifacts and distortion, and shows improved robustness against inaccurate optical flow. Hybrid demonstrates stronger robustness to inaccurate optical flow predictions, but can produce overly blurred results. Nevertheless, it generates full-frame stabilized videos with fewer visual artifacts and can incorporate existing optical flow smoothing methods for further stabilization. However, Hybrid performs poorly in addressing the rolling shutter effect and may be limited in complex scenarios involving lighting changes, occlusions, and foreground/background motion.

2.2.2. Three-Dimensional Video Stabilization

Lee and Tseng [58] proposed a deep learning-based 3D video stabilization method named Deep3D, which innovatively uses 3D information to enhance video stability. Compared with previous 2D methods, when dealing with scenes with complex scene depths, Deep3D can significantly reduce the generation of artifacts. It adopts a self-supervised learning framework to simultaneously learn the depth and camera pose in the original video. It is worth noting that Deep3D does not require pre-trained data but directly stabilizes the input video through 3D reconstruction. During testing, the CNN simultaneously learns the scene depth and 3D camera motion of the input video.

As depicted in Figure 10, this method’s implementation comprises two sequential stages. The initial 3D geometric optimization stage employs PoseNet and DepthNet to estimate the 3D camera trajectory and dense scene depth, respectively, from the input RGB frame sequence. Optical flow and the frame sequence itself serve as learning constraints for 3D scene reconstruction. The subsequent frame correction stage utilizes the estimated camera trajectory and scene depth to generate stabilized video through smoothed-trajectory view synthesis. During this process, users may adjust smoothing filter parameters to achieve varying stabilization intensities. Finally, warping and cropping operations produce the stabilized video output.

The DepthNet and PoseNet in the geometric optimization framework of Deep3D can estimate the dense scene depth and camera pose trajectory according to the segments of the input sequence. Using the loss term of 3D projection measurement, the parameters of these networks are updated through backpropagation during the test time.

Deep3D effectively handles parallax and severe camera shake, producing stabilized outputs with good stability and low distortion. However, the computational cost during the optimization phase is relatively high, making it less suitable for real-time streaming applications compared to lightweight online methods. Additionally, while postprocessing applies adaptive depth smoothing to dynamic objects, it may still introduce minor distortion in highly non-rigid regions.

Chen Li et al. [59] proposed an innovative online video stabilization method leveraging gyroscope (Euler angles) and accelerometer data from an inertial measurement unit (IMU) sensor. To comprehensively evaluate their method, they constructed a novel dataset covering seven typical scenarios: walking, ascending/descending stairs, panning, zooming, fast shaking, running, and stationary recording. All videos are 1080p resolution at 30 FPS, approximately 30 s long. For refined optimization, they employed an improved Cubic Spline Method to generate pseudo-ground-truth-stabilized videos as references.

In terms of trajectory optimization, Chen Li et al. adopted two sub-networks. The first sub-network focuses on detecting the motion scene and adaptively selects the features that conform to a specific scene by generating an attention mask. This method significantly improves the flexibility and robustness of the model when dealing with complex motion scenes, enabling the model to achieve a prediction accuracy of 99.9% in all seven scenes. The second sub-network, under the supervision of the mask, uses the Long Short-Term Memory network (LSTM) to predict a smooth camera path based on the real unstable trajectory.

However, this method has limitations. Its reliance on inertial measurement unit (IMU) sensors limits its applicability on devices lacking these components. Although the method performs well in general scenarios, it may encounter difficulties in extreme motion or dynamic object scenarios due to its assumption of static scenes. Although the trajectory smoothing module is enhanced by scene priors, post-processing (two-step modification) is still required to reduce jitter, introducing additional complexity.

RStab [60] breaks through the traditional limitations through a 3D multi-frame fusion body rendering framework. Its core stable rendering (SR) module fuses multi-frame features and color information in 3D space, combined with deep a priori sampling of the adaptive ray range (ARR) module and optical flow constraints of color correction (CC), preserving the full field of view (FOV) while significantly improving the projection accuracy of dynamic regions, realizing full-frame stabilized video generation. By fusing multi-view features and colors in 3D space using Stabilized Rendering (SR), RStab avoids aggressive cropping and geometric distortion, achieving a cropping ratio of 1 across datasets like NUS and Selfie. The adaptive ray range (ARR) module uses depth priors to constrain sampling around object surfaces, reducing interference from dynamic objects, while the color correction (CC) module refines projections with optical flow to enhance color accuracy.

However, RStab relies on accurate depth maps and optical flow from pre-trained models, which may introduce errors in low-texture or fast-motion scenes. The volume rendering and multi-module pipeline may be computationally intensive, potentially limiting real-time deployment on resource-constrained devices. While ARR and CC mitigate dynamic region issues, extreme cases with complex non-rigid motions might still cause residual blurs or misalignments. Despite these, RStab demonstrates strong performance in 3D-aware video stabilization, effectively balancing full-frame generation and structural fidelity.

Although this method has a slightly lower score in terms of stability and may occasionally cause visual distortion due to the lack of future information, its output results are more reliable. It has produced robust results with less distortion in various scenes and achieved a more balanced performance in different types of videos. It is particularly worth mentioning that in the four scenes of walking, stairs, zooming, and static, even with only a 3-frame delay, this method has achieved impressive results.

2.2.3. Two-and-a-Half-Dimensional Video Stabilization

Two-and-a-half-dimensional video stabilization leverages the complementary strengths of 2D and 3D approaches while avoiding explicit 3D camera path reconstruction. Instead, it imposes spatial coherence constraints to smooth the camera motion trajectory.

Goldstein and Fattal [61] implement this approach through projective reconstruction—an epipolar geometry-based method. Their technique models geometric relationships between arbitrary uncalibrated camera views using fundamental matrices and camera projections, achieving motion smoothing without reconstructing the 3D scene. The method is robust to planar scenes and degenerate camera motions, with lower computational cost than full 3D approaches. However, it relies on sufficient feature tracks, fails with strong occlusions or non-Lambertian surfaces, and cannot handle rolling shutter effects.

Matthias et al. [62] introduced the concept of hybrid homology, which is applied following feature extraction and RANSAC-based outlier rejection to estimate global motion parameters. The method integrates video stabilization but may fail in scenes with significant depth variations or degraded visual signals. Its reliance on feature density in low-texture regions limits performance, and it requires dense KLT tracks for accurate distortion modeling.

Wang et al. [63] modeled each trajectory as a Bézier curve, preserving spatial relationships by maintaining original offsets between neighboring curves. Their approach formulates video stabilization as a spatio-temporal optimization problem that minimizes trajectory jitter while preventing visual distortion. The Bézier representation ensures smooth motion but may introduce artifacts in long, twisting trajectories. The method is sensitive to feature tracker reliability and may crop excessive content during warping, especially in aggressively stabilized videos.

Zhao and Ling [64] proposed a video stabilization network named PWStableNet, which adopts a pixel-by-pixel calculation method. Unlike most previous methods that calculate a global homography matrix or multiple homography matrices based on a fixed grid to warp jittery frames to a stable view, PWStableNet introduces a pixel-level warping map, allowing each pixel to be warped independently. This design more accurately handles the parallax problem caused by depth changes and represents the first pixel-level video stabilization algorithm based on deep learning.

PWStableNet employs a multi-level cascaded encoder–decoder structure with innovative inter-stage connections. These connections fuse the feature map of the previous stage with the corresponding feature map of the later stage, enabling the latter to learn residuals from the former’s features. This cascaded architecture helps the later stage generate a more accurate warping map.

As illustrated in Figure 11, to stabilize a specific frame, PWStableNet takes a group of adjacent frames as input and estimates two warping maps: a horizontal warping map and a vertical warping map. For each pixel, the values in these two maps indicate its new position in the stable view after transformation from the original position. Composed of three-level cascaded encoder–decoder modules, PWStableNet features two branches of a Siamese network with shared parameters. This Siamese structure ensures temporal consistency between consecutive stabilized frames, thereby enhancing the stability of the generated video.

PWStableNet achieves precise frame stabilization through pixel-level warping, outperforming methods based on global affine transformations or transformation sets—particularly in videos containing parallax effects or crowd scenes. This advantage stems from its ability to model complex depth discontinuities that cannot be represented by sparse affine matrices. Furthermore, the approach demonstrates robust performance with low-quality inputs including noisy and motion-blurred footage. However, stabilization efficacy may be compromised in scenarios involving extreme parallax or rapid motion. The reliance on homography mixtures for local regions limits its accuracy in highly non-planar scenes.

Chen et al. [65] proposed PixStabNet, a multi-scale convolutional neural network for real-time video stabilization that operates without future frames. To boost robustness, the researchers implemented a two-stage training scheme.

Some previous methods, such as StabNet, did not consider depth changes. It used historical real stabilized frames as inputs during training and historical output stabilized frames during testing, which may lead to serious distortion and warping in the output video. Although PWStableNet takes depth changes into account by generating pixel-based warping maps, since it requires 15 future frames as network inputs, it will introduce a delay of at least 15 frames. PixStabNet solves these problems by proposing a multi-scale CNN network, which can directly predict the transformation of each input.

As shown in Figure 12, PixStabNet employs a multi-scale, coarse-to-fine architecture. The encoder–decoder network estimates pixel-based warping maps for stabilization, enhanced by a two-stage training scheme. Processing begins at the coarsest scale to compute an initial transformation. This coarse warping map pre-stabilizes frames at the next finer scale, where the network performs further refinement. The final warping map aggregates outputs from all scales.

The experimental results show that compared with StabNet, PixStabNet can produce more stable results with less distortion. Although the video output by PWStableNet has less distortion, its stability is poor and there is obvious jitter. It is worth mentioning that PixStabNet is currently the fastest online method, with a running speed of 54.6 FPS and without using any future frames. However, while pursuing stability, it may lead to large cropping in the output results. It struggles with extreme jitter or low-texture regions and relies on large datasets for optimal performance.

Yu and Ramamoorthi [66] proposed a robust video stabilization method, which innovatively models the inter-frame appearance changes directly as a dense optical flow field between consecutive frames. Compared with traditional technologies that rely on complex motion models, this method adopts a video stabilization formula based on the first principle, although this introduces a large-scale non-convex problem. To solve this problem, they cleverly transfer the problem to the parameter domain of the convolutional neural network (CNN). It is worth noting that this method not only takes advantage of the standard advantages of CNN in gradient optimization, but also uses CNN purely as an optimizer, rather than just extracting features from the data for learning. The uniqueness of this method lies in that it trains the CNN from scratch for each input case and deliberately overfits the CNN parameters to achieve the best video stabilization effect on a specific input. By transforming the transformation directly targeting the image pixels into a problem in the CNN parameter domain, this method provides a new feasible method for video stabilization. However, this method requires a large amount of computational resources, making it challenging for real-time applications. Reliance on large datasets for training limits generalization to unknown motion types. Performance degrades under low-light conditions or when optical flow accuracy is compromised, and due to the warping nature of CNNs, it may over-smooth fine details.

In a recent study, to address the model generalization challenge, Ali et al. [67] proposed a test-time adaptive strategy to optimize pixel-level synthesis parameters through single-step fine-tuning combined with meta-learning to significantly improve the stability of complex motion scenes. This work relies on pre-trained deep models, limiting applicability to scenarios without access to such baselines. A framework for deep camera path optimization [68] achieves real-time image stabilization with single-frame-level delay within a sliding window with performance comparable to offline methods via a motion smoothing attention (EMSA) module with a hybrid loss function. However, the sliding window approach may accumulate errors in long videos, and the model requires sensor noise parameter estimation, which is challenging in dynamic lighting. Dense flow processing increases memory usage compared to mesh-based approaches.

Sánchez-Beeckman et al. [69] propose a self-similarity two-stage denoising scheme combined with temporal trajectory pre-filtering to enhance the quality of input frames. Demosaicking artifacts persist in low-contrast regions as the method struggles with fast motion due to optical flow limitations. Meanwhile, Zhang et al. [70] utilizes IMU-assisted gray-scale pixel matching across frames to significantly enhance the temporal consistency of the white balance, and to provide more robust preprocessing support for the image stabilization algorithm. This strictly requires IMU sensor data, limiting compatibility with legacy devices.

For videos with large occlusions, the 2D methods based on feature trajectories often fail due to the difficulty of obtaining long feature trajectories and produce artifacts in the video. At the same time, the structure from motion (SFM) method is usually not suitable for dynamic scenes, and the 3D method also cannot produce satisfactory results. In contrast, the optical flow-based method shows stronger robustness when dealing with larger foreground occlusions.

2.3. Comparison of Methods

This section aims to provide a comprehensive comparison of the main algorithms from both performance metrics (quantitative) and visual quality (qualitative) perspectives, thereby elucidating their technical characteristics and laying the foundation for understanding the current state of research. Table 1 and Table 2 respectively show the performance of some classic algorithms in terms of processing speed (FPS) on the NUS dataset during CPU and GPU testing. Table 3 summarizes the performance of different representative algorithms on mainstream evaluation metrics (typically including cropping rate C, distortion D, and stability S). Figure 13 visually demonstrates the differences in image stabilization effects among some algorithms in typical scenarios. Table 3 clearly reflects the mainstream shift in video stabilization from traditional optimization methods (L1StabilizeBundle) to deep learning methods (StableNet and its subsequent algorithms). Deep learning methods demonstrate overall advantages in core metrics (especially C and D) and offer greater potential for speed improvements.

A single quantitative metric is insufficient for a comprehensive evaluation of stabilization performance. Combining visual comparisons is crucial for understanding algorithm performance in real-world scenarios and identifying potential issues. Comparisons on standard datasets provide a baseline, but the generalization capability of algorithms in more complex and diverse real-world scenarios remains a challenge.

3. Assessment Metrics for Video Stabilization Algorithms

In the technology of video stabilization, the irregular jitter of the camera can cause visual discomfort. However, at the same time, the processing of video stabilization may introduce artifacts, distortion, and cropping, all of which will lead to the impairment of the visual quality of the video output by the algorithm. Therefore, the assessment of video stabilization quality has become a key indicator for measuring the advantages, disadvantages, and practicality of video stabilization methods.

Although there have been many previous studies [71,72,73,74,75] involving the assessment of video stabilization performance, up to now, there has not been a clear and unified assessment standard for video stabilization quality. This section reviews the historical development of video stabilization quality assessment. Like general video quality assessment, prevailing methods fall into two categories: subjective and objective evaluation.

3.1. Subjective Quality Assessment

Subjective quality assessment is an assessment method based on the human subjective visual system. In this method, the stability of a video is usually evaluated through the inspection of the human visual system and user surveys. Among them, the Mean Opinion Score (MOS) [18] is a widely adopted subjective assessment method for measuring the quality of video stabilization at present. This method was first established by Suan et al. on a statistical basis and has been applied in the medical field. In order to ensure that the results of the subjective assessment of the video have statistical significance, a certain number of observers must be involved, so that the results of the subjective assessment can truly reflect the stabilization effect of the video. However, the drawback of the MOS assessment method is that it cannot be represented by mathematical modeling and is time-consuming, so it is not suitable for the stability assessment of large datasets.

The Differential Mean Opinion Score (DMOS) is a derivative index based on the MOS score, which reflects the difference in assessment scores between the distortion-free image and the distorted image by the human visual system. The smaller the DMOS value, the higher the image quality.

In addition, an MOS-based user survey method was introduced in [18], which measures the subjective preferences of users through combined comparisons between different methods. Specifically, the investigator will show multiple videos to the observers and ask the observers to select the best-performing video among them according to some predefined indicators.

In the assessment of video stabilization quality, user surveys are often used as a supplementary means to objective assessment to assess the stability of the stabilized video and whether the distortion and blurriness have an impact on the user’s viewing experience. However, the drawback of user surveys is that they are too time-consuming and it is difficult to accurately describe them through mathematical modeling. Nevertheless, in the current situation where there is a lack of a unified objective quality assessment standard, the existence of a subjective quality assessment is still very necessary, as it provides us with an intuitive and effective method for evaluating the quality of video stabilization.

3.2. Objective Quality Assessment

The objective assessment method is to conduct a standardized assessment of the video stabilization algorithm by constructing a mathematical model of indicators that can reflect the quality of video stabilization. According to whether they rely on a reference object or not, these methods can be further divided into full-reference quality assessment methods and no-reference quality assessment methods. The core of the full-reference quality assessment method lies in comparing the stabilized video processed by the video stabilization algorithm with the real stabilized video, and judging the advantages and disadvantages of the algorithm based on the differences between the two. In contrast, the no-reference quality assessment method does not require a real stabilized video as a reference. Instead, it uses a statistical model to measure the motion changes of the video before and after the stabilization process and conducts an assessment accordingly.

3.2.1. Full-Reference Quality Assessment

In full-reference quality assessment, an authentic stable video serves as the reference standard. This evaluation involves synthesizing jittery video by introducing artificial shake to the stable reference, which then serves as input to the stabilization algorithm. Performance is quantified by measuring quality differences between the algorithm’s output and the original stable reference.

Common full-reference metrics include Peak Signal-to-Noise Ratio (PSNR) [76], Mean-Squared Error (MSE) [77], and Structural Similarity (SSIM) [76]. These methods are widely adopted due to their computational efficiency and general applicability, making them suitable for integration into video optimization pipelines.

Several specialized approaches have emerged: Offiah et al. [78] developed a full-reference assessment method for medical endoscopic video stabilization. Tanakian et al. [72] employed MSE between stabilized and reference motion paths as a stabilization distance metric. Qu et al. [79] synthesized jitter–stable video pairs for dataset construction, evaluating algorithms via SSIM. Zhang et al. [80] proposed a Riemannian metric-based approach that improves accuracy through motion difference analysis while increasing computational complexity. Liu et al. [57] utilized PSNR, SSIM, and LPIPS for synthesized frame evaluation. Ito et al. [81] introduced a composite index combining MSE, SSIM, and resolution loss, and developed a dedicated stabilization assessment dataset.

Despite these advances, the limited availability of paired reference-distorted video datasets continues to constrain the widespread adoption of full-reference assessment in video stabilization evaluation.

3.2.2. No-Reference Quality Assessment

Unlike full-reference assessment, no-reference quality evaluation operates without paired stable reference videos. Instead, it employs statistical models to quantify motion variations between unstabilized inputs and stabilized outputs, directly measuring video stabilization efficacy.

Niskanen et al. [82] separated residual jitter components from stabilized videos using low-pass/high-pass filtering, evaluating stability through jitter attenuation magnitude. However, this methodology exhibits limited capability in precisely determining motion stability characteristics.

To comprehensively evaluate the quality of videos processed by stabilization algorithms from multiple perspectives, Liu et al. [32] proposed a set of objective evaluation indicators, including cropping rate, distortion degree, and stability. The cropping rate measures the remaining frame area after cropping empty pixel regions caused by motion compensation; the distortion degree quantifies the anisotropic scaling of homography between input and output frames; and stability aims to measure the stability and smoothness of the stabilized video. Their specific definitions and descriptions are as follows:

The cropping rate quantifies the proportion of retained frame content after stabilization, calculated per frame as the homography scale factor between input/output frames. The video-level metric represents the mean across all frames, where higher values indicate superior content preservation.

The distortion degree (Distortion) describes the degree of distortion of the stabilized result compared to the original video. To quantify this indicator, we calculated the distortion value for each frame, which is equal to the ratio of the two largest eigenvalues of the affine part of the homography matrix. The distortion degree of the entire video is defined as the minimum distortion value among all frames.

The stability (Stability) indicator is used to evaluate the stability and smoothness of the video. The rotation and translation sequences of all homography transformations between consecutive frames in the output video are treated as two time series. The metric calculates the ratio of low-frequency energy (2–6 Hz) to full-band energy (excluding DC), taking the minimum ratio as the stability score.

Currently, the above CDS evaluation indicators are widely used in the field of video stabilization. However, when setting the high–low-frequency separation threshold for this evaluation method, adjustments are usually required according to videos with different motion types. This feature necessitates flexible settings for specific scenarios in practical applications to ensure the accuracy and effectiveness of the evaluation results.

Zheng et al. [83] introduced motion-path curvature analysis by mapping feature-derived homographies to Lie group spaces and computing the discrete geodesic total curvature as a stability metric. Zhang et al. [84] enhanced this approach using constrained paths for spatial motion variations, evaluating stability through weighted curvature calculations.

Liu [57] proposed a method to measure video stability using cumulative optical flow. However, current video stabilization quality assessment methods mainly focus on evaluating factors such as distortion, cropping, and path stability, while neglecting factors that may significantly impact video stability, such as lighting conditions, visual system characteristics, and accurate recognition of background regions. In future work, one of the major challenges that need to be addressed is the need to explore more stability-related features and represent them through modeling to further improve the accuracy and universality of objective evaluation methods.

4. Benchmark Datasets for Video Stabilization

(1) The HUJ dataset comprises 42 videos, covering driving scenarios, dynamic scenes, zooming sequences, and walking motions.

(2) The MCL dataset, containing 162 videos across seven categories: regular motion, jello effect, depth scenes, crowd environments, driving sequences, running motions, and object-focused scenarios.

(3) The BIT dataset includes 45 videos spanning walking, climbing, running, cycling, driving, large parallax, crowd scenes, close-up objects, and low-light environments.

(4) QMUL dataset consists of 421 videos, encompassing regular, blurry, high-speed motion, low-light, textureless, parallax, discontinuous, depth, crowd, and close-up object scenarios.

With the integration of deep learning into video stabilization research, the following datasets have become pivotal for algorithm development and validation:

(5) The NUS dataset, containing 144 videos, covers seven distinct categories: regular motion, fast rotation, zooming, parallax, driving, crowd scenes, and running sequences.

(6) The DeepStab dataset is a purpose-built dataset for supervised deep learning in video stabilization, containing 61 pairs of synchronized videos, each with a duration of up to 30 s and a frame rate of 30 FPS. It encompasses diverse scenarios, such as indoor environments with large parallax and outdoor scenes featuring buildings, vegetation, and crowds. Camera motions include forward translation, lateral movement, rotation, and their composite dynamics, providing rich temporal–spatial variations. Notably, data collection employs dual-camera synchronization: one camera records stable footage, while the other captures handheld jitter, generating high-fidelity paired samples for training (as illustrated in Figure 14).

(7) The unique feature of the Video+Sensor dataset is that it contains 50 videos, each accompanied by gyroscope and optical image stabilization (OIS) sensor logs. For research convenience, the dataset is meticulously divided into 16 training videos and 34 test videos. The test set is further categorized into six scenarios—regular motion, rotation, parallax, driving, human-centric, and running—to facilitate comprehensive evaluation of video stabilization algorithms across diverse real-world contexts.

(8) The IMU_VS dataset contains 70 videos that have been augmented with IMU sensor data in seven scenarios: walking, stair climbing/descending, static, translation, running, zooming, and rapid shaking. To ensure data diversity, 10 videos were collected for each scenario, with sensor logs detailing angular velocity and acceleration. The dataset is carefully partitioned into training (42 videos), validation (7 videos), and test (21 videos) subsets, supporting algorithm development, optimization, and evaluation at different stages.

In addition, other datasets specifically designed for special scenarios are constantly being applied:

(9) Building on their research in selfie video stabilization [85], Yu et al. expanded and introduced the Selfie dataset, comprising 1005 videos in total. The dataset employs Dlib for facial detection in each frame, tracking face occurrences across consecutive frames. Only video segments with successful face detection in at least 50 continuous frames are retained as specific face-focused clips, ensuring high-quality and targeted data through this stringent selection criterion. Uniquely, the Selfie dataset includes both regular color videos and corresponding Ground-Truth Foreground Masks for each frame, providing rich annotations critical for selfie video stabilization research.

(10) Leveraging the virtual scene generator Silver [86], Kerim et al. innovatively proposed the VSAC105Real dataset, divided into two sub-datasets: VSNC35Synth and VSNC65Synth. VSNC35Synth focuses on normal weather conditions, containing 35 videos, while VSNC65Synth is broader, encompassing 65 videos across diverse weather scenarios—day/night, normal, rainy, foggy, and snowy—to simulate real-world filming environments.

(11) The ISDS dataset [87] optimizes the detection of small targets through a multi-scale weighted feature fusion model (YOLOv4-MSW), which provides data and modeling support for the application of steady image algorithms to dynamic water scenes.

Table 4 summarizes the types, scenarios, and main features of common video stabilization datasets. NUS is widely used for algorithm validation due to its motion diversity, while DeepStab is frequently adopted for training due to paired samples. Selfie and Video+Sensor datasets address specialized stabilization needs. The Selfie dataset is specifically tailored for researching selfie video stabilization methods, while the Video+Sensor dataset plays a pivotal role in advancing 3D video stabilization.

These meticulously collected and curated datasets have made substantial contributions to the progress of video stabilization, offering critical support for the development of novel stabilization approaches. However, to further enhance the generalizability and robustness of deep learning algorithms, ongoing efforts are needed to explore standardized datasets that incorporate a broader spectrum of motion types in the future.

5. Challenges and Future Directions

5.1. Current Challenges

Despite the significant progress and rapid development in the research and application of video stabilization over the years, we must acknowledge that numerous challenges and unresolved technical problems still persist.

Real-time high-resolution processing:Existing methods face severe computational delays with high-resolution videos. For example, applications like drone aerial photography and medical endoscopy require sub-millisecond responses beyond current capabilities.

Cross-device generalization: Deep learning models suffer significant performance degradation when applied to unseen camera hardware due to device-specific training data. The DeepStab dataset’s limitation to handheld devices results in increased distortion errors with wide-angle lenses.

Training data scarcity: Supervised learning requires costly “stable–jittery” video pairs. Current datasets cover only 7–10 motion scenes, lacking long-tail samples (extreme illumination, textureless regions), limiting model generalization.

Optical flow limitations: These methods provide nonlinear motion compensation flexibility but fail in feature-scarce environments, producing non-rigid distortions and artifacts when reliable rigid constraints are absent.

Sensor-based parallax issues: While sensor-based methods avoid scene-content dependencies [88,89], their homography-based stabilization at infinity cannot adapt to depth variations, causing residual parallax in close-range scenarios.

Warping-cropping trade-off: Methods dependent on warping inevitably introduce boundary artifacts that necessitate cropping, thereby perpetuating the fundamental stability-cropping trade-off. Smoother motion trajectories require larger crops, whereas restricted warping capacity constrains achievable stability [30,47,49,61].

5.2. Future Directions

With the further advancement of research, there are still some worthy research directions in the field of video stabilization:

Adaptive model selection: The 2D/3D model trade-off requires application-specific solutions. Two-dimensional models suit lightweight applications but fail with parallax distortion. Three-dimensional models alleviate parallax but suffer computational complexity. Recent flapping-wing robot stabilization has demonstrated context-adaptive success [90].

Real-time online systems: Minimizing latency remains critical. While some deep learning methods achieve real-time performance, future-frame dependencies create fixed delays. Optical flow-guided context reduction [91] exemplifies mobile-compatible acceleration strategies.

Cross-domain integration: Techniques from related fields could enhance stabilization. Skeleton-based angular weighting [22] may improve dynamic motion modeling, while video compression’s hierarchical enhancement [21] offers co-optimization insights for mobile systems.

Comprehensive datasets: Current datasets lack diversity, causing overfitting and poor generalization. Curating datasets covering long-tail scenarios (e.g., low-texture environments) is essential for robust algorithms.

5.3. Multi-Disciplinary Applications

Video stabilization enables critical functionality across domains:

Intelligent transportation: Real-time in-vehicle stabilization improves traffic sign recognition through motion blur suppression.

Industrial inspection: Sub-pixel motion compensation solves micro-flutter in precision part microscopy.

Geological exploration: UAV systems integrate IMU with 3D reconstruction to eliminate wind-induced image shifts.

Medical robotics: Endoscopic systems use pixel-warping to compensate respiratory motion, reducing surgical risk.

These implementations transform mechanical vibration into quantifiable motion vectors, enhancing vision-system robustness. Future integration with edge computing and multimodal perception could enable cross-domain collaborative systems.

6. Conclusions

In recent years, with the deepening of research on digital video stabilization, deep learning-based video stabilization algorithms have become one of the hot research areas. Compared with traditional video stabilization, deep learning-based video stabilization algorithms can not only effectively handle video shake but also reduce distortion and cropping during the stabilization process.

In this paper, we first conduct an in-depth analysis of the research background and significance of video stabilization. Based on a comprehensive review and analysis of previous research achievements, this paper systematically expounds on the current mainstream video stabilization algorithms, mainly classified and described according to traditional and deep learning approaches based on their implementation methods. We then elaborate on their current development status and representative methods. Subsequently, the comprehensive evaluation metrics for video stabilization are introduced, and widely used datasets are summarized. Finally, the main challenges and future directions of this technology in practical applications are clearly pointed out.

Author Contributions

Conceptualization, Q.H. and X.L.; investigation, C.J. and Y.W.; writing—original draft, Q.X. and Y.W.; writing—review and editing, Q.X. and Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Opening Foundation of Jiangsu Engineering Research Center of Digital Twinning Technology for Key Equipment in Petrochemical Process under grant number DTEC202301 and DTEC202303.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Baker, C.L.; Hess, R.F.; Zihl, J. Residual motion perception in a “motion-blind” patient, assessed with limited-lifetime random dot stimuli. J. Neurosci. 1991, 11, 454–461. [Google Scholar] [CrossRef] [PubMed]
Ling, Q.; Zhao, M. Stabilization of traffic videos based on both foreground and background feature trajectories. IEEE Trans. Circuits Syst. Video Technol. 2018, 29, 2215–2228. [Google Scholar] [CrossRef]
Sharif, M.; Khan, S.; Saba, T.; Raza, M.; Rehman, A. Improved video stabilization using SIFT-log polar technique for unmanned aerial vehicles. In Proceedings of the 2019 International Conference on Computer and Information Sciences (ICCIS), Sakaka, Saudi Arabia, 3–4 April 2019; pp. 1–7. [Google Scholar]
Li, X.; Xu, F.; Yong, X.; Chen, D.; Xia, R.; Ye, B.; Gao, H.; Chen, Z.; Lyu, X. SSCNet: A spectrum-space collaborative network for semantic segmentation of remote sensing images. Remote Sens. 2023, 15, 5610. [Google Scholar] [CrossRef]
Li, X.; Xu, F.; Li, L.; Xu, N.; Liu, F.; Yuan, C.; Chen, Z.; Lyu, X. AAFormer: Attention-Attended Transformer for Semantic Segmentation of Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Yu, J.; Wu, Z.; Yang, X.; Yang, Y.; Zhang, P. Underwater target tracking control of an untethered robotic fish with a camera stabilizer. IEEE Trans. Syst. Man Cybern. Syst. 2020, 51, 6523–6534. [Google Scholar] [CrossRef]
Cardani, B. Optical image stabilization for digital cameras. IEEE Control Syst. Mag. 2006, 26, 21–22. [Google Scholar]
e Souza, M.R.; de Almeida Maia, H.; Pedrini, H. Rethinking two-dimensional camera motion estimation assessment for digital video stabilization: A camera motion field-based metric. Neurocomputing 2023, 559, 126768. [Google Scholar] [CrossRef]
Ke, J.; Watras, A.J.; Kim, J.J.; Liu, H.; Jiang, H.; Hu, Y.H. Efficient online real-time video stabilization with a novel least squares formulation and parallel AC-RANSAC. J. Vis. Commun. Image Represent. 2023, 96, 103922. [Google Scholar] [CrossRef]
Li, X.; Mo, H.; Liu, F. A robust video stabilization method for camera shoot in mobile devices using GMM-based motion estimator. Comput. Electr. Eng. 2023, 110, 108841. [Google Scholar] [CrossRef]
Raj, R.; Rajiv, P.; Kumar, P.; Khari, M.; Verdú, E.; Crespo, R.G.; Manogaran, G. Feature based video stabilization based on boosted HAAR Cascade and representative point matching algorithm. Image Vis. Comput. 2020, 101, 103957. [Google Scholar] [CrossRef]
Kokila, S.; Ramesh, S. Intelligent software defined network based digital video stabilization system using frame transparency threshold pattern stabilization method. Comput. Commun. 2020, 151, 419–427. [Google Scholar]
Cao, M.; Zheng, L.; Jia, W.; Liu, X. Real-time video stabilization via camera path correction and its applications to augmented reality on edge devices. Comput. Commun. 2020, 158, 104–115. [Google Scholar] [CrossRef]
Dolly, D.R.J.; Peter, J.D.; Josemin Bala, G.; Jagannath, D.J. Image fusion for stabilized medical video sequence using multimodal parametric registration. Pattern Recognit. Lett. 2020, 135, 390–401. [Google Scholar] [CrossRef]
Huang, H.; Wei, X.X.; Zhang, L. Encoding Shaky Videos by Integrating Efficient Video Stabilization. IEEE Trans. Circuits Syst. Video Technol. 2019, 29, 1503–1514. [Google Scholar] [CrossRef]
Siva Ranjani, C.K.; Mahaboob Basha, S. Video Stabilization using SURF-CNN for Surveillance Application. In Proceedings of the 2024 8th International Conference on Electronics, Communication and Aerospace Technology (ICECA), Coimbatore, India, 6–8 November 2024; pp. 1118–1123. [Google Scholar] [CrossRef]
Wei, S.; Xie, W.; He, Z. Digital Video Stabilization Techniques: A Survey. J. Comput. Res. Dev. 2017, 54, 2044–2058. [Google Scholar]
Guilluy, W.; Oudre, L.; Beghdadi, A. Video stabilization: Overview, challenges and perspectives. Signal Process. Image Commun. 2021, 90, 116015. [Google Scholar] [CrossRef]
Roberto e Souza, M.; Maia, H.d.A.; Pedrini, H. Survey on Digital Video Stabilization: Concepts, Methods, and Challenges. ACM Comput. Surv. 2022, 55, 47. [Google Scholar] [CrossRef]
Wang, Y.; Huang, Q.; Jiang, C.; Liu, J.; Shang, M.; Miao, Z. Video stabilization: A comprehensive survey. Neurocomputing 2023, 516, 205–230. [Google Scholar] [CrossRef]
Huang, Q.; Lu, H.; Liu, W.; Wang, Y. Scalable Motion Estimation and Temporal Context Reinforcement for Video Compression using RGB sensors. IEEE Sens. J. 2025, 25, 18323–18333. [Google Scholar] [CrossRef]
Huang, Q.; Liu, W.; Shang, M.; Wang, Y. Fusing angular features for skeleton-based action recognition using multi-stream graph convolution network. IET Image Process. 2024, 18, 1694–1709. [Google Scholar] [CrossRef]
Ravankar, A.; Rawankar, A.; Ravankar, A.A. Video stabilization algorithm for field robots in uneven terrain. Artif. Life Robot. 2023, 28, 502–508. [Google Scholar] [CrossRef]
e Souza, M.R.; Maia, H.d.A.; Pedrini, H. NAFT and SynthStab: A RAFT-based Network and a Synthetic Dataset for Digital Video Stabilization. Int. J. Comput. Vis. 2024, 133, 2345–2370. [Google Scholar] [CrossRef]
Ren, Z.; Zou, M.; Bi, L.; Fang, M. An unsupervised video stabilization algorithm based on gyroscope image fusion. Comput. Graph. 2025, 126, 104154. [Google Scholar] [CrossRef]
Wang, N.; Zhou, C.; Zhu, R.; Zhang, B.; Wang, Y.; Liu, H. SOFT: Self-supervised sparse Optical Flow Transformer for video stabilization via quaternion. Eng. Appl. Artif. Intell. 2024, 130, 107725. [Google Scholar] [CrossRef]
Gulcemal, M.O.; Sarac, D.C.; Alp, G.; Duran, G.; Gucenmez, S.; Solmaz, D.; Akar, S.; Bayraktar, D. Effects of video-based cervical stabilization home exercises in patients with rheumatoid arthritis: A randomized controlled pilot study. Z. Rheumatol. 2024, 83, 352–358. [Google Scholar] [CrossRef] [PubMed]
Liang, H.; Dong, Z.; Li, H.; Yue, Y.; Fu, M.; Yang, Y. Unified Vertex Motion Estimation for integrated video stabilization and stitching in tractor–trailer wheeled robots. Robot. Auton. Syst. 2025, 191, 105004. [Google Scholar] [CrossRef]
Dong, L.; Chen, L.; Wu, Z.C.; Zhang, X.; Liu, H.L.; Dai, C. Video Stabilization-Based elimination of unintended jitter and vibration amplification in centrifugal pumps. Mech. Syst. Signal Process. 2025, 229, 112500. [Google Scholar] [CrossRef]
Grundmann, M.; Kwatra, V.; Essa, I. Auto-directed video stabilization with robust L1 optimal camera paths. In Proceedings of the CVPR 2011, Colorado Springs, CO, USA, 20–25 June 2011; pp. 225–232. [Google Scholar]
Bradley, A.; Klivington, J.; Triscari, J.; van der Merwe, R. Cinematic-L1 video stabilization with a log-homography model. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 1041–1049. [Google Scholar]
Liu, S.; Yuan, L.; Tan, P.; Sun, J. Bundled camera paths for video stabilization. ACM Trans. Graph. (TOG) 2013, 32, 78. [Google Scholar] [CrossRef]
Liu, S.; Yuan, L.; Tan, P.; Sun, J. Steadyflow: Spatially smooth optical flow for video stabilization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 4209–4216. [Google Scholar]
Liu, S.; Tan, P.; Yuan, L.; Sun, J.; Zeng, B. Meshflow: Minimum latency online video stabilization. In Computer Vision–ECCV 2016: Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part VI 14; Springer: Berlin/Heidelberg, Germany, 2016; pp. 800–815. [Google Scholar]
Zhang, L.; Xu, Q.K.; Huang, H. A global approach to fast video stabilization. IEEE Trans. Circuits Syst. Video Technol. 2015, 27, 225–235. [Google Scholar] [CrossRef]
Zhang, L.; Chen, X.Q.; Kong, X.Y.; Huang, H. Geodesic video stabilization in transformation space. IEEE Trans. Image Process. 2017, 26, 2219–2229. [Google Scholar] [CrossRef]
Wu, H.; Xiao, L.; Lian, Z.; Shim, H.J. Locally low-rank regularized video stabilization with motion diversity constraints. IEEE Trans. Circuits Syst. Video Technol. 2018, 29, 2873–2887. [Google Scholar] [CrossRef]
Chereau, R.; Breckon, T.P. Robust motion filtering as an enabler to video stabilization for a tele-operated mobile robot. In Proceedings of the Electro-Optical Remote Sensing, Photonic Technologies, and Applications VII; and Military Applications in Hyperspectral Imaging and High Spatial Resolution Sensing, Dresden, Germany, 24–26 September 2013; Kamerman, G.W., Steinvall, O.K., Bishop, G.J., Gonglewski, J.D., Eds.; International Society for Optics and Photonics, SPIE: Bellingham, WA, USA, 2013; Volume 8897, p. 88970I. [Google Scholar] [CrossRef]
Franz, G.; Wegner, D.; Wiehn, M.; Keßler, S. Evaluation of video stabilization metrics for the assessment of camera vibrations. In Proceedings of the Infrared Imaging Systems: Design, Analysis, Modeling, and Testing XXXV, National Harbor, MD, USA, 21–26 April 2024; Haefner, D.P., Holst, G.C., Eds.; International Society for Optics and Photonics, SPIE: Bellingham, WA, USA, 2024; Volume 13045, p. 130450D. [Google Scholar] [CrossRef]
Yang, C.; He, Y.; Zhang, D. LSTM based video stabilization for object tracking. In Proceedings of the AOPC 2021: Optical Sensing and Imaging Technology, Beijing, China, 20–22 June 2021; Jiang, Y., Lv, Q., Liu, D., Zhang, D., Xue, B., Eds.; International Society for Optics and Photonics, SPIE: Bellingham, WA, USA, 2021; Volume 12065, p. 120653D. [Google Scholar] [CrossRef]
Takeo, Y.; Sekiguchi, T.; Mitani, S.; Mizutani, T.; Shirasawa, Y.; Kimura, T. Video stabilization method corresponding to various imagery for geostationary optical Earth observation satellite. In Proceedings of the Image and Signal Processing for Remote Sensing XXVII, Online, 6 October 2021; Bruzzone, L., Bovolo, F., Eds.; International Society for Optics and Photonics, SPIE: Bellingham, WA, USA, 2021; Volume 11862, p. 1186205. [Google Scholar] [CrossRef]
Voronin, V.; Frantc, V.; Marchuk, V.; Shrayfel, I.; Gapon, N.; Agaian, S.; Stradanchenko, S. Video stabilization using space-time video completion. In Proceedings of the Mobile Multimedia/Image Processing, Security, and Applications 2016, Baltimore, MD, USA, 17–21 April 2016; Agaian, S.S., Jassim, S.A., Eds.; International Society for Optics and Photonics, SPIE: Bellingham, WA, USA, 2016. [Google Scholar] [CrossRef]
Mehala, R.; Mahesh, K. An effective absolute and relative depths estimation-based 3D video stabilization framework using GSLSTM and BCKF. Signal Image Video Process. 2025, 19, 412. [Google Scholar] [CrossRef]
Zhang, Y.; Guo, P.; Ju, M.; Hu, Q. Motion Intent Analysis-Based Full-Frame Video Stabilization. IEEE Signal Process. Lett. 2025, 32, 1685–1689. [Google Scholar] [CrossRef]
Buehler, C.; Bosse, M.; McMillan, L. Non-metric image-based rendering for video stabilization. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2001), Kauai, HI, USA, 8–14 December 2001; Volume 2, p. II. [Google Scholar]
Liu, S.; Wang, Y.; Yuan, L.; Bu, K.; Tan, P.; Sun, J. Video stabilization with a depth camera. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 89–95. [Google Scholar]
Liu, F.; Gleicher, M.; Jin, H.; Agarwala, A. Content-preserving warps for 3D video stabilization. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2; ACM: New York, NY, USA, 2023; pp. 631–639. [Google Scholar]
Smith, B.M.; Zhang, L.; Jin, H.; Agarwala, A. Light field video stabilization. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan, 29 September–2 October 2009; pp. 341–348. [Google Scholar]
Liu, F.; Gleicher, M.; Wang, J.; Jin, H.; Agarwala, A. Subspace video stabilization. ACM Trans. Graph. (TOG) 2011, 30, 4. [Google Scholar] [CrossRef]
Lee, K.Y.; Chuang, Y.Y.; Chen, B.Y.; Ouhyoung, M. Video stabilization using robust feature trajectories. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan, 29 September–2 October 2009; pp. 1397–1404. [Google Scholar]
Wang, M.; Yang, G.Y.; Lin, J.K.; Zhang, S.H.; Shamir, A.; Lu, S.P. Deep online video stabilization with multi-grid warping transformation learning. IEEE Trans. Image Process. 2018, 28, 2283–2292. [Google Scholar] [CrossRef]
Xu, Y.; Zhang, J.; Maybank, S.J.; Tao, D. Dut: Learning video stabilization by simply watching unstable videos. IEEE Trans. Image Process. 2022, 31, 4306–4320. [Google Scholar] [CrossRef]
Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12026–12035. [Google Scholar]
Yu, J.; Ramamoorthi, R. Learning video stabilization using optical flow. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 8159–8167. [Google Scholar]
Wulff, J.; Black, M.J. Efficient sparse-to-dense optical flow estimation using a learned basis and layers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 120–130. [Google Scholar]
Choi, J.; Kweon, I.S. Deep iterative frame interpolation for full-frame video stabilization. ACM Trans. Graph. (TOG) 2020, 39, 4. [Google Scholar] [CrossRef]
Liu, Y.L.; Lai, W.S.; Yang, M.H.; Chuang, Y.Y.; Huang, J.B. Hybrid neural fusion for full-frame video stabilization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2299–2308. [Google Scholar]
Lee, Y.C.; Tseng, K.W.; Chen, Y.T.; Chen, C.C. 3D video stabilization with depth estimation by CNN-based optimization. In Proceedings of theIEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 10621–10630. [Google Scholar]
Li, C.; Song, L.; Chen, S.; Xie, R.; Zhang, W. Deep online video stabilization using IMU sensors. IEEE Trans. Multimed. 2022, 25, 2047–2060. [Google Scholar] [CrossRef]
Peng, Z.; Ye, X.; Zhao, W.; Liu, T.; Sun, H.; Li, B.; Cao, Z. 3D Multi-frame Fusion for Video Stabilization. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 7507–7516. [Google Scholar] [CrossRef]
Goldstein, A.; Fattal, R. Video stabilization using epipolar geometry. Acm Trans. Graph. (TOG) 2012, 31, 126. [Google Scholar] [CrossRef]
Grundmann, M.; Kwatra, V.; Castro, D.; Essa, I. Calibration-free rolling shutter removal. In Proceedings of the 2012 IEEE International Conference on Computational Photography (ICCP), Seattle, WA, USA, 28–29 April 2012; pp. 1–8. [Google Scholar] [CrossRef]
Wang, Y.S.; Liu, F.; Hsu, P.S.; Lee, T.Y. Spatially and Temporally Optimized Video Stabilization. IEEE Trans. Vis. Comput. Graph. 2013, 19, 1354–1361. [Google Scholar] [CrossRef]
Zhao, M.; Ling, Q. Pwstablenet: Learning pixel-wise warping maps for video stabilization. IEEE Trans. Image Process. 2020, 29, 3582–3595. [Google Scholar] [CrossRef] [PubMed]
Chen, Y.T.; Tseng, K.W.; Lee, Y.C.; Chen, C.Y.; Hung, Y.P. Pixstabnet: Fast multi-scale deep online video stabilization with pixel-based warping. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; pp. 1929–1933. [Google Scholar]
Yu, J.; Ramamoorthi, R. Robust video stabilization by optimization in CNN weight space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3800–3808. [Google Scholar]
Ali, M.K.; Im, E.W.; Kim, D.; Kim, T.H. Harnessing Meta-Learning for Improving Full-Frame Video Stabilization. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 12605–12614. [Google Scholar] [CrossRef]
Liu, S.; Zhang, Z.; Liu, Z.; Tan, P.; Zeng, B. Minimum Latency Deep Online Video Stabilization and Its Extensions. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 1238–1249. [Google Scholar] [CrossRef]
Sánchez-Beeckman, M.; Buades, A.; Brandonisio, N.; Kanoun, B. Combining Pre- and Post-Demosaicking Noise Removal for RAW Video. IEEE Trans. Image Process. 2025. [Google Scholar] [CrossRef] [PubMed]
Zhang, L.; Chen, X.; Wang, Z. IMU-Assisted Gray Pixel Shift for Video White Balance Stabilization. IEEE Trans. Multimed. 2025, 1–14. [Google Scholar] [CrossRef]
Balakirsky, S.B.; Chellappa, R. Performance characterization of image stabilization algorithms. In Proceedings of the 3rd IEEE International Conference on Image Processing, Lausanne, Switzerland, 19 September 1996; Volume 2, pp. 413–416. [Google Scholar]
Morimoto, C.; Chellappa, R. Evaluation of image stabilization algorithms. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No. 98CH36181), Seattle, WA, USA, 15 May 1998; Volume 5, pp. 2789–2792. [Google Scholar]
Tanakian, M.J.; Rezaei, M.; Mohanna, F. Camera motion modeling for video stabilization performance assessment. In Proceedings of the 2011 7th Iranian Conference on Machine Vision and Image Processing, Tehran, Iran, 16–17 November 2011; pp. 1–4. [Google Scholar]
Cui, Z.; Jiang, T. No-reference video shakiness quality assessment. In Computer Vision–ACCV 2016: Proceedings of the 13th Asian Conference on Computer Vision, Taipei, Taiwan, 20–24 November 2016; Revised Selected Papers, Part V 13; Springer International Publishing: Cham, Switzerland, 2017; pp. 396–411. [Google Scholar]
Karpenko, A.; Jacobs, D.; Baek, J.; Levoy, M. Digital video stabilization and rolling shutter correction using gyroscopes. CSTR 2011, 1, 13. [Google Scholar]
Streijl, R.C.; Winkler, S.; Hands, D.S. Mean opinion score (MOS) revisited: Methods and applications, limitations and alternatives. Multimed. Syst. 2016, 22, 213–227. [Google Scholar] [CrossRef]
Hore, A.; Ziou, D. Image quality metrics: PSNR vs. SSIM. In Proceedings of the 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 2366–2369. [Google Scholar]
Wang, Z.; Bovik, A.C. Mean squared error: Love it or leave it? A new look at signal fidelity measures. IEEE Signal Process. Mag. 2009, 26, 98–117. [Google Scholar] [CrossRef]
Ye, F.; Pu, S.; Zhong, Q.; Li, C.; Xie, D.; Tang, H. Dynamic gcn: Context-enriched topology learning for skeleton-based action recognition. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA USA, 12–16 October 2020; ACM: New York, NY, USA, 2020; pp. 55–63. [Google Scholar]
Offiah, M.C.; Amin, N.; Gross, T.; El-Sourani, N.; Borschbach, M. An approach towards a full-reference-based benchmarking for quality-optimized endoscopic video stabilization systems. In Proceedings of the Eighth Indian Conference on Computer Vision, Graphics and Image Processing, Mumbai, India, 16–19 December 2012; ACM: New York, NY, USA, 2012; pp. 1–8. [Google Scholar]
Zhang, L.; Zheng, Q.Z.; Liu, H.K.; Huang, H. Full-reference stability assessment of digital video stabilization based on riemannian metric. IEEE Trans. Image Process. 2018, 27, 6051–6063. [Google Scholar] [CrossRef]
Niskanen, M.; Silven, O.; Tico, M. Video Stabilization Performance Assessment. In Proceedings of the 2006 IEEE International Conference on Multimedia and Expo, Toronto, ON, Canada, 9–12 July 2006; pp. 405–408. [Google Scholar] [CrossRef]
Ito, M.S.; Izquierdo, E. A dataset and evaluation framework for deep learning based video stabilization systems. In Proceedings of the 2019 IEEE Visual Communications and Image Processing (VCIP), Sydney, NSW, Australia, 1–4 December 2019; pp. 1–4. [Google Scholar]
Liu, S.; Li, M.; Zhu, S.; Zeng, B. Codingflow: Enable video coding for video stabilization. IEEE Trans. Image Process. 2017, 26, 3291–3302. [Google Scholar] [CrossRef]
Yu, J.; Ramamoorthi, R. Selfie video stabilization. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 551–566. [Google Scholar]
Kerim, A.; Marcolino, L.S.; Jiang, R. Silver: Novel rendering engine for data hungry computer vision models. In Proceedings of the 2nd International Workshop on Data Quality Assessment for Machine Learning, Virtual, 14–18 August 2021. [Google Scholar]
Huang, Q.; Sun, H.; Wang, Y.; Yuan, Y.; Guo, X.; Gao, Q. Ship detection based on YOLO algorithm for visible images. IET Image Process. 2024, 18, 481–492. [Google Scholar] [CrossRef]
Thivent, D.J.; Williams, G.E.; Zhou, J.; Baer, R.L.; Toft, R.; Beysserie, S.X. Combined Optical and Electronic Image Stabilization. U.S. Patent 9,596,411, 14 March 2017. [Google Scholar]
Liang, C.K.; Shi, F. Fused Video Stabilization on the Pixel 2 and Pixel 2 xl; Tech. Rep.; Google: Mountain View, CA, USA, 2017. [Google Scholar]
Ye, J.; Pan, E.; Xu, W. Digital Video Stabilization Method Based on Periodic Jitters of Airborne Vision of Large Flapping Wing Robots. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 2591–2603. [Google Scholar] [CrossRef]
Wang, Y.; Huang, Q.; Tang, B.; Sun, H.; Guo, X. FGC-VC: Flow-Guided Context Video Compression. In Proceedings of the IEEE International Conference on Image Processing, ICIP 2023, Kuala Lumpur, Malaysia, 8–11 October 2023; IEEE: Kuala Lumpur, Malaysia, 2023; pp. 3175–3179. [Google Scholar] [CrossRef]

Figure 1. Jittery Video (Top) and Stabilized Video (Bottom).

Figure 2. Mechanical Stabilizer.

Figure 3. Optical Stabilization.

Figure 4. Digital Video Stabilization Process.

Figure 5. The Development Process of Video Stabilization Algorithms.

Figure 6. Digital Video Stabilization Process. The red line indicates the desired smooth motion trajectory (target stable path), the yellow line represents the original jitter motion trajectory (actual unstable path).

Figure 7. StabNet Algorithm Flowchart.

Figure 8. DUT Algorithm Flowchart.

Figure 9. Hybrid Algorithm Flowchart.

Figure 10. Deep3D Algorithm Flowchart.

Figure 11. PWStableNet Algorithm Flowchart.

Figure 12. PixStabNet Algorithm Flowchart.

Figure 13. Visual Quality of Classic Video Stabilization Methods.

Figure 14. DeepStab Dataset Collection Process and Image Examples.

Table 1. Comparison of Typical Method Performance on CPU.

Methods	FPS	Datasets
Bundle	3.5	NUS (test)
L1Stabilizer	10.0	NUS (test)
MeshFlow	22.0	NUS (test)
StabNet	35.5	NUS (test), DeepStab (train)

Table 2. Comparison of Typical Method Performance on GPU.

Methods	FPS	Datasets
Hybrid	2.6	NUS (test)
Deep3D	34.5	NUS (test)
DIFRINT	14.3	NUS (test)
DUT	14	NUS (test), DeepStab (train)
PWStableNet	56	NUS (test), DeepStab (train)
PixStabNet	54.6	NUS (test), DeepStab (train)

Table 3. CDS Metrics for Classic Video Stabilization Methods on Video+Sensor Datasets.

Method	Year	C	D	S
L1Stabilize	2011	0.641	0.905	0.826
Bundle	2013	0.758	0.886	0.848
StableNet	2018	0.751	0.850	0.840
PWStableNet	2020	0.937	0.971	0.830
DeepFlow	2020	0.792	0.851	0.845
DIFRINT	2021	1.000	0.880	0.787

Table 4. Features of Video Stabilization Datasets.

Datasets	Type	Scenario	Features
HUJ	Real-world	General	Driving/zooming/walking scenarios
MCL	Real-world	General	7 scenarios
BIT	Real-world	General	Includes low-light/large parallax
QMUL	Real-world	General	Largest scale
NUS	Real-world	General	Deep learning validation
DeepStab	Real-world	General	Dual-camera synchronization
Video+Sensor	Real-world	General	Paired with gyroscope and OIS sensor logs
IMU_VS	Real-world	General	IMU sensor data augmentation
Selfie	Real-world	Special (selfie)	Continuous face tracking
VSAC105Real	Synthetic	Special (weather)	Weather simulation
ISDS	Real-world	Special (maritime)	Small-target detection optimization

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, Q.; Huang, Q.; Jiang, C.; Li, X.; Wang, Y. Video Stabilization: A Comprehensive Survey from Classical Mechanics to Deep Learning Paradigms. Modelling 2025, 6, 49. https://doi.org/10.3390/modelling6020049

AMA Style

Xu Q, Huang Q, Jiang C, Li X, Wang Y. Video Stabilization: A Comprehensive Survey from Classical Mechanics to Deep Learning Paradigms. Modelling. 2025; 6(2):49. https://doi.org/10.3390/modelling6020049

Chicago/Turabian Style

Xu, Qian, Qian Huang, Chuanxu Jiang, Xin Li, and Yiming Wang. 2025. "Video Stabilization: A Comprehensive Survey from Classical Mechanics to Deep Learning Paradigms" Modelling 6, no. 2: 49. https://doi.org/10.3390/modelling6020049

APA Style

Xu, Q., Huang, Q., Jiang, C., Li, X., & Wang, Y. (2025). Video Stabilization: A Comprehensive Survey from Classical Mechanics to Deep Learning Paradigms. Modelling, 6(2), 49. https://doi.org/10.3390/modelling6020049

Article Menu

Video Stabilization: A Comprehensive Survey from Classical Mechanics to Deep Learning Paradigms

Abstract

1. Introduction

2. Advances in Video Stabilization

2.1. Algorithms of Traditional Digital Video Stabilization

2.2. Video Stabilization Methods Based on Deep Learning

2.2.1. Two-Dimensional Video Stabilization

2.2.2. Three-Dimensional Video Stabilization

2.2.3. Two-and-a-Half-Dimensional Video Stabilization

2.3. Comparison of Methods

3. Assessment Metrics for Video Stabilization Algorithms

3.1. Subjective Quality Assessment

3.2. Objective Quality Assessment

3.2.1. Full-Reference Quality Assessment

3.2.2. No-Reference Quality Assessment

4. Benchmark Datasets for Video Stabilization

5. Challenges and Future Directions

5.1. Current Challenges

5.2. Future Directions

5.3. Multi-Disciplinary Applications

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI