1. Introduction
As Augmented Reality (AR) technologies, particularly those utilizing Head Mounted Displays (HMDs), become increasingly integrated with our reality, the necessity for video stabilization is becoming more pronounced [
1,
2]. This technology, essential for correcting the shake induced by user-handled cameras, has been progressively developed since the early 2000s alongside the rapid advancement of camera technology. The need for video stabilization stems from the advancements in image sensor fabrication technology, which, while improving resolution, also amplifies the challenge of identifying images without post-correction due to significant shaking. Employed not only in everyday smartphone cameras but also in action cams that are mounted on displaying devices through modules known as image stabilizers, the demand for this technology continues to grow, reflecting an expanding market size. This trend underscores the critical role of video stabilization in enhancing the user experience in AR environments, ensuring seamless integration of virtual and physical worlds.
Highlighting the necessity of research, this paper addresses the limitations of existing video stabilization techniques. Video stabilization can be divided into three main processes: motion estimation that estimates the movement of objects, motion smoothing that smoothens the path of object movement, and stable frame generation that creates stabilized frames using calculated values. Previously, visual tracking technologies like the KLT tracker [
3] based on good features to track have enhanced motion estimation performance, and motion smoothing methods such as robust L1 optimal camera paths [
4] or Kalman filters have been developed. The process involves warping fields, calculated to transform frames, leading to two primary issues: cropping, which reduces resolution by cutting off image edges during the warping process, and distortion, which causes pixel value distortions leading to visual artifacts like blur and wobble.
Firstly, the rolling shutter effect, due to the type of a recording mechanism of image sensors, causes the top of the frame to be recorded slightly earlier than the bottom one recording mechanism, when scanning from the top to the bottom of a sensor (see
Figure 1). This results in a stretching appearance and can lead to more pronounced shake in the footage when the camera is in motion. The rolling shutter effect introduces various visual artifacts, including wobble, skew, and blur, necessitating the need for motion compensation.
Secondly, cropping occurs, a phenomenon where the edges of the video are cut off, reducing the resolution, due to the warping process based on the pixel values between adjacent frames (see
Figure 2). This reduction in resolution implies a loss of information at the video’s periphery.
Thirdly, distortion manifests as a spatial warping or stretching of the image, creating an illusion of fluttering, primarily due to camera movement, vibration, or the rolling-shutter effect (see
Figure 3). This is further compounded by visual artifacts such as shakiness, blur, and wobble.
To alleviate those issues, video stabilization techniques are broadly used and can be categorized into mechanical and digital methods, with mechanical stabilizers like gimbals and Optical Image Stabilizers (OIS). These methods have drawbacks related to cost and bulkiness. Despite the surge in technology development with the advent of deep learning, digital methods still face performance limitations. Our approach uses only the camera’s visual sensor in a digital method, enhanced by the Inertial Measurement Unit (IMU) sensor found in most mobile electronic devices today, reducing additional costs and improving performance over existing digital methods to provide visually useful imagery. To mitigate performance degradation caused by the aforementioned issues, our method first uses an IMU sensor to reduce the rolling-shutter effect due to dynamic motion. This advantage allows motion compensation using input frame and sensor data to lessen the rolling-shutter effect. Secondly, to address cropping, a process is required to fill unknown pixel value areas with surrounding regions to create a full-frame image of the same size as the input frame. However, dynamic scenes introduce additional factors like lighting changes that cause pixel value changes, and panorama image stitching for frame rendering can severely exacerbate visual artifacts, necessitating the application of neural rendering [
5] using convolutional and Encoder–Decoder-based networks for performance enhancement. Previous research studies have developed technologies for video stabilization using IMU sensors or creating full-frame images with a Cropping ratio of 1 using deep learning-based methods. Our video stabilization algorithm first synchronizes IMU sensor data with the video, receiving timestamp (s), gyroscope (rad/s), accelerometer (m/s
2), and magnetometer (μT) values, and applies motion compensation to input video frames using the Versatile Quaternion-based Filter (VQF) [
6] algorithm and AKAZE-based [
7] optical flow. Then, to improve cropping, PCA-flow-based video stabilization [
8] is performed. Final stabilized frames are produced by applying neural rendering to address distortion occurring during the full-frame video creation process. Our method, which performs motion compensation using sensor data followed by deep learning-based full-frame video stabilization, represents a novel hybrid approach to video stabilization not previously explored. To quantitatively assess our method, we employ various metrics: the Stability score, indicating video stability; the Distortion value, depicting the extent of video deformation or alteration; the Cropping ratio, denoting the proportion of peripheral areas removed during video stabilization. Furthermore, we evaluated visual quality with the following metrics: the LPIPS assesses alignment with human visual perception; the SSIM measures structural similarity between images; the PSNR gauges image quality loss. The specifics of these evaluation metrics are elaborated in
Section 4.2. The applicability and expected effects of our proposed technique include the following: First, correcting video shake to contribute to an accurate stabilization making it valuable across various industries. Second, it offers the advantage of producing visually useful imagery without the bulkiness and cost issues associated with the usages of gimbals and OIS.
In summary, the contributions of this paper are as follows:
For the first time, we combined an IMU sensor with a deep learning-based full-frame video stabilization method, demonstrating an increase in stability.
To address the main issues of video stabilization, such as cropping and distortion degradation, we integrated PCA flow and neural rendering.
Our technology contributes to correcting video shake for accurate target detection and tracking and has the advantage of generating visually high-quality videos at low cost.
2. Background
2.1. Motion Estimation
Optical flow is a technique commonly used to estimate the motion of objects between video frames. Based on the calculated warping field, it contributes to compensating for this information in subsequent processing stages, making it widely employed across various fields. Motion estimation techniques employing optical flow can be broadly categorized into sparse and dense approaches. Sparse optical flow defines and detects features such as ORB corner points [
9], subsequently estimating motion using a KLT tracker based on these detected outcomes. Conversely, dense optical flow provides information about the magnitude and direction of pixel movement, performing motion estimation without relying on feature bases. Although dense optical flow boasts high accuracy, its comprehensive computation across unnecessary areas results in slow processing speeds. To overcome these limitations, RAFT (Recurrent All-Pairs Field Transforms) optical flow [
10], which utilizes dense optical flow and R-CNN (Region-based Convolutional Neural Network) features [
11], has been developed. This method, structured around an R-CNN involving feature extraction, visual similarity, and iterative updates, enhances accuracy with each learning phase by repetitively updating the flow vector. Recently, research has been conducted on Gyroflow+ [
12], which integrates gyroscope data with optical flow and homography. For this purpose, a self-guided fusion module and a homography decoder have been proposed. Attempts to overcome the limitations of the dense optical flow approach have continued; among them, Xiao et al. [
13] experimented with a method using a module that deep-couples optical flow with deformable convolution. Specifically, they proposed a method capable of robust motion estimation even in scenarios with large motion. Additionally, in scenarios such as satellite video, an efficient computation method was developed by applying temporal difference for temporal compensation [
14] as an alternative to optical flow for motion compensation.
Another method of motion estimation employs an affine transform, utilizing matrix operations for coordinate transformation between input and output. This encompasses techniques such as homography, per-pixel warp fields, and multi-grid methods. Homography applies perspective transformation matrices derived from features extracted across two planes. Per-pixel warp fields generate warp fields at the pixel level based on histogram differences, identifying similarities between feature trajectories and pixel profiles in static backgrounds and differences in dynamic objects. Multi-grid techniques learn a series of set mesh-grid transformations from previous stabilized camera frames to generate camera paths. Bundled camera paths define a bundled of spatially variant camera paths through measured local homographies, optimizing these paths for video stabilization after motion smoothing. IMU sensor-based motion estimation selects between optical flow (KLT tracker) and IMU-aided motion estimator based on the camera’s angular velocity threshold. However, a limitation exists as cropping may occur in all scenarios regardless of the threshold applied. Lately, GlobalFlowNet [
15], an unsupervised method for performing video stabilization, has been developed. It utilizes a foreground mask in preprocessing for robust homography-based motion estimation and employs low-level confidence features. This approach enhances the capture of consistent spatial correspondence.
2.2. Video Completion
Full-frame video stabilization with motion inpainting addresses missing areas in stabilized videos by constructing image mosaics using neighboring frames. It utilizes local motion estimation for applying global transformation only to common coverage areas between frames and calculates optical flow between frames to eliminate unwanted motion fluctuations, thereby achieving stabilized motion paths. However, this method may produce visible artifacts in non-planar and dynamic scenes and create visible color seams when combining propagated color from different frames due to effects like lighting changes, shadows, and vignetting.
Temporally coherent completion of dynamic video presents an automatic video completion algorithm that synthesizes missing areas in videos in a temporally coherent manner. Despite limitations in handling discrepancies due to dynamically changing video frames and mismatched image-space motion vectors, this algorithm is well-suited for processing dynamic scenes captured with moving cameras. It utilizes optical flow and color, matching colors temporally using pixel-wise forward/backward flow fields, although it may not accurately repaint the screen using motion-based features. However, in videos containing rapid movements, there are difficulties in accurately estimating the flow, which results in a decline in the quality of color completion.
Flow-edge-guided video completion improves upon traditional issues of being unable to synthesize sharp flow edges and often producing over-smoothed results by jointly synthesizing colors and flow, propagating color along flow trajectories to enhance temporal consistency. This approach alleviates memory issues, allows for a high-resolution output, and avoids visible seams by operating in the gradient domain, performing video completion through dense flow fields. In some cases, such as with dynamic textures, optical flow estimation can be inaccurate, leading to visual artifacts. Additionally, image composition becomes challenging when large areas are obscured throughout the entire sequence.
2.3. View Synthesis and Rendering
Deep blending for free-viewpoint image-based rendering (IBR) addresses the difficulties of traditional IBR when moving far from input frames due to numerous visible artifacts. It employs novel view synthesis using held-out real image data to learn blending weights for combining input photo contributions. Accurate geometry provision is crucial for CNNs to find correct blending weights, yet direct blending in image space can result in visible artifacts and glitches, especially when flow estimates are unreliable. For instance, there is a limitation of flickering occurring in the resultant image when composing images with significant or inconsistent lighting differences.
Free view synthesis overcomes the limitations of traditional methods that rely on camera grid and stereo matching, which restrict the layout of input views. It generates a free view synthesis from unstructured input images of general scenes by correcting input images with Structure from Motion (SfM) and calculating 3D proxy geometry through Multi-View Stereo (MVS). Utilizing depth maps and 3D proxy geometry, it maps encoded features to the target view, blending them using an Encoder–Decoder network. Since it only composes images on a frame-by-frame basis, there is a drawback of lacking temporal consistency. Additionally, visual artifacts occur when the proxy 3D model used for mapping misses significant parts of the scene.
Out-of-boundary view synthesis towards full-frame video stabilization [
16] significantly improves upon traditional grid-based and pixel-based warping methods through a two-stage coarse-to-fine method. It notably contributes to minimizing cropping in the boundary areas and reducing jitter. However, in cases of significant movement of dynamic objects between adjacent frames, the accurate generation of out-of-boundary regions may be challenging due to discontinuities.
In addition, methods based on progressive fusion and temporal fusion have been explored to reduce distortion. Jiang et al. [
17] proposed a Multi-Scale Progressive Fusion Network (MSPFN) to eliminate various degrees of blurring. They generated a Gaussian pyramid of rain images and employed a coarse fusion module with Conv-LSTM to capture global textures. Subsequently, a fine fusion module was introduced to fuse correlated information in a cascading manner, forming progressive multi-scale fusion. Ultimately, the utilization of a residual module facilitated the generation of high-quality images. Xiao et al. [
18] addressed the challenge of limited and difficult-to-extract information provided by frames by proposing temporal grouping projection fusion and Multi-Scale Deformable (MSD) convolution alignment. Temporal fusion was applied to regroup continuously input frames into different poses, thereby reducing the complexity of projection while enabling the learning of more complementary information from frames. Following this, a multi-scale residual block was utilized to learn complex motion information for accurate frame alignment. Finally, a temporal attention module was employed to generate images that maintained a high level of consistency with the reference frame.
2.4. Video Stabilization Using IMU Sensors
Image deblurring using IMU sensors estimates blur function from gyroscope and accelerometer data during shooting [
19]. Known blur function allows image improvement through non-blind deconvolution for deblurring. Since the algorithm assumes a constant depth of the scene, there is the limitation of an inaccurate blur estimation due to depth differences in real scenes. Digital video stabilization and rolling-shutter correction using gyroscopes measures camera motion with gyroscopes to perform digital video stabilization and rolling-shutter correction efficiently. Despite its strength under poor lighting and significant foreground motion, it may introduce shaky motion-induced visual artifacts. Deep online video stabilization using IMU sensors synthesizes stabilized images through deep motion estimation using data from IMU sensors. It identifies various motion types with a Deep Neural Network (DNN) classifier and employs Long Short-Term Memory (LSTM) [
20] for extracting temporal features, effectively removing shaky artifacts, performing strongly across datasets with less time consumption, although it requires sufficient training data for accurate predictions. Deep online fused video stabilization employs both gyroscope sensor data and image content in an unsupervised learning DNN for video stabilization. The network fuses motion representations combining optical flow with real/virtual camera pose histories, where LSTM cells infer new virtual camera poses for generating warping grids to stabilize video frames. Although numerous studies are underway to enhance stabilization performance using IMU sensors, the issue of cropping still persists in video stabilization.
2.5. Limitations
Conventional methods have several limitations. Firstly, shooting with cameras that utilize sensors with a rolling-shutter mechanism, where the shutter closes sequentially, results in stretched and shaken photographs. Secondly, cropping occurs, a phenomenon where the edges of the video are cut off, reducing the resolution, due to the warping process based on the pixel values between adjacent frames. Lastly, distortion exists, a condition where pixel values are distorted due to visual artifacts including shakiness, blur, and wobble caused by the camera’s movement or vibration, making the space appear twisted, elongated, and wavering.
5. Conclusions
The primary objective of our research was to generate visually useful imagery for Augmented Reality applications by preventing cropping through resolution maintenance and enhancing stabilization performance by minimizing the degradation of stability and distortion. In pursuit of this goal, we maintained a focus on improving performance while adequately preserving execution speed, leading to the proposal of a novel hybrid full-frame video stabilization algorithm, based on dual-modality cross-interaction using neural rendering, not previously explored. Our method was evaluated using stability, distortion, and cropping metrics, demonstrating enhanced stabilization, especially when utilizing an IMU sensor to robustly counter flow inaccuracies. Overall, our method uniquely achieved TOP2 in the S/D/C metric and showed the most significant improvement in Turn environments. Further, the visual quality induced by the Distortion value was quantitatively compared using the LPIPS, SSIM, and PSNR metrics, providing a detailed analysis. The use of the combination of an IMU sensor and neural rendering technique showed that while maintaining the Distortion value, it resulted in increased Stability score outcomes, effectively reducing visual artifacts caused by shaking. Our technology has proven valuable in correcting video shake, contributing to accurate target detection and tracking.
The application of our technology is contingent upon the availability of IMU sensor values measured concurrently with the original video capture, which presents a limitation in terms of applicability to the vast array of videos available online. Despite this challenge, given that modern AR/VR devices and smartphones are inherently equipped with IMU sensors, leveraging these to acquire videos and applying our technology can result in visually superior, stabilized videos. Additionally, since the warping field depends on optical flow and the IMU sensor values are globally reflected across frames, visual artifacts may still persist at the video edges.
Looking forward, we propose three areas for future work. First, we aim to achieve superior motion compensation by fusing the transformation matrices generated by deep learning-based optical flow calculations, assigning greater weights to the transformation matrices as the IMU sensor values increase. Moreover, there is a need to locally reflect this in the frame to enhance compensation at the video edges. By applying this method, performance in motion estimation is expected to improve since it does not solely rely on optical flow, and consequently, it can reduce visual artifacts caused by large motion and environmental changes during the warping process. Second, we aim to implement superior feature extraction by adding SE (Squeeze-and-Excitation) blocks that perform actions similar to self-attention. This method allows for more accurate calculations of the optical flow across the frame while reducing the parameter size for better efficiency and overcoming accuracy degradation. Third, to ensure the generalization of the proposed method, it is necessary to enrich the dataset with data from daytime/nighttime or sunny day/rainy day/foggy day scenarios and extend the experiments. These future directions underscore our commitment to refining the balance between stabilization, visual quality, and computational efficiency, thereby pushing the boundaries of video stabilization techniques.