An Unsupervised Video Stabilization Algorithm Based on Key Point Detection

In recent years, video stabilization has improved significantly in simple scenes, but is not as effective as it could be in complex scenes. In this study, we built an unsupervised video stabilization model. In order to improve the accurate distribution of key points in the full frame, a DNN-based key-point detector was introduced to generate rich key points and optimize the key points and the optical flow in the largest area of the untextured region. Furthermore, for complex scenes with moving foreground targets, we used a foreground and background separation-based approach to obtain unstable motion trajectories, which were then smoothed. For the generated frames, adaptive cropping was conducted to completely remove the black edges while maintaining the maximum detail of the original frame. The results of public benchmark tests showed that this method resulted in less visual distortion than current state-of-the-art video stabilization methods, while retaining greater detail in the original stable frames and completely removing black edges. It also outperformed current stabilization models in terms of both quantitative and operational speed.


Introduction
With the growing popularity of mobile devices, more and more people are adopting mobile devices to record their lives, but unprofessional shooting often leads to the introduction of shakiness into the final video, which affects people's subsequent viewing and applications of the video, so more and more people are beginning to pay attention to video stabilization. Especially in complex scenes, the difficulty of video stabilization is increased by the influence of the interactions between the moving foreground and background in the video. It is therefore essential to develop high-performance video stabilization algorithms for shaky video in complex scenes.
The process of a video stabilization algorithm generally starts with the detection of key points, obtaining the unstable motion trajectory according to the key points, then smoothing the motion trajectory, and finally generating stable frames.
Traditional video stabilization methods have the following shortcomings: • Key-point detection relies on manual methods, such as ORB and SIFT. However, the detection will be inaccurate or fail when there are drastic changes in external conditions, such as light, darkness, large movements, and the video itself is not of high quality, etc. [1][2][3]; • Traditional methods of key-point tracking usually use optical flow or Kalman filtering to track the key points. This assumes that there is no significant motion between the two frames and that the grayscale between the two frames is unchanged. However, the scenes in real life are ever-changing, and this produces a large mismatch for complex videos [4]; • The traditional method of trajectory smoothing uses a fixed kernel of smoothing. However, because the trajectories are ever-changing, the use of a fixed kernel does not achieve the best stabilization effect [2,[5][6][7][8][9]. It can also be achieved by fitting to obtain the smoothed trajectories, but this method involves a high levle of error [10]; • Over-cropping the black edge of the generated frames results in the final frame retaining a smaller field of view and less detail [1,5,7].
DNN-based methods avoid the drawbacks of the traditional algorithm. However, there are still several problems: • There are few pairs of stable/unstable videos in the existing official datasets, which is not conducive to supervised models [11][12][13]; • Most of the existing stabilization models still use traditional methods to detect key points in the key-point detection stage, but the traditional methods do not detect the key points completely and many key points are lost. In the key-point detection stage of the video stabilization method using a DNN-based model, although the key points generated are abundant, the method does not consider the impact of redundant key points in a large untextured area [14]; • For generated frames, a fixed cropping method is used to remove black edges, which preserves less of the original frame content [11][12][13][14][15][16].
In this paper, an unsupervised video stabilization algorithm based on key-point detection is proposed to deal with the above problems. A DNN-based key-point detector was first used to detect key points. The key points and optical flow inside the untextured area with the largest area were optimized to obtain an accurate unstable motion trajectory. The motion trajectory was then placed into the smoother to adaptively generate a smoothing kernel to smooth the motion trajectory iteratively. Finally, based on the smoothed trajectory, a homography transformation was applied to generate a stable frame with black edges. For the processing of the black edges, in this study we used an adaptive cropping method to crop and retain the maximum area of the original frame.
The four contributions of this paper are as follows: • An unsupervised deep learning model is used for video stabilization, which does not require pairs of stable/unstable datasets; • A method is proposed to optimize key points and optical flow in large texture-free areas to obtain accurate non-stationary motion trajectories; • Adaptive cropping is proposed to retain the maximum field of view for the stable video, while not having a black edge; and • The model runs faster on the basis of improved model performance, which is more suitable for the requirements of video stabilization.

Related Work
In this section, we summarize traditional stabilization methods and DNN-based stabilization methods. Traditional stabilization methods include 2D and 3D methods.
2D methods. They use 2D transforms to describe the 2D motion trajectory in the video and stabilize the video based on these transforms. For example, MeshFlow [1] uses a traditional method for key-point detection, then uses Kalman filtering [17] to track key points and obtains motion trajectories. The motion trajectories are smoothed by optimizing the minimum energy function and stable frames are generated by means of grid-based warping. Liu et al. optimized camera trajectories by means of bundling to achieve motion smoothing, and generated stable frames based on the grid [5]. Wang et al. developed a stabilizer based on finding the relative position of key points [6]. The developers of the Steadyflow [7] method used Jacobi's method to smooth the motion trajectories. Matsushita Y et al. achieved video stabilization by interpolating and inpainting the video [8]. Grundmann M et al. achieved video stabilization by optimizing the camera path using L1 [9]. Hu R et al. used SIFT for keypoint detection, and these key points were used to estimate the motion trajectories of the camera [3]. The 2D method is relatively simple to implement, but its time complexity is high. However, when it comes to complex scenes, the method's processing capability for stabilizing video is relatively limited. This is because the focus when stabilizing a simple scene is only the jitter factor introduced by the camera, whereas in a complex scene one also needs to consider whether there is a moving foreground target in the video. At the same time, the 2D method only uses a fixed smooth kernel in the process of generating stable frames, but unstable trajectories are various, so it is incorrect to use a fixed smooth kernel.
3D methods. They find the camera trajectory in 3D space and smooth it to achieve video stabilization. Liu and Wang et al. used a depth camera for stabilization [2]. Liu and Gleicher et al. projected the 3D feature points onto a new camera location for stabilization [10]. Karpenko et al. used a gyroscope sensor for stabilization [18]. Zhang et al. stabilized video sequences based on a 3D perspective camera model [19]. Bhat P et al. modified the multi-view stereo algorithm to calculate the view-dependent depth for video stabilization [20]. 3D methods obtain more information and therefore perform better. However, the 3D stabilization algorithm has strict hardware conditions and limited applications.
DNN-based stabilization methods. They aim at the transformation of unstable frames to stable frames, whereas traditional stabilization methods aim at the obtainment of motion trajectories. Yu and Ramamoorthi et al. achieved the generation of stable frames by inputting the processed accurate optical flow into the network to generate the warping fields of stable frames [15]. The advantage of this method is that the optical flow is accurate. The disadvantage is that the post-processing of the optical flow leads to complex operation costs and long running time. Furthermore, optical flow over-patching can result in local distortion. Moreover, this method uses a manual key-point detector, which may fail to detect complex video or texture-free region cases. Xu et al. obtained the stable video directly using unsupervised learning models [14]. The advantages of this approach are its short running time, clear model logic, and the lack of a need for training data pairs. However, the disadvantage is that the key points are generated using a depth model, so redundant key points will be generated, especially in regions with relatively large untextured areas. Furthermore, the motion trend inside the large untextured area is different from the overall motion trend. This means that the final motion trajectories obtained are inaccurate, which introduces distortion. Wang et al. obtained the homography matrix of the grid through the use of a Siamese network, which generated stable frames [11]. This method has the advantage of using both unstabilized frames and frames that have been stabilized up to the current time as inputs to obtain the warping field until all stabilized frames are generated, which is performed cyclically, as it does not stabilize only one frame individually. The disadvantages of this method are that, firstly, many training data pairs are required, and secondly, the model only uses the simple homography matrix to generate stable frames, so the method will not work for videos with violent motion and close-up scenes. Choi et al. used interpolation for video stabilization but introduced artifacts [21]. The advantage of this approach is that the full-frame is stabilized, so there are no cropping problems with the final stabilized video obtained. The disadvantage is that the generated frames can introduce artifacts due to the need for multiple interpolations. The PWStableNet method [12] introduces a cascade structure to generate pixel-by-pixel warp fields for video stabilization. The advantage of this method is that it takes into account the fact that different pixels may have different distortions, so it stabilizes in a pixel-by-pixel manner in adjacent frames, and the use of a cascade structure is beneficial to learning the warp field more accurately. A disadvantage of this method is that it uses pixel-by-pixel stabilization, so the running time is relatively long, and it also requires a large number of training data pairs. It is not as effective as the stable grid-based model for complex video processing.
In contrast, the method proposed in this paper uses an unsupervised DNN-based model for video stabilization, so the requirements on the dataset are not strict. For videos of complex scenes, we focus on the problem of obtaining rich key points, while optimizing key points and optical flow over large untextured areas to generate more accurate motion trajectories. In this study, we used SuperPoint [22] to generate rich key points, then used gradient information to optimize key points and optical flow so that the final result would have less distortion. For the problem of black edges in the generated frames, an adaptive cropping method was used to ensure that the black edges were completely removed, while retaining more of the original information.

Pipeline
The overall structure of the algorithm is shown in Figure 1. The algorithm is divided into four stages: the coarse key-point detection and tracking stage, the accurate keypoint detection and accurate optical flow stage, the smooth motion trajectory stage, the stable frame generation and adaptive cropping stage.The above stages are described in Sections 3.1-3.4.

Coarse KeyPoint Detection and Tracking Stage
This stage is divided into two parts, one involves using SuperPoint to detect key points and return the matched key points, and the other involves using RAFT [23] to generate the optical flow iteratively. The stage is illustrated in Figure 2. This optical flow is then combined with the key points returned by SuperPoint to complete the tracking of the key points. In the optical flow stage, we refer to the RAFT method [23] developed by Zachary Tee et al., whereas in the key-point detection stage, we refer to the SuperPoint method of Daniel Detone et al. [22].
This stage takes adjacent frames I i and I i+1 , i∈ {0, C}, where C is the total number of frames. It uses SuperPoint for key-point detection to obtain kp i and kp i+1 , which are matched to obtain the matched pairs {m i , m i+1 }. Then I i and I i+1 are input into RAFT to obtain the optical flow F i . According to F i , the motion vector f i of m i , at the position corresponding to the key point in F i , is extracted. m i is added to the motion vector f i to obtain the position of the m i in relation to the next frame, which completes the tracking of the key points and obtains the motion trajectories of the key points. As shown in Figure 3, the reason for adopting SuperPoint was that SuperPoint is robust compared with manual key-point detection methods. Manual methods may miss many potential points of interest. Another problem is that SuperPoint produces too many key points, as shown in Figure 4. As a result, for videos with large untextured areas, redundant key points will be generated and distortion will be introduced. This problem is solved in the second stage of this paper.

Accurate Alignment and Motion Trajectory Stage
The second stage mainly involves obtaining accurate key points and optical flow, as well as obtaining the trajectory of the keypoints. This stage is divided into two parts. One is related to obtaining the accurate alignment, and the other is related to obtaining the motion trajectories. The accurate alignment portion of this stage includes the generation of the mask, ensuring the precision of the key points and the precision of the optical flow, as shown in Figure 5.

Mask Generation
From experiments, we know that key-point redundancy in large untextured areas and inaccurate optical flow inside large untextured areas may distort the generated stable frames. Thus, in our method, we obtain the mask of the i-th frame by means of a gradient, denotes the position of the pixel and G i denotes the average of the gradient of the i-th frame.
The visualization of M i according to the gradient, denoted by part 1 in Figure 5, shows that there are many burrs around the untextured area and irregularities, which will affect the subsequent processing. Therefore, the operation of dilation and erosion on M i is used to obtain M i .
Connecting small and non-continuous non-flat areas at small intervals via dilation, Equation (1) is as follows: and rejecting non-flat areas that are too small via erosion, Equation (2) is as follows: where the area around (x, y) is denoted by (x + x , y + y ) (|x | ≤ 1, |y | ≤ 1).

Accurate Key Points
The mask M i can be obtained from the mask generation stage. As shown in part 2 of Figure 5, the maximum untextured area in M i is obtained via calculation. We remove the key points within the maximum untextured area to obtain the accurate key points, as shown in Figure 6.

Accurate Optical Flow
The unprocessed optical flow is shown in Figure 7a. It can be noted that the optical flow within the untextured area is not consistent with that in the textured area, which will distort the generated frame in the local region. Next, PCAFlow [24] is used to fill in the optical flow in the removed area based on the optical flow in the unremoved area, so that the motion of the frame tends to be consistent and an accurate optical flow is obtained, as shown in Figure 7b.

Trajectory Generation
The final step is to obtain a mesh-based vertex motion trajectory. Due to the various levels of complexity of the video, in this study we used homography based on the separation of foreground and background and mesh-based vertex profiles to obtain the motion trajectories. In this study, the frames were divided into one grid every 16 pixels, with a total of A×B grids. According to d i , the m i value is then propagated to the nearby grid vertices v i according to Equation (3): where m i denotes the position of the key points, v i is the position of the mesh vertexes, and "| |" denotes an absolute value. The next two median filters are used for the motion trajectories obtained for each vertex. The first one aims at assigning each vertex a most representative motion vector and the second one aims at preventing the appearance of noise. The motion vector f i is then divided into two categories representing the background and the foreground of motion according to Equation (4): According to the above, {m i , m i+1 } are used in RANSAC to obtain two homographies, H b and H p . The local residual motion vectors for the foreground and background are obtained according to Equation (5): where × means matrix multiplication. According to Equation (6), the foreground and background motions are obtained:

Motion Trajectory Smoothing
The unstable trajectory is input into the encoder/decoder structure to generate a smooth kernel for the trajectory, and then the smooth kernel is applied to the iterative smoothing of the unstable trajectories, and finally, the stable trajectories are obtained.
The overall loss function is composed of three parts: temporal smoothness loss L t , shape-preserving loss L s , and content-preserving loss L c : According to Equation (7), L t is formed by minimizing the energy function: where C(t) denotes the camera path, t denotes time moment, P(t) denotes the optimized path, r denotes the radius (r ∈ 16), and v(t) denotes the motion vector of the mesh vertices. Suppose that {p,p} is the matching pair of key points from I i to I i+1 , and p consists of a closed grid including four vertices , according to Equation (8): where w p denotes the bilinear interpolation operation, a denotes the grid, and i denotes the i-th frame.
p can then be expressed using v 1 ia ,v 2 ia ,v 3 ia ,v 4 ia , then the shape-preserving loss L s is built on each grid, according to Equation (9): where R 90 = 0 1 −1 0 .
To prevent distortion around key points, and to avoid distortion between flat and non-flat areas, content-preserving loss L c is added here, according to Equation (10): where H g is the homography matrix of the grid containing p, and p is found as in Equation (8).

Adaptive Cropping
The fourth stage is the adaptive cropping stage. The stable frames generated in the previous stage have black edges. Here, we present an adaptive cropping method designed to crop the stable frames with black edges, cropping all the black edges while retaining more original frames.
The process of this stage first involves obtaining the binary image of the black edge of each frame according to the generated stable frame with a black edge. Then, the binary image is eroded. Finally, the global maximum inscribed rectangle is obtained according to the maximum inner contour of the black edge, and the video frame is cropped according to the inscribed rectangle. The overall processes are as follows.
Step1: Find the black-edge binary image, as shown in Figure 8. A specific binary image with only black edges is first obtained from the generated stable frames with black edges. Using a grid-based format, each grid generates its own homography matrix H g based on the motion vectors in the x and y directions. The original grid vertices generate the target-stabilized grid based on H g , which ultimately synthesizes the stable frames. Only the coordinates within 0-H and 0-W are the RGB values of the original frames. All other locations greater than W and H, and less than 0, are black edge locations, from which a specific binary image can be calculated. Step2: The "binary image " obtained from step 1 is eroded according to Equation (2).
Step3: Contouring the white area of the "binary image". This step involves finding the connected domain in the "binary image" with a pixel value of 255, and the same idea is used in the work of Satoshi et al. [25]. In a binary image, when the pixel values between adjacent pixels are different for the first time, this isconsidered to be the start of the boundary i s and is uniquely encoded to deal with the problem of creating multiple boundaries. When you find the the start-point i s , then you look for the end-point i e , which is used to determine whether the contour has ended.When the current point is found to belong to the contour, it is currently the center point i c . In the process of tracking, we look for the next point belonging to this contour in the 8-neighbourhood of the center point, which is the latest point, i n . i c and i n are continuously iterated and replaced to complete the generation of the boundary, as shown in Figure 9. Step4: The process of finding the maximum inner rectangle for the contour. The outline of the white region and the set of points of that outline can be obtained from step 3 . Then, according to the characteristics of the rectangle, we define the sum value of the coordinates of the upper left corner of the rectangle as the minimum, and the sum value of the coordinates of the lower right corner as the maximum. Using this rule, we find the coordinates of the two points diagonal to the rectangle using point set of the outline, so we can find the maximum inner rectangle of the outline.
The advantage of adaptive cropping is the ability to avoid the problem of the black edge in generating stable frames and to retain the maximum field of view of the original frames.

Experiments
We evaluated our model on the public benchmark dataset, the NUS dataset [5], and compared it with the current state-of-the-art models, both DNN-based methods and traditional methods. In this section we provide comprehensive quantitative and qualitative assessment results. Furthermore, during processing, it was found that most models could not handle high-resolution video. The model presented in this paper can handle video of any resolution adaptively.

Datasets
The NUS dataset includes a total of 174 unstable videos of varying duration (10-60 s), which are divided into six categories: Regular, Zooming, Running, QuickRotation, Crowd, and Parallax. The resolutions in the dataset are 640 × 360, 640 × 480, and 1280 × 720.

Experimental Settings
Since the experimental model in this paper was an unsupervised model, there was no need for a training dataset. We just input the unstable video data, implemented the model, and output stable video. The test set used for the model was obtained from the NUS dataset. We ran the model for this paper on a Tesla V100-SXM2 with a 32G GPU.
The results from the visualization can be decsribed as follows. Video1: Large untextured area (sky in the video). The MeshFlow [1] and Yu et al. [15] methods produced redundant black edges due to improper cropping. DUT [14] and Yu et al. [15] did not take into account the key points in the untextured area and the optical flow's inconsistency with the overall optical flow, resulting in significant distortion of the tower in the stable frame, as well as around the buildings. StabNet [16] excessively cropped the image and at the same time introduced more distortion as this method only handles regular two-dimensional motion.
Video2: Motion foreground. MeshFlow [1] produced too few key points, resulting in distortion in and around the moving object, as well as excess black edges. DUT [14] and Yu et al. [15] both produced excessive black edges due to improper cropping. DIFRINT [21] produced artifacts around the moving foreground target. DIFRINT [21] resulted in artifacts due to the use of multiple interpolations. StabNet [16] left only a small field of view due to excessive cropping.
Video3: QuickRotation. MeshFlow [1], DUT [14], and Yu et al. [15] all produced redundant black edges due to improper cropping. MeshFlow [1] did not adapt to fast rotation using only homography transforms and out-of-place warping. The DIFRINT [21] method uses interpolation and does not adapt to fast-moving objects-neither the camera itself nor the objects in the video. The motion trajectory of the feature points of this type of video is complex, so StabNet [16] did not adapt well and this led to distortion.
Video4: Crowd. MeshFlow [1] and Yu et al. [15] produced excessive black edges due to improper cropping. MeshFlow [1] produced distortion, resulting from inaccurate estimation due to the use of homography transformations alone, whereas the results of the method of Yu et al. [15] showed distortion due to the inaccurate optical flow at the filled edges. DIFRINT [21] still produced artifacts due to interpolation. The results produced by StabNet [16] were still distorted due to the fact that this method is not able to adapt well to complex motion.

Evaluation Metrics
To fairly assess the quality of the model, three evaluation metrics are introduced here: the cropping ratio, distortion ratio, and stability score.
The cropping ratio is the area of the generated frame remaining after the black border has been completely removed. The larger the ratio, the better. The calculation is based on the homography matrix M, obtained by matching the key points of the original and stable frames. The cropping ratio cp of each frame is obtained from the scale component of M.
The final crop ratio is equal to the average of all cp values, according to Equation (11): The distortion ratio is the degree of distortion of the stable frame. A higher value means that a better shape of the original frame has been maintained. The distortion rate is equal to the minimum value of the ratio of the two smallest eigenvalues λ 1 and λ 2 among all the eigenvalues of M in all frames, and the distortion fraction is equal to the minimum of all λ 1 λ 2 ratios, according to Equation (12): λ 2 , λ 3 , . . . . . . , λ n = linalg . eig(M)  λ 1 , λ 2 = min((λ 1 , λ 2 , λ 3 , . . . . . . , λ n where linalg.eig() represents the eigenvalue operation. The stability score is the degree of smoothness of the stabilized video-the larger the better. We select the average of the percentage of translational components and the average of the percentage of rotational components over the frame, the minimum of both is called stability score, according to Equation (13): where Theta denotes the rotational component, Trans denotes the translational component, and fft() denotes the Fourier transform function.

Quantitative Results
We evaluated our model and state-of-the-art video stabilization methods on the public benchmark dataset NUS, where the models being compared were Meshflow [1], Liu et al. [5], StabNet [16], and DUT [14], Yu et al. [15]. The results of the evaluation of these quality metrics in the demonstrated approach are shown in Tables 1-3. Table 1 shows a comparison of the cropping ratio. In the NUS dataset, the method presented in this paper maintained a similar cropping ratio to that obtained using the methods of Liu et al. [5] and Yu et al. [15]. Meanwhile, compared with the methods of Meshflow [1], DUT [14] and StabNet [16] based on the "one size fits all" cropping method, the cropping ratio obtained using the method presented in this paper has already improved upon the cropping ratio of many other methods.  Table 2 shows a comparison of the distortion ratio. Our model, using the NUS dataset, resulted in less distortion. Compared with MeshFlow's transform based on global homography [1], the transform based on the foreground and background separation introduced less distortion. Compared with DUT [14], which also uses a DNN-based key-point detector, the problem of key-point redundancy caused by a large untextured area was considered in our experimental process, and the key points in a large untextured area were removed. Furthermore, PCAFlow was used to fit the optical flow, which introduced less distortion. Compared to StabNet [16] which also generates stable frames based on grids , StabNet [16] was not unsupervised and DeepSta datasets were used in the training. This dataset is too small and contains a single scene, so the training was not accurate enough. Compared with the method of Yu et al. [15], which also filled the flat area, Yu et al. [15] filled the boundary between the untextured area and the textured area for a second time, which was overfilled and also produced deformation and distortion. When the object in the video was too close to the camera or had a large occlusion or there were too many people in the video, the motion trajectory optimization method of Liu et al. [5] did not follow its motion trend, resulting in poor results, including distortion and artifacts.  Table 3 shows the stability score obtained in our experiments. In the NUS dataset, our model obtained a high stability score compared to those of the other models. Our model was the best on the whole, especially in Running, Zooming, QuickRotation, and Parallax categories, and in Regular and Crowd the method was ranked second. Our model performed better than Meshflow [1], StabNet [16], and DUT [14], which are also based on the use of a grid. Compared with DUT [14], which also adopts unsupervised learning, the stability score of our model exceeded that of DUT [14] on the NUS datasets. Compared with the method of Liu et al. [5], in regard to smoothing the path of the bundled camera, our model showed superior performance. Compared to the model of Yu et al. [15], which uses optical flow as its input, our model showed superior performance in the Running, Zooming, QuickRotation, and Parallax categories. However, in the Crowd and Regular categories, our model performed slightly worse than that of Yu et al. [15], because, for some videos, the flat area was complicated. For example, when the flat area in the video was very fragmented, our model did not deal with such videos well. The data in Table 3 are the average values of different models obtained for different classes. If videos were compared on a one-to-one basis, there would be videos with large gaps in their stability scores in each category for the different models. Because the stability score of the gap is relatively small, it is generally not obvious in the subjective results. The stability score is related to many factors, including video content, video brightness and darkness, etc. Thus, if the difference between the stability scores obtained for a single video is small, this is not visually obvious, but if the difference between the stability scores of a video is large, it is visually obvious. So the average represents the average situation of a certain category. Table 4 shows a comparison of the different state-of-the-art methods in terms of their runtime, including traditional and DNN methods. Under the same experimental conditions, the runtime of the model presented in this paper was the shortest compared to the other models. Our model includes fewer operations, so it has a lower runtime. It was found through our experiments that the most time-consuming part of the process the stage involving the generation of the trajectory and obtaining the optical flow. During the trajectory generation process, each key point should be processed separately, and the key points should be propagated to the grid vertices and processed on the grid vertices. The overall time required was as follows:

Comparison of Runtimes
Grid-based time consumption can be expressed as in Equation (14): where n is the number of key points. Pixel-based time consumption can be expressed as in Equation (15): where H and W are the resolutions of the videos.

Runtime per Frame
MeshFlow [1] 2576 ms Liu et al. [5] 1360 ms DUT [14] 198 ms Yu et al. [15] 570 ms ours 134 ms In the trajectory generation stage, we constructed a Boolean matrix by matricizing all operations so that the index could be used to determine whether the current point was required and to extract the corresponding movement deviation based on the index. The extraction of the optical flow was also found to be a time-consuming step when using RAFT. RAFT itself was found to be a means of generating optical flow via iteration, The accuracy of optical flow had little influence on the experimental results between 20 and 10 iterations, so the number of iterations was adjusted to 10, as shown in Figure 11.

Conclusions
In this paper we present a method for video stabilization using an unsupervised approach. The accurate motion trajectory of key points was obtained by post-processing the key points and the optical flow in the maximum flat area. The trajectory was also smoothed using a deep learning approach. Finally, for the generated frames with black edges, an adaptive cropping method was used to crop them so that the original frame details were preserved as much as possible. On the benchmark dataset, we verified the superiority of our model over other state-of-the-art methods, finding that it had less distortion and larger stability scores, minimizing black edges while maximizing the preservation of the original frames. In terms of computational efficiency, our method was more efficient than the other methods tested.
The model presented in this paper is not yet an end-to-end model, so we hope to improve it into an end-to-end model. Moreover, to further improve the performance of the model presented in this paper, we hope that online stabilization can be implemented.