Robust Video Stabilization Using Particle Keypoint Update and l1-Optimized Camera Path

Acquisition of stabilized video is an important issue for various type of digital cameras. This paper presents an adaptive camera path estimation method using robust feature detection to remove shaky artifacts in a video. The proposed algorithm consists of three steps: (i) robust feature detection using particle keypoints between adjacent frames; (ii) camera path estimation and smoothing; and (iii) rendering to reconstruct a stabilized video. As a result, the proposed algorithm can estimate the optimal homography by redefining important feature points in the flat region using particle keypoints. In addition, stabilized frames with less holes can be generated from the optimal, adaptive camera path that minimizes a temporal total variation (TV). The proposed video stabilization method is suitable for enhancing the visual quality for various portable cameras and can be applied to robot vision, driving assistant systems, and visual surveillance systems.


Introduction
The demand for a compact, portable camera is rapidly growing because of popularized consumer hand-held cameras with easy handling and compact size such as mobile cameras, digital cameras, digital camcorders, drone cameras, and wearable cameras. With the advancement of cloud services, acquisition of high quality videos becomes more important to share contents without the barriers of time and space. However, video sequences are subject to undesired vibrations due to camera shaking caused by poor handling and/or a dynamic, unstable environment. To overcome this problem, various video stabilization methods have been developed to improve the visual quality of various hand-held cameras [1]. A mechanical video stabilization system controls the camera vibrations using the gyro sensor or accelerometer. It either moves the lens to change the light path and the optical axis or uses an internal sensor to minimize the shaky motion. In spite of the high performance, the mechanical and optical video stabilizer is not suitable for portable camera because of the increased volume and cost of the system. On the other hand, an image processing-based video stabilizer can efficiently remove the movement of video frames without extra cost of additional hardware devices.
An image processing-based video stabilization method generally consists of two steps: (i) removing undesired motion by smoothing the camera path and (ii) rendering the stabilized frames [2]. Existing video stabilization systems can be classified by the camera path estimation method. Early two-dimensional (2D) stabilization methods used the block matching algorithm to estimate inter-frame motion vectors. Jang et al. estimated the optimal affine model between adjacent frames by using a variable block size [3]. Xu et al. proposed a video stabilization algorithm using circular block matching and least square fitting [4]. Since the 2D block matching-based methods can easily estimate the camera path, they are applied in various applications [5]. However, they are sensitive to noise and produce a matching error between acquired video frames under a dynamic environment. An improved 2D video stabilization method used the optical flow to estimate the global camera path. Chang et al. used the Lucas-Kanade optical flow estimation algorithm to define an affine motion model between frames, and stabilized the camera path by motion compensation [6]. Matsushita et al. estimated the camera path using the homography between adjacent frames and smoothed the global path using a Gaussian kernel [7]. Xu et al. used Horn-Schunck optical flow estimation algorithm to compute an affine model between successive frames and smoothed camera path by model-fitting filter [8]. Although optical flow-based stabilization methods can compute an affine motion model in a simple, flexible manner, they fail to stabilize multiple objects with different distances at the same time. To improve the quality of stabilized video, an alternative approach used feature points to estimate a rotation-and scale-invariant camera path. Battiato et al. used the scale invariant feature transform (SIFT) to estimate the camera path and reduce the estimation error using the least squares algorithm [9]. Lee et al. used trajectories of SIFT feature points to estimate the camera path and minimized an energy function to smooth the camera path with reducing geometric distortion [10]. Xu et al. estimated motion parameters of the affine model using the fast accelerated segment test (FAST) algorithm for video stabilization [11]. Nejadasl et al. stabilized calibrated image sequence using the Kanade-Lucas-Tomasi (KLT) tracker and SIFT [12]. Cheng et al. presented motion detection using the speeded up robust features (SURF) and modified random sample consensus (RANSAC) for video stabilization [13]. To define a more powerful 2D camera model, the locally estimated camera path are proposed. Liu et al. modeled mesh-based 2D camera motion with bundled camera path to improve the video stabilization performance [14], and Kim et al. classified background feature points using the KLT tracker [15]. Although 2D video stabilization methods are faster and robust because of the use of a linear transformation, they fail to estimate the optimal camera path in textureless regions.
Currently, 3D camera motions are estimated based on the image segmentation result to improve the quality of a video. Liu et al. proposed a 3D video stabilization method using structure from motion and spatial warping to preserve 3D structures [16]. Zhou et al. generated labeled frames using 3D point cloud and estimated the homography of each label to reduce distortion in textureless regions [17]. The 3D stabilization methods can generate higher quality results and are suitable for an accurate video analysis [18,19]. However it is hard to implementation in real-time or near real-time service because of the high computational complexity, and these methods have the common problem of the parallax caused by feature tracking failure in flat region.
To solve these problems, this paper presents a novel video stabilization algorithm using a robust feature detection method to improve existing 2D methods instead of the less robust 3D methods. The proposed algorithm redefines important feature points using particle keypoints. The homography is accurately estimated by detecting robust particle keypoints. Undesired motions are removed by minimizing the temporal total-variation of the camera path. As a result, the proposed method provides a significantly increased visual quality of shaky video acquired by a handheld camera. This paper is organized as follows. Section 2 presents theoretical background of video stabilization. Section 3 presents the robust feature extraction and matching based video stabilization and Section 4 presents the optimal camera path estimation. Experimental results are given in Section 5, and Section 6 concludes the paper.

Theoretical Background
Digital video stabilization plays an important role of a stabilized sensor in acquiring high-quality video with preserving information for visual perception. A portable or wearable camera produces jitter and an undesired camera path because of various unstable video acquisition environments with camera shaking. Specifically, we can observe the geometric distortion of the video due to the mislocation of the pixels as shown in Figure 1a. The camera path is not consistent with camera coordinate system from the world coordinate system's point of view. Since a perspective distortion is generated by undesired camera motion and rotation, the geometric transformation in the sensor output generates unstable video frames. For that reason, the proposed video stabilization algorithm compensates the perspective distortion caused by the transformation of the acquired video as shown in Figure 1b The shaky video can be considered as a geometrically transformed version of the ideally stable video. The relationship between feature points in the original and the shaky frames is defined in the homogeneous coordinate as where H represents the homography, p = [x, y, 1] T a feature point in the original frame, and q = [x,ŷ, 1] T its correspondence point in the shaky fame. The homography is generally estimated using the correspondences between adjacent frames. Although state-of-the-art feature extraction algorithms can detect distinguishable keypoints regardless of scale change, rotation, and brightness change, these methods fail to estimate the accurate homography of the images including a large flat region without any salient texture. The incorrectly estimated homography significantly degrades the performance of the video stabilization with an erroneous camera path.
To solve these problems, we extract robust feature points to estimate the optimal homography of the textureless region. By updating important feature points in flat regions using the particle keypoint, the proposed method can significantly remove undesirable jitter using the optimally estimated homography in the entire image. The proposed method can also improve the visual quality without expensive optical devices by reconstructing stable video with a significantly reduced perspective distortion.

Feature Extraction and Matching for Robust Video Stabilization
The proposed video stabilization method estimates the optimal camera path of a certain length of video by redefining robust feature points and it is an extended version of Jeon s work [20]. Figure 2 shows the block diagram of the proposed video stabilization method. The proposed algorithm consists of three steps: i) robust feature detection, ii) estimation of camera path, and iii) rendering to reconstruct a stabilized video. Given a pair of input shaky video frames f t−1 and f t , the flat region map is generated. The FAST and BRIEF keypoints X FB t−1 , X FB t are extracted in f t−1 and f t , respectively. The particle keypoints in the two frames, X P t−1 and X P t , are generated using statistical analysis of extracted FAST and BRIEF keypoints in flat regions. After that, the global camera path C t is estimated by the optimal homography H t , and the smoothed camera path P t is then estimated using a variational method. As a result, the stabilized framef t is obtained using the estimated camera path.

Flat Region Map Generation for Feature Extraction
Conventional video stabilization methods enhance the quality of a consumer video by estimating and smoothing the global path. Existing video stabilization methods assumed that temporally adjacent frames are related by a homography, which is robust to camera transformation, and the global camera path can be easily estimated using the geometric transformation. The global camera path is estimated by matching feature points that are robust to a geometric transformation. However, existing methods fail to detect feature points in a flat region. In addition, an inaccurately estimated homography in a textureless region further degrades the stabilization performance. In order to solve this problem, the proposed method generates the flat region map and the optimal camera path by redefining important keypoints in a flat region.
A textureless region is extracted using the flat region map. A spatially smoothed frames are obtained by convolving the shaky frames f t−1 and f t with a 3 × 3 Gaussian low-pass filter. The frames are divided into flat and active regions using the absolute difference of the original frame and its smoothed version. As a result, the estimated flat region map is used to redefine robust feature points. Figure 3 shows the t-th original shaky frame and the corresponding flat region map.

Robust Feature Matching between Adjacent Frames
Matching of features between temporally adjacent frames is very important to understand the geometric relationship of two frames and detect specific objects in video [11,21]. Various feature detection methods were proposed and widely applied to detect a common region in two images [22]. Harris et al. proposed a seminal model to detect corner points where shifting a local window in any directions yields a large change in appearance [23]. Lowe proposed the scale-invariant feature transform (SIFT) that generates an image pyramid using the difference of Gaussian (DoG), and then keypoints are detected at the local maxima in the image pyramid [24]. Although SIFT can detect scale-and rotation-invariant feature points, the computational complexity is a bottleneck of video applications. To solve this problem, Bay et al. proposed the speeded up robust features (SURF) that uses an approximated filters and integral images to reduce the processing time [25]. Recently, a number of intensity-based feature point detection algorithms were proposed. Rosten et al. proposed a faster corner detection algorithm using an accelerated segment test, which is called FAST [26]. Calonder et al. proposed a simple description method using the binary robust independent elementary features (BRIEF), which compares image intensities of sampling pairs [27]. More binary descriptors were proposed using a special sampling pattern to compensate the orientation of keypoints [28,29].
The proposed method combines FAST and BRIEF for fast, accurate extraction of feature points. FAST extracts feature points by comparing intensities with 16 neighborhood pixels in the circle. We determine the corner if the intensity of the n contiguous neighborhood pixels I p→x are all brighter than that of the candidate pixel I p , or if they are all darker than that of the candidate pixel I p . To arrange the neighborhood pixels in order of the amount of information about whether the candidate pixel p is a corner, the decision tree classifier is trained using the iterative Dichotomiser 3 (ID3) algorithm. The keypoints p is defined as where x represents the neighborhood that is selected by decision tree using the ID3 algorithm, and t the threshold for comparing intensity. We used t=0.2 for the experimentally best result. BRIEF identifies local feature points by comparing intensities of sampling pairs. The homography can be computed very efficiently because a binary string can be matched using the hamming distance by the XOR operation. The FAST keypoints are extracted between two adjacent video frames f t−1 and f t to determine the distribution of random particle keypoints. The descriptors are generated using BRIEF and matched using the hamming distance. The extracted FAST and BRIEF keypoints are denoted as where µ and σ respectively represent the mean and standard deviation of the distribution. The descriptor matches the frames in the sense of the distance between particle keypoints and FAST and BRIEF keypoints. The descriptor D t of t-th frame is defined as Final correspondences are matched using the sum of squared difference (SSD) of the descriptors of two frames. The descriptor is used to match robust keypoints in the flat region using particle keypoints. Finally, the optimal homography H t is estimated using random sample consensus (RANSAC) to eliminate outliers [30]. RANSAC defines the optimal geometric model between two images by repeating random sampling of matched points. Figure 4 shows feature detection results using the proposed method. Figure 4a shows matched points using SIFT with RANSAC, and Figure 4b shows matched points using SURF with RANSAC. Figure 4c shows matched points using FAST and BRIEF, and Figure 4d shows the results using the proposed particle keypoints. As a result, the particle keypoints can extract robust feature points of overall image including flat region.

Estimation of the Optimal Camera Path
Traditional video stabilization methods use a moving average of Gaussian filter to smooth the camera path. The moving average filter can smooth the camera path using the temporal mean of neighboring frames. The Gaussian kernel can remove undesired motion using the global transformation [7]. However, these methods fail to track a sharp change of the camera path. Furthermore, the performance of video stabilization becomes low when cropping regions and the amount of distortion increase. To solve this problem, the proposed method adaptively smooths the camera path using 1D TV algorithm [31]. The holes represent an empty region in a video frame which is generated after moving the frame by smoothed camera path. To compensate the holes, the boundary region of a stabilized video is generally cropped out, and the remaining central region is enlarged to fill the original size of the video frame. Therefore it is important to minimize the hole region to preserve the original contents. The stabilized video has less holes since the TV method can preserve the original path and removes undesired outliers.
Given the optimal homography H t between f t−1 and f t , a global camera path C t is generated. The corner points denoted as V t = {(1, 1), (1, h), (w, 1), (w, h)} in w × h input shaky frame f t are transformed toV t by H t . H t can be regarded as the transformation matrix of the camera movement. Therefore, the camera motion between f t−1 and f t is simply considered as the difference between V t andV t . The global camera path C t is computed by adding the movement of adjacent frames as whereV t = H t V t . The estimated global camera path C t is smoothed by 1D TV for video stabilization. The energy function for the smoothed camera path P t is defined as where A the temporal difference matrix and λ represents the weight coefficient for smoothing. The first term of Equation (6) enforces the smoothed camera path that is close to the original path, and the second removes noisy motions by smoothing the camera path. The energy function of Equation (6) can be minimized by the iterative clipping algorithm.  Figure 5 shows the estimated camera path using the proposed method. Figure 5a shows the x-coordinates of the original camera path in the dotted curve and the smoothed path using the moving average filter in the solid curve. Figure 5b shows the y-coordinates of the original camera path in the dotted curve and the smoothed path using the moving average filter in the solid curve. Figure 5c shows the x-coordinates of the original camera path in the dotted curve and the smoothed path using the proposed method in the solid curve. Figure 5d shows the y-coordinates of the original camera path in the dotted curve and the smoothed path using the proposed method in the solid curve. The proposed method can smooth the camera path without undesirable jitters and delay.
The final step of video stabilization is to reconstruct geometrically transformed frames using the smoothed camera path. The smoothed homographyĤ t can be estimated by the difference between the original camera path C t and the smoothed path P t as where V t represents the four corner points of the image. The stabilized video framef t is generated by transforming usingĤ t asf t =Ĥ t f t .
As a result, the proposed video stabilization method can successfully generate a stabilized video by estimating the optimal homography.

Experimental Results
This section presents experimental results and compares the performance of the proposed and existing methods. The proposed method improves the video quality by estimating the optimal homography using the particle keypoint update. To verify the accuracy of the estimated homography H t of temporally adjacent frames, f t−1 and f t , we tested the estimated projective transformation matrices from four feature different extraction methods, SIFT, SURF, FAST+BREIF, and the proposed method. We used SIFT and SURF algorithms with threshold values used in [24,25], respectively. Also, the proposed algorithm uses the intensity threshold t = 0.2 for FAST and a 256-bit string for BRIEF descriptor. After extracting feature points between f t−1 and f t , each transformation matrix is estimated. By combining all correspondences from the four methods, we evaluated the motion errors between the correspondences using l1-norm error evaluation as whereX t = H t X t−1 represents the transformed feature points in the previous frame , · · · , (x n t−1 , y n t−1 )}, and X t = {(x 1 t , y 1 t ), · · · , (x n t , y n t )} the feature points in the current frame. Table 1 summarizes the error of estimated homography using the four feature detection algorithms. The proposed method estimates the more accurate homography than other feature extraction methods as shown in Table 1.  Figure 6a shows the 80th, 81st, and 82nd frames in the original shaky video, and Figure 6b the correspondingly stabilized frames using the feature-based global camera path smoothing method [7], which cannot avoid a geometric distortion on the boundary because of the inaccurately estimated homography. We can easily find the distortion from the vertical structure on the right side of each frame. The bundled path algorithm fails in warping textureless blocks on the bottom of frame as shown in Figure 6c [14]. On the other hand, the proposed particle keypoint-based method can significantly enhance the shaky video with less geometric distortion on the boundary as shown in Figure 6d. Figure 6. Experimental results of various video stabilization methods: (a) the input shaky video frames (80th, 81st, and 82nd frames), (b) the stabilized video using the global camera path using feature detection [7], (c) the bundled path algorithm [14], and (d) the proposed method. Figure 7 shows the expanded version of an upper right region of Figure 6 for clearer comparison. The long object at the right side of each image is observed carefully. Figure 7a shows the expanded images of three temporally adjacent frames in original shaky video and Figure 7b shows the results of the stabilized video with geometric distortion by feature-based global smoothed camera path estimation method [7]. As shown in Figure 7c, the video stabilization method based on the bundled path could not successfully stabilize the video [14]. On the other hand, the proposed stabilized algorithm improves considerably the video quality with preserving the contents. Figure 8 shows the difference of two temporally adjacent frames. Figure 8a shows the differences of three pairs of original frames {(79, 80), (80, 81), (81, 82)}. Figure 8b shows the differences of three pairs of stabilized frames {(79, 80), (80, 81), (81, 82)}. As shown in Figure 8, the proposed method can significantly compensate the undesirable movements. . Experimental results of various video stabilization methods: (a) the expanded shaky video frames (80th, 81st, and 82nd frames), (b) the stabilized video using the global camera path using feature detection [7], (c) the bundled path algorithm [14], and (d) the proposed method. To evaluate the empty region caused by the process of frame registration for stabilization, we compared the results of the proposed stabilization method and YouTube stabilizer using the same test video as shown in Figure 9. Stabilized frames are cropped to eliminate the missing boundaries, so it is important to have less cropping ratio to preserve the significant region of the original image. To measure the amount of cropping in various stabilization methods, tick marks are inserted on the diagonal line in the 80th input frame as shown in Figure 9a. Figure 9b,c respectively show the stabilized frames using auto-directed video stabilization method [32] and the proposed video stabilization method. As shown in Figure 9, the proposed video stabilization method can successfully preserve the contents of input frame with a reduced cropping ratio.  Figure 10. Experimental results of various video stabilization methods: (a) the input shaky video frames (170th, 171st, and 172nd frames), (b) the stabilized video using the global camera path using feature detection [7], (c) the bundled path algorithm [14], and (d) the proposed method. Figure 10 shows the same test results of Figure 6 using different input video. Figure 10a shows the 170th, 171st, and 172nd frames of the input shaky video captured by a mobile camera. The significant portions of the stabilized video using the existing methods in [7,14] are removed by cropping to eliminate holes in the boundaries as shown in Figure 10b,c. As shown in Figure 10d, the stabilized video using the proposed method shows significantly improved video quality by removing undesired artifacts.
As shown Figure 11, a bottom right region of Figure 10 is enlarged to easily compare the results. Figure 11a shows the enlarged three original frames, and Figure 11b shows the stabilized results using the feature-based global camera path smoothing method [7]. Figure 11c shows the stabilized frames using the bundled path algorithm [14]. As shown in Figure 11d, the proposed method successfully obtains stabilized video with less holes. Figure 11. Experimental results of various video stabilization methods: (a) the enlarged shaky video frames (170th, 171st, and 172nd frames), (b) the stabilized video using the global camera path using feature detection [7], (c) the bundled path algorithm [14], and (d) the proposed method.   Figure 13 compares the performance of various camera path smoothing methods. Each resulting frame is divided into sixteen rectangular grids to easily evaluate the performance of the stabilization. Figure 13a shows the input shaky video frames acquired by a hand-held camera, and Figure 13b shows the results of stabilized video by smoothing the camera path using a moving average filter [7]. The stabilized frames using the proposed method that minimized the 1D TV are shown in Figure 13c. Based on comparing each grid, the proposed method can successfully enhance the shaky video with significantly reduced holes.  Figure 14 shows the enlarged version of Figure 13. Figure 14a shows the first three frames in the original shaky video. Figure 14b shows the distorted object moving back and forth in the center of each frame. On the other hand, the proposed method successfully reduces the noisy motion of the shaky video as shown in Figure 14c.  The difference between two successive frames is minimized since the proposed method reduces the noisy motions. To evaluate the objective performance, we used the peak signal to noise ratio (PSNR) values of the temporally adjacent frames. The PSNR is defined as where 2 represents the mean square error, and MAX f the maximum intensity value of the frames. Table 2 summarizes the PSNR values of adjacent video frames stabilized by the proposed method. As a result, the proposed video stabilization can correct the location of the pixels in the adjacent frames. Finally, we measured the perspective distortion for objective assessment of the proposed video stabilization method using Liu s method [14]. As mentioned in Section 2, a perspective distortion generally occurs when the real world is projected onto the image sensor. An inaccurately estimated homography results in the perspective distortion that significantly degrades the geometric quality of the video. For that reason, we estimated the perspective distortion using the transformation between the original and stabilized frames. The homography of the stabilized image sequences can be defined as where C t and P t respectively represent the cumulative homographies between adjacent frames of the observed shaky and stabilized videos, and B t the transformation matrix. The perspective distortion is computed by averaging the perspective components in B t since the homography with distortion determines the video quality. Tables 3-5 summarize the perspective distortion of various video stabilization method. As shown in the Tables, the proposed video stabilization method can successfully remove the undesired motion without perspective distortion compared with conventional video stabilization algorithms.  Unstable videos with undesired camera motions have the limited performance of object detection and tracking. The final experiment is performed to demonstrate whether the proposed method can play a practical role of pre-processing in various video analysis systems. We used the Lucas-Kanade feature tracker (LKT) to demonstrate the performance of the object tracking on shaky and stabilized videos. Figure 16 illustrates the experimental results of the object tracking. The yellow boxes in Figure 16 represent the tracking results using the LKT tracking method. Although the popular LKT algorithm tracked robust features with image rotation and view point change, it has a fundamental problem of missing the interest objects on the shaky video as shown in Figure 16a. As shown in Figure 16b, the proposed method can significantly improve the object tracking performance.
The stabilized results used in Figures 6-16 using the proposed method can be found in the supplementary video with the comparison between the original and stabilized version.

Conclusions
The proposed video stabilization method removes unstable motions by estimating the optimal camera path using the robust keypoints extraction in the textureless region, and it smooths the shaky motions without frame delay using the variational optimization method. In addition, the proposed method is particularly suitable for hardware implementation in handheld cameras since it estimates the optimally camera path of shaky video using only four vertices in each frame. As a result, the proposed algorithm can successfully enhance the shaky video using an improved 2D stabilization method based on particle keypoints. The proposed method can be used for various video systems including mobile imaging devices, video surveillance systems, and vehicle imaging information systems. To overcome the vibration of the video acquired by vision-based mobile robots, the state of the art technology presents video stabilization system on a field programmable gate array (FPGA) based mobile robot system to apply to the single chip based embedded system for real-time video stream [33]. The proposed method can be applied to this system to extract correct features in the flat region and to improve the quality of stabilized video. Recently, an aerial surveillance system uses the video stabilization method to detect objects in a wide area [34]. The aerial video acquired with a moving camera cannot avoid jitters between temporally adjacent frames. For that reason, the video stabilization algorithm is an indispensable pre-processing step for robust detection of objects in the aerial surveillance system. The proposed method can define the significant feature points which is hard to be extracted in the flat or low-resolution region. It can significantly improve the performance of conventional video stabilization methods. The portable handheld camera users communicate with the dynamic activity videos such as walking, cycling, and hiking and it is important to remove undesirable shaky motion. The proposed feature extraction algorithm can be flexibly modified to extract robust initial keypoints, and it can also be used in a computationally powerful server-based cloud service to enhance the quality of the uploaded videos. The road videos in the first person can be stabilized by optimally estimating the camera path based on particle keypoints update in the flat region. Moreover, the personal videos nowadays are summarized in the form of the time lapse video because of the limited battery energy of the mobile devices and speed of the wireless network. In this context, the proposed method can be applied to the pre-processing step of a video summarization algorithm to remove wobble effects.