Metadata-Assisted Global Motion Estimation for Medium-Altitude Unmanned Aerial Vehicle Video Applications

Global motion estimation (GME) is a key technology in unmanned aerial vehicle remote sensing (UAVRS). However, when a UAV’s motion and behavior change significantly or the image information is not rich, traditional image-based methods for GME often perform poorly. Introducing bottom metadata can improve precision in a large-scale motion condition and reduce the dependence on unreliable image information. GME is divided into coarse and residual GME through coordinate transformation and based on the study hypotheses. In coarse GME, an auxiliary image is built to convert image matching from a wide baseline condition to a narrow baseline one. In residual GME, a novel information and contrast feature detection algorithm is proposed for big-block matching to maximize the use of reliable image information and ensure that the contents of interest are well estimated. Additionally, an image motion monitor is designed to select the appropriate processing strategy by monitoring the motion scales of translation, rotation, and zoom. A medium-altitude UAV is employed to collect three types of large-scale motion datasets. Peak signal to noise ratio (PSNR) and motion scale are computed. This study’s result is encouraging and applicable to other mediumor high-altitude UAVs with a similar system structure.

Unmanned aerial vehicle remote sensing (UAVRS) as a means of aerospace remote sensing is a strong complement to satellite and aerial remote sensing of manned aircrafts.With the exponential development of the sensors and instruments to be installed onboard, UAVRS applications with new potential are continuously increasing [1].Owing to its real-time video transmission, detection of high-risk areas, low cost, high resolution, flexibility, and other advantages, UAVRS has been widely utilized in military and civilian areas in the past decade [2,3].Equipped with various imaging equipment of visible light, infrared, and synthetic aperture radar to obtain remote sensing images, unmanned aerial vehicles (UAVs) utilize aerial and ground control systems for automatic video shooting, data compression and transmission, video preprocessing and post-processing, and other functions and can be utilized in many applications, such as national environmental protection [4], mineral resource exploration, land use survey, marine environmental monitoring [5], water resource development, crop growth monitoring and assessment [6], forest protection and monitoring [7], natural disaster monitoring and evaluation [8], target surveillance [9], and digital Earth.
Worldwide applications have contributed to the development of several scientific studies on remote sensing [10].As an auxiliary means that employs indispensable visual information, video processing has been widely utilized in guidance, navigation, and control [11] and is eliciting an increasing amount of attention because of its special characteristics, namely, moving imaging, long-distance transmission, and complex atmosphere.Mai et al. [12] summarized five characteristics of small UAV airborne videos.Brockers [13] and Kanade et al. [14] studied computer vision for micro air vehicles.These studies have contributed significantly to the UAV vision in the field of UAVRS.
Compared with popular low-altitude UAVs, medium-altitude UAVs play a more important role in UAVRS because of their longer endurance, higher altitude, and more powerful imaging sensor loading capability.However, published studies on vision technologies and applications for medium-or high-altitude UAVs remain scarce.

Utility of Global Motion Estimation in UAVRS
Global motion estimation (GME) is the process of determining the motion of the camera and is a key technology in the UAV vision for remote sensing data acquisition and various applications.Data acquisition is the process of imaging, compression, and transmission; it provides the image data source for remote sensing applications.During data acquisition, GME could be utilized to estimate the camera's motion in video stabilization [15] to obtain a stable and smooth video.It can also be utilized to calculate the redundancy information between frames in video encoding [16] to achieve high-resolution image compression and transmission despite the limited bandwidth of the data link.
GME has important contributions to the other four classes of video processing as shown in Table 1, namely, target detection and tracking [17], video shot segmentation and retrieval [18], super-resolution reconstruction [19], and structure from motion [20].These four classes have important applications in remote sensing.In target detection and tracking, the global motion presents the background, and the different local motions indicate the moving targets.Several remote sensing applications for monitoring and surveillance are performed with this fact.In video shot segmentation and retrieval, accurate global motion is a reliable indicator to extract several important sequences or images from a remote sensing database.In super-resolution reconstruction, robust GME is utilized to complete image registration, which is highly useful for forest and agriculture applications.Another important application is the popular structure from motion for 3D mapping, which is based on camera motion estimation.In outdoor video applications, conventional GME based on image block matching has nearly evolved to maturity.However, in a UAV environment, performance is poor when the UAV's motion or behavior changes significantly.The cause of this problem is that conventional image-based GME is only applicable to the narrow baseline condition between frames and not to the wide baseline condition when a large-scale motion is prevalent in the video.Owing to the combined motion of the vehicle and camera, the image block moves out of the researching window or image distortion results in the same content in different sizes and shapes; either of these conditions results in image block matching failure.Several other methods utilize some prior information from special sensors to deal with this problem.Although these methods provide improved results, they cannot be applied to other UAVs with different structures.Furthermore, when images have minimal information (e.g., deserts, rivers, and other special landforms), image-based methods become completely unreliable.
The problems in GME that need to be solved are summarized below; these problems provide future research directions for our work.
(1) How to improve precision under a large-scale motion condition?(2) How to reduce the dependence on image information to adapt to several special landforms?(3) How to enhance adaptability to different UAVs?

Related Work
Traditional GME methods can generally be divided into pixel based [21], feature based [22], and vector based [23] according to different analysis objects.The performances of these methods were evaluated in [24,25].To achieve high precision, a large number of pixels, features, and vectors are involved in the computation.
In recent years, several excellent GME algorithms [26][27][28][29][30] have been developed in an attempt to achieve both precision and computation.Okade et al. [26] proposed the use of discrete wavelet transform in the block motion vector field to estimate the global motion parameters in the compressed domain.A key assumption was that the LL sub-band provides the average motion, which is predominantly due to the background (camera) motion.The algorithm proposed by Yoo et al. [27] independently conducted motion estimations in both forward and backward directions and selected the more reliable vector between forward and backward motion vectors by evaluating motion vector reliability from the perspective of the interpolated frame.In [28], a new class of prediction algorithms based on region prediction was introduced.This new class of algorithms can be applied in conventional fixed-pattern algorithms to predict the region in which the best matched block is located, thereby increasing the speed of the algorithm.Sung and Chung [29] presented a robust real-time camera motion estimation method that employs a fast detector with a multi-scale descriptor; the method entails minimal computation but exhibits high precision.Krutz et al. [30] proposed the concept of global motion temporal filtering for the emerging video coding standard HEVC.All of these methods can be considered image-based GME.
Aside from image-based GME, several scholars have employed camera sensor parameters as auxiliary information to solve the problem.This class of GME can be referred to as sensor-assisted GME.In [31], a novel approach called sensor-assisted motion estimation was developed to estimate the linear displacement of a mobile device through the use of built-in sensors.A built-in three-axis accelerometer was utilized to determine linear displacement on the X-, Y-, and Z-axes.Another method called sensor-assisted video encoding (SaVE) was introduced in [32] to reduce the computational complexity of video encoding.The method calculates the movement of a camera (on mobile devices) and then infers the global motion in a video.SaVE utilizes readings from a single accelerometer attached to the video camera to compute the vertical angle.For the horizontal angle, either absolute angle readings from a single digital compass or a pair of accelerometers are utilized.Wang et al. introduced a sensor-assisted GME method for H.264/AVC video encoding [33].By leveraging location (GPS) and digital compass data, the method exploits the geographical sensor information to detect transitions between two video sub-shots based on the variations in both camera location and shooting direction.Strelow et al. [34] introduced an algorithm that computes optimal vehicle motion estimated by simultaneously considering all measurements from the camera, rate gyro, and accelerometer.Sensor-assisted GME takes advantage of the position and behavior information of sensors, such as accelerometers, GPS devices, digital compasses, and rate gyros.With information from sensors, good performance is achieved to some extent.The success of these methods lies in that they upgrade the GME problem to the system level and then exploits useful information from the system.
Owing to the characteristics of moving imaging and dual-platform (vehicle and camera) behavior change, UAV video GME becomes a system issue that cannot be simply solved by image processing algorithms.For example, in the compression domain, MPEG-4 or H.264 generally exhibits low effectiveness when ground speed and dual-platform behavior change significantly.In this case, translation between frames is larger than the search radius of the block matching algorithm, or rotation and zoom motions create image distortion; either one of these conditions results in an error in block matching.The compression ratio becomes large with the limited bandwidth of the data link, leading to data loss during transmission.The worst effect is the possible generation of mosaics in the ground video receiver, thereby seriously affecting video applications, such as target detection and recognition.The key to solving this problem is to develop a fast and accurate GME method.
To achieve effective GME for UAV/aerial video applications, several scholars worked to solve the problem from the system level, similar to sensor-assisted GME.Rodrí guez et al. [35] presented an efficient algorithm to solve the motion estimation problem.The algorithm requires minimal computation and is thus suitable for implementation in a mini-UAV.Computation is reduced by using prior knowledge on camera locations (from available mini-UAV sensor data) and projective geometry of the camera.Based on the algorithm in [35], Gong et al. [36] proposed a low-complexity image-sequence compressing algorithm for UAVs.Bhaskaranand et al. [37] designed a video encoding scheme suitable for applications in which encoder complexity needs to be low (e.g., UAV video surveillance).Encountering the same problem, Angelino et al. proposed a novel motion estimation scheme that employs the global motion information provided by the onboard navigation system [38,39].The homography between two images was utilized to initialize the block matching algorithm, allowing for a more robust motion estimation and a smaller search window to reduce complexity.Bhaskaranand and Gibson proposed a low-complexity encoder [40] with no block level motion estimation, global motion compensated prediction, and spectral entropy-based coefficient selection and bit allocation.
These studies promoted the development of UAVRS technologies and applications under certain conditions.However, they also have several limitations (indicated below).
(1) The research objects were predominantly small, and low-altitude UAVs that have different structures were employed.This condition leads to poor expansibility of the method.(2) The information used was not the bottom data measured from the UAV system, and some information was assumed to be known.Thus, the process of GME was not completed from the bottom level.(3) The motion of the dual platform was often assumed to be smooth and stable, which confines GME to a narrow baseline condition.However, even the same contents (e.g., house, bridges) of two adjacent frames differ in geometric features (shape and size), location, and orientation when the vehicle's translation or the dual platform's behavior changes considerably.

Present Work
The current work aims to solve the three problems in GME mentioned in Section 1.1.3.First, according to the theory of coordinate transformation, GME is converted from a wide baseline condition to a narrow baseline one to improve the precision of GME under a large-scale motion condition.Second, bottom metadata are utilized to reduce the dependence on image information and derive an information and contrast feature to maximize the use of reliable image information.Third, a medium-altitude UAV with a common structure is investigated.The proposed scheme can also be applied to medium-or highaltitude UAVs with a similar system structure.

Study Hypotheses
A novel metadata-assisted GME method (MaGME) for medium-altitude UAV video applications was developed.According to the imaging characteristics of medium-altitude UAVs, three hypotheses were established.
(1) Central projection model hypothesis The camera imaging model conforms to the central projection model.On this basis, MaGME is applicable to CCD and infrared video.The central projection model is a common image model that is similar to the pinhole camera model assumed in [41].It is utilized to solve collinear equations in coordinate transformation.However, selecting the wrong type of camera would lead to an erroneous result.
(2) Field depth consistency hypothesis Terrain fluctuations and man-made buildings can be ignored relative to the long imaging distance.Thus, all pixels of an image are assumed to be on the same plane.Field depth consistency is a basic hypothesis in most studies on GME [35][36][37][38][39].The lower the terrain fluctuations and buildings are, the more accurate the result of GME is.
(3) Content of interest hypothesis Users are more interested in an area with strong contrast and rich information (e.g., houses, overpasses) than in an area with hue/brightness consistency and minimal information (e.g., lakes, wheat fields, grasslands).The image information in the contents of interest is rich and reliable.The purpose of the content of interest hypothesis is to indicate which regions of the image contain valuable information.Based on this hypothesis, this study attempts to detect meaningful blocks and discard worthless blocks from images in block matching.

Metadata
To improve the precision of GME under large-scale motion conditions and reduce the dependence on image information, full information mining was implemented to build a model from bottom metadata to global motion.A medium-altitude UAV with a CCD camera was employed.Equipped with GPS and INS, the UAV can measure its position and behavior by itself.The two-DOF camera mounted on the front belly can complete attitude measurement independently.The metadata associated with image global motion are shown in Table 2.

Workflow of MaGME
A GME method based on the theory of coordinate transformation was designed with the metadata provided above.The method completely relies on bottom metadata that UAV systems produce initially rather than on known data of camera position, orientation, and projective geometry provided in [35].In addition, the camera calibration process in [35,38,39] was not considered in the current study because the camera mounted on a medium-altitude UAV usually implements this process before image compression.
As shown in the light blue box of Figure 1, MaGME performs in both wide and narrow baseline conditions, as indicated by the image motion monitor.To improve performance under a wide baseline condition, the following two steps were conducted.First, based on the theory of coordinate transformation, the coarse GME between image F(t) and auxiliary image F'(t) was computed using metadata (shown as M1).Second, auxiliary image F'(t) was built to convert the wide baseline condition to a narrow baseline one; then, the residual GME was solved by big-block matching method (shown as M2).These two steps were combined to obtain the final global motion.The narrow baseline condition requires only big-block matching, the process of which is similar to the residual GME under the wide baseline condition.Given that UAV video global motion suffers from the combined motion of the vehicle and camera, complex transformations, including translation, rotation, zoom, and shear, widely exist between two frames or between the image and real scene.This relationship needs to be represented by a perspective projection model [42].Thus, video GME can be converted into a problem of solving the perspective projection transformation matrix between frames.A perspective projection model is described by eight parameters (m 0 , m 1 , m 2 , m 3 , m 4 , m 5 , m 6 , m 7 ).At time t, one point of the frame is recorded as P(x t , y t ).At time t+1, the point is recorded as P(x t+1 , y t+1 ).The relationship is shown in Equation (1).The purpose of video GME is to compute successive perspective projection models.
An image motion monitor was designed to determine whether the global motion between frames is under a wide baseline condition for guide strategy determination.When the monitor shows a large-scale motion between frames, coarse and residual GME need to be performed together.Otherwise, only residual GME based on big-block matching is required.This procedure makes the proposed method feasible under all motion conditions of the UAV system.
Under a wide baseline condition, the ground coordinate system (GCS) is introduced as an auxiliary coordinate system.Based on the central projection model hypothesis, frames F(t) and F(t+1) in the image coordinate system (ICS) are projected onto the ground plane in GCS.Corrected image planes F * (t) and F * (t+1) are then obtained with Equations ( 2) and (3).
where H 1 and H 2 denote the perspective projection transformations from the image planes in ICS to their projection image planes in GCS at time t and t+1, respectively.After transformation from ICS to GCS, the main motion between frames can be represented by translation T, which can be achieved according to the positional relationship of F * (t) and F * (t+1) in GCS.The transformation between F(t) and F(t+1) can be expressed as Equation ( 4). ( 1) ( ) Based entirely on metadata, coarse global motion M 1 between two images is obtained.
M 1 multiplied by image F(t) results in compensated image F'(t).
If both the metadata and calculation are absolutely accurate, M 1 is the global motion from frame F(t) to frame F(t+1), and F'(t) is similar to F(t+1).However, affected by equipment installation and sensor measurement errors, M 1 cannot represent the real global motion.In fact, a residual global motion exists between F'(t) and F(t+1).
Although F'(t) is not similar to F(t+1), experiments show that much of translation, rotation, zoom, and shear were eliminated.With auxiliary image F'(t), the wide baseline problem between F(t) and F(t+1) can be converted to a narrow baseline one between F'(t) and F(t+1).The residual global motion between F'(t) and F(t+1) is denoted as M 2 .
Finally, as the core of MaGME, global motion M is expressed as Equation ( 7).

Coordinate Transformation
Based on coordinate transformation theory and the central projection model, the image plane in ICS was converted to the ground plane in GCS by utilizing the metadata.After the transformation, the complex projective projection transformation between two image planes can be described by a simple translation transformation in GCS.
Coordinate transformation from ICS to GCS follows the order "Image Coordinate System Camera Coordinate System Plane Coordinate System North-East-Up Coordinate System Ground Coordinate System", as shown in Figure 2. Accordingly, the transformation from image plane FI(xI,yI,zI) in ICS to image plane FN(xN,yN,zN)in NCS can be expressed as Equation (8).
GCS employs the Gauss-Kruger coordinate on the XOY plane and altitude on the Z-axis.The origin of GCS is the intersection of the Greenwich Meridian and the Equator.Parallel to NCS, the X-axis pointing to the north, the Y-axis pointing to the east, and the Z-axis pointing upward, a left-handed coordinate system is formed, as shown in Figure 2.
A central projection model needs to be established to solve collinear equations according to the central projection model hypothesis.

S M M T T S S T M M T T S
Equation ( 10) is obtained according to the theory of similar triangles.FN(xN,yN,zN) can be calculated with Equation (8), and FG(xG,yG,zG) can be obtained with Equation (11).
In Equation (11), zG is the height of the object point.According to the field depth consistency hypothesis, the entire image is regarded as a plane.zG is a known value.
T C I , T P C , M P C , M N P , and T G N can be expressed by 4 × 4 matrices.Pixel plane FI is expressed as (xI,yI,−f,1) T with a homogeneous coordinate.
The general transformation from image plane FI in ICS to ground plane FG in GCS is described above.However, owing to the different coordinate system forms in UAV systems, some adjustments (e.g., coordinate axis direction or rotation direction) are required in specific applications.

Coarse GME
After coordinate transformation, the rotation or zoom scale between F * (t) and F * (t+1) becomes consistent in GCS.Without considering the precision of metadata and calculation, the transformation is absolutely accurate.Only translation T exists between F * (t) and F * (t+1).It can be estimated according to the relationship of two planes in GCS.
As two image planes have already been corrected to GCS, the same content points of the two images in GCS could have the same coordinate.Leveraging this fact, translation T from F * (t) to F * (t+1) can be estimated by the position difference of the same point on two planes.As shown in Figure 4, the center of the overlapping area of F * (t) and F * (t+1) is point , which is denoted by (x *p t ,y *p t ) in F * (t) and (x *p t+1 ,y *p t+1 ) in F * (t+1).Then, translation T(dx * ,dy * ) from F * (t) to F * (t+1) can be expressed as Equation (18).Consequently, the transformation from F(t) to F(t+1) can be expressed as • T• H 1 represents the homography matrix between two images.The solution of M 1 is entirely based on the metadata and does not involve any image information.Accordingly, the computation is less than that in image-based methods.However, two issues need to be noted.
(1) Given that metadata precision is affected by equipment installation and the measured parameter, M1 is not the real global motion between frames.The low precision of vehicle and camera parameters P leads to the poor capability of M1 to represent global motion under the condition of large-scale rotation or zoom motion.The low precision of position parameters leads to the poor capability of M1 to represent global motion under the condition of large-scale translation motion.
(2) Under a wide baseline condition, the initialization of matching windows position and reduction of searching computation mentioned in [38,39] cannot be achieved by M1.When the behaviors of the dual platform and focal length change considerably, the same content in the two images would exhibit large distortion, which causes the block matching method to fail.
Owing to the high performance of INS, GPS, the camera, and other equipment mounted on the large medium-altitude UAV, M1 can eliminate much of the distortion and translation between F'(t) and F(t+1).The matching of F'(t) and F(t+1) is under a narrow baseline condition.Utilizing F'(t) as an auxiliary image, GME M between F'(t) and F(t+1) under a wide baseline condition can be converted to coarse GME M1 between F(t) and F'(t); residual GME M2 between F'(t) and F(t+1) is under a narrow baseline condition.2 1

Information and Contrast Feature
To maximize the use of reliable information in the image and ensure the precision of GME in the contents of interest, an information and contrast feature (I&C feature) for big-block selection was developed based on the content of the hypothesis of interest.
where Ω is the image block window, I is the gray value of point (x, y) in Ω, I is the average of gray value of the image block, PI is the probability of I in the image block, λ is the amount of gray information of the image block, and Q(I) indicates the presence or absence of gray I in the image block.When I is present, Q(I) = 1; otherwise, Q(I) = 0. κ is the normalization factor.To facilitate the observation and calculations, all I&C features of blocks are normalized to (0, 255).
According to Equation ( 20), the region with rich information and high contrast has a high I&C feature value.The content of interest hypothesis indicates that people show more interest in regions with strong contrast and rich information than in open fields with consistent hue or brightness and minimal information; the image information in the contents of interest is reliable.Consequently, the selected image blocks should be located in the content of interest to ensure accurate registration of important image content and avoid mismatch in the region with minimal information and low contrast.Thus, the I&C feature can be utilized as an indicator in image block selection.Additionally, it can reduce the work required to eliminate many mismatching outpoints in [43,44].As to the number of image blocks, only one big block can solve residual translation motion, and at least four big blocks would be sufficient for residual perspective projection motion estimation, unlike in conventional methods wherein each image block is matched.

Residual GME Based on Big-block Matching
The experiments indicate that the residual global motion is predominantly translation.In consideration of both computation and precision, M2 can be solved by two models, namely, translation transformation and perspective projection transformation.Motion estimation between F'(t) and F(t+1) can use the conventional image block matching method.To maximize the use of the typical contents (house, tree) in the image, a big block would be appropriate.
As shown in Figure 5, an image was divided into 16 × 16 big blocks to maximize the use of several typical contents (house, tree).Through visual judgment, the yellow-marked region has low contrast and minimal information.The selection of image blocks should not be in this region.
The blocks whose I&C features are greater than a certain threshold value in auxiliary image F'(t) were utilized to search best matching blocks in F(t+1).Three-step search (TSS) [45] was employed as the search method.When the MAD between two matching blocks is less than a certain threshold or the number of matching exceeds the maximum value, the location is accepted as the best matching position.
where M and N are the column and row of the block and Cij and Rij are the pixels being compared in the current and reference blocks, respectively.The center of several image blocks in auxiliary image F'(t) is recorded as (x ' t ,y ' t ), and (xt+ 1 ,yt+ 1 ) is the corresponding point in F(t+1).The motion vector calculated with block matching method is recorded as V(dx,dy;t).The motion model from F'(t) to F(t+1) can be expressed as translation in Equation ( 22) or perspective projection transformation in Equation ( 23).
where tx and ty can be evaluated by the average translation of all image blocks calculated on X-and Y-axes.The translation can then be easily obtained.m 0 ,m 1 ,...,m 7 are the parameters of the perspective projection matrix.At time t, the real perspective projection transformation from F'(t) to F(t+1) is denoted as Mt.The motion vector at (x,y) is denoted as The best global motion can be described as the solution of Mt when the square distance between x , d M y ; Mt) and V(dx,dy;t) is at the minimum., ; ) where Ω is the center set of the image blocks involved in the calculation.The above equation is equivalent to Equation ( ( ,..., ) arg min However, this scenario leads to nonlinear optimization, which can be solved by using the Levenberg-Marquardt algorithm [46] or the Newton-Raphson algorithm [47] with a large amount of computation.To avoid nonlinear optimization, Farin et al. [25] replaced Euclidean error E with an algebraic error and then converted it to a linear least squares problem.Multiplying the Euclidean error with (m6xt'+ m7yt'+1) 2 results in an algebraic error, as shown in Equations ( 27) and (28).
Imposing the necessary condition / 0    for a minimal error results in a linear equation system from which we can obtain m 0 ,m 1 ,…,m 7 .

Image Motion Monitor
An image motion monitor was created to select an appropriate processing strategy (whether to use coarse GME or not) by monitoring the image motion scales of translation, rotation, and zoom, as shown in Figure 6.When the motion scale is large, image matching is under a wide baseline condition; coarse and residual GME are both performed.Otherwise, only residual GME based on big-block matching is required.How to use existing information to represent the three basic motion scales is a key issue.Coordinate transformation theory and spatial geometry were adopted to solve this problem.The scale of image translation can be manifested as the sum of image center shift on X-and Y-axes.Given that the value cannot be calculated directly, the center shift in GCS can be acquired first and then multiplied by image proportion.Image zoom motion is relevant to focal length and imaging distance.Image rotation is determined by the angle between the north and the projection of the camera optical axis on the ground plane in GCS.To derive three basic motion scales, several notions need to be defined.As shown in Figure 7, D1 stands for the projection distance between the projection of the optical center and visual field center.Imaging distance D2 is the spatial distance between the optical center and visual field center.Projection point OG(x O G ,y O G ) in GCS transformed from image center OI in ICS can be obtained through coordinate transformation.For a simplified calculation, the lens center can be replaced by the UAV position denoted as PG(x P G ,y P G ,z P G ). UAV height is denoted as H, and terrain height is denoted as Hter.According to spatial geometry, D1 and D2 can be calculated with Equation (29).Scale S represents the image proportion of the pixel distance and the actual distance; it is relevant to focal length f, sensor pixel size u, and imaging distance D2.The angle between the north and the projection of the optical axis of the camera on the ground plane at the imaging time is denoted as α and can be solved by triangle calculation on the ground plane of GCS.
The monitor scales for translation, rotation, and zoom motion denoted as δT, δR, and δZ can be calculated with Equation (30).To maintain the consistency of the image proportion, S is set to a constant value S0 during the transformation from ICS to GCS.
( ) ( ) ( ) ( ) When at least one scale exceeds its limited threshold, the monitor outputs 1; otherwise, it outputs 0. Marked as κT, κR, and κZ, the limited thresholds of translation, rotation, and zoom motion are different, and their specific values are recommended through experiments.When gM(t)>0, a large-scale motion occurs between frames, and global motion needs to be solved by both coarse and residual GME; otherwise, only residual GME based on big-block matching is required.

Study Area and Dataset
In the experiments, a medium-altitude UAV that can cruise at an altitude of 3000 m to 5000 m was employed.GPS, INS, a radio altimeter, a barometric altimeter, and a CCD camera were mounted on the UAV.The camera has two DOFs relative to the vehicle.The images underwent camera calibration by the camera itself before GME.The study area is located in the east plain part of China.The main types of landforms include city, village, and open field.The maximum height of the terrain fluctuations and man-made buildings is below 100 m.The area and flight path are shown in Figure 8.
After several flights, a database that includes approximately 100 hours of video and original metadata was established.We assumed that translation, rotation, and zoom motion represent the three basic motions in a UAV video and that the actual motion is composed of these three basic motions.Thus, three types of image and metadata with large-scale motions were selected as the experimental dataset together with 100 groups of images and corresponding metadata for each type of motion (300 groups included).
Several image examples of the three types are shown in Figure 9.

Coarse GME
The experimental process of coarse GME is illustrated by a group of data with two images (images A and B) and metadata.According to the workflow of MaGME shown in Figure 10, image A is set as F(t), and image B is set as F(t+1).
Large-scale rotation exists between two images.At this point, GME is a wide baseline registration problem.With the method proposed in Section 2.2, the transformation matrix from ICS to GCS (H1, H2) can be computed by the metadata.After this transformation, two images of F * (t) and F * (t+1) in GCS can be obtained (shown in Figure 11).
The two images maintain their consistency in shape and size in GCS.This result is proven by the clarity and lack of aliasing in the overlap pixels.The translation between images A and B in GCS can be represented by translation from XAOAYA to XBOBYB.By using the method proposed in Section 2.3, translation T can be obtained as (−248, 36) pixels.Finally, coarse GME M1 can be obtained with Equation (5).

Residual GME
Through coarse GME, auxiliary image A' can be obtained as F'(t) with Equation ( 6); this image is represented as a gray image in Figure 13.Auxiliary image A' has several valid pixels that cannot be compensated by image A contents because the corresponding contents do not exist in image A.
To determine if M1 can accurately represent the global motion between images A and B, an image fusion experiment was designed with Equation (32), where C only represents the image gray value.The image on the left in Figure 14 is the direct fusion result of images A' and B, and the image on the right is the fusion result after a certain translation.The edges of houses and roads in the image on the left exhibit aliasing, which indicates that the fusion did not reach pixel-level registration precision.However, in the image on the right, the overlapping region has sharp and clear edges, which indicates higher matching accuracy.These results indicate that M1 cannot represent the global motion accurately.Translation plays the main role in the residual motion between images A' and B; a small amount of distortion also provides a little contribution.
With Equation (20), we can calculate the I&C feature of each block in image A'.By using the I&C feature map, image block selection for residual GME becomes easy, quick, and reliable.
As shown in Figure 15, the higher the I&C feature value on the left is, the brighter the blocks on the right are.These blocks have a high probability of being selected in residual GME.Hence, accurate estimation of the contents of interest can be ensured, and the amount of image blocks can be reduced at the same time.
The motion vector field with all image blocks involved in matching is shown in Figure 16a.The fusion result of the motion vector field and I&C feature map is displayed in Figure 16b.The blocks with a high I&C feature value obtain motion estimation approximately the same in size and orientation, whereas in the blocks with a low I&C feature value, the motion vectors are erroneous.Accordingly, using I&C feature as an indicator to select image blocks for residual GME is reasonable.

Performance of the Entire Algorithm
In the performance test experiment, we selected 300 groups of images and corresponding metadata with three typical types of motions (translation, rotation, and zoom).To simulate the condition of wide baseline registration, large-scale motions widely exist between frames.Aside from the proposed MaGME(T) (MaGME with residual translation motion estimation) and MaGME(P) (MaGME with residual perspective projection motion estimation), BM-GME (GME based on block matching) and SIFT-GME (SIFT based GME) were used for comparison.
For the block matching algorithm of BM-GME, one may refer to [45].For the model solution, one may refer to [25].
The homography matrix of SIFT-GME was computed with the SIFT matching method in [48,49].SIFT features are invariant to image scale and rotation and are known to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination.After scale-space extrema detection, keypoint localization, orientation assignment, and keypoint description, SIFT features are generated as vectors with 128 dimensions.SIFT-GME involves three major stages: feature detection, feature matching, and global motion solution.In feature detection, the initial Gaussian smoothing parameter (σ0) is 1.6, and the number of sampled intervals per octave is 3.In feature matching, a modification of the k-d tree algorithm called the best-bin-first search method [50] is applied to identify the nearest neighbors with high probability using only a limited amount of computation.In global motion solution, the global motion is represented by the perspective projection model in Equation (1).The RANSAC algorithm is utilized to select the matching features and calculate the perspective projection matrix.
Under each motion condition, the motion scales between frames were calculated with Equation (30).After GME compensation, the PSNR values of the four methods were computed with Equation ( 33       matching.In general, the existence of hundreds of features in one image can make the algorithm difficult to implement in real time.Remote images acquired by UAVs usually have rich textures.Thousands or tens of thousands of SIFT features need to be calculated, which would seriously affect real-time performance.By contrast, the calculation speed of BM-GME is high.In several digital video compression standards, such as H.264 and MPEG-4, block matching-based methods are applied to real-time GME. MaGME is a selective big-block matching-based GME method.The complexity of MaGME is approximately equal to the sum of the three parts of metadata calculation, I&C feature detection, and selective block matching.The overall computation of MaGME is related to the number of blocks involved in block matching.Experiments show that if block matching employs less than 50% of the image blocks to calculate the global motion matrix, the computation amount in MaGME is less than that in BM-GME.In the actual process, the number of required image blocks is much smaller than this number.MaGME only requires minimal computation in coarse GME and a small number of image blocks to solve the residual GME.For the residual GME in particular, MaGME(T) requires at least one image block to calculate translation, whereas MaGME(P) requires at least four image blocks to compute the eight parameters of perspective projection transformation.Therefore, the amount of computation in MaGME is less than that in BM-GME; consequently, it is also less than that in SIFT-GME.

Conclusions
GME is a key step in many video applications for UAVRS.Given that conventional image-based GME methods do not perform well when a UAV's motion and behavior change significantly or when image information is not rich, a method of metadata-assisted GME called MaGME was developed in this study for medium-altitude UAVs.
The main contributions of this study are threefold.First, GME was divided into coarse and residual GME.Coarse GME was solved according to the theory of coordinate transformation.With the assistance of an auxiliary image, the large-scale motion effect on image matching was eliminated, and the wide baseline condition was converted to a narrow baseline one.Second, to maximize the use of reliable information in the image and ensure high-precision motion estimation of the contents of interest, an I&C feature detection algorithm was designed to describe the information content and contrast simultaneously.Based on the I&C feature, a big-block matching method was developed to complete residual GME.Third, an image motion monitor was designed to determine the scale of video motion and select the appropriate processing strategy.
A medium-altitude UAV was employed to collect experimental data.Three typical groups of datasets, including translation, rotation, and zoom, were set up to test four GME methods.These four methods are the proposed MaGME(T) (MaGME with residual translation motion estimation), proposed MaGME(P) (MaGME with residual perspective projection motion estimation), GME based on block matching, and SIFT-based GME.The PSNR and motion scale values of the three datasets were computed and analyzed (300 images and metadata samples in all).The results show that the proposed MaGME(T) and MaGME(P) exhibit encouraging performance when the motion scale is large.The two methods can be applied to images with a few local features in several special landforms.The results of this research can be applied to other medium-or high-altitude UAVs with a similar system structure.

Figure 2 .
Figure 2. Five coordinate systems.GCS utilizes the Gauss-Kruger coordinate on the XOY plane and altitude on the Z-axis.

Figure 5 .
Figure 5. Image divided into big blocks.Several typical contents (house, tree) are in big blocks, which is useful in improving matching precision.

Figure 6 .
Figure 6.Three basic motions between frames of a UAV video.

Figure 8 .
Figure 8. Study area and flight path.

Figure 9 .
Figure 9. Image examples of the dataset: (a) images of translation, (b) images of rotation, and (c) images of zoom motion.

Figure 12 .
Figure 12.Registration and fusion of two images in GCS.

Figure 13 .
Figure 13.Transformation from image A to image A': (a) image A-F(t) and (b) auxiliary image A'-F'(t).

Figure 14 .
Figure 14.Fusion of image A'-F'(t) and image B-F(t+1): (a) fusion of image A' and image B without translation.(b) Fusion of image A' and image B with some translation.

Figure 15 .
Figure 15.I&C feature value map and fusion result: (a) I&C feature map of the image and (b) fusion of the I&C feature map and image.

Figure 16 .
Figure 16.Analysis of the motion vector.(a) Motion vector field of all blocks.(b) Fusion of the motion vector field and I&C feature map.
), where MSE is the mean squared error.The results are shown in Figures 17 to 19 and analyzed in Table

Figure 17 .
Figure 17.Performance analysis under a large-scale translation condition: (a) PSNR of the four methods and (b) translation scale of the images.

Figure 18 .
Figure 18.Performance analysis under a large-scale rotation condition: (a) PSNR of the four methods and (b) rotation scale of the images.

Figure 19 .
Figure 19.Performance analysis under a large-scale zoom motion condition: (a) PSNR of the four methods and (b) zoom scale of the images.

Table 1 .
Uses of GME in UAVRS.
pan Angle between the camera's optical axis and the UAV's nose, unit: degree 12 Camera tilt tilt Angle between the camera's optical axis and the UAV body plane, unit: degree 13 Resolution Row*Col Row: image row, Col: image column 14 Focal length f Unit: meter 15 Pixel size u Size of each pixel, unit: meter
According to this process, images F(t) and F(t+1) can be projected to the ground plane in GCS; then, two image planes F * (t) and F * (t+1) are achieved.The transformation from image plane F to ground plane F * *