Towards Real-Time Service from Remote Sensing : Compression of Earth Observatory Video Data via Long-Term Background Referencing

City surveillance enables many innovative applications of smart cities. However, the real-time utilization of remotely sensed surveillance data via unmanned aerial vehicles (UAVs) or video satellites is hindered by the considerable gap between the high data collection rate and the limited transmission bandwidth. High efficiency compression of the data is in high demand. Long-term background redundancy (LBR) (in contrast to local spatial/temporal redundancies in a single video clip) is a new form of redundancy common in Earth observatory video data (EOVD). LBR is induced by the repetition of static landscapes across multiple video clips and becomes significant as the number of video clips shot of the same area increases. Eliminating LBR improves EOVD coding efficiency considerably. First, this study proposes eliminating LBR by creating a long-term background referencing library (LBRL) containing high-definition geographically registered images of an entire area. Then, it analyzes the factors affecting the variations in the image representations of the background. Next, it proposes a method of generating references for encoding current video and develops the encoding and decoding framework for EOVD compression. Experimental results show that encoding UAV video clips with the proposed method saved an average of more than 54% bits using references generated under the same conditions. Bitrate savings reached 25–35% when applied to satellite video data with arbitrarily collected reference images. Applying the proposed coding method to EOVD will facilitate remote surveillance, which can foster the development of online smart city applications.


Introduction
Dynamic Earth observatory video data (EOVD) has enabled many innovative smart city applications (e.g., smart transportation, sewage disposal monitoring, and disaster management).Remote surveillance via unmanned aerial vehicles (UAVs) and video satellites has become a new trend in smart city development.However, receiving the EOVD immediately after its capture in order to meet the real-time demands of dynamic remote sensing data analysis and service in smart cities is a key problem.
The substantial gap between the EOVD data collection rate and the transmission bandwidth has greatly restricted remote surveillance applications in smart cities. Taking satellite Jilin-1 as an example, a single frame of satellite video data is about 12,000 × 5000 pixels with frame rate of 15 fps, resulting in 20 Gbps of video data.However, the transmission channel for real-time transmission between satellites and the Earth is only 10-20 Mbps.Even with the latest coding standard-high-efficiency video coding (HEVC) with a compression ratio of 300:1-the gap to be bridged is still 3-to 6-fold; thus, efficient data compression techniques are in high demand.Although the situation is alleviated to some extent for data transmission from short distance UAVs, the situation deteriorates rapidly as the data receiving distance increases.To solve this problem, much work has been done to reduce the data size through dictionary learning-based data representation.One excellent work is the incremental K-SVD method for spatial big data representation [1,2].Another representative work is the low-rank dictionary [3,4].However, as we focus on the representation of continuous data sequences in the pixel domain, these methods cannot be directly applied to compress the remote sensing video data.
EOVD are video clips taken from high space (e.g., 500-600 km for video satellites and hundreds of meters for UAVs) in which the majority of the picture's content is landscape with small foreground objects, in contrast to the common videos with foreground objects as the major content.Moreover, remote surveillance for smart city applications produces large overlaps in the surveillance video data collected over an extended time period.Since the landscape changes slowly, the overlapping areas will have similar backgrounds across video clips, giving rise to a new form of redundancy, called long-term background redundancy (LBR) in this paper.Taking all the videos on a large temporal scale, LBR becomes significant as the background repetition dramatically increases.Thus, eliminating LBR in EOVD will significantly improve coding efficiency and support real-time smart city video applications.
Most widespread video coding strategies commonly adopt intra/inter-frame prediction to explore similarities in local spatial/temporal domains [5,6], effectively eliminating most local redundancies within a single video clip.Moreover, to further reduce the redundancy in ground surveillance video data induced by static backgrounds, Reference [7] proposed generating short-term, high-quality reference frames of backgrounds to improve the prediction accuracy for those areas.While this study achieved efficient coding for a single video source, the similarity measurement is subject to changes in visual appearances due to projection and illumination variations of the background on large spatial and temporal scales.
Several multisource data coding schemes have been proposed in recent years, which mainly focus on coding image sets from arbitrary views.Some researchers [8][9][10] have utilized scale-invariant feature transform (SIFT) features to measure the similarity between blocks from different images.Due to their invariance to rotation and robustness to illumination changes, SIFT features can build correlations among different images, thus achieving inter-image prediction to explore redundancies among multisource images.The same idea has been extended to duplicated video clips [11], where the redundancies between video clips are eliminated by referencing the basic video clip after adjusting for projection and illumination.While these methods have provided excellent ideas for exploiting redundancies across data sources within a dataset, the matched image blocks in pixel domains from different sources usually do not relate in reality and thus are not suitable for matching large areas like backgrounds in EOVD.
A reference library that records information common to all video data (e.g., libraries of two-dimensional (2D) vehicle images [12] or three-dimensional (3D) vehicle models [13,14] to eliminate redundancies caused by the repetition of similar vehicles) could efficiently eliminate redundancies across video clips.Unlike references from a dataset, a library-based method normally presents the basic knowledge of the encoded content and transformation.It is more efficient than only selecting references in the pixel domain, because this method reveals how the images relate in reality.Our method was developed based on this idea but focuses on using a library of backgrounds rather than foregrounds to eliminate LBRs in EOVD.
In this study, we developed a long-term background referencing library (LBRL)-based EOVD coding scheme according to the characteristics of the EOVD.First, we discussed the LBR induced by similarities among video clips taken of the same area throughout a temporal scale.Then, we analyzed the factors causing image representation variations of the background in different video clips.Based on that analysis, we proposed how to develop an LBRL for remote surveillance applications in smart cities. Next, we proposed a method to generate references based on the LBRL and the adjusted impact factor.Finally, we developed an encoding and decoding framework for EOVD compression.
Video clips from UAV and from video satellites were used to conduct experiments to evaluate the performance of the proposed method.A reference library built using the same conditions as those of the encoding video clips was used in the UAV case to represent how a good background reference can help to reduce the bitrate, and the results revealed that the proposed method can achieve 54% bitrate savings on average over the main profile of HEVC.In the satellite case, LBRL was developed from a Google Earth image [15], which attempted to simulate the usage of real historical remote sensing data.In this case, the bitrate savings were around 25%.In addition, we also tested to what extent different impact factors contribute to the bitrate savings.
There are three main contributions of this work: (1) We analyzed the characteristics of Earth observatory video data, and discovered the long-term background redundancy among the videos collected of the same location at different times, which provides a chance to further compress the EOVD.(2) We introduced the concept of a referencing library (the LBRL) as the fundamental infrastructure to facilitate the real-time collection of EOVD, which will further enhance online smart city applications.(3) We proposed an LBRL-based reference generation method and the coding framework for EOVD, which can significantly reduce the bitrate compared to the coding standard for a single video source, helping to alleviate the difference between data collection bitrate and the space to Earth transmission bandwidth.
The remainder of this paper is organized as follows: Section 2 provides a literature review regarding related work.A detailed analysis of the LBR of EOVD and the development of an LBRL to eliminate LBR is illustrated in Section 3. The LBRL-based reference generating and encoding framework is developed in Section 4. Section 5 reports our experimental results, and Section 6 concludes the paper.

Related Work
Our work is related to the current single video coding method, the coding method for ground surveillance data considering the scene redundancy, and the coding method for multisource video clips.Therefore, we review the coding method from these three aspects.

Video Compression of Satellite Videos
In the initial stages of satellite development, satellite data were stored as remote sensing images.Satellite image compression methods can be divided into two methods: prediction-based and transformation-based. Prediction-based methods [16][17][18] use encoded pixels to estimate the current pixel value based on the correlation between pixels or bands of satellite images.Transformation-based methods [19,20] regard satellite data as a generalized, stationary random field [21]; its three-dimensional orthogonal transformation [22] maximizes the information concentrated in a small number of transform coefficients, thereby removing the maximum amount of spatial redundancy and inter-spectrum redundancy.The above methods were designed for a single image frame.Although they effectively removed spatial redundancy in the image, removing the redundancy caused by the correlation between the images was difficult.In recent years, with the development of video satellites, general video compression standards have been integrated into satellites.For example, Skysat [23], a video satellite launched by Skybox, was outfitted with video compression standard H.264 [5].General video compression standards use local spatial-temporal prediction models in small-scale space-time ranges to process local, short-term data; they cannot, however, remove geographical background redundancy from satellite video.

Video Compression of Surveillance Videos
Surveillance videos characteristically have fixed scenes and slight changes in the background.For these characteristics, surveillance video compression methods can be divided into LRSD (low-rank sparse decomposition)-based and background modeling methods.LRSD-based methods [24][25][26] employ LRSD to decompose the input video into low-rank components representing the background and sparse components representing the moving objects, which are encoded by different methods.Background modeling methods [7,[27][28][29][30] use background modeling technology to build background frames for reference that improve the compression efficiency by improving the prediction accuracy.These surveillance video compression methods only apply to local spatial-temporal redundancy in single-source video; they do not consider the similarity of the background when the same region is captured by multisource satellite videos and cannot cope with apparent differences in the area due to shooting time, posture, height, and other factors.

Video Compression of Multisource Image/Video Data
Multisource image/video data refers to the collection of images/videos obtained by multiple shooting devices at various times from different positions.They contain a large number of similar images with common pixel distributions, features, and backgrounds.With the development of cloud technology, cloud-based image compression has attracted substantial interest [8,31,32].These methods use cloud historical data to compress images by searching for similar images in the cloud data as a reference to improve prediction accuracy.Concurrently, compression methods [9][10][11]33,34] for image sets were developed using cloud historical data as a reference.The basic idea is to cluster images via image content, organize those images into a pseudo sequence, and code them like a video.The compression methods for multisource image/video data are designed from the perspective of image features, which usually mine similarities between image blocks by matching feature points.Moreover, multiscale features for image representation are proposed to extend representation from single payload to multiple payloads, as being proposed in References [35][36][37][38], which is also a way to build relations between multiple data sources.However, computational complexity is high, and the actual correspondence between the selected image block and the coding object is often lacking, which is not conducive to large-area matching.

Long-Term Background Referencing Library
First, this section will exploit LBR in EOVD and discuss what factors are important to eliminate LBR.Then, we develop an LBRL to represent the long-term background.

A New Redundancy Induced by Background Repetition
LBR is a new type of redundancy found in remotely captured video clips shot of the same location.It is caused by the similarity among the repeated background in different video clips.In the long term, LBR shows the following characteristics: structural consistency and appearance variance.To facilitate its expression, with A representing the area shot by a certain video clip and A the entire area of a smart city, A ⊂ A. The background represented in a video clip of area A is denoted as B.
Structural consistency: Since landscapes change slowly, the structure of a certain area can be assumed to be consistent within a time period.Therefore, different video clips shot of this area will reveal the same structure of the area.As shown in Figure 1a, there are two frames taken from two video clips shot of the same location at different time by two satellites.Even though there are some differences in the image representations, we can easily judge that this is the same place according to the same structure.
Appearance variance: Due to the different conditions under which the video clips are captured, such as natural conditions (e.g., atmosphere, illumination) and device conditions (e.g., sensors), the images representations of the same area will have some variations.As shown in Figure 1a and the magnified part in Figure 1b, we can find variances in viewing angle, color, and quality.Thus, we discuss the appearance variance in these aspects.
Remote Sens. 2018, 10, x FOR PEER REVIEW 5 of 20 images representations of the same area will have some variations.As shown in Figure 1a and the magnified part in Figure 1b, we can find variances in viewing angle, color, and quality.Thus, we discuss the appearance variance in these aspects.
(a) (b) (1) Projection difference: a location in a specific video clip can be represented by the projection of that area into the image plane, which is: where is the background in a picture and ( ) is the projection of area A into a video clip.Since the projection is decided by the position and angle of the camera, it changes for every frame.
(2) Radiometric difference: the color of an image is affected by changes in the area's environmental radiation.Radiation changes can be modeled because the factors causing them, such as illumination, are limited in the long term.Therefore, the image representation of an area can be expressed as follows: where is a radiation model that converts from the reference background to the current image, and ( ) is the image representation of the background after radiation change.
(3) Quality difference: EOVD image quality is affected by many factors.Some are related to the sensor itself, such as the optical imaging system, electrical signal conversion, and motion of the platform.These factors remain stable for a certain video clip, leading to consistent quality degradation for that video clip.Therefore, the image representation of an area can be expressed as follows: where is the quality degradation of a certain satellite, and ( ) is the final image representation of the area.

Development of an LBRL
Current video coding standards using intra-and inter-frame prediction are very efficient at eliminating short-term redundancies.However, using such a prediction method across multiple video clips is uncommon, mostly because of how the image representation changes due to variations (1) Projection difference: a location in a specific video clip can be represented by the projection of that area into the image plane, which is: where B is the background in a picture and P v (A) is the projection of area A into a video clip.Since the projection is decided by the position and angle of the camera, it changes for every frame.(2) Radiometric difference: the color of an image is affected by changes in the area's environmental radiation.Radiation changes can be modeled because the factors causing them, such as illumination, are limited in the long term.Therefore, the image representation of an area can be expressed as follows: where M I is a radiation model that converts from the reference background to the current image, and I R (B) is the image representation of the background after radiation change.(3) Quality difference: EOVD image quality is affected by many factors.Some are related to the sensor itself, such as the optical imaging system, electrical signal conversion, and motion of the platform.These factors remain stable for a certain video clip, leading to consistent quality degradation for that video clip.Therefore, the image representation of an area can be expressed as follows: where M q is the quality degradation of a certain satellite, and I D (B) is the final image representation of the area.

Development of an LBRL
Current video coding standards using intra-and inter-frame prediction are very efficient at eliminating short-term redundancies.However, using such a prediction method across multiple video clips is uncommon, mostly because of how the image representation changes due to variations in projection, radiation, and quality.As a result, the same area is recorded every time it is captured, leading to a waste of transmission bandwidth.
Creating an LBRL to eliminate LBR addresses this redundant transmission issue.Ideally, an LBRL should do the following: (1) Be able to cover the entire area of smart city applications.
(2) Be robust enough to handle changes in image representation due to various viewing angles.
(3) Be compatible with changes in the visual appearance of the background caused by radiation changes and quality degradation.
Therefore, we proposed an LBRL composed of basic, high-resolution reference images of a smart city's area, which can support three essential transformations: projection transformation related to each frame, and radiometric adjustment plus quality adjustment related to a video clip.The formation of the LBRL is shown in Figure 2.
Remote Sens. 2018, 10, x FOR PEER REVIEW 6 of 20 in projection, radiation, and quality.As a result, the same area is recorded every time it is captured, leading to a waste of transmission bandwidth.
Creating an LBRL to eliminate LBR addresses this redundant transmission issue.Ideally, an LBRL should do the following: (1) Be able to cover the entire area of smart city applications.
(2) Be robust enough to handle changes in image representation due to various viewing angles.
(3) Be compatible with changes in the visual appearance of the background caused by radiation changes and quality degradation.
Therefore, we proposed an LBRL composed of basic, high-resolution reference images of a smart city's area, which can support three essential transformations: projection transformation related to each frame, and radiometric adjustment plus quality adjustment related to a video clip.The formation of the LBRL is shown in Figure 2. We used historical, geographical registered images that had been corrected to develop an LBRL of an area for EOVD referencing.These images were stitched together to cover the entire area of a smart city.The geographical attribute was used to facilitate image matching during the referencing process in the EOVD encoding.The approximate area was determined according to the initial video's data positioning.Image data in an LBRL can be updated when a static ground change is detected from new video data.
Since the three transformations were highly related to each video clip, we did not include them in the LBRL but made the transformations available within the LBRL-based reference generator, which will be described in the next section.

LBRL-Based Reference Generation and Coding Framework
This section first details how to generate references from an LBRL for a newly collected video clip through geometrical matching for projection transformation, radiometric adjustment, and quality adjustment of the background image.Then, it describes the encoding and decoding scheme We used historical, geographical registered images that had been corrected to develop an LBRL of an area for EOVD referencing.These images were stitched together to cover the entire area of a smart city.The geographical attribute was used to facilitate image matching during the referencing process in the EOVD encoding.The approximate area was determined according to the initial video's data positioning.Image data in an LBRL can be updated when a static ground change is detected from new video data.
Since the three transformations were highly related to each video clip, we did not include them in the LBRL but made the transformations available within the LBRL-based reference generator, which will be described in the next section.

LBRL-Based Reference Generation and Coding Framework
This section first details how to generate references from an LBRL for a newly collected video clip through geometrical matching for projection transformation, radiometric adjustment, and quality adjustment of the background image.Then, it describes the encoding and decoding scheme based on a generated background reference.An overview of the proposed LBRL-based EOVD coding framework is illustrated in Figure 3.

Generating a Background Reference
Prediction from references is known to be the most important process in the current video coding process to remove redundancies.As in block-based prediction, coding efficiency is directly related to the similarity between the reference and the current encoding frame.We generated background references for the current frame in the following sequence: geometrical matching, radiometric adjustment, and quality adjustment.

Geometric Matching
The initial position of a captured video clip is decided through the satellite global positioning system (GPS), which is used to decide an approximate captured area in the geographical registered library.A buffer is added outside the approximate captured area to cope with the positioning error of GPS.The reference image containing the captured area with the buffer is then cropped for the geometrical matching between the reference image and the captured video frame.
Remote Sens. 2018, 10, x FOR PEER REVIEW 7 of 20 based on a generated background reference.An overview of the proposed LBRL-based EOVD coding framework is illustrated in Figure 3.

Generating a Background Reference
Prediction from references is known to be the most important process in the current video coding process to remove redundancies.As in block-based prediction, coding efficiency is directly related to the similarity between the reference and the current encoding frame.We generated background references for the current frame in the following sequence: geometrical matching, radiometric adjustment, and quality adjustment.

Geometric Matching
The initial position of a captured video clip is decided through the satellite global positioning system (GPS), which is used to decide an approximate captured area in the geographical registered library.A buffer is added outside the approximate captured area to cope with the positioning error of GPS.The reference image containing the captured area with the buffer is then cropped for the geometrical matching between the reference image and the captured video frame.Geometrical matching to the LBRL locates the correct shooting area of the current frame and transforms the area reference image from the LBRL to the target frame through projection Geometrical matching to the LBRL locates the correct shooting area of the current frame and transforms the area reference image from the LBRL to the target frame through projection transformation.This process consists of downsampling the reference image, matching correspondence points, estimating the perspective transformation, and resampling the image.
SIFT feature matching [39] is normally used to find correspondence points.Due to differences in resolution, quality, and radiation between the basic area image from the LBRL and the current frame, however, sufficient correspondence point pairs often cannot be obtained, resulting in incorrect matches.Therefore, we first downsampled the high-resolution image from the LBRL to convert it to an image with ground resolution similar to the video data.The ground resolution of the current video was obtained from satellite documentation.The approximate shooting area was estimated from the online geopositioning of the satellite imagery.Then, to match the downsampled reference image with the current video frame, we adopted the improved SIFT matching method [40] that was developed for multisensor remote sensing image matching, which developed a distinctive order based on a self-similarity descriptor that was robust against illumination differences.
Based on correspondence point pairs, the perspective transformation was estimated by solving the following mapping functions: where (x c i , y c i ) are the coordinates of points in the current video and (x r i , y r i ) are the corresponding points in the downsampled reference images from the LBRL.After estimating the perspective transformation, we generated the geometrically transformed reference image I g r using Equation (4).

Radiometric Adjustment
We employed the color transfer model proposed in Reference [41] to adjust the radiation of the geometrically transformed reference image I g r to correspond with the current video frame.Since video data is recoded in the YUV color space, the radiometric adjustment was also conducted in this color space.Since a YUV color space is similar to a lαβ color space, in which the first channel is lightness and the other two channels are color components, we adopted a color transform model similar to the one proposed for the lαβ color space in our work: where

Quality Adjustment
Images in the current satellite video data usually appeared blurrier than the reference image; thus, the quality of the reference image was adjusted to correspond to the quality of the current video frame.In this paper, we adjusted the quality based on the previously obtained reference image I c r after geometrical matching and radiometric adjustment, generating the final reference image I r .
We applied a 2D Gaussian blur filter to simulate the quality degradation of the satellite video.Since we assumed the quality degradation was homogeneous over the whole image, we adopted an isotropic Gaussian model and set the mean of the Gaussian distribution to 0, leaving the standard deviation σ to be defined according to the difference between I c r and the current video frame.In practice, a Gaussian blur filter can be converted to a 5 × 5 kernel, the values of which can be represented by a polynomial of σ according to their distance to the kernel center.In this way, the image value after Gaussian blur can be represented by a function of σ; thus, σ can be obtained by minimizing the pixel value differences between I c r after Gaussian blur and the current frame.To reduce computational complexity, we stochastically selected N 5 × 5 blocks from I c r and their corresponding pixels from the current frame and minimized the objective function to obtain the optimized parameter σ as follows: where B k r is the kth block from I c r , and p k c is the kth corresponding pixel value from the current frame.G(σ) is the Gaussian kernel, whose values are defined by the parameter σ as follows: The final reference image I r was obtained by G(σ) * I c r .The next section will introduce how to use an LBRL reference image to encode and decode EOVD.

Encoding and Decoding Scheme
The LBRL-based encoding and decoding of satellite videos requires the LBRL reference image at both the encoding and decoding ends.Figure 4 illustrates the overall LBRL-based encoding framework.The encoding process can be described as follows: Remote Sens. 2018, 10, x FOR PEER REVIEW 9 of 20 image value after Gaussian blur can be represented by a function of ; thus, can be obtained by minimizing the pixel value differences between after Gaussian blur and the current frame.To reduce computational complexity, we stochastically selected N 5 × 5 blocks from and their corresponding pixels from the current frame and minimized the objective function to obtain the optimized parameter as follows: where is the th block from , and is the th corresponding pixel value from the current frame.( ) is the Gaussian kernel, whose values are defined by the parameter as follows: The final reference image was obtained by ( ) * .The next section will introduce how to use an LBRL reference image to encode and decode EOVD.

Encoding and Decoding Scheme
The LBRL-based encoding and decoding of satellite videos requires the LBRL reference image at both the encoding and decoding ends.Figure 4 illustrates the overall LBRL-based encoding framework.The encoding process can be described as follows: Step 1. Generating the background reference.Initially, or for an I frame, , we used the proposed LBRL-based reference generation method described in Section 3.1 to initialize a background reference (denoted by in Figure 4) for the encoding of I frames.Since the generated background reference was not sent, we needed to send the control data together with the encoded frame to reconstruct the reference at the decoder.The control data for generating the background reference was denoted as , including the perspective transformation matrix = 1 in geometric matching, [ ] and [ ] in radiometric adjustment, and in quality adjustment.The generated background reference was stored in a temporal buffer in the encoder to update the reference for subsequent frames.At the same time that a newly generated was put into the reference buffer, previous data were removed from the buffer.Step 1. Generating the background reference.Initially, or for an I frame, f I , we used the proposed LBRL-based reference generation method described in Section 3.1 to initialize a background reference (denoted by I r in Figure 4) for the encoding of I frames.Since the generated background reference was not sent, we needed to send the control data together with the encoded frame to reconstruct the reference at the decoder.The control data for generating the background reference was denoted as P rg , including the perspective transformation matrix in radiometric adjustment, and σ in quality adjustment.The generated background reference was stored in a temporal buffer in the encoder to update the reference for subsequent frames.At the same time that a newly generated I r was put into the reference buffer, previous data were removed from the buffer.
Step 2. Updating the background reference.For P frames, f P , immediately after an I frame, the radiometric conditions and quality degradation did not vary markedly, only the projections changed slightly.Therefore, we only updated the perspective transformation from the background reference for the last frame.The output control data was denoted as P ru , including a new PT matrix.The updated reference image was then added to the reference buffer.Step 3. Calculating candidate modes and performing predictions.A generated or updated background reference was added to the coding reference list.For any I frame, besides the traditional intra-picture prediction, a long-term prediction (denoted by m l ) taking I r as the additional reference could also be performed.Since inter-picture prediction is normally more efficient than intra-picture prediction, it is more efficient at reducing the bitrate.Then, for P frames, both short-term (denoted by m s ) and long-term predictions could be selected by referring to the adjacent frames or background reference, respectively.As proven by Reference [3], a high-quality background reference can help reduce the bitrate of blocks in P frames.Step 4. Encoding and reconstructing the current block.Rate-distortion was applied to select the best encoding mode.By performing the predictions, residuals (denoted as Res) were computed and encoded by transform, scaling, quantization, and entropy coding.Frames were reconstructed (denoted as Rec f ) to provide short-term frame references by reconstructing each block by adding the block reference to the decoded block residuals.The reconstructed frames were stored in the reconstructed frame buffer to provide the reference list.
After encoding a video clip, the parameters for the reference generation and prediction, as well as encoded residuals, are output.After being transmitted from the remote sensing platforms to the server on Earth, video clips are decoded by reconstructing the background references from the LBRL using the reference generating or updating parameters.

Experimental Setup
We evaluated the effectiveness of the proposed LBRL-based EOVD compression method by carrying out extensive experiments.Two types of EOVD datasets were used in this paper as follows: Video clips from UAVs: Four video clips from UAVs captured over Yangtze river park, Wuhan, China, were employed to evaluate performance, as shown in Figure 5a.These video clips were captured once a day for four consecutive days by a Yuneec Typhoon H UAV. The flight height was fixed at 100 m.Each original video clip contained 300 frames of 1080p (1920 × 1280) resolution, 15 fps.The videos contain slow flight and fast flight.Two other videos in the same area were captured to extract frames as data to develop a reference image library for this test.In total, the area contained nine stitched key frame pictures.
Video clips from satellites: Four video clips from satellite Jilin-1 over Valencia, Spain were used in the experiment (Figure 5b), containing two video clips of building, one of farmland, and one of seaside areas.To facilitate the coding process, we did not directly use the original 12,000 × 5000 resolution, but cut out video clips with a of size 1080p with 300 frames, 15 fps.As access to the historical satellite data was limited in our experiments, images from Google Earth [15] were employed as the LBRL images.The Google Earth images were already geographically registered images with high resolutions.
Because the reference images for the UAV video clips were captured by the same device under similar conditions, this group of tests mainly focused on evaluating how effective the background library was at video coding.The reference images for the satellite video clips came from a different satellite under significantly different sensing conditions.This reality scene was used to test whether the algorithm for radiometric and quality adjustments in reference generation was effective.
In the experiment, we conducted two implementations of the proposed method based on two standard codecs for different testing purposes (details shown in Table 1).It can also be implemented on other codecs since it mainly provides an extra encoding reference.The first implementation was based on the low-delay configuration of an HEVC test model HM16.8, named LBRL-HEVC.This implementation was compared to the unmodified HEVC codec to test the effectiveness of the proposed method on EOVD compression, since HEVC can achieve the highest compression ratio.This testing was implemented at a four-core Intel i5 CPU on a 2.6 GHz platform.Although HEVC has achieved a very high compression ratio, it is not commonly applied in practice due to its computational complexity.In order to test the applicability of our method for practical use, the second implementation was based on the x264 codec, named LBRL-x264, compared to the unmodified model of x264.x264 is known as the fastest CPU implementation of video compression [42].It is the most commonly used codec in the practice, including applications on UAVs and video satellite platforms.This testing was implemented at Nvidia Jetson TX2, which was selected as one part of an embedded system developed for a small satellite set to launch in the year 2020.Nvidia Jetson TX2 contains four ARM Cortex A57 cores and one GPU with 256 CUDA cores.
In the experiment, Bjøntegaard delta PSNR (BD-PSNR) and Bjøntegaard delta rate (BD-Rate) [43] were utilized as the metrics for the objective evaluation of coding performance.We also included subjective evaluation metrics for the satellite video data.

Experiments with UAV Video Clips
In this experiment, the effectiveness of how a long-term background reference can improve the encoding efficiency was tested.We used the first implementation of the proposed method LBRL-HEVC against HEVC for this purpose.
The image data used to build LBRL was collected under the same conditions with the to-be encoded video clips.They were geographically corrected by using ground control points, and then stitched together.The developed LBRL for UAV data is shown in Figure 6a, with the size of 12.69 MB.Considering that the images in LBRL of UAV shared quite similar conditions with the UAV video data, we only conduct geometric matching to generate the reference image, without radiometric adjustment and quality adjustment.Taking one video clip in Figure 6d, the approximate Although HEVC has achieved a very high compression ratio, it is not commonly applied in practice due to its computational complexity.In order to test the applicability of our method for practical use, the second implementation was based on the x264 codec, named LBRL-x264, compared to the unmodified model of x264.x264 is known as the fastest CPU implementation of video compression [42].It is the most commonly used codec in the practice, including applications on UAVs and video satellite platforms.This testing was implemented at Nvidia Jetson TX2, which was selected as one part of an embedded system developed for a small satellite set to launch in the year 2020.Nvidia Jetson TX2 contains four ARM Cortex A57 cores and one GPU with 256 CUDA cores.
In the experiment, Bjøntegaard delta PSNR (BD-PSNR) and Bjøntegaard delta rate (BD-Rate) [43] were utilized as the metrics for the objective evaluation of coding performance.We also included subjective evaluation metrics for the satellite video data.

Experiments with UAV Video Clips
In this experiment, the effectiveness of how a long-term background reference can improve the encoding efficiency was tested.We used the first implementation of the proposed method LBRL-HEVC against HEVC for this purpose.
The image data used to build LBRL was collected under the same conditions with the to-be encoded video clips.They were geographically corrected by using ground control points, and then stitched together.The developed LBRL for UAV data is shown in Figure 6a, with the size of 12.69 MB.Considering that the images in LBRL of UAV shared quite similar conditions with the UAV video data, we only conduct geometric matching to generate the reference image, without radiometric adjustment and quality adjustment.Taking one video clip in Figure 6d, the approximate area was firstly located in the LBRL (red rectangle) and then cropped out, as shown in Figure 6b.The geometric matching and transformation was conducted to convert the cropped image from LBRL to be of the same shape as the frames of the video clip.
The total encoding performance gains of the proposed LBRL-HEVC compared with HEVC are listed in Table 2 and the Rate-Distortion (RD) curve is shown in Figure 7.At the same PSNR, the proposed method averagely decreases 54.18% bitrate over HEVC.This result also corresponds to 4.32 dB PSNR gains over HEVC at the same bitrate.The bitrate reduction occurs mainly because of the bit savings from the I frame.Most of the prediction modes of the I frame changed from intra-frame prediction to inter-frame prediction, referencing the generated background reference images.The P frame can also reference both the generated background reference images and its previous frames.
Remote Sens. 2018, 10, x FOR PEER REVIEW 12 of 20 area was firstly located in the LBRL (red rectangle) and then cropped out, as shown in Figure 6b.The geometric matching and transformation was conducted to convert the cropped image from LBRL to be of the same shape the frames of the video clip.The total encoding performance gains of the proposed LBRL-HEVC compared with HEVC are listed in Table 2 and the Rate-Distortion (RD) curve is shown in Figure 7.At the same PSNR, the proposed method averagely decreases 54.18% bitrate over HEVC.This result also corresponds to 4.32 dB PSNR gains over HEVC at the same bitrate.The bitrate reduction occurs mainly because of the bit savings from the I frame.Most of the prediction modes of the I frame changed from intra-frame prediction to inter-frame prediction, referencing the generated background reference images.The P frame can also reference both the generated background reference images and its previous frames.area was firstly located in the LBRL (red rectangle) and then cropped out, as shown in Figure 6b.The geometric matching and transformation was conducted to convert the cropped image from LBRL to be of the same shape as the frames of the video clip.The total encoding performance gains of the proposed LBRL-HEVC compared with HEVC are listed in Table 2 and the Rate-Distortion (RD) curve is shown in Figure 7.At the same PSNR, the proposed method averagely decreases 54.18% bitrate over HEVC.This result also corresponds to 4.32 dB PSNR gains over HEVC at the same bitrate.The bitrate reduction occurs mainly because of the bit savings from the I frame.Most of the prediction modes of the I frame changed from intra-frame prediction to inter-frame prediction, referencing the generated background reference images.The P frame can also reference both the generated background reference images and its previous frames.

Experiments with Satellite Video Clips
The LBRL in this case consisted of satellite images downloaded from Google Earth, due to the limited access to historical satellite data.Since the satellite video clips were in the city of Valencia, Spain, we built the LBRL for this city.The total land area of Valencia was 134.7 km 2 , and the total size of Google Earth images covering this city was 5.93 GB.The size of the library was proportional to the land area, namely 45 MB/km 2 on average.Even considering one of the biggest cities, New York City, USA, with a land area of about 784 km 2 , the size of the LBRL is less than 35 GB.

Intermediate Results from Background Reference Generation
The intermediate results for background reference generation from a Google Earth image (Figure 8b) for a certain video clip (sample frame in Figure 8a) are presented in Figure 8c,d.It is easy to notice that the Google Earth image is much shaper and the color is more brilliant than the captured satellite videos, so besides geometric matching, we also conducted radiometric adjustment and quality adjustment to generate the final reference image.

Experiments with Satellite Video Clips
The LBRL in this case consisted of satellite images downloaded from Google Earth, due to the limited access to historical satellite data.Since the satellite video clips were in the city of Valencia, Spain, we built the LBRL for this city.The total land area of Valencia was 134.7 km 2 , and the total size of Google Earth images covering this city was 5.93 GB.The size of the library was proportional to the land area, namely 45 MB/km 2 on average.Even considering one of the biggest cities, New York City, USA, with a land area of about 784 km 2 , the size of the LBRL is less than 35 GB.

Intermediate Results from Background Reference Generation
The intermediate results for background reference generation from a Google Earth image (Figure 8b) for a certain video clip (sample frame in Figure 8a) are presented in Figure 8c,d.It is easy to notice that the Google Earth image is much shaper and the color is more brilliant than the captured satellite videos, so besides geometric matching, we also conducted radiometric adjustment and quality adjustment to generate the final reference image.

Experiments with Satellite Video Clips
The LBRL in this case consisted of satellite images downloaded from Google Earth, due to the limited access to historical satellite data.Since the satellite video clips were in the city of Valencia, Spain, we built the LBRL for this city.The total land area of Valencia was 134.7 km 2 , and the total size of Google Earth images covering this city was 5.93 GB.The size of the library was proportional to the land area, namely 45 MB/km 2 on average.Even considering one of the biggest cities, New York City, USA, with a land area of about 784 km 2 , the size of the LBRL is less than 35 GB.

Intermediate Results from Background Reference Generation
The intermediate results for background reference generation from a Google Earth image (Figure 8b) for a certain video clip (sample frame in Figure 8a) are presented in Figure 8c,d.It is easy to notice that the Google Earth image is much shaper and the color is more brilliant than the captured satellite videos, so besides geometric matching, we also conducted radiometric adjustment and quality adjustment to generate the final reference image.From the intermediate result, we can see that the proposed background referencing generation method can successfully handle the image representation variances caused by illumination and sharpness differences.However, current strategy will not work well with the problems which cause a change of representation of remote sensing images: (1) shadow movement; (2) projection difference of tall buildings; (3) huge illumination change; (4) scene change due to seasons variation e.g., vegetation; (5) landscape change.The solution to these problems might require multimode reference images for one region in LBRL, together with image translation techniques and other methods for background reference generation.An updating strategy for LBRL is also required to handle the landscape change problem.

Results of LBRL-HEVC
The improvement of coding efficiency was tested first using the implementation LBRL-HEVC.In this test, the coding results from references generated with only radiometric adjustment (Only-RA) and with only quality adjustment (Only-QA) were also compared to analyze the effectiveness of the radiometric and quality adjustment in generating good background references.
The coding results of LBRL-HEVC compared with HEVC are presented in Table 3.In general, the average bitrate savings can reach up to 24.93%.Compared to the averaged bitrate savings with UAV data, it was proven that the similarity of the background reference had great effectiveness on the improvement of the EOV data compression ratio.
We can also notice that in different video clips with different video content, the highest bitrate reduction appeared with farmland, where there were few tall buildings.Since we did not consider the elevation change, we could not correct the projection difference in our geometric matching, leading to low efficiency prediction for places containing projection differences.The seaside video clip had the lowest bitrate reduction, which was probably due to the negative influence of waves in the water area.
Comparing the results from Only-RA and Only-QA with HEVC, the coding efficiency was not obviously improved.This might be because with only one process, there were still great differences between the background reference and the encoding video frames in the pixel domain, resulting in non-valid inter-frame predictions.From the experimental data, we can conclude that the quality adjustment was a bit more important than the radiometric adjustment for background reference generation.
The RD curves for the tested satellite video clips are shown in Figure 9, revealing results similar to those we obtained from Table 2.The RD curves for Only-RA and Only-QA almost overlapped with HEVC, showing no significant improvement.The curves for the proposed method were higher than the other curves for the four video clips, representing the general effectiveness of the proposed method in bitrate reduction for satellite videos.In general, the implementation of the proposed method on HEVC (LBRL-HEVC) proved that we can generate effective background references from the Google Earth images, and that the compression ratio can be successfully increased.The bitrate reduction of satellite data was less than that of UAV data, which was mainly due to the similarity between the reference image in LBRL and the current video data.

Results of LBRL-x264
In this section, the effectiveness of the proposed method in the embedded system of real applications is evaluated.The results of LBRL-x264 compared to x264 are presented in Table 4. Similar to the results of LBRL-HEVC compared to HEVC, LBRL-x264 can reduce around 32.77% bitrate compared to x264 at the same PSNR, and the quality improvement was on average 1.7 dB at the same bitrate.The detailed results are plotted in Figure 10, together with the curves of LBRL-HEVC and HEVC.As shown in the curves of x264 and LBRL-x264, the differences between them were bigger at the lower part than the higher part.The lower part of the curves covered the range of the selected bitrate for transmission, where obvious bitrate reduction can be observed.More details are shown in the visual results in Figure 11.We can also notice that the bitrates from HEVC were much lower than the results from LBRL-x264 or x264.However, the HEVC-based codecs cannot be implemented on UAV or satellite platforms, due to the computational complexity presented in Section 5.4.
The visual comparisons for video clips Builidng-2 and Seaside are shown in Figure 10.In the visual comparison, we selected target bitrates of around 500 kbps.The video clips in the test were In general, the implementation of the proposed method on HEVC (LBRL-HEVC) proved that we can generate effective background references from the Google Earth images, and that the compression ratio can be successfully increased.The bitrate reduction of satellite data was less than that of UAV data, which was mainly due to the similarity between the reference image in LBRL and the current video data.

Results of LBRL-x264
In this section, the effectiveness of the proposed method in the embedded system of real applications is evaluated.The results of LBRL-x264 compared to x264 are presented in Table 4. Similar to the results of LBRL-HEVC compared to HEVC, LBRL-x264 can reduce around 32.77% bitrate compared to x264 at the same PSNR, and the quality improvement was on average 1.7 dB at the same bitrate.The detailed results are plotted in Figure 10, together with the curves of LBRL-HEVC and HEVC.As shown in the curves of x264 and LBRL-x264, the differences between them were bigger at the lower part than the higher part.The lower part of the curves covered the range of the selected bitrate for transmission, where obvious bitrate reduction can be observed.More details are shown in the visual results in Figure 11.We can also notice that the bitrates from HEVC were much lower than the results from LBRL-x264 or x264.However, the HEVC-based codecs cannot be implemented on UAV or satellite platforms, due to the computational complexity presented in Section 5.4.
The visual comparisons for video clips Builidng-2 and Seaside are shown in Figure 10.In the visual comparison, we selected target bitrates of around 500 kbps.The video clips in the test were 1080p of 1920 × 1280 resolution, and the original video data of 12,000 × 5000 resolution was 30 times that of the tested video clips.With the same quality, the encoded original video data stream would be 15 Mbps, which was within the required range of 10-20 Mbps transmission bandwidth between satellites and Earth.
As shown in the pictures, if encoded at nearly the same bitrate, the LBRL-based methods can provide better visual quality than that achieved from the corresponding codec.Comparing different decoded pictures from the same frame, the visual qualities were consistent with the PSNR values; namely, lower PSNR corresponded to lower visual quality.LBRL-HEVC can provide almost the same result visually as the original frame, especially that for Seaside with 39.25 dB.When the quality degraded to 35-37 dB from HEVC, decoded pictures tended to be blurry.Since the compression ratio is lower in LBRL-x246 and x264, the qualities of the decoded pictures were obviously lower than those from HEVC.We can clearly notice the blocking artifacts in the pictures from x264.Taking the cars on the road in the Seaside video clip as an example to clarify the visual comparison, we can count six cars from the original picture.After encoding by LBRL-HEVC, five were remained, whilst only four cars were left in the decoded picture from HEVC.The shape of the cars became blurry in LBRL-x264, but we can still count five cars.The cars had almost disappeared from in the picture from x264.As shown in the pictures, if encoded at nearly the same bitrate, the LBRL-based methods can provide better visual quality than that achieved from the corresponding codec.Comparing different decoded pictures from the same frame, the visual qualities were consistent with the PSNR values; namely, lower PSNR corresponded to lower visual quality.LBRL-HEVC can provide almost the same result visually as the original frame, especially that for Seaside with 39.25 dB.When the quality degraded to 35-37 dB from HEVC, decoded pictures tended to be blurry.Since the compression ratio is lower in LBRL-x246 and x264, the qualities of the decoded pictures were obviously lower than those from HEVC.We can clearly notice the blocking artifacts in the pictures from x264.Taking the cars on the road in the Seaside video clip as an example to clarify the visual comparison, we can count six cars from the original picture.After encoding by LBRL-HEVC, five were remained, whilst only four cars were left in the decoded picture from HEVC.The shape of the cars became blurry in LBRL-x264, but we can still count five cars.The cars had almost disappeared from in the picture from x264.

Computational Complexity Analysis
In the proposed method, the additional computational cost comes from the generation of reference images, including geometric matching, radiometric adjustment, quality adjustment, and the resampling of the images, as well as the long-term prediction for the I frame.The computational complexity of other encoding processes is the same as that for HEVC.The LBRL is built offline, thus we do not take it into consideration in the computational complexity analysis.
The computational complexity was measured by frames per second (fps) and tested separately on two implementations.LBRL-HEVC and HEVC were tested using a laptop with i5 CPU.The total additional time for background reference generation was 5.52 s for the I frame and 3.50 s for the P frame, since the P frame did not need radiometric adjustment and quality adjustment.The tests for LBRL-x264 and x264 were carried on Nvidia Jetson TX2.The algorithms for SIFT matching, radiometric adjustment, and resampling were accelerated by the parallel processing on the GPU.

Computational Complexity Analysis
In the proposed method, the additional computational cost comes from the generation of reference images, including geometric matching, radiometric adjustment, quality adjustment, and the resampling of the images, as well as the long-term prediction for the I frame.The computational complexity of other encoding processes is the same as that for HEVC.The LBRL is built offline, thus we do not take it into consideration in the computational complexity analysis.
The computational complexity was measured by frames per second (fps) and tested separately on two implementations.LBRL-HEVC and HEVC were tested using a laptop with i5 CPU.The total additional time for background reference generation was 5.52 s for the I frame and 3.50 s for the P frame, since the P frame did not need radiometric adjustment and quality adjustment.The tests for LBRL-x264 and x264 were carried on Nvidia Jetson TX2.The algorithms for SIFT matching, radiometric adjustment, and resampling were accelerated by the parallel processing on the GPU.Therefore, th total processing times for generating background reference images for the I frame and P frame were around 96.1 ms and 45.8 ms, respectively.
The comparison of the computational speed of the two implementations on different platforms are reported in Table 5.The QP for encoding was uniformly set to 32 for the comparison.Because the computational complexity was quite high for HEVC, the additional cost for reference generation did not have obvious effects on the processing speed.Since the processing time was around 1 min for one frame, far from real-time processing, it cannot be used on UAV or satellite platforms.The computational speed reached more than 100 fps for x264, which left time for the proposed method to generate the background reference.The average processing speed was around 16.77 fps for LBRL-x264, a bit higher than 15 fps set for remote sensing video data.Therefore, our method implemented on x264 could achieve the real-time processing of 1080p video data on a remote sensing platform.Therefore, the total processing times for generating background reference images for the I frame and P frame were around 96.1 ms and 45.8 ms, respectively.The comparison of the computational speed of the two implementations on different platforms are reported in Table 5.The QP for encoding was uniformly set to 32 for the comparison.Because the computational complexity was quite high for HEVC, the additional cost for reference generation did not have obvious effects on the processing speed.Since the processing time was around 1 min for one frame, far from real-time processing, it cannot be used on UAV or satellite platforms.The computational speed reached more than 100 fps for x264, which left time for the proposed method to generate the background reference.The average processing speed was around 16.77 fps for LBRL-x264, a bit higher than 15 fps set for remote sensing video data.Therefore, our method implemented on x264 could achieve the real-time processing of 1080p video data on a remote sensing platform.

Conclusions
This paper proposes a long-term background referencing-based Earth observatory data encoding method for real-time collection, analysis, and applications in smart cities.The key idea is to build an LBRL covering the entire area of a smart city to represent the common appearance of the landscape.For each new captured video clip, the corresponding image of the shooting location from the library is cropped and converted according to the image representation of the area in the video clip.The converted image is used as the additional long-term reference for the encoding of I frames and P frames.Extensive experiments with UAV video data and satellite video data show that, the proposed LBRL-based EOV encoding method can save 25% to 54% of the total bitrate and achieve a significant gain in background coding performance over HEVC and x264 correspondingly.The GPU implementation of the proposed method based on x264 codec on Nvidia TX2 can achieve a real-time processing of the 1080p video data with 15 fps.By applying the x264 implementation, the gap between the bitrate of video data and the bandwidth of the transmission channel can be reduced from 3-6-fold to 2-4-fold.
Compared with the existing short-term prediction-based coding methods for single video clips, the proposed method follows the characteristics of a large portion of static landscape in EOV data, in addition to making use of the existing information of the landscape.Moreover, the information is reformed and geographically organized in the library, rather than the original data form used in multisource coding methods.The geographically organized form of the library helps to facilitate the reference searching.The uniform representation of the landscape and its transformations guarantee a highly similar reference, which further improves the compression efficiency.
The proposed method does not completely solve the real-time transmission problem between remote sensing platforms and Earth, but provides an idea to make use of known information on Earth to reduce the information needed to be sent from remote sensing platforms.To further improve the compression efficiency of the propose method, we will further investigate the development of background referencing libraries from multiple sources of historical data, exploiting the extraction and representation of common knowledge from images taken under different conditions.Then, exploring more accurate radiometric and quality adjustment models, this method can possibly be implemented for different land cover types.A complete solution to the transmission problem calls for development in different fields, including computational platforms, new data transmission solutions, and improved data processing techniques.

Figure 1 .
Figure 1.Appearances of two video clips shot of the same location at different time by two different satellites.(a) A sample frame taken from each of the video clips; the same structure indicates the same location.(b) Magnified area showing the variances in projection, color, and quality.

Figure 1 .
Figure 1.Appearances of two video clips shot of the same location at different time by two different satellites.(a) A sample frame taken from each of the video clips; the same structure indicates the same location.(b) Magnified area showing the variances in projection, color, and quality.

Figure 2 .
Figure 2. Sketch of the long-term background referencing library (LBRL) for long-term background redundancy (LBR) elimination.(a) Corrected historical images registered at corresponding locations in the scale of a smart city, forming the foundation library of the LBRL; (b) projection transformation; (c) radiometric adjustment; (d) quality adjustment.

Figure 2 .
Figure 2. Sketch of the long-term background referencing library (LBRL) for long-term background redundancy (LBR) elimination.(a) Corrected historical images registered at corresponding locations in the scale of a smart city, forming the foundation library of the LBRL; (b) projection transformation; (c) radiometric adjustment; (d) quality adjustment.

Figure 3 .
Figure 3. Overview of the proposed background reference image generation and the Earth observatory video data (EOVD) encoding using the generated reference image.

Figure 3 .
Figure 3. Overview of the proposed background reference image generation and the Earth observatory video data (EOVD) encoding using the generated reference image.

T
and Y s U s V s T are the color values in the radiometrically adjusted reference images I c r and I g r , respectively.Y s U s V s T frame's values.By using this model, the color of the reference I g r was adjusted according to the color statistics of the current frame.

Figure 4 .
Figure 4.The overall framework of the LBRL-based encoding of EOVD.

Figure 4 .
Figure 4.The overall framework of the LBRL-based encoding of EOVD.

Figure 6 .
Figure 6.LBRL and reference images generation for UAV video clips.(a) The LBRL developed for the UAV video clip test; (b) cropped area from the LBRL (red rectangle in (a)) according to the to-beencoded video clip; (c) geometrical transformed reference image; (d) to-be-encoded video clip.First row-UAV video clip a; second row-UAV video clip d.

Figure 6 .
Figure 6.LBRL and reference images generation for UAV video clips.(a) The LBRL developed the UAV video clip test; (b) cropped area from the LBRL (red rectangle in (a)) according to the to-be-encoded video clip; (c) geometrical transformed reference image; (d) to-be-encoded video clip.First row-UAV video clip a; second row-UAV video clip d.

Figure 6 .
Figure 6.LBRL and reference images generation for UAV video clips.(a) The LBRL developed for the UAV video clip test; (b) cropped area from the LBRL (red rectangle in (a)) according to the to-beencoded video clip; (c) geometrical transformed reference image; (d) to-be-encoded video clip.First row-UAV video clip a; second row-UAV video clip d.

Figure 7 .
Figure 7. RD curves of LBRL-HEVC and HEVC for four video clips from UAV data.

Figure 8 .Figure 7 .
Figure 8. Reference images.(a) Sample frame from satellite video clips building-1; (b) cropped image from a large image downloaded from © Google Earth; (c) reference image after geometric matching; (d) reference image from after radiometric adjustment; (e) final background reference image from after quality adjustment.

20 Figure 7 .
Figure 7. RD curves of LBRL-HEVC and HEVC for four video clips from UAV data.

Figure 8 .Figure 8 .
Figure 8. Reference images.(a) Sample frame from satellite video clips building-1; (b) cropped image from a large image downloaded from © Google Earth; (c) reference image after geometric matching; (d) reference image from after radiometric adjustment; (e) final background reference image from after quality adjustment.From the intermediate result, we can see that the proposed background referencing generation method can successfully handle the image representation variances caused by illumination and sharpness differences.However, current strategy will not work well with the problems which cause a change of representation of remote sensing images: (1) shadow movement; (2) projection difference of tall buildings; (3) huge illumination change; (4) scene change due to seasons variation e.g.,
Remote Sens. 2018, 10, x FOR PEER REVIEW 16 of 20 1080p of 1920 × 1280 resolution, and the original video data of 12,000 × 5000 resolution was 30 times that of the tested video clips.With the same quality, the encoded original video data stream would be 15 Mbps, which was within the required range of 10-20 Mbps transmission bandwidth between satellites and Earth.

Table 1 .
Experimental configuration of two implementations.

Table 1 .
Experimental configuration of two implementations.