RETRACTED: Geometric Construction of Video Stereo Grid Space

Zhang, Huangchuang; Shi, Ruoping; Li, Ge

doi:10.3390/rs14102356

Open AccessArticle

RETRACTED: Geometric Construction of Video Stereo Grid Space

by

Huangchuang Zhang

^1,2,

Ruoping Shi

³ and

Ge Li

^1,*

¹

School of Electronic and Computer Engineering, Peking University, Shenzhen 518055, China

²

Pengcheng Laboratory, Shenzhen 518055, China

³

Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2022, 14(10), 2356; https://doi.org/10.3390/rs14102356

Submission received: 25 April 2022 / Revised: 8 May 2022 / Accepted: 10 May 2022 / Published: 13 May 2022 / Retracted: 24 June 2024

(This article belongs to the Special Issue Digital Twins for Sustainable and Smart Cities)

Download

Browse Figures

Versions Notes

Abstract

:

The construction of digital twin cities is a current research hotspot. Video data are one of the important aspects of digital twin cities, and their digital modeling is one of the important foundations of its construction. For this reason, the construction and digital analysis of video data space has become an urgent problem to be solved. After in-depth research, this study found that the existing video space construction methods have three shortcomings: first, the problem of high requirements for objective conditions or low accuracy; second, the lack of easy and efficient mapping algorithms from 2D video pixel coordinates to 3D; and third, the lack of efficient correlation mechanisms between video space and external geographic information, making it difficult to integrate video space with external information, and thus prevent a more effective analysis. In view of the above problems, this paper proposes a video stereo grid geometric space construction method based on GeoSOT-3D stereo grid coding and a camera imaging model to form a video stereo grid space model. Finally, targeted experiments of video stereo grid space geometry construction were conducted to analyze the experimental results before and after optimization and compare the variance size to verify the feasibility and effectiveness of the model.

Keywords:

digital twins; video stereo grid; grid space construction; GeoSOT-3D

Graphical Abstract

1. Introduction

With the development of economy and society, the construction of digital twin cities is carried out on the basis of smart cities in many places [1,2]. A digital twin city is a digital model twinned with a physical world city through digital twin modeling to realize the digitalization and intelligence of urban management [3,4,5]. There are massive data in cities, and 80% of the information in smart cities is related to spatial location [6]. The use of artificial intelligence and big data technology for the large-scale data analysis of digital twin cities is a long-term demand [7,8], and it is necessary to use faster and more efficient communication interfaces to solve various obstacles related to data transmission in communication networks [9,10,11], such as data-compression technology. The automatic analysis, transmission, and decision making by algorithms based on valuable data, as well as the establishment of digital models, comprise an important basis for promoting the construction of digital twin cities. As an indispensable part of current urban construction, video surveillance systems are an important data source for digital twin cities [12]. Therefore, the construction and digital analysis of video data space is an urgent problem to be solved. The reasons are, on the one hand, the interpretation of the current video space is basically done manually, and the workload is large and time-consuming, so it is urgent to analyze the video information through a spatial construction algorithm to form a video space model; while on the other hand, a large amount of video content exists in isolation. It is impossible to efficiently query the content information of the video space, and it is necessary to combine the three-dimensional information of the real world with the space expressed by the video.

Therefore, the construction of a video stereo grid space model will make better use of the video space data, so the actual three-dimensional space of the video can be displayed. The expression is more abundant and intuitive, and the data management is simpler. The existing video space construction methods have the characteristics of high requirements for objective conditions or low precision, and lack of simple and efficient mapping algorithms from 2D video pixel coordinates to 3D. In addition, there is a lack of an efficient correlation mechanism between a video space and the external geographic information, making it difficult to integrate the video space with the external information and achieve a more effective data analysis.

A grid is a data structure that manages and expresses space, and can model any space. Using the data structure of a grid as a starting, in view of the current problems, this paper proposes an innovative approach using a Geographic Coordinate Subdividing Grid with One-Dimension Integral Coding on 2ⁿ-Tree-3D (GeoSOT-3D) global mesh as a carrier, combined with the camera imaging model, to construct a video stereo grid space model. The GeoSOT-3D global mesh has the characteristics of identifiable, locatable, indexable, computable, and automatic spatial association. Due to these characteristics, GeoSOT-3D can accurately express the geographic spatial position, and is suitable for the construction of a video stereo grid spatial model. Using GeoSOT-3D to disassemble the space into grid units and then construct a three-dimensional grid space for video improves the utilization efficiency of video data, and can promote the rapid development of additional video applications based on the stereo grid space. In addition, the construction of the video stereo grid space provides ideas for real-time urban data and the visualization thereof, thereby providing effective support for efficient and intelligent decision-making using the digital twin cities.

The innovative method used in this paper integrates the advantages of the GeoSOT-3D grid with the efficiency of underlying data organization and the accuracy of geographic location expression, combined with the camera imaging model, and proposes optimization strategies on the basis of traditional algorithms to achieve the accurate completion of video spatial representation under the condition of low external constraints and known conditions. The mapping relationship between the three-dimensional grid space and the geographic coordinates was constructed to achieve the associated expression of GeoSOT-3D grid coding and complete the modeling of the spatial geometry of the video three-dimensional grid.

2. Related Work

The purpose of the construction of video stereo space is to calibrate, solve, and restore the video space information through computer vision and other methods to realize the space restoration of the surveillance video data on the map through transformation mapping or virtual reality technology. Therefore, the related work mainly introduces two aspects: a camera-calibration method and spatial construction of the video.

Camera calibration is the process of solving for camera parameters. Usually it is necessary to use some calibration objects or structural information in the image to solve for them. It can be divided into three methods: the traditional camera-calibration method, the active-vision calibration method, and camera self-calibration method [13,14,15].

The traditional camera-calibration method mainly relies on calibration objects with known information for calibration, and must perform imaging processing on the calibration objects in the early stage. When the camera is shooting, the calibration objects are imaged at different positions and angles, and the calculation is performed after the images are processed. When the calibration object is relatively accurate, the traditional camera-calibration method can obtain a relatively high accuracy, but the calculation is relatively complicated. Typical algorithms for traditional camera calibration include the DLT algorithm [16], the Tsai two-step calibration method [17], and the Zhengyou Zhang camera-calibration method [18].

The active-vision method mainly solves for the camera parameters based on the camera’s motion information, and must know the qualitative and quantitative information of the camera [19]. The solution is a mainly linear solution with high robustness. Due to the constraints on the camera movement, the equipment requirements are relatively high, and the application scope is relatively limited.

The camera self-calibration method does not require calibration objects, nor does it require camera-motion-related information, and it is more flexible. The camera self-calibration method is mainly solved by using the constraints between multiple images. An example of this would be solving the Kruppa equation to directly obtain the camera parameters [20], or solving through the vanishing-point information [21]. Due to less prior information, the camera self-calibration method has a low calculation accuracy, and is suitable for scenarios such as virtual reality that do not require a high accuracy.

To summarize, each method of camera calibration has a different scope of application, as well as different advantages and disadvantages. A summary is shown in Table 1.

Previous research on the spatial construction of video was relatively focused on the 3D reconstruction of video or the use of virtual reality technology to display moving objects within the video or map the video onto a map.

Lee and Nevatia developed a calibration tool for surveillance video, and performed camera calibration for the street view by interactively selecting vanishing points. These functions provided a practical tool for the calibration of the street view surveillance video [22]. Meizhen used a quadtreelike grid structure to describe the true coverage of the video when the camera parameters were known, and the effect was related to the initial grid size and the maximum grid level [23]. For structured scenes, Zhang proposed a mutual mapping model between the surveillance video and the geospatial data under multiplane constraints, and the model was suitable for situations in which the ground area contained multiple planes [24]. Li extracted geographic information from video data and used semiautomatic processing methods to provide the data a geographic frame to support an effective spatial analysis [25].

Table 1. Comparison of camera-calibration methods [14,15,19,24].

Calibration Method	Traditional Camera Calibration	Active-Vision Calibration	Camera Self-Calibration
Advantage	In the case of more accurate calibration objects, a higher accuracy can be obtained	The algorithm is more stable and robust	High flexibility and wide range of applications
Disadvantage	The calculation is relatively complicated	High requirements for equipment and limited application scope	Low precision

Aleksandar’s team proposed a method to integrate geographic information into a surveillance video [26], and defined two ensemble models: GIS-enhanced video and video-enhanced GIS. Then, on the basis of the models, more than 10 control points were used to perform a least-squares optimization on the geographic reference system of the video, and the video was projected into the virtual reality scene to complete the analysis and display of the video. Yujia et al. integrated the moving objects in a surveillance video with GIS to realize the GIS-MOV system. The system’s main purpose was to generate a virtual geographic environment, extract moving objects, and visualize these objects in the virtual environment according to their motion data [27,28,29]. Wu used an optimization algorithm to calibrate a pose-changing camera and a fixed camera [30]. The algorithm separated the background and background of the video shot by the different cameras, completed the fusion with a two-dimensional map, proposed the concept of a video map, and generated the video map.

One problem with this type of analysis is a lack of robust mapping algorithms that can relate video pixel coordinates with 3D world coordinates. The current mapping methods of pixel coordinates and three-dimensional world coordinates have their own limitations. When the research object is a surveillance video with a fixed angle that has already been recorded, it is usually impossible to make a specific movement with the camera or use a calibration object in the video, and algorithms that do not use a calibration object have a large error. In addition, many current methods have extensive requirements for the number and precision of control points, and these methods may lack corresponding conditions when actually using a video. When displaying the video space, the use of a virtual space for fusion requires high amounts of equipment and precision, the algorithm is time-consuming, and the three-dimensionality of the video is lost when displayed on a two-dimensional map.

3. Materials and Methods

3.1. Model Architecture

The model construction was largely based on the camera imaging model and the GeoSOT-3D global grid model. On the basis of the imaging model, the relationship between the grid codes and the world coordinate system was added to provide a representation of the real geographic world for the video stereo grid space. The grid was associated with the space expressed by the video, which provided a good foundation for the subsequent spatial fusion between the videos and the external information. The mapping relationship constructed by the model is shown in Figure 1.

Figure 2 shows the technical route of model construction, which is described as follows:

Using the vanishing-point-based stereo space recovery optimization algorithm to clarify the camera parameters, the external parameters were used to determine the rotation angle and translation vector of the camera in the three-dimensional world. The inverse calculation of the camera imaging model could obtain the mapping of the 2D video frame to the 3D world coordinate system by solving the mapping from the camera world coordinate system to the camera coordinate system, and then using the internal parameters to determine the camera coordinate system in relation to the projection method of the 2D image.
Based on the known geographic control points and other information, we first established a mapping relationship between the world coordinate system and the latitude and longitude geodetic coordinate system through the angle and scale solution information. Secondly, we passed the GeoSO-3D grid code to the geodetic coordinate system of latitude, longitude, and height, creating an encoding and decoding conversion relationship to realize the mapping relationship between the GeoSOT-3D grid code and the world coordinate system coordinates.
The model completed the mapping of the known 3D height information from the 2D pixel coordinates to the three-dimensional grid code.

3.1.1. GeoSOT-3D Global Mesh Model

GeoSOT (Geographic Coordinate Subdividing Grid with One-Dimension Integral Coding on 2ⁿ-Tree) [6] is a grid coding based on longitude and latitude that can accurately quantify the spatial information of any location and size on the earth. GeoSOT has the characteristics of small deformation, multiscalability, and easy computer processing, and is used in various fields of spatial information organization [31]. Therefore, this grid coding was very suitable for the construction of a video stereo grid space [32].

The GeoSOT-3D subdivision framework selected the sphere center of the reference ellipsoid in the China Geodetic Coordinate System 2000 (CGCS2000), and the height dimension selected the geodetic height direction; the center height was 0°, and the maximum height reached 512° (surface longitude and latitude length unit). The core idea was to expand the earth space of the original longitude and latitude system by three times. First, we expanded the earth space to 512° × 512° × 512°, and then expanded 1° to 64′ and 1′ to 64″ to achieve the recursive octree division of whole degree, whole minute, and whole second [33,34,35], as shown in Figure 3. The edge length of the GeoSOT-3D grid was 32,768 km at the first level, and the mesh side length of each level after that was 32,768 km. This was half of the previous level; the edge length was divided into 32 levels, and its side length accuracy reached 1.5cm.

As shown in Figure 4, the GeoSOT-3D encoding method used Z-order padding. The 0th level was global, and the voxels were coded from 0 to 7 in turn using octal coding; the coding length was related to the level. Since the maximum split level was 32, the maximum octal code length was 32 bits. The binary fixed-length encoding adopted the form of 96-bit (0,1) encoding, and every three bits corresponded to the octal encoding. This encoding greatly reduced the storage requirements of the computer and sped up the calculation speed. In practical applications, the encoding can be divided into three parts: longitude encoding, latitude encoding, and altitude encoding, allowing the encoding to be mutually converted with the known coordinates of longitude, latitude, and altitude.

3.1.2. Camera Imaging Model

For video space construction, an explicit camera imaging model is required. Camera imaging is essentially the mapping of three-dimensional world information onto a two-dimensional plane through a lens, which is a simplification of the optical imaging model [36]. The imaging model involves four coordinate systems: the world coordinate system, the camera coordinate system, the image-plane coordinate system, and the pixel coordinate system [37]. The relationship between the four coordinate systems is shown in Figure 5.

1.: The world coordinate system is an expression of the coordinate system of the objectively existing world. Its origin and scale unit can be arbitrary, and it is used to express the object information imaged by the camera. In a surveillance video system, the world coordinate system is constructed based on the monitoring field of view, and can be customized according to user needs. The camera model uses ( $O - X_{w} Y_{w} Z_{w}$ ).
2.: The camera coordinate system is a camera-based coordinate system. Its origin is located at the optical center of the camera imaging lens, the x-axis and y-axis are parallel to the u’ and v’ axes of the image plane coordinate system, the z-axis is perpendicular to the x-axis, and the y-axis is parallel to the optical axis of the camera, forming a three-dimensional coordinate system inside the camera, represented by ( $O - X_{c} Y_{c} Z_{c}$ ). The world coordinate system is transformed to the camera coordinate system by a rigid body transformation.
3.: The image-plane coordinate system is a two-dimensional coordinate system on the imaging plane. The origin is the line between the optical center and the optical axis, and its u’ and v’ axes are parallel to the u and v axes of the pixel coordinates, respectively. The distance from the origin of the image-plane coordinate system to the origin of the camera coordinate system is determined by the focal length of the camera.
4.: The pixel coordinate system takes the upper-left corner of the image and the two mutually perpendicular sides of the image to form the u-axis and the v-axis as the origin.

The parameters used to complete an imaging process by transforming the above four coordinate systems are the camera parameters. The camera parameters mainly include two parts: the external parameters and internal parameters. Among these, the world coordinate system is converted to the camera coordinate system by the external parameters, and the camera coordinate system is converted to the image-plane coordinate system and the pixel coordinate system by the internal parameters. The details are as follows:

Camera external parameters: These are related to the camera position and the conversion relationship between the world coordinate system and the camera coordinate system, including the rotation matrix and the translation vector. The rotation matrix

R

is a three-dimensional orthogonal matrix, and describes the rotation relationship of the coordinate axes. If the original coordinate axis is rotated clockwise around the z-axis by an angle of

θ

to obtain the target coordinate axis, the rotation relationship between the two coordinate axes is as follows:

[\begin{matrix} x \\ y \\ z \end{matrix}] = [\begin{matrix} c o s θ & - s i n θ & 0 \\ s i n θ & c o s θ & 0 \\ 0 & 0 & 1 \end{matrix}] [\begin{matrix} x^{'} \\ y^{'} \\ z^{'} \end{matrix}]

(1)

In the same way, the rotation relationship between the γ angle clockwise around the x-axis and the ω angle clockwise around the y-axis is as follows:

[\begin{matrix} x \\ y \\ z \end{matrix}] = [\begin{matrix} 1 & 0 & 0 \\ 0 & c o s γ & - s i n γ \\ 0 & s i n γ & c o s γ \end{matrix}] [\begin{matrix} x^{'} \\ y^{'} \\ z^{'} \end{matrix}]

(2)

[\begin{matrix} x \\ y \\ z \end{matrix}] = [\begin{matrix} c o s ω & 0 & - s i n ω \\ 0 & 1 & - s i n γ \\ s i n ω & 0 & c o s ω \end{matrix}] [\begin{matrix} x^{'} \\ y^{'} \\ z^{'} \end{matrix}]

(3)

The rotation matrix R is obtained by multiplying the above three matrices. Through a Rodrigues transformation, the rotation matrix can also be converted into a three-dimensional rotation vector.

The translation vector is a three-dimensional vector that describes the translation relationship from the origin of the world coordinate system to the origin of the camera coordinate system. Therefore, the transformation from the world coordinate system to the camera coordinate system can be expressed by the following formula:

[\begin{matrix} x_{c} \\ y_{c} \\ z_{c} \end{matrix}] = [\begin{matrix} R & T \end{matrix}] [\begin{matrix} x_{w} \\ y_{w} \\ z_{w} \end{matrix}], R : 3 \times 3, T : 3 \times 1

(4)

Camera internal parameters: These are determined by the physical characteristics of the camera, mainly including the camera focal length f, which is used to represent the transformation relationship from the camera coordinate system to the pixel coordinate system.

The conversion relationship from the camera coordinate system to the pixel coordinate system is a conversion from three-dimensional to two-dimensional, and can be obtained by triangle similarity. As shown in Figure 6, it is known that

Δ {ABO}_{c}

and

Δ {PAO}_{c} ~ Δ O_{c} p C

, the distance from the main point O to the origin of the camera coordinate system, is the focal length f. From this information, the following formula can be obtained:

u^{'} = f \frac{x_{c}}{z_{c}}, v^{'} = f \frac{y_{c}}{z_{c}}

(5)

Z_{c} [\begin{matrix} u^{'} \\ v^{'} \\ 1 \end{matrix}] = [\begin{matrix} f & 0 & 0 \\ 0 & f & 0 \\ 0 & 0 & 1 \end{matrix}] [\begin{matrix} x_{c} \\ y_{c} \\ z_{c} \end{matrix}]

(6)

Since there is no rotation transformation from image-plane coordinates to pixel coordinates, only the origin translation and scale transformation need to be considered. The following formula can be obtained:

{\begin{matrix} u = u_{0} + \frac{u^{'}}{d u^{'}} \\ v = v_{0} + \frac{v^{'}}{d v^{'}} \end{matrix}

(7)

[\begin{matrix} u \\ v \\ 1 \end{matrix}] = [\begin{matrix} \frac{1}{d u^{'}} & 0 & u_{0} \\ 0 & \frac{1}{d v^{'}} & v_{0} \\ 0 & 0 & 1 \end{matrix}] [\begin{matrix} u^{'} \\ v^{'} \\ 1 \end{matrix}]

(8)

When combining the above two parts of the camera parameters, the relationship between the pixel coordinate system coordinate points (u, v) and the world coordinate system coordinate points (

x_{w}

,

y_{w}

,

z_{w}

) can be obtained, and is expressed by the following formula:

[\begin{matrix} u \\ v \\ 1 \end{matrix}] = \frac{1}{z_{c}} [\begin{matrix} \frac{1}{d u^{'}} & 0 & u_{0} \\ 0 & \frac{1}{d v^{'}} & v_{0} \\ 0 & 0 & 1 \end{matrix}] [\begin{matrix} f & 0 & 0 \\ 0 & f & 0 \\ 0 & 0 & 1 \end{matrix}] [R | T] [\begin{matrix} x_{w} \\ y_{w} \\ z_{w} \end{matrix}]

(9)

The product of the first two matrices is the camera internal parameter, and the latter matrix is the camera external parameter. After accurate camera calibration, the conversion relationship from the three-dimensional space of the camera to the two-dimensional space can be established in such a way to clarify the three-dimensional information contained in the two-dimensional image in the video, and achieve the purpose of further analysis and modeling of the video space.

3.2. Stereo Space Restoration Optimization Algorithm Based on Vanishing Point

3.2.1. Basic Algorithm

This section combines the classic vanishing-point-based camera calibration algorithm to solve for the camera parameters [21,38]. When the camera’s photographic center is

O

, and the projection center of the image is

O^{'}

, the projection of the optical center

O

on the image plane, also called the image principal point, is obtained. Further, when one initializes the coordinates of the image principal point as the center of the image (w/2, h/2), the two sets of parallel lines perpendicular to each other are known. If one lets a rectangle enclosed be ABCD, which is abcd in the image, the rectangle ABCD is not the same as images that are parallel. The intersection of AD and BC on the image (

F_{u}

) and the intersection of AB and CD on the image (

F_{v}

) are the two vanishing points. The vanishing line is the straight line

F_{u} F_{v}

defined by the two vanishing points. A schematic diagram of vanishing-point imaging is shown in Figure 7.

Let

O_{u v}

be the vertical projection of

O^{'}

on the straight line

F_{u} F_{v}

, then the focal length f of the camera is calculated as follows:

f = \sqrt{o o_{u v}^{2} - o^{'} o_{u v}^{2}}

(10)

The pixel coordinates of the two vanishing points are known, and the length of

{OO}_{u v}

can be calculated from the similarity of the triangles. Since the coordinates of the principal point are also known, the length of

{O^{'} O}_{u v}

can also be calculated. It can be assumed that the focal length has the same value in both the u and v directions. So far, the parameter focal length f of the internal parameter matrix and the image principal point (

u_{0}

,

v_{0}

) can be obtained, and based on these calculations, the solution of the internal parameter matrix A is completed:

A = [\begin{matrix} f & 0 & u_{0} \\ 0 & f & v_{0} \\ 0 & 0 & 1 \end{matrix}]

(11)

The rotation matrix of the external parameter is usually represented by the letter M, which determines the rotation relationship from the world coordinate system to the camera coordinate system. Assuming the unit vectors of the x, y, and z axes of the world coordinate system are

\vec{x}

,

\vec{y}

, and

\vec{z}

, and have been translated to the optical center, the unit vectors of the three axes in the camera coordinate system are

\vec{a}

,

\vec{b}

, and

\vec{c}

. Since the world coordinate system is artificially specified, the direction of the unit vector of the world coordinate system is constructed on the two connecting lines between the two vanishing points and the optical center O; in other words, the unit vector where

\vec{{OF}_{u}}

and

\vec{{OF}_{v}}

are located. By definition, the length from the optical center to the image principal point is the focal length f. From this, the coordinates of the unit vector of the world coordinate system in the camera coordinate system can be obtained:

\vec{x^{'}} = \frac{\vec{{OF}_{u}}}{‖ \vec{{OF}_{u}} ‖} = (\frac{F_{u i}}{‖ \vec{{OF}_{u}} ‖}, \frac{F_{u j}}{‖ \vec{{OF}_{u}} ‖}, \frac{f}{‖ \vec{{OF}_{u}} ‖})

(12)

\vec{y^{'}} = \frac{\vec{{OF}_{v}}}{‖ \vec{{OF}_{v}} ‖} = (\frac{F_{v i}}{‖ \vec{{OF}_{v}} ‖}, \frac{F_{v j}}{‖ \vec{{OF}_{v}} ‖}, \frac{f}{‖ \vec{{OF}_{v}} ‖})

(13)

\vec{z^{'}} = \vec{x^{'}} \times \vec{y^{'}} = (\vec{{z^{'}}_{l}}, \vec{{z^{'}}_{j}}, \vec{{z^{'}}_{k}})

(14)

According to the properties of the rotation matrix, the rotation matrix is obtained by the following formula:

M = [\begin{matrix} \frac{F_{u i}}{\sqrt{F_{u i}^{2} + F_{u j}^{2} + f^{2}}} & \frac{F_{v i}}{\sqrt{F_{v i}^{2} + F_{v j}^{2} + f^{2}}} & \vec{{z^{'}}_{l}} \\ \frac{F_{u j}}{\sqrt{F_{u i}^{2} + F_{u j}^{2} + f^{2}}} & \frac{F_{v i}}{\sqrt{F_{v i}^{2} + F_{v j}^{2} + f^{2}}} & \vec{{z^{'}}_{j}} \\ \frac{f}{\sqrt{F_{u i}^{2} + F_{u j}^{2} + f^{2}}} & \frac{f}{\sqrt{F_{v i}^{2} + F_{v j}^{2} + f^{2}}} & \vec{{z^{'}}_{k}} \end{matrix}]

(15)

The translation vector is the translation from the origin of the world coordinate system to the origin of the camera coordinate system. Suppose the origin of the world coordinate system is

O_{w}

. If the projection of the origin in the image is

{O^{'}}_{w} = (u_{w}, v_{w})

, then the coordinate of

{O \vec{O^{'}}}_{w}

in the camera coordinate system is

(u_{w} - u_{0}, v_{w} - v_{0}, f)

, the direction of the translation vector. Then the translation vector is:

\vec{t} = (u_{w} - u_{0}, v_{w} - v_{0}, f)

(16)

The parameter translation vector, rotation matrix, and internal parameters of the camera can be calculated by the vanishing-point algorithm to establish the mapping relationship between the three-dimensional world and the two-dimensional pixel plane. Through the inverse calculation of the camera imaging model, the mapping relationship between the two-dimensional pixel coordinates and the three-dimensional world can be obtained. Suppose the camera rotation matrix is M, the translation vector is

\vec{t}

, the camera internal parameter is A, and the z-axis coordinate of the world coordinate system coordinate corresponding to the known pixel coordinate is the height, then the camera parameter definition can be obtained:

Z_{c} = \frac{h e i g h t + {(M^{- 1} \vec{t})}_{3}}{{(M^{- 1} A^{- 1} {(u, v, 1)}^{T})}_{3}}

(17)

(x, y, z) = M^{- 1} (Z_{c} * A^{- 1} {(u, v, 1)}^{T} - T)

(18)

In Equation (17), the subscript 3 represents the third element of the vector. The current algorithm can obtain a relatively practical result, but because the current image principal point is initialized at the center of the image, and the values of the horizontal and vertical focal lengths are the same, it is still insufficient to describe the real camera imaging model. The next section, therefore, will introduce the optimization method of the algorithm. For a space with many known conditions, the algorithm can be optimized to better calibrate the parameter model.

3.2.2. Optimization Method

The algorithm based on the vanishing point can initially complete the calibration of the camera parameters, but because the solution is completely dependent on the method, and the coordinates of the image principal point are also initialized as the image center, the accuracy cannot be guaranteed for different cameras. For the camera imaging model, its form is determined. When more correspondence between world coordinates and pixel coordinates is known, it is suitable for optimization methods such as gradient descent to further update the parameters with the goal of reducing the errors. Due to different angles or different construction methods of the world coordinate system, the external parameters are different. In order to make the algorithm more universal, iterative updates are performed for the camera internal parameters. Therefore, when the known conditions allow, inspired by Zhang’s calibration method, the iterative optimization method can be used to optimize the camera’s internal parameters according to the known conditions to obtain more accurate parameter results.

Assuming the pixel coordinates (u, v) corresponding to the set of world coordinate system coordinates (

x_{w}

,

y_{w}

,

z_{w}

) are known, the camera imaging model is set as follows:

[\begin{matrix} u \\ v \\ 1 \end{matrix}] = \frac{1}{Z_{c}} [\begin{matrix} f_{x} & 0 & u_{0} \\ 0 & f_{y} & v_{0} \\ 0 & 0 & 1 \end{matrix}] [R | T] [\begin{matrix} x_{w} \\ y_{w} \\ z_{w} \end{matrix}]

(19)

[\begin{matrix} u \\ v \\ 1 \end{matrix}] = \frac{1}{Z_{c}} [\begin{matrix} f_{x} & 0 & u_{0} \\ 0 & f_{y} & v_{0} \\ 0 & 0 & 1 \end{matrix}] [\begin{matrix} r_{11} & r_{12} & r_{13} \\ r_{21} & r_{22} & r_{23} \\ r_{31} & r_{32} & r_{33} \end{matrix}] [\begin{matrix} x_{w} \\ y_{w} \\ z_{w} \end{matrix}] + [\begin{matrix} f_{x} & 0 & u_{0} \\ 0 & f_{y} & v_{0} \\ 0 & 0 & 1 \end{matrix}] [\begin{matrix} t_{1} \\ t_{2} \\ t_{3} \end{matrix}]

(20)

The above formula cab ne simplified into an equation of the form:

u = \frac{(f_{x} r_{11} + u_{0} r_{31}) x_{w} + (f_{x} r_{12} + u_{0} r_{32}) y_{w} + (f_{x} r_{13} + u_{0} r_{33}) z_{w} + f_{x} t_{1} + t_{3} u_{0}}{r_{31} x_{w} + r_{32} y_{w} + r_{33} z_{w} + t_{3}}

(21)

v = \frac{(f_{x} r_{21} + u_{0} r_{31}) x_{w} + (f_{x} r_{22} + u_{0} r_{32}) y_{w} + (f_{x} r_{23} + u_{0} r_{33}) z_{w} + f_{x} t_{2} + t_{3} u_{0}}{r_{31} x_{w} + r_{32} y_{w} + r_{33} z_{w} + t_{3}}

(22)

Assuming the error is the squared error, the formula is:

J^{(i)} (f_{x}, f_{y}, u_{0}, v_{0}) = \frac{1}{2} (h ({(x_{w}, y_{w}, z_{w})}^{T})) - {({(u, v)}^{T})}^{2}

(23)

It can be known from the simplified form of the equation that y is related to

f_{x}

and

u_{0}

, and v is related to

f_{y}

and

v_{0}

. One can then apply the stochastic gradient descent method to find the partial derivatives of the four parameters

f_{x}

,

u_{0}

,

f_{y}

, and

v_{0}

with respect to the objective function:

\frac{d J}{d f_{x}} = \frac{(u_{p r e d} - u) (r_{11} x_{w} + r_{12} y_{w} + r_{13} z_{w} + t_{1})}{r_{31} x_{w} + r_{32} y_{w} + r_{33} z_{w} + t_{3}}

(24)

\frac{d J}{d f_{y}} = \frac{(v_{p r e d} - v) (r_{21} x_{w} + r_{22} y_{w} + r_{23} z_{w} + t_{2})}{r_{31} x_{w} + r_{32} y_{w} + r_{33} z_{w} + t_{3}}

(25)

\frac{d J}{d u_{0}} = \frac{(u_{p r e d} - u) (r_{31} x_{w} + r_{32} y_{w} + r_{33} z_{w} + t_{3})}{r_{31} x_{w} + r_{32} y_{w} + r_{33} z_{w} + t_{3}}

(26)

\frac{d J}{d v_{0}} = \frac{(v_{p r e d} - v) (r_{31} x_{w} + r_{32} y_{w} + r_{33} z_{w} + t_{3})}{r_{31} x_{w} + r_{32} y_{w} + r_{33} z_{w} + t_{3}}

(27)

According to the stochastic gradient descent method, the four parameters are optimized in the negative gradient direction; that is, for any sample data

{(x_{w}, y_{w}, z_{w})}^{(i)} \to {(u, v)}^{(i)}

, the optimization steps of each update of the parameters

f_{x}

,

u_{0}

,

f_{y}

, and

v_{0}

are as follows:

f_{x} : = f_{x} - α \frac{d J}{d f_{x}}

(28)

f_{y} : = f_{y} - α \frac{d J}{d f_{y}}

(29)

u_{0} : = u_{0} - α \frac{d J}{d u_{0}}

(30)

v_{0} : = v_{0} - α \frac{d J}{d v_{0}}

(31)

The basic algorithm steps, based on the vanishing point, are equivalent to providing a good initial value for the optimization algorithm. Through multiple known control points, the camera’s internal parameters are optimized for multiple rounds of iterations, and the appropriate learning rate α is selected to control the number of iteration rounds. Once the appropriate optimization training set and test set are selected, a better optimization effect can be obtained. In order to prevent over-fitting, the highest number of iterations is generally set as one of the termination conditions, and it can also be set as the error-reduction value, reaching the threshold as another termination condition of the iteration. Finally, the algorithm realizes the mutual mapping relationship between the two-dimensional video frame and the three-dimensional world. These actions lay the foundation for the construction of the video three-dimensional grid space. The description of the algorithm is as follows in Algorithm 1.

Algorithm 1. Optimization algorithm based on vanishing point calibration

Require: learning rate

α

, Initial parameters

f_{x}

,

u_{0}

,

f_{y}

,

v_{0}

, The maximum number of iteration rounds N, Small constant

δ = 10^{- 7}

Repeat:
Calculation error:

J (f_{x}, f_{y}, u_{0}, v_{0}) = \frac{1}{2} \sum^{} (h ({(x_{w}, y_{w}, z_{w})}^{T})) - {({(u, v)}^{T})}^{2}

A sample is randomly selected from the known control points, and the corresponding pixel coordinates are:
Gradient calculation:

\frac{d J}{d f_{x}}

,

\frac{d J}{d f_{y}}

,

\frac{d J}{d u_{0}}

,

\frac{d J}{d v_{0}}

Parameter update:

f_{x} : = f_{x} - α \frac{d J}{d f_{x}}

,

f_{y} : = f_{y} - α \frac{d J}{d f_{y}}

,

u_{0} : = u_{0} - α \frac{d J}{u_{0}}

,

v_{0} : = v_{0} - α \frac{d J}{v_{0}}

Calculate new error:

J_{n e w} (f_{x}, f_{y}, u_{0}, v_{0}) = \frac{1}{2} \sum^{} (h ({(x_{w}, y_{w}, z_{w})}^{T})) - {({(u, v)}^{T})}^{2}

Iteration rounds: n = n + 1
Until:

J_{n e w} (f_{x}, f_{y}, u_{0}, v_{0}) - J (f_{x}, f_{y}, u_{0}, v_{0}) < δ o r n \geq N

3.3. Stereo Grid Space Geometry Construction

In the previous section, through the vanishing point coordinates, the camera parameter information could be solved. This section will establish the mapping relationship between the world coordinate system and grid coding to connect the camera imaging model with the grid and complete the construction of the three-dimensional grid space, thus completing the transition from relative positioning in the video space to absolute geographic positioning. Through coordinate-system mapping, the combination of video spatial data and geographic information is truly completed, and a video three-dimensional grid space is established, laying the foundation for grid-based data association.

3.3.1. Algorithm for Mapping Latitude and Longitude Coordinates to World Coordinate System

The algorithm required some prior knowledge to complete the construction of the mapping relationship. Most importantly, it needed to determine two parameters: the scale, which represents the scale of the world coordinate system and the real distance; and the degree. The degree represents the true direction of the world coordinate system and is specified as the world coordinate, the angle between the x-axis and true north. The scale relationship must clarify the real distance of a line segment in the video scene, and the direction information must clarify the orientation of the video scene in the real 3D world. Therefore, at least the real latitude and longitude coordinates of two ground pixels must be known to solve for the required two parameters.

First of all, it was assumed the longitude and latitude coordinates of two pixels on the ground in three-dimensional space on the image were known. The world coordinate system is a uniform Cartesian three-dimensional coordinate system, and the origin of the world coordinate system can be set as one of two known points. By knowing the latitude and longitude coordinates of two points, the azimuth angle φ of one point relative to another point and the true distance

d

of the two points can be calculated through the inverse solution method of geodetic theme solution [39]. The azimuth is the angle from the true north of a point to the target line clockwise. In this algorithm, it was the angle from the true north of the origin of the world coordinate system to the line connecting the two points.

For two known control points, since the two points were ground points and the height was 0, the corresponding world coordinates could be obtained from Formulas (17) and (18). From this, the coordinates of the two points in the world coordinate system were obtained, and the distance

l

between the two points in the world coordinate system could be calculated using the two-point distance formula. By calculating

s c a l e = d / l

, the real distance per unit of the length of the world coordinate system could be calculated, and the one-dimensional coordinates of the height in the world coordinate system could be determined by scale.

Since the world coordinate system is not necessarily a right-handed system, the camera imaging model could infer whether the y-axis was 90° counterclockwise to the x-axis. The judgment method was used to project the unit world coordinates of the x-axis and y-axis of the two points into the pixel coordinates

p 1 = (u_{1}, v_{1})

,

p 2 = (u_{2}, v_{2})

. They then formed the vectors

\vec{p 0 p 1}

and

\vec{p 0 p 2}

with the origin pixel coordinates

p 0 = (u_{0}, v_{0})

. Since the pixel coordinate system was not a right-handed system, by judging the cross product value of the two vectors, if the cross product value was less than zero, the x-axis was the 90° counterclockwise direction of the y-axis, and vice versa.

Next, we calculated the angle θ between the line connecting the two points and the x-axis in world coordinates. Using the azimuth angle and the relationship between the x-axis and the y-axis, the angle between the x-axis and the true north direction could be calculated. It was stipulated that starting from x to true north counterclockwise was positive.

When the y-axis was in the 90° clockwise direction of the x-axis, then the degree = φ − θ, and when the y-axis was in the 90° counterclockwise direction of the x-axis, then the degree = φ+ θ. Under extreme conditions, when there was only one known longitude and latitude point, the true north direction vector and a known distance (ground point) could be marked on the video image, and the world coordinate system coordinates of the point on the image could also be calculated to obtain the required parameter.

When the two parameters of degree and scale were obtained, the mapping relationship could be established. When the coordinates of the world coordinate system (x, y, z) of a known point required latitude and longitude coordinates, because the angle between the x axis of the world coordinate system and the true north direction and the scale relationship with the real world were known, the projection of the point on the plane of the x-axis and y-axis of the world coordinate system, the azimuth φ, and the distance

d

from the origin of the world coordinate system could be calculated. When the latitude and longitude of the origin (

l n g_{0}

,

l a t_{0}

) were known, and the method of solving the positive solution of the geodetic theme [39] was applied to obtain the latitude and longitude coordinates of the point, the height could be directly converted to the real geodetic height by using the scale parameter:

{\begin{matrix} θ = \arccos (\frac{x}{\sqrt{x^{2} + y^{2}}}) \\ φ = d e g r e e + f l a g * θ (f l a g = - 1 i f x y i s r i g h t - h a n d e d s y s t e m e l s e 1) \\ d = \sqrt{x^{2} + y^{2}} * s c a l e \\ (l n g, l a t) \overset{\leftarrow}{d i r e c t s o l u t i o n o f g e o d e t i c p r o b l e m} (l n g_{0}, l a t_{0}, φ, d) \\ H = z * s c a l e \end{matrix}

(32)

When the longitude, latitude, and elevation coordinates of a point were known, the azimuth φ between the point and the origin of the world coordinate system and the real distance d on the ground could be calculated according to the longitude and latitude of the point. According to the two parameters of degree and scale, the x-coordinate and y-coordinate of the point in the world coordinate system could be obtained, and the z-coordinate could be directly divided by the height and the scale:

{\begin{matrix} (l n g, l a t, l n g_{0}, l a t_{0}) \vec{i n v e r s e s o l u t i o n o f g e o d e t i c p r o b l e m} (φ, d) \\ θ = f l a g (d e g r e e - φ) (f l a g = - 1 i f x y i s r i g h t - h a n d e d s y s t e m e l s e 1) \\ x = \cos θ * \frac{d}{s c a l e} \\ y = \sin θ * \frac{d}{s c a l e} \\ z = \frac{H}{s c a l e} \end{matrix}

(33)

3.3.2. GeoSOT-3D Grid Code Construction Method

In the previous section, the mapping method of latitude and longitude to the world coordinate system was established through an algorithm. In this section, we introduce the method used to convert the latitude, longitude, and height coordinates to the GeoSOT-3D grid code to realize the construction of the three-dimensional grid space.

Given the latitude and longitude coordinates of a location are

(L n g, L a t, H)

, the formula for calculating the nth-level GeoSOT-3D grid location code is as follows, where

{⌊ ⌋}_{2}

means binary conversion.

Calculate longitude code

C o d e L n g_{n} = {\begin{matrix} {⌊ \frac{L n g + 256}{2^{9 - n}} ⌋}_{2} 0 \leq n \leq 9 \\ C o d e L n g_{n} (\frac{64}{2^{15 - n}}) + ⌊ (L n g + 256 - C o d e L n g_{9}) {\frac{60}{2^{15 - n}} ⌋}_{2} 10 \leq n \leq 15 \\ C o d e L n g_{n} (\frac{64}{2^{21 - n}}) + ⌊ (L n g + 256 - C o d e L n g_{9}) {\frac{60}{2^{21 - n}} ⌋}_{2} 16 \leq n \leq 32 \end{matrix}

(34)

Calculate latitude code

C o d e L a t_{n} = {\begin{matrix} {⌊ \frac{L a t + 256}{2^{9 - n}} ⌋}_{2} 0 \leq n \leq 9 \\ C o d e L a t_{n} (\frac{64}{2^{15 - n}}) + ⌊ (L a t + 256 - C o d e L n g_{9}) {\frac{64}{2^{15 - n}} ⌋}_{2} 10 \leq n \leq 15 \\ C o d e L a t_{n} (\frac{64}{2^{21 - n}}) + ⌊ (L a t + 256 - C o d e L n g_{9}) {\frac{64}{2^{21 - n}} ⌋}_{2} 16 \leq n \leq 32 \end{matrix}

(35)

Calculate height code

C o d e H_{n} = ⌊ H \times {(\frac{2^{n}}{512 \times 11130}) ⌋}_{2} 0 \leq n \leq 32

(36)

Similarly, given the binary codes (

C o d e L n g_{n}, C o d e L a t_{n}, C o d e H_{n})

, the formulas for converting the codes to latitude, longitude, and height are as follows:

Calculate longitude

L n g = {\begin{matrix} C o d e L n g_{n (10)} \times 2^{9 - n} - 256 0 \leq n \leq 9 \\ L n g_{9} + (C o d e L n g_{n (10)} - L n g_{9} \times 2^{n - 9}) \times \frac{2^{15 - n}}{60} - 256 10 \leq n \leq 15 \\ L n g_{15} + (C o d e L n g_{n (10)} - L n g_{15} \times 2^{n - 15}) \times \frac{2^{21 - n}}{3600} - 256 16 \leq n \leq 32 \end{matrix}

(37)

Calculate latitude

L a t = {\begin{matrix} C o d e L a t_{n (10)} \times 2^{9 - n} - 256 0 \leq n \leq 9 \\ L a t_{9} + (C o d e L a t_{n (10)} - L a t_{9} \times 2^{n - 9}) \times \frac{2^{15 - n}}{60} - 256 10 \leq n \leq 15 \\ L a t_{15} + (C o d e L a t_{n (10)} - L a t_{15} \times 2^{n - 15}) \times \frac{2^{21 - n}}{3600} - 256 16 \leq n \leq 32 \end{matrix}

(38)

Calculate height

H = C o d e H_{n (10)} \times \frac{512 \times 11130}{2^{n}} 0 \leq n \leq 32

(39)

Based on the above formula, encoding, longitude, latitude, and elevation could be converted into each other, and on the basis of the above, the mapping path from encoding to pixel coordinates could be formed, and the spatial geometric model of the video stereo grid could be solved. When the height of a certain pixel was known, it could be solved in reverse to form a mapping path from pixel coordinates to grid encoding, forming a complete bidirectional mapping. The mapping relationship constructed by the final algorithm was:

C o d e \to (l n g, l a t, H) \to (x, y, z) \to (u, v)

(40)

(u, v, H) \to (x, y, z) \to (l n g, l a t, H) \to C o d e

(41)

4. Results

This experiment was mainly based on the video three-dimensional grid space construction model proposed in this paper, the feasibility verification of the model-construction algorithm, and the three-dimensional accuracy test. The experimental data had to include the three-dimensional grid coordinates of the scene corresponding to the video. The data used in this experiment were from the data of Cangshan District Convention and Exhibition Center, Fuzhou City, Fujian Province, as shown in Figure 8.

Using remote-sensing images and collected height information, more than 15 pixel coordinates and corresponding points of latitude and longitude were collected. The collected data was mainly based on the characteristic points of the building, which were close to the two known control points, and experiments were carried out using the experimental data. Two different scenes were collected, as shown in Figure 9, which were the west side of the Fuzhou Cangshan Convention and Exhibition Center and the first passageway. These are referred to as scene A and scene B, respectively, below. The experimental data for the two different scenarios are shown in Table 2 and Table 3.

For the data of each scene, 70% of the data were randomly divided into the control point set, and 30% of the data were used as the verification point set. The verification points did not participate in the optimization process, and the accuracies of the control points and verification points before and after the optimization method were calculated. There were two stopping conditions for optimization. One was the number of iterations reaching 2000, while the other was the error at the verification point no longer decreasing. For different scenes, we could calculate the actual latitude and longitude points when the pixel heights of the upper-left and lower-right corners of the scene graph were 0, and the distance between the two points, as an estimate of the scene scale, could be calculated. Next, we divided the calculated straight-line distance error by the scene scale to obtain the error rate. After this, we could calculate both the grid code corresponding to the pixel coordinates, in the case of known height, and the calculation accuracy of the grid code actually collected. The accuracy was measured using the straight-line distance of the latitude and longitude points of the dataset and the distance from the center of the average grid, also called the grid error.

The two scenes’ acquired parallel lines calculated the vanishing point and the positions of the two known control points, which are shown in Figure 10 and Figure 11. The selected parallel lines were usually contours with clear boundaries or straight-line information, and it could be clarified that the two sets of parallel lines were orthogonal. The method used for calculating the vanishing point was to collect the pixel coordinates of the parallel lines and calculate the intersection of the corresponding parallel lines, which was the pixel coordinate information of the vanishing point. The known control points required by the two algorithms were selected as far away as possible in the picture scene, and the remote-sensing image and the picture could clearly correspond to the point.

The collection of control points using scene A as an example is shown in Figure 11.

We calculated the latitude and longitude points corresponding to the pixels and heights, and the straight-line distance errors of the two scenes are shown in Table 4 below.

The errors of the two scenarios were reduced after optimization, showing that the optimization method was feasible. The field of view of scene A was wider than that of scene B, so it was more difficult to describe a large scene than a small scene, and the error in the modeling was larger. However, the error rate calculated relative to the range of the scene was smaller, and the error tolerance was higher. At the same time, scene A, with a large error, stopped when it reached the upper limit of the number of iterations during optimization, and its optimization space was larger than for scene B. Overall, the error of the model adapted to the application and was within acceptable limits.

When calculating the accuracy of the two scenes, for the results when testing multiple grid levels, because the range of scene B was small, all point grid errors were 0 at level 23, when the larger grid level was tested. The calculation results are shown in Table 5 and Table 6.

In the above table, it can be seen different grid levels had different error results. Because the grid was more detailed for the error description, the larger the level, the closer the error was to the linear distance error.

For scene A, the verification point error at level 22 was greater than the verification point error at level 21 because the grid at level 21 had a higher tolerance for errors; the control point error at level 23 was smaller than the error or the verification point error at level 22 because the grid at level 23 had a higher tolerance for errors. The grid side length was smaller, and the description was more detailed. The 23-level and 22-level control led to a situation in which the error became larger after optimization. This was because optimization was used to find the global optimum while sacrificing the accuracy of some points, resulting in the real value and the calculated value not being in the same grid. At 24 levels, the grid error of the verification point did not change after optimization. This was because the control point data was fitted during optimization, and to the verification point. To prevent the verification points being on the same grid, the data were not optimized.

For scene B, the side length of the grid at level 23 was too large. The error before and after optimization cannot be described in detail, and the effect was not reflected; the error of the verification points at levels 24 and 25 did not change after optimization, because it was not realized at the grid level. During the optimization breakthrough, the 27-level grid was relatively close to the straight-line distance error result. Using the above analysis, it was necessary to select an appropriate grid level according to the application requirements to describe the three-dimensional video grid space. Small-level grids had a higher tolerance for errors, and at the same time, the penalty for not being in the same grid was also greater. When the error tolerance was low, was is closer to the straight-line distance error result.

As shown in Figure 12 and Figure 13, after applying the camera internal parameters before and after optimization to the mapping relationship of the three-dimensional grid space of the video and mapping the collected grid code to the two-dimensional image, we could observe the difference between the mapping results before and after optimization.

In the above two figures, the left side is the projection result before optimization, and the right side is the optimized projection result; the blue points are real values in the collected data, and the red points are the results of the corresponding model projection. On the whole, the projection results of the video stereo grid space geometry construction model were better, and the optimized projection results were closer to the real values than before optimization. Some local points deviated slightly from the real values after optimization. This was because the optimization was based on the overall error; more specifically, sacrificing the part for the whole to achieve the global optimum. Using the above analysis or information, we could determine that there was not enough control point data, and better results could be obtained by using only the basic algorithm. If there was enough control point information, the parameters could be further optimized to achieve better results.

According to the method proposed in this paper, the video stereo grid space geometry algorithm constructed the video stereo grid space effect in the exhibition center, as shown in Figure 14. By constructing a three-dimensional grid for the video space, an efficient association mechanism between the video space and the external geographic information was established with the grid as the medium. As a data container, the three-dimensional grid integrated and fused the video space and external information to achieve a more effective analysis.

5. Discussion

We proposed a geometric construction model of a video stereo space based on a GeoSOT grid coding framework and camera imaging model to address the shortcomings of existing methods of video stereo space construction and geographic data organization, and combined the video characteristics and application requirements. The model constructed the video stereoscopic space using a GeoSOT-3D grid and managed the representation using grid coding, and proposed an optimization strategy based on the traditional algorithm to achieve an efficient modeling and display of the video stereoscopic space.

Using Table 5 and Table 6 and Figure 10, Figure 11, Figure 12, Figure 13 and Figure 14, we concluded that the model proposed in this paper was feasible and correct. It was indeed possible to realize the stereo construction of the video space and use coordinate transformation and mapping to associate the video space with the actual corresponding geographic space through the stereo grid. The model-construction algorithm could complete the video space representation more accurately, with lower requirements for external constraints and known conditions.

For the video stereo space, the grid structure could provide a feasible and simple method for modeling and expression. In addition, the earth subdivision grid based on latitude and longitude integrated well with the existing theoretical system and easily expanded the model. The GeoSOT subdivision grid model could express the video space at multiple scales, fully reflecting the geographic attributes of the video space. On the basis of the GeoSOT grid and the video application requirements, the grid model expression was extended, and fully met the needs of video space modeling. The research results of this paper will help to promote the construction of digital twin cities in terms of the construction and digital analysis of massive video data spaces.

6. Conclusions

The biggest contribution of this paper was that, starting from the actual demand, and taking the video stereo grid space as the research object, a construction method of the video stereo grid space was proposed. The method established the correlation between the video and the real world through a stereo grid containing real geographic information and realized the grid representation of the video stereo space, and had strong data organization and integration capabilities. By extension, the model could realize the integration and fusion of video space with external geographic information and other information, which was conducive to a more effective digital analysis, which is of great significance in the massive video data processing required for digital twin cities.

Author Contributions

Conceptualization, H.Z., R.S. and G.L.; methodology, H.Z. and R.S.; software, H.Z. and R.S.; validation, H.Z. and G.L.; formal analysis, H.Z. and G.L.; investigation, H.Z. and R.S.; data curation, H.Z. and R.S.; writing—original draft preparation, H.Z.; writing—review and editing, G.L.; visualization, H.Z. and R.S.; supervision, G.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 62172021).

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Botín-Sanabria, D.M.; Mihaita, S.; Peimbert-García, R.E.; Ramírez-Moreno, M.A.; Ramírez-Mendoza, R.A.; Lozoya-Santos, J.D.J. Digital twin technology challenges and applications: A comprehensive review. Remote Sens. 2022, 14, 1335. [Google Scholar] [CrossRef]
Sagar, M.; Miranda, J.; Dhawan, V.; Dharmaraj, S. The Growing Trend of Cityscale Digital Twins around the World. 2020. Available online: https://opengovasia.com/the-growingtrend-of-city-scale-digital-twins-around-the-world/ (accessed on 25 April 2022).
Wu, Y.; Shang, J.; Xue, F. Regard: Symmetry-based coarse registration of smartphone’s colorful point clouds with cad drawings for low-cost digital twin buildings. Remote Sens. 2021, 13, 1882. [Google Scholar] [CrossRef]
Lee, A.; Lee, K.; Kim, K.; Shin, S. A geospatial platform to manage large-scale individual mobility for an urban digital twin platform. Remote Sens. 2022, 14, 723. [Google Scholar] [CrossRef]
Ketzler, B.; Naserentin, V.; Latino, F.; Zangelidis, C.; Thuvander, L.; Logg, A. Digital twins for cities: A state of the art review. Built Environ. 2020, 46, 547–573. [Google Scholar] [CrossRef]
Cheng, C.Q. An Introduce to Spatial Information Subdivision Organization; Science Press: Beijing, China, 2012. [Google Scholar]
Mimouna, A.; Alouani, I.; Khalifa, A.B.; El Hillali, Y.; Taleb-Ahmed, A.; Menhaj, A.; Ouahabi, A.; Amara, N.E.B. OLIMP: A heterogeneous multimodal dataset for advanced environment perception. Electronics 2020, 9, 560. [Google Scholar] [CrossRef]
Ma, X.; Cheng, J.; Qi, Q.; Tao, F. Artificial intelligence enhanced interaction in digital twin shop-floor. Procedia CIRP 2021, 100, 858–863. [Google Scholar] [CrossRef]
Haneche, H.; Boudraa, B.; Ouahabi, A. A new way to enhance speech signal based on compressed sensing. Measurement 2020, 151, 107–117. [Google Scholar] [CrossRef]
Mahdaoui, A.E.; Ouahabi, A.; Moulay, M.S. Image denoising using a compressive sensing approach based on regularization constraints. Sensors 2022, 22, 2199. [Google Scholar] [CrossRef]
Ferroukhi, M.; Ouahabi, A.; Attari, M.; Habchi, Y.; Taleb-Ahmed, A. Medical video coding based on 2nd-generation wavelets: Performance Evaluation. Electronics 2019, 8, 88. [Google Scholar] [CrossRef]
He, F. Intelligent video surveillance technology in intelligent transportation. J. Adv. Transp. 2020, 2020, 8891449. [Google Scholar] [CrossRef]
Brown, M.; Majumder, A.; Yang, R. Camera-based calibration techniques for seamless multiprojector displays. IEEE Trans. Vis. Comput. Graph. 2005, 11, 193–206. [Google Scholar] [CrossRef] [PubMed]
Sun, J.; Wang, P.; Qin, Z.; Qiao, H. Overview of camera calibration for computer vision. In Proceedings of the 11th World Congress on Intelligent Control and Automation, Shenyang, China, 29 June–4 July 2014. [Google Scholar]
Saeifar, M.H.; Nia, M.M. Camera calibration: An overview of concept, methods and equations. Int. J. Eng. Res. Appl. 2017, 7, 49–57. [Google Scholar] [CrossRef]
Abdel-Aziz, Y.I.; Karara, H.M. Direct linear transformation from comparator coordinates into object space coordinates in close-range photogrammetry. Photogramm. Eng. Remote Sens. 2015, 81, 103–107. [Google Scholar] [CrossRef]
Shih, S.W.; Hung, Y.P.; Lin, W.S. Efficient and accurate camera calibration technique for 3-D computer vision. In Optics, Illumination, and Image Sensing for Machine Vision VI; International Society for Optics and Photonics: Bellingham, WA, USA, 1992. [Google Scholar]
Zhang, Z. A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1330–1334. [Google Scholar] [CrossRef]
Hu, Z.Y.; Wu, F.C. Camera calibration method based on active vision. Chin. J. Comput. 2002, 11, 1149–1156. [Google Scholar]
Triggs, B. Autocalibration and the absolute quadric. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA, 17–19 June 1997. [Google Scholar]
Caprile, B.; Torre, V. Using vanishing points for camera calibration. Int. J. Comput. Vis. 1990, 4, 127–139. [Google Scholar] [CrossRef]
Lee, S.C.; Nevatia, R. Robust camera calibration tool for video surveillance camera in urban environment. In Proceedings of the CVPR 2011 WORKSHOPS, Colorado Springs, CO, USA, 20–25 June 2011. [Google Scholar]
Meizhen, W.; Liu, X.; Yanan, Z.; Ziran, W. Camera coverage estimation based on multistage grid subdivision. ISPRS Int. J. Geo-Inf. 2017, 6, 110. [Google Scholar]
Zhang, X.G.; Liu, X.J.; Wang, S.N. Mutual mapping between surveillance video and 2D geospatial data. Geomat. Inf. Sci. Wuhan Univ. 2015, 40, 139–145. [Google Scholar]
Li, W.; Jing, H.T.; Yuan, S.W. Research on geographic information extraction from video data. Sci. Surv. Mapp. 2017, 11, 96–100. [Google Scholar]
Milosavljevic, A.; Rancic, D.; Dimitrijevic, A.; Predic, B.; Mihajlovic, V. Integration of GIS and video surveillance. Int. J. Geogr. Inf. Sci. 2016, 30, 2089–2107. [Google Scholar] [CrossRef]
Milosavljevic, A.; Rancic, D.; Dimitrijevic, A.; Predic, B.; Mihajlovic, V. A method for estimating surveillance video georeferences. ISPRS Int. J. Geo-Inf. 2017, 6, 211. [Google Scholar] [CrossRef]
Xie, Y.; Meizhen, W.; Liu, X.; Wu, Y. Integration of GIS and moving objects in surveillance video. ISPRS Int. J. Geo-Inf. 2017, 6, 94. [Google Scholar] [CrossRef]
Zhang, X.; Hao, X.Y.; Li, J.S.; Li, P.Y. Fusion and visualization method of dynamic targets in surveillance video with geospatial information. Acta Geod. Cartogr. Sin. 2019, 48, 1415–1423. [Google Scholar]
Wu, Y.X. Research on Video Map and Its Generating Method. 2018. Available online: https://kns.cnki.net/KCMS/detail/detail.aspx?dbname=CMFD201901&filename=1018292439.n (accessed on 25 April 2022).
Li, S.Z.; Song, S.H.; Cheng, C.Q. Mapping Satellite-1 remote sensing data organization based on GeoSOT. J. Remote Sens. 2012, 16, 102–107. [Google Scholar]
Sun, Z.Q.; Cheng, C.Q. True 3D data expression based on GeoSOT-3D ellipsoid subdivision. Geomat. World 2016, 23, 40–46. [Google Scholar]
Meng, L.; Cheng, C.Q.; Chen, D. Terrain quantization model based on global subdivision grid. Acta Geod. Cartogr. Sin. 2016, 45, 152–158. [Google Scholar]
Yuan, J. Research on Administrative Division Coding Model Based on GeoSOT Grid. Master’s Thesis, School of Earth and Space Sciences, Peking University, Beijing, China, 2017. [Google Scholar]
Hu, X.G.; Cheng, C.Q.; Tong, X.C. Research on 3D data representation based on GeoSOT-3D. J. Peking Univ. Nat. Sci. Ed. 2015, 51, 1022–1028. [Google Scholar]
Liu, J.F. Research on Digital Camera Calibration and Related Technologies. Master’s Thesis, Chongqing University, Chongqing, China, 2010. [Google Scholar]
Yoshikawa, N. Spatial position detection of three-dimensional object using complex amplitude derived from Fourier transform profilometry. In Proceedings of the Information Photonics, Optical Society of America, Charlotte, NC, USA, 6–8 June 2005. [Google Scholar]
Orghidan, R.; Salvi, J.; Gordan, M.; Orza, B. Camera calibration using two or three vanishing points. In Proceedings of the 2012 Federated Conference on Computer Science and Information Systems (FedCSIS), Wroclaw, Poland, 9–12 September, 2012; IEEE: Piscataway, NJ, USA, 2012. [Google Scholar]
Kong, Y.Y. Fundamentals of Geodesy; Wuhan University Press: Wuhan, China, 2010. [Google Scholar]

Figure 1. Mapping relationships between different coordinate systems in the model.

Figure 2. Model Technical Route.

Figure 3. GeoSOT-3D meshing reference frame [35].

Figure 4. GeoSOT-3D space-filling curve [35].

Figure 5. Four coordinate relationships, from left to right: the world coordinate system, the camera coordinate system, the image-plane coordinate system, and the pixel coordinate system.

Figure 6. The relationship between the camera coordinate system and the image-plane coordinate system.

Figure 7. Schematic diagram of vanishing-point imaging.

Figure 8. Remote-sensing image of Cangshan Convention and Exhibition Center.

Figure 9. Experimental scene A (left) and experimental scene B (right).

Figure 10. Collecting parallel lines to calculate vanishing point.

Figure 11. Scene A acquisition of control points.

Figure 12. Projection results for scene A: (a) projection results before optimization; (b) optimized projection results.

Figure 13. Projection results for scene B: (a) projection results before optimization; (b) optimized projection results.

Figure 14. Construction of video stereo grid space display.

Table 2. Scene A experimental data.

u	v	Lon	Lat	Height
485	275	119.356096	26.030987	20.0
1520	322	119.356876	26.030922	20.0
789	231	119.356313	26.031290	20.0
1338	226	119.356755	26.031389	20.0
949	305	119.356447	26.030902	20.0
1190	623	119.356690	26.030652	0.0
2413	638	119.357597	26.030742	0.0
2295	640	119.357492	26.030730	0.0
378	576	119.356022	26.030765	0.0
1952	290	119.357228	26.031058	20.0
2457	502	119.357607	26.030835	10.0
2332	293	119.357521	26.031098	20.0
277	245	119.355923	26.031143	20.0
2153	332	119.357378	26.030923	20.0
1570	475	119.35694	26.030758	10.0
92	496	119.355755	26.031068	0.0

Table 3. Scene B experimental data.

u	v	Lon	Lat	Height
1464	1034	119.357021	26.030674	0.0
1263	1038	119.356999	26.030674	0.0
653	281	119.356948	26.030732	6.5
1250	275	119.357006	26.030739	6.5
1039	279	119.356982	26.030733	6.5
1030	831	119.356982	26.030733	0.0
1039	585	119.356982	26.030733	3.0
810	310	119.356961	26.030703	6.5
1409	311	119.357021	26.030708	6.5
210	312	119.356902	26.030697	6.5
2001	312	119.357082	26.030715	6.5
70	279	119.356887	26.030726	6.5
1267	81	119.357006	26.030739	8.5
670	81	119.356948	26.030732	8.5
1255	180	119.357006	26.030739	7.5
660	179	119.356948	26.030732	7.5

Table 4. Linear distance error results of experimental latitude and longitude points (unit: m).

Scenes	Control Point Error before Optimization	Control Point Error after Optimization	Error Rate after Optimization	Validation Point Error before Optimization	Verification Point Error after Optimization	Error Rate after Optimization
A	5.917	5.858	1.06%	6.974	6.183	1.13%
B	1.119	1.110	2.52%	1.089	0.945	2.10%

Table 5. Calculation results of grid errors in scene A (unit: m).

Grid Level	Control Point Error before Optimization	Control Point Error after Optimization	Validation Point Error before Optimization	Verification Point Error after Optimization
21	9.515	9.515	0.0	0.0
22	7.974	8.416	6.546	6.546
23	5.146	5.567	7.187	7.187
24	5.648	5.648	6.870	5.879
25	5.943	5.943	7.041	6.226

Table 6. Calculation results of grid errors in scene B (unit: m).

	Control Point Error before Optimization	Control Point Error after Optimization	Validation Point Error before Optimization	Verification Point Error after Optimization
23	2.104	2.104	1.960	1.960
24	1.385	1.300	2.071	2.071
25	1.088	0.921	1.280	1.280
26	1.110	1.027	1.179	1.057
27	1.181	1.111	1.127	1.057

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, H.; Shi, R.; Li, G. RETRACTED: Geometric Construction of Video Stereo Grid Space. Remote Sens. 2022, 14, 2356. https://doi.org/10.3390/rs14102356

AMA Style

Zhang H, Shi R, Li G. RETRACTED: Geometric Construction of Video Stereo Grid Space. Remote Sensing. 2022; 14(10):2356. https://doi.org/10.3390/rs14102356

Chicago/Turabian Style

Zhang, Huangchuang, Ruoping Shi, and Ge Li. 2022. "RETRACTED: Geometric Construction of Video Stereo Grid Space" Remote Sensing 14, no. 10: 2356. https://doi.org/10.3390/rs14102356

APA Style

Zhang, H., Shi, R., & Li, G. (2022). RETRACTED: Geometric Construction of Video Stereo Grid Space. Remote Sensing, 14(10), 2356. https://doi.org/10.3390/rs14102356

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RETRACTED: Geometric Construction of Video Stereo Grid Space

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Model Architecture

3.1.1. GeoSOT-3D Global Mesh Model

3.1.2. Camera Imaging Model

3.2. Stereo Space Restoration Optimization Algorithm Based on Vanishing Point

3.2.1. Basic Algorithm

3.2.2. Optimization Method

3.3. Stereo Grid Space Geometry Construction

3.3.1. Algorithm for Mapping Latitude and Longitude Coordinates to World Coordinate System

3.3.2. GeoSOT-3D Grid Code Construction Method

4. Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI