An Accurate Matching Method for Projecting Vector Data into Surveillance Video to Monitor and Protect Cultivated Land

: The integration of intelligent video surveillance and GIS (geograhical information system) data provides a new opportunity for monitoring and protecting cultivated land. For a GIS-based video monitoring system, the prerequisite is to align the GIS data with video image. However, existing methods or systems have their own shortcomings when implemented in monitoring cultivated land. To address this problem


Introduction
China is experiencing an unprecedented urbanization process, with the urbanization level increasing from 17.92% in 1978 to 59.58% in 2018 [1]. With the acceleration of urbanization, the expansion of urban and population growth has led to an increasing demand for cultivated land, resulting in unreasonable land use. Large amounts of farmland have been occupied and converted into other uses, such as residential, industrial, commercial, infrastructure, and institutional uses during the period of urban expansion [2]. Meanwhile, idleness of cultivated land is pervasive in rural areas as a consequence of urbanization or unclear ownership. As China is a developing country with a large population and scarce land resources, the loss of cultivated land will threaten the security of food supply and the stability of society. In this respect, accurate monitoring of cultivated land plays a crucial role in farmland preservation.
In order to ensure the basic food supply for human survival, Chinese government has formulated a series of policies to protect the total amount of cultivated land, such as the delineation of basic homography transformation, which assumes a planar ground in geographic space and requires at least four corresponding points to calculate the homography matrix parameters [18]. Both of the two methods rely on matching corresponding points in video image space and geospatial space [19]. It is a challenge to automatically match features from video image and GIS data. For PTZ cameras, it is not advisable to manually select points considering the cost and work efficiency.
Based on the above analysis, the main objective of this paper is to propose an accurate matching method for projecting vector data into surveillance video to monitor and protect the cultivated land. It mainly relies on matching the 2D image coordinates from PTZ video frames with orthophoto maps and vector surveying and mapping data. On the basis of the proposed method, we design and implement a prototype of cultivated land monitoring system. Then, the implemented prototype is applied to monitor cultivated land in Dongyang City, Zhejiang Province, China.

Related Work
Currently, several GIS-based video monitoring systems have been developed and applied in the fields of city safety management and forest insect defoliation and discoloration [17,18]. Inspired by these systems, this paper develops a new video and GIS integrated surveillance system for cultivated land, which can make up for the shortcomings of traditional field surveying and remote sensing technologies. In this section, we mainly introduce the related work on the integration of video and GIS.
The study mainly focuses on establishing the geometric mapping relationship between the spatial point set sampled from video frame and the geospatial point set sampled from geodetically calibrated reference imagery. Since the video is composed of a series of independent frames (images), the geometric mapping relationship should be constructed for each frame.
At present, the methods for video geo-referencing can be classified into two categories: methods based on view line intersection with DEM and methods based on a homography transformation [18][19][20]. For methods based on the intersection between sight and DEM, precise parameters including inner and outer camera parameters and DEM are required to project video frames into a 3D model of an area that is being watched. For example, Zhao et al. [17] proposed a video-based monitoring framework for forest insect defoliation and discoloration. GIS data and Video Surveillance could be integrated to determine the geographical position of forest insect damage in their monitoring system, which required a digital elevation model (DEM) and returned parameters from the PTZ camera. Milosavljevic et al. [19] proposed a method to estimate the geo-reference of both fixed and PTZ cameras, which relied on matching 2D image coordinates from video frames with 3D geodetic coordinates to estimate the camera's position and orientation in geographic space. In this situation, video frames do not match the intuitive feeling of the human eye because they have been geometrically corrected to the 2D or 3D GIS scene. Furthermore, it is difficult to obtain precise real-world physical information of the acquired images due to changes in focal length and angles of the PTZ cameras. Meanwhile, high resolution DEM or DSM are not always available and not all of the objects in real world can be modelled, which causes the failure of cross-mapping.
Methods based on a homography matrix assume a planar ground in a geographic space, so they require at least four matching points to estimate the corresponding homography matrix [21]. Automatically detecting and matching features from ground imagery or stationary videos, such as surveillance cameras videos, and satellite or aerial images can be defined as cross-view image matching [22][23][24]. It is almost impossible to automatically match features from such images due to the dramatic differences in each set of images, as they are captured from different viewpoints and in different resolutions. Therefore, the matching points are manually selected from cross-view images. The number, precision, and spatial distribution of the selected points are very important and directly affect the accuracy of image registration. For 2D GIS, the mapping from 2D geographic space to video image space is double, but for 3D GIS, the transformation is single [18]. When using these methods, the terrain model is usually taken into account for the assumption of planar ground. The main drawback of these methods is that they are unsuitable for the case of PTZ cameras. With the change of pan-tilt-zoom, more user interaction is required to select corresponding features and prevents automation.

Methodology
The proposed method for integrating a surveillance video and BFPZs (2D GIS vector data) relies on registering BFPZs to surveillance video image. This can be realized by the mapping matrix estimated by matching the 2D geodetic coordinates with the 2D image coordinates in video frames. Compared to fixed camera, PTZ cameras can be controlled to rotate horizontally and vertically, and the field-of-view can be changed. The image coordinates of each video frame are independent of other images. Therefore, we need to construct the mapping relationship between BFPZs data and every video image.
As there is large resolution and geometric difference in the two types of data, it is impossible to automatically detect and match features. Therefore, we break down the integration process into two stages, as shown in Figure 1. In the configuration phase, we will determine the mapping relationship between BFPZs and the preset video image I 0 . by identifying corresponding features from the two data and estimation algorithms. DOM (digital orthophoto map) is used to help identify corresponding points in the BFPZs and preset video image, as vector data is graphics and hard to be recognized. Based on these features, the transformation matrix can be estimated by algorithms. Then, the coordinates of 2D BFPZs can be transformed from geodetic space to video image space. In the dynamic mapping phase, the mapping between 2D BFPZs and PTZ video images can be seen as the problem of multi-view image matching since BFPZs has been projected in image space. Different from configuration phase, we can use a matching method to automatically detect and match points without manual intervention. Once the relative geometric relation between video frames is estimated, then the coordinates of 2D BFPZs can be transformed between different video image space. In the following sections, we will introduce the details about homography transformation, SIFT (scale-invariant feature transform) [25] and ASIFT (affine-SIFT) [26] matching algorithms and the integration strategies. ISPRS Int. J. Geo-Inf. 2020, 9, x FOR PEER REVIEW 4 of 17 space to video image space is double, but for 3D GIS, the transformation is single [18]. When using these methods, the terrain model is usually taken into account for the assumption of planar ground. The main drawback of these methods is that they are unsuitable for the case of PTZ cameras. With the change of pan-tilt-zoom, more user interaction is required to select corresponding features and prevents automation.

Methodology
The proposed method for integrating a surveillance video and BFPZs (2D GIS vector data) relies on registering BFPZs to surveillance video image. This can be realized by the mapping matrix estimated by matching the 2D geodetic coordinates with the 2D image coordinates in video frames. Compared to fixed camera, PTZ cameras can be controlled to rotate horizontally and vertically, and the field-of-view can be changed. The image coordinates of each video frame are independent of other images. Therefore, we need to construct the mapping relationship between BFPZs data and every video image.
As there is large resolution and geometric difference in the two types of data, it is impossible to automatically detect and match features. Therefore, we break down the integration process into two stages, as shown in Figure 1. In the configuration phase, we will determine the mapping relationship between BFPZs and the preset video image 0 I . by identifying corresponding features from the two data and estimation algorithms. DOM (digital orthophoto map) is used to help identify corresponding points in the BFPZs and preset video image, as vector data is graphics and hard to be recognized. Based on these features, the transformation matrix can be estimated by algorithms. Then, the coordinates of 2D BFPZs can be transformed from geodetic space to video image space. In the dynamic mapping phase, the mapping between 2D BFPZs and PTZ video images can be seen as the problem of multi-view image matching since BFPZs has been projected in image space. Different from configuration phase, we can use a matching method to automatically detect and match points without manual intervention. Once the relative geometric relation between video frames is estimated, then the coordinates of 2D BFPZs can be transformed between different video image space. In the following sections, we will introduce the details about homography transformation, SIFT (scale-invariant feature transform) [25] and ASIFT (affine-SIFT) [26] matching algorithms and the integration strategies.

Homography Transformation
The homography transformation is a popular geo-referencing technique used worldwide. Surveillance video and remote sensing (RS) sensors capture the same scene differently in two views, i.e., front view and side view, as shown in Figure 2. Usually, the remote sensing images are geometrically corrected and used to produce a DOM. When the terrain is a planar ground or the topographic relief is small, the complex geometric relations between surveillance video image and remote sensing image can be modelled by homography transformation. ISPRS Int. J. Geo-Inf. 2020, 9, x FOR PEER REVIEW 5 of 17

Homography Transformation
The homography transformation is a popular geo-referencing technique used worldwide. Surveillance video and remote sensing (RS) sensors capture the same scene differently in two views, i.e., front view and side view, as shown in Figure 2. Usually, the remote sensing images are geometrically corrected and used to produce a DOM. When the terrain is a planar ground or the topographic relief is small, the complex geometric relations between surveillance video image and remote sensing image can be modelled by homography transformation. Assuming P is a point in the DOM spatial coordinate system, and its corresponding point in video image coordinate system is Q : where X and Y are real world coordinates, u and v are column and row, respectively, in image coordinates.
Given a homography matrix H , the relationship between P and Q can be expressed as follows: H is a  33 matrix and can be represented as follows: As H has eight unknowns, at least four pairs of non-collinear video image point and geospatial points are required to calculate the parameters of H . Once H is determined, the coordinates of any point in spatial coordinate system can be projected into the image coordinate. Assuming P is a point in the DOM spatial coordinate system, and its corresponding point in video image coordinate system is Q: where X and Y are real world coordinates, u and v are column and row, respectively, in image coordinates. Given a homography matrix H, the relationship between P and Q can be expressed as follows: H is a 3 × 3 matrix and can be represented as follows: As H has eight unknowns, at least four pairs of non-collinear video image point and geospatial points are required to calculate the parameters of H. Once H is determined, the coordinates of any point in spatial coordinate system can be projected into the image coordinate.
The mapping is double direction, and image coordinates can also be projected into the spatial coordinate system. This modeling method is convenient for geometric presentation, computation, and implementation simplicity.

SIFT and ASIFT Matching
When the PTZ surveillance camera is paned, tilted, or zoomed, there will be geometric distortion between these video sequences. To automatically integrate GIS data and surveillance video, we use the image matching algorithm to automatically detect and match corresponding features, rather than manually selecting feature points.
SIFT [25] is a well-known image matching algorithm, which is invariant to scaling, rotation, illumination, and affine transformation with sub-pixel accuracy. The original SIFT algorithm first uses subpixel location, scale, and dominant orientation to describe a detected feature point. It can be expressed by f = (L, σ, θ, d), where L represents the location of the feature, σ and θ denote the point's characteristic scale and dominant orientation, respectively, and d is a 128-vector invariant feature descriptor. Then, feature points are matched using the minimum Euclidean distance method between their descriptors. To ensure correct matching, Lowe [25] suggested another matching strategy called NNDR (nearest neighbor distance ratio), which means that the ratio of the nearest to the second nearest neighbor can be applied to get correct matches. It can be described as follows: where, NN 1 means the Euclidean distance from the nearest vector in sensed image feature map to the vector in reference image feature map and NN 2 is the Euclidean distance from the second nearest vector in sensed image feature map to the vector in reference image feature map. When the distance ratio is lower than d ratio , it can be seen as a pair of correct matches. However, SIFT does not perform well when geometric distortion between images is severe. In order to improve the performance in this situation, ASIFT [26] is proposed, which simulates the rotation of camera around optical axis. In their method, image affine transformation is applied to model the changes of viewpoints, which can be expressed as: The processing steps of ASIFT can be summarized as follows: first, a dense set of rotation transformation is applied to both images A and B; then a series of tilt transformation is applied to all rotated images; at last, SIFT is performed to match all pairs of the simulated images.

Integration of 2D Vector BFPZs and Surveillance Video Images
In the proposed method, the mapping between 2D Vector BFPZs and PTZ surveillance video is constructed in a semi-automatic way. To solve the homography matrix between BFPZs data and the preset frame of PTZ camera, we provide an interactive tool to select control points. Once an adequate number of points are matched, Levenberg-Marquardt iterative optimization [27,28], is applied to calculate the homography matrix.
Assuming that 2D Vector BFPZs is a set of polygons {P 1 , P 2 , . . . , P n }, for a given polygon . . , n, and the coordinates of its vertex are V X i j , Y i j , j = 1, 2, . . . , s. Then this vertex can be mapped into an image coordinate system, and its coordinates are V (x ij , y ij ), which can be computed as follows.
where, A 0 , B 0 , C 0 , D 0 , E 0 , F 0 , G 0 , H 0 , are parameters of H 0 . The precision, number, and spatial distribution of the selected points play a vital role in the process of calculating the homography matrix, which directly affects the accuracy of the estimated matrix. It is better to choose relative permanent points, such as road intersection, as shown in Figure 3.
The precision, number, and spatial distribution of the selected points play a vital role in the process of calculating the homography matrix, which directly affects the accuracy of the estimated matrix. It is better to choose relative permanent points, such as road intersection, as shown in Figure 3. When the PTZ camera is paned, tilted, or zoomed, the homography matrix between GIS data and video frames can be calculated as follows. First, ASIFT matching method is used to detect and match features from video sequences. Then, random sample consensus (RANSAC) [29] is applied to estimate the homography matrix between video frames, considering that the obtained corresponding features have mismatches even using the best automatic matching method. RANSAC estimates the parameters of homography model from a set of observed data containing outliers in an iterative way. Finally, the mapping between 2D Vector BFPZs and any video frame can be calculated.
Specifically, the locations of 2D Vector BFPZs in each single video frame coordinates system can be computed by two strategies. What counts is the selection of reference video frame. If the preset video frame is chosen as the reference image, ASIFT and RANSAC methods are applied to estimate the mapping matrix between video frames and preset image. As shown in Figure 4, assuming that the estimated transformation matrix between reference frame and frame I t is t H , and any vertex ' t V of the BFPZs polygons in video frame I t can be calculated using its corresponding vertex ' V of the BFPZs polygons in reference frame as the following equation  When the PTZ camera is paned, tilted, or zoomed, the homography matrix between GIS data and video frames can be calculated as follows. First, ASIFT matching method is used to detect and match features from video sequences. Then, random sample consensus (RANSAC) [29] is applied to estimate the homography matrix between video frames, considering that the obtained corresponding features have mismatches even using the best automatic matching method. RANSAC estimates the parameters of homography model from a set of observed data containing outliers in an iterative way. Finally, the mapping between 2D Vector BFPZs and any video frame can be calculated.
Specifically, the locations of 2D Vector BFPZs in each single video frame coordinates system can be computed by two strategies. What counts is the selection of reference video frame. If the preset video frame is chosen as the reference image, ASIFT and RANSAC methods are applied to estimate the mapping matrix between video frames and preset image. As shown in Figure 4, assuming that the estimated transformation matrix between reference frame and frame I t is H t , and any vertex V t of the BFPZs polygons in video frame I t can be calculated using its corresponding vertex V of the BFPZs polygons in reference frame as the following equation.
The precision, number, and spatial distribution of the selected points play a vital role in the process of calculating the homography matrix, which directly affects the accuracy of the estimated matrix. It is better to choose relative permanent points, such as road intersection, as shown in Figure 3. When the PTZ camera is paned, tilted, or zoomed, the homography matrix between GIS data and video frames can be calculated as follows. First, ASIFT matching method is used to detect and match features from video sequences. Then, random sample consensus (RANSAC) [29] is applied to estimate the homography matrix between video frames, considering that the obtained corresponding features have mismatches even using the best automatic matching method. RANSAC estimates the parameters of homography model from a set of observed data containing outliers in an iterative way. Finally, the mapping between 2D Vector BFPZs and any video frame can be calculated.
Specifically, the locations of 2D Vector BFPZs in each single video frame coordinates system can be computed by two strategies. What counts is the selection of reference video frame. If the preset video frame is chosen as the reference image, ASIFT and RANSAC methods are applied to estimate the mapping matrix between video frames and preset image. As shown in Figure 4, assuming that the estimated transformation matrix between reference frame and frame I t is t H , and any vertex ' t V of the BFPZs polygons in video frame I t can be calculated using its corresponding vertex ' V of the BFPZs polygons in reference frame as the following equation. ' = '  Image matching is performed between adjacent video frame, as shown in Figure 5. In this situation, vertex V t of the BFPZs polygons in video frame I t can be computed as Equation (10).
where H i is the transformation matrix between video frame I i−1 and I i .
Image matching is performed between adjacent video frame, as shown in Figure 5. In this situation, vertex ' t V of the BFPZs polygons in video frame I t can be computed as Equation (10).
where i H is the transformation matrix between video frame -1 I i and I i .

Framework of Cultivated Land Surveillance System
On the basis of the integration method of 2D vector BFPZs and real-time PTZ surveillance video, we design and implement a prototype of monitoring system. The system can collect, manage geospatial and video data, and perform overall display and analysis. Through video surveillance of cultivated land, supervisors can efficiently improve the supervision work. As shown in Figure 6, the integration framework mainly contains three layers: presentation layer, middle layer, and data layer.

Framework of Cultivated Land Surveillance System
On the basis of the integration method of 2D vector BFPZs and real-time PTZ surveillance video, we design and implement a prototype of monitoring system. The system can collect, manage geospatial and video data, and perform overall display and analysis. Through video surveillance of cultivated land, supervisors can efficiently improve the supervision work. As shown in Figure 6, the integration framework mainly contains three layers: presentation layer, middle layer, and data layer.
Image matching is performed between adjacent video frame, as shown in Figure 5. In this situation, vertex ' t V of the BFPZs polygons in video frame I t can be computed as Equation (10).
where i H is the transformation matrix between video frame -1 I i and I i .

Framework of Cultivated Land Surveillance System
On the basis of the integration method of 2D vector BFPZs and real-time PTZ surveillance video, we design and implement a prototype of monitoring system. The system can collect, manage geospatial and video data, and perform overall display and analysis. Through video surveillance of cultivated land, supervisors can efficiently improve the supervision work. As shown in Figure 6, the integration framework mainly contains three layers: presentation layer, middle layer, and data layer.

1.
Data layer: The data layer is mainly used to store and manage geospatial and surveillance video data. In the proposed system, geospatial data contains BFPZs presented in 2D vector format and high-resolution ortho-images presented in raster format. The former is used to define ROI regions in real-time video, and the latter is used to establish the mapping relationship between geospatial space and image pixel coordinates.

2.
Middle layer: The middle layer has functions of data acquisition, processing, and analysis. As geospatial data and video data are unstructured data, they are independent of each other. In this layer, GIS components are used for fetching geospatial data, and video stream components are used for reading real-time video data. 2D mapping between BFPZs and video images is the core component, and BFPZs data can be projected into real-time video to generate ROI area. For the specific area, video analysis can be performed.

3.
Presentation layer: In the presentation layer, it provides interactive integration of 2D BFPZs data and surveillance video as well as real-time alert for command and control center.
The software structure of the cultivated land monitoring system consists of five parts: video retrieval, video control, image registration, image matching, and video analysis. Video retrieval is mainly used to load real-time video using HTTP protocol and display the current video frame. It can also visualize 2D BFPZs overlaid on video frames. Video control allows the user to pan, tilt, and zoom the PTZ camera and fetch the current, previous, or next video frame. Image registration provides an interactive tool for constructing the mapping matrix between 2D vector BFPZs and the preset video image by selecting correspondences between image space and geospatial space and solving the matrix parameters. Then, 2D vector BFPZs is projected into the preset video image. Image matching provides an automatic way that can use the multi-view image matching method ASIFT to construct the mapping between surveillance video images. Correspondences between video images can be detected and matched without human involvement. The projected 2D vector BFPZs in the preset video image can be mapped into real-time video images with different matching strategies. At last, 2D vector BFPZs polygons can be set as the defense area for intelligent video surveillance, which can efficiently extract useful information from the huge number of videos by automatically detecting, tracking, and recognizing objects and analyzing activities [14].

Experiments and Results
Details of experiments on the alignment of 2D vector data and video images are presented in this section. We tested our system in cultivated land with relatively smooth landscape and compared the alignment results with man-made ground truth.

Datasets and Evaluation Criteria
In order to visually and quantitatively evaluate the proposed method, two intelligent surveillance cameras (PTZ) were deployed on China Mobile's signal tower in Dongyang City, Zhejiang Province, China. One is located in the Woodcarving town, and the other one is located in Wangfeng community, Jiangbei Street. Meanwhile, two video datasets were recorded and collected from the two surveillance cameras, detailed information can be seen in Table 1. Dataset MDXZ contains the scenes of cultivated land in Woodcarving town, while JBWF contains the scenes of Wangfeng community. These two datasets were created by rotating the cameras vertically and horizontally, as well as changing the zoom level. As shown in Figure 7, cultivated landscape in the two places is relatively smooth.  The mapping between 2D vector BFPZs and video frames is taken as an issue of image registration. Therefore, root of mean square error (RMSE) is used to qualify the accuracy of alignment results, which measures the deviation between the observed value and ground truth . It can be calculated as follows: where () , xy is the coordinates of the selected M points in transformed 2D GIS data, and ( ', ') xy demotes the ground true values.
Since we lack ground truth values, it is hard to determine the projection error of the estimated mapping relationship. In addition, pixel coordinates system between video frames is independent. To cope with the above challenge, we manually constructed ground truth for each video frame as follows: (1)Select 20 points in 2D vector BFPZs and record their projected coordinates in the image pixel coordinate system; (2)For each point in (1), find its corresponding point by visual inspection and record its coordinates in the current image pixel system.
Then, for each video frame, the projection error can be calculated by using RMSE. Note that some principles should be obeyed when selecting points. For example, the selected points should be well-distributed.

Implementations and Experimental Results
The cultivated land monitoring system was implemented as a Visual C# project in Microsoft Visual Studio 2017 and works in a client/server environment. This application is mainly designed for PTZ network (i.e., IP) cameras. It is also suitable for fixed cameras. As Figure 8 shows, the user can retrieval real-time video using the HTTP protocol. The URL, username, and password are needed to access the camera API to retrieve the current video frame.
The user interface is composed of two interconnected components. The first involves the registration of 2D BFPZs data from the geospatial coordinates system to video image pixel The mapping between 2D vector BFPZs and video frames is taken as an issue of image registration. Therefore, root of mean square error (RMSE) is used to qualify the accuracy of alignment results, which measures the deviation between the observed value and ground truth. It can be calculated as follows: where (x, y) is the coordinates of the selected M points in transformed 2D GIS data, and (x , y ) demotes the ground true values.
Since we lack ground truth values, it is hard to determine the projection error of the estimated mapping relationship. In addition, pixel coordinates system between video frames is independent. To cope with the above challenge, we manually constructed ground truth for each video frame as follows: (1) Select 20 points in 2D vector BFPZs and record their projected coordinates in the image pixel coordinate system; (2) For each point in (1), find its corresponding point by visual inspection and record its coordinates in the current image pixel system.
Then, for each video frame, the projection error can be calculated by using RMSE. Note that some principles should be obeyed when selecting points. For example, the selected points should be well-distributed.

Implementations and Experimental Results
The cultivated land monitoring system was implemented as a Visual C# project in Microsoft Visual Studio 2017 and works in a client/server environment. This application is mainly designed for PTZ network (i.e., IP) cameras. It is also suitable for fixed cameras. As Figure 8 shows, the user can retrieval real-time video using the HTTP protocol. The URL, username, and password are needed to access the camera API to retrieve the current video frame.
The user interface is composed of two interconnected components. The first involves the registration of 2D BFPZs data from the geospatial coordinates system to video image pixel coordinate system, which corresponded to the function of image registration. The second involves registering 2D BFPZs data in different video image space, which corresponded to the function of image matching.  To visually check the suitability of the proposed method, several examples are taken in Figures  8 and 9. Though the intention of this paper is to register 2D BFPZs into real-time video to determine the defense boundary of cultivated land, in fact, this is a reversible process, which means that the real-time video image is also mapped into the geospatial space. This mapping can satisfy the needs of spatial analysis, such as the measurement of cultivated land. Figure 9 shows the cross-mapping results of 2D BFPZs (polygons in red) and the preset video image, while Figure 10 shows the matching results between video images and results of projected 2D BFPZs in different video image space using ASIFT and RANSAC method. From these figures, we can see that the proposed method can successfully build the transformation between 2D BFPZs and video image, and accurately define the boundaries of the protected cultivated land in the video image.  To visually check the suitability of the proposed method, several examples are taken in Figures 8  and 9. Though the intention of this paper is to register 2D BFPZs into real-time video to determine the defense boundary of cultivated land, in fact, this is a reversible process, which means that the real-time video image is also mapped into the geospatial space. This mapping can satisfy the needs of spatial analysis, such as the measurement of cultivated land. Figure 9 shows the cross-mapping results of 2D BFPZs (polygons in red) and the preset video image, while Figure 10 shows the matching results between video images and results of projected 2D BFPZs in different video image space using ASIFT and RANSAC method. From these figures, we can see that the proposed method can successfully build the transformation between 2D BFPZs and video image, and accurately define the boundaries of the protected cultivated land in the video image.  To visually check the suitability of the proposed method, several examples are taken in Figures  8 and 9. Though the intention of this paper is to register 2D BFPZs into real-time video to determine the defense boundary of cultivated land, in fact, this is a reversible process, which means that the real-time video image is also mapped into the geospatial space. This mapping can satisfy the needs of spatial analysis, such as the measurement of cultivated land. Figure 9 shows the cross-mapping results of 2D BFPZs (polygons in red) and the preset video image, while Figure 10 shows the matching results between video images and results of projected 2D BFPZs in different video image space using ASIFT and RANSAC method. From these figures, we can see that the proposed method can successfully build the transformation between 2D BFPZs and video image, and accurately define the boundaries of the protected cultivated land in the video image.

Accuracy Analysis
The examples presented in Figures 9 and 10 show the feasible and achievable accuracy of the solution for projecting 2D vector BFPZs into real-time video frames. In this section, we mainly focus on the evaluation of the proposed method in a quantitative way and comprehensively analyze the influence of the selected reference image on the alignment results.
In this experiment, image video frames named MDXZ05 and JBWF05 in the two datasets were chosen as the preset image, respectively. We manually selected 6-8 control points to map the 2D BFPZs from the geospatial coordinate system to the preset image pixel system. Then two matching strategies (Strategy-A: adjacent image is set as the reference image; Strategy-S: preset image is set as the reference image;) are applied to automatically detect and match control points to estimate the mapping matrix. The relationship between images can be depicted in Figures 11 and 12, taking MDXZ dataset as an example.

Accuracy Analysis
The examples presented in Figures 9 and 10 show the feasible and achievable accuracy of the solution for projecting 2D vector BFPZs into real-time video frames. In this section, we mainly focus on the evaluation of the proposed method in a quantitative way and comprehensively analyze the influence of the selected reference image on the alignment results.
In this experiment, image video frames named MDXZ05 and JBWF05 in the two datasets were chosen as the preset image, respectively. We manually selected 6-8 control points to map the 2D BFPZs from the geospatial coordinate system to the preset image pixel system. Then two matching strategies (Strategy-A: adjacent image is set as the reference image; Strategy-S: preset image is set as the reference image;) are applied to automatically detect and match control points to estimate the mapping matrix. The relationship between images can be depicted in Figures 11 and 12, taking MDXZ dataset as an example.

Accuracy Analysis
The examples presented in Figures 9 and 10 show the feasible and achievable accuracy of the solution for projecting 2D vector BFPZs into real-time video frames. In this section, we mainly focus on the evaluation of the proposed method in a quantitative way and comprehensively analyze the influence of the selected reference image on the alignment results.
In this experiment, image video frames named MDXZ05 and JBWF05 in the two datasets were chosen as the preset image, respectively. We manually selected 6-8 control points to map the 2D BFPZs from the geospatial coordinate system to the preset image pixel system. Then two matching strategies (Strategy-A: adjacent image is set as the reference image; Strategy-S: preset image is set as the reference image;) are applied to automatically detect and match control points to estimate the mapping matrix. The relationship between images can be depicted in Figures 11 and 12, taking MDXZ dataset as an example.      Quantitative comparisons of different matching strategies and two datasets using the RMSE are reported in Tables 2 and 3, respectively. RMSE-A represents the results of Strategy-A, and RMSE-S represents the results of Strategy-S.  From Tables 2 and 3, we can see that the average projection error of the two datasets is within 13 pixels. As the surveillance camera get images in a side view, the GSD (ground sample distance) varies with the distance from the object to the camera. With reference to the GSD of TDOMs (true digital orthophoto maps) generated from low-altitude UAVs images [30], it is believed that the proposed integration method can achieve meter-accuracy and meet the requirements of real-time cultivated land monitoring in the experimental area. Meanwhile, one can observe that the average image registration accuracy of dataset MDXZ is higher than that of dataset JBWF, which can also reflect that the geometric difference between video images in MDXZ is greater than that in JBWF.
By comparing the experimental results of video frames in MDXZ dataset, a clear trend stands out: both RMSE-A and RMSE-S increase when the matching process varies from MDXZ05 to MDXZ00 and from MDXZ05 to MDXZ09, in which the geometric distortion becomes larger when the PTZ camera is panned, tilted, and zoomed. Furthermore, it can be found that Strategy-S can obtain higher accuracy than Strategy-A. This is because the matching error will accumulate as the matching times increase. An interesting discovery is that the best image registration results may not be obtained between the nearest image, but may be the second or third nearest image, such as the results of MDXZ02, MDXZ03, and MDXZ04 in Table 2. A similar conclusion can be reached in Table 3.

Discussions
For a GIS-based video monitoring system, the prerequisite is to align the GIS data with video images. In order to accomplish this alignment, it is necessary to establish a mapping relationship between the video frame and GIS data. In this study, we propose a method to register 2D vector data of the protected cultivated land onto surveillance video images. Experimental results in cultivated land with relatively flat landscape suggest that the proposed method can accurately project 2D vector BFPZs onto real-time video frames and obtain meter-level positioning accuracy to meet the requirements of cultivated land monitoring.
Currently, there have been some studies on GIS-based video monitoring systems. The GeoScopeAVS system designed by Milosavljevic et al. [16] integrated GIS and video surveillance for real-time retrieval of information about viewed geospatial objects. In their system, 3D GIS and camera views were aligned so that it is possible to extract the visualization of appropriate 3D virtual objects and place them over the video. However, this method may not be applicable to cultivated land, because 3D models of geographic features and accurate DEM of the area are essential. As we all know, people pay more attention to the modeling of urban environment, especially buildings, but the modeling of rural environment is largely ignored. Moreover, not all objects in the real world can be completely modelled, such as vehicles, people, trees, and street lights [15]. Zhao et al. [17] designed a GIS-based video monitoring system for forest insect defoliation and discoloration. Though 2D GIS and 3D GIS are simultaneously integrated in their system, it is not a real fusion because these data are geographically linked and separately displayed in their own window. Nonetheless, it is difficult for non-experts to interpret these data as they are not presented in the same layer. Milosavljevic et al. [19] proposed a method to georeference video images, which relied on matching video frame coordinates of certain point features with their 3D geographic locations. Overall, the above-mentioned methods have one thing in common: all of them require internal and external parameters of the camera and high-precision 3D geospatial data, which are difficult to be accurately obtained for the case of cultivated land and cause failure. Zhang et al. [18] presented a GIS-based prototype system for city safety management. They described a semi-automatic cross-mapping method based on 2D GIS and some constraints. However, the case of the PTZ camera was not considered in their system, which made them unsuitable for the monitoring of cultivated land. In comparison to them, our solution is more feasible in monitoring cultivated land. On the one hand, it is fast and easy to be implemented. Neither of the camera parameters and 3D model of objects are needed to integrate 2D GIS data and PTZ video images. Once a number of control points are identified from 2D GIS data and the selected reference video image, the alignment of 2D GIS data and PTZ video frames can be realized by automatic feature matching method. On the other hand, even non-experts and inexperienced persons can intuitively see the boundary of protected farmland from the surveillance video and identify destruction of cultivated land.
There are some factors that may influence the performance of our system: the selection of reference images as well as the amplitude and speed of video camera movement may change the overlap between images and have an impact on feature matching. If overlap between images is too small, it is difficult to obtain uniformly distributed corresponding features to estimate the geometric transformation matrix. In addition, a different matching strategy may affect the accuracy of video images and GIS data registration results. As described in Section 5.2, a reasonable matching strategy can reduce the propagation of matching error. Comprehensively, it is better to choose Strategy-S instead of Strategy-A for the integration of 2D vector BFPZs and video images. One reason is the error propagation in Strategy-A, and the other reason is that it is difficult to determine the interval between adjacent images. The major shortcoming of Strategy-S is that the movement of the PTZ camera can be restricted. If the overlap between the preset frame and the current frame is too small, this strategy may fail. Furthermore, the proposed method is not suitable for areas of complex terrain, such as mountainous areas. In such areas, the geometric mapping relationship between video image plane and 2D GIS data cannot be simply modelled by homography transformation. It may fail in alignment video images with GIS data, and prevents the following application.
Based on the alignment results, we can easily identify the occurrence of farmland destruction by visually inspecting the image content covering the 2D vector area. In China, the conversion of permanent farmland in BFPZs to other uses is forbidden, while the use nature of non-permanent agricultural land can be changed, for example, to construction land, with the approval of relevant departments. Since the two kinds of farmland are very similar and indistinguishable in video images, it is necessary to label the protected area on the video image. The proposed method provides a solution to solve the problem of labelling basic farmland in PTZ videos. Polygons in the 2D GIS data define the boundary of protected cultivated land. If image content delimited by polygons is recognized as construction land, it can be recognized as an illegal change.
However, little effort has been made to automatically identify destruction of farmland or unauthorized change of farmland use from the video images. In current literature, there is almost no research and report on this issue. Moreover, this issue is a complicated problem that requires technologies in various subjects and fields, such as computer vision, intelligent video analysis, image processing, and data mining. In the future, we will try to improve the robustness of our system and further study on automatic detection of illegal farmland occupation from real-time PTZ video images delineated by 2D GIS vector data.

Conclusions
The objective of this paper is to integrate 2D GIS and real-time PTZ surveillance, and implement a prototype of a cultivated land monitoring system based on the proposed integration method. This integration can assist users in identifying the illegal occupation of the protected cultivated land as early as possible. Since the permanent cultivated land and non-permanent agriculture land are very similar and indistinguishable in video images, 2D GIS data can provide the boundaries and other semantic information of the protected farmland. For the integration process, the alignment of GIS data and video frames is necessary. In the proposed method, 2D GIS vector data is projected onto the video image by the transformation matrix, which is estimated from the extracted pairs of corresponding features. Compared with the existing methods or systems, the proposed integration method has the following advantages: (1) it is fast and easy to be implemented. Neither of the camera parameters and 3D model of objects or terrain are needed to integrate 2D GIS data and PTZ video images. (2) It is easier for non-experts and inexperienced persons to identify illegal occupation of the protected land from the real-time video. In addition, there are some other issues worthy of consideration. For example, how to automatically identify destruction of farmland or unauthorized change of farmland use from the video images with the aid of projected 2D GIS data. This will be our future studies.