LandQ v2 : A MapReduce-Based System for Processing Arable Land Quality Big Data

: Arable land quality (ALQ) data are a foundational resource for national food security. With the rapid development of spatial information technologies, the annual acquisition and update of ALQ data covering the country have become more accurate and faster. ALQ data are mainly vector-based spatial big data in the ESRI (Environmental Systems Research Institute) shapeﬁle format. Although the shapeﬁle is the most common GIS vector data format, unfortunately, the usage of ALQ data is very constrained due to its massive size and the limited capabilities of traditional applications. To tackle the above issues, this paper introduces LandQ v2 , which is a MapReduce-based parallel processing system for ALQ big data. The core content of LandQ v2 is composed of four key technologies including data preprocessing, the distributed R-tree index, the spatial range query, and the map tile pyramid model-based visualization. According to the functions in LandQ v2 , ﬁrstly, ALQ big data are transformed by a MapReduce-based parallel algorithm from the ESRI Shapeﬁle format to the GeoCSV ﬁle format in HDFS (Hadoop Distributed File System), and then, the spatial coding-based partition and R-tree index are executed for the spatial range query operation. In addition, the visualization of ALQ big data with a GIS (Geographic Information System) web API (Application Programming Interface) uses the MapReduce program to generate a single image or pyramid tiles for big data display. Finally, a set of experiments running on a live system deployed on a cluster of machines shows the efﬁciency and scalability of the proposed system. All of these functions supported by LandQ v2 are integrated into SpatialHadoop, and it is also able to efﬁciently support any other distributed spatial big data systems.


Introduction
Arable land is a foundational resource for national food security.Arable land quality (ALQ) data are about the quality evaluation of arable land by classification and gradation.With the rapid development of spatial information technologies, the annual acquisition and update of ALQ data covering the country have become more accurate and faster.ALQ data are dominated by spatial ISPRS Int.J. Geo-Inf.2018, 7, 271 2 of 15 vector data in the ESRI shapefile format.In China, the government began to perform all round arable land quality monitoring [1,2] and evaluation work in 2011.In addition, in 2013, the government formed a relatively complete and large arable land quality dataset of approximately 2.51 TB in the ESRI shapefile format [3].To make such an enormous amount of data readily available requires a management information system with high-performance computing that will play a crucial role in offering accessibility for the government to make scientific decisions and ensuring valuable spatial data safety, readily, and practically.
The traditional standalone version GIS development pattern has been challenged by the volume, velocity, and variety of the arable land quality big dataset.Our team has developed a GIS cluster-based management information system, LandQ v1 (version 1) [3].In the first version, the ArcGIS platform was used as the basic development framework and Oracle 11g was adopted as the spatial database.However, along with the continuous application of the system (LandQ v1 ), the increasing challenges and overhead make it very hard to use such a rich spatial vector dataset from the national scale.On the one hand, data preprocessing takes a large amount of time and is not ideal in terms of system performance.On the other hand, a large number of tables not only cause data organization difficulties, but also bring about client service scheduling problems.In addition, there are serious deficiencies in the data querying and visualizations for ALQ big data.
The rise of big data has brought about a remarkable change in the traditional GIS industry, especially based on cloud computing technology [4,5], which has provided a potential solution for addressing these computational challenges.Several processing and analyzing platforms have been developed for spatial big data, such as Hadoop-GIS [6], SpatialHadoop [7], GeoMesa [8], GeoSpark [9], ST-Hadoop [10], and others.Meanwhile, related application systems have also sprung up from different fields around the world, including satellite or remote sensing data [11,12], sensor data [13], social media [14], and others [15,16].For the various application areas, the directions and emphasis of the above systems focus on different topics.In this paper, we take the national land and resource information fields, especially the arable land quality (ALQ) data, into consideration.For the ALQ big data, cloud computing technology is adopted to solve the practical problems in data storage and management.
LandQ v2 (version 2) is a MapReduce-based parallel processing system that lays out the necessary infrastructure to store, index, query, and visualize the nation's arable land quality (ALQ) big data.
To solve the problem of data storage in the cloud environment, we designed a GeoCSV model to support the data processing, and then, we used a spatial coding-based approach to partition the whole dataset and build a distributed R-tree index in HDFS.In addition, the spatial query and visualization operations are developed and implemented within the MapReduce program.All these functions, supported by LandQ v2 , are integrated into SpatialHadoop [7], which is a very successful framework for spatial big data, and it is also able to efficiently support any other distributed big spatial data systems.

LandQ v2 System Overview
Figure 1 shows the LandQ v2 system overview.It was developed in a cloud computing environment, which mainly includes four levels, namely, the spatial data infrastructure layer, the cloud computing platform layer, the key technologies layer, and the system application layer.The spatial data infrastructure layer is composed of ordinary servers that form a resilient and scalable cluster, mainly providing data storage and computing resources for the upper layer.In the cloud computing platform layer, this paper selects the mainstream Hadoop ecosystem, and the current system architecture mainly involves the processing system (MapReduce), program language (Pig), and storage system (HDFS).The core content of this system is the key technologies layer, which includes the vector data cloud storage model, spatial data partitioning, the distributed R-tree index, the tile pyramid model-based visualization, and others, as shown in Figure 1.The system application layer is mainly based on the above layers to handle and leverage arable land quality (ALQ) big data.
In the system development process, in order to improve the compatibility and continuity of LandQ v2 , the existing open source standards and basic tools are referenced.For example, the cloud storage model-GeoCSV-is proposed based on the OGC (Open Geospatial Consortium) standard, characteristics of the distributed storage and parallel programming model.For spatial data processing, the ESRI Geometry API (Application Programming Interface) for Java is used to complete the analysis of simple geometric object elements.In addition, in the data visualization engine, the mainstream ArcGIS Runtime for the WPF (Windows Presentation Foundation) desktop development and ESRI Web API for web development are used to meet the application requirements of the system client.All of these functions supported by LandQ v2 are integrated into SpatialHadoop.

Data Preprocessing
Arable land quality (ALQ) data are mainly vector-based spatial data in the ESRI Shapefile format, which is a mainstream spatial vector data storage format [17] because of its simple structure, high precision, and rapid display capabilities.However, in the era of spatial big data, this format faces severe challenges.On the one hand, its storage volume is constrained by the 2 GB upper limit, which will result in thousands of small files and will increase the difficulty of data management and processing.On the other hand, there are obvious shortcomings in the field types, index efficiency, and network transmission.In addition, the visualization of the Shapefile generally requires professional GIS tools or software to operate.To solve the above shortcomings, GeoCSV is proposed based on the OGC standard.This format also offers to the idea of distributed storage and the parallel programing model.

Vector Data Cloud Storage Model
As Figure 2 shown, the data structure of GeoCSV [18] adopts the object-based vector data model to store spatial geometric elements in the cloud environment.GeoCSV uses the simple Key-Value storage model and the OGC-WKT format [19] to describe the spatial geometry.It takes the advantages of CSV (comma-separated values) files to organize the spatial vector data, that is, each record represents only one spatial geometry object.CSV file contains a number of data records.This is in line with the cloud computing platform, which is conducive to spatial data segmentation, processing, and analysis.

Data Preprocessing
Arable land quality (ALQ) data are mainly vector-based spatial data in the ESRI Shapefile format, which is a mainstream spatial vector data storage format [17] because of its simple structure, high precision, and rapid display capabilities.However, in the era of spatial big data, this format faces severe challenges.On the one hand, its storage volume is constrained by the 2 GB upper limit, which will result in thousands of small files and will increase the difficulty of data management and processing.On the other hand, there are obvious shortcomings in the field types, index efficiency, and network transmission.In addition, the visualization of the Shapefile generally requires professional GIS tools or software to operate.To solve the above shortcomings, GeoCSV is proposed based on the OGC standard.This format also offers to the idea of distributed storage and the parallel programing model.

Vector Data Cloud Storage Model
As Figure 2 shown, the data structure of GeoCSV [18] adopts the object-based vector data model to store spatial geometric elements in the cloud environment.GeoCSV uses the simple Key-Value storage model and the OGC-WKT format [19] to describe the spatial geometry.It takes the advantages of CSV (comma-separated values) files to organize the spatial vector data, that is, each record represents only one spatial geometry object.CSV file contains a number of data records.This is in line with the cloud computing platform, which is conducive to spatial data segmentation, processing, and analysis.Based on the GeoCSV model, this model has the following advantages: (1) object-oriented storage of existing spatial vector data, and the parallel computation of fine granularity is supported; (2) simple data structure with OGC standard data and the advantages of network transmission; (3) CSV is used for the separation of storage, which is conducive to the distributed processing of data segmentation, processing, and analysis; and (4) spatial data and attribute data formats are unified, and it is easy to add a field or adjust the data structure.

Data Conversion
Data conversion from Shapefile to GeoCSV is executed by one MapReduce job as shown in Figure 3.In this way, batch conversions can be implemented, which will change multiple small Shapefiles into one large GeoCSV file.Since a Shapefile is composed of multiple sub-files (.shp, .shx,.dbf,etc.), and each sub-file contains a unique header file, the conversion of a single Shapefile file will be treated as a separate subtask in the implementation of the data conversion algorithm.The task is done primarily as a two-part process, namely, Shapefile parsing and spatial object reconstruction in GeoCSV.Based on the GeoCSV model, this model has the following advantages: (1) object-oriented storage of existing spatial vector data, and the parallel computation of fine granularity is supported; (2) simple data structure with OGC standard data and the advantages of network transmission; (3) CSV is used for the separation of storage, which is conducive to the distributed processing of data segmentation, processing, and analysis; and (4) spatial data and attribute data formats are unified, and it is easy to add a field or adjust the data structure.

Data Conversion
Data conversion from Shapefile to GeoCSV is executed by one MapReduce job as shown in Figure 3.In this way, batch conversions can be implemented, which will change multiple small Shapefiles into one large GeoCSV file.Since a Shapefile is composed of multiple sub-files (.shp, .shx,.dbf,etc.), and each sub-file contains a unique header file, the conversion of a single Shapefile file will be treated as a separate subtask in the implementation of the data conversion algorithm.The task is done primarily as a two-part process, namely, Shapefile parsing and spatial object reconstruction in GeoCSV.

Geometry
Geometry The spatial vector data cloud storage model, GeoCSV [18].
Based on the GeoCSV model, this model has the following advantages: (1) object-oriented storage of existing spatial vector data, and the parallel computation of fine granularity is supported; (2) simple data structure with OGC standard data and the advantages of network transmission; (3) CSV is used for the separation of storage, which is conducive to the distributed processing of data segmentation, processing, and analysis; and (4) spatial data and attribute data formats are unified, and it is easy to add a field or adjust the data structure.

Data Conversion
Data conversion from Shapefile to GeoCSV is executed by one MapReduce job as shown in Figure 3.In this way, batch conversions can be implemented, which will change multiple small Shapefiles into one large GeoCSV file.Since a Shapefile is composed of multiple sub-files (.shp, .shx,.dbf,etc.), and each sub-file contains a unique header file, the conversion of a single Shapefile file will be treated as a separate subtask in the implementation of the data conversion algorithm.The task is done primarily as a two-part process, namely, Shapefile parsing and spatial object reconstruction in GeoCSV.

Spatial Index
In the distributed storage environment, a spatial dataset will be separated into several blocks dispensed to different nodes in the cluster [20], and the spatial index technology based on a single node cannot be used directly.Compared with the centralized index, the distributed index increases the communication protocol and communication overhead among the nodes in the cluster [21].The index mechanism of a distributed storage system must take full account of the distributed systems and the organization of the data storage.

Data Partitioning
In this section, we present the spatial partitioning architecture based on the spatial coding-based approach (SCA) [22].Figure 4 5) Spatial data partitioning: every spatial object will be matched with the spatial coding and written into the data blocks in HDFS; (6) Allocating the data blocks: all blocks will be distributed into the nodes in the cluster with their spatial codes.

Spatial Index
In the distributed storage environment, a spatial dataset will be separated into several blocks dispensed to different nodes in the cluster [20], and the spatial index technology based on a single node cannot be used directly.Compared with the centralized index, the distributed index increases the communication protocol and communication overhead among the nodes in the cluster [21].The index mechanism of a distributed storage system must take full account of the distributed systems and the organization of the data storage.

Data Partitioning
In this section, we present the spatial partitioning architecture based on the spatial coding-based approach (SCA) [22].Figure 4

Distributed R-Tree Index
The R-tree index is the most influential data access method in the GIS database [23].As Figure 5 shows, the distributed R-tree index includes the local index and global index [24].The local index, also called a data block index, is created for the collection of spatial object information in the data block after the spatial partitioning steps in HDFS.Here, this index will generate an index header file for each data block in HDFS, and the contents of the index header file include the envelope rectangle (MBR), the count of spatial objects (Count), offset to the start (Offset), and other basic information.The global index is created from the local index.The purpose of this phase is to build the global index structure, indexing all partitions in HDFS.By concatenating all the local index files into one file, the global index file will be generated.In addition, the content of the global index file includes the data block id (ID), the envelope rectangle (MBR), the total size (Size), and others.

Distributed R-Tree Index
The R-tree index is the most influential data access method in the GIS database [23].As Figure 5 shows, the distributed R-tree index includes the local index and global index [24].The local index, also called a data block index, is created for the collection of spatial object information in the data block after the spatial partitioning steps in HDFS.Here, this index will generate an index header file for each data block in HDFS, and the contents of the index header file include the envelope rectangle (MBR), the count of spatial objects (Count), offset to the start (Offset), and other basic information.The global index is created from the local index.The purpose of this phase is to build the global index structure, indexing all partitions in HDFS.By concatenating all the local index files into one file, the global index file will be generated.In addition, the content of the global index file includes the data block id (ID), the envelope rectangle (MBR), the total size (Size), and others.

Spatial Range Query
The spatial query operation is closely related to the spatial data storage model and spatial index, and it can be regarded as their reverse process.In the non-indexed Hadoop cloud environment, a spatial query needs to traverse all the spatial information records to match the corresponding results, and for large-scale spatial data, the query efficiency is extremely low.
As shown in Figure 6, this paper combines the spatial range query algorithm into three phases: (a) the Global filtering phase.The content stored in the global index is the basic information of all the data blocks in HDFS.The size of the global index file is relatively small, so this phase is executed at the master node.All query operations need to go through the global filtering phase first and the global index file usually resides in the memory of the master node; (b) the Local filtering stage.The local filtering phase is mainly for filtering the candidate sets in the data blocks from the global filtering results.The local index file is mainly the leaf node of the distributed R-tree stored in the data block header file; (c) the Refining stage.Through step (b), it can get all the leaf nodes in the current data block that intersects the query range.The refining stage mainly matches each spatial geometric object in the candidate set; if the object intersects or is included in the spatial query range, it will be put in the results file, otherwise, it will not perform this operation.

Spatial Range Query
The spatial query operation is closely related to the spatial data storage model and spatial index, and it can be regarded as their reverse process.In the non-indexed Hadoop cloud environment, a spatial query needs to traverse all the spatial information records to match the corresponding results, and for large-scale spatial data, the query efficiency is extremely low.
As shown in Figure 6, this paper combines the spatial range query algorithm into three phases: (a) the Global filtering phase.The content stored in the global index is the basic information of all the data blocks in HDFS.The size of the global index file is relatively small, so this phase is executed at the master node.All query operations need to go through the global filtering phase first and the global index file usually resides in the memory of the master node; (b) the Local filtering stage.The local filtering phase is mainly for filtering the candidate sets in the data blocks from the global filtering results.The local index file is mainly the leaf node of the distributed R-tree stored in the data block header file; (c) the Refining stage.Through step (b), it can get all the leaf nodes in the current data block that intersects the query range.The refining stage mainly matches each spatial geometric object in the candidate set; if the object intersects or is included in the spatial query range, it will be put in the results file, otherwise, it will not perform this operation.

Spatial Range Query
The spatial query operation is closely related to the spatial data storage model and spatial index, and it can be regarded as their reverse process.In the non-indexed Hadoop cloud environment, a spatial query needs to traverse all the spatial information records to match the corresponding results, and for large-scale spatial data, the query efficiency is extremely low.
As shown in Figure 6, this paper combines the spatial range query algorithm into three phases: (a) the Global filtering phase.The content stored in the global index is the basic information of all the data blocks in HDFS.The size of the global index file is relatively small, so this phase is executed at the master node.All query operations need to go through the global filtering phase first and the global index file usually resides in the memory of the master node; (b) the Local filtering stage.The local filtering phase is mainly for filtering the candidate sets in the data blocks from the global filtering results.The local index file is mainly the leaf node of the distributed R-tree stored in the data block header file; (c) the Refining stage.Through step (b), it can get all the leaf nodes in the current data block that intersects the query range.The refining stage mainly matches each spatial geometric object in the candidate set; if the object intersects or is included in the spatial query range, it will be put in the results file, otherwise, it will not perform this operation.

Data Visualization
The tile pyramid model uses "vertical grading, horizontal block" thinking, and the map data are divided into uniform rectangular tiles where the number of levels depends on the map scale, and the tile number is decided by the picture size [25].
The visualization based on the tile pyramid model is developed by employing a three phase technique [26]: (1) The partitioning phase: here, we use the default Hadoop partitioning to split the total input spatial big data into partitions in HDFS; (2) The plotting phase: through the internal loop, all tiles occupied by each spatial object will be plotted in the tile pyramid model; (3) The merging phase: the partial images having the same index code will be combined into one final image.As shown in Figure 7, each tile in the map pyramid model is indexed according to the level, row, and column as the unique identifier.In the parallel algorithm, the construction of each tile can be regarded as a separate sub-task, and the entire tile construction of the pyramid model can be regarded as a collection of multiple subtasks.

Data Visualization
The tile pyramid model uses "vertical grading, horizontal block" thinking, and the map data are divided into uniform rectangular tiles where the number of levels depends on the map scale, and the tile number is decided by the picture size [25].
The visualization based on the tile pyramid model is developed by employing a three phase technique [26]: (1) The partitioning phase: here, we use the default Hadoop partitioning to split the total input spatial big data into partitions in HDFS; (2) The plotting phase: through the internal loop, all tiles occupied by each spatial object will be plotted in the tile pyramid model; (3) The merging phase: the partial images having the same index code will be combined into one final image.As shown in Figure 7, each tile in the map pyramid model is indexed according to the level, row, and column as the unique identifier.In the parallel algorithm, the construction of each tile can be regarded as a separate sub-task, and the entire tile construction of the pyramid model can be regarded as a collection of multiple subtasks.In general, we assume that the enclosing rectangle of the space object shape is mbr, and the maximum and minimum coordinates are (x1, y1) and (x2, y2), respectively.Then, the following formula is used to get the spatial object in the pyramid model.The cover of the map tile (r) column (c) is r  (shape) = ⌈ (mbr 1 − MBR 1 MBR ℎ /2  ⌉ (1) where the [rmin, rmax] and [cmin, cmax] for the space object shape of the map tile correspond to the line number and column number interval, respectively, and MBR is for the entire map.The l is the pyramid level.
In particular, if the map tile size is p and the level has a spatial resolution of r, the spatial resolution of the tile in the x and y directions satisfies the following conditions: In general, we assume that the enclosing rectangle of the space object shape is mbr, and the maximum and minimum coordinates are (x1, y1) and (x2, y2), respectively.Then, the following formula is used to get the spatial object in the pyramid model.The cover of the map tile (r) column (c) is where the [r min , r max ] and [c min , c max ] for the space object shape of the map tile correspond to the line number and column number interval, respectively, and MBR is for the entire map.The l is the pyramid level.
In particular, if the map tile size is p and the level has a spatial resolution of r, the spatial resolution of the tile in the x and y directions satisfies the following conditions:

System Test and Results
In this section, the key technologies mentioned above are tested with the national arable land quality (ALQ) big data.The content of the test in this paper includes three aspects: data conversion, spatial query operation, and visualization performance.The test results will be given bellow.

Data Conversion
The parallel conversion experiment of the vector data mainly includes two aspects.First, the parallel algorithm proposed in this paper is compared with the data transformation algorithm in the ArcGIS software.On the other hand, the parallel algorithm is tested in the cluster environment.The experimental environment in this paper includes the stand-alone and cluster environment.Table 1 configures the information for the experimental environment.Table 2 shows the test data and that the size of the ESRI Shapefile varies from 2 GB to 128 GB.The experimental results are shown in Figures 8 and 9.It can be seen that the conversion time of the two algorithms increases with the growth of the size of the ESRI Shapefile.At the same time, it is fairly clear that the parallel conversion algorithm based on the MapReduce proposed in this paper is more efficient in terms of time.Moreover, with the increase of the number of ESRI Shapefile, the efficiency growth is clearer, approximately 4 to 6 times to the ArcToolBox.The R-tree spatial index quality can be measured by the envelope rectangle of its index tree hierarchies.The calculation formula is as follows, where Area (T) in Equation ( 7) represents the total area covered by the MBR of each index layer of the R-tree.Generally, the less the coverage area is, the better the R-tree index performance.In Equation ( 8), Overlap (T) represents the overlap area between the MBRs of each index layer in the R-tree.The larger the overlap area is, the more space the query path will have, which leads to a greater query time consumption and greatly reduces the efficiency of the data retrieval.Thus, for overlapping areas, the smaller the area is, the better the Rtree performance will be.The R-tree spatial index quality analysis formula is as follows: As shown in Figure 10, spatial coding-based data partition (Hilbert coding-based approach, HCA) shows a better index quality than random sampling for building an R-tree in both Area (T) and Overlay (T).The main reasons include the following two aspects.(1) Due to the randomness of the sample set and lost spatial features.The data partitioning method based spatial-coding can make good use of the adjacent features of spatial coding, which greatly guarantees the spatial distribution of the vector data; (2) the data partitioning method based on spatial sampling.Only the spatial location information of the sample is considered, however, the other one is considered with more Figure 9 shows the time consuming comparison of the data conversion efficiency in the cluster environment.Here, we only test a data volume of 64 GB.Through the test results, we can find that the execution time of the algorithm becomes shorter with the increase of the number of cluster nodes, which fully embodies the advantages of the cloud computing environment.

Index Quality of the R-Tree
The R-tree spatial index quality can be measured by the envelope rectangle of its index tree hierarchies.The calculation formula is as follows, where Area (T) in Equation ( 7) represents the total area covered by the MBR of each index layer of the R-tree.Generally, the less the coverage area is, the better the R-tree index performance.In Equation (8), Overlap (T) represents the overlap area between the MBRs of each index layer in the R-tree.The larger the overlap area is, the more space the query path will have, which leads to a greater query time consumption and greatly reduces the efficiency of the data retrieval.Thus, for overlapping areas, the smaller the area is, the better the R-tree performance will be.The R-tree spatial index quality analysis formula is as follows: As shown in Figure 10, spatial coding-based data partition (Hilbert coding-based approach, HCA) shows a better index quality than random sampling for building an R-tree in both Area (T) and Overlay (T).The main reasons include the following two aspects.(1) Due to the randomness of the sample set and lost spatial features.The data partitioning method based spatial-coding can make good use of the adjacent features of spatial coding, which greatly guarantees the spatial distribution of the vector data; (2) the data partitioning method based on spatial sampling.Only the spatial location information of the sample is considered, however, the other one is considered with more influencing factors, including spatial location information, the number of elements, the amount of data bytes, and so on.In addition, the sampling based method is also due to the randomness of the samples, the partitioning strategy for the same dataset and the same sampling rate are different, and the algorithm lacks stability.influencing factors, including spatial location information, the number of elements, the amount of data bytes, and so on.In addition, the sampling based method is also due to the randomness of the samples, the partitioning strategy for the same dataset and the same sampling rate are different, and the algorithm lacks stability.

Spatial Query Performance
Based on the distributed R-tree, here, we test the query performance of LandQ v2 .We measure the query time consumption in the different numbers of jobs and nodes.
In experiment 1, the result in Figure 11 shows the query performance comparison between the different job numbers with two modes, namely, the non-index and R-tree index.It can be seen that the spatial query execution time for the data increases with the increase of the spatial parallel access times, and there is a linear trend of growth for the non-index model, which is mainly due to traversing all the data blocks regardless of any spatial query.For the R-tree index model, because the search is only executed with the relevant data block, there is no obvious linear growth trend and there is a great relationship with the spatial query itself.In addition, for both models, the more query jobs are queried, the longer it will take.Experiment 2 tests the vector data query efficiencies with spatial index in different cluster nodes.The experimental results are shown in Figure 12.It can be seen that the efficiency of the data query increases with the increase in the number of cluster nodes.At the same time, the parallel query algorithm in this paper can perform well and shows good superiority.

Spatial Query Performance
Based on the distributed R-tree, here, we test the query performance of LandQ v2 .We measure the query time consumption in the different numbers of jobs and nodes.
In experiment 1, the result in Figure 11 shows the query performance comparison between the different job numbers with two modes, namely, the non-index and R-tree index.It can be seen that the spatial query execution time for the data increases with the increase of the spatial parallel access times, and there is a linear trend of growth for the non-index model, which is mainly due to traversing all the data blocks regardless of any spatial query.For the R-tree index model, because the search is only executed with the relevant data block, there is no obvious linear growth trend and there is a great relationship with the spatial query itself.In addition, for both models, the more query jobs are queried, the longer it will take.
influencing factors, including spatial location information, the number of elements, the amount of data bytes, and so on.In addition, the sampling based method is also due to the randomness of the samples, the partitioning strategy for the same dataset and the same sampling rate are different, and the algorithm lacks stability.

Spatial Query Performance
Based on the distributed R-tree, here, we test the query performance of LandQ v2 .We measure the query time consumption in the different numbers of jobs and nodes.
In experiment 1, the result in Figure 11 shows the query performance comparison between the different job numbers with two modes, namely, the non-index and R-tree index.It can be seen that the spatial query execution time for the data increases with the increase of the spatial parallel access times, and there is a linear trend of growth for the non-index model, which is mainly due to traversing all the data blocks regardless of any spatial query.For the R-tree index model, because the search is only executed with the relevant data block, there is no obvious linear growth trend and there is a great relationship with the spatial query itself.In addition, for both models, the more query jobs are queried, the longer it will take.Experiment 2 tests the vector data query efficiencies with spatial index in different cluster nodes.The experimental results are shown in Figure 12.It can be seen that the efficiency of the data query increases with the increase in the number of cluster nodes.At the same time, the parallel query algorithm in this paper can perform well and shows good superiority.Experiment 2 tests the vector data query efficiencies with spatial index in different cluster nodes.The experimental results are shown in Figure 12.It can be seen that the efficiency of the data query increases with the increase in the number of cluster nodes.At the same time, the parallel query algorithm in this paper can perform well and shows good superiority.

Visualization for Arable Land Quality Big Data
Based on the parallel construction algorithm of the tile pyramid model, the provincial and county level of arable land quality (ALQ) big data are respectively visualized.Following the OGC map standard, we construct the map tiles and integrate them with the ESRI API for JavaScript very well.
For the 2013 national and provincial level of arable land quality, there were equal data map elements of the tile pyramid parallel construction.In this paper, we showed the national economic classification, which reflects the economic value of the arable land.The resulting tile details are shown in Table 4.The tile pyramid model has 8 levels of map scale ranging from 1:20 million to 1:31.25 million.The county dataset had a parallel build time of 1 h 15 min 7 s, and the provincial data had a parallel build time of 11 min and 52 s.

Visualization for Arable Land Quality Big Data
Based on the parallel construction algorithm of the tile pyramid model, the provincial and county level of arable land quality (ALQ) big data are respectively visualized.Following the OGC map standard, we construct the map tiles and integrate them with the ESRI API for JavaScript very well.
For the 2013 national and provincial level of arable land quality, there were equal data map elements of the tile pyramid parallel construction.In this paper, we showed the national economic classification, which reflects the economic value of the arable land.The resulting tile details are shown in Table 4.The tile pyramid model has 8 levels of map scale ranging from 1:20 million to 1:31.25 million.The county dataset had a parallel build time of 1 h 15 min 7 s, and the provincial data had a parallel build time of 11 min and 52 s.

Discussion
The increased amount of the arable land quality (ALQ) data has actually brought challenges for processing and managing the vector-based spatial big data, especially in terms of data processing efficiency.Traditional GIS solutions have been unable to meet the needs of the national application.Here, we present a MapReduce-based parallel processing system, LandQ v2 , for national arable land quality (ALQ) big data.LandQ v2 is a complete set of application systems and is integrated into SpatialHadoop.Similarly, the relevant technologies can also be applied to other areas.
In LandQ v2 , four key technologies are designed and implemented.It is based on the cloud storage model of the spatial vector data (GeoCSV).The many shortcomings of the ESRI Shapefile will be addressed and solved in the data storage and process.The batch conversion algorithm with MapReduce improves the efficiency of data processing from the ESRI Shapefile to GeoCSV.Aimed at the problem of data query efficiency, a spatial coding-based approach for partitioning the spatial big data is presented, and then the distributed R-tree index is built for the spatial range query.This method will improve the query performance of the spatial big data, as well as the data balance in HDFS [22].For the visualization of the remote sensing data, the tile pyramid model is a good and mature solution [27,28].In this paper, we use this model to solve the problem of visualizing the vector-based spatial big data.The tile pyramid model-building algorithm is proposed and parallelized with the MapReduce program.

Discussion
The increased amount of the arable land quality (ALQ) data has actually brought challenges for processing and managing the vector-based spatial big data, especially in terms of data processing efficiency.Traditional GIS solutions have been unable to meet the needs of the national application.Here, we present a MapReduce-based parallel processing system, LandQ v2 , for national arable land quality (ALQ) big data.LandQ v2 is a complete set of application systems and is integrated into SpatialHadoop.Similarly, the relevant technologies can also be applied to other areas.
In LandQ v2 , four key technologies are designed and implemented.It is based on the cloud storage model of the spatial vector data (GeoCSV).The many shortcomings of the ESRI Shapefile will be addressed and solved in the data storage and process.The batch conversion algorithm with MapReduce improves the efficiency of data processing from the ESRI Shapefile to GeoCSV.Aimed at the problem of data query efficiency, a spatial coding-based approach for partitioning the spatial big data is presented, and then the distributed R-tree index is built for the spatial range query.This method will improve the query performance of the spatial big data, as well as the data balance in HDFS [22].For the visualization of the remote sensing data, the tile pyramid model is a good and mature solution [27,28].In this paper, we use this model to solve the problem of visualizing the vector-based spatial big data.The tile pyramid model-building algorithm is proposed and parallelized with the MapReduce program.
We tested the above key technologies and system functions.By comparing these features with existing GIS tools, all of the parallel algorithms, including data conversion, data partitioning, distributed R-tree index, data query, and visualization, show good performance and, at the same time, they can take full advantages of the cluster scalability.Compared with LandQ v1 , the visualization of the ALQ dataset in national level makes the macroscopic grasp of data more comprehensive and scientific, which will be useful to policymakers.

Conclusions
Recent advancements in cloud computing technology have offered a potential solution to the big data challenges [29].This paper presents LandQ v2 , which is a MapReduce-based parallel processing system for arable land quality (ALQ) big data that uses the Hadoop cloud computing platform.The contents of system architecture, key technologies, and the tests shown in this study are efficient enough to bring the following remarkable advantages of the developed system: (1) The spatial vector big data cloud storage model, GeoCSV, which is based on the characteristics of spatial vector data ISPRS Int.J. Geo-Inf.2018, 7, x FOR PEER REVIEW 3 of 15 characteristics of the distributed storage and parallel programming model.For spatial data processing, the ESRI Geometry API (Application Programming Interface) for Java is used to complete the analysis of simple geometric object elements.In addition, in the data visualization engine, the mainstream ArcGIS Runtime for the WPF (Windows Presentation Foundation) desktop development and ESRI Web API for web development are used to meet the application requirements of the system client.All of these functions supported by LandQ v2 are integrated into SpatialHadoop.

Figure 3 .
Figure 3.The MapReduce-based parallel data conversion algorithm.Figure 3. The MapReduce-based parallel data conversion algorithm.

Figure 3 .
Figure 3.The MapReduce-based parallel data conversion algorithm.Figure 3. The MapReduce-based parallel data conversion algorithm.
summarizes the six steps, which can be done by two MapReduce jobs.(1) Computing the spatial location: the central point is used instead of the spatial objects; (2) Defining the spatial coding matrix (SCM): we define the spatial coding values as a spatial coding matrix; (3) Computing the sensing information set (SIS): the SIS includes the spatial code, location, size, and count; (4) Computing the spatial partitioning matrix (SPM): SPM is built according to the SIS and the blocked size in HDFS; ( summarizes the six steps, which can be done by two MapReduce jobs.(1) Computing the spatial location: the central point is used instead of the spatial objects; (2) Defining the spatial coding matrix (SCM): we define the spatial coding values as a spatial coding matrix; (3) Computing the sensing information set (SIS): the SIS includes the spatial code, location, size, and count; (4) Computing the spatial partitioning matrix (SPM): SPM is built according to the SIS and the blocked size in HDFS; (5) Spatial data partitioning: every spatial object will be matched with the spatial coding and written into the data blocks in HDFS;(6) Allocating the data blocks: all blocks will be distributed into the nodes in the cluster with their spatial codes.

Figure 5 .
Figure 5.The distributed R-tree index.

Figure 6 .
Figure 6.The spatial query steps based on the distributed R-tree index.

Figure 5 .
Figure 5.The distributed R-tree index.

Figure 5 .
Figure 5.The distributed R-tree index.

Figure 6 .
Figure 6.The spatial query steps based on the distributed R-tree index.Figure 6.The spatial query steps based on the distributed R-tree index.

Figure 6 .
Figure 6.The spatial query steps based on the distributed R-tree index.Figure 6.The spatial query steps based on the distributed R-tree index.

Figure 7 .
Figure 7.The schematic diagram of the tile pyramid model.

Figure 7 .
Figure 7.The schematic diagram of the tile pyramid model.

of 15 Figure 8 .Figure 8 .
Figure 8.The time-consuming comparison of data conversion in a stand-alone environment.

Figure 8 .
Figure 8.The time-consuming comparison of data conversion in a stand-alone environment.

Figure 9 .
Figure 9.The time-consuming comparison of data conversion in a cluster environment.

Figure 9 .
Figure 9.The time-consuming comparison of data conversion in a cluster environment.

Figure 10 .
Figure 10.The index quality comparison of the R-tree based on random and HCA.

Figure 11 .
Figure 11.The query performance comparison between the different job numbers.

Figure 10 .
Figure 10.The index quality comparison of the R-tree based on random and HCA.

Figure 10 .
Figure 10.The index quality comparison of the R-tree based on random and HCA.

Figure 11 .
Figure 11.The query performance comparison between the different job numbers.

Figure 11 .
Figure 11.The query performance comparison between the different job numbers.

Figure 13 .
Figure 13.The time-consuming comparison of the tile pyramid construction in a single machine environment.

Figure 14 .
Figure 14.The time-consuming comparison of the tile pyramid construction in a cluster environment.
Figures 15 and 16 show the national arable land quality data global browsing and local browsing diagrams, respectively.In line with the WMS standard, the resulting map tiles are stored in folders with different scales which will be accessed and loaded by the Web client.At the same time, it can integrate well with other layers, such as national administrative boundary, railway, water, and other layers in the client.

Figure 14 .
Figure 14.The time-consuming comparison of the tile pyramid construction in a cluster environment.

Figures 15 and 16 15 Figure 15 .
Figures 15 and 16 show the national arable land quality data global browsing and local browsing diagrams, respectively.In line with the WMS standard, the resulting map tiles are stored in folders with different scales which will be accessed and loaded by the Web client.At the same time, it can integrate well with other layers, such as national administrative boundary, railway, water, and other layers in the client.ISPRS Int.J. Geo-Inf.2018, 7, x FOR PEER REVIEW 13 of 15

Figure 15 .
Figure 15.The map tiles of the national arable land quality dataset.

Figure 15 .
Figure 15.The map tiles of the national arable land quality dataset.

Figure 16 .
Figure 16.The schematic diagram of visualization for the national arable land quality dataset.

Figure 16 .
Figure 16.The schematic diagram of visualization for the national arable land quality dataset.

Table 1 .
The test environment for the data conversion algorithm.

Table 2 .
The datasets for the parallel conversion test.

Table 4 .
The datasets for the parallel conversion test.

Table 4 .
The datasets for the parallel conversion test.