Parallel Spatial-Data Conversion Engine: Enabling Fast Sharing of Massive Geospatial Data

: Large-scale geospatial data have accumulated worldwide in the past decades. However, various data formats often result in a geospatial data sharing problem in the geographical information system community. Despite the various methodologies proposed in the past, geospatial data conversion has always served as a fundamental and e ﬃ cient way of sharing geospatial data. However, these methodologies are beginning to fail as data increase. This study proposes a parallel spatial data conversion engine (PSCE) with a symmetric mechanism to achieve the e ﬃ cient sharing of massive geodata by utilizing high-performance computing technology. This engine is designed in an extendable and ﬂexible framework and can customize methods of reading and writing particular spatial data formats. A dynamic task scheduling strategy based on the feature computing index is introduced in the framework to improve load balancing and performance. An experiment is performed to validate the engine framework and performance. In this experiment, geospatial data are stored in the vector spatial data deﬁned in the Chinese Geospatial Data Transfer Format Standard in a parallel ﬁle system (Lustre Cluster). Results show that the PSCE has a reliable architecture that can quickly cope with massive spatial datasets. types of points, eight lines, ﬁve polygons, and even a solid (3D) data exist within the ﬁle. This work attempts to present how the PSCE understands the VCT format and pays close attention to the core geographical information, geometric information, and linked attributes.


Introduction
The current tools and equipment for capturing geospatial data at both mega and milli scales are insufficient. Spatiotemporal data acquired through remote sensors (e.g., remote sensing images), widespread location-aware mobile devices, and large-scale simulations (e.g., climate data) have always been "big" [1][2][3]. Furthermore, considerable data have accumulated via the geographical information systems (GIS) of different institutions, organizations, and communities worldwide. However, reusing existing spatial data for new applications remains challenging [4,5].
The complexity of data sharing is believed to originate from two main characteristics within data sources, namely, distribution and heterogeneity [6][7][8]. Heterogeneity problems, including syntactic, schematic, and semantic heterogeneity, can be very complicated [6,8,9]. In syntactic heterogeneity, structured as follows. Related works on spatial data conversion techniques and the problems they encounter are presented in Section 2. The framework and architecture of the proposed PSCE are discussed in detail in Section 3. A use case, which converts a large dataset of the Chinese standard vector geo-spatial data transfer (VCT) format into GeoJSON format, is utilized to demonstrate the framework in Section 4. Finally, the conclusions are discussed in Section 5.

Spatial Data Conversion Techniques
Spatial data conversion techniques, which mainly aim to input spatial data from one system or format and then output to another, have existed for a long time and are well documented in the literature [8,9,14]. However, the key point in the conversion process lies in the spatial models (the way people view the geospatial world) behind the spatial datasets, which actually determine the quality of the outcomes to a large extent. In other words, if the two models between are compatible or even based on the same model, then the conversion might be satisfactory. Otherwise, the conversion might be poor.
However, the conversion engine must consider several rules to obtain satisfactory results. The first rule concerns geometric data, that is, no geometries shall be lost or twisted unless the destination format does not support such a type. Second, the attributes associated with the geometries shall be considered and matched with one another. Other important information includes the spatial relationship between features, spatial reference, metadata, and symbol styles. An ideal conversion engine should also provide an expandable and customized way to handle all these aspects.
Spatial data conversion techniques have gone through three generations. Converting spatial data from system A directly into system B is the first attempt in this sense. The transfer program is designed for specific formats. Therefore, it has great control and usually acts fast.
However, the file formats in both directions, which are sometimes difficult to achieve or even impossible, must all be accessible and fully understood, especially when the data format is proprietary to an organization. Furthermore, developing transfers between two data formats is laborious. Even after they are developed, all the transfers related with this format will have to be revised accordingly if one format evolves into a new version.
Therefore, converting spatial data directly from format to format seems inefficient in a large-scale setting. However, an exchange format can be created to reduce the workloads, allowing formats to be transferred first into the exchange format and then to the destination format [13,27]. Many exchange formats exists, such as commercial releases (e.g., DXF for AutoCAD, MIF for MapInfo, and E00 for Arc/Info) and national standard exchange formats from all over the world (e.g., the Spatial Data Transfer Standard by the American National Standards Institute [11], Chinese National Geo-spatial data transfer format [13], the Standard Procedure and Data Format for Digital Mapping in Japan [28]). However, many redundant datasets may be produced in exchange format methods, and the quality of the conversion largely depends on the compatibility between the exchange format and the two ends. Moreover, excessive exchange formats exist worldwide, rendering all efforts in this direction futile.
The third generation of spatial data conversion technology has emerged through sharing a common interface library. A common interface library mediates the conversion process, interpreting from the source and serializing into the destination and enabling the transfer between two data formats directly. The entire process, from reading and transferring to writing, is quite flexible and controllable. Hence, specific adjustments can be made according to particular data formats. Furthermore, the architecture of the conversion engine is scalable, and new data formats can be simply plugged into the architecture without requiring the reconstruction of the engine itself or its other parts. GDAL/OGR, an open source translator library for geospatial data formats, and FME, a spatial data transformation software, are two famous examples of these formats.

Engine Framework
The PSCE, which evolved from the common interface library methodology, is a framework for massive spatial data sharing. It provides an infrastructure for data exchange among different spatial data formats in high-performance computing environments. The PSCE framework aims to simplify different data sources for users working on large computing clusters. The framework was designed to support the following six requirements for the massive spatial data converting process: 1.
Effectiveness-reliability of data conversion and ability to reduce information loss; 2.
Efficiency-ability to handle massive data conversion quickly under the HPC environment; 3.
Expandability-can easily plug a new data format into the framework and participate in the conversion; 4.
Concurrency-the potential to run on a large computing cluster concurrently; 5.
Independence-independence of formats and ability to act separately; 6.
Transparency-users should not see the complexities associated with data conversion.

Architecture
An abstract standard model for spatial data as a common interface is the first step toward developing a solution for the data conversion engine. The PSCE uses the OpenGIS standard [29,30] as the common data model to understand geospatial data in various data formats. The overview of the common interface set is shown in Figure 1. Each data format in the PSCE should implement its own provider, a specific way of loading or unloading data from a data format. The extendable architecture of the PSCE is symmetric and shown in Figure 2. It is mainly comprised of three parts, namely, TRANSFER ENGINE, PROVIDER HARBOR, and PROVIDERs. Each PROVIDER can be a customized wrapper for the particular spatial data format to load and unload the spatial data and to bridge the given data format and the abstract spatial data model. different data sources for users working on large computing clusters. The framework was designed to support the following six requirements for the massive spatial data converting process: 1. Effectiveness-reliability of data conversion and ability to reduce information loss; 2. Efficiency-ability to handle massive data conversion quickly under the HPC environment; 3. Expandability-can easily plug a new data format into the framework and participate in the conversion; 4. Concurrency-the potential to run on a large computing cluster concurrently; 5. Independence-independence of formats and ability to act separately; 6. Transparency-users should not see the complexities associated with data conversion.

Architecture
An abstract standard model for spatial data as a common interface is the first step toward developing a solution for the data conversion engine. The PSCE uses the OpenGIS standard [29,30] as the common data model to understand geospatial data in various data formats. The overview of the common interface set is shown in Figure 1. Each data format in the PSCE should implement its own provider, a specific way of loading or unloading data from a data format. The extendable architecture of the PSCE is symmetric and shown in Figure 2. It is mainly comprised of three parts, namely, TRANSFER ENGINE, PROVIDER HARBOR, and PROVIDERs. Each PROVIDER can be a customized wrapper for the particular spatial data format to load and unload the spatial data and to bridge the given data format and the abstract spatial data model. The conversion part, which includes at least three aspects, namely, geometry transferring, attributes mapping, and metadata conveying, is performed by the TRANSFER ENGINE (TE). Geometry transferring can sometimes be difficult because many spatial data models exist and are never perfectly compatible with one another. For example, the vector data format of the Chinese National Geo-spatial data transfer format defines a type of line called Circular Arc, which is described by only three points. Moreover, it cannot be found in OpenGIS SFA (OpenGIS Simple Feature Access) and can only be approximated via LinearString. The attribute information attached to the geometry is maintained in two lists (i.e., one for the origin and one for the destination). Before conversion actually starts, a mapping relationship between the two is established to identify which part of the information should be delivered to the destination and thus prevent mismatch.

Task Scheduler
The PSCE utilizes parallel computing technology, which employs a manager-worker paradigm of two entities, namely, a manager and multiple workers. On the one hand, the manager decomposes the problem logically into small tasks, distributes the tasks among a set of workers, and gathers the computing statistics from each worker. On the other hand, the workers act in a very simple cycle, that is, they obtain a task instruction from the manager, process the task, and send the computing statistics back to the manager.
The manager decomposes the entire problem according to a domain partitioning strategy, which deems one Feature object as the minimum unit of a typical geospatial data. By dividing the input spatial data into parts without any dependencies, the partitioning strategy is to achieve one of the most important goals, that is, to decouple the processes. As such, communication or collaboration between workers will not be necessary, thereby greatly simplifying the relationship between processes and thus accomplishing high computational speedups.
A dynamically load-balanced strategy was used to allocate tasks to worker processes, which enable it to adapt to the changing system conditions. A FIFO (FIFO, First Input First Output) queue (an idle worker queue) and a FILO stack (barrels to be done, called a cellar) are maintained. PSCE combines one idle worker from the worker queue and one barrel from the cellar to form a task instruction, then sends the instruction to the worker until the cellar is empty. The workers receive the task instructions, fulfill the task, and report feedback to the manager. Thus, the cycle can continue until all the spatial data are converted, as shown in Figure 3. This dynamically load-balanced strategy results in PSCE's robust performance, allowing it to respond appropriately for the failure of certain processors. The conversion part, which includes at least three aspects, namely, geometry transferring, attributes mapping, and metadata conveying, is performed by the TRANSFER ENGINE (TE). Geometry transferring can sometimes be difficult because many spatial data models exist and are never perfectly compatible with one another. For example, the vector data format of the Chinese National Geo-spatial data transfer format defines a type of line called Circular Arc, which is described by only three points. Moreover, it cannot be found in OpenGIS SFA (OpenGIS Simple Feature Access) and can only be approximated via LinearString. The attribute information attached to the geometry is maintained in two lists (i.e., one for the origin and one for the destination). Before conversion actually starts, a mapping relationship between the two is established to identify which part of the information should be delivered to the destination and thus prevent mismatch.

Task Scheduler
The PSCE utilizes parallel computing technology, which employs a manager-worker paradigm of two entities, namely, a manager and multiple workers. On the one hand, the manager decomposes the problem logically into small tasks, distributes the tasks among a set of workers, and gathers the computing statistics from each worker. On the other hand, the workers act in a very simple cycle, that is, they obtain a task instruction from the manager, process the task, and send the computing statistics back to the manager.
The manager decomposes the entire problem according to a domain partitioning strategy, which deems one Feature object as the minimum unit of a typical geospatial data. By dividing the input spatial data into parts without any dependencies, the partitioning strategy is to achieve one of the most important goals, that is, to decouple the processes. As such, communication or collaboration between workers will not be necessary, thereby greatly simplifying the relationship between processes and thus accomplishing high computational speedups.
A dynamically load-balanced strategy was used to allocate tasks to worker processes, which enable it to adapt to the changing system conditions. A FIFO (FIFO, First Input First Output) queue (an idle worker queue) and a FILO stack (barrels to be done, called a cellar) are maintained. PSCE combines one idle worker from the worker queue and one barrel from the cellar to form a task instruction, then sends the instruction to the worker until the cellar is empty. The workers receive the task instructions, fulfill the task, and report feedback to the manager. Thus, the cycle can continue until all the spatial data are converted, as shown in Figure 3. This dynamically load-balanced strategy results in PSCE's robust performance, allowing it to respond appropriately for the failure of certain processors.

Domain Parititioning Strategy
Geospatial data conversion is a typical data-intensive problem [3,25], and the most timeconsuming part is the data themselves. Load balance is important because the slowest worker will

Domain Parititioning Strategy
Geospatial data conversion is a typical data-intensive problem [3,25], and the most time-consuming part is the data themselves. Load balance is important because the slowest worker will determine the overall performance if all workers are subject to barrier synchronization point. The same number of Feature objects (EFC, Equal Feature Count) may not be a good partition choice because the data size of a single Feature object can vary in a large range due to the complexity of the geometry within. Furthermore, the attributes actually have the same data size because each Feature in a FeatureCollection has the same schema. Thus, the PSCE proposes a feature computing index (FCI) as an estimation of the actual workload from each Feature object. The FCI of a feature object can be described as follows: where Atr is the total bytes of the attribute data divided by one POINT bytes. G i is the total number of POINTs in the geometry contained in Feature I. F i is the total feature computing index of Feature object i.
The PSCE first examines all the spatial data and builds a computing index of each Feature object, calculates their volume, and then places them into barrels by Feature ID. As such, each piece of the data is quite computationally intensive. Nevertheless, the number of Feature objects in one barrel may vary from a dozen to a thousand, depending on the following formula: where n is the amount of Feature objects in the barrel. V 0 is the total volume of the given barrel. And Atr, G i , and F i have the same meaning as in Equation (1). Therefore, each of the barrels as a single task would have a similar FCI and thus a similar number of workloads. However, the efficiency of massive spatial data conversion is severely limited by the relatively low I/O performance of most current computing platforms [31]. The PSCE utilizes a parallel file system, Lustre, to manage large spatial data files and takes advantage of its powerful I/O capacity to promote overall performance. All spatial data files are arranged in the Lustre file system, and each data format provider is implemented under the support of the MPI-IO to achieve parallel access to a large spatial dataset. However, the manager process in the PSCE does not read or write spatial data directly. It only fetches certain metadata to make a wise decision, whereas workers communicate with the I/O scheduler of the file system directly. The message between the manager and workers are only instructions or reports of small data size. As such, the communication within will not suffer, as shown in Figure 4.

Use Case
To validate the framework and evaluate the performance of the PSCE described in the previous sections, an experiment was conducted to benchmark the performance of the PSCE. The experiment was performed on a Linux cluster (with one metadata target (MDT) and nine object storage targets (OSTs) to import a large VCT dataset into a GeoJSON format. The GeoJSON specification is quite similar to the geometric hierarchical structure designed in the OpenGIS SFA. Therefore, writing a GeoJSON Provider is theoretically easy.
However, a VCT file can be complicated because points, lines, and polygons are stored alongside annotation and symbol styles. Moreover, four types of points, eight lines, five polygons, and even a solid (3D) data exist within the file. This work attempts to present how the PSCE understands the VCT format and pays close attention to the core geographical information, geometric information, and linked attributes. . Communication and data stream in the PSCE. Assuming that 16 barrels and four workers exist, the manager process dynamically assigns the tasks to the workers and print out the reports into logs. Each worker then directly reads and writes data from files or database.

Use Case
To validate the framework and evaluate the performance of the PSCE described in the previous sections, an experiment was conducted to benchmark the performance of the PSCE. The experiment was performed on a Linux cluster (with one metadata target (MDT) and nine object storage targets (OSTs) to import a large VCT dataset into a GeoJSON format. The GeoJSON specification is quite similar to the geometric hierarchical structure designed in the OpenGIS SFA. Therefore, writing a GeoJSON Provider is theoretically easy.
However, a VCT file can be complicated because points, lines, and polygons are stored alongside annotation and symbol styles. Moreover, four types of points, eight lines, five polygons, and even a solid (3D) data exist within the file. This work attempts to present how the PSCE understands the VCT format and pays close attention to the core geographical information, geometric information, and linked attributes.

VCT Provider
The first step to recruit the VCT format is to create a new Provider for the VCT format, following the common interface and registering it in the ProviderHarbor. In the PSCE, each VCT document is referred to as a Datasource and may contain several FeatureCollections whose geometry type might vary from point to polygon. The metadata described between <vectorMetadata> < /vectorMetadata > in the head of a VCT file maintains the geographic coordinate system information and is parsed to instance the SpatialReference associated with Datasource.
After the <vectorMetadata> part, it comes in the order with the <featureCodes>, <tableStructures>, <pointFeatures>, <curveFeatures>, <polygonFeatures>, and <attributeTables>, in which the actual spatial data are described. The PSCE regards every <featureCode> part as a FeatureCollection. For example, a FeatureCollection named bont_L, which has a geometry type of line, is defined in Figure 5, and the curves belonging to bont_L will appear in the latter <curveFeatures> part. The <tableStructures> part presents all the attribute tables, and the definitions of FeatureDefn and FieldDefn to describe the structure of the FeatureCollection. Consequently, with this structure knowledge, the PSCE can interpret the latter three data parts.
Symmetry 2020, 12, x FOR PEER REVIEW 9 of 14 The first step to recruit the VCT format is to create a new Provider for the VCT format, following the common interface and registering it in the ProviderHarbor. In the PSCE, each VCT document is referred to as a Datasource and may contain several FeatureCollections whose geometry type might vary from point to polygon. The metadata described between <vectorMetadata> < /vectorMetadata > in the head of a VCT file maintains the geographic coordinate system information and is parsed to instance the SpatialReference associated with Datasource.
After the <vectorMetadata> part, it comes in the order with the <featureCodes>, <tableStructures>, <pointFeatures>, <curveFeatures>, <polygonFeatures>, and <attributeTables>, in which the actual spatial data are described. The PSCE regards every <featureCode> part as a FeatureCollection. For example, a FeatureCollection named bont_L, which has a geometry type of line, is defined in Figure 5, and the curves belonging to bont_L will appear in the latter <curveFeatures> part. The <tableStructures> part presents all the attribute tables, and the definitions of FeatureDefn and FieldDefn to describe the structure of the FeatureCollection. Consequently, with this structure knowledge, the PSCE can interpret the latter three data parts. Figure 5. Sample data to illustrate vector geo-spatial data transfer (VCT) data structure. In this example, a FeatureCollection called bont_L with its attribute table Tbont_L is defined.

Geometric Transfer
The VCT spatial data model was not built under the OpenGIS SFA standard. Therefore, compatibility problems exist between them, and they may not always be easy to fuse them together. The transfer tables for geometries between VCT and SFA are shown in Table 1, Table 2, and Table 3 for points, lines, and polygons, respectively. The points defined in the VCT file are quite simple and can be directly converted to POINT or MULTIPOINT in the SFA, as shown in Table 1. . Sample data to illustrate vector geo-spatial data transfer (VCT) data structure. In this example, a FeatureCollection called bont_L with its attribute table Tbont_L is defined.

Geometric Transfer
The VCT spatial data model was not built under the OpenGIS SFA standard. Therefore, compatibility problems exist between them, and they may not always be easy to fuse them together. The transfer tables for geometries between VCT and SFA are shown in Table 1, Table 2, and Table 3 for points, lines, and polygons, respectively. The points defined in the VCT file are quite simple and can be directly converted to POINT or MULTIPOINT in the SFA, as shown in Table 1.   For the lines in VCT, polyline (CODE 11) is quite similar to the LINESTRING in the SFA. So is the integrated line (CODE 100) for MULTILINESTRING. Thus, direct conversion can be adopted. For the rest of the line types, the following formula is employed for resampling and for optimal approximation (illustrated in Figure 6): where CURVE(X,Y) = 0 is the function to describe the curve defined in the VCT file; {Point | CURVE(X,Y) = 0} means that the existing points provided in the definition of the curve for line type 13,14 the start and end points are used, and X j and X j+1 are two adjacent points along the way toward the X increasing direction.
Polygons exhibit a similar situation. The integrated lines (CODE 21), integrated polygons (CODE 22) in VCT and POLYGON, and the MULTIPOLYGON in SFA are nearly peer to peer, and thus suitable for a straight transfer. By contrast, VCT polygon types 12, 13, and 14 have to be resampled and converted first into LINEARRING before building the polygon in SFA with the following formula: where R is the radius; θ is the angle; and CURVE(R, θ i ) = 0 is the function to describe the polygon curve defined in the VCT file.
approximation (illustrated in Figure 6): < < +1 , ∈ {1,2,3,4,5,6}, , +1 ∈ { | ( , ) = 0} where CURVE(X,Y) = 0 is the function to describe the curve defined in the VCT file; {Point | CURVE(X,Y)=0} means that the existing points provided in the definition of the curve for line type 13,14 the start and end points are used, and Xj and Xj+1 are two adjacent points along the way toward the X increasing direction.

Performance Evaluation
Experiments were conducted for benchmarking and to demonstrate the framework and evaluate the performance of the PSCE. The testing data included 420,000 polygons, which were generated in a heavily uneven distribution with a feature size standard deviation (STD) of 99.30, as shown in Table 4. The minimum polygon only contains three points, with an attribute size of 1.11 KB and a maximum feature size of 401.18 KB. The data were then partitioned and converted with the EFC and FCI methods, respectively. Table 4 indicate that in the EFC method, whose partitions have the same number of features, 5000 features results in a heavily uneven distribution with a minimum partition data size of 3.60 MB and a maximum partition data size 1876.83 of MB. Meanwhile, the FCI method achieves a much even partition, with a data size STD of only 6.29. The experiments were conducted in a Lustre cluster with one metadata target (MDT) and eight object storage targets (OSTs). Each node has eight Intel(R) Xeon(R) CPU E5-2603 cores and 16 GB of memory. EFC and FCI methods invoked eight workers, one manager, and one process on each node in which the manager partitions the data and distributes the tasks among workers. The workers deal with the conversion of each part until all data are processed. Table 5 shows the benchmark results. Given that the PSCE is an I/O intensive algorithm and the EFC partitioned the data into heavily uneven distributions, the performance of each partition in the EFC was also very diverse, from a minimum time of 1.53 s to a maximum time 2269.23 s. By contrast, the FCI exhibited a quite stable performance, with an average time of 155.46 s and a STD 23.75. Overall, FCI could achieve 6.99 of acceleration, whereas EFC could only achieve 3.30 (shown in Figure 7).

Conclusions
In the era of big data, large amounts of geospatial data have been accumulated in isolated systems throughout the world. Large scale of geospatial data sharing between organizations in the GIS communities has always been a challenge. Traditional spatial data conversion methods that are running on a single computer are increasingly becoming bottlenecks for data sharing. Meanwhile multicore CPUs and multimode computing clusters are increasingly utilized. Therefore, a new approach that enables massive spatial data sharing by utilizing parallel computing and highperformance computing technics is critical.
The PSCE presented in this paper is a generic framework for massive spatial data sharing, which is built upon large computing clusters in an attempt to enhance the speed and reliability of spatial data sharing. The framework utilizes a common interface based on the OpenGIS Simple Feature Model, which bridges two spatial data formats. The architecture is also flexible and extendable, and every data format can have a customized way of reading and writing its spatial data.
The PSCE could handle large datasets because it is designed to run in high-performance computing clusters with the support of a parallel file system or distributed spatial database. Hence, it had unmatched performance in comparison with sequential approaches. The dynamic task scheduling strategy based on the FCI was introduced in the framework to improve load balancing, allowing PSCE to cope with massive spatial datasets in a fast and stable manner.
However, as the number of worker processes is increasing, the PSCE will encounter its I/O limitations, and the speedup will decline. In future study, we will try to improve the geospatial data distribution in the computing cluster to break through the I/O bottleneck. We also will improve our method by hiring technologies such as temporal graph convolutional networks [32] and deep inferring network technologies [33] to predict the performance bottleneck in applications.

Conclusions
In the era of big data, large amounts of geospatial data have been accumulated in isolated systems throughout the world. Large scale of geospatial data sharing between organizations in the GIS communities has always been a challenge. Traditional spatial data conversion methods that are running on a single computer are increasingly becoming bottlenecks for data sharing. Meanwhile multicore CPUs and multimode computing clusters are increasingly utilized. Therefore, a new approach that enables massive spatial data sharing by utilizing parallel computing and high-performance computing technics is critical.
The PSCE presented in this paper is a generic framework for massive spatial data sharing, which is built upon large computing clusters in an attempt to enhance the speed and reliability of spatial data sharing. The framework utilizes a common interface based on the OpenGIS Simple Feature Model, which bridges two spatial data formats. The architecture is also flexible and extendable, and every data format can have a customized way of reading and writing its spatial data.
The PSCE could handle large datasets because it is designed to run in high-performance computing clusters with the support of a parallel file system or distributed spatial database. Hence, it had unmatched performance in comparison with sequential approaches. The dynamic task scheduling strategy based on the FCI was introduced in the framework to improve load balancing, allowing PSCE to cope with massive spatial datasets in a fast and stable manner.
However, as the number of worker processes is increasing, the PSCE will encounter its I/O limitations, and the speedup will decline. In future study, we will try to improve the geospatial data distribution in the computing cluster to break through the I/O bottleneck. We also will improve our method by hiring technologies such as temporal graph convolutional networks [32] and deep inferring network technologies [33] to predict the performance bottleneck in applications.