A CUDA-Based Parallel Geographically Weighted Regression for Large-Scale Geographic Data

: Geographically weighted regression (GWR) introduces the distance weighted kernel function to examine the non-stationarity of geographical phenomena and improve the performance of global regression. However, GWR calibration becomes critical when using a serial computing mode to process large volumes of data. To address this problem, an improved approach based on the compute uniﬁed device architecture (CUDA) parallel architecture fast-parallel-GWR (FPGWR) is proposed in this paper to e ﬃ ciently handle the computational demands of performing GWR over millions of data points. FPGWR is capable of decomposing the serial process into parallel atomic modules and optimizing the memory usage. To verify the computing capability of FPGWR, we designed simulation datasets and performed corresponding testing experiments. We also compared the performance of FPGWR and other GWR software packages using open datasets. The results show that the runtime of FPGWR is negatively correlated with the CUDA core number, and the calculation e ﬃ ciency of FPGWR achieves a rate of thousands or even tens of thousands times faster than the traditional GWR algorithms. FPGWR provides an e ﬀ ective tool for exploring spatial heterogeneity for large-scale geographic data (geodata).


Introduction
Large-scale geodata is currently a topic of considerable attention in many research fields, including mobile communication [1], public transportation [2], medical health [3], Earth observation [4], and climate monitoring [5].
To enhance the capability of analyzing massive geodata, geographic knowledge mining is turning to data-driven patterns [6]. Distributed system and parallel computing are two feasible technologies to solve the problem of massive geodata analysis. A tremendous amount of multisource geodata is stored in a distributed spatial index system [7], enabling people to access records efficiently. Using the advantages of the distributed system Hadoop, Aji et al. (2019) [8] proposed a scalable high-performance spatial data warehousing system (Hadoop-GIS) that can meet the needs of managing and querying massive geodata. Furthermore, based on the MapReduce parallel computing framework and the HadoopBase database (HBase) technology, the origin-destination (OD) estimation method [9] can efficiently manage massive bus travel data and directly reckon the origin and destinations of travel for bus passenger. In the parallel computing field, large-scale geodata could be parallelize into multiple data pieces utilizing the strategies of multiple instruction multiple data (MIMD) and single instruction multiple data (SIMD). MIMD handles multiple instructions simultaneously in opposition to SIMD. There are several environments to parallelize multiple tasks the cluster environment. As a representative model of local regression, GWR incorporates all of the observations (samples) into the loop of the regression sequence. The key to geographic weighting is the calculation of distance weights for each sample, where it causes costly complexity in terms of runtime and memory. At the same time, the entire process consumes a large amount of computing time because the weight calibrator participates in multilayer loops. Under the condition of large-scale geodata, GWR needs to go through two levels of large cycle iteration, the outer iteration is responsible for point by point regression, and the inner iteration is used for matrix calculation between single sample and full samples. Therefore, limited by data structure and operating mode, GWR is less effective in addressing large-scale geodata. Concurrency methods can improve the efficiency of geographic analysis tools depending on the software optimization, but the hardware parallel environment could obtain native support and achieve the best acceleration performance. Both FastGWR and Spark-GWR could divide GWR into several parallel task sets, and the two parallel programs are designed for CPU architecture that cannot be adapted to GPU architecture. FPGWR decomposes large-scale GWR into simpler parallelizable computing units utilizing atomization algorithm and processes them with numerous parallel GPU cores.
In this paper, we develop FPGWR to reduce the computational complexity in the GWR process and enable GWR's applications in millions or even tens of millions of geodata. This technique significantly improves the efficiency in regression when utilizing the parallelization of large tasks. On the basis of the CUDA framework, atomic subtasks that are decomposed from large tasks could run on a GPU device in parallel mode. This paper contributes to the prior literature as follows. (1) FPGWR can compensate for the deficiencies of GWR in undertaking regression computation for large-scale geodata, and FPGWR with separate atomic computing units (atomization) is more efficient than GWR.
(2) FPGWR is a powerful model for exploring spatial heterogeneity and incorporating high parallelism into geography analysis, which is applicable for studies in various fields, such as economic geography, social science, public health. (3) The improvement from GWR to FPGWR can provide new insights into geospatial computing from spatial and computational perspective.

GWR Review
Before the 1980s, OLR was frequently applied for geographical phenomena analysis. The predictive coefficientsβ, calculated by the ordinary least squares (OLS) estimator method, abides by the rule of global optimal unbiased estimation. The final regression result merely reflects the average level in the study region. It is illegitimate to utilize the global regression methods in the local regression model. Therefore, Foster et al. (1986) [28] created a spatial adaptive filter (SAF) learning from varying coefficient modeling, which could describe step-jump and continuous spatial non-stationarity in the coefficients automatically. Based on the local polynomial smoothing technique, Brunsdon et al. (1996) [17] proposed the analysis tool of GWR.

GWR Model
The GWR model extends OLR, introducing the location factor to express the spatial variation of coefficients. In other words, we have the following: where y i is the regression variable (dependent variable) at location i, (u i , v i ) represents the coordinate (usually latitude and longitude) of the ith sample point in the study area, β m (u i , v i ) denotes the kth coefficient of the ith sample point based on a function with independent variables of u i and v i , x im expresses the mth predictor variable (independent variable), and ε i represents the error term, and n is the sample size. The necessary conditions for Equation (2) can be expressed as follows: For simplicity, Equation (1) is abbreviated as To prevent GWR from degenerating into a general linear regression, it is necessary that β 1m = β 2m = · · · = β nm should not appear in the preconditions.
The variables related to GWR can be defined in the form of matrix. The independent variable matrix X can be calculated by the following form:

Spatial Weight Kernel Function
There are n terms of spatial weight w ij between two sample points (i = j is allowed) in the study area. In the GWR model, it is usual to denote the weight matrix W i as a diagonal square matrix: At present, there are several forms of the weight kernel function w ij , and the most used are Bi-Square and Gaussian. The two functions can be expressed as Equations (6) and (7): Gaussian : where d ij represents the distance between two sample points (i and j), and bw denotes the bandwidth parameter which could be interpreted as not only the neighbors threshold but also the distance attenuation factor within the weight kernel function.

Model Regression
The regression coefficient estimateβ i at position i is defined bŷ The regression valueŶ i of the regression point i based onβ i can be estimated from where X i represents the ith row vector in matrix X. The hat matrix plays a very important role in the residual analysis of the linear regression model. This study introduces the hat matrix S into GWR. The matrix S can be expressed as follows: The regression result matrixŶ i can be represented with the hat matrix S:

The Criteria of Optimal Bandwidth Selection
The key to discovering the optimal bandwidth bw is minimizing the AIC c score. Loop selection and golden selection methods are available to obtain the lowest AIC c value. Searching the optimal bandwidth bw is inseparable from the parameter estimation criterion. The criterion AIC c [29] is introduced by Brunsdon et al. (2002) [30] to select the optimal bandwidth of the weight function. The specific formula can be expressed as The residual ε can be calculated by the sample data Y and the regression resultŶ: The unbiased estimate of the random error variance is expressed asσ 2 : where RSS indicates the sum of squared residuals, tr(S) is the trace of the hat matrix S, and n − 2tr(S) + tr S T S represents the effective freedom degree of GWR. In most cases, tr S T S approximately equals tr(S) (tr S T S ≈ tr(S)), and thereby, the above Equation (14) can be simplified asσ

Atomizing the GWR Model
As mentioned above, the regression process of the GWR model involves two fixed steps: optimal bandwidth selection and model diagnosis. Most existing packages that have implemented the GWR algorithm are supported by the serial mode. Compared with the parallel mode, the serial mode carries undesirable consequences to the regression computation. The computing containers with noninfinite computational power will be overloaded with too large-scale samples. The runtime arises along with the sample size growth, following a power or even an exponential relationship [31]. In the paper, it is a feasible solution to design the Algorithm 1 (atomization) in reducing the complexity of GWR regression calculation. Given test bandwidth (bw) and atomic process index (z) 2.
End loop 10. End loop 11. Calculate B −1 12. Set S z = 0,Ŷ z = 0 13. Loop each a = 1, 2, · · · , p + 1, calculate : 14. Set temp_x_inv = 0 15. Loop each b = 1, 2, · · · , p + 1, calculate : In order to introduce the parallel mode legally, we design GWR atomization to decompose the matrix calculation process. The matrix elements used in the result are extracted on-demand to obtain the result value via simple algebraic calculations. It will save huge memory usage and computing resource occupy in the large matrix operation of GWR. Intermediate matrix is an important research object of GWR atomization, which exists in several common models.
OLR can be calculated by the following matrix form: On basis of OLS, regression coefficientβ is estimated from Next, regression resultŶ of OLR can be expressed as follows: By comprehensively analyzing Equations (8), (9), (17), and (18), we can find the intermediate matrix M which exists in all regression models of estimating unbiased via OLS. It can be calculated by the following: In the point-by-point regression process, the intermediate matrix M is inevitable. Matrix X T can be defined by where p is the number of independent variables. The multiplication of matrix X T and the diagonal square matrix W i is special. The resulting matrix A can be expressed as follows: Similarly, matrix B can be written as where matrix B is a square matrix with p + 1 dimensions. In practical applications, p + 1 is usually less than 10, which means that it is legal to ignore the time spent by the inverse operation for matrix B.
Comparing with matrix decomposition, the regression subprocess of GWR relies on the weight matrix W i when calculating matrices A and B. The determination of weighting scheme W i could be achieved: (a) Obtain the coordinate matrix UV (n×2) of all samples, and then transpose the matrix to matrix UV T (2×n) . (b) Solve the distance matrix D (n×n) between coordinate matrix UV (n×2) and its transposed matrix UV T (2×n) . (c) Calculate the weight matrix W (n×n) of all samples according to Equation (6) or (7), and then W i is the diagonal matrix formed by the ith row elements of the weight matrix W. However, the process needs huge memory space and calculation time when involving the enormous sample size. In addition, each subprocess of GWR will determine W i once, which causes high redundancy of memory and runtime. The implementation of matrix decomposition approach has been carried out to decrease memory usage and runtime occupation by means of Equations (21) and (22).

Implementation of the Atomization Algorithm
Unlike the large process with full-matrix multiplication, each logically independent subprocess merely participates in the regression calculation once, on the basis of the atomization algorithm. It is the prerequisite for parallelization to ensure that the subprocess is repeatable. To address the problems caused by redundant computing, two aspects (memory and time) of optimization are conducted in the study. The AIC c scores and estimationŶ, generated during the bandwidth calibration process, are stored in the singleton pattern. Moreover, by means of on-demand computing, the disadvantage of a high-repetition-rate calculation is eliminated in the large process. Given a test bandwidth bw, the detailed steps of the atomization algorithm can be implemented as Algorithm 1.

CUDA Enabled FPGWR
FPGWR based on CUDA has the capability to process massive spatial geodata. The technique is substantially developed to increase the computing speed of GWR. Supported by a large number of SP, the GPU device can handle parallel computing as a natural carrier of HPC. Hardware performs superior to software in terms of the multithread scheduling. Hence, it is the preferred solution to improve GWR on the basis of the CUDA framework.

Optimizing the Kernel Function of CUDA
CUDA is a general-purpose parallel computing architecture introduced by the NVIDIA Corporation [32]. In the CUDA framework, parallel tasks would be instantiated as independent controllable threads. Independence means that there are no mutually exclusive signals among all threads. Each thread could run synchronously without depending on its sibling threads. Controllability means that the specificity of the thread instances could be controlled by the same parameters. The initialization values are differently set to make the generated instances diverse from each other. Due to the identical computing processes of threads, merely one thread scheduler is needed to manage all threads.
There are two principles for designing the CUDA kernel function to maximize the usage of the GPU scheduling resource and computing cycle. We should minimize the occurrence of WARP Branch in the kernel function as much as possible. At the same time, it is recommended to choose the CUDA memory type flexibly. The specific optimization strategy is shown in Figure 1. By the method of matrix decomposition, the atomic kernel function has successfully prevented process branching. Hence the computation workload can be evened out among the threads. Each atomic task will dynamically be assigned one unique thread index (z) that is different from the others. Because the tasks execute in a completely random order, the coupling relationship between the atomic threads and SPs is disconnected. To overcome the performance bottleneck caused by frequent access to global memory, FPGWR utilizes the shared memory to store these temporary variables. Hence the computation workload can be evened out among the threads. Each atomic task will dynamically be assigned one unique thread index ( ) that is different from the others. Because the tasks execute in a completely random order, the coupling relationship between the atomic threads and SPs is disconnected. To overcome the performance bottleneck caused by frequent access to global memory, FPGWR utilizes the shared memory to store these temporary variables.

Implementing FPGWR Based on CUDA
In this study, we have implemented FPGWR in a CUDA framework by utilizing the method of atomization. FPGWR significantly shortens the total time of large-scale GWR regression and releases the memory space of massive spatial matrix data. The FPGWR implementation consists of five steps.
Step 1, the program in Host device invokes the GPU device to be prepared, and at the same time, a series of initial parameters are set in the constant memory of the GPU.
Step 2, the sample data are input to the global memory of the GPU. The volume of geodata is too enormous to be instantiated in either the shared memory or the local memory. It will throw an "out of memory (OOM)" error when sample data volume is too large to fit in GPU global memory.
Step 3, CUDA loads the instructions compiled from the code of the atomic kernel function, and then, the scheduler generates individual threads with the same kernel function.
Step 4, all threads are assigned to Streaming Multiprocessor (SM) in the unit of WARP. To address the enormous number of threads, the GPU will activate the flow-shop scheduling mode.
Step 5, CUDA feeds back the regression results from the GPU to the

Implementing FPGWR Based on CUDA
In this study, we have implemented FPGWR in a CUDA framework by utilizing the method of atomization. FPGWR significantly shortens the total time of large-scale GWR regression and releases the memory space of massive spatial matrix data. The FPGWR implementation consists of five steps.
Step 1, the program in Host device invokes the GPU device to be prepared, and at the same time, a series of initial parameters are set in the constant memory of the GPU.
Step 2, the sample data are input to the global memory of the GPU. The volume of geodata is too enormous to be instantiated in either the shared memory or the local memory. It will throw an "out of memory (OOM)" error when sample data volume is too large to fit in GPU global memory.
Step 3, CUDA loads the instructions compiled from the code of the atomic kernel function, and then, the scheduler generates individual threads with the same kernel function.
Step 4, all threads are assigned to Streaming Multiprocessor (SM) in the unit of WARP. To address the enormous number of threads, the GPU will activate the flow-shop scheduling mode.
Step 5, CUDA feeds back the regression results from the GPU to the Host, and the GPU device resources are released immediately. The detailed workflow of the FPGWR implementation is shown in Figure 2. (c) core implementation process of the fpgwr_kernel function.  As shown in Figure 2a, the FPGWR algorithm could be divided into four layers: data layer, input layer, working layer, and output layer. The data layer is dedicated to storing the files of original observations. The input layer reads the spatial observation data from hard disk into host memory.
At the same time, the part of the CUDA programming is instantiated in this layer. The initialization parameters and observation matrices are introduced together into the atomic kernel function, and then, the function will be compiled into an executable program. The working layer runs on the NVIDIA GPU. It starts massive task threads, which are managed uniformly by the multithreaded scheduler of the GPU. At the physical level, WARPs are bundled into a queue of batches, while the WARPs in the same batch are executed synchronously. The output layer is designed to collect the regression results. Based on the bandwidth indexes, these results are organized into multiple sets of regression matrices (S,Ŷ, andβ). Finally, the algorithm finds the optimal results set that corresponds to the minimum AIC c score. Figure 2b,c illustrate how the core part of FPGWR works at the micro level. The specific meanings of the initialization parameters (n, p and bws) and the prototype of the FPGWR_KERNEL function are described in Subfigure (b). The detailed process of FPGWR_KERNEL function is presented in Subfigure (c). The steps of the process could correspond to those of Algorithm 1. The multithread scheduling depends on the initial BLOCK and GRID settings of the kernel function in CUDA. BLOCK is set as a one-dimensional vector with a constant value (64), namely, each BLOCK contains 64 threads. GRID is set as a two-dimensional vector, in which the number of the first dimension is the sample size n divided by 64 (number of BLOCK's first dimension), and the second dimension is the size of the bandwidth array.

Data Source
To explore the real performance of FPGWR, three data sources-the simulation dataset, the "Zillow test dataset" [26], and the "Georgia" dataset [33]-are used for the experiment. The simulation dataset is designed to evaluate the influences caused by the sample size and the independent variables size. The "Zillow test dataset" (https://github.com/Ziqi-Li/FastGWR) is assigned to compare the acceleration performance of the different GWR packages. The "Georgia" dataset is used to validate the result accuracy of FPGWR against other schemes.

Simulation Dataset
The test region is displayed as a square area [34] with l length sides, where the sample points are distributed evenly. After setting the sample size of each row to c, the total number of samples could be expressed as n = c × c. The distance between two adjacent samples is calculated by ∆l = l/(c − 1). The lower-left corner is defined as the origin of the coordinate system. The expression for calculating the positions of the samples is given by where mod stands for the remainder function, and floor denotes the rounding function.
The sample data are generated by the GWR model below. It is predefined in Equation (24) as follows: To unify the dimensions of the regression coefficients β, all of the values are limited to the interval (0, β max ) (β max is a fixed constant). The coefficients β follow 5 functions as follows: The spatial distribution of the coefficients β is displayed in Figure 3. The five coefficients β selected by the model are closely related to the position of the sample, which demonstrates the spatial non-stationarity of the observations.
The spatial distribution of the coefficients is displayed in Figure 3. The five coefficients selected by the model are closely related to the position of the sample, which demonstrates the spatial non-stationarity of the observations. According to Formula (27), eight sets of sample datasets are produced for testing. The datasets' construction parameters and resource link are exhibited in Table 1.  According to Formula (27), eight sets of sample datasets are produced for testing. The datasets' construction parameters and resource link are exhibited in Table 1.

Zillow Test Dataset
The "Zillow test dataset" [26] is a subset of the Zillow property dataset, which consists of the single-family housing information within the metropolitan area of Los Angeles. The mathematical expression of the dataset is expressed as Equation (30). The dataset is open source on GitHub along with the FastGWR algorithm (https://github.com/Ziqi-Li/FastGWR). This paper has downloaded eight datasets (1 k, 2 k, 5 k, 10 k, 15 k, 20 k, 50 k and 100 k) from the GitHub repository for the comparative experiment.

Georgia Dataset
The Georgia dataset [33] contains a subset (socio-demographic characteristics) of the 1990 US census within the state of Georgia. The coordinates of the data points are set at the centroids of counties, so there are 159 records containing county population attributes in the dataset. The model could be defined as Equation (31):

Testing Specifications and Environment
The experiment for FPGWR is conducted on a desktop computer. The configuration of this computer is an Intel i7-9700K 3. Relying on its ultra-high-speed task scheduling capability, the GPU can withstand the pressure of multithreaded computation tasks in parallel.

FPGWR Performance
Due to the transformation of parallelization, the regression efficiency of FPGWR increases dramatically against GWR. The bandwidth optimization is clearly a repetitive process. To compare the acceleration performances, this study analyzes only the single subprocess with a fixed bandwidth. The runtimes of FPGWR with different sample sizes are shown in Table 2. Given four independent variables, the runtime can be controlled within 2 s when the sample size is less than 40 k. After increasing the sample size to 250 k, the runtime becomes approximately 66.6 s. As the sample size increases to the millions scale, it spends only approximately 1094.7 s. The result shows that the runtime obeys a logarithmic variation rule as the sample size changes.
The runtimes vary tremendously with different sample sizes. To enable the display of all results together, the y-axis of the logarithmic scale is plotted in Figure 4. The comprehensive analysis of Table 2 and Figure 4 reveals that both the sample size and the number of independent variables can influence the variation of runtime. The regression time has positive association with the number of independent variables. When the sample size varies, the variation in the runtime is similar with different numbers of independent variables. Given the same sample size, the results on the time exhibit a simple multiple relationship among the different numbers of independent variables. In summary, the sample size has a more pronounced impact than the number of independent variables on the regression time. The runtimes vary tremendously with different sample sizes. To enable the display of all results together, the y-axis of the logarithmic scale is plotted in Figure 4. The comprehensive analysis of Table  2 and Figure 4 reveals that both the sample size and the number of independent variables can influence the variation of runtime. The regression time has positive association with the number of independent variables. When the sample size varies, the variation in the runtime is similar with different numbers of independent variables. Given the same sample size, the results on the time exhibit a simple multiple relationship among the different numbers of independent variables. In summary, the sample size has a more pronounced impact than the number of independent variables on the regression time. Speed-up and efficiency are important metrics for the performance check of parallel algorithms [35]. Speed-up refers to the ratio of single processor runtime to multiprocessor runtime, and efficiency represents the average of speed-up in multiprocessors [36]. The speed-up of FPGWR relies on the GPU performance, which consists of SP number, base clock frequency, and memory bandwidth. Table 3 compares the performance configuration of different type GPUs. On basis of the 250,000 simulation samples, this study regresses the model given in equation (24) with an increasing number of GPU cores. The runtime of GTX1050 is set as the benchmark value to calculate the speed-up factor of GPUs with different SP numbers. The speed-ups growth proves that FPGWR has an outstanding parallel scalability. Figure 5 illustrates that the computation time decreases approximate-linearly as the number of GPU cores increases. It exhibits an obvious positive linear relationship between efficiency and GPU cores.  Speed-up and efficiency are important metrics for the performance check of parallel algorithms [35]. Speed-up refers to the ratio of single processor runtime to multiprocessor runtime, and efficiency represents the average of speed-up in multiprocessors [36]. The speed-up of FPGWR relies on the GPU performance, which consists of SP number, base clock frequency, and memory bandwidth. Table 3 compares the performance configuration of different type GPUs. On basis of the 250,000 simulation samples, this study regresses the model given in equation (24) with an increasing number of GPU cores. The runtime of GTX1050 is set as the benchmark value to calculate the speed-up factor of GPUs with different SP numbers. The speed-ups growth proves that FPGWR has an outstanding parallel scalability. Figure 5 illustrates that the computation time decreases approximate-linearly as the number of GPU cores increases. It exhibits an obvious positive linear relationship between efficiency and GPU cores.
The experimental result demonstrates the outstanding capability of FPGWR to accelerate GWR, although its performance varies slightly among different orders of magnitude of observations. By setting appropriate sample sizes and independent variable numbers, the full potential of FPGWR can be achieved in various fields.  The experimental result demonstrates the outstanding capability of FPGWR to accelerate GWR, although its performance varies slightly among different orders of magnitude of observations. By setting appropriate sample sizes and independent variable numbers, the full potential of FPGWR can be achieved in various fields.

Comparison of FPGWR and Other GWR
Benefiting from the development of computer hardware, researchers could quickly and easily build the GPU environment for large-scale spatial study. To verify the acceleration capability, another four GWR-namely, FastGWR (Python), MGWR (Python), GWmodel (R), and spgwr (R)-are selected to compare with FPGWR. The test data utilized by the experiment is the "Zillow test dataset." GWmodel uses moving window weighting technique to decrease the computation. FastGWR implements distributed parallelism in HPC environment to improve operating efficiency. FastGWR is superior than MGWR, GWmodel and spgwr in terms of overall calculating efficiency. As a side note, although the optimal environment for FastGWR is an HPC cluster, it is more unbiased to conduct the experiments based on a single desktop environment.
The runtimes of the five packages with different sample sizes are displayed in Table 4. The runtime merely contains the single regression time with a specified bandwidth (as in Section 4.3.1). Given 1000 observations, FPGWR is 5 times faster than FastGWR, 32 times faster than MGWR, 88 times faster than GWmodel, and 865 times faster than spgwr. As the sample size increases to 10,000, FPGWR is approximately 14 times faster than FastGWR, approximately 157 times faster than MGWR, approximately 2185 times faster than GWmodel, and approximately 45,811 times faster than spgwr. Once the sample size exceeds 20,000, spgwr will fail to complete the regression task first, followed by GWmodel and MGWR. The cause is that the three schemes fail to avoid storing the high-dimensional weight matrix and other intermediate matrices.

Comparison of FPGWR and Other GWR
Benefiting from the development of computer hardware, researchers could quickly and easily build the GPU environment for large-scale spatial study. To verify the acceleration capability, another four GWR-namely, FastGWR (Python), MGWR (Python), GWmodel (R), and spgwr (R)-are selected to compare with FPGWR. The test data utilized by the experiment is the "Zillow test dataset." GWmodel uses moving window weighting technique to decrease the computation. FastGWR implements distributed parallelism in HPC environment to improve operating efficiency. FastGWR is superior than MGWR, GWmodel and spgwr in terms of overall calculating efficiency. As a side note, although the optimal environment for FastGWR is an HPC cluster, it is more unbiased to conduct the experiments based on a single desktop environment.
The runtimes of the five packages with different sample sizes are displayed in Table 4. The runtime merely contains the single regression time with a specified bandwidth (as in Section 4.3.1). Given 1000 observations, FPGWR is 5 times faster than FastGWR, 32 times faster than MGWR, 88 times faster than GWmodel, and 865 times faster than spgwr. As the sample size increases to 10,000, FPGWR is approximately 14 times faster than FastGWR, approximately 157 times faster than MGWR, approximately 2185 times faster than GWmodel, and approximately 45,811 times faster than spgwr. Once the sample size exceeds 20,000, spgwr will fail to complete the regression task first, followed by GWmodel and MGWR. The cause is that the three schemes fail to avoid storing the high-dimensional weight matrix and other intermediate matrices. 0.50 6.76 n/a n/a n/a 50,000 2.72 64.57 n/a n/a n/a 100,000 10.80 307.12 n/a n/a n/a The runtimes of the five packages are illustrated in Figure 6. The y-axis is marked on a logarithmic scale to display all of the results together. Observing each package separately reveals that the runtimes of the five schemes all exhibit a logarithmic increasing trend. According to Figure 6, the performances of the five packages are enhanced generation by generation. FPGWR is the most ideal implementation among these schemes. The runtimes of the five packages are illustrated in Figure 6. The y-axis is marked on a logarithmic scale to display all of the results together. Observing each package separately reveals that the runtimes of the five schemes all exhibit a logarithmic increasing trend. According to Figure 6, the performances of the five packages are enhanced generation by generation. FPGWR is the most ideal implementation among these schemes. Overall, FPGWR is a feasible GWR accelerator with a low development cost and simple productization process. Compared with other packages, FPGWR can greatly simplify a complicated job through decomposing the redundant full-sample regression.

Validation of the Result Accuracy
To validate the accuracy of FPGWR against other GWR packages, the results ( , . and scores) of the five packages are compared based on the well-known "Georgia" dataset. The dependent variable PctBach and independent variables PctPov, PctRural and PctBlack are chosen to calibrate the same GWR model according to Equation (31). On the basis of the adaptive Bi-square kernel function, 93 nearest neighbors are selected for the optimal bandwidth. As shown in Table 5, the Mean and Standard Deviation of the estimated coefficients are displayed in the middle section, and the . and scores are indicated in the lower section. The FPGWR result is clearly consistent with those of the other four packages.  Overall, FPGWR is a feasible GWR accelerator with a low development cost and simple productization process. Compared with other packages, FPGWR can greatly simplify a complicated job through decomposing the redundant full-sample regression.

Validation of the Result Accuracy
To validate the accuracy of FPGWR against other GWR packages, the results (β, Ad j.R 2 and AIC c scores) of the five packages are compared based on the well-known "Georgia" dataset. The dependent variable PctBach and independent variables Intercept, PctPov, PctRural and PctBlack are chosen to calibrate the same GWR model according to Equation (31). On the basis of the adaptive Bi-square kernel function, 93 nearest neighbors are selected for the optimal bandwidth. As shown in Table 5, the Mean and Standard Deviation of the estimated coefficientsβ are displayed in the middle section, and the Ad j.R 2 and AIC c scores are indicated in the lower section. The FPGWR result is clearly consistent with those of the other four packages. The spatial distribution of the estimated coefficientsβ in the study area is illustrated in Figure 7. To simulate the spatial variation better, both the surfaces are interpolated as a continuous surface utilizing the griddata method. The spatial distribution of the estimated coefficients in the study area is illustrated in Figure  7. To simulate the spatial variation better, both the surfaces are interpolated as a continuous surface utilizing the griddata method.

Discussion
Multiple loops are necessary in the GWR model until FPGWR atomizes the large process. Enormous calculation redundancy would inevitably emerge in the implementation of the GWR algorithm. The algorithm structure is nested with multilevel loops, where the upper loop depends on the lower loop and the internal sequence of each loop is fixed. It is illegitimate to disturb the original iterative sequence of the subprocesses; otherwise, the accuracy of the regression results would be

Discussion
Multiple loops are necessary in the GWR model until FPGWR atomizes the large process. Enormous calculation redundancy would inevitably emerge in the implementation of the GWR algorithm. The algorithm structure is nested with multilevel loops, where the upper loop depends on the lower loop and the internal sequence of each loop is fixed. It is illegitimate to disturb the original iterative sequence of the subprocesses; otherwise, the accuracy of the regression results would be questioned seriously. FPGWR introduces a hybrid (parallel-serial) mode, which could enable the GPU device to not only tolerate parallel tasks of each batch but also complete all of the tasks efficiently. The subprocesses could be randomly executed without errors, and the accuracy of the results is guaranteed for the model diagnosis. FPGWR differs obviously from GWR in its memory usage and time cost.

Memory
The matrix storage strategy of GWR is different from FPGWR, as shown in Figure 8. The FPGWR optimizes the storage mode in utilizing the schemes, the on-demand storage and the matrix vectorization. The weight matrix W i is stored as an n × n diagonal matrix in the GWR model. Although only the diagonal elements must be solved, a storage space of size n 2 is demanded. The memory complexity could be expressed as O n 2 . In the subsequent steps, the calculations of the matrices B, B −1 and A all inherit the memory complexity. In comparison, the FPGWR method only stores the data as required. Its memory complexity can be reduced into O((p + 1)n) (p + 1 ≤ 10 in common). memory complexity could be expressed as ( ). In the subsequent steps, the calculations of the matrices , and all inherit the memory complexity. In comparison, the FPGWR method only stores the data as required. Its memory complexity can be reduced into ( + 1) ( + 1 ≤ 10 in common).  Table 6 illustrates the comparison of memory usage between FPGWR and GWR. When the sample size is less than 100,000, the memory of the GPU device is still available for usage. Once the size is increased to 10,000,000, GWR approximately requires 364 TB RAM. Any existing single GPU device could not allocate so much storage space. In contrast, FPGWR consumes only 380 MB RAM. To summarize, FPGWR demonstrates a tremendous advantage over GWR.   Table 6 illustrates the comparison of memory usage between FPGWR and GWR. When the sample size is less than 100,000, the memory of the GPU device is still available for usage. Once the size is increased to 10,000,000, GWR approximately requires 364 TB RAM. Any existing single GPU device could not allocate so much storage space. In contrast, FPGWR consumes only 380 MB RAM. To summarize, FPGWR demonstrates a tremendous advantage over GWR.

Time
It is necessary for the process of calculating the hat matrix S in GWR. The weight matrix W i is a special diagonal matrix that does not increase the runtime complexity during matrix multiplication. The runtime complexity of matrices A is O(1), and the runtime complexity of B is O (p + 1) 2 n . Because p + 1 is far less than n, the computing time of matrix B can be ignored. Given a fixed weight bandwidth, the runtime complexity of hat matrix S i can be expressed as O (p + 1) 2 n . After n operations on S i , the runtime complexity of matrix S is defined by O (p + 1) 2 n 2 . Different GWR schemes utilize varied ways to select the optimal bandwidth, and thereby, the study will omit a discussion about the runtime complexity of the whole process. The regression subprocess of single sample is appropriate to be used for the runtime analysis in this subsection. Instead of computing iteratively, FPGWR atomizes the regression process of each point as an independent thread. The strategy has reduced the runtime complexity of matrix S appreciably. At the same time, the runtime complexity of matrix of A becomes O((p + 1)n), and the runtime complexity of matrix of B becomes O (p + 1) 2 n . By combining matrices A and B in the form of parallel addition, the hat matrix S i gains a runtime complexity of O((p + 1)(p + 2)n). The runtime complexity of FPGWR is theoretically lower than GWR, but the instruction operation efficiency of host device differs significantly from the GPU device, and the methods used by the different libraries to optimize the matrix operation are inconsistent. Therefore, the actual comparison results should refer to the experimental results (in Section 4.3.2).
The achievement of GWR regression coefficients requires consuming much time for repeated iteration when handling big geodata. For example, when the sample size is 1,000,000, single point regression of GWR needs to be iterated for 1,000,000 times with a huge time occupying and memory usage. Therefore, this problem could be solved by parallelization strategy. The atomization algorithm does not store the weight matrix and other temporary matrices during each point regression iteration, but only reads and calculates the matrix elements on-demand. FPGWR shortens computation time while using much less memory space through parallelizing these atomic units on CUDA.

Conclusions
GWR is a local modeling technique that has been widely used in various disciplines. However, GWR has significant computational redundancy and can handle approximately 15,000 geographical observations at most. To apply the local smoothing technique on a large-scale spatial dataset, we proposed an improved algorithm FPGWR to solve these problems. FPGWR optimizes the matrix storage mode to overcome the limitation on memory space, thereby significantly reducing the memory complexity of GWR. Furthermore, it introduces a parallel computing mode, decomposing the full-sample large cycle into an atomization process, to decrease the runtime complexity substantially.
To demonstrate the practicability of FPGWR, simulation and Zillow datasets are used to conduct the experiment. The results show that the regression runtime is exponentially related to the number of observations, and thus, GWR is unable to process the regression task with large volumes of geodata. In comparison, the time taken up by FPGWR exhibits a logarithmic relationship with the number of observations; hence, FPGWR represents a significant advance in handling the massive geodata mining task.
In summary, the dilemma that limits GWR in the data scale could be considerably alleviated by FPGWR, and thus, the application domains of GWR would be potentially expanded to a large extent. Under these circumstances, increasingly large datasets from geographical or nongeographical fields could be converted to the providers of the large-scale geographic analysis services. In the future, we will investigate a key issue: how to adapt FPGWR to non-CUDA architectures, even other non-GPU HPC devices, to enhance the versatility of the extended algorithm.