Mapping Gross Domestic Product Distribution at 1 km Resolution across Thailand Using the Random Forest Area-to-Area Regression Kriging Model

: Accurate spatial distribution of gridded gross domestic product (GDP) data is crucial for revealing regional disparities within administrative units, thus facilitating a deeper understanding of regional economic dynamics, industrial distribution, and urbanization trends. The existing GDP spatial models often rely on prediction residuals for model evaluation or utilize residual distribution to improve the ﬁnal accuracy, frequently overlooking the modiﬁable areal unit problem within residual distribution. This paper introduces a hybrid downscaling model that combines random forest and area-to-area kriging to map gridded GDP. Employing Thailand as a case study, GDP distribution maps were generated at a 1 km spatial resolution for the year 2015 and compared with ﬁve alternative downscaling methods and an existing GDP product. The results demonstrate that the proposed approach yields higher accuracy and greater precision in detailing GDP distribution, as evidenced by the smallest mean absolute error and root mean squared error values, which stand at USD 256.458 and 699.348 ten million, respectively. Among the four different sets of auxiliary variables considered, one consistently exhibited a higher prediction accuracy. This particular set of auxiliary variables integrated classiﬁcation-based variables, illustrating the advantages of incorporating such integrated variables into modeling while accounting for classiﬁcation characteristics.


Introduction
Socioeconomic parameters play essential roles in government management, enterprise decision-making, and scientific research, often acquired through traditional statistical surveys.However, these data often fail to reveal the inner spatial differences of irregular administrative units, posing a challenge for understanding the complex interactions between human beings and the environment [1].To overcome this limitation and obtain high-spatial-resolution socioeconomic data, spatialization studies have been developed to allocate such data from statistical units to regular grids [2][3][4].Among the various socioeconomic parameters, gross domestic product (GDP) holds particular significance as the broadest measure of economic development and resource allocation on national and local scales [5].While GDP data from statistical units provide valuable insights, gridded GDP data with fine regular grid cells offer a more detailed spatial distribution of GDP counts within each statistical unit (i.e., administrative unit).Moreover, these gridded GDP data can be effectively integrated with other raster geoscience variables, such as remotely sensed data from Earth observation satellites, enabling spatial analyses in both raster and vector formats [6].Consequently, the importance of downscaling statistical GDP (i.e., GDP spatialization or GDP prediction in space) has been increasingly recognized, leading to the widespread adoption of gridded GDP data in various applications.
The theoretical foundation of GDP spatialization lies in the strong relationship between anthropogenic lighting signals at night (nighttime light, NTL) and economic activity, which was first reported in [7].To explore this relationship, the earliest satellite-derived NTL data came from the US Air Force Defense Meteorological Satellite Program-Operational Linescan System (DMSP-OLS), including daily, monthly, and annual data, developed by the NOAA National Geophysical Data Center (NGDC) and ranging from 1992 to 2013 [8,9].Notably, Doll et al. [10] produced the first global gridded GDP map with a 1 • resolution using DMSP-OLS NTL data.Subsequently, newer NTL data have been acquired from the Suomi National Polar-Orbiting Partnership satellite (NPP) since 2012 [11].Compared to DMSP-OLS NTL data, NPP-VIIRS NTL data from the Visible Infrared Imaging Radiometer Suite (VIIRS) offer superior in-flight calibration, higher spatial resolution, lower detection limit, wider dynamic range, and finer radiometric quantization [12].As a result, NPP-VIIRS NTL data can generate gridded GDP maps with even finer resolution [13].Recently, the latest version of synthetic annual NPP-VIIRS NTL data, developed in [14], has gained wide acceptance and has been used to examine economic activity in various studies [15][16][17][18][19][20][21][22].Despite its potential, few studies have explored gridded GDP prediction using this version of the NPP-VIIRS NTL data.
Despite the successful application of GDP spatialization in various regions, including 11 European Union countries and the United States [23], different administrative units in China [6,[24][25][26][27], the continental United States (CONUS) [28], and the entire world [29][30][31], few studies have considered the impacts of different sectors in GDP spatialization [15,19].Studies have shown that NTL data alone are insufficient to explain GDP from agriculture and forestry [32].To overcome this limitation, a more effective method is to combine NTL with land use/land cover (LULC) data and other auxiliary data, such as vegetation index, to predict and spatialize GDP in different sectors [6,24,25,29].Recent studies have utilized vegetation index together with NTL and LULC data to spatialize primary industry GDP in Turkey and China [33,34].Moreover, Wang et al. [31] employed gridded population data, statistical agricultural output data, and market price data of different crops to spatialize agricultural GDP in Uganda.However, there remains a lack of studies exploring the estimation of non-agricultural and agricultural GDP separately, particularly using the new version of the NPP-VIIRS data.
The methods for GDP spatialization have undergone evolution over the past three decades.Initially, researchers heavily relied on linear regression models, exponential linear models, and multiple linear fitting for GDP prediction [3,10,25,26,[29][30][31]35].It has been increasingly recognized that using NTL data alone is insufficient for achieving accurate GDP prediction.As a result, researchers have turned to employing multiple linear fitting alongside auxiliary data, such as land use/land cover (LULC) data, environmental data, and population data, to enhance the accuracy of GDP spatialization [6,31].In recent years, machine learning techniques, such as convolutional neural networks and random forests, have emerged in tandem with geospatial big data, offering new opportunities to improve the spatial prediction of GDP [2,15,28,34].Conventionally, the residuals predicted by the aforementioned models have served as a yardstick for assessing the performance of GDP prediction models.Additionally, some studies have employed residual allocation methods to address potential biases in spatial scales.These methods allocate the residuals at the target spatial scale, ensuring unbiased results at the original spatial scale [23,26,36].Researchers have successfully applied these residual allocation methods in studies utilizing multiple linear fitting and random forests to improve the accuracy of GDP spatialization [27,33].However, it is worth noting that some studies have simply corrected residuals in space by multiplying the ratio of the actual GDP to forecast GDP, without accounting for potential modifiable areal unit problems.
Hence, this paper introduces a sub-sector downscaling strategy based on a hybrid approach that combines random forest algorithm (RF) and area-to-area kriging (ATAK) interpolation to map the GDP across Thailand at a 1 km spatial resolution.Subsequently, we conducted an assessment of the potential of various combinations of ancillary variables in enhancing GDP prediction by incorporating the latest version of the NPP-VIIRS data, points of interest, vegetation index, and other relevant data.To achieve this, we established distinct random forest area-to-area regression kriging (RFATARK) models for both agricultural and non-agricultural GDP, incorporating four distinct groups of ancillary variables.A comparative analysis of the downscaled GDP results from five alternative methodologies underscored the notable advantages of the RFATARK-based downscaling approach.This advantage was further validated through a comparison with an alternate gridded GDP product.The structure of the paper is as follows: Section 2 provides insights into the study area, data sources, and the specific methodology deployed for GDP spatialization.Subsequently, Section 3 presents the outcomes of our analytical endeavors, followed by a comprehensive discussion in Section 4. The conclusive Section 5 encapsulates the inferences derived from our research.

Study Area
The study is centered within the Kingdom of Thailand, located in tropical Southeast Asia on the Indochina Peninsula, encompassing an approximate land area of 513,000 km 2 (Figure 1).Bordered by Myanmar to the north and west, Thailand's northeastern boundary is delineated by the Mekong River, adjacent to Laos.To the east lies Cambodia, while the southern region extends as a narrow peninsula between the Andaman Sea and the Gulf of Thailand, adjoining northern Malaysia.As a significant member of the Association of Southeast Asian Nations (ASEAN), Thailand serves as a primary conduit for China's access to the Indian Ocean via the Indochina Peninsula, signifying its strategic role within China's "One Belt and One Road" (OBOR) initiative [37].

Materials
The dataset employed in this study encompasses a synthesis of remote sensing data, geographic information data, and socioeconomic statistical data (as presented in Table 1).
The remote sensing data assimilated within this study encompass NTL, vegetation index (VI), LST, and LULC.The NTL data utilized in this investigation were derived from the NPP/VIIRS day/night band (DNB) dataset, acquired from the Earth Observation Group (EOG) at the Payne Institute for Public Policy, Colorado School of Mines (h ps://eogdata.mines.edu/products/vnl/).Specifically, the annual imagery for the year 2015 was employed, with a spatial resolution of 15 arc-seconds (~500 m at the equatorial regions) presented as geographic grids.The NTL values are expressed in digital number The geographical features of Thailand are varied and contribute significantly to its regional diversity.Northern Thailand is characterized by extensive forest cover, whereas the northeastern plateau, known as the Khurat Plateau, faces challenges in agriculture due to its dry summers and rainy wet seasons.The southern peninsula extends from the western mountain range, featuring hills in the west and coastal plains in the east.The central plain, nourished by the Chao Phraya River, historically serves as the nation's rice belt for agricultural production.The administrative divisions in Thailand encompass seven regions: the northern region, northeastern region, western region, central region, Bangkok and vicinity, eastern region, and southern region.The nation is subdivided into 77 provinces, with the capital city, Bangkok, holding a unique status as both the political center and the sole municipality under direct central government jurisdiction.
Thailand's climate is decidedly tropical and exhibits three distinct seasons: summer (mid-February to mid-May), the rainy season (mid-May to mid-October), and winter (mid-October to mid-February).The country experiences temperatures ranging from 18 • C to 36 • C, accompanied by an average annual rainfall of approximately 1700 mm.The country's geographical and climatic characteristics significantly influence its agricultural landscape, which spans 265,200 km 2 , including 49,600 km 2 of irrigated land.Thailand's stature as the world's largest exporter of rice and natural rubber underscores its agricultural prominence [38].The nation's maritime significance is underscored by its position as the third-largest fishing country in Asia, attributed to its abundant marine fisheries in the Gulf of Thailand and Andaman Sea.In 2015, Thailand's Gross Domestic Product (GDP) reached USD 401.3 million, with industry and services contributing 39.2%, agriculture 8.4%, trade 13.4%, logistics technology and communication 9.8%, construction and mining 4.3%, and other service sectors, encompassing finance, education, and hospitality, comprising 24.9% [39].

Materials
The dataset employed in this study encompasses a synthesis of remote sensing data, geographic information data, and socioeconomic statistical data (as presented in Table 1).

Dataset Description Source
The remote sensing data assimilated within this study encompass NTL, vegetation index (VI), LST, and LULC.The NTL data utilized in this investigation were derived from the NPP/VIIRS day/night band (DNB) dataset, acquired from the Earth Observation Group (EOG) at the Payne Institute for Public Policy, Colorado School of Mines (https://eogdata.mines.edu/products/vnl/,accessed on 6 December 2022).Specifically, the annual imagery for the year 2015 was employed, with a spatial resolution of 15 arcseconds (~500 m at the equatorial regions) presented as geographic grids.The NTL values are expressed in digital number (DN) format.The 'vcm-orm-ntl' annual synthetic product, the first version, was utilized in this study, which involved filtering out transient lights, stray light, background values, and anomalous data, resulting in cloudless average radiation values [14].The vegetation index data, encompassing normalized difference vegetation index (NDVI), enhanced vegetation index (EVI), and near-infrared vegetation index (NIRv), were calculated utilizing MODIS reflectance data, specifically the MCD43A4.006BRDF-Adjusted Reflectance 16-Day L3 Global 500 m product.These data were hosted on Google Earth Engine and subjected to quality control using the quality assurance layer [40].Calculation of the NIRv values followed the approach outlined in [41], while NDVI and EVI values were computed according to the methodology presented in [42].LST data were drawn from MODIS LST data of the Aqua satellite, employing the MYD11A1.006Land Surface Temperature/Emissivity Daily L3 Global 1 km dataset, available on Google Earth Engine [43].The selection of the Aqua satellite data was influenced by its local overpass time of 13:30, closely aligning with the occurrence of daily maximum temperatures.The LULC data were procured from the FROM-GLC dataset developed in [44], with a spatial resolution of 30 m, accessible at http://data.ess.tsinghua.edu.cn/(accessed on 24 March 2023).
The geographic information data encompass digital elevation model (DEM), boundary information, road networks, water bodies, and points of interest (POIs).The DEM data, sourced from Advanced Space Borne Thermal Emission and Reflection Radiometer (ASTER)/Global Digital Elevation Model (GDEM), had a spatial resolution of 30 m and were provided by the Earth Remote Sensing Data Analysis Center of Japan [45].Boundary information data were procured from the GADM data repository (https://gadm.org/data.html,accessed on 14 June 2022), specifically version 3.6.The road, water, and POI data were sourced from the OpenStreetMap (OSM) project, renowned for generating and distributing free global geographic data (download URL: http://download.geofabrik.de/asia/thailand.html, accessed on 20 December 2022).

Data Processing
The collected data underwent a series of preprocessing steps, including the calculation of relevant indicators, all of which served as auxiliary variables for GDP spatialization.The resulting auxiliary variable data were subsequently resampled to achieve a final spatial resolution of 1 km.

NTL Data Processing
Bangkok, as Thailand's most developed city, served as a critical reference point in the analysis of nighttime light (NTL) data.The fundamental assumption is that pixel values in other regions should not theoretically exceed the values observed in Bangkok.Therefore, the highest Digital Number (DN) value recorded in Bangkok was employed as a threshold to identify outliers in other areas.Additionally, to ensure compatibility with other datasets, the spatial resolution of the NPP/VIIRS NTL data was resampled from 15 arc-seconds to 500 m using the nearest neighbor method.Subsequently, these resampled NPP/VIIRS NTL data were employed to compute four essential nighttime light indices, Averaged Nighttime Light (NLAve), Nighttime Light Intensity (NLVI), Nighttime Light Lit Area Proportion (NLVS), and Compounded Nighttime Light Index (VCNLI), following the approach outlined by Zhao et al. [13].
NLAve represents the average DN value of the NPP/VIIRS NTL data within a given statistical unit.It is computed using the formula: where DN i and n i denote the pixel value of the light level and the corresponding pixel count within a statistical unit, respectively.S represents the total area of the statistical unit in square kilometers.NLVI is defined as the ratio of NLAve to the maximum nighttime light value within a statistical unit.The calculation is as follows: where DN max is the maximum DN value in the NPP/VIIRS NTL data, and N L represents the total number of lit pixels within a statistical unit.NLVS is determined by the ratio of the lit pixel area (SL) to the total area of the statistical unit (S), as indicated below: VCNLI is calculated as the product of NLVI and NLVS, providing a comprehensive measure of nighttime light characteristics within a statistical unit:

OSM Data Processing
The OSM dataset encompasses road data, water data, and POI data, as detailed in Tables 2 and 3.In this subsection, we provide a comprehensive description of the processing steps undertaken for each type of OSM data.
For both road and water data, we calculated the auxiliary variables by determining the distance from the center point of each 1 km × 1 km pixel to the nearest road or water feature within each respective class.This information assumes a critical role in subsequent analyses.
For the POI data, we generated ten gridded (1 km × 1 km) auxiliary variables, each corresponding to ten types of POI point data.This was achieved through the application of kernel density estimation (KDE), following the methodology detailed by Wang et al. [48].An integral aspect of KDE is the bandwidth parameter, which we meticulously determined.Our process encompassed the exploration of a bandwidth range spanning from 500 m to 25 km.Within this range, we adopted a step length of 100 m for distances ranging from 500 to 10,000 m, and a 1 km step length for distances between 10 and 25 km.Consequently, this procedure yielded 110 distinct bandwidth values, consequently generating 1100 gridded POI density layers (comprising 10 types of POIs and 110 bandwidths).These layers were aggregated to the provincial scale through an average aggregation method.Subsequently, we constructed random forest (RF) models at the provincial scale, with each bandwidth corresponding to a unique RF model.The ten gridded POI density layers generated for each bandwidth served as independent variables, while provincial Gross Domestic Product (GDP) was employed as the dependent variable.Among the 110 RF models, we discerned the optimal bandwidth by identifying the bandwidth that minimized the out-of-bag error value (i.e., normalized mean square error (MSE)), a determination illustrated in Figure 2. Notably, the optimal bandwidth was determined to be 8100.bandwidth corresponding to a unique RF model.The ten gridded POI density layers ge erated for each bandwidth served as independent variables, while provincial Gross D mestic Product (GDP) was employed as the dependent variable.Among the 110 RF mo els, we discerned the optimal bandwidth by identifying the bandwidth that minimiz the out-of-bag error value (i.e., normalized mean square error (MSE)), a determinati illustrated in Figure 3. Notably, the optimal bandwidth was determined to be 8100.Concerning the treatment of OSM auxiliary variables, we administered four distinct approaches based on a combination of two processing methods.The two methods considered were the classification of water and road auxiliary variables and the merging of OSM auxiliary variables.These four treatments are as follows: (1) Unmerged and Single (UMS): water and road auxiliary variables remained unclassified, and POI auxiliary variables were left unmerged; (2) Unmerged and Classified (UMC): while water and road auxiliary variables were classified, POI auxiliary variables were not subjected to merging; (3) Merged and Single (MS): water, road and POI auxiliary variables were merged without any classifying; and (4) Merged and Classified (MC): involved the merged water, road and POI auxiliary variables, in which the water and road auxiliary variables were classified.
The merging of OSM auxiliary variables was executed through a weighted approach [49], as described by Equation ( 5): where OSM sum represents the merged OSM auxiliary variable for different types; b ij is the value of the j-th OSM auxiliary variable for the i-th pixel after normalization; m denotes the count of one type of OSM auxiliary variables; and w j indicates the weight value of the j-th OSM auxiliary variable.These weight values of each class (or group) of OSM auxiliary variable are calculated through the entropy method, governed by Equations ( 6)-(8) [48]: . .,a nj max a 1j , . . .,a nj −min{a 1j , . . .,a nj (6) In these equations, a ij corresponds to the value of the j-th OSM auxiliary variable for the i-th pixel, n is the count of pixels in the auxiliary variable raster layer, and e j represents the entropy value of the j-th OSM auxiliary variable.The weight associated with each class of OSM auxiliary variables is presented in Tables 2 and 3.

Other Processing
Beyond the above two data sources, five additional types of data play a crucial role in the GDP spatialization process.These encompass terrain data, land use and land cover (LULC) data, vegetation index data, land surface temperature (LST) data, and population data.
The cornerstone of terrain data lies in the Digital Elevation Model (DEM) data, characterized by a spatial resolution of 30 m.This initial dataset underwent a transformation process, resulting in the derivation of three fundamental parameters: elevation, slope, and slope direction.Subsequently, the elevation values at broader scales, specifically operating at a 1 km scale or the provincial scale, were computed through the aggregation of 30 m elevation values within individual statistical units.Furthermore, the determination of slope and slope direction at these elevated scales was achieved through an assessment of the proportion of slopes with inclinations less than 5 degrees and an examination of the distribution of slope directions, encompassing sunny slopes, semi-sunny slopes, semi-shady slopes, and shady slopes, within each statistical unit.
The LULC dataset employed for this study in Thailand comprises eight distinct land cover categories, conforming to the FROM-GLC classification system.These categories encompass crop, forest, grass, shrub, wetland (denoted as wet), water bodies, bare land (denoted as bare), and impervious surfaces.The analysis entailed the quantification of the relative abundance of these land cover classes within each statistical unit, operating at either a 1 km scale or the provincial level.
To discriminate human settlements from other land cover types and to ameliorate the impact of cloud interference, an annual synthetic vegetation index was generated through the application of the maximum value composite method to multitemporal vegetation index imagery.Subsequently, the vegetation index data underwent spatial aggregation, resulting in a resolution of 1 km through a process of averaging.This refined dataset was then amenable to the computation of vegetation index values per unit area, whether at the provincial level or within individual grid cells.Notably, it is imperative to emphasize that the LST data follow an analogous preprocessing procedure to that of the vegetation index data.
The gridded population data, sourced from the Worldpop database, underwent a meticulous calibration process, which harmonized it with census data to ensure data integrity and fidelity.Following this calibration, the population data were aggregated to a resolution of 1 km or the provincial scale.

Downscaling Methodology
The dataset employed in this study encompassed a synthesis of remote sensing data, geographic information data, and socioeconomic statistical data (as presented in Table 1).

Downscaling Model
A machine learning-based geostatistical downscaling method was introduced in this study, referred to as the RFATARK model (random forest area-to-area regression kriging).The RFATARK model combines two distinct techniques, namely the random forest (RF) algorithm, which is employed within the trend module, and the area-to-area kriging (ATAK) interpolation method, utilized within the residual module.The RF algorithm, initially proposed by Breiman [50] and further refined by Culter and Stevens [51], is a machine learning technique utilized for diverse tasks, including classification, regression, and other decision tree-based applications.Compared to traditional regression algorithms, the RF algorithm offers several advantages, such as a high predictive accuracy, the absence of prior probability distribution assumptions, and the capacity to assess variable importance [52].The RF model involves two key parameters, which represents the number of variables used to make decisions at the nodes of decision trees (denoted as mtree), and indicates the number of decision trees generated (denoted as ntree).During the modeling process, mtry and ntree were optimized to determine the most suitable parameter settings based on error variations across different parameter combinations.On the other hand, ATAK, a variant of areal interpolation, changes the supports before and after the interpolation.Its purpose is to redistribute areal residuals to their target area-level positions [53].This approach achieves the prediction of areal values through a linear combination of areal data.For each specific pixel, the residual is calculated as a weighted linear combination of residuals from neighboring areas, satisfying an unbiased weight constraint.
Let GDP(S i ) and X k (S i ) represent the target and k ancillary random variables at coarse pixel S i .The RF regression model capturing the relationship between GDP(S i ) and X k (S i ) can be symbolized as f (•).Under the assumption of scale-invariance in the statistical relationships among these variables, the trend component of fine spatial resolution is estimated through the utilization of the coarse regression function.Subsequently, the residual component of the fine spatial resolution is determined through ATAK, which interpolates the coarse residuals by considering m neighboring coarse pixels, denoted as e(S i ).The predicted GDP at a 1 km resolution, represented as GDP(s), at the fine resolution using the RFATARK model, can be expressed as follows: (9)   where λ i represents the weight assigned to m neighboring coarse pixels in ATAK, satisfying the constraint ∑ m i=1 λ i (S i ) = 1.For a detailed calculation process, refer to [54].In this study, the RFATARK model was applied separately to non-agricultural and agricultural statistical GDP, resulting in the mapping of non-agricultural and agricultural GDP at a 1 km resolution.Given the distinct characteristics of agricultural and nonagricultural GDP, land cover information was used to mask the GDP prediction results based on the following rules: GDP where GDP agri j

and GDP
non−agri j represent the agricultural and non-agricultural GDP for pixel j, respectively, while L forest , L imp , L crop , L water , L grass , L shrub , and L wetland denote the proportions of forest, impervious layer, cropland, water, grassland, shrub, and wetland in the pixel, respectively.The model-predicted GDP (denoted as GDP model ) is the sum of the predicted agricultural GDP and non-agricultural GDP values derived from the RFATARK model.To achieve GDP spatialization, we needed to disaggregate GDP data at the administrative unit scale to the pixel scale.Therefore, it was necessary to make corrections to the simulated GDP for each pixel in the administrative unit [13].The statistical GDP data at the seven regions were used to adjust the simulated GDP obtained from the RFATARK model.The corrected GDP at the pixel level is expressed as follows: where GDP corrected i,j and GDP model i,j represent the corrected and model-predicted GDP values for pixel j within region i, respectively.GDP statistical i and GDP model i denote the total GDP values for region i derived from region-level statistical data and model-predicted GDP, respectively.

Downscaling Strategy
In this experiments, nine types of data from seven sources were employed as auxiliary information.The corresponding auxiliary variables were divided into four groups according to the different treatments of OSM data, as detailed in Section 2.3.2.These categories are denoted as UMS, UMC, MS, and MC scenarios.All auxiliary variables underwent a preprocessing procedure described in the data processing section to acquire data at the provincial scale and target resolution (i.e., 1 km × 1 km).
Subsequently, different downscaling models for agricultural and non-agricultural GDP were established, resulting in the generation of GDP distribution maps through six distinct models.These models encompass random forest (RF), multiple linear regression (MLR), support vector regression (SVR) [55], random forest area-to-area regression kriging (RFATARK), multiple linear area-to-area regression kriging (MLATARK,) and support vector area-to-area regression kriging (SVATARK) [54].Within the SVR modeling, we employed ε-SVR with the Gaussian radial basis function as its kernel function.Additionally, we optimized the relevant penalty coefficient and gamma parameters by minimizing model error.Meanwhile, during the RF modeling process, we adjusted the key parameters mtry and ntree to identify their optimal configurations, guided by the objective of minimizing the mean absolute error (MAE) associated with agricultural and non-agricultural GDP models across various sets of independent variables.Throughout this parameter exploration, we explored a range of values for mtry, spanning from 1 to 10, while ntree took on values of 5, 10, 20, 50, 100, 300, to 500.For a visual representation of these parameter variations, please refer to Figure 3.
distinct models.These models encompass random forest (RF), multiple linear regression (MLR), support vector regression (SVR) [55], random forest area-to-area regression kriging (RFATARK), multiple linear area-to-area regression kriging (MLATARK,) and support vector area-to-area regression kriging (SVATARK) [54].Within the SVR modeling, we employed ε-SVR with the Gaussian radial basis function as its kernel function.Additionally, we optimized the relevant penalty coefficient and gamma parameters by minimizing model error.Meanwhile, during the RF modeling process, we adjusted the key parameters mtry and ntree to identify their optimal configurations, guided by the objective of minimizing the mean absolute error (MAE) associated with agricultural and nonagricultural GDP models across various sets of independent variables.Throughout this parameter exploration, we explored a range of values for mtry, spanning from 1 to 10, while ntree took on values of 5, 10, 20, 50, 100, 300, to 500.For a visual representation of these parameter variations, please refer to Figure 3.Among the six downscaling models employed, the la er three models integrated the area-to-area kriging (ATAK) method for residual redistribution, whereas the initial three models did not.Each model was separately trained on four distinct sets of auxiliary variables, yielding a total of twenty-four agricultural GDP models and twenty-four non-agricultural GDP models.The model performance evaluation was conducted using a 10-fold cross-validation methodology, revealing R-squared values exceeding 0.7 for both training and testing datasets, with consistently low p-values below the significance threshold of 0.01.These findings provide robust evidence supporting the validity of the modeling process.Furthermore, by combining the predictions of agricultural GDP and non-agricultural GDP for each model, we obtained corresponding downscaled GDP predictions at a 1 km resolution.These predictions underwent masking using land cover information and were subsequently corrected based on regional statistical GDP data, as shown in Equations ( 10)-(12).
Finally, we conducted a validation exercise by comparing the predicted GDP distribution results with both statistical GDP data and existing gridded GDP data, utilizing metrics such as mean absolute error (MAE), root mean square error (RMSE), and determinate coefficients (R 2 ).A visual illustration of the GDP spatialization procedure can be found in Figure 4.Among the six downscaling models employed, the latter three models integrated the area-to-area kriging (ATAK) method for residual redistribution, whereas the initial three models did not.Each model was separately trained on four distinct sets of auxiliary variables, yielding a total of twenty-four agricultural GDP models and twenty-four non-agricultural GDP models.The model performance evaluation was conducted using a 10-fold cross-validation methodology, revealing R-squared values exceeding 0.7 for both training and testing datasets, with consistently low p-values below the significance threshold of 0.01.These findings provide robust evidence supporting the validity of the modeling process.Furthermore, by combining the predictions of agricultural GDP and non-agricultural GDP for each model, we obtained corresponding downscaled GDP predictions at a 1 km resolution.These predictions underwent masking using land cover information and were subsequently corrected based on regional statistical GDP data, as shown in Equations ( 10)-( 12).
Finally, we conducted a validation exercise by comparing the predicted GDP distribution results with both statistical GDP data and existing gridded GDP data, utilizing metrics such as mean absolute error (MAE), root mean square error (RMSE), and determinate coefficients (R 2 ).A visual illustration of the GDP spatialization procedure can be found in Figure 4.

Gridded GDP Mapping
Depending on the categorization of the independent variables and the chosen method for spatializing GDP, a series of 1 km gridded GDP maps were generated, as visually represented in Figures 5 and 6.In contrast to the statistical provincial GDP maps, which aggregate both agricultural and non-agricultural sectors (as illustrated in Figure 1c,d), the spatialized GDP maps provide a significantly finer-grained perspective by allocating GDP values within individual 1 km × 1 km grid cells.This approach affords an enhanced resolution, facilitating a more detailed examination of economic activities across the entire study area.

Gridded GDP Mapping
Depending on the categorization of the independent variables and the chosen method for spatializing GDP, a series of 1 km gridded GDP maps were generated, as visually represented in Figures 5 and 6.In contrast to the statistical provincial GDP maps, which aggregate both agricultural and non-agricultural sectors (as illustrated in Figure 1c,d), the spatialized GDP maps provide a significantly finer-grained perspective by allocating GDP values within individual 1 km × 1 km grid cells.This approach affords an enhanced resolution, facilitating a more detailed examination of economic activities across the entire study area.ISPRS Int.J. Geo-Inf.2023, 12, x FOR PEER REVIEW 13 of 25 Evidently, the vicinity surrounding Bangkok emerged as the most economically developed region in Thailand, characterized by a robust business and industrial sector, consequently resulting in significantly elevated GDP values.This high GDP concentration in the Bangkok region was consistently reflected across all GDP downscaling predictions.However, the results obtained from the MLR and MLATARK methods exhibited notable disparities when compared to the outcomes derived from the other four machine learningbased methods.At the same time, the MLR and MLATARK methods had limitations in accurately representing GDP information for cities located beyond the provincial boundaries surrounding the capital.They struggled to effectively capture other economically developed regions.In contrast, the four machine learning-based methods demonstrated a superior ability to capture the major urban structures in economically prosperous provinces outside the capital, such as Songkhla, Phuket, Chiang Mai, Khon Kaen, Nakhon Ratchasima, and Kamphaeng Phet.In addition, variations in the selection of auxiliary variables also exerted a discernible influence on the spatial distribution of GDP.For instance, within the same modeling method, the utilization of different auxiliary variables led to conspicuous disparities in prediction outcomes, particularly in the periphery of the capital economic zone.Evidently, the vicinity surrounding Bangkok emerged as the most economically developed region in Thailand, characterized by a robust business and industrial sector, consequently resulting in significantly elevated GDP values.This high GDP concentration in the Bangkok region was consistently reflected across all GDP downscaling predictions.However, the results obtained from the MLR and MLATARK methods exhibited notable disparities when compared to the outcomes derived from the other four machine learningbased methods.At the same time, the MLR and MLATARK methods had limitations in accurately representing GDP information for cities located beyond the provincial boundaries surrounding the capital.They struggled to effectively capture other economically developed regions.In contrast, the four machine learning-based methods demonstrated a superior ability to capture the major urban structures in economically prosperous provinces outside the capital, such as Songkhla, Phuket, Chiang Mai, Khon Kaen, Nakhon Ratchasima, and Kamphaeng Phet.In addition, variations in the selection of auxiliary variables also exerted a discernible influence on the spatial distribution of GDP.For instance, within the same modeling method, the utilization of different auxiliary variables led to conspicuous disparities in prediction outcomes, particularly in the periphery of the capital economic zone.The influence of residual redistribution on the spatial distribution of GDP becomes evident through a comparative analysis of Figures 5 and 6.As a consequence of this redistribution process, notable alterations in GDP values were observed in several provinces located outside the economic core centered around the capital, such as Sa Kaeo, Chanthaburi, Trat, Chai Nat, and Ratchaburi.Moreover, this redistribution of residuals induced significant transformations in the spatial distribution of GDP within the regions characterized by varying levels of economic development.For example, in the SVATARK results, certain economically developed areas experienced an increase in GDP values, while certain less developed areas witnessed a decrease.Conversely, RFATARK's results reveal a decrease in GDP values within specific economically developed regions, coupled with a simultaneous increase in GDP values within certain less developed areas.

Accuracy Assessment
The spatialization of GDP results was carried out utilizing various models based on four groups of auxiliary variables, namely UMS, UMC, MS, and MC.These predictions were aggregated to the administrative province level.The accuracy of these downscaled results was subsequently validated using provincial GDP data.Additionally, a The influence of residual redistribution on the spatial distribution of GDP becomes evident through a comparative analysis of Figures 5 and 6.As a consequence of this redistribution process, notable alterations in GDP values were observed in several provinces located outside the economic core centered around the capital, such as Sa Kaeo, Chanthaburi, Trat, Chai Nat, and Ratchaburi.Moreover, this redistribution of residuals induced significant transformations in the spatial distribution of GDP within the regions characterized by varying levels of economic development.For example, in the SVATARK results, certain economically developed areas experienced an increase in GDP values, while certain less developed areas witnessed a decrease.Conversely, RFATARK's results reveal a decrease in GDP values within specific economically developed regions, coupled with a simultaneous increase in GDP values within certain less developed areas.

Accuracy Assessment
The spatialization of GDP results was carried out utilizing various models based on four groups of auxiliary variables, namely UMS, UMC, MS, and MC.These predictions were aggregated to the administrative province level.The accuracy of these downscaled results was subsequently validated using provincial GDP data.Additionally, a compar-ison was made with an existing gridded GDP product referred to as G_GDP.Table 4 presents a comprehensive evaluation of the model performances, utilizing three classical statistical metrics.Among the three methods that do not incorporate residual redistribution, the RF models consistently demonstrated strong predictive capabilities across different categories.For both UMS and UMC, RF achieved an average RMSE of approximately 7322.48.Similarly, in the case of MS and MC, RF maintained consistently relatively low RMSE values, averaging around 7248.037.This accuracy was further substantiated by the models' performance, with an average MAE of approximately 2148.53.While the SVR models were competitive, they slightly trailed behind RF.In both the UMS and UMC, SVR achieved an average RMSE of approximately 8590.632, whereas in MS and MC, SVR maintained a comparable RMSE of approximately 8593.313.The SVR models exhibited an average MAE of approximately 2521.071, reflecting their overall accuracy.On the other hand, the MLR models yielded mixed results.In the UMS and UMC categories, MLR achieved moderate predictive accuracy, with an average RMSE of approximately 12,974.289.However, in the MS and MC categories, MLR's accuracy significantly declined, resulting in an average RMSE of approximately 42,859.676.The MLR models exhibited an average MAE of approximately 15,779.77.
Comparatively, the method incorporating residual redistribution led to a substantial reduction in the mean RMSE and MAE when compared to the method without such a redistribution.Moreover, the residual redistribution method notably enhanced the mean R 2 in the machine learning-based models, although this improvement was not observed in the case of MLR.Overall, the models with residual redistribution outperform those without, underscoring the advantages of incorporating residual redistribution methods.Specifically, the RFATARK models exceled across all scenarios, consistently achieving remarkable precision with an average RMSE below 1000 and an average MAE of approximately 256.458.The SVATARK models also demonstrated strong predictive performance, with an average RMSE of approximately 977.692 and an average MAE of approximately 297.494.While RFATARK and SVATARK possessed similar mean R 2 values, RFATARK exhibited smaller mean RMSE and MAE values, suggesting its superiority in spatializing GDP.In contrast, the MLATARK models displayed relatively lower predictive accuracy, with an average RMSE ranging from approximately 8363.801 to 9174.452 across categories, and an average MAE of approximately 2721.766.As previously discussed, the RFATARK models consistently outperform other methods in terms of predictive accuracy.Notably, the RFATARK model, utilizing the MC group of auxiliary variables, achieved the highest accuracy among all scenarios, with an RMSE value of 402.082, a MAE value of 174.195, and an R 2 value of 0.998.Furthermore, comparing the prediction results of different independent variable groups, it is evident that, apart from the MLR-based method, the classification and factor weighting of OSM variables significantly enhance prediction accuracy.
To further validate the model performances, we compared the RFATARK, SVATARK, and MLATARK results within the MC group with G_GDP using scatter plots (Figure 7).The linearity of RFATARK and SVATARK with statistical GDP is notably superior to that of MLATARK and G_GDP, as evidenced by R 2 values of 0.999, 0.998, 0.443, and 0.877 for RFATARK, SVATARK, MLATARK, and G_GDP, respectively.Additionally, the regression lines for RFATARK and SVATARK closely align with the 1:1 line when compared with statistical GDP, as indicated by the slope of the regression line.Specifically, the slopes of the regression lines for RFATARK, SVATARK, MLATARK, and G_GDP were 1.024, 1.063, 0.388, and 3.653, respectively.In comparison to SVATARK, RFATARK demonstrated the advantage of a slope closer to 1, making it better suited for predicting GDP in economically developed regions, such as Bangkok and its surrounding provinces.Comparatively, the method incorporating residual redistribution led to a substantial reduction in the mean RMSE and MAE when compared to the method without such a redistribution.Moreover, the residual redistribution method notably enhanced the mean R 2 in the machine learning-based models, although this improvement was not observed in the case of MLR.Overall, the models with residual redistribution outperform those without, underscoring the advantages of incorporating residual redistribution methods.Specifically, the RFATARK models exceled across all scenarios, consistently achieving remarkable precision with an average RMSE below 1000 and an average MAE of approximately 256.458.The SVATARK models also demonstrated strong predictive performance, with an average RMSE of approximately 977.692 and an average MAE of approximately 297.494.While RFATARK and SVATARK possessed similar mean R 2 values, RFATARK exhibited smaller mean RMSE and MAE values, suggesting its superiority in spatializing GDP.In contrast, the MLATARK models displayed relatively lower predictive accuracy, with an average RMSE ranging from approximately 8363.801 to 9174.452 across categories, and an average MAE of approximately 2721.766.As previously discussed, the RFA-TARK models consistently outperform other methods in terms of predictive accuracy.Notably, the RFATARK model, utilizing the MC group of auxiliary variables, achieved the highest accuracy among all scenarios, with an RMSE value of 402.082, a MAE value of 174.195, and an R 2 value of 0.998.Furthermore, comparing the prediction results of different independent variable groups, it is evident that, apart from the MLR-based method, the classification and factor weighting of OSM variables significantly enhance prediction accuracy.
To further validate the model performances, we compared the RFATARK, SVATARK, and MLATARK results within the MC group with G_GDP using sca er plots (Figure 7).The linearity of RFATARK and SVATARK with statistical GDP is notably superior to that of MLATARK and G_GDP, as evidenced by R 2 values of 0.999, 0.998, 0.443, and 0.877 for RFATARK, SVATARK, MLATARK, and G_GDP, respectively.Additionally, the regression lines for RFATARK and SVATARK closely align with the 1:1 line when compared with statistical GDP, as indicated by the slope of the regression line.Specifically, the slopes of the regression lines for RFATARK, SVATARK, MLATARK, and G_GDP were 1.024, 1.063, 0.388, and 3.653, respectively.In comparison to SVATARK, RFATARK demonstrated the advantage of a slope closer to 1, making it be er suited for predicting GDP in economically developed regions, such as Bangkok and its surrounding provinces.

Discussion
Gridded GDP data with fine spatial resolution are invaluable for offering detailed insights into the spatial distribution of GDP, facilitating comprehensive socio-economic analyses alongside other rasterized geoscientific big data.In this study, we introduced a hybrid GDP spatialization approach that integrates random forest and geostatistical analysis to downscale statistical GDP data into gridded GDP data, leveraging a diverse range of geospatial big data sources.The trajectory of research in GDP spatialization methods has witnessed a shift from modeling total GDP to sector-specific GDP, from reliance solely on NTL data to incorporating broader geoscientific big data, and from linear GDP prediction models to more sophisticated counterparts.We explored the effects of residual redistribution on prediction accuracy and considered the influence of various potential factors on the accuracy of predictions.

GDP Spatialization for Different Sectors
In contrast to the traditional approach of directly modeling total GDP, we have chosen to separately simulate agricultural GDP and non-agricultural GDP.Historically, NTL data exhibit a strong linear relationship with total GDP, serving as a foundational element for spatializing total GDP [7,23].However, it became evident that the accuracy of linear models for total GDP is significantly influenced by the dominant GDP sector [23].Consequently, scholars increasingly began optimizing GDP spatialization results by developing models for different GDP sectors.
Early investigations suggested that the latitude and the variation coefficient of NTL could be leveraged to partition GDP into agricultural and non-agricultural components, thereby paving the way for future studies [56,57].While some studies have employed NTL data, gridded population data, and the fraction of primary sector GDP in the total GDP of each statistical unit to spatialize primary sector GDP (e.g., agricultural or rural GDP), the varying relationship between population and primary sector GDP within this sector has posed challenges and introduced systematic errors into the results of GDP spatialization.
More recent studies have capitalized on LULC data in conjunction with NTL data to spatialize GDP across different sectors [6,15,24,25,32].This approach allows for the refinement of GDP models for specific sectors, resulting in more realistic GDP distributions.For instance, Chen et al. [34] introduced distinct GDP models for six sectors, four of which fall within the primary sector, while Ustaoglu et al. [33] advocated for the inclusion of vegetation indices in the GDP prediction model to optimize GDP spatialization results.In our study, we combined the strengths of both approaches, considering the robust relationship between population and primary sector GDP while also using LULC data to refine the GDP prediction model.Nevertheless, we chose not to further subdivide GDP into additional sectors due to computational efficiency considerations, as separating secondary and tertiary sectors presents challenges [34].

Data Used for GDP Spatialization
This study incorporated a wide array of geoscientific data types in GDP spatialization, as illustrated in Tables 1-3.Early efforts in GDP spatialization primarily relied on statistical GDP data and NTL data [23,36].This approach was often employed to investigate factors influencing the relationship between NTL and GDP, such as temporal effects [16,22].However, this method faced limitations when spatializing GDP at scales smaller than the smallest statistical unit, such as at the pixel level.This limitation prompted our utilization of a broader array of geoscientific big data, in addition to NTL data.
Supplementing NTL data with other datasets has become commonplace to enhance GDP mapping for different sectors.The strong correlation between population data and agricultural GDP has led researchers to incorporate gridded demographic data, particularly when spatializing agricultural GDP [19,26,29].Additional data often used alongside population data includes settlement patterns, primary sector GDP contributions to total GDP, agricultural product prices, and geographical coordinates [31,58].Location information, encompassing latitude, longitude, and altitude, is also a common auxiliary variable in GDP spatialization [15,27,34,56], with evidence demonstrating the significant influence of geomorphological types detected through altitude on GDP [13].Furthermore, LULC data have frequently been used alongside NTL data for GDP spatialization [24,25,28,32].Specifically, when spatializing agricultural GDP, LULC data have been combined with vegetation indices such as EVI, NDVI, or net primary productivity (NPP) [6,33,34].Recent research in GDP spatialization has incorporated an array of geoscientific big data, including road networks, building information, POIs, and more [15,27,34].
In our study, we employed NTL data, population data, terrain information, LULC data, vegetation indices, LST, road and water information, and POI data to spatialize GDP.The incorporation of diverse geoscientific data sources enables the generation of a more accurate gridded GDP map.However, it is essential to acknowledge that this approach entails challenges related to data acquisition and increased computational demands.Given the constraints of data availability, it is noteworthy that some auxiliary variables, as discussed above, were not utilized in the present experiment.Nevertheless, it is essential to underscore that these excluded variables remain viable candidates for incorporation in future GDP spatialization studies, with the potential to augment the precision of the final gridded GDP predictions, an avenue that merits further exploration.A noteworthy consideration is the observation that the population residing in one pixel may be involved in economic activities in another pixel.Consequently, datasets related to public utility (such as roads and water) and production factor (such as POIs) data may not fully capture the intricate dynamics of economic activity.To improve the practical relevance of the gridded GDP results in governance and business decision-making contexts, future GDP spatialization models could incorporate data specifically addressing the population engaged in economic activities.This approach may yield a more precise representation of the economic landscape by accounting for workforce distribution and employment patterns within distinct geographic units.Through the inclusion of data on the population actively contributing to economic productivity, the resulting gridded GDP model has the potential to provide insights more directly applicable to governmental policy formulation and enterprise-level decision-making processes.
Moreover, our study explores various combinations of auxiliary variables, yielding distinct groupings based on data processing, which are denoted as UMS, UMC, MS, and MC groups.Within these four distinct groups, the MC and MS groups exhibit the lowest number of variables, followed by the UMC group, while the UMS group boasts the highest number of variables.Notably, across different scaling experiments, the results obtained using the MC variable group consistently manifested superior accuracy.This observation underscores the notion that a greater abundance of auxiliary variables does not necessarily translate into heightened predictive accuracy.This phenomenon can be attributed, in part, to the possibility that the introduction of additional variables may amplify the impact of error, potentially outweighing their contribution to the modeling process.Conversely, the attribute integration process takes into consideration the unique characteristics of these variables, potentially leading to more accurate representations of their underlying conditions.For instance, by factoring in road network classifications and applying weighted integration techniques, we are able to capture a more precise overall depiction of the road network than would be achievable by analyzing each road network attribute in isolation.

GDP Spatialization Methods
Initially, GDP spatialization relied on linear models [7,23].However, Ghosh et al. [29] highlighted the need to consider variations in the linear NTL-GDP relationship among administrative units.Simple linear models exhibited substantial GDP precision errors.To enhance accuracy, some scholars proposed using "lit-population," calculated as the product of NTL and population, as an independent variable in linear models [26,58].It is worth noting that simple linear models are ill-suited for simulating GDP across different sectors.
Building upon simple linear models, Doll et al. [10] introduced the log-linear model, later expanded by Henderson et al. [59], to incorporate cross-sectional and temporal effects.Additional models capable of disaggregating GDP into sectors, such as log-quadratic and multivariate log-linear models [19], have emerged.Zhang and Gibson [22] extended this framework to distinguish unobserved time-invariant features within cross-sectional units and spatially invariant features across time periods.While widely used to investigate NTL-GDP relationships, these models exhibit limitations in GDP spatialization accuracy.
Various regression models, including quadratic polynomial, power, exponential, stepwise multiple regression, multivariate linear regression, and fractional multivariate models, have demonstrated superior performance compared to simple linear models [6,13,24,27,33].Notably, the latter three models incorporate not only NTL data but also diverse geographical data for GDP forecasting.
While the discussed models offer valuable insights, they fall under the category of traditional statistical approaches and have limitations.Recent advancements in GDP spatialization have seen the adoption of artificial intelligence algorithms, such as RF regression and deep learning models [15,27,34,49].These intelligent models have demonstrated remarkable capabilities in simulating GDP, primarily attributed to their capacity to harness extensive geographical data and deploy sophisticated modeling techniques.However, it is essential to acknowledge a significant drawback associated with these artificial intelligence models-their 'black box' nature-which limits interpretability.
In our study, we adopted the RF regression model for GDP forecasting.Additionally, we introduced a hybrid approach involving the incorporation of a residual distribution based on geostatistical interpolation.Previous studies have utilized residuals solely to gauge model accuracy.Some of these studies, following accuracy assessments, applied residuals at the target spatial scale to attain unbiased results at the original spatial scale [27,33].However, this residual distribution method merely corrects spatial residuals by adjusting the ratio of actual GDP to forecasted GDP, without accounting for the challenge posed by modifiable area units.In contrast, our method employs a geostatistical approach, known as ATAK, to effectively address the issue of modifiable area units, resulting in improved GDP spatialization outcomes compared to models lacking residual distribution.

Impact of Residual Predictions
In the introduction and the preceding section, we previously highlighted the potential enhancement in overall prediction accuracy through the integration of residual prediction.Due to the lack of detailed statistical GDP data at finer administrative levels, such as city and street-level GDP data in our experimental setup, we utilized the most detailed available provincial-level statistical GDP data for validation purposes.Considering that residual predictions were also based on the redistribution within provincial administrative units, theoretically, the validation of results at the provincial level, combined with residual predictions, should outperform downscaling models that do not consider residuals.This theoretical advantage is reflected in the results and comparative analysis.As indicated in the previous sections, the RFATARK method utilizing the auxiliary variables of MC group demonstrated the highest accuracy in downscaled GDP predictions.To further illustrate the superiority of the proposed RFATARK method, we conducted additional comparisons with three downscaling models under the MC auxiliary variable group.One model employed random forest inverse distance weighting (RFIDW), utilizing inverse distance weighting for residual redistribution.The other two models incorporated additional auxiliary variables into the random forest and XGBoost models (denoted as RF_MC and XGBoost), respectively, by adding latitude and longitude information into the MC auxiliary variable group.Figure 8 displays the downscaled GDP spatial distribution maps for these three methods, and Table 5 presents their accuracy assessment results.dataset used in our experiment, we proposed the RFATARK downscaling method which is a combination of RF and ATAK.In future applications, the XGBoost method can be integrated with ATAK residual prediction for downscaling, following the same steps as the RFATARK method but using the XGBoost method during regression modeling.

Additional Factors Potentially Impacting Accuracy
The accuracy of downscaling predictions is influenced by the model.In the process of downscaling modeling, it is commonly assumed that the model parameters of the target variable and auxiliary variables remain constant at both coarse and finer scales.However, due to the presence of scale effects, the relationship models between variables dynamically change with scale [65].The greater the differences in scale, the poorer the applicability of the model may become.In this experiment, due to the lack of more detailed administrative units such as counties or streets in the statistical GDP data, provincial-level GDP statistical data were utilized for modeling.As there is a significant gap between the provincial level and the target scale of 1 km, the accuracy of the model may be compromised when applying the province-level GDP model to 1 km grids.Subsequently, in the process of correcting the model simulation results, as the remaining provincial data needs to be used as validation data, regional GDP statistics were employed for correction.The difference in scale between the regional level and the provincial level, as well as the kilometer grid scale, could have negative implications for result accuracy.In future research, obtaining statistical data at multiple scales would be advantageous for enhancing modeling effectiveness.
Additionally, the accuracy of variables involved in the modeling process also affects the accuracy of downscaling predictions.The errors in variables directly impact the modeling process and accumulate as the model progresses.On one hand, variables exhibit certain observation errors, and on the other hand, they have representativeness errors.In particular, for socio-economic variables, the current pixel-based partitioning (i.e., 1 km grids) may inadvertently lead to the truncation of certain production factors as they might be divided by pixel contours in space.This fragmentation of production factors could affect the accuracy of calculations, especially when attempting to capture the overall economic functions of a given spatial unit.In future research, it might be beneficial to consider using analysis units with greater spatial coherence to better capture subtle economic dynamics within each region.This approach could also provide a more realistic representation of the interconnections between economic factors that extend beyond pixel boundaries, thereby improving the overall accuracy and reliability of modeling results.For example, reorganizing pixels into larger, more coherent regions that serve as subdivisions of provinces could ensure spatial continuity of economic activities and allow for a more accurate description of the intricate relationships between production factors, providing a more realistic representation of the economic landscape within each region.

Conclusions
This paper introduced a hybrid GDP spatialization method, which employs a RF regression model in conjunction with the ATAK technique, and utilizes multi-source data as auxiliary information.The proposed approach entails the construction of distinct RF models for the agricultural and non-agricultural sectors, allowing us to capture the nonlinear relationships between independent variables and the statistical GDP pertaining to diverse sectors.Subsequently, the application of ATAK facilitates the distribution of residuals in the GDP prediction process.This hybrid approach, which has not been extensively explored in previous GDP spatialization studies, plays a pivotal role in refining the spatial distribution of GDP predictions.The integration of diverse geoscientific data sources, coupled with the proposed methodological framework, has yielded significant improvements in the predictive capabilities of GDP spatialization.Our case study focusing on Thailand's GDP underscores the efficacy of our approach, providing a more accurate and intuitive visualization of GDP distribution.Additionally, we have examined the utility of four distinct sets of covariates in the downscaling process, with the MC group demonstrating superior informativeness compared to the UMS, UMC, and MS groups, thereby yielding the most accurate downscaled predictions.The methodology introduced in this study can be extended to downscale GDP data for additional time periods, as well as to finer resolution grids, through the incorporation of auxiliary information corresponding to the desired resolution.The resulting gridded GDP map serves as a valuable scientific resource for government bodies and organizations, facilitating the formulation of informed development strategies.

Figure 1 .
Figure 1.Elevation and regional divisions (a), LULC (b), agriculture statistical GDP (c), and nonagriculture statistical GDP (d) of Thailand at the province level.(GDP in constant 2011 international USD).

Figure 1 .
Figure 1.Elevation and regional divisions (a), LULC (b), agriculture statistical GDP (c), and nonagriculture statistical GDP (d) of Thailand at the province level.(GDP in constant 2011 international USD).

Figure 2 .
Figure 2. The result of KDE bandwidth determination.

Figure 2 .
Figure 2. The result of KDE bandwidth determination.

Figure 3 .
Figure 3. Tuning curves of random forest modeling for agricultural and non-agricultural GDP with four different scenarios.(The horizontal coordinate represents mtree).

Figure 3 .
Figure 3. Tuning curves of random forest modeling for agricultural and non-agricultural GDP with four different scenarios.(The horizontal coordinate represents mtree).

Figure 5 .
Figure 5.The spatial distribution of downscaled GDP using RF, SVR, and MLR methods.

Figure 5 .
Figure 5.The spatial distribution of downscaled GDP using RF, SVR, and MLR methods.
ISPRS Int.J. Geo-Inf.2023, 12, x FOR PEER REVIEW 16 of 25 RMSE of approximately 42,859.676.The MLR models exhibited an average MAE of approximately 15,779.77.

Figure 7 .
Figure 7.The linear relationship between statistical GDP and spatialized GDP at the provincial scale.GDP derived by (a) RFATARK, (b) SVATARK, (c) MLATARK using MC group of auxiliary variables, and (d) G_GDP product.The black dashed line represents the 1:1 line, and the orange line represents the line fitted through the scatter points.

Table 1 .
Comprehensive datasets utilized in the present study.

Table 2 .
Introduction and weighting of OSM auxiliary variables for both road and water data.

Table 3 .
Introduction and weighting of OSM auxiliary variables for POI data.

Table 2 .
Introduction and weighting of OSM auxiliary variables for both road and water data.

Table 3 .
Introduction and weighting of OSM auxiliary variables for POI data.

Table 4 .
Performance of the different models based on three statistical measures between the estimated and actual GDP in province level.The RMSE and MAE values are in 10 7 USD.Bolding indicates the situation with the highest precision.