Subsurface Topographic Modeling Using Geospatial and Data Driven Algorithm

: Infrastructures play an important role in urbanization and economic activities but are vulnerable. Due to unavailability of accurate subsurface infrastructure maps, ensuring the sustainability and resilience often are poorly recognized. In the current paper a 3D topographical predictive model using distributed geospatial data incorporated with evolutionary gene expression programming ( GEP ) was developed and applied on a concrete-face rockfill dam ( CFRD ) in Guilan province- northern to generate spatial variation of the subsurface bedrock topography. The compared proficiency of the GEP model with geostatistical ordinary kriging ( OK ) using different ana-lytical indexes showed 82.53% accuracy performance and 9.61% improvement in precisely labeled data. The achievements imply that the retrieved GEP model efficiently can provide accurate enough prediction and consequently meliorate the visualization insights linking the natural and engineering concerns. Accordingly, the generated subsurface bedrock model dedicates great information on stability of structures and hydrogeological properties, thus adopting appropriate foundations.


Introduction
Increased access to timely and well documented geospatial data lead to the application of diverse empowered predictive data mining approaches [1] to address the knowledge extracted problems through more sensible information processing paradigms [2][3][4]. In recent years, predictive geospatial remotely sensed-based models in combination with field surveyed data through the GIS platform have been used to interpret different variant of interested geo-objects [5][6][7][8][9]. Despite to drawbacks of GIS in parametric modeling tools [10,11], its incorporation with innovative intelligence approaches have shown significant degree of success in a series of emerging environmental phenomena on the ground surface [12][13][14][15][16]. However, developing such models through geospatial resources for subsurface investigations due to limited data requires exploratory interpolation tools and the complexity of the prospected geo-objects is a difficult and cumbersome task [17,18].
In subsurface mapping process of infrastructures such as dams, analyses of thematic and surveyed geospatial data due to complexity in both design and construction phases play a fundamental role. The surface geospatial data leads to find the best site locations with the minimum impacts on vicinity areas [19,20], while the subsurface geospatial provide more insights on construction process and required structural elements for more safety. This implies a great importance of geospatial analyses in both on/under the ground surface. Thereby, in addition to implemented traditional decision-making methods [21], integrating different intelligence systems for the aim of improving the geospatial data analyses in selecting the dam site locations have emerged [22,23]. For such studies, a variety of geospatial layers like the geological formation, soil type and erosion, fault line, digital elevation model (DEM), slope, ground water, rainfall, water discharge, land use/cover, road network, dominants of urban areas, seismicity patterns, landslide scars, and dissolved solids are used. However, in construction, the subsurface spatial data describing the topographical condition, foundation concerns, ground water table, and availability of materials is more interesting [24,25]. This implies the necessity of utilizing integrated up-to-date geospatial-surveyed database in analyzing the associated problems to speed up phases of the workflow in infrastructure scale [26]. However, providing the geospatial data repository due to traditional collections is typically a time-consuming process [27,28]. Figure 1 shows the typical steps in providing appropriate geospatial database to dedicate a knowledge-based predictive model. Such a geospatial database involves a wide variety of geo-information, remotely sensed products/images, and site surveys. Depending on the required attribute, the geospatial database can cover DEM, land use, soil type distribution, topographical indices, surface geo-data, drainage pattern of the catchment area, and slope stability, as well as probable natural hazard risks. Accordingly, these incorporations allocate significant economic, social, and environmental benefits for future urbanization and industrial developments on both local and regional scales [27]. Depth to bedrock (DTB) is one of the interesting subsurface attributes in planning and construction of dams due to significant influence on stability performance and land surface processes [22,23,26]. This attribute ( Figure 2) provides knowledge on spatial variation and topographical interface between unconsolidated sediments and stiff layer, as well as more environmental insights to the migration of contaminants following the bedrock gradient [29][30][31]. From an economical perspective, the DTB can significantly impact initialized project costs based on damage, rippability, and excavation volume as well as unnecessary mitigation measures. In foundation design, shallow bedrock requires less overlaid removal for shoring, while deeper distributions can be a liability for lateral variations in soil characteristics [32] and consequently footing and slab failures if not accommodated.
Referring to literatures, the risk of dam project can be reduced through efficient but accurate 3D subsurface spatial DTB model [33][34][35]. However, creating accurate continues predictive spatial DTB model due to densities of sparse surveyed points and imprecise physical data from the material formations as well as associated uncertainties [17,18,36] is usually beyond the ability of traditional applied methods. Furthermore, the observed conflicts of the outcome in predicted DTB subjected to each individual interpolation algorithm [17] evidently describes why it is not always clear which method can provide the most appropriate model. Gene expression programming (GEP) is an evolutionary computational algorithm for optimization problems, closely related to genetic algorithms and genetic programming [37]. Despite the achieved success of intelligence computational techniques in solving the demerits of DTB models [17,38,39], no distinguished work using GEP for the purpose of subsurface DTB modeling has been reported. Moreover, variability of DTB due to great influence on the stability of dam in high seismic zone such as northern Iran is preferred to be estimated as accurately as possible [40][41][42][43]. This gap motives dedicating an appropriate alternative with supreme predictability levels by introducing a predictive 3D spatial DTB using the GEP. Further, this was applied on a dam site in Guilan province-northern Iran, where high accurate quantitative model is great of interest.
In the current study, the geospatial database was provided using the first three steps presented in Figure 1 through the retrieved information of 63 compiled boreholes, DEM, geology map, and spatial geographical coordinates. The structure of optimum GEP model was captured through programming in C++ with ability in checking wide variety of internal characteristics. It then was compared with conventional ordinary kriging (OK) technique and evaluated using different performance metrics and uncertainty analyses. Then, 82.53% accuracy was observed with R 2 = 0.97 in GEP comparing to R 2 = 0.92, and 74.6% for OK revealed superior performance in predictions of DTB. These predictions can assist in possible reduction in number of boreholes and corresponding costs.

Principle of GEP
The GEP [37] like genetic programming (GP) [44] is an evolutionary intelligent paradigm that originated from genetic algorithms (GAs) [45]. The main fundamental differences between GAs, GP, and GEP are raised due to employed individuals, where the GAs uses linear strings with fixed length (chromosomes) and GP works with nonlinear entities of different sizes and shapes (parse trees). This characteristic in GEP is turned into encoded linear strings of fixed length (the genome or chromosomes) to express nonlinear entities of different sizes and shapes, i.e., simple diagram representations or expression trees (ET).
Referring to Figure 3, the GEP is started with initialized random generated populations (chromosomes) and aims to select and reproduce the individuals using genetic operators and calculated fitness function (FT). The appropriate solution then is found through an iterative procedure comprising three main genetic operators including mutation, transposition, and recombination. If the termination criterion (number of generation) is not achieved, then the process will reproduce new genes using a roulette wheel [46]. Thereby, in the iteration process each chromosome might be modified by none/one or several operators that randomly selects the individuals. In the modification process, the mutation operator essentially is much more effective than others because it can occur in any level of the chromosome. The appropriate value for this operator can be selected within [0.01-0.1] interval [37,46]. Transposition operator internally replaces some consecutive elements of the chromosome. It can be performed using gene, insertion sequence (IS), or root IS (RIS) procedures. The IS provides a copy of randomly selected segment of consecutive elements in tail while in the RIS it is replaced in the heads (roots) of genes. Gene transposition aims to replace the chosen segment of each gene except the first one in the beginning of the chromosome. For transposition, choosing a value in the domain of [0.01-0.1] is recommended [46]. Subsequently, the information of two parent chromosomes using recombination procedure is exchanged to generate two offspring. Therefore, new populations with the same size of the parent but in the form of different sub-ETs are generated that further should be evaluated using the FT. Each gene of the evolved entities is a fixed length string of different alphabets comprising the head (function and terminal) and tail (terminal). Considering the head (h) as a used defined parameter, the length of the tail (t) is determined by trial error procedure as: where; n denotes the number of arguments of functions. Therefore, root of ET (position 0) always is filled by functional operator and subsequently appropriate functions depending on the nature of the problem are attached as branches. The inputs of model are presented in terminals in the form of inclusive variables and numerical constants. Proper functions for the root can be selected among the standard mathematics factors (e.g., +, −, *, /, sqrt, exp, log, ln, sin, cos, etc.) and logical operators (e.g., and, or, not, nor, etc.), as well as Boolean or user defined relations. The corresponding ET then is translated using Kavra language [37] as shown in Figure 4.

Study Area and Geospatial Database
Guilan Province lies along the Caspian Sea in the north of Iran which can be characterized using nature and humid subtropical climate with high precipitations. Moreover, part of this territory is in mountainous active Alborlz seismotectonic province that endurea several destructive earthquakes, such as Manji-Roudbar 1990 (Ms7.3). As presented in Figure 5, the Bijar CFRD is suited between the cities of Roudbar and Rasht in Guilan province with 105 Mm 3 nominal reservoir capacity, 430 × 10 m crest dimensions and 59.4 m height from the basement. To provide the geo-database, the 10 × 10 m resolution DEM for the study area (Figure 5) and the corresponding slope were created. The geological and rain fall information from geological survey and meteorological organization of Iran were added to database.
The level of overlaid sediments, depth, and geographical coordinates of 63 sparse drilled boreholes and knowledge-based pseudo-observations also were processed and stored in database. These documented geospatial data were then applied to provide DTB predictive topographical model. Referring to Figure 5, the elevation of the area varies between 130 to more than 430 m.

DTB Modelling Using GEP
Like all intelligence algorithms, the acquired data should be randomized where the assigned percentage is user defined. Randomizing controls the lurking variable and eliminates the possible biases that may arise in the experiment. In this paper, the values 65%, 20%, and 15% have been considered to build the training, testing, and validation sets. To increase the accuracy of modeling process and get more insights on the area with the lack of data, the expert intelligent knowledge-based approach presented in [17] was applied. The most appropriate number of chromosomes and genes for predictive GEP model was captured subjected to root mean square error (RMSE) as the FT through combination of trial-error and constructive technique [32,47]. Accordingly, the variation of RMSE against the achieved R 2 leads to an optimum number of chromosomes, where the predictive GEP structure was selected as the outcome of systemic feedback loops through the changing of internal characteristics. Each individual generated model was checked and controlled to be responsive in serving the required needs. The explained procedure was programmed in C++. Table 1 reflects the ranges of applied parameters in model adjustment.  Figure 6 shows a sample of executed models and corresponding compared performances as a function of numbers of chromosomes and head size. Such parametric investigations revealed that topology of 35-8-3 expressing the number of chromosomes; head size and genes and thus assigned ETs (Figure 7) can be selected as optimal. Accordingly, the ETs of the GEP model then dedicate the mathematical relation as:  The predictability of GEP model using employed data is presented in Figure 8. Subsequently, the spatial DTB model of the area using DEM pixels ( Figure 5) comprising the latitude (d0), longitude (d1), and elevation (d3) was created and presented in Figure 9.

Comparison, Validation, and Discussion
In the subsurface mapping process of infrastructures, analyses of thematic and surveyed geospatial data due to complexity in both design and construction phases play a fundamental role. Such modeling is a complex task for GIS technology. However, the spatial nature of geo-objects always drives GIS to be part of the modeling systems.
The Dam projects due to vulnerabilities and more susceptibilities require as accurate as possible information on subsurface conditions. Accordingly, unavailability of accurate enough subsurface maps can create a challenging situation and intensify facing the un-considered and unexpected problems in the planning and construction process. Moreover, autocorrelation, heterogeneity, limited ground truth, and multiple scales and resolutions also poses unique challenges [48] that violate common employed assumption of many traditional predictive modeling [49]. As most exploratory boreholes are randomly distributed and may not be representative over the study area, selecting the best interpolation method giving the closest approximation to known parameters is usually difficult but imperative [50]. Therefore, predictive spatial models are developed to increase the resolution of the response variable based on explanatory features. Since the systematic surveys of underground are critical, mapping and modeling of subsurface helps to mitigate the risk from a utility-congested scenario. Combining innovative technologies such as BIM was also found to be critical in reducing the engineering risks associated with the subsurface construction site [10,11].
From an engineering perspective, generated models need to be pursued for validation and check for the compliance and system realization. Such a process aims to ensure that the produced model meets the accuracy and operational criteria. In this paper, comparing the results with OK geostatistical technique, accuracy performance using confusion matrix and statistical error criteria, as well as uncertainty analyses through confidence and prediction intervals (CI and PI), were carried out.
Kriging [51] is a recognized applicable geostatistical interpolation technique for the unknown values of spatial and temporal variables. Governing by prior covariances through the Gaussian process regression, this algorithm dedicates the best practical unbiased prediction for intermediate values [52]. Among the different types of kriging, the OK can implicitly evaluate the mean in a moving neighborhood as well as estimating a block value [53]. This method presents a minimized estimation of variance through the optimal weights to reduce the error [45] as: where; γ is the variogram value. zi and Xi denote the real and estimated values at location i, respectively. λi shows the assigned weighting coefficient and μ is Lagrange factor. The performance and comparison between the OK and GEP model subjected to randomized datasets are shown in Figure 10A. The level of CI represents a visual sense for long-term success of the method in capturing the DTB, while PI shows the certain probability of future estimation. The stability and capacity of GEP and OK predictive models using employed data was reflected in Figure 10B. Statistically, the predictive model is in control if the outputs fall within certain ranges where the wider CI, the more instable the relation. Residual as the difference between the observed and predicted indicates the fitting deviation and is presented in Figure 10C. Visual comparison describing the predictability level of OK and GEP using confusion matrix [54] is given in Table 2, where each array dedicates the number of true labeled outputs. The results showed 82.53% performance in accurate data labeling in GEP and 9.61% improvement than the OK. The perspectives of 3D generated models then are presented in Figure 11.  Figure 11. Visualized insights into predicted spatial DTB models using optimum GEP and OK.
The performance of predictive model including quantitative numerical values can be scaled and compared using statistical error metrics such as mean absolute percentage error (MAPE), RMSE, R 2 , and residuals. The MAPE is one of the most popular indexes for description of accuracy and size of the forecasting error, while residual represents a fitting deviation of predicted value from measured. RMSE expresses the prediction accuracy of a model and is an appropriate estimator for the standard deviation of the error distribution. R 2 reflect the goodness of fit in prediction problem. As presented in Figure  12, higher values of R 2 as well as smaller MAPE, residual, and RMSE are interpreted as more accurate levels ( Figure 11).

Concluding Remarks
Developing accurate enough subsurface topographical predictive models can address new challenges in both infrastructure and urbanized scales. On a regional scale, 3D models can completely tie geospatial databases into subsurface models. Accordingly, unforeseen subsurface conditions due to incomplete information can cause large incurred indirect costs. This leads to a higher quality database, superior knowledge, and thus more visualized insights on new opportunities. It then leads to more tailored planning and management of increased subsurface uses and thus wider and long-term benefits.
These concerns stimulated checking performance of GEP evolutionary algorithms in subsurface analyses where high-resolution and more accurate predictive DTB models significantly are demanded. The capabilities of programmed GEP in C++ then was pursued in Bijar CFRD dam, north of Iran, and compared with OK. According to obtained results, the GEP model showed 9.81% progress more than OK in predictability levels of DTB. Analyzed control performance in terms of MAPE, RMSE, residual, and R 2 reflected supreme performance in GEP than OK. Referring to established CI and PI more stability in model and thus future prediction in GEP rather than the OK is expected. The compared results revealed that OK provides over/underestimate more results than GEP and thus can generate unrepresentative interpolation for the entire of the studied area. This implies that GEP efficiently dedicates more cost-effective and accurate enough prediction for the DTB.
From a practical perspective, DTB mapping can assist in distinguishing infill sediments. It can be applied for modeling the landscape evolution, natural hazard assessments, and groundwater evaluation to help decision makers in assessing the risk patterns and promote new responses to detect out of the ordinary behavior before serious damages are inflicted.

Conflicts of Interest:
The authors declare no conflict of interest.