Geographic modeling is a fundamental methodology for understanding, simulating, and predicting geographic phenomena and processes within a certain context [1
]. A crucial step in geographic modeling is preparing input data for geographic models. Input data, including preliminary data or raw input data (e.g., digital elevation model (DEM)) and derived information (e.g., topographic properties such as slope and area), are not only a prerequisite for model setup, calibration, and validation, but the quantity and quality also directly affect simulation results [1
]. Insufficient and inappropriate input data (e.g., lack of observations and inappropriate DEM resolution) might limit both the accuracy of the model results and the applications of geographic models [9
Input data preparation in geographic modeling is particularly challenging due to the input data needed by geographic models often being obtained from distributed data sources and being syntactically and semantically heterogeneous [1
]. Modelers have to assess the quality, relevance, and suitability of preliminary data for geographic modeling. Then, they need to select and compose a set of applicable and compatible data pre-processing algorithms and their implementations, such as web services, to prepare needed input data. These data preparation steps often contain many operations that are repeated with a traditional manual method for most cases of geographic modeling. This means that considerable time, expertise, and effort are required to set up a new model application, which restricts the reproducibility of previous studies, particularly for those non-expert stakeholders (e.g., policymakers from local government) [5
Integrated modeling environment (IME) has been proposed as an efficient and convenient tool for sharing, reusing, integrating, and running heterogeneous geographic models [1
]. IMEs are shifting the application model of geographic modeling from centralized desktop software systems to distributed and service-oriented online geoprocessing platforms [20
]. In addition, IMEs are increasingly using advanced computing technologies, such as parallel computing and cloud computing, to meet the computation requirements of large-scale and complex geographic models in the big data era [23
However, most of the IME studies focused on developing new models, or/and sharing and coupling existing models and modules [1
]. The input data preparation of geographic modeling in IMEs still heavily depends on modelers’ modeling knowledge (including knowledge of the geographical domain, knowledge of geographic models and their input/output data, prior modeling experiences, and technical expertise). This situation not only reduces modeling efficiency and the applicability of IMEs, but might also lead to untrustworthy model results [11
]. The situation is becoming unavoidable because geographic models are becoming increasingly complicated due to their trend of integrated multi-factor, multi-process, and multi-scale research [4
]. Therefore, methods that can prepare appropriate input data for geographic models in a user-friendly and efficient way are urgently needed for IMEs.
To address problems related to input data preparation for geographic models, a variety of methods have been proposed. Based on artificial intelligence (AI) technologies such as ontology, logical reasoning, case-based reasoning (CBR), and AI planning, these methods aim to provide an automatic and intelligent way to discover data and the necessary pre-processing applications (e.g., web services) for geographic models [32
]. Using these methods, the time, expertise, and prior experience requirements for preparing model input data can be reduced significantly to ensure the efficiency and effectiveness of geographic modeling.
In this paper, we conducted a systematic review of the state-of-the-art methods for preparing input data for geographic models and provide recommendations of areas for future study. The remainder of this paper is structured as follows: Section 2
provides an analysis of the factors that influence data preparation for geographic models, then Section 3
outlines the corresponding key tasks that should be accomplished. In Section 4
, existing input data preparation methods are classified into three categories: Manual, (semi-)automatic, and intelligent (i.e., not only (semi-)automatic but also adaptive to application context) methods. Then, each of them is discussed according to their influencing factors and key tasks. Section 5
discusses future research directions of intelligent input data preparation methods for geographic models and their integration with IMEs. The last section provides a summary of this review.
5. Future Research Directions
Although many methods have been proposed to improve the efficiency and accuracy of input data preparations and minimize the requirement for extensive modeler expertise, the three key tasks presented in Section 2
are still far from being accomplished. As geographic models are becoming increasingly complicated through integrating sub-models from diverse domains [4
], input data preparation now requires more time, modeling knowledge, and technical expertise than ever. Increasing numbers of cross-domain stakeholders are engaged in geographic modeling [1
]. The need for user-friendly and intelligent input data preparation methods and tools is becoming increasingly urgent.
To fill the gap between the existing methods and the requirement for a highly intelligent and easy-to-use input data preparation environment, knowledge-driven and service-oriented methods for IMEs must be developed. These methods should be able to use domain knowledge and prior experience to solve new modeling problems, automatically discover and pre-process (or reuse) application-context-matching input data for geographic models from distributed data sources and report the uncertainty of the automatically recommended solutions. They will make IMEs easier to use and will be more effective for modelers.
To this end, we recommend the following research priorities in input data preparation for geographic models:
Publishing, sharing, and reusing model data and data pre-processing workflows. Data involved in geographic modeling can be classified into four types: Preliminary data, intermediate data (processing results used by subsequent steps), prepared input data, and simulation results. Publishing, sharing, and reusing these data and the corresponding workflows could avoid repetitive work in the data pre-processing steps for preparing input data, thus reducing errors, and supporting collaboration and reproducibility. This has been demonstrated by several hydrological model data sharing platforms [106
] and workflow building environments [73
]. Whereas a unified, semantically rich, and machine-understandable metadata framework to publish model data and workflows is still lacking. Thus, it is difficult to efficiently discover and reuse multi-source, heterogeneous data and workflows. In addition, due to current sharing platforms being isolated from IMEs, a considerable amount of manual work is required to exchange data between these platforms and IMEs. To solve these problems, web service and semantic web technologies could be used to reduce syntactic and semantic heterogeneities between the data of these platforms and IMEs.
Integrating both data discovery and processing functionalities into IMEs. As mentioned in Section 2
, the integration of data processing functionalities and the geographic model program in IMEs have been extensively researched. However, modelers still have to discover and process input data for geographic models separately. This means that the model input data acquired from data discovery tools, or directly from distributed spatial data infrastructures (SDIs), have to be manually transferred to input data pre-processing tools or IMEs. This procedure is tedious and needs the users to have specialized SDI knowledge (such as metadata standards, protocols, and domain terminologies) and data pre-processing functionalities [5
]. Recently, integrated geospatial analysis platforms, such as HydroDesktop [48
], Google Earth Engine (GEE) [110
], and the Joint Research Centre Earth Observation Data and Processing Platform (JEODPP) [111
], have attracted increasing interest. They enable users to discover, process, analyze, and visualize the needed data in one platform. Unfortunately, the data discovery and process steps in these platforms have not yet been automated and have not been integrated with IMEs, which means that data have to be exchanged manually. Therefore, integrating both data discovery and processing functionalities into IMEs should be researched in the future.
Developing task-oriented input data preparation methods. Geographic modeling is inherently task-driven work. These tasks of solving geographic problems are highly dependent on the conceptual knowledge of geographic problem-solving and technical expertise in terms of geographic models, data, data pre-processing tools (including parameter-settings), and workflows. Users can easily understand and express tasks instead of specialized domain knowledge, study area characteristics, and technical details of geographic modeling [112
]. Recent studies have proposed several task-oriented geospatial data retrieval or processing methods [90
]. However, these methods are still difficult to use in geographic modeling due to the lack of automation driven by specific task knowledge of geographic modeling, especially of input data preparation.
Constructing large-scale, high-quality knowledge bases for intelligent geographic modeling. The quantity and quality of formalized geographic modeling knowledge determine the level of automation and intelligence of input data preparation methods and corresponding IMEs [33
]. Currently, a large amount of knowledge on geographic modeling in different domains has not yet been formalized, for example, the knowledge of geoprocessing functionalities, and domain concepts and algorithms of digital terrain analysis [34
]. Knowledge fusion and refinement are also urgently needed to alleviate problems of incompleteness, incorrectness, redundancy, and heterogeneity in knowledge bases [116
]. In addition, few studies have been conducted to address the issue of the representation and reasoning of application-context knowledge [34
]. Therefore, determining how to construct large-scale and high-quality knowledge bases for intelligent modeling is a key problem in future research. To build these knowledge bases, advanced technologies, such as machine learning, natural language processing, and knowledge graph [121
], could be explored to extract, represent, and use the cross-domain modeling knowledge.
Input data preparation for geographic models has been increasingly recognized as a vital step in geographic modeling. An easy-to-use, efficient, and intelligent input data preparation method could not only free modelers from the burden of repetitive work and extensive training but also improve the accuracy of the model results.
We first analyzed factors influencing input data preparation for geographic models, and the corresponding three key tasks that need to be accomplished when developing input data preparation methods. Then, we divided existing input data preparation methods into three categories: Manual methods, (semi-)automatic methods, and intelligent methods. Based on a survey of the state-of-the-art methods, we determined that knowledge-driven intelligent input data preparation for geographic models is the most promising yet challenging research subject. It is still seldom implemented in practical systems. This limits the IMEs’ ability to improve the modeling efficiency and to ensure the suitability of model inputs to the application context. Therefore, we discussed four future research directions to improve this situation. With the support of advanced technologies and methods such as web service, semantic web, and AI, input data preparation methods, as well as geographic modeling with IMEs, are entering the era of intelligence. The improvements in these research directions will enable modelers, whether they are domain experts or novices, to easily and effectively prepare sufficient and application-matching input data for geographic models.