Overcoming Data Scarcity in Earth Science

The Data Scarcity problem is repeatedly encountered in environmental research. This may induce an inadequate representation of the response’s complexity in any environmental system to any input/change (natural and human-induced). In such a case, before getting engaged with new expensive studies to gather and analyze additional data, it is reasonable first to understand what enhancement in estimates of system performance would result if all the available data could be well exploited. The purpose of this Special Issue, “Overcoming Data Scarcity in Earth Science” in the Data journal, is to draw attention to the body of knowledge that leads at improving the capacity of exploiting the available data to better represent, understand, predict, and manage the behavior of environmental systems at meaningful space-time scales. This Special Issue contains six publications (three research articles, one review, and two data descriptors) covering a wide range of environmental fields: geophysics, meteorology/climatology, ecology, water quality, and hydrology.


Introduction
Environmental modeling deals with the representation of processes that occur in the real world in space and time. Based on differential equations, dynamic models mostly describe the processes that transform the environment through time. The spatial interactions and topological rules are mostly managed by geographic information systems (GIS) [1]. These mathematical models heavily rely on the data collected by direct field observations. However, a functional and complete dataset of any environmental variable is difficult to collect because of two main reasons: (i) the low reliability in the measurements (e.g., due to issues related to the equipment location or occurrences of equipment malfunctions); and (ii) the high cost of the monitoring campaigns [2,3]. The lack of an adequate amount of Earth-science data may induce an unsatisfactory and not reliable representation of the response's complexity of an environmental system to any input/change, both natural and human-induced. In this case, before undertaking expensive studies to collect and analyze additional environmental data, it is reasonable to first understand what improvement in estimates of system performance would result if all the available data could be well exploited [4].
Missing data imputation is a crucial task in cases where it is fundamental to use all available data and not neglect records with missing values [5]. Since the 1980s, many techniques to impute missing data have been proposed [6,7]. Generally speaking, the methods for filling in an incomplete dataset can be divided into two main categories: single imputation and multiple imputations [6]. Single imputation, i.e., filling in precisely one value for each missing one, intuitively has many appealing features, e.g., standard complete-data methods can be applied directly, and the substantial effort required to create imputations needs to be carried out only once. Multiple-imputation is a method of   Table 4. Countries of the authors.

Country Quantity Percentage
Czech Republic 1  5  Italy  5  26  Kyrgyzstan  3  16  Netherland  1  5  United States  7  37  Uruguay  2  11  Total  18  100 Author Contributions: Conceptualization, A.G.; writing-original draft preparation, A.G.; writing-review and editing, A.C., C.C., and L.E. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.

Introduction
Soils play a crucial role in the global hydrologic cycle by governing the rates of infiltration and transmission of rainfall, and surface runoff, i.e., precipitation that does not infiltrate into the soil and runs across the land surface into water bodies, such as streams, rivers and lakes. Runoff occurs when rainfall exceeds the infiltration capacity of soils, and it is based on the physical nature of soils, land cover, hillslope, vegetation and storm properties such as rainfall duration, amount and intensity. The rainfall-runoff process serves as a catalyst for the transport of sediments and contaminants, such as fertilizers, pesticides, chemicals and organic matter, negatively impacting the morphology and biodiversity of receiving water bodies [1,2]. Flooding and erosion caused by uncontrolled runoff, particularly downstream, results in damage to agricultural lands and manmade structures [1]. Hence, modeling surface runoff is an essential part of soil and water conservation efforts, including but not limited to, forecasting floods and soil erosion and monitoring water and soil quality.
The U.S. Department of Agriculture's (USDA) agency for Natural Resources Conservation Service (NRCS), formerly known as the Soil Conservation Service (SCS), developed a parameter called Curve Number (CN) to estimate the amount of surface runoff. Furthermore, soils are classified into Hydrologic Soil Groups (HSGs) based on surface conditions (infiltration rate) and soil profiles (transmission rate). Combinations of HSGs and land use and treatment classes form hydrologic soil-cover complexes, each of which is assigned a CN [3]. A higher CN indicates a higher runoff potential. Consequently, accurate classification of HSGs is critical for the calculation of CNs that provide a meaningful prediction of runoff.
In the United States, more than 19,000 soil series have been identified and aggregated into map unit components with similar physical and runoff characteristics, and assigned to one of four HSGs: A, B, C or D. The original assignments were based on measured rainfall, runoff and infiltrometer data [4]. Since then, assignments have been based on the judgement of soil scientists, primarily relying on their interpretation of criteria published in the National Engineering Handbook (NEH) Part 630, Hydrology [5]. As with any subjective interpretation, the placement of soils into appropriate hydrologic groups have been non-uniform and inconsistent over time and across geographical locations. Soils with similar runoff characteristics were placed in the same hydrologic group, under the assumption that soils found within a climatic region with similar depth, permeability and texture will have similar runoff responses. Conventional soil mapping techniques extrapolate these classifications and geo-reference them with GPS (Global Positioning Systems) and digital elevation models visualized in a GIS (Geographic Information Systems) [6,7]. However, in addition to the inconsistent classification of soil profiles, the varying definition of mapping units introduces a certain degree of subjectivity. Over the past two decades, Pedology research has witnessed an evolution from traditional soil mapping techniques to methods for 'the creation and population of spatial soil information systems by numerical models inferring the spatial and temporal variations or soil types and soil properties from soil observation and knowledge and from related environmental variables' [8], also known as Digital Soil Mapping (DSM) [9][10][11].
Considering the advances in modern computing and the vastly expanding soil databases, NRCS and the Agricultural Research Service (ARS) formed a joint working group in 1990 to address shortcomings attributed to guidelines stated in NEH reference documents [12]. Two among the several goals identified by the group were to standardize the procedure for the calculation of CNs from rainfall-runoff data and to reconsider the HSG classifications. A fuzzy model that was developed using the National Soil Information System (NASIS) soil interpretation subsystem was applied to 1828 unique soils using data from Kansas, South Dakota, Missouri, Iowa, Wyoming and Colorado. Correlation between the soil's assigned and modeled HSG was analyzed, and the overall HSG frequency coincidence exceeded 54 percent [13]. It was observed that the correlation frequencies for soils from groups A and D were higher than those for groups B and C. These correlation inconsistencies were attributed to: (1) boundary conditions that occur when soils exhibit properties that do not fit entirely into a single hydrologic group. The effects of this are more profound for groups B and C considering that they are each bounded by two groups (2) fuzzy modeling of the subjective HSG criteria. To address the inconsistencies due to boundary conditions, an improved method that developed an automated system based on detailed soil attribute data was proposed by Li, R et al. [14]. This work aimed to mitigate the aggregation effect of HSGs on soil information, and eventually the CNs, due to the assignment of similar soils into different HSGs (exaggerating small differences between them) or different soils to the same HSG (omitting differences between them). Furthermore, this work successfully identified improper placement of HSGs. However, this work used a significantly smaller sample size of 67 soil types in the Lake Fork watershed in Texas.
Machine learning, a branch of Artificial Intelligence, is an inherently interdisciplinary field that is built on concepts such as probability and statistics, information theory, game theory and optimization, among many others. In 1959, Arthur Samuel, one of the pioneers of machine learning, defined machine learning as a "field of study that gives computers the ability to learn without being explicitly programmed" [15]. A more recent and widely accepted definition can be attributed to Tom Mitchell: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E" [16]. Based on the approach used, type of input and output data, and nature of the problem being addressed, machine learning techniques can be classified into four main categories: (1) supervised learning; (2) unsupervised learning; (3) semi-supervised learning; and (4) reinforcement learning.
In supervised learning, the goal is to infer a function or mapping from training data that is labeled. The training data consist of an input vector X and an output vector Y that is labeled based on available prior experience. Regression and classification are two categories of algorithms that are based on supervised learning. Unsupervised learning, on the other hand, deals with unlabeled data, with the goal of finding a hidden structure or pattern in this data. Clustering is one of the most widely used unsupervised learning methods. In semi-supervised learning, a combination of labeled and unlabeled data is used to generate an appropriate model for the classification of data. The reinforcement learning method uses observations gathered from the interaction with the environment to make a sequence of decisions that would maximize the reward or minimize the risk. Q-learning is an example of a reinforcement learning algorithm.
The application of machine learning techniques in soil sciences ranges from the prediction of soil classes using DSM [17,18] to the classification of sub-soil layers using segmentation and feature extraction [19]. The predictive ability of machine learning models has been leveraged for agricultural planning and mass crop yield, the prediction of natural hazards, including, but not limited to, landslides, floods, drought and forest fires and monitoring the effects of climate change on the physical and chemical properties of soil [20,21]. Based on high spatial resolution satellite data, terrain/climatic data, and laboratory soil samples, the spatial distribution of six soil properties including sand, silt, and clay were mapped in an agricultural watershed in West Africa [22]. Of the four statistical prediction models tested and compared, i.e., Multiple Linear Regression (MLR), Random Forest Regression (RFR), Support Vector Machine (SVM) and Stochastic Gradient Boosting (SGB), machine learning algorithms performed generally better than MLR for the prediction of soil properties at unsampled locations. In a similar study for a steep-slope watershed in southeastern Brazil [23], the performance of three algorithms: Multinomial Logistic Regression (MLR), C5-decision tree (C5-DT) and Random Forest (RF) was evaluated and compared based on performance metrices of overall accuracy, standard error, and kappa index. It was observed that the RF model consistently outperformed the other models, while the MLR model had the lowest overall accuracy and kappa index. In the context of DSM applications, complex models such as RF are found to be better classifiers than generalized linear models such as MLR. While machine learning offers the added advantage of identifying trends and patterns with continuous improvement over time, these models are only as good as the quality of the data collected. An unbiased and inclusive dataset, along with the right choice of model, parameters, cross-validation method, and performance metrices is necessary to achieve meaningful results.
In this work, we investigated the application of four machine learning methods: kNN, SVM-Gaussian Kernel, Decision Trees and Ensemble Learning towards the classification of soil into hydrologic groups. The results of these algorithms are compared to those obtained using estimation based on soil texture.

Background
Soils are composed of mineral solids derived from geologic weathering, organic matter solids consisting of plant or animal residue in various stages of decomposition, and air and water that fill the pore space when soil is dry and wet, respectively. The mineral solid fraction of soil is composed of sand, silt and clay, relative percentages of which determine the soil texture in accordance with the USDA system of particle-size classification. Sand, being the larger of the three, feels gritty, and ranges in size from 0.05 to 2.00 mm. Sandy soils have poor water-holding capacity that can result in leaching loss of nutrients. Silt, being moderate in size, has a smooth or floury texture, and ranges from 0.002 to 0.05 mm. Clay, being the smallest of the three, feels sticky, and is made up of particles smaller than 0.002 mm in diameter. In general, the higher the percentage of silt and clay particles in soil, the higher is its water-holding capacity. Particles larger than 2.0 mm are referred to as rock fragments and are not considered in determining soil texture, although they can influence both soil structure and soil-water relationships. The ease with which pores in a saturated soil transmit water is known as saturated hydraulic conductivity (Ksat), and it is expressed in terms of micrometers per second 8 Data 2020, 5,2 (or inches per hour). Pedotransfer functions (PTFs) are commonly used to estimate Ksat in terms of readily available soil properties such as particle size distribution, bulk density, and organic matter content [24,25]. Machine Learning-based PTFs have been developed to understand the relationship between soil hydraulic properties and soil physical variables [26].

Hydrologic Soil Groups
Soils are classified into HSGs based on the minimum rate of infiltration obtained for bare soil after prolonged wetting [5]. The four hydrologic soil groups (HSGs) are described as follows: Group A-Soils in this group are characterized by low runoff potential and high infiltration rates when thoroughly wet. They typically have less than 10 percent clay and more than 90 percent sand or gravel. The saturated hydraulic conductivity of all soil layers exceeds 40.0 micrometers per second. Group B-Soils in this group have moderately low runoff potential and moderate infiltration rates when thoroughly wet. They typically have between 10 and 20 percent clay and 50 to 90 percent sand. The saturated hydraulic conductivity ranges from 10.0 to 40.0 micrometers per second. Group C-Soils in this group have moderately high runoff potential and low infiltration rates when thoroughly wet. They typically have between 20 and 40 percent clay and less than 50 percent sand. The saturated hydraulic conductivity ranges from 1.0 to 10.0 micrometers per second. Group D-Soils in this group are characterized by high runoff potential and very low infiltration rates when thoroughly wet. They typically have greater than 40 percent clay and less than 50 percent sand. The saturated hydraulic conductivity is less than or equal to 1.0 micrometers per second. Dual hydrologic soil groups-Certain wet soils are placed in group D based solely on the presence of a high water table. Once adequately drained, they are assigned to dual hydrologic soil groups (A/D, B/D and C/D) based on their saturated hydraulic conductivity. The first letter applies to the drained condition and the second to the undrained condition.

Soil Survey Data
The dataset used for this work was obtained from USDA's NRCS Web Soil Survey (WSS), the largest public-facing natural resource database in the world [27]. The Soil Survey Geographic Database (SSURGO) developed by the National Cooperative Soil Survey was used to identify Areas of Interests (AOI) in the State of Washington the Idaho Panhandle National Forest. Tabular data corresponding to Physical Soil Properties and Revised Universal Soil Loss Equation, Version 2 (RUSLE2) related attributes for various AOIs were retrieved from the Microsoft Access database and compiled into Microsoft Excel spreadsheets. Features of interest include the map symbol and soil name, its corresponding hydrologic group, percentages of sand, silt and clay, depth in inches and Ksat in micrometers per second. The initial dataset comprised of 4468 unique soil types.
As with most survey-based datasets, there were incomplete or missing data, inconsistencies in formatting and undesired data entries. The compiled dataset was preprocessed to remove samples corresponding to: missing data points, dual hydrologic groups (A/D, B/D and C/D), and soil layers beyond a water impermeable depth range of 20 to 40 inches. This reduced the dataset to 2107 unique soil types. MATLAB ® programming environment was used for all data preparation and processing.

Estimation Based on Soil Texture
Based on the percentages of sand, silt, and clay, soils can be grouped into one of the four major textural classes: (1) sands; (2) silts; (3) loams; and (4) clays. The soil textural triangle shown in Figure 1 illustrates twelve textural classes as defined by the USDA [28]: sand, loamy sand, sandy loam, loam, silt loam, silt, sandy clay loam, clay loam, silty clay loam, sandy clay, silty clay, and clay. These classifications are typically named after the primary constituent particle size, e.g., "sand", 9 Data 2020, 5,2 or a combination of the most abundant particles sizes, e.g., "sandy clay". One side of the triangle represents percent sand, the second side represents percent clay, and the third side represents percent silt. Given the percentages of sand, silt and clay in the soil sample, the corresponding textural class can be read from the triangle. Alternately, the NRCS soil texture calculator [28] can be used to determine textural class based on specific relationships between sand, silt and clay percentages as shown in Table 1. In this work, the method used to assign HSGs based on soil texture was adopted from Hong and Adler (2008) [29], which was modified from the USDA handbook [30] and National Engineering Handbook Section 4 [5]. MATLAB ® was used to assign HSGs based on soil texture calculations. Figure 1. The soil textural triangle is used to determine soil textural class from the percentages of sand, silt and clay in the soil [28]. Table 1. Soil texture calculations and mapping to hydrologic soil groups [28,29].

Machine Learning Algorithms
A common problem encountered in machine learning and data science is that of overfitting, where the model does not generalize well from training data to unseen data. Cross validation techniques are generally used to assess the generalization ability of a predictive model, thus avoiding the problem of overfitting. In this work, a Monte Carlo Cross-Validation (MCCV) method [31] was used by randomly splitting the dataset into equal-sized training and test subsets, training the model, predicting classification and repeating the process 100 times. The overall prediction accuracy (or other performance metrics) is the average over all iterations.
A machine learning algorithm can be classified as either parametric or non-parametric. Parametric methods assume a finite and fixed set of parameters, independent of the number of training examples. In non-parametric methods, also called instance-based or memory-based learning, the number of parameters is determined in part by the data, i.e., the number of parameters grows with the size of the training set. Due to the availability of a large dataset with labeled data, in this work, we considered four non-parametric supervised learning algorithms: (1) kNN (2) SVM Gaussian Kernel (3) Decision Trees (4) Random Forest. A qualitative introduction to these algorithms is presented in the following subsections.
3.3.1. k-Nearest Neighbors (kNN) Algorithm kNN algorithm, an instance-based method of learning, is based on the principle that instances within a dataset will generally exist in close proximity to other instances that have similar properties. If the instances are tagged with a classification label, then the value of the label of an unclassified instance can be predicted based on the labels of its nearest neighbors.
The Statistics and Machine Learning Toolbox from MATLAB ® was used to create a Classification kNN model using function 'fitcknn', followed by the function 'predict' to predict classification for test data.
where features is a numeric matrix that contains percent sand, percent silt, percent clay and Ksat; label is a cell array of character vectors that contain the corresponding HSGs; and k represents the number of neighbors.

Support Vector Machines (SVMs) with Gaussian Kernel
Support Vector Machines are non-parametric, supervised learning models that are motivated by a geometric idea of what makes a classifier "good" [32]. For linearly separable data points, the objective of the SVM algorithm is to find an optimal hyperplane (or a decision boundary) in an N-dimensional space (where N is the number of features) that distinctly classifies data points. Support vectors are the data points that lie closest to the hyperplane. The SVM algorithm aims to maximize the margin around the separating hyperplane, essentially making it a constrained optimization problem.
For data points that are not linearly separable, which is true of most real-world data, the features can be mapped into a higher-dimensional space in such a way that the classes become more easily separated than in the original feature space. A technique commonly referred to as the 'kernel trick', uses a kernel function that defines the inner product of the mapping functions in the transformed space. One of the most popular kernels are the Radial Basis Functions (RBFs), of which, the Gaussian kernel is a special case.
The Statistics and Machine Learning Toolbox from MATLAB ® was used to create a template for SVM binary classification based on a Gaussian kernel function using function 'templateSVM', followed by the function 'fitcecoc' that trains an Error-Correcting Output Codes (ECOC) model based on the features and labels provided. t is specified as a binary learner for an ECOC multiclass model. Finally, the function 'predict' is used to predict classification for test data. t = templateSVM ('KernelFunction', 'gaussian') SVM_gaussian_model = fitcecoc (features, labels, 'Learners', t); predict_HSG = predict (SVM_gaussian_model, features_test) 11 Data 2020, 5, 2

Decision Trees
Decision Trees are hierarchical models for supervised learning in which the learned function is represented by a decision tree [16,33]. The model classifies instances by querying them down the tree from the root to a leaf node, where each node represents a test over an attribute, each branch denotes its outcomes and each leaf node represents one class. Based on the measure used to select input variables and the type of splits at each node, decision trees can be implemented using statistical algorithms such as CART (Classification And Regression Tree), ID3 (Iterative Dichotomiser 3) and C4.5 (successor of ID3), among many others.
The Statistics and Machine Learning Toolbox from MATLAB ® was used to grow a fitted binary classification decision tree based on the features and labels using function 'fitctree', followed by the function 'predict' to predict classification for test data. Function 'fitctree' uses the standard CART algorithm to grow decision trees. decisiontree_model = fitctree (features, labels); predict_HSG = predict (decisiontree_model, features_test)

Ensemble Learning
While decision trees are a popular choice for predictive modeling due to their inherent simplicity and intuitiveness, they are often characterized by high variance. Consequently, decision trees can be unstable because small variations in the data might result in a completely different tree and hence, a different prediction. Ensemble learning methods that combine and average over multiple decision trees have been used to improve predictive performance [32]. Bagging (or bootstrap aggregation) is a technique that is used to generate new datasets with approximately the same (unknown) sampling distribution as any given dataset. Random forests, an extension of the bagging method, also selects a random subset of features. In other words, random forests can be considered as a combination of 'bootstrapping' and 'feature bagging'.
The Statistics and Machine Learning Toolbox from MATLAB ® was used to grow an ensemble of learners for classification using function 'fitcensemble'', followed by the function 'predict' to predict classification for test data. ensemble_model = fitcensemble (features, labels); predict_HSG = predict (ensemble_model, features_test); The function 'TreeBagger' bags an ensemble of decision trees for classification using the Random Forest algorithm, followed by the function 'predict' to predict classification for test data. Decision trees in the ensemble are grown using bootstrap samples of the data, with a random subset of features to use at each decision split. treebagger_model = TreeBagger (50, features, labels, 'OOBPrediction', 'On', 'OOBPredictorImportance', 'On'); predicted_HSG = predict (treebagger_model, features_test); 'OOBPrediction' and 'OOBPredictorImportance' are set to 'on' to store information on what observations are out of bag for each tree and to store out-of-bag estimates of feature importance in the ensemble, respectively.

Performance Metrics
A Confusion matrix is commonly used to visualize the performance of a classification algorithm. Figure 2 illustrates the confusion matrix for a multi-class model with N classes [34]. Observations on correct and incorrect classifications are collected into the confusion matrix C c ij , where c ij represents the frequency of class i being identified as class j. In general, the confusion matrix provides four types of classification results with respect to one classification target k: Several performance metrics can be derived from these four outcomes. The ones of interest to us are listed below, for per-class classifications: • Accuracy: This metric simply measures how often the classifier makes a correct prediction.
• Cohen's Kappa (κ): This metric compares an Observed Accuracy with an Expected Accuracy (random chance) where p o represents the accuracy and p e represents a factor that is based on normalized marginal probabilities.
For multi-class classification problems, averaging per-class metric results can provide an overall measure of the model's performance. There are two widely used averaging techniques: macro-averaging and micro-averaging.

•
Macro-average: Macro-averaging reduces the multi-class predictions down to multiple sets of binary predictions. The desired metric for each of the binary cases are calculated and then averaged resulting in the macro-average for the metric over all classes. For example, the macro-average for Recall is calculated as shown below: • Micro-average: Micro-averaging uses individual true positives, true negatives, false positives and false negatives from all classes to calculate the micro-average. For example, the micro-average for Recall is calculated as shown below: Macro-averaging assigns equal weight to each class, whereas micro-averaging assigns equal weight to each observation. Micro-averages provide a measure of effectiveness on classes with large observations, whereas macro-averages provide a measure of effectiveness on classes with small observations.

Results and Discussions
Following data preparation and pre-processing in MATLAB ® , soil data samples were classified into one of the four hydrologic groups using soil texture calculations, followed by classifications using the following algorithms: (a) k-Nearest Neighbor (kNN), (b) SVM Gaussian Kernel, (c) Decision Tree, (d) Classification Bagged Ensemble, and (e) TreeBagger. A Monte Carlo Cross-Validation (MCCV) method was used to avoid the problem of overfitting [31]. A measure of overall accuracy was first computed to compare all five algorithms with the soil texture-based classification. Table 2 shows that overall accuracy for the latter is significantly lower than those for the machine learning-based classification algorithms. In fact, none of the HSG C occurrences were correctly classified using soil texture calculations. The TreeBagger (Random Forest) algorithm has the highest overall accuracy of 84.70 percent, closely followed by the Decision Tree and kNN algorithms with 83.12 percent and 80.66 percent, respectively. Although applied for an entirely different dataset, the fuzzy system hydrologic grouping model [13] results in an overall correlation frequency of 60.5 percent for HSGs A, B, C and D, with higher correlation between assigned and modeled results for HSGs A and D. For datasets in which the classes are not represented equally (also known as imbalanced classes), accuracy is typically not a good measure of performance. Out of the 2107 unique soil samples in the observed group, 337 belong to HSG A, 1142 to HSG B, 511 to HSG C and 117 to HSG D. Given that our dataset is relatively imbalanced, we further evaluate the performance of all five algorithms based on metrics of Recall, Precision, FPR, TNR, F1-Score, MCC and Kappa. It is important to account for chance agreement when dealing with highly imbalanced classes since a high classification accuracy could result from classifying all observations as the largest class [35,36]. Table 3 lists per-class results and macro-and micro-averages of these metrics for classification using kNN, SVM and Decision Trees. Table 4 presents the same for two Ensemble Learning algorithms. A graphical comparison of individual classes (HSGs) for each metric is shown in Figure 3.  It should be noted that micro-averages for Recall, Precision and F1-Sscore are equal, as expected in multi-class classification problems. Moreover, micro-averages for MCC and Kappa are equal. It can be observed that for all five algorithms, the ability of the classifiers to correctly predict (Recall) HSGs A and B are relatively higher when compared to HSGs C and D. This is in line with results obtained for three-class and seven-class classification of soil types using Decision Trees and SVM in [19], wherein sandy soils had higher classification accuracy. On the other hand, the certainty with which the classifiers predict correct classes (Precision) is relatively higher for HSG D in our work. A comparison of macro-and micro-averages of F1-Scores among the five classifiers shows that kNN, Decision Tree and TreeBagger have scores close to 0.8, while SVM-Gaussian Kernel lags with a score close to 0.7. Among the four soil groups, HSG B has the highest rate of False Positives, with the highest being 57.8 percent for SVM-Gaussian Kernel and lowest being 19.83 percent for kNN. The fact that HSG B is the largest class in the dataset, and bordered by two other groups, explains the high FPR. A comparison of macroand micro-averages of MCCs among the five classifiers shows comparable results (~0.72) for kNN, Decision Tree and TreeBagger. Yet again, SVM-Gaussian kernel has the lowest score (~0.6). The results of Cohen's Kappa coefficient for HSG B shows some discrepancy that is consistent across all five classifiers. This may be related to the corresponding high FPRs. Regardless, the micro-average Kappa value is consistent with that of MCC, possibly accounting for any class imbalance. An interesting observation is that the micro-and macro-averages of Kappa coefficients for all five classifiers are similar in value. The macro-averages range from 0.55 (RF and DT) to 0.58 (SVM) and micro-averages range from 0.65 (SVM) to 0.73 (kNN and CBE), all within the moderate to substantial agreement range [37]. This similarity is observed in studies related to machine learning techniques for DSM, suggesting that the quality and robustness of datasets is of greater importance than the classifier itself [38,39]. In the context of predicting soil map units on tropical hillslopes in Brazil, an RF model yielded an overall accuracy of 78.8 percent and a Kappa index of 0.76, while a Decision Tree model had an overall accuracy of 70.2 percent and a Kappa value of 0.66 [39]. In contrast, for classification based on soil taxonomic units in British Columbia, Canada, kNN and SVM resulted in the highest accuracy of 72 percent; however, models such as CART with bagging and RF were preferred due to the speed of parameterization and the interpretability of the results, while resulting in similar accuracies ranging from 65 to 70 percent [40].

Conclusions
This work presents the application of machine learning towards classification of soil into hydrologic groups. The machine learning models tested were kNN, SVM-Gaussian Kernel, Decision Trees and Ensemble Learning (Classification Bagged Ensemble and Random Forest). It was observed that for all five classifiers, Recall for HSGs A and B were relatively higher when compared to HSGs C and D, but precision was relatively higher for HSG D. Overall, performance metrics related to kNN, Decision Tree and TreeBagger exceeded those for SVM-Gaussian Kernel and Classification Bagged Ensemble.
As part of future work, the effects of class imbalance will be investigated by comparing datasets with varying degrees of imbalance and using various cross-validation techniques with proportional stratified random sampling. Deep learning methods that address this classification problem will also be explored.

Introduction
Since water quality is affected by complex factors like animal/human activities and weather events, its continuous sampling and monitoring is of paramount importance for human health [1]. The United States Geological Survey (USGS) has been continuously monitoring the quality of surface water across the U.S. over the past decades [2]. The most common water quality indicators suggested by the USGS are temperature, specific conductance, dissolved oxygen concentration (DO), pH, and turbidity. Collecting and analyzing water quality data is a challenging task. First off, water quality monitoring techniques are different in different water bodies like streams, lakes, bays, and estuaries, characterized not only by different microscopic and macroscopic organisms, but also by different ecosystems, flow rate, and accessibility. Additional common challenges include uncertainty in water quality observations and instrument failure. In the instance of instrument malfunctioning or stop recording, one or more values in the time series may be missing. Popular methods to recover gaps in time series are divided into two major groups: deterministic and stochastic [3]. Examples of deterministic approaches are nearest-neighbor interpolation, polynomial interpolation, and methods based on distance weighting. Regression methods, auto regressive methods, and machine learning methods fall under the stochastic category [3].
Sampling water quality is further complicated by the development of an effective method to analyze and evaluate the collected data. Water quality data are usually characterized by non-Gaussian distributions. Also, the presence of outliers and missing values are very common [4]. As a result, finding an appropriate analytical method is key. Some popular classical methods are graphical analysis (e.g., boxplots, scatter plots, and Q-Q plots), probability distribution analysis, and trend analysis. However, when dealing with excessive amount of data, it is easy to miss hidden patterns and information. In the past two decades, several studies have proposed novel approaches to analyze water quality data, including fuzzy theory [5], maximum likelihood methods [6], principal component analysis [7], cascade correlation artificial neural network [8], interactive fuzzy multi-objective linear programming [9], linear regression [10], inexact chance-constrained quadratic programming [11], and Dempster-Shafer methods [12]. All these methods have the ability to deal with large datasets and investigate relationships among water quality indicators. However, to take advantage of the above tools, prior and/or additional information about the data is needed. For example, the fuzzy set theory requires a grade of membership (that defines how each data point is mapped to a membership value) or a value of possibility (e.g., possible, quite possible, slightly possible, impossible). Similarly, the Dempster-Shafer theory necessitates basic probability analysis [13].
Rough Set Theory (RST), introduced by Pawlak in 1982 [13], represents a valid alternative to overcome these issues. RST is a powerful tool to deal with large amounts of information, does not require preliminary or additional information about the data, and considers vagueness and uncertainty in the dataset [14]. RST is commonly used in classification, ranking, multi-criteria decision analysis, and decision rules [15]. One of the applications of RST is pattern recognition by attribute reduction. By reducing unnecessary features, RST is capable of discovering hidden patterns in high dimensional datasets [16]. The philosophy of rough set is based on the assumption that some information is associated to every object in the universe. Objects sharing the same information are called indiscernible and the indiscernibility relation is the mathematical basis of rough set theory [17]. This tool has been successfully applied to areas like healthcare, banking, medicine, engineering, environmental science, among others [17].
In this work, we investigate the potential of applying RST to water quality analysis. RST is useful when dealing with complexity and vagueness in a dataset, which is always the case when analyzing water quality field data. Although a few attempts exist in the field of environmental and water resources engineering [18,19], the application of RST for assessing water quality indicators has not been widely explored. For example, Shen and Chouchoulas [20] proposed a hybrid system called fuzzy-rough estimator to assess the size of algae population based on water characteristics. Although their attribute reduction method (going from eleven original attributes to seven) was demonstrated to be successful, their approach was not capable of extracting high accuracy sets of rules. Another application of RST in water resources engineering is the one investigated by Barbagallo et al. [21] who studied reservoir operating rules. This study employed the integrated RST and Rose application, a software developed by the University of Poznan in Poland [22], to provide the minimal condition attributes and reveal the relevance of each attribute. Dong et al. [18] proposed a model to forecast annual runoff from a reservoir using RST. Their results showed that the larger the samples was, the more accurate the model. In a study performed by Ip et al. [23], RST was employed to identify the significant water quality indicators in a decision-making system. Specifically, RST was able to reduce the number water quality indicators and quantify the importance degree of each core indicator.
Other studies combined RTS with other approaches, such as the one by Pai and Lee [19] that introduced the Multinomial Logistics Regression (MLR) model. MLR was used to investigate the relationship between different degrees of water pollution and environmental factors, like the one between the concentration of SO 2 emitted by car and motorcycle exhausts and ozone density in the atmosphere. This framework was shown to be capable of predicting water quality using environmental factors rather than monitoring the processes of chemical elements. Another example is the work by Karimi et al. [24] who employed the variable consistency dominance-based rough set approach to explore the complex relationship between water quality and environmental indicators. They explored the relationship between total dissolved solids (TDS) and environmental indicators used as explanatory variables, such as precipitation, river water temperature, runoff, normalized difference vegetation index (NVDI), land surface temperature, river water temperature. Using a moving average filter in the TDS data, they decreased the noise and reduced the width of the boundary region between the lower approximation (all elements in a subset belong to the set) and upper approximation (all elements possibly belong to the set).
The main goal of this work is to assess the efficiency of using RST to estimate one water quality indicator based on given (known) indicators. Evaluating the overall quality of the stream water is outside the scope of this work. What we consider here is a comprehensive approach that looks at several water quality indicators rather than providing a generic assessment of the stream healthiness. Our hypothesis is that, when observations in a time series are missing, RST is capable of providing information regarding the missing indicator based on the other recorded indicators. RST also identifies the dispensable indicators. By eliminating the dispensable (redundant) indicator or indicators, the complexity of the dataset is reduced. The strength of each indicator in finding an unknown indicator is assessed and dispensable attributes are identified to discover hidden patterns. Section 2 introduces the basics of rough set theory and its application to a water quality dataset collected in Fairfax, VA during 2015 to 2017. Section 3 presents the results, whereas Section 4 discusses the results and summarizes the main conclusions.

Rough Set Theory
In RST, data are represented by an information system or information table consisting of non-empty sets of finite objects (rows) and non-empty finite set of attribute (columns). More formally: where S is the decision system, U is the universe, and A is an attribute. The central concept in RST is the indiscernibility relation, a relationship between two (or more) objects where all the values are identical (equivalent) with respect to a subset of considered attributes [25]. The indiscernibility relation is defined as any subset B of A with a binary relation I (B) on U. For every a ∈ A: (x, y) ∈ I(B) if and only if a(x) = a(y), where the value of attribute a is for element x (or y).
Approximation is another fundamental concept in RST. On one hand, lower approximation refers to the domain of objects that are known with certainty to belong to the subset of interest. The lower approximation is also called B-positive region, posB(X). On the other hand, upper approximation refers to objects that possibly belong to the subset of interest [26].
Suppose X ⊂ U, and B ⊂ A, the B lower and B upper approximation of X, respectively, are: Therefore, the B-boundary region of X is defined as: If the boundary region is empty, then X is exact (or crisp). Otherwise, X is inexact and is classified as rough. The approximation method is a valuable method to express data topological properties [14]. The decision-making (DM) rule is another helpful tool to discover hidden patterns in a dataset and is defined as follows: 22 Data 2018, 3,50 where C is a disjoint set of condition attributes and D is the decision attribute. For every x ⊂ U, there exist C1(x), . . . , Cn(x), d1(x), . . . , dm(x). The decision rule induced by x in S is: where the arrow implies the decision D is based on condition C. The importance degree of attributes relative to the decision is calculated as: This way the most important attributes are selected and if the importance degree equals zero, the attribute is unimportant. The larger γcd (c), the higher the attribute degree of importance is. Please note that the importance degree is not a percentage and has no units. If |x| is the number of element in a set (i.e., cardinality of x), then the support of decision is defined as: and the strength of the decision is quantified as: In other words, the support of the decision corresponds to the number of times that a certain rule is observed within the dataset and the strength of the decision is the support of the decision divided by the total number of decision rules. So, if the support of a decision is high, it means that the number of times that the specific decision rule is repeated is high and consequently, this decision rule is strong.
Also, the certainty of the decision rule is calculated as: where π|C(x)| = |C(x)|/|U|. When cerx equals to one, then C→xD is a certain decision rule. Another useful factor in the DM rule concept is the coverage of decision rule defined as: where π|C(x)| = |D(x)|/|U|. The coverage coefficient is the conditional probability of reasons for a given decision. When C→xD is a decision rule, then D→xC is called the inverse decision rule and can be used to give explanations (reasons) for a decision. Please note that the certainty factors for inverse decision rules are coverage factors for the original decision rule [14].

Study Area and Dataset
In this study, we evaluate the chemical and physical quality of water at the outlet of the watershed that contains the George Mason University campus in Fairfax, VA. Figure 1 shows the watershed boundaries and the location where water quality indicators were sampled. This urbanized watershed contains two small creeks and one retention pond and is located within the larger Pohick Creek Watershed. Moreover, it consists of generally flat to sloping topography with most drainage (approximately 90%) flowing towards the south central portion of the campus and Pohick Creek.
A water quality monitoring instrument (the Eureka Manta2 Waterprobe) with six sensors automatically records six water quality indicators (listed in Table 1): dissolved oxygen concentration (DO), nitrate concentration, pH, specific conductivity, temperature, and turbidity. These water quality indicators were chosen for several reasons. First off, they are listed by the Environmental Protection Agency (EPA) to define water quality standards for surface water [27]. Secondly, since George Mason University complies with the Clean Water Act and EPA storm water regulations, its Facilities Department monitors these indicators across the campus every year [28]. The water quality probe recorded each indicator every hour from October 2015 to December 2017. However, the probe was out for calibration and repairs occasionally and there have been some frequent network issues with the data logger. As a result, only 14 months of data are used in this work. Most data have been collected during 2016 and 2017. Both years are characterized by monthly mean temperature and precipitation that are similar to the 30-year mean values for the region [29]. Specifically, the average temperature during the past 30-year in Fairfax, VA is 13 • C and the yearly average temperature both in 2016 and 2017 is 14 • C. The average 30-year cumulative precipitation is 107 cm and the average precipitation for 2016 and 2017 is 90 cm and 104 cm, respectively. This indicates that 2016 and 2017 are not anomalous years in terms of regional climatology [29]. The collected data, summarized in Table 1, show that water temperature fluctuates from about 5 • C in winter to almost 30 • C in summer. The average pH is 6.75 and it falls in the range identified by EPA water quality standards for the Commonwealth of Virginia [27]. The average DO is 6.14 mg/L and it is also within the EPA water quality standards. The level of nitrate (average of 136.11 mg/L-N) shows that the runoff water possibly traveled through lands with fertilizers. Another possible source of nitrate is the atmosphere containing nitrogen compounds derived from automobiles [30]. According to EPA, the natural level of nitrate from wastewater effluent can range up to 30 mg/L. Finally, the high standard deviations in conductivity and turbidity are also common because of the frequent storms in this region.
The collected data are then discretized into three categories (low, medium and high): (i) any value lower than the 25th quartile is classified as low (L); (ii) any value between the 25th and 75th quartiles is classified as medium (M); and (iii) any value higher than the 75th quartile is classified as high (H). 24 Data 2018, 3, 50 Table 1. Units, average, standard deviation, 25th and 75th quartiles of water quality indicators collected during the study period at the location shown in Figure 1. A plot of time series of all the water quality indicators during the study period is shown in Figure 2. The inverse correlation between pH and temperature is clearly notable. However, it is important to mention that correlations between water quality indicators (as shown in [31]) at monthly scales are affected by several parameters, including environmental conditions and anthropogenic factors (e.g., rainfall events, construction sites). The shape of the frequency distributions for the water quality indicators considered in this study demonstrates the difficulty of fitting a known distribution to these datasets ( Figure 3). For instance, the turbidity frequency distribution-shown in Figure 3e-is clearly non-normal and skewed towards lower values, with a long tail at higher values. On the other hand, some indicators (e.g., DO, specific conductivity) show more symmetrical distributions.

Results
The information table is set up to apply RST to the data collected at the Mason watershed outlet. Specifically, five water quality indicators are chosen as the conditional attributes and the sixth one as the decision attribute. The water quality probe reads each water quality indicator every hour. However, in order to introduce rough set theory to water quality analysis, coarse resolution (monthly average) data are examined. This not only helps with showing a limited amount of condition and decision attributes in the following tables, but also helps to reduce the random noise in the data sample. Numerical values are assigned to each of the 14 months and presented as time codes in Table 2.
The following analysis is based on the scenario in which pH is chosen as the decision attribute (D) and the rest of the indicators as condition attributes (C). In set theory formalism, this corresponds to U = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}, C = {DO, NO 3 , K, T, Tu}, and D = {pH}, where C and D can be either L, or M, or H.
The first step is to identify redundant (or identical) time codes. After analyzing each time code, {7} and {8} are the only ones found identical, not only in terms of attributes, but also in terms of decision. This means that every single conditional and decision attribute is the same for time codes {7} and {8}. The fact that they are identical not only in terms of condition attributes but also in terms of decision attributes shows that if DO and K are medium, NO 3 and T are low, and Tu is high, pH is certainly high. This is the first certain decision rule concluded from Table 2. No other time codes are found to be identical in terms of condition and/or decision attributes. Thus, since all the other codes are unique in terms of both condition and decision attributes, each one of them represents a unique rule. As a result, 13 unique rules are identified in Table 2. June-17 October-17 November-17 The second step explores the discernibility relation by eliminating one condition attribute at the time. There are 6 tables in Table 3 and each table except the first one is missing one attribute. Firstly, as discussed above, time codes {7} and {8} are identical and they are highlighted. Secondly, if DO, NO 3 , and T were removed, discernibility would be the same, as shown in Table 3(b),(c),(e). As a result, these three attributes are deemed dispensable. Thirdly, if K and Tu were removed, new decision rules would appear. These new rules are highlighted in Table 3(d),(f) as well. In Table 3 The formal process of identifying dispensable attributes is further investigated in Table 4. The first column represents the attribute that is removed, whilst the second column represents the unique condition attribute combination in the absence of the corresponding attribute. When K and Tu are removed, the unique condition attribute combinations are different than when other indicators are removed in the other cases. In the third column, the unique decision making rules are displayed. If column 3 is not identical to the rules found in the presence of all attributes (conditional and decision), then the removed attribute is deemed dispensable (column 5). More specifically, the posc(D) is equal to {(1), (2), (3), (4), (5), (6), (7,8), (9), (10), (11), (12), (13), (14)}. If column 3 does not match posc(D), the removed attribute is indispensable. As a result of the analyses shown in Tables 3 and 4, the indispensable attributes are specific conductivity and turbidity. Clearly, in the absence of K, decision rules {2} and {3} are identical, in the absence of Tu, decision rules {9} and {11} are identical, and decision rules {13} and {14} are also identical. These attributes are defined as the core attributes. Table 3. Analysis of the discernibility relation with identical time code highlighted for the following cases: (a) all attributes; (b) DO eliminated; (c) NO 3 concentration eliminated; (d) specific conductivity eliminated; (e) temperature eliminated; (f) turbidity eliminated.
Results in Table 5 demonstrate that turbidity is equally important in every scenario considered in the study with an importance degree of 0.14. Specific conductivity is the next important factor with an importance degree of 0.07. If the decision attribute is turbidity, the only indispensable attribute is specific conductivity and if the specific conductivity is the decision attribute, the indispensable attribute beside turbidity is temperature. According to the foregoing analysis, if any of the six water quality indicator needs to be retrieved because of a missed measurement, turbidity and specific conductivity are the core values that would provide useful information about the missing information. Moreover, the decision in every scenario is weighted towards turbidity since the importance degree of turbidity is higher than the importance degree of conductivity. Table 5. Importance degree of C attributes relative to the decision attribute D.

Decision Attribute
Indispensable Attribute 1 (Importance Degree)

Indispensable Attribute 2 (Importance Degree)
pH Tu (0.14) K (0.07) DO Tu (0.14) K (0.07) NO 3 Tu (0.14) K (0.07) T Tu (0.14) There is a direct relationship between temperature and all the other water quality indicators. Furthermore, conductivity has an effect on turbidity and turbidity influences dissolved oxygen concentration, which also affects nitrate concentration. However, there is no direct relation between pH and other indicators. Therefore, we start our analysis by selecting pH as a decision attribute. Based on the dispensability analyses shown above, conductivity and turbidity are the core condition attributes (Table 6). There is a strong relationship between these two core attributes in stormwater runoff across the Mason campus watershed, as previously shown by [32].  Table 6 shows the DM rules together with their strength, certainty, and coverage, computed according to Equations (9)-(11), respectively. Table 6 also shows the support of each DM rule (N). As mentioned above, N is the number of times that each DM rule was recorded. Table 6 shows that N is larger than 1 for DM rules 3, 4, 5, and 7. As a result, their strengths are higher than the strengths of the rules for which N = 1.
If the conditional attributes are identical and the decision attributes are not equal, the certainty of the DM rule is less than one. Thus, the certain DM rules are 1, 2, 7, and 9. In other words, if specific conductivity is high and turbidity is either low or high, then pH is certainly medium (according to DM rule 1 and 2). If specific conductivity is low and turbidity is either medium or high, then pH is certainly low (according to DM rule 7 and 9).
In order to explain the decision attribute in terms of condition attributes, the conditions and decision attributes need to be mutually replaced in every DM rule. The only certain inverse rule is DM rule 5, which indicates that if pH is high, then turbidity is high and specific conductivity is medium. Moreover, rule number 5 is a unique case. Since there is only one rule with a high pH value, the coverage for this rule is equal to 1 and, as a result, the certainty for inverse DM rule 5 is one.
The same analysis is repeated five times by selecting a different attribute as a decision attribute and setting the rest of the attributes as a condition attributes every time. Table A1 shows the DM rules and strength, certainty, and coverage for all the other cases. The highest strength factor (0.29) belongs to the rule in which the conditional attributes are specific conductivity and turbidity and the decision attribute is temperature. On the other hand, five rules show a certainty factor equal to 1 when the conditional attributes are specific conductivity and turbidity and the decision attribute is dissolved oxygen. Moreover, the coverage factor equals to 1 in one of the rules when the specific conductivity and turbidity are conditional attributes and the nitrate is decision attribute.
A similar analysis was performed also at weekly scale, by averaging the water quality indicators for each week of the study period. However, because of the high temporal variability in water quality, no redundant attribute was identified. Hence, at finer temporal resolutions, more attributes play an important role. Since this work is meant as an attempt to apply rough set theory to water quality data analysis, it would not be feasible to effectively display the step-by-step procedure using a larger dataset (e.g., weekly). Nevertheless, the developed approach based on rough set theory could be applied to data at any temporal resolution and to time series of any length.
The developed methodology can also be used to compare different months or the same month in different years. For instance, the months of April, May, June, and August of 2016 (case 1) can be compared to the same months in 2017 (case 2). In case 1, the indispensable attribute would be specific conductivity. However, in case 2, there is no indiscernible attribute. This shows that indiscernible attributes may vary depending on environmental and/or anthropogenic conditions. This kind of comparison highlights possible changes in the stream water quality conditions, whose sources can be potentially investigated by the analyst.

Discussion
This study investigates the application of RST to water quality analysis. RST does not require any prior information on the dataset and represents a powerful tool to deal with uncertainty and vagueness in the sample. Moreover, RST is capable of finding indiscernible attributes and extracting rules based on core attributes. This work presents the basic concepts of rough set theory and its application to six water quality indicators collected during a 3-year-long study period at the George Mason University campus in Fairfax, VA. More specifically, monthly averages for each water quality indicator are calculated and 14 months are considered.
It is important to mention that the streamflow velocity at the watershed outlet where data were collected is particularly high during and after rainfall events. As a result, the common relationships among water quality attributes are not observed in this case study that focuses on the monthly scale. For example, when water temperature is low, DO concentration is commonly high [33]. However, we cannot observe this rule at the monthly resolution. When a storm happens, even during summer when temperatures are high, the rapidly moving water contains more DO than stagnant water in winter days (when the temperature is lower).
Coarse temporal resolution (i.e., monthly) data are selected here in order to present a novel methodology in the field of water quality analysis. The coarse resolution helps with showing a limited number of attributes and decision values. Six different scenarios are studied here and in each scenario one attribute is assigned to be a decision attribute and the rest are reflected as conditional attributes. In most cases, specific conductivity (with an importance degree of 0.07) and turbidity (with an importance degree of 0.14) are the core conditional attributes. In addition, we generate DM rules for each scenario and calculate the strength, certainty, and coverage of each rule. The certain rules show that if specific conductivity is high and turbidity is either low or high, then pH is medium. Also, if specific conductivity is low and turbidity is either medium or high, then pH is certainly low. However, the coverage of these DM rules is the lowest among all DM rules. Five other possible DM rules with certainty lower than one are identified as well. There is one DM rule with coverage factor of one (DM 5), which means that there is only one DM rule with a unique pH value (high). As a result, the certainty for the inverse DM rule 5 is one.
Overall, RST was proven capable of finding core indicators and discovering DM rules. Considering more attributes and more data entries could increase the certainty of the identified DM rules and possibly identify additional DM rules. RST-based DM rules can be of tremendous help to planners and analysts in their decision making process. For instance, results from this study can be useful for university facility managers that monitor water quality across campus. If applied to a larger scale, the proposed methodology has the potential of providing timely, relevant, and essential water quality information.
Future work should look at the raw data at their native resolution (one hour). Although no difference in the DM rules was observed in the weekly analysis with respect to the monthly one, increasing the resolution to one hour may result in higher certainty in the DM rules. Moreover, other locations should be investigated to verify the efficiency of the proposed methodology and possibly sampling additional indicators (i.e., conditional attributes). Further conditional attributes can be related to atmospheric conditions, like the amount and duration of precipitation events and land cover/land use.  Abstract: This work assessed the quality of wind speed estimates in Uruguay. These estimates were obtained using the Weather Research and Forecast Model Data Assimilation System (WRF-DA) to assimilate wind speed measurements from 100 m above the ground at two wind farms. The quality of the estimates was assessed with an anemometric station placed between the wind farms. The wind speed estimates showed low systematic errors at heights of 87 and 36 m above the ground. At both levels, the standard deviation of the total errors was approximately 25% of the mean observed speed. These results suggested that the estimates obtained could be of sufficient quality to be useful in various applications. The assimilation process proved to be effective, spreading the observational gain obtained at the wind farms to lower elevations than those at which the assimilated measurements were taken. The smooth topography of Uruguay might have contributed to the relatively good quality of the obtained wind estimates, although the data of only two stations were assimilated, and the resolution of the regional atmospheric simulations employed was relatively low.

Summary
This work evaluated the use of techniques for assimilation of data from field measurements into initial conditions of atmospheric numerical simulations in order to obtain wind estimates in Uruguay, at heights of 100 m above the ground and lower. The wind was estimated with hourly frequency in a regular grid that covers the country. The field data to be assimilated was operatively measured in wind farms installed in Uruguay, using anemometers placed 100 m above the ground. The data was assimilated into initial conditions for the Weather Research and Forecast regional model (WRF) of the National Center of Atmospheric Research (NCAR), [1] using the module for data assimilation included in this model, the WRF-DA module [2].
The data assimilation process, also called analysis, is an essential component of numerical atmospheric forecasts, and its main purpose is to generate initial conditions for the predictions. The variables that compose an initial condition are called prognostic variables because the model uses their values at a given instant to compute their values at a later time. To generate an initial condition for a specific numerical model at a given time, a first approximation is generally used. This first approximation, called "background condition", usually consists of a prediction for the same instant, obtained with the same model, from previous initial conditions. The data assimilation system must combine the information from the background condition with the information from measurements of atmospheric variables (or variables of systems related to it; for example, the ocean, the soil, or the cryosphere). This combination of information is done in a way that optimizes the quality of the result in statistical terms, either minimizing the expected value of the sum of its quadratic errors or maximizing its likelihood. Note that the short-term predictions used as background values are, in turn, affected by field measurements that were assimilated during previous times. This allows considering information from regions or atmospheric levels in which few measurements are available since measurements, done earlier in other regions or at other levels, can propagate their influence through atmospheric dynamical processes into the zones that have relatively fewer observations. Kalnay [3] described the main techniques for data assimilation that are currently used, such as optimal interpolation, 3D-Var, Kalman filters, ensemble Kalman filters, and 4D-Var. The WRF-DA system implements the 3D-Var [4] and 4D-Var techniques [5], and a technique that combines the ensemble Kalman filter with 3D-Var [6,7]. Harlim [8] pointed that the referred techniques assume that the errors from the background conditions and the field measurements are unbiased and normally distributed, although numerical models inevitably have systematic errors. These assumptions can provide reasonable estimates of the first-order statistics while being practical to implement. However, these methodologies require caution for interpreting their higher-order statistical estimates, and an important challenge in data assimilation is to use existing methods in the presence of model systematic errors. The author proposed methodologies to mitigate the effects of model systematic errors on the results of the assimilation process. Rao and Sandu [9] proposed an a-posteriori error estimation methodology that quantifies the impact of model and data errors on the inference results of inverse problems, including the 4D-Var assimilation process. The model and data errors considered include both unbiased noise and systematic biases. The authors found that the proposed methodology could be useful to reduce and quantify uncertainties in a real-time system with feedback. Besides, the error estimates can be used to locate faulty sensors and to determine areas of maximum sensitivity, where improvements in the stations network or an increase in model resolution may be required.
In addition to its direct use in the numerical prediction process, the results of data assimilation can be considered "pseudo-observations" of atmospheric variables in regular grids. Note that these do not consist purely of interpolations of field measurements in a regular grid since they also consider the information from short-term predictions. The uses of the pseudo-observations obtained from a data assimilation process can be very broad, but they require an evaluation of their quality by comparison with field measurements not used in the assimilation process.
The current work used the rather conventional 3D-Var assimilation technique. In Section 2, we described the data used, both from field measurements and from numerical predictions used as background conditions, and we referred to the quality control method used for the field data from the wind farms. In Section 3, we described the main aspects of the 3D-Var assimilation technique and its implementation in this work using the WRF DA system. In Section 4, we described the main results, and in Section 5, we presented the conclusions.

Data Description
The wind data used in the data assimilation process was obtained from two anemometers installed in the "Rosendo Mendoza" (WF1) and "Valentines" (WF2) wind farms. The geographic locations of these wind farms are shown in Figure 1. The anemometers were placed 100 m above the ground, and they recorded mean velocities for successive periods of 15 min, which were transmitted to the National Dispatch of Electric Charges of Uruguay, operated by the Electricity Market Administration (Administración del Mercado Eléctrico; ADME) and the National Administration of Electric Power Plants and Transmissions (Usinas y Trasmisiones del Estado; UTE), and Uruguayan National Public Electricity Utility Organizations. In addition to the wind measurements, the mean electric power generated in the same 15-min periods and the corresponding quantity of aerogenerators that were effectively active were also transmitted by each wind farm. These additional data allowed for a quality control analysis of the measurements, as described by Orteli and Cazes Boezio [10]. With the historical information of wind velocity and electric power generated, the authors built an empirical wind-power curve for each wind farm. In those cases in which not all the aerogenerators of the wind farm were active, the power generation that would correspond to a condition of full availability was estimated by linear extrapolation. The authors considered that those combinations of wind and generated power that depart from the empirical curve beyond certain thresholds were suspicious of being affected by malfunctioning of the measurement or recording systems, or by occasional interference of the wake of an aerogenerator with the anemometer. The data from WF1 and WF2 available for this study were from the months from January to May and from November to December of 2017 (seven months). To evaluate the quality of the pseudo-observations obtained through the implemented data assimilation process, an independent station was used ("validation station", VSt), which is located at Colonia Arias, Uruguay. The station location is also shown in Figure 1. The station measures wind speed and direction at heights of 87 and 36 m above the ground and is part of a network of stations that measures wind velocity and solar radiation that has been operated by UTE since 2009. This network is described by Cornalino and Draper [11], and its measurements (that include wind velocities averaged every 10 min) are made available online by UTE. In recent years, many of the anemometric stations operated by UTE have become affected by wind farms. The VSt of Colonia Arias has been active since 2011, and during the period studied in this work, it was not affected by any wind farm. Figure 2 shows the wind speed at the WF1, WF2, and Vst sensors averaged for each local hour during the seven months considered.
The assimilation process uses regional simulations computed with the WRF model from NCAR, as shown by Cazes Boezio and Orteli [12]. The regional simulations take their initial and boundary conditions from global predictions made by the Global Forecast System (GFS) of the National Ocean and Atmosphere Administration (NOAA) of the Unites States. The horizontal grid of the regional simulations has a resolution of 30 km in the zonal and meridional directions, as shown in Figure 3. The vertical direction is discretized in 54 levels, 7 of which are within the first 100 m of height above the ground. In Appendix A, we indicated the parameterizations of physical processes employed, and we defined in detail the vertical discretization. The regional simulations have two purposes: first, they generate the background conditions into which the data measured in WF1 and WF2 are assimilated; and second, they allow for the estimation of the matrix of covariances of the errors of these background conditions. In Section 3, the hours of initialization and the simulated periods used for each one of these purposes were specified.

Methods
Next, we described the main aspects of the 3D-Var assimilation technique, which was used in this work and is available in the WRF-DA system. A complete description of this technique and its implementation in the WRF-DA system can be found in the NCAR Technical Note 453 [13], in Barker et al. [4] and Barker et al. [2], while the corresponding operational details are given in the WRF-DA User Guide [14].
The background condition information is displayed as a vector called x b , which contains the values of all the different prognostic variables that compose the background condition in a certain conventional order. This information is displayed for each variable and each point of the grid. The field measurements that are to be assimilated, obtained at the same instant that corresponds to x b , are displayed as a vector called y, also in a conventional order. The initial condition determined by the assimilation system is expressed as a vector called x a , with a structure analogous to that of x b . x a combines the information from the background condition x b and the field measurements contained in y, as expressed in Section 1. The 3D-Var technique determines the vector x a as that of maximum likelihood, conditioned to the information of the background condition and the field measurements, and is the vector x that minimizes the following expression [3]: that is, J(x a ) must be minim. In this formula, B and R are the matrices of error covariances of the background condition x b and the field measurements y, respectively. It is assumed that these errors are normal variables and are unbiased. The B matrix is thus related to the accidental errors of x b , while the model systematic errors or biases are assumed to be zero, as pointed out in [8]. Matrix B is defined as follows: where E represents the expected value of the elements of the matrix ( where ∇H is the gradient of H(x). If the linearized expression for H(x) is used, J(x) assumes a quadratic form. It is assumed that the errors of the field measurements contained in vector y are statistically independent of each other, so R is a diagonal matrix; its diagonal contains the variances of the errors of each field measurement.
The matrix B is essential to this assimilation system. First, its non-diagonal terms contain the covariances of x b errors at different grid points and also the covariances of errors of different variables. These covariances are necessary to propagate the information related to any field measurement through the horizontal and vertical directions and allow measurement of a specific variable to affect the analysis of others. In addition to this, the matrices B −1 and R −1 together determine the relative importance of the background conditions x b and the field measurements y to determine the analysis x a .
The WRF-DA system offers two methods to estimate B: the National Meteorological Center method (NMC), described by Parrish and Derber [15], and a method based on ensembles of predictions ([1], Chapter 9). Due to the availability of computer resources, we used the NMC method, which is relatively more economic. This method requires a database of pairs of x b ; each pair has two short-term predictions for the same hour obtained with different forecast horizons (for example, 12 and 24 h), and consequently with different initial conditions. The pairs of predictions correspond to certain hours of the day (for example, 0:00 and 12:00 GMT), and cover a certain period (for example, one year). It is assumed that the differences between the forecasts of each pair have statistical properties analogous to those of the background condition errors, and in particular, to estimate B, the following approximation can be used: where Note that, if n is the length of the x a, and x b vectors, the size of B is nxn. This implies a matrix of very large dimensions that would cause important technical difficulties to operate with B or B −1 , and even to store them in the computer memory. The 3D-Var technique implemented in the WRF-DA system solves this problem by factoring B as a product of certain matrices that have clear physical interpretations, allowing certain assumptions about their structures to be made and making them computationally tractable [2,4].
As a final remark in this subsection, we noted that both for the simulations of the background conditions and the estimation of the B matrix, a specific configuration of the WRF model was used, which here is the one described in Section 2.

WRF-DA Implementation for This Work
To apply the NMC method to estimate the matrix B, we used WRF-GFS simulations extended for 24 h and initialized at 0:00 GMT and 12:00 GMT during the entire year of 2016. The results obtained for 12 and 24 h after the initial conditions were used to estimate the matrix B, as indicated in the previous section. In order to gain insight into the spatial structure of the covariances contained in B, Decombes et al., [16] and Rivi [17] proposed "pseudo-single observation tests". Such tests consist of choosing a particular variable of x b in a particular grid point and suppose a hypothetical observation that increments in a fixed amount the background value of this variable at the selected grid point. The data assimilation process is carried on considering this single pseudo observation and prescribing a hypothetical value of the standard deviation of its error. Rivi [17] showed that the x a − x b difference was proportional to the covariance between the errors of the background variable in question at the chosen grid point and the errors of the rest of the background variables at all the grid points. The plots of the x a − x b differences for selected variables illustrate how the B matrix spreads the information of field measurements. In Appendix B, we summarized the fundamentals of the pseudo observation tests, and we showed some selected results for the B matrix estimated here.
The background conditions x b were obtained from WRF-GFS simulations initiated at 0:00, 6:00, 12:00, and 18:00 GMT, during those months of 2017 for which information from the WF1 and WF2 stations was available. We used the hourly results of these simulations from 4 to 9 h since the initialization of each simulation. In this way, four consecutive simulations could cover the 24 h of each day. The selected forecast horizon was the earliest for which the GFS prediction results were available in real-time, with some time left to carry out the processes described in this work. In this way, it is possible to implement these processes in an operative mode. For each local hour in Uruguay, Table 1 shows which WRF-GFS initialization cycle was used, and which hour within its forecast horizon corresponded to the background condition for that local hour. Table 1. Hours within the WRF simulations used to define the background conditions as a function of the initialization cycle and local time in Uruguay. Each row corresponds to an initialization cycle, and each column corresponds to a local hour in Uruguay. The hours used within each cycle are indicated, considering the hours elapsed since its initialization. In the present study, the assimilation made for each hour was independent of the previous assimilations; the corresponding background condition only incorporated information from the GFS global simulation. This is called a "cold start". Alternatively, the background conditions could also have considered the field measurements made in previous hours. This alternative is called a warm start. The warm start method was tested in the context of the present work and obtained results of equivalent quality to those obtained with the cold start, so here we showed only the latter results. The similarity of the results from both methods might be because the assimilated information was very local; the assimilation of data in a wider region could influence the simulations in the area of interest over more hours and thus increase the relevance of assimilation from previous hours. The verification of this hypothesis requires a database of field measurements more complete than the one that was available for this work.
Each field measurement is specified to the WRF-DA system in a file that contains the day and time of the measurement, the geographic coordinates, and ground elevation of the corresponding station, the height of the measurement sensor, and the records of wind speed and direction. This information is used to generate the vector of field observations, y. Considering that the topography of the regional model has some degree of smoothing, the elevation of the specified terrain is not the actual elevation, but the elevation corresponding to the topography of the model interpolated to the station's location coordinates. The height specified for the sensors is the elevation of the station plus 100 m. The specified module of wind speed is the average of the anemometer record for two consecutive 15-min periods centered on the hour for which the assimilation will be performed.
The specified wind direction is deduced from the background condition, interpolating each component of the wind vector to the geographical position and elevation of the anemometer. Although vanes are available at the stations, it was found that using their records produces results that are not as satisfactory as those obtained by deducing the wind direction from the background condition. This suggests that the quality of data obtained from these weathervanes is relatively poor, while regional short-term predictions are reasonably good with regard to this variable. It may be of interest to evaluate the effect of the assimilation of atmospheric pressure measurements on the wind direction obtained in the analysis, but no such observations were available to complete this analysis.
In addition, to estimate the R matrix, the WRF-DA system uses a file that specifies the standard deviations of the errors of the different kinds of field observations (obserr.txt). In the present work, we adjusted the errors specified in this file, proposing a value of 0.1 m/s for wind speed, which is reasonable for the type of sensors installed in the stations considered here [18]. For wind direction, we kept the default value proposed in the obserr.txt file, 5 • . We also pointed out that the WRF-DA system uses two "namelist" files to prescribe parameters to be set for the B matrix estimation and the computation of the resultant analysis (x a vector). In this work, we used the default values for these parameters, which are indicated in the WRF-DA user guide [14].

Results
To evaluate the quality of the obtained pseudo-observations, we interpolated each component of the wind vector to the geographic location of VSt and the levels of 87 and 36 m above the ground, and then we computed the wind speed at these levels. Figure 4a shows the systematic error or bias of the two levels for each hour of the day. The bias is defined as the average of the wind estimate at a given level and hour minus the corresponding observed value over all the studied days: b ≡ s a − s obs (5) where s a is the wind speed interpolated from the analysis to the location and level of each Vst sensor, and s obs is the correspondent measured wind speed. The overline represents the average over all the studied days. Figure 4b shows the relative bias, defined as the bias divided by the observed mean wind speed at the corresponding hour: b rel ≡ s a − s obs s obs (6) At 87 m, the bias was moderate, generally smaller than 0.5 m/s, while relative bias was generally smaller than 5%. At 36 m, the bias and the relative bias were similarly moderate: the bias was generally smaller than 0.5 m/s, while the relative bias during some hours was slightly larger than that found at 87 m. Note that 87 m above the ground was similar to the elevation of the assimilated observations, while the 36-m elevation was significantly closer to the ground. The moderate systematic errors found at both elevations indicated that the data assimilation technique effectively combined the information from short-term WRF predictions and wind measurements. The wind measurements from the wind farms lacked information about elevations relatively close to the ground, e.g., at 36 m, while the WRF simulations did include this information since they considered several elevations within the first 100 m above the ground but had errors of their own. The assimilation process corrected these errors at the locations and elevations of the anemometers in the wind farms and also transmitted the effects of these gains to other regions and elevations. The information that allowed these gains to spread was contained in the background error matrix B. To further illustrate this point, Figure 5 shows the bias of the background conditions 36 m above the ground at VSt and the bias resulting from the assimilation process (also included in Figure 4a). The background bias was larger for all hours. Since both biases were means of errors, it was possible to compute the significance of their difference with a two-tailed Student t-test [19]. It was found that these biases were different with a significance value lower than 0.05 for the local hours from 0:00 to 16:00, and from 21:00 to 23:00 (the hours with significant differences are indicated with a green bar in Figure 5). Next, we presented two statistical parameters related to accidental errors. Figure 6 shows the Pearson correlation of estimated versus observed wind speeds at the VSt station. At 87 m above the ground, there were correlations generally larger than 0.75 during the night hours, and generally larger than 0.80 during the day. At 36 m, the values were slightly smaller at nighttime and very similar during the day. Figure 7 shows the standard deviation of the error (estimated minus observed values) divided by the mean observed wind speed at each hour (RSTD). RSTD was approximately 25% for both elevations considered.

Summary and Conclusions
This work assessed the quality of wind speed estimates in Uruguay obtained with the WRF-DA system, which was used to assimilate wind speed measurements 100 m above the ground at two wind farms. The quality of the estimates was assessed with an anemometric station placed between the wind farms, that measured wind speed at 87 and 36 m above the ground. The information to be assimilated from field measurements was minimal, not only because it included only two stations but also because it lacked records of other atmospheric variables that are related to wind, such as atmospheric pressure. It was interesting to assess the extent to which these minimum field measurements could generate useful interpolations and evaluate the quality of the wind estimates at elevations both similar and different to those of the assimilated records. The measurement station used to validate the wind speed assessments was placed between those used for the assimilation process, so the conclusions of the assessment are applicable only to regions between the two main stations.
Wind speed estimates showed a low systematic error at the verification station, generally below 0.5 m/s at both 87 and 36 m above the ground. A relative systematic error was generally less than 5% of the average speed. This result indicated that the data assimilation technique effectively combined information from field measurements and background conditions. The assimilated measurements did not include information from elevations as low as 36 m. The background conditions did contain information from these low elevations, but with systematic errors of their own. The assimilation technique managed to propagate the gain from the observations at 100 m above the ground in the wind farms to other regions and to lower elevations. The covariance matrix of the background condition error was essential to the propagation of these observational gains.
As for the total error, the correlation values between observed and estimated wind speed and the standard deviation of the total error of each estimate, generally about 1 m/s to 1.5 m/s, suggested that the obtained estimates could be of sufficient quality to be useful in various applications. Some examples of applications in which such estimates are valuable are the estimation of wind climatology within the range of the considered height levels, retrospective simulations of transport processes and dispersion of air pollutants, or real-time estimation of environmental conditions in which systems whose operation can be affected by the wind are being used. In any case, the effective use of pseudo-observations in a specific application requires the estimation of their confidence intervals, which are necessary both to assess whether the accuracy of pseudo-observations is acceptable for the application in question and to take into account the effects of the uncertainty of these data if they are used. The generation of databases of pseudo-retrospective observations, such as the one presented in this work, allows for the estimation of these confidence intervals.
For future studies, we are interested in quantifying the effects of including atmospheric pressure observations on the quality of the results. We are also interested in evaluating the effect of expanding the region in which observations are collected for the assimilation of data on the results of the hot start option.
Finally, we pointed out that the topography of the studied region is not completely flat but is relatively smooth. This can contribute to the quality of results obtained by assimilating a few observations in numerical simulations with relatively low resolution. In the case of regions with relatively complex topography, the numerical simulation may require finer spatial resolution to properly take into account the effects of topography on the wind field. The proper quantity and location of measurement stations should also be evaluated in each case.

Appendix A
The setting of the WRF model used in this work is analogous to that of the work by Cazes Boezio and Ortelli [12] which evaluates short-term forecasts of wind power generated in Uruguay. The horizontal resolution was 30 km in the zonal and meridional direction. Figure 3 shows the grid points used to compute air temperature, pressure, density, and vertical velocity. The grid points used to compute the zonal and meridional velocities (not shown) were staggered one half of the horizontal resolution in the zonal and the meridional directions, respectively, according to the Arakawa C-grid arrange [1].
The model setting employed considered 54 layers in the vertical direction. The model vertical coordinate is η [1], defined as where p d is the hydrostatic component of dry air pressure at a particular atmosphere level, and p S and p T are the analogous pressures at the Earth's surface and the atmosphere conventional top, respectively. In this work, the atmosphere top was set to 50 hPa. Table A1 gives the values of h at each layer interface. The horizontal velocities and the air temperature were computed inside each layer, while the vertical velocity was computed at the layer interfaces, according to the Lorenz vertical grid arrangement [1].
The WRF model allowed us to choose several parameterizations of physical processes, especially for atmospheric boundary layer processes, surface layer processes, short and long wave radiative heat transfers, convective precipitation, clouds microphysics, and drag associated with gravity waves. Table A2 shows the parameterizations chosen in the current work and indicates references for them. 46 Data 2019, 4, 142 Table A2. Parameterization of physical processes used.

Physical Process Scheme Used
Short Wave Radiation Dudhia scheme [20] Long Wave Radiation RRTM schene [21] Surface Layer Revised MM5 surface layer scheme [22] Atmospheric Boundary Layer Yonsei University scheme [23] Microphysics Hong et al. scheme [24] Cumulus Precipitation Simplified Arakawa Schubert scheme [25] Gravity Wave Drag Kim Arakawa scheme [26] Land Processes Noah land surface model [27] Appendix B Kalnay [3] showed the result of the optimization indicated in Equation (A1), that yields the 3D-Var analysis x a and is equivalent to the result of the optimal interpolation procedure, We chose a particular variable and a particular grid point that corresponds to the kth element of x b or x a , according to the conventional order of these vectors. We defined the synthetic observation y at the chosen grid point as the kth value of x b plus a conventional increment Δ, Since the ∇H operator produces analogous variables to those of the vector of observations y from x b , and the analogous to the synthetic observation y in the background condition is x b (k), the correspondent ∇H is a vector with all its terms equal to 0, except the kth element, which is equal to 1, so The matrix R of covariance of observation errors has a single element, which is the covariance of the synthetic observation y. We chose a conventional value s 2 for this covariance. Rizvi [17] showed that with these choices, Equation (A2) yields where B k is the k column of B. Equation A5 indicates that the x a − x b difference is proportional to a vector that gives all the covariances of the error of x b (k) with the errors of all the other variables of x b . Note that the x a − x b difference is independent of the x b condition chosen. Plots of x a − x b fields for a selected variable at a selected level of the numerical domain can help to understand the geographical structure of the error covariances of that variable at that level.
Here, we chose to increase in 1 m/s the zonal ("test A") and the meridional ("test B") wind of a particular x b condition at the grid point and level closest to the location of the Vst station and the level of 87 m above ground. s was chosen as 1 m/s. Figure A1a shows the x a − x b field for zonal velocity resulting from test A, at the same model level of the selected grid point (level 7 from the ground). Figure A1b shows the x a − x b field for the meridional velocity at the same level, for "test B". These results indicated that WF1 and WF2 were located in regions, where the zonal and meridional winds are well correlated with those of the Vst, and, therefore, wind estimates at this location benefit from observational gains obtained at WF1 and WF2. Besides, this type of analysis can be useful to define the density of a station network intended to cover a specific area. Figure A2 shows the vertical profile of the x a − x b zonal wind difference for test A, at the point of the horizontal grid in which the increase of zonal wind speed was prescribed. This profile shows the vertical structure of the covariances of zonal velocity background errors with the background error at the level chosen for the test. It was found that these covariances increased with the elevation up to 500 m above the ground, and then decreased to values that were close to 0 in the upper atmosphere ( Figure A2a). Figure A2b shows a zoom of the profile shown in Figure A2a for the first 150 m above the ground, in order to focus on the levels of interest to this work. Although covariances were lower at lower elevations, their relatively large values indicated that observational gains obtained at elevations about 100 m could propagate quite directly to lower levels. The analogous vertical profiles from test B were found to be very similar to those from test A, and are not shown here.   Figure A2, a vertical profile at the first 150 m above the ground. Dots indicate the model grid points, and the arrow, the grid point at which the perturbation is prescribed.

Introduction
Extremes in climate such as floods, droughts, and cold and heat-waves can have significant societal, ecological, and economic impacts globally [1]. Since the publication of the third assessment report of the Intergovernmental Panel on Climate Change (IPCC) in 2000, characterizing extremes under past and projected future climate has generated rapid interest [2]. The climate modelling community, for instance, has spent increasing effort to capture high-frequency extreme events in their simulations of historical and future projected climate. The underlying aim for both regional and global climate modelling exercises (e.g., CORDEX and PRIMAVERA) 1 has been to develop a better understanding of the evolution of extreme weather events under long-term climate change and variability.
The impetus to better understand extreme weather events is further driven by the impact modellers who assess sectoral damages at varying spatial scales. The two vital characteristics of climate that are at the core of impact models are (i) mean climate and (ii) the occurrence and frequency of extreme events [3]. An increasing notion shared within the climate research community is that even a relatively small change in the frequency or severity of extreme weather events (i.e., in the tails of the probability distribution function) would have profound impacts on life and assets [4], thus making it further imperative to analyze extremes at higher temporal and spatial resolutions. For the scientific community focusing on impacts of climate change and variability, historical observations of extreme indicators can facilitate a better understanding of the role of extreme events and sectoral implications [5].
Largely driven by the requirement for a robust definition of climate extreme indicators, the Expert Team on Climate Change Detection and Indices (ETCCDI) 2 in 1999 led the first efforts in defining a set of climate extreme indices (CEIs) that provide a comprehensive overview of temperature and precipitation statistics [4,[6][7][8]. The ETCCDI has developed an internationally coordinated set of core climate indices consisting of 27 descriptive indices for moderate weather extremes 3 [9][10][11]. The preliminary set of these 27 core indices were drawn up keeping the detection and attribution needs of the research community in mind [10,11]. Noting the limitations of the ETCCDI indices with regard to restricted scope/usage in assessing sectoral impacts, additional sector-relevant indices were recommended and developed by the Expert Team on Sector-specific Climate Indices (ET-SCI) [9].
This study introduces a new open-access high-resolution global gridded (0.25 • × 0.25 • ) 4 dataset of 71 CEIs (including the original 27 ETCCDI indices), covering the period 1970-2016. The dataset (hereafter referred to as "CEI_0p25_1970_2016") aims to contribute to the existing CEI databases by making available the first comprehensive CEI dataset currently unavailable for the climate community at a high resolution with worldwide coverage. Moreover, a consistent global CEI dataset covering a long historical time period can lay a framework for not only analyzing observed changes in extremes, but also potentially improving information services on extremes at regional scales [10].
The CEI_0p25_1970_2016 are a set of core (Table S1 in Supplementary Materials) and non-core (Table S2 in Supplementary Materials) indices 5 as defined and developed by the ETCCDI/ET-SCI, and adopted by the World Meteorological Organization (WMO). The set of "core indices" refers to indices that were developed by ETCCDI targeting the research community focusing on "detection and attribution" in climate science (details in Section 4).
The rest of the paper is organized as follows. Section 2 describes the CEI_0p25_1970_2016 in detail. Section 3 discusses the underlying meteorological dataset and the tools/methodology used in the preparation of the CEI_0p25_1970_2016. Section 4 outlines the novelty, potential scope, application, and limitations of the CEI_0p25_1970_2016. Dataset availability, ongoing work, and some recommendations for future research are summarized in Section 5.

Spatial and Temporal coverage of CEI_0p25_1970_2016
The CEIs included in this study encompass all but two indices 6 that are part of the complete list of 73 ETCCDI/ET-SCI core and non-core indices [9]. The CEI_0p25_1970_2016 is derived using meteorological variables from the reanalysis data product Global Land Data Assimilation 2 Formed by the World Meteorological Organization (WMO) Commission for Climatology (CCl). 3 Extreme events that by definition typically occur a few times annually rather than severe impact, decadal weather events. The indices for moderate weather extremes use absolute or percentile thresholds generally set at moderate values (e.g., 25 • C, 90th percentile). 4~2 7 km × 27 km at the equator. System (GLDAS) [13]. GLDAS is a new generation of reanalysis developed jointly by the National Aeronautics and Space Administration (NASA) Goddard Space Flight Center (GSFC) and National Centers for Environmental Prediction (NCEP) [14]. Because the spatial extent of GLDAS covers all land north of 60 • S, the indices in CEI_0p25_1970_2016 are also computed over the corresponding 1440 (longitude) × 600 (latitude) grid cells. Further description of GLDAS as well as the reasons for using it as a data source in this study are discussed in Section 3.

Other Existing Datasets Incorporating CEIs
While other similar historical gridded CEI datasets do exist, they are either (i) regional in coverage, (ii) at coarser resolution, or (iii) limited in the number of indices available for research purpose. Examples include (i) the 30 CEIs made available by E-OBS at 0.10 • gridded resolution for Europe http://surfobs.climate.copernicus.eu/dataaccess/access_eobs_indices.php, (ii) the global 0.50 • gridded resolution S-14 indices dataset of 27 core ETCDDI indices available at http://h08.nies.go. jp/s14/ [15], and (iii) the global 3.75 • × 2.5 • resolution HadEX2 and GHCNDex datasets of 27 core ETCDDI indices available at https://www.climdex.org/learn/datasets/ [6,7]. To the best of the author's knowledge, the present database CEI_0p25_1970_2016 is currently the only comprehensive high-resolution global-gridded historical dataset of ETCCDI/ET-SCI core and non-core indices.

Data Acquisition and Processing
The CEIs used in this study were computed utilizing the WMO ET-SCI recommended and developed R-software package "ClimPACT2" 7 [9]. R [16] is an open-source language and software environment, developed primarily (but not solely) for statistical computing, and is applied widely in climate research. Moreover, ClimPACT2 also makes use of several R subroutines, such as SPEI [17], and is designed for operating on parallel computing infrastructure.
For the computation of CEI, ClimPACT2 requires the following meteorological variables (i) maximum near-surface air temperature (TX), (ii) minimum near-surface air temperature (TN), and (iii) near-surface total precipitation (PR), all at daily timesteps. These variables in the native Network Common Data Form 4 (NetCDF4) 8 format were obtained from the GLDAS-version 2 9 [13,18,19], available at 3-hourly timesteps and a fine spatial resolution of 0.25 • × 0.25 • . GLDAS is a global high-resolution reanalysis dataset that incorporates satellite and ground-based observations, producing optimal fields of land surface states and fluxes in near-real-time [13].

Choice of GLDAS as a Reanalysis Dataset for the Computation of CEIs
Vis-à-vis other global gridded reanalysis datasets, GLDAS offers several advantages. First, GLDAS provides a consistent quality-controlled long global gridded time-series of the required variables (i.e., TX, TN, and PR) at a high spatial resolution. Other reanalysis data products available were found to have either a coarser spatial resolution (e.g., ECMWF-ERA40 and JRA-55, both available from the mid-1950s but at 1.125 • ), or a shorter time series (e.g., newly released ECMWF-ERA5 at 0.281 • from 1979-present day, and NCEP-CFSv2 at 0.205 • from 2011-present day). Second, GLDAS runs in near-real-time, offering the potential to regularly update the database presented here.
The choice of GLDAS for computing the current set of indices was further motivated by its large number of additional meteorological (e.g., specific humidity, surface pressure), land surface state (e.g., soil moisture, surface temperature), and flux (e.g., evaporation, sensible heat flux) variables, not commonly available in other reanalysis data products for a long time-series and at a high spatial resolution 12 . While none of these additional variables are required for computing the current set of indices, another dataset [12] of sectoral indices that are not presently implemented in the ETCCDI/ET-SCI indices requires a subset of these variables (details in Section 5.2). The two datasets of indices (current and [12] under prep.) will together comprise a large (~85) number of indices both based on the same underlying GLDAS data, thus enabling the climate impacts community to access "ready-to-use" multi-sectoral indices.
GLDAS has been comprehensively evaluated using different regional/global reference datasets in earlier studies (e.g., see [14] who compare the GLDAS daily surface air temperature at 0. Equally well-documented are certain known limitations of the temperature and precipitation estimates in GLDAS. Whereas spatial details in high mountainous areas are not sufficiently estimated by the GLDAS data, the surface air temperature estimates are generally accurate, with some caution recommended for mountainous areas [14]. Previous studies that have incorporated GLDAS data include (i) [22] for impact assessment studies in energy sector, and (ii) [23,24] for the analysis of regional environmental conditions and changes. For a comprehensive list of GLDAS-related references, readers are referred to https://ldas.gsfc.nasa.gov/gldas/GLDASpublications.php.

Novelty of CEI_0p25_1970_2016
The CEI_0p25_1970_2016 is currently the only dataset providing researchers and policymakers with an exhaustive list of ETCCDI/ET-SCI recommended indices, dating back to the preceding four decades, covering nearly all global land grid-cells, and assembled using a quality-controlled reanalysis data product at a high spatial resolution. Considering the computational time and resources required for assembling a comprehensive dataset of CEIs at a global scale, the biggest asset of CEI_0p25_1970_2016 from the users' perspective is the open access to a pre-compiled ready-to-use set of indices in its native data format, along with a web interface allowing robust statistical analysis and mapping of the results in a few easy steps (details in Section 5.1).

Scope of Application
The CEIs included in this study are not only suited as assessment tools in multiple sectors such as Agriculture, Health, Energy, Water resource, etc., but also as metrics capable of being aggregated as composite indicators for risk assessment and vulnerability studies (e.g., as demonstrated and applied recently by [25] over Italy in the form of a "Climate Risk Index"). A number of earlier studies have demonstrated the efficacy of the CEIs, both in detection and attribution studies, as well in the impacts assessment of climate change and variability in key sectors. Examples include (i) [26] who use "Rx1day" (Table S2) to examine the changes in model-simulated extreme precipitation by decomposing the daily regional-scale extreme precipitation as contributions from atmospheric thermodynamics and dynamics; and (ii) [27] who consider a broad range of CEIs (from Tables S1 and S2), for assessing future climate change impacts on agriculture, human health, ecological ecosystems and utility (energy demand) in Canada.
Moreover, it is widely known and established in sectoral impact studies employing empirical methods that a large proportion of variation in the outcome variable is better explained by the climatic variables accounting for moderate or severe extremes (e.g., the relationship (i) between crop productivity and a variant of the index "GDDgrown" in Table S1, known as killing degree days (KDDs) [28], (ii) between electricity consumption and degree-day indices namely "CDD" and "HDD" [29]). CEI_0p25_1970_2016 for instance provides an instant resource platform for empirical modellers to download and investigate a number of potential predictor variables that are robust moderate/severe extreme indicators.
The robust characteristics and climatological attributes captured by ETCCDI/ET-SCI indices can facilitate consistent comparison of results across different climatic zones, different time periods, and the identification of regions (clusters) with similar characteristics in extremes (e.g., grid cells with similar trends in annual days when daily maximum temperature is at least 30 • C ("TXge30", Table S1). The identification of common hot spots can be of potential interest to policymakers, insurance companies, and country planners for the assessment of the risk and vulnerability of regions to extreme weather disasters (e.g., flooding, drought, heat waves).
While the mean climatology of a location is invariably well-captured by the state-of-the-art reanalysis data products and Earth System Models (ESMs), extremes (particularly in precipitation) at fine spatial scales have been difficult to replicate [30]. CEIs provide the modelling community with a detailed set of indicators enabling the comparison of different input data sources in their ability to model extremes [8,9].
Finally, with the planned inclusion of additional indices to the current inventory of ETCCDI/ET-SCI indices in the near future [9], the development of larger CEI datasets for historical and future time periods could make valuable instruments available to researchers, policymakers, and adaptation planners focusing on occurrences and return periods of rarer extreme meteorological events (e.g., using extreme value theory).

Limitations of Indices Included in CEI_0p25_1970_2016
While the CEIs included in this study (Tables S1 and S2) were developed by the WMO expert teams to largely address the growing demands of sectoral impact modellers, certain limitations of the existing ETCCDI/ET-SCI indices have been recognized, and efforts are ongoing to develop other robust indices meeting multi-sectoral requirements [9]. For instance, under the current framework of ET-SCI definitions, the Heat Wave Magnitude (HWM) indices (Table S2 are based on the methodology developed by either [31] or [32]. The more recently developed HWM Index daily (HWMId) defined by [33] and implemented in various sectoral studies (e.g., [34] for river discharge and [35] for assessing impacts on wheat yields 13 ) is yet to be included in the inventory of ETCCDI/ET-SCI indices.
Moreover, the ETCCDI/ET-SCI indices are defined largely at annual timescales, and some are defined at monthly timescales as well. For certain sectoral applications (e.g., in Agriculture and Energy), the current set of monthly/annual indices may prove less useful, as climate anomalies need to be computed over different timescales. For instance, the "GSL" index (Table S1) in its current form defined at annual timescales does not account for heterogeneity in the length of crop-specific growing season (further details in [35]). In such cases, using indices computed at annual timescales can lead to misleading results. Some further shortcomings of the existing ETCCDI/ET-SCI indices are discussed and recommended for future work (details in Section 5.2).
Lastly, it must be emphasized that because CEI_0p25_1970_2016 utilizes temperature and precipitation data from GLDAS, when using the current set of indices users should keep in mind the known uncertainties and limitations of the GLDAS data (as discussed in Section 3.2).
The files follow the naming convention CEI_timescale_GLDAS_0p25_deg_hist_1970_2016.nc (Figure 1), wherein "CEI" is the abbreviation of the index (as described in Tables S1 and S2) and "timescale" is either "ANN", "MON", or "DAY", relating to annual, monthly, or daily timescales 15 over which the corresponding CEI is computed.
The size of the individual NetCDF files vary between 156 megabytes (Mb) and 1.9 gigabytes (Gb), depending on the CEI and time-scales at which it is computed. One exception is the file "hw_ANN_GLDAS_0p25_deg_hist_1970_2016.nc" which is 3.1 Gb as it includes twenty individual indices in a single netCDF4 file. GLDAS does not include data over (or near) water bodies. Such grid cells where the required GLDAS TX, TN, and PR data are not available for computing the CEIs are identified by missing values "1.e+20f". Further details of the variables/dimensions in the individual netCDF4 files can be examined using either NCO or CDO commands, such as "ncdump -h netcdf_file_name" or "cdo sinfo netcdf_file_name", respectively. For creating quick plots and exploratory data analysis of individual netCDF files, open-access data tools such as Panoply (https:// www.giss.nasa.gov/tools/panoply/) or NCview (http://meteora.ucsd.edu/~pierce/ncview_home_ page.html) are recommended. Sample plots using Panoply for the four indices ("TXx", "HWM_Tx90", "CSDI", and "PRCPTOT") are shown in Appendix A (Figures A1-A4).

Ongoing Work and Recommendations for Work in Future
The indices in CEI_0p25_1970_2016 are intended to be updated post-2016 years, subject to the availability of the required GLDAS raw meteorological variables in the coming years. The updated longer time-series of CEIs of more recent years should prove beneficial to the research community focusing on recent extreme events (e.g., the droughts of 2017 and 2018 in south-east Australia, the heat waves of 2018 in California, United States of America, the more recent January-February 2019 extreme cold wave in North America). Additionally, upon the formal inclusion of any new indices (such as the "HWMId" and the "Crop-specific GSL" as discussed in Section 4.3) by the WMO expert teams to their list of ET-SCI indices, the same will be formally included in the existing dataset presented in this study.
While the ETCCDI/ET-SCI core and non-core indices employed in this study encompass a very large spectrum of sectoral and non-sectoral indices, the list is by no means exhaustive. Motivated by the suggestions of the R ClimPACT2 [9] package creators, another dataset of indices largely relevant for health and energy sectors (called "HEI_0p25_1970_2016") is currently under preparation [12]. Some features of HEI_0p25_1970_2016 will for instance be the inclusion of the two ETCCDI indices (i.e., CDD and HDD [36]) that are not included in this study 16 . Moreover, HEI_0p25_1970_2016 will also account for additional meteorological variables (e.g., near-surface relative humidity and wind speed) for computing non ETCCDI/ET-SCI indices, such as the Humidex [37,38], the Heat Index (HI) [39,40], and the Discomfort Index (DI) [41,42]. Together, both CEI_0p25_1970_2016 and HEI_0p25_1970_2016 are aimed to address the growing needs of the climate impact community, by overcoming the current data scarcity of high-resolution global gridded CEIs in earth science.

Supplementary Materials:
The following are available online at http://www.mdpi.com/2306-5729/4/1/41/s1, Table S1: 32 Core ET-SCI indices. Bold indicates index is also an ETCCDI index. (TX: daily maximum near-surface air temperature, TN: daily minimum near-surface air temperature, PR: daily near-surface total precipitation, H: Health, AFS: Agriculture and Food Security, WRH: Water Resources and Hydrology); Table S2: 39 non-core ET-SCI indices. Bold indicates index is also an ETCCDI index. Sectoral abbreviations same as in Table S1. Acknowledgments: The author is grateful to Nicholas Herold in assisting with the R software package ClimPACT2; Lisa Alexander and Enrico Scoccimarro for constructive discussion on sectoral extreme indices; Enrica De Cian for feedback on the draft version of the paper; the high-performance computing resources of the Boston University Shared Computing Cluster (SCC) on which the CEIs were computed; and NASA Goddard Earth Sciences Data and Information Services Center (GES DISC) for making GLDAS data publicly available. Developers of R SPEI package, CDO, and NCO are also acknowledged for providing open-access tools that were used for data preparation in this study. The constructive feedback received from three anonymous reviewers helped to improve the manuscript further.

Conflicts of Interest:
The author declares no conflict of interest.

Introduction
In an era of major global change (i.e., in climate, land use, habitat fragmentation, and movements of humans and other species) the introduction of invasive species and the geographic expansion of endemic species to novel ranges are occurring at unprecedented rates [1,2]. Invasive species have extensive negative impacts on the ecosystems they invade, such as losses in both taxonomic and functional diversity [3], resulting in severe economic consequences. For example, in the USA invasive insects cost the agricultural sector USD 13 billion per year due to crop loss and damage [4], routine activities to control Aedes mosquitoes in Cuba cost USD 16.80 per household [5], and Great Britain spends USD 34.6 million per year on the control of invasive fresh-water species [6]. The stakes are even higher when invasive species can vector pathogens that cause disease in humans, animals, or plants. Some arthropod species that are highly invasive are among the most effective vectors of human pathogens. Mosquitoes, such as Aedes aegypti [Linneaus, 1762] and Aedes albopictus [Skuse, 1895], are invasive arthropods able to vector pathogens (henceforth referred to as IAVPs) and transmit three globally important viruses to humans: chikungunya, dengue, and Zika [7]. Likewise, the Asian longhorned tick Haemaphysalis longicornis [Neumann, 1901], which is rapidly expanding across the east coast of the USA [8], can vector the severe fever and thrombocytopenia syndrome virus, which has human fatality rates exceeding 30% in Asia [9]. Of veterinary importance, the biting midge Culicoides imicola [Kieffer, 1913] is currently expanding in range throughout Europe and can transmit the bluetongue and African horse sickness viruses [10]. Plants are also affected, cotton whiteflies (Bemisia species, including Bemisia tabaci [Gennadius, 1889]), now present in every continent except Antarctica, can transmit over 100 different plant viruses [10].
In order to prevent the potentially catastrophic ecological, economical, and health consequences associated with IAVPs, mitigation methods must be rapidly employed following species introduction or expansion into a new geographical range [11]. Mitigation methods may include IAVP control and eradication, or communication of the risks to policy makers, physicians and the public, and environmental data are often used to inform these different processes. Here, we describe the benefits and limitations associated with using i) remotely sensed data, which we define as data acquired by sensors mounted on satellite, airborne, or other distant means, and ii) proximally sensed data, which we define as having been collected by a ground-based, or other platform, in close proximity to the variable being measured, in order to inform IAVP mitigation.

Linking Environmental Earth Data and IAVPs
As observed by Malanson and Walsh; "detection and eradication [of invasive species] are essentially spatial problems. They primarily require learning where the invasives are and getting there" [12]. This is a simplification of a more complex issue, which may also involve a lack of personnel or funding to efficiently implement detection and eradication, insufficient communication or perception of the IAVP risk, and even IAVP resistance to control measures. However, environmental data can be used to address the "spatial problems" by informing predictions on where invasive species may be introduced and become established.
In some instances, using environmental data in the mitigation of an invasive species can be as straightforward as directly detecting the species. For example, thanks to the reflectance properties of vegetation, invasive plants can be mapped using indices such as NDVI (the Normalized Difference Vegetation Index) or EVI (the Enhanced Vegetation Index) that are derived from remotely sensed data that measures infrared reflectance (e.g., in [13], and also see [14] for a review on this method). A similar concept can be applied to invasive arthropods which cause damage to vegetation, and NDVI data has been used to track the dispersal of invasive insects by monitoring defoliation [15,16]. Although weather radars have detected mass migrations of invasive insects [17], as yet remotely sensed data cannot directly characterize IAVP geographical distributions. There are promising proximal sensing methods that use reflectance data from cameras that can detect and differentiate between multiple fruit fly species, including those that vector crop pathogens [18] (see also [19] for an interesting application of proximal sensing of an invasive pathogenic plant bacterium). However, mapping IAVP distribution in real-time is often less desirable than preempting the potential geographic distribution, as surveillance and control are more efficient if implemented prior to the establishment of a species [11,12,20].
Species distribution models (SDMs) are frequently used to predict the current and future geographic distribution of IAVPs [11,12,20] due to their ability to be applied to species that cannot be directly detected because they are small, elusive or inhabit remote locations. Typically, correlative SDMs apply an algorithm, such as maximum entropy, boosted regression trees or random forest, that combines empirical occurrence data on the species with relevant environmental data (e.g., average temperature and precipitation) to predict the spatial and temporal distributions of a species [21,22]. Mechanistic models, such as compartmental or agent-based models have also been developed, alone or in combination with correlative models, to characterize potential species distributions [23,24]. Here, we adopt a broad definition of SDMs to include any modeling approach that aims to predict the distribution of a species, from logistic regression to multi-criteria decision analyses. In the last few decades there has been a sharp increase in the number of publications on SDMs, with hundreds published each year [25]. This dramatic rise in interest in SDMs is in part due to advances in remote sensing technology, including new satellites and sensors that have hugely increased the quantity and quality of environmental data that can be used [26].

Gathering Environmental Data for SDMs
The accuracy of SDM predictions is highly dependent on how closely the data used in the model match conditions relevant to the species and, despite considerable increases in both spatial and temporal resolution of available environmental data, there is often still a substantial mismatch in the conditions represented by the available data and those experienced by IAVP species. Environmental data used in SDMs can be classified as bio-physical or climatic, both of which can be measured by proximal sensing, but data used in SDMs is typically derived from remote sensing.

Bio-Physical Variables
Bio-physical variables generally include land-use, land cover, primary productivity, and vegetation phenology and fragmentation. Bio-physical variables are almost exclusively derived from Earth observation satellites which measure either reflectance at various wavelengths in the electromagnetic spectrum, or emitted radiances in the thermal spectrum. These reflectance data can be used to calculate NDVI and NDWI (Normalized Difference Water Index), which are applied instead of, or alongside, other satellite imagery/reflectance data to ascertain variables such as land-use and land cover (Table 1). Satellite data are available in a wide range of spatial (<1 m to >5 km) and temporal (hourly to yearly) resolutions, and allow for some user flexibility based on the scale at which the model is applied (e.g., eco-region, county, national, global). Given technological limitations due to on-board storage media or limited opportunity for data transmission, spatial and temporal resolution of remote sensing tools are inversely correlated [27]. As the majority of bio-physical variables remain static or exhibit very gradual changes over time, spatial resolution is often prioritized over temporal resolution. For example, since 1972 the NASA-USGS Landsat series has provided uninterrupted data on the Earth's surface at a relatively high resolution of 30 m, but measurements are only taken once every 16 days, although this will increase to every eight days starting from 2020. NOAA VIIRS provides a series of environmental data, as well as monthly cloud-free composites of visible infrared emittance for the entire Earth during night at a resolution of 15 arcsec (<500 m at the equator) [28], which can be used as a proxy for human settlements to inform the possible human contact risk associated with IAVP presence [29,30]. Since the 1980's, satellite remote sensors such as AVHRR and, many years later, MODIS, have allowed the derivation of more spatially and temporally continuous vegetation and surface temperature data at a moderate spatial resolution (250-1000 m), but with more frequent (daily) observations, thus greatly enriching the available datasets [31]. In addition, the more recent Sentinel missions (2A, 2B, 3) from the European Space Agency (ESA) have offered optical data at 10-300 m spatial resolution every 3-7 days since 2016. The accuracy of satellite data is generally strongly linked to the method of derivation, geographical region, climatic condition, and availability of in-situ data for calibration, which in turn affect SDM results. For example, cloud cover often hinders satellite optical data, especially in inter-tropical regions, but there are multiple statistical approaches that can fill these gaps over space or time [32]. Remotely sensed data on bio-physical Earth observations can be combined with ground-based (in-situ) data to provide crucial information on habitat structure, and are therefore commonly used in SDMs. Large extent datasets for bio-physical variables, such as the Global Copernicus Land Cover maps (spatial resolution = 100 m) [33], the pan-European Corine Land Cover (100 m) [34], the USA National Land Cover Datasets (30 m) [35] and the global MODIS Land Cover Type/Dynamics (500 m-1 km) [36], are derived using a combination of satellite and ground (in-situ) sensors [35]. However, these data are typically presented as a single multi-annual "snapshot" using a composite of several observations over time, and thus, provide very limited information on temporal variation.

Climatic Variables
Climatic data, which is often fundamental in the physiology of arthropods, includes variables such as land surface temperature (LST) or air temperature and precipitation. Such data are commonly derived from remote sensing and are frequently used in SDMs. Precipitation can be measured by active satellite sensors in the micro-wave region and offer high temporal (hourly) but coarse spatial resolution data (e.g., GPM and TRMM; Table 1). As for bio-physical variables, the spatial and temporal resolution of satellite data for climatic variables are also inversely related, which results in a lack of high spatial resolution data at higher temporal frequencies of measurement. In the case of climatic variables, which can vary minute-by-minute, temporal resolution is highly important. This trade-off often plays a significant role in attaining high accuracy results from SDMs. To fill these temporal gaps, recent satellite missions that measure radiance in the thermal spectrum bands (i.e., which measure temperature) are focused on providing higher spatial resolution climatic data with frequent measurements (e.g., Sentinel 2A/B data at 10 m with weekly acquisitions).
In addition to satellite-derived Earth data, data collected by ground-based weather stations, or a combination of both, such as the WorldClim, PRISM, Daymet and ECA&D datasets, are perhaps the most widely used climatic data in SDMs due to the user friendly format that requires comparatively little pre-processing compared with satellite data (e.g., [37][38][39][40][41][42]). As weather stations measure variables at discrete geographic locations these data must be interpolated to create a continuous spatial layer before being used in SDMs. There are multiple methods by which weather station data can be interpolated, but all are limited by the density of weather stations in the study area, and are confounded by topographical features and spatial gradients, although satellite or other remotely sensed data can help to remedy some of these shortcomings [37,43,44].
For regional SDM applications high resolution datasets are required, but the availability of such data remains a challenge also for current satellite missions, despite considerable improvements during the last few years with the advent of the new Landsat and Sentinel missions.

Issues Faced When Using Environmental Data in IAVP Models
As described, there are many environmental datasets available that can inform SDMs. However, these datasets are often of limited relevance in the context of IAVP modeling, not least due to substantial mismatches between the spatial resolution at which predictions are made and the resolution at which the predictions are interpreted, communicated or applied. The spatial resolution of model predictions are constrained by the resolution of the environmental data used, which is typically in the order of kilometers. However, the subsequent predictions are often used to inform actions applied at spatial scales in the order of meters, such as informing which neighborhoods should be targeted for surveillance and control, where to install deer fences to control tick abundances, or communicating IAVP presence. Although some inaccuracies in SDM outputs may seem trivial in the context of a scientific paper, they can pose a serious issue when accurate predictions are required for use in "real world" scenarios. For example, SDMs and model-derived data are used by the Centers for Disease Control and Prevention (CDC) in the USA to inform administrative regions on the likelihood of Aedes mosquito invasion, in order to distribute vector control resources (e.g., intensive surveillance and insecticide application) [45,46]. Consequently, disparities between the spatial resolution of the data used to inform the model and that at which model outputs are applied will result in model outputs that are inaccurate for their intended applications. At best, IAVP distributions may be over-estimated, leading to unnecessary use of resources, and at worst, distributions can be underestimated such that no, or insufficient, actions are employed to control IAVPs in an area that is actually at risk. Indeed, an economic evaluation of biological invasions states that "uncertainty prevails concerning what ecosystems will be invaded and what impacts an invasion will have within these ecosystems", highlighting that accurate ecological and economic analyses are crucial in the allocation of finite resources to control invasive species [47].
There is also a mismatch in spatial resolution between environmental data used in SDMs, and that at which the IAVP is affected. Indeed, many arthropods, such as mosquitoes and ticks are small and poikilothermic, and are therefore heavily affected by microclimatic conditions, which vary at fine spatial scales (in the order of centimeters to meters) and differ to the surrounding macroclimatic conditions [48][49][50]. For example, potentially invasive ticks, especially nidicolous (nest-dwelling) species, spend almost their entire life-cycle within a limited spatial radius; following a bloodmeal they detach from the host and remain within the host's nest or a nearby sheltered area, such as a cave or crevice, in order to metamorphose [51]. Within these isolated and sheltered microhabitats environmental conditions can be very different to those in the surrounding environment. In the same way, IAVPs can be sensitive to extreme environmental conditions, for example the lone star tick (Amblyomma americanum [Linnaeus, 1758]), which is invasive across much of the north east of the USA, dies within just 2 h of exposure to temperatures of ≤−3 • C in the laboratory [52] and rapidly desiccates when exposed for several hours to temperatures exceeding 30 • C [53]. Likewise, mortality of Culicoides brevitarsis [Kieffer, 1917] (Diptera: Ceratopogonidae), a vector of the bluetongue virus, is high in the laboratory when temperatures are greater than 35 • C, even if just for a few days [54]. Consequently, high temporal resolution of data is required to accurately capture the variance and range in environmental variables [44], but at present the most accessible remotely sensed data are only available for 1-6 day interval measurements, thus do not capture data at the same hourly temporal resolution that can affect IAVP survival. There is a wealth of literature demonstrating that if species were theoretically subjected to the macroclimate as measured by remote sensing, rather than the microclimate which they truly experience, their behavior, reproduction, growth, survival, and both phenotypic and genotypic adaptations would all be profoundly impacted [55].
In addition to issues of resolution in environmental data, some factors that impact IAVP distribution cannot be directly measured, and instead other measurements are used as a proxy, or are interpolated, for the variable of interest. Due to its importance in the IAVP life cycle, temperature is among the most broadly applied variables in IAVP species distribution modeling. However, land surface temperature is generally used as a proxy for ambient temperature [27,56,57], whilst relative humidity, which is vital to arthropod survival, is often calculated from temperature and dew point measurements, or minimum day-time air temperature [58]. SDMs are made further complex when the species of interest has multiple life stages, each of which may exploit a different microhabitat. Mosquitoes have an "amphibious" life history, throughout which they experience air, below-water, and water-surface temperatures, by having terrestrially fixed or floating eggs, aquatic immature larvae and flying adults [59]. Researchers have measured air temperature, water temperature, and precipitation to understand whether air temperature, usually used to determine mosquito distribution or life cycle, provides an appropriate direct measure for determining Anopheles [Diptera: Culicidae] larval development in water [48]. The authors of one such study concluded that their results "suggest that although widely used, air temperature alone does not provide an appropriate variable for estimating immature mosquito development or for setting threshold temperatures". Another study that measured temperature in microhabitats suitable for Aedes mosquitoes found that when utilizing temperature from remote sensors or weather stations instead of from proximal data loggers, model outputs predicted that Ae. albopictus developmental rates were delayed and population growth rates were under-estimated ( Figure 1) [60]. Thus, the environmental characteristics important to the survival of an IAVP vary considerably compared to those that can be measured or interpolated by currently available data [59,61], and the obliged use of sub-optimal proxy data may result in erroneous model outputs [48,61]. . Grey points and smoothed trends represent temperature measured at the microhabitat scale (i.e., an artificial, hard plastic, water filled container typically used for egg laying by this species) in different environmental settings: full-shadow, half-shadow, and full sun conditions. Temperatures were recorded inside (mosquito larvae habitat) and outside water (mosquito adult and egg habitat) using iButton ® (Maxim Integrated, US) DS1923 data loggers at one hour intervals. Blue points and smoothed trends depict four-daily Land Surface Temperature (LST; MOD11A1 and MYD11A1 MODIS data) values, derived from the Moderate Resolution Imaging Sensor (MODIS) instruments, on-board the Terra and Aqua satellites. MODIS data were downloaded from a NASA server (https://lpdaac.usgs.gov/data_access), imported into GRASS GIS, and temperature values were extracted for each pixel (1 km resolution) where iButton sensors were placed (this figure was produced using data reported in [60]).
Strictly related to the low spatial resolution at which remotely sensed data are acquired, ecotones, i.e., where two macrohabitats intersect, for instance at the edge of a river, between mountains and valleys, green areas in a city or in catch-basins, are not currently well-captured by environmental data. However, ecotones can create microhabitat refugia in a macrohabitat that would otherwise be unsuitable. For example, Hoogstraal demonstrated that in the Nile Valley which is otherwise too dry for ticks, the soft tick Ornithodoros sonrai [Sautet and Witkowski, 1943] was able to colonize rodent burrows close to a permanent river, which provided adequate water and humidity [62]. Research has identified general patterns and mathematical relationships in the "buffering effect" of the physical structure of a microhabitat and has determined that, in general, within the microhabitat experienced by the tick, temperature and relative humidity are lower than that of the external environment typically measured for environmental data [63]. However, these patterns are influenced by a variety of factors, including the structure of the microhabitat and surrounding hydrography [63].

Attempting to Overcome the Lack of Microhabitat Data
Methods to capture, interpret, and produce remotely-sensed data that can be applied to SDMs are continually improving. In March 2019, Planet announced that they can provide satellite imagery from which NDVI can be derived at a resolution of 3-5 m, every 3 days (https://www.planet.com/pulse/ developing-the-worlds-first-indicator-of-forest-carbon-stocks-emissions/). Additionally the project can gain NDVI at 0.8 m using Light Detecting and Ranging (LiDAR) sensors mounted on aircraft, but for only a single time point due to the costliness of this data collection process. Despite these improvements, there may be a considerable lag time between such data being available and being used in SDMs, which typically require time series data spanning multiple years to truly capture adequate information on the climate. In addition, high resolution data at such a large scale require intense computational power and expertise for use in SDMs, as high resolution satellite data brings with it challenges related to differentiating details between variables within the imagery, as well as new sources of noise [64]. For instance, a project attempted to use NDWI calculated from QuickBird satellite imagery at 2.44 m spatial resolution to locate potential habitat for invasive mosquitoes (e.g., swimming pools). However, ground truthing of the data showed that shadows cast upon swimming pools by surrounding trees or structures resulted in decreased NDWI values and reduced the ability to detect water bodies [65].
We understand that improving the quality of the remotely sensed data processing chain, including geometric and radiometric corrections, is a complex discipline in itself and takes time and an organized effort. However, we can take better ownership of the data that are currently available to us, and can follow the lead of other disciplines in doing this. A set of Essential Biodiversity Variables (EBV) have been identified to support biodiversity monitoring under the framework of the Group on Earth Observations Biodiversity Observation Network (GEO BON). Out of 21 candidate EBVs suggested by GEO BON, 14 EBVs have been identified as directly or indirectly measurable by remote sensing (https://geobon.org/ebvs/what-are-ebvs/) [66][67][68]. Two subsets of EBVs, focusing on Species Abundance (SA EBV) and Species Distribution (SD EBV), have been introduced and defined as a space-time-species-gram (cube), which can address species distribution or abundance irrespective of the taxonomy or scale [69]. This classification is facilitated by the availability of global, high-resolution, remotely-sensed data on environmental conditions and ecological species attributes. The framework has been optimized for biodiversity monitoring, but an equivalent product could be developed for relevant data pertaining to invasive species monitoring. Similarly, other areas of research have identified the need for environmental data that better meet the requirements of modelers, and have built high resolution and user friendly databases. For example, Bio-ORACLE (Ocean Rasters for Analysis of Climate and Environment) is a global dataset of environmental data which has been tailored for, and successfully implemented in, the distribution modeling of marine species [70][71][72][73]. Creating similar datasets that include environmental (both remotely and proximally sensed) data relevant to IAVP species at a fine spatial scale and a user friendly format could greatly improve the way in which currently available environmental data are used in IAVP SDMs. In addition, online data repositories, that include microhabitat data are available, such as DataONE (Data Observation Network for Earth, https://www.dataone.org/), JaLTER (Japan Long-Term Ecological Research Network; http://db.cger.nies.go.jp/JaLTER/metacat/style/skins/jalter-en/index.jsp) and the VLIZ: IMIS (The Flanders Marine Institute: Integrated Marine Information System, http://www.vliz.be/en/imis for example see [74]). Whilst these databases represent a great resource, people must be made aware that microclimatic data do exist, and centralization of microhabitat data in a well-structured repository could greatly facilitate data dissemination and utilization by the scientific community.
On a smaller scale, unmanned aerial vehicles (UAVs) and drones can be equipped with visible light, near-infrared, and/or thermal sensors to measure environmental variables, producing NDVI and surface temperature data at high resolution and at the desired scale [75,76]. UAVs have also been used to survey for bird and primate nesting and resting sites to estimate population numbers [77,78], and although this method is not currently suitable for the direct detection of IAVPs, host nesting sites, e.g., woodrat middens, or water bodies suitable for mosquito egg laying, could be surveyed using these techniques and used as a parameter for host availability in SDMs. Environmental data at the microhabitat level can also be measured using data loggers; small sensors able to measure a range of variables paralleling those that can be remotely sensed, such as temperature, light, air velocity, barometric pressure, and relative humidity (e.g., see HOBO ® U30 USB Station (U30-NRC) data logger; Bourne, MA). Many data loggers are small enough to be placed in almost any microhabitat, and can be programmed to record measurements at multiple intervals throughout a 24 h period. Data derived from such data loggers has been successfully used to model the extirpation and persistence of mammals (American pika, Ochotona princeps [Richardson, 1828]) [79,80] [81], and could no doubt also be applied to IAVP distributions. While a large number of data loggers need to be employed to collect sufficient data for species distribution modeling, requiring considerable resources to deploy and manage, these data could be supplemented by crowd-sourced means. Environmental data can now be collected from sensors within smart phones that can measure multiple variables, including temperature, pressure, and light, as well as from privately owned amateur weather stations and apps that ask citizens to report climatic data, such as amount of precipitation [82].
Despite the generalized application of coarse resolution data for modeling the distribution of IAVPs, overcoming the lack of data at a microhabitat scale has been attempted, although not necessarily for SDMs. An alternative and empirical strategy to improve our understanding of microhabitat thermal properties can be in the implementation of controlled experiments, that allow us to characterize microhabitat properties [83,84]. For example, studies have directly measured temperature within aquatic mosquito egg laying sites [48], and have recorded environmental variables in catch basins known to be egg laying sites for Ae. albopictus in Italy [58]. Both studies found that modeling mosquito population dynamics using these variables, rather than air temperature which is typically used, changes, and likely improves, the estimated development of the mosquito. An increasing number of scientific studies call for a better estimation of the thermal characteristics of mosquito microhabitats, specifically in order to achieve more reliable SDMs [48,58,85].
The strategies to overcome the lack of microhabitat data proposed above require intense use of resources, and we propose that microhabitat data collected by data loggers, crowd-sourcing, unmanned aerial vehicles, and/or controlled environmental studies are maintained in a database which can be freely contributed to, and be used by all those studying IAVP ecology. An open access database of microhabitat data could greatly facilitate the propagation of both collection and use of this type of environmental data.

Conclusions
Currently, environmental data that is used to inform ecological niche models largely relies on remotely sensed data, which is at a relatively course temporal and spatial resolution and does not accurately represent the microhabitat experienced by the species of interest, nor that at which activities informed by the prediction are executed. The predicted distribution of invasive arthropods resulting from models are therefore likely to be insufficient for direct application. The subsequent over-and/or under-estimations in IAVP distribution can have considerable consequences on control efforts, which may be informed by such predictions. We posit that consequently, efficiency and efficacy in the allocation of resources to control IAVPs are sub-optimal. The optimal resolution of environmental Abstract: On the basis of the Research Station of the Russian Academy of Sciences in Bishkek, a unique scientific infrastructure-a complex geophysical station-is successfully functioning, realizing a monitoring of geodynamic processes, which includes research on the network of points of seismological, geodesic, and electromagnetic observations on the territory of the Bishkek Geodynamic Proving Ground located in the seismically active zone of the Northern Tien Shan. The scientific and practical importance of monitoring the geodynamical activity of the Earth's crust takes place not only in seismically active regions, but also in the areas of the location of particularly important objects, mining, and hazardous industries. Therefore, it seems highly relevant to create new software and hardware to study geodynamic processes in the earth's crust of seismically active zones, based on integrated monitoring of the geological environment in the widest possible depth range. The use of modern information technology in such studies provides an effective data management tool. The considering system for collecting, processing, and storing monitoring electromagnetic data of the Bishkek geodynamic proving ground can help overcome the scarcity of experimental data in the field of Earth sciences.
Dataset: For general use, a center for collective use of scientific equipment "Integrated geodynamic research" (CCU IGR) was created, on the basis of the Research Station of the Russian Academy of Sciences in Bishkek (RS RAS) (http://ckp-rf.ru/auth/). Through it, you can register and get access to data that is laid out for general use. In open access on the Internet, EDI-files on the MANAS profile are posted at http://ds.iris.edu/spud/emtf.

Summary
The complex of regional geophysical works, including magnetotelluric studies, is carried out in almost all major tectonic provinces worldwide. One of the tasks of such a complex is to study the geodynamic state of the regions and assess the development and distribution of hazardous geological processes. Tien Shan region is one of the most tectonically active. This paper discusses the information aspects of the developed technology of multidisciplinary geophysical monitoring of geodynamic processes in the Earth's crust seismically active regions. The approach to the created technology is based on the integrated use of structural-functional and object-oriented information models. The developed structural-functional information model describes the processes of obtaining, storing and converting raw electromagnetic data, measured by magnetotelluric soundings method (MTS), and the object-oriented model used for describing the data itself (initial, intermediate, and final) and the relationships between them. The models are built using CASE tools All Fusion (Business Process Modeler-BPwin) and Power Designer, to define the boundaries and hierarchical structure of the developed system. The created technology provides an effective information system for integrated geophysical monitoring of geodynamic processes originating in the earth's crust of the seismically active zone of the Northern Tien Shan (the territory of the Bishkek geodynamic polygon) [1][2][3]. The main element of the complex geophysical monitoring is electromagnetic observations with a natural source of electromagnetic fields, which include magnetotelluric (MT) continuous observations of changes in the electrical parameters of the geoelectric cross-section at stationary points, continuous geomagnetic observations of the full vector T of the geomagnetic field at stationary points of the network and periodic observations at controlled points served by mobile stations. MT observations are used to determine variations of electromagnetic parameters in the Tien Shan lithosphere to a depths of 100 km and to study their relationship with geodynamic processes, occurring at these depths.
This work represents the results of research related to the development of azimuthal magnetotelluric monitoring techniques, which consists of analyzing the obtained time series of electromagnetic parameters in order to determine the contribution of each of the components of the impedance tensor to the informativeness of monitoring studies [3]. On the basis of the correlation analysis of gravitational tidal effects and the results of magnetotelluric monitoring, an additional test is carried out, the previously identified azimuthal dependence of the environmental stress sensitivity. When performing modern monitoring studies, scientists have to face an unprecedented amount of data that is subject to orderly storage, processing, graphical visualization and analysis [4]. Both stations are located on the territory of the Bishkek geodynamic proving ground, which in turn is part of the Northern Tien Shan seismic zone (Figure 1), and data is recorded around the clock in the period of 0.01-1000 s. Over the years of research, a catalog of geoelectrical data based on magnetotelluric soundings (MTS) and magnetovariational soundings (MVS) made in a series of regional and local profiles in the range of periods from 0.06 to 1800 s, created in the Tien Shan region, has been constantly updated. The catalog also includes the results of deep magnetotelluric soundings (periods up to 10,000 s). Information characterizing the parameters of the network of magnetotelluric observations (observation points and their coordinates) is contained in the catalog of the regional network of MTS, MVS cost center. To date, the established regional network of MTS and MVS for stationary observation points and profiles covers almost the entire territory of Tien Shan, within Kyrgyzstan and the surrounding areas. The data acquisition and information processing system of magnetotelluric monitoring allows collecting and accumulating data from a variety of monitoring observation points-stationary, regime, and profile ( Figure 1). Monitoring was performed to study geodynamic processes in the Earth's crust and upper mantle based on the calculation of the transfer functions between the components of the magnetotelluric field with high temporal resolution in order to study their temporal dynamics. The final result of such monitoring, from a formal point of view, is a set of time series of various data [5,6]. In the practice of monitoring geodynamical processes, statistical methods of data analysis are widely used. In particular, a correlation analysis is used to determine the degree of the interrelation of the observed data series. Time series are formed, which are used to study changes in the recorded parameters over time and to isolate anomalies associated with the preparation of strong earthquakes [7,8]. Programs are designed for visualization, processing, and analysis of time series. They have a convenient user interface. They implemented arithmetic, statistical, and other functions for working with time series. It is possible to edit drawings (graphs) on the screen, save, and print them.

Data Description
According to the results of continuous monitoring of electromagnetic, geomagnetic, GPS, gravimetric, and seismic observations, banks of primary data of the territory of the Bishkek geodynamic proving ground are formed and a catalog of earthquakes is compiled. As an example, consider the procedure for collecting data from magnetotelluric monitoring.

Data Collection Procedure
The monitoring network continuously records the MT field on the embedded flash memory of the Phoenix MTU-5D instrumentation. The duration of the recording depends on the amount of flash memory and registration parameters. The registration parameters indicate the polling frequency and the duration of the recording. At the maximum polling frequency, the recording time is about 20 days, after which data is copied from the flash memory to a laptop, and the equipment is serviced and restarted. When working on the MTS profile, the measurement mode depends on the objectives of the study and is seasonal. In an ordinary observation point on a profile, the duration of an MT-field recording is a time interval from several hours to several days, which is determined by the depth of soundings. To check the performance of the station, a test recording of about an hour is made. The most informative is the night registration.

Structure and Data Processing of the Magnetotelluric Sounding (MTS) Method
The primary time series files of magnetotelluric data are stored in two binary files one of which saves the data of high and middle-frequency band (2400 and 150 Hz) at intervals of a few seconds from the beginning of the minute, while the second file continuously saves low-frequency data (24 Hz).
Time series are accompanied by a small binary file which saves registration parameters. To process time series data, use the SSMT 2000 program from the standard set of Phoenix software. As a result of this processing, average daily MT monitoring records are obtained stored as binary files. Work with these files is performed using the GSPlot (General Spectra Plot) program from the standard set of Phoenix software. The GSPlot program allows to visually view the transformants of the MT data and also presents them in a table form.
The data storage scheme of magnetotelluric soundings processing is based on the data storage scheme in the international data exchange standard MT. In this standard, sensing data at a point is written to a file with the extension EDI (Electrical Data Interchange). EDI files are obtained using the MT-Correct processing application program developed by the North-West geophysical company and saved in ASCII format, in contrast to the primary binary files.

Structure and Storage of Magnetotelluric (MT) Data
All MT sounding material, both source material and processing results, are placed in archives on working computers, in a database and in an external archive on CDs. In the MT database, the material is classified by year of observation, by profiles and measurement points.
The data of the MT-monitoring are located in the directories corresponding to the names of the items-Aksu-mon and Chon-mon, in which folders with the number of years of monitoring are created.
The data on the MT-profiles are in the directories with the name of the profile and year of work.

Methods
The geophysical monitoring database (DB) of the RS RAS includes an electromagnetic observation database with an artificial source of electromagnetic field, electromagnetic observations with a natural source of electromagnetic field, geodetic GPS observations on a local network and geomagnetic observations and can be considered as a single distributed database (Distributed DataBase-DDB) [9] of geophysical monitoring. These databases play the role of local databases located in different nodes of the corporate and/or global computer network. DDB, as defined by Data [10], can be considered as a loosely coupled network structure whose nodes are local databases.
Local databases are autonomous, independent, and self-defined; access to them is provided through the DBMS. Connections between nodes are replicated data streams. DDB topology is a star structure.
For organizing the collection, storage of data, and processing of the results of MT monitoring, the As IS model was developed. The model was developed in the BPWin [11] environment in the form of data flow diagrams (DFD-diagrams) and is presented in Figure 2  For on-line access to the results of MT monitoring, a model of a distributed interactive system of access to the results of magnetotelluric monitoring in the form of a data flow diagram (DFD diagrams) was developed. This model is essentially an As To Be model and is presented in Figure 3. On the system model, the main process is allocated: distributed interactive system of access to the results of monitoring MT and two external entities: User and Storage. The repository is a storage medium on the external medium with respect to the system, on which the primary Phoenix station files, the average daily (processed) files, and EDI files are stored. Figure 4 shows an example of the MT data outputs that was obtained from the database. . The example of the magnetotelluric (MT) data outputs that was obtained from database. The left side shows the time series source data-5 components of the electromagnetic field. In the right side sample processed data. Blue and green dots show processed curves that are obtained in binary format, red and yellow solid lines-smooth curves in EDI format.

User Notes
Usage: for collecting and processing geophysical information, in particular for measuring, recording, and processing the electrical and magnetic components of the natural electromagnetic field, in the study of geodynamic processes occurring in the Earth's crust and upper mantle using electrical survey methods. Thus, the developed software system makes it possible to increase the efficiency of processing MT monitoring data by significantly reducing the time spent searching for the necessary information, the ability to quickly view the newly received data, and create a distributed database of monitoring MT observations over a period of about 15 years. Currently, the system is in the trial operation of the Research Station of the Russian Academy of Sciences.

Patents
As part of these studies, the database "Local database of magnetotelluric data in the system of geophysical monitoring of geodynamic processes in the Earth's crust of seismically active regions" was registered.

Conflicts of Interest:
The authors declare no conflict of interest.