A First Approximation for Acid Sulfate Soil Mapping in Areas with Few Soil Samples

: Acid sulfate soil mapping is the first step to avoid possible environmental damages created by one of the most problematic soils existing in nature. One of the problems in acid sulfate soil mapping is the lack of soil samples in some regions. This prevents the creation of the occurrence maps. For a first recognition of these regions, a possible solution could be the use of soil samples from other areas with similar characteristics. In this study, we analyze if a machine learning method is able to correctly classify the soil samples in an area where it has not been trained. For this, Random Forest and two different regions located in southern Finland with a similar composition of their soils are considered.


Introduction
Nowadays, artificial intelligence has a great potential to solve different problems in various research fields related to the economy and society.One of these research fields is soil science, which has a direct impact on agriculture, energy or climate change.Among the different types of soils, the acid sulfate (AS) soils are considered the most problematic soils from an environmental point of view [1].This is because this type of soil when oxidized can acidify the soil and mobilize toxic metals that may lead to serious environmental damage.To minimize possible environmental problems, it is essential to avoid the oxidation and drainage of these soils.A fundamental part of this is to locate the areas where these soils appear.The application of machine learning techniques in digital soil mapping has meant a great advance in this field [2].One of the advantages is that these techniques allow the creation of more precise maps with a lower number of samples than the traditional methods [3].In the case of AS soils, not all machine learning techniques are successful in their prediction [4].This is mainly due to the complex relationship between the AS soils and the environmental covariates [4].For example, Support Vector Machine (SVM) is a method unable to correctly classify this type of soil [4], while Random Forest (RF) and Gradient Boosting (GB) are very good methods for an accurate classification and prediction of AS soils [4,5].Furthermore, the AS soil probability maps created with these two models have a high precision [4,5].Other method able to classify AS soils under some conditions is Extreme Learning Machine (ELM) [6].Artificial Neural Network [7], Fuzzy Logic [8] or Fuzzy k-means [9] have also been used for AS soil mapping.
The use of a supervised machine learning technique in AS soil mapping requires two different types of data: the soil samples and the environmental covariates generally created from remote sensing data.The soil samples and their corresponding environmental covariates values are used to train and validate the model.While for the prediction, the model only considers the environmental covariates.One of the problems in AS soil mapping is the lack of soil samples in some regions.This hinders the creation of AS soils occurrence maps by means of supervised machine learning techniques.In such regions, the presence of AS soils may be overlooked using a traditional mapping approach.Consequently, there is a risk that AS soils are unknowingly drained, leading to environmental damage, as a result of agricultural, forestry or urban activities.For a first recognition of such regions, in addition to using the remote sensing data of the area, a possible solution could be the use of soil samples from other areas with similar characteristics for training the model.The question is whether a machine learning model could correctly classify AS soils in an area where it has not been trained.So far, there are no previous works where this study has been carried out.However, if this were possible, this first prediction could be used to design an efficient sampling plan for the region.In previous works, Random Forest has shown high abilities for the correct prediction of AS soils [4,5].In this work, we analyze if RF is able to correctly classify the soil samples in an area where it has not been trained.For this, two different regions located on the coast of the Gulf of Finland, with a similar composition of their soils, are considered.

Study Areas
For this study, two different study areas in the south of Finland were considered: Virolahti and its surroundings, and a coastal area situated between the cities of Helsinki and Loviisa (Hel-Lov), see Figure 1.Both areas are located in the Littorina Sea, where AS soils are usually found.The land cover in these two regions mainly consists of bedrock, outcrops and blockfields.Furthermore, the soils in both areas are similar.This has allowed to carry out the study of this work.

Figure 1.
Location of the study areas (pink color) and the maximal extent of the Littorina Sea (diagonal lines).Both study areas are located in the Littorina Sea.

Datasets
The datasets of the two study areas include two different types of data: the soil samples and the environmental covariates.The soil samples correspond to AS and non-AS soils.The classification of the samples was made following the method described in [10] by the Geological Survey of Finland (GTK).In the area of Virolahti and surrondings, there are 186 soil samples, 93 for AS soils and 93 for non-AS soils.In the Hel-Lov area, there is a total of 458 soil samples, 229 for each class.
The environmental covariates (raster data) considered in this study are: Quaternary geology, digital elevation model (DEM) derived from LiDAR data, terrain layers derived from DEM, and aerogeophysics layers.LiDAR and the aerogeophysics are remote sensing data from airborne surveys.All these covariates can provide useful information for the characterization of AS soils.Each dataset consists of 17 environmental covariates with a resolution of 50 m and the Finnish coordinate reference system (ETRS89/TM35FIN(E,N)).The environmental covariates of Virolahti and its surroundings area were created in a previous work [11], whereas the environmental covariates of the Hel-Lov area have been created for this study with Qgis.

Method
In this study, we have analyzed the ability of a supervised machine learning method for the classification of AS soils in an area different from the one in which the model was trained.The method considered is Random Forest (RF), which is a method based on decision trees [12].This technique has been used for the classification and prediction of AS soils in previous studies [4,5,11].We have used grid search and cross-validation (GridSearchCV) for the selection of the best tuning parameters for the performance of the model, for more information see [4].The metrics used for the determination of the suitability of the model are those related to the confusion matrix: precision, recall and F1-score [13].

Results and Discussion
A supervised machine learning method can make the predictions if it has been previously trained and validated with the dataset.In the present study, the only way to know if a model is capable of correctly predicting the AS soils of an area where it has not been trained is by validating the model with the soil samples of that area.For this reason, an area with a large dataset available has been selected for the study.This area is the region between Helsinki and Loviisa (Hel-Lov).The Virolahti area dataset was already used in a previous study where it was analyzed how variable selection improves the prediction accuracy of the AS soil mapping [5].In principle, it cannot be guaranteed that the importance of environmental covariates is the same in the two regions.For this reason, the dataset with 17 layers has been considered for this study.As already mentioned, RF is a method with high predictive capacity for the prediction of AS soils [4,5,11].Furthermore, it is one of the machine learning techniques that is least affected by irrelevant covariates or with redundant information [5,14].This makes it an ideal method for the present study.
The process of this study has two steps.First, the RF model is trained and validated with the dataset of Virolahti and its surroundings.80% of the soil samples are used for training the model and the remaining 20% for its validation (Figure 2).The next step is the validation of the model with the dataset of the Hel-Lov area.All the soil samples of this region are used for the validation.The results are shown in Table 1, where the metrics considered for the evaluation of the model are represented.As it can be seen, the model is able to distinguish the AS and non-AS soils of the Hel-Lov area when the model is trained using the data from Virolahti and its surroundings.The similar F1-score values for both classes indicate that the performance of the model is good for both classes.This is also confirmed by the balance between precision and recall for both classes.These results, where all the metrics are above 60%, are very good for a model that has not been trained in the area of the prediction.On the other hand, we have also analyzed the performance of the model when it is trained with the soil samples of the own area.For this, the proportion of the points for training the model and its validation was 80 and 20%, respectively.Training the model in the same area improves the results by up to 13%, see Table 1.Therefore, RF is a model that is able to do a prediction in an area where it has not been trained.However, the prediction accuracy of the model will not be so good as the prediction made when the model is trained with the soil samples of the own area.Thus, training the RF model in a different area can be used for a first recognition of areas with limited soil samples as well as for the creation of an efficient sampling plan design in those areas.
Table 1.Metrics related to the confusion matrix for the Random Forest (RF) prediction in the area between Helsinki and Loviisa (Hel-Lov) when the model is trained using soil samples from Virolahti and its surroundings, and when the model is trained with the soil samples from the own area (Hel-Lov).The two classes are acid sulfate (AS) and non acid sulfate (non-AS) soils.It must be taken into account that training the model with soil samples from other areas should only be carried out if there are at least some points in the area itself that allow the validation of the model, and if the geology of both areas can be considered relatively similar.Future works should analyze the ability of the model to predict AS soils in areas where the soil composition is different from the one of the training area.

Conclusions
In this study, we have analyzed the ability of a machine learning method such as Random Forest to make a correct prediction of the acid sulfate soils in an area where it has not been previously trained.The training of the model has been carried out with soil samples from another region with similar characteristics in the composition of the soils.Our results show that the method performs well in classifying both classes, with all metrics evaluated above 60%.This is an advance in the field since it allows a first recognition of the regions with a limited number of soil samples.Furthermore, this first prediction will permit the development of a more efficient sampling plan design in these regions that is also oriented towards the application of machine learning techniques for acid sulfate soil mapping.

Figure 2 .
Figure 2. Schematic diagram of the validation process.