Designing a European-Wide Crop Type Mapping Approach Based on Machine Learning Algorithms Using LUCAS Field Survey and Sentinel-2 Data

: One of the most challenging aspects of obtaining detailed and accurate land-use and land-cover (LULC) maps is the availability of representative field data for training and validation. In this manuscript, we evaluate the use of the Eurostat Land Use and Coverage Area frame Survey (LU-CAS) 2018 data to generate a detailed LULC map with 19 crop type classes and two broad categories for woodland and shrubland, and grassland. The field data were used in combination with Copernicus Sentinel-2 (S2) satellite data covering Europe. First, spatially and temporally consistent S2 image composites of (1) spectral reflectances, (2) a selection of spectral indices, and (3) several bio-geophysical indicators were created for the year 2018. From the large number of features, the most important were selected for classification using two machine-learning algorithms (support vector machine and random forest). Results indicated that the 19 crop type classes and the two broad categories could be classified with an overall accuracy (OA) of 77.6%, using independent data for validation. Our analysis of three methods to select optimum training data showed that by selecting the most spectrally different pixels for training data, the best OA could be achieved, and this already using only 11% of the total training data. Comparing our results to a similar study using Sentinel-1 (S1) data indicated that S2 can achieve slightly better results, although the spatial coverage was slightly reduced due to gaps in S2 data. Further analysis is ongoing to leverage synergies between optical and microwave data.


Introduction
Information on land use and land cover (LULC) is crucial for spatial modeling and monitoring of the global hydrological and carbon cycle, energy balance, and status of natural resources [1][2][3]. A repeated, transparent and precise LULC monitoring is also essential for addressing rapid changes in land use such as soil sealing and agricultural production [4]. The most needed are LULC maps that can display the dynamic land-use changes, such as in arable land.
Satellite-based remote sensing is the most appropriate tool for LULC mapping [5] because of its global continuous and regular coverage and cost-efficiency [6]. The abundance of freely available remote-sensing data offers unprecedented opportunities to produce land cover, and land cover change maps over large-scale areas [7].
The Sentinel-2 (S2) satellites are part of the Copernicus European Union's Earth Observation (EO) Programme and represent one of the most promising instruments to mon-itor dynamic land cover changes in a timely way [8]. The two twin satellites carry a multispectral sensor with 13 spectral bands (four bands at 10 m, six at 20 m, and three at 60 m spatial resolution). With the two satellites in operation, the temporal resolution is 5 days.
The use of S2 data for LULC monitoring applications has been widely documented in several papers. Xiong et al. [9] produced a nominal 30 m binary cropland map (cropland and non-cropland) of the entire African continent using S2 and Landsat-8 (L8) data for 2015 and suggested the application of 10 m S2 data to address limitations in the classification of small fragmented fields. Inglada et al. [10] also combined S2 and L8 time series data to produce a country-wide crop map for France, achieving a kappa value of 0.86 considering 17 classes. Malinowski et al. [4] created a LULC map for a large part of the European continent for 2017 using S2 data, with 13 classes and average overall accuracy (OA) of 86.1% with a methodology potentially suitable for a fully automated classification at a relatively high frequency. Dostálová et al. [11] produced a two-class forest-type map (coniferous and broadleaved) for Europe using Sentinel-1 (S1) for 2017 (achieving an OA of 82.7%). The authors found S1-inherent problems related to the side-viewing observation of SAR sensors for complex terrains that negatively impacted the accuracy. Another publicly released 10 m spatial resolution and global LULC map was derived by (deep) neural nets [12] using S2 data with an OA of 86.0%. In the European Space Agency's (ESA) WorldCover project, S1 and S2 data were combined to provide a global land cover map for 2020 at 10 m resolution with 11 land-cover classes [13]. Their algorithm achieved an OA of 74.4%, in which the accuracy of the shrubs class and the separation between tree cover and shrubs areas could be improved. Defourny et al. [14] classified five main crop groups and non-cropland areas in three countries (Ukraine, Mali, South Africa) and five local sites distributed across the world using S2 and L8 data with an OA value higher than 80% for all sites, except one. Jiang et al. [15] produced a high-resolution map of major crop types for three large regions in China (2-3 crop types per region), achieving an average OA of 94%.
One of the major issues for deriving accurate and detailed LULC maps is the availability of field data for training (and validation) purposes. In this regard, one concerted effort is represented by the European Land Use and Cover Area frame Survey (LUCAS)organized by Eurostat [16]. LUCAS is a regular in situ survey performed every 3 years (since 2006) to collect land cover data in the European Union (EU). The most recent survey (2018) was further improved to respond to the needs of the EO community with the introduction of the "Copernicus module" point data. These are points characterized by homogeneous land cover types, and therefore they are directly comparable to S1 and S2 pixels. In 2020, d'Andrimont et al. [17] further revised the Copernicus module data by constructing polygon geometries that spatially represent the homogenous regions of approximately 0.5 ha around each point. The polygon geometries aimed at further increasing the availability of field data-however, care should be taken not to sample more than one pixel from the same polygon as this would bias the results.
Over recent years, the LUCAS data offered the opportunity to develop and advance LULC mapping capacities for Europe. For example, Close et al. [18] used the LUCAS 2015 survey and S2 data to classify the Wallonia region in Belgium for monitoring land use, land-use change, and forestry (LULUCF). Pflugmacher et al. [19] produced a pan-European land cover (13 classes) map using LUCAS 2015 survey and Landsat-8 data. Weigand et al. [20] utilized the LUCAS 2015 survey as reference information for high-resolution land cover mapping using S2 data in Germany (seven classes). Venter et al. [21] generated a 10 m resolution land cover map (ELC10) of Europe composed of eight land cover classes with the fusion of S1 and S2 data.
Noteworthy, d'Andrimont et al. [22] derived a 10 m crop type map with S1 data from 28 member states of Europe (EU-28). The study presented a relatively precise and detailed land cover map (19 specific crop type classes beside two broad classes) using a random forest (RF) classifier. Data from the LUCAS Copernicus module (using the polygon geometries) were exploited for training, while the validation was based on an independent set of 87,853 high-quality points filtered from LUCAS core points. The overall accuracy was 74.0% for the 21 classes. The accuracy increased to 79.2% when the classes were grouped into eight broader classes. To date, however, the full potential of S2 data to obtain detailed crop type maps (as the 19 crop types identified in [22]) has not yet been demonstrated.
To evaluate the potential of S2 time series for the production of detailed LULC maps, this manuscript reports on the potential of the LUCAS 2018 data to derive detailed EUwide crop type maps at 10 m spatial resolution based on a spatially and temporally harmonized S2 time series. As different classification algorithms often yield different results [23], this research also includes a comparison between support vector machine (SVM) and random forest (RF). Both methods are relatively insensitive to noise and overtraining, and they can deal with imbalanced data [24]. To compare the results of this research to previous findings, the classification scheme used by [22] was also adopted here.
The specific objectives of this investigation are as follows: • To evaluate the potential of using the LUCAS 2018 data in combination with S2 data for generating detailed LULC maps.

•
To compare the classification performance of S2 and S1, using as [16] a reference study for S1.

•
To compare the performance of RF and SVM classifiers.

•
To study the impact on the classification accuracy due to autocorrelation in field data (considering data sampled from the LUCAS Copernicus polygon geometries).

Materials and Methods
The area of interest is the EU-28 territory containing LUCAS 2018 survey data. The main steps shown in Figure 1 included the extraction of S2 spectral and temporal features at the point locations of the LUCAS data, followed by the SVM and RF hyperparameter tuning, training, and assessment of the results. A final LULC map was generated for EU-28 regions by combining the best model and an additional model (based on yearly indicators only) to fill remaining spatial gaps (15% of land area that could not be covered with the optimal feature).

Sentinel-2: Data Preparation, Spectral and Temporal Features
The processing chain relied on S2A and S2B L2A data which are processed at the BOKU/EODC data facility and accessed via dedicated APIs [25]. Only S2 image tiles with cloud cover below 50% were included. A fully automated processing chain was created to produce cloud-masked S2 L2A data at 10 m spatial resolution, all projected to the same spatial reference (EPSG: 3035) and extent (EU-28). The six S2 bands at 20 m pixel size were resampled at 10 m using the nearest neighbor algorithm.
Starting from this dataset, yearly and monthly cloud-free image composites were computed for the year 2018. The features used in this study are summarized in Table 1.
Monthly indicators encompass monthly composites of B02-B08, B8A, B11, and B12 from S2 spectral bands, bio-geophysical indicators (FAPAR, FCOVER, and LAI) [33], and NDVI. To build the monthly composite, every pixel was taken from the S2 acquisition presenting the maximum NDVI value over the month. In total, 189 features were obtained.    Figure 2.
The smallest area of observation is a circle with a 1.5 m radius. About 70% of core points were surveyed on the ground, and the rest were photo-interpreted in the office ( Figure 3). The land cover of these points was labeled according to three different classification schemes. The level-1 legend includes eight broad land cover groups, while level-2 and level-3 comprise detailed classes with 26 and 66 categories, respectively [34].
The regular LUCAS survey was expanded by the so-called Copernicus module to address the needs of the EO community [34]. The Copernicus module was originally planned for 90,620 points out of the main 337,854 core points. Due to difficulties in accessing the locations, however, only 63,364 points could be sampled. A summary of the LU-CAS survey data is presented in Table 2.   To delineate the polygon geometries for each of the LUCAS Copernicus points, d'Andrimont et al. [17] expanded the point geometries in the four cardinal directions (up to 51 m distance), using the information on LULC homogeneity reported in the field during the survey. Among 63,364 polygons, 77 polygons were discarded due to geolocation problems. The remaining 63,287 polygons presented areas from 0.005 ha to 0.52 ha (with an average of 0.32 ha). However, only the polygons for which the level-3 legend was available were kept (resulting in a final set of 58,428 LUCAS Copernicus polygons). Among the 58,428 polygons, S2 spectral and temporal information could be extracted from our S2 dataset only for 56,366 polygons (total area of 1,901,627 individual pixels).
Due to remaining gaps in S2 data, some samples did not have spectral information at a specific time of the year (mostly winter months). This caused a large number of missing values in the monthly composites. The statistics are summarized in Table 3. Therefore, an additional data refinement was necessary, leading to the exclusion of additional 84 features. The final dataset (with 105 features) results in a total of 1,349,052 samples (taken from 43,013 different polygons). The LUCAS data presents eight level-1 land-cover categories, including A-artificial land, B-cropland, C-woodland, D-shrubland, E-grassland, F-bare land, G-water, and H-wetlands. Since this study focuses on the classification of the main crop types, and to enable comparability with the S1 study described in [22], the class types were redefined. In the chosen classification scheme, only classes and subclasses of B-cropland, C-woodland, D-shrubland, and E-grassland (and a subclass from F-bare land) were considered to create 19 specific crop type classes, plus additional two broad classes, namely woodland and shrubland, and grassland. More details can be found in Table 4 adapted from [22]. According to the defined scheme, the total samples for training the classification models (1,349,052) were decreased to 1,344,885 (taken from 42,753 different polygons) by keeping only the points related to the 21 classes.  (2) support vector machine (SVM). The RF classifier, first introduced by Breiman [35], is a robust machine learning algorithm capable of generating high classification accuracy and quantifying the feature importance. RF is an ensemble technique that utilizes bagging (bootstrap + aggregation) and builds a swarm of random decision trees. In bootstrap, instead of training all data, a subset of samples is trained in a tree of RF. The selected subset is called the bag, and the remnant samples are described as out-of-bag (OOB) samples. The outcome of all trained trees is aggregated, which reduces the variance and increases the classification accuracy. The OOB score can be applied to internally evaluate the trained model (with OOB samples not included during the training process), even without using an independent validating dataset [35,36].
The SVM classifier is a popular kernel-based machine learning presented by Vapnik [37,38]. The SVM algorithm generates an optimal hyperplane to discriminate the dataset into determined classes utilizing training data. The closest samples of different classes in feature space called support vectors are used to maximize the margin. SVM can transform training data into different spaces to find better discrimination between categories using the kernel trick [39].

Training Data Sub-Setting
To study the impact of autocorrelation in the training data (i.e., many points are sampled from the same polygons), three different training subsets were created (see Table 5 the resulting training samples): (1) proportional (one random sample from each polygon), (2) balanced (same number of samples for each class) and (3) keeping all dissimilar samples within each polygon. In details: • Proportional (One random sample from each polygon) A single random sample per polygon was selected to represent the polygon. The extracted subset contains 42,753 samples (third column in Table 5), equal to the total number of available Copernicus polygons.
• Balanced At least one point per polygon was selected to reach a balanced dataset of 4000 samples per class. In this case, the majority of classes have the same number of samples. However, the number of polygons in the woodland and shrubland and grassland is greater than the reference number (4000), and some polygons were not considered in the sampling. Moreover, as shown in Table 5, some classes have a smaller number of polygons than the reference number. In this case, all available samples in those classes were considered for the balanced subset.

• Dissimilar
The 10% most dissimilar samples within each polygon were selected using the similarity matrix (based on the Euclidean distance metric calculated on all samples in an individual polygon).

. Feature Selection
A recursive feature elimination method was used to select the important features [40]. In a first step, a RF classification was performed on the training dataset using all 105 features, and the OOB score was calculated. Then, the importance of each feature in the process was extracted. The least significant feature was eliminated from the dataset at the next step. The process was repeated until no features remained.

Hyperparameter Tuning
Hyperparameter tuning is an essential step in machine learning, which directly impacts the model performance. Grid search and random search are very popular hyperparameter optimization techniques [41,42]. In the first method, the domain of hyperparameters is discretized to a grid. Then, the performance of all possible combinations is assessed using statistical metrics, here cross-validation (CV). The set of hyperparameters that can maximize the average value in CV is selected as an optimal one for training the model. The random search employs random combinations of hyperparameters instead of using all plausible sets in the grid. These two methods were used in sequence to tune the hyperparameters of RF and SVM classifiers applying to the main training dataset with the 43 most important features.
In the case of RF tuning, a large grid range of values with 1000 possible combinations was defined for the random search (2nd column in Table 6). Then, 100 combinations were assessed randomly with three-fold CV (total 300 fits) multiple times to reduce the grid span. In the second step, all remained 56 combinations of the reduced range (3rd column in Table 6) were examined using the grid search method. The 4th column of Table 6 presents the tuned values for the hyperparameters of the RF classifier, namely n-estimators, max_features, min_samples_split, and min_samples_leaf. The other parameters of the RF classifier were kept in their default settings. The same procedure was applied for tuning the main parameters of the SVM. In this case, 1800 and 90 fits with three-fold CV were utilized for the random search and grid search, respectively. Table 7 reports the adjusted values for the four hyperparameters C, kernel, degree, and gamma. For the other parameters, the default settings were used.

Validation Data
For evaluating the efficiency of the classification models, an independent dataset (not part of the main training dataset) was extracted from the 274,490 LUCAS core points (see Table 2). To select the validation dataset, we followed the procedure outlined in [16]. We retained only those directly interpreted points in the field and were within parcels greater than 0.5 ha with homogeneous land cover. Only samples related to the relevant classes were kept (Table 4). By applying these rules, 91,201 samples were selected. Then, the related features for those points were derived from S2 data. Samples with at least one missing feature were removed. Finally, 70,800 points with 43 features remained for the accuracy assessment.

Assessment Metrics
To evaluate the accuracy, the final classification model was run on the validation data. From the results, the confusion matrix was calculated. Five assessment metrics were derived from the confusion matrix, including user accuracy (UA), producer accuracy (PA), F1-score, OA, and kappa coefficient (KC). The UA presents false positives or errors of commission, while PA displays false negatives or errors of omission. F1-score is the weighted average of UA and PA and is a useful metric for evaluating the classification with imbalanced classes and for class-specific analysis. The OA metric is the ratio of correctly predicted samples to the total samples used for validating the general performance of the model on the validation dataset. The kappa coefficient (KC) was considered for legacy with previous studies [43]. This measure defines the agreement between classified and truth values. Table 8 shows a categorization of the kappa statistics as defined in [44]. The OOB score is additionally used to evaluate the models generated with RF classifiers.

Selecting the Best Classifier and Dataset
The results of RF and SVM classifiers using all training data and the three smaller training subsets are reported in Table 9.
The SVM classifier provided slightly better results in most cases. The most effective outcome was obtained with the 'Dissimilar' subset that reached an OA of 77.6%, compared to 77.8% when using all data for training. Consequently, it could be concluded that high accuracy is reachable by applying only the best set of samples, which contains only 11% of the total data and the difference between their OA is only 0.2 percentage points.
Regarding the OOB score, the correlation among samples increases by growing the number of samples, which is evident in the divergence of the OOB and OA, especially when considering all data for training.

The Accuracy Assessment of the Best Classification Model
The results were interpreted by applying the best model ("Dissimilar" subset for training and SVM algorithm) on the validation data. The detailed confusion matrix for the 21 LULC classes (19 crop types, grassland and woodland and shrubland) is represented in Table A1. The map in Figure 4 shows the classification result for the validation data.
When grouping the 21 classes into eight main classes (the scheme is described in the first column of Table 4), the overall accuracy increased from 77.6% to 82.5% (Table 10). The cereals, root crops, non-permanent industrial crops, grassland, and woodland and shrubland classes had appropriate discrimination compared with other categories by having an F1-score of 83.6%, 77.9%, 75.4%, 79.8%, and 90.4%, respectively. Besides, the dry pulses, fodder crops, and bare arable land classes had lower accuracy due to their confusion with cereals, grassland, and woodland and shrubland classes.  In addition, maize, sugar beet, sunflowers, and rape and turnip rape had suitable discrimination between crops, as reported in Table A1. The misclassification proportion was also reported as relatively high in cereals. The KC for main (21 classed) and grouped (8 classed) schemes of classification was 0.70 and 0.75, respectively. Thus, according to Table 8, there is a substantial agreement between classified and true data in the generated model.

Discussion
This work highlights the high potential of multi-temporal S2 data for large-scale land-cover classification focusing on field crops. Based on monthly composites of the summer months and yearly image composites, 19 field crops, one forest class, and grassland were classified with consistently good results. Both overall and class-specific accuracies are comparable to a recently published study using the same training and validation data but S1 data for the modeling. Andrimont et al. [16] achieved an overall accuracy in discriminating the 21 vegetation classes of 74.0% and if they were grouped into the main crop type groups of 79.2%. In this work, the 21 classes could be separated with an accuracy of 77.6% and the eight classes with 82.5%, outperforming the S1 classification performance by 5.2 and 3.2 percentage points, respectively.
Both the training and the validation datasets are affected by an unbalanced number of samples and this is reflected in the large differences in the class-specific accuracies achieved. Classes with low presence in the field are also significantly less represented in the training data and subsequently often lead to lower accuracies. The grouping of classes could alleviate the problem, but also here, the class-specific accuracies range between 30% and 90% (Table 10). Compared to the study of Andrimont et al. [16], the results of the present work are not dramatically better but somewhat higher and more balanced across the classes.
Due to cloud cover, optical satellite data such as Landsat and S2 are often not available over large areas and at regular acquisition intervals. This is not a problem for producing actual crop classification for smaller areas [8] or for analyzing temporally more "stable" classes such as non-vegetation and forests, since scenes from several years can be combined in such cases [45]. Another alternative is the aggregation to seasonal or annual composites [21], whereby even data from several years have to be combined [19] and phenological differences are lost. Another possibility is the smoothing and gap-filling of multi-temporal data [46].
The present work aimed to analyze the potential of monthly and yearly S2 composites. However, it turned out that despite the very good temporal repetition rate of the two S2 satellites, the acquisitions, at least for 2018, were insufficient to generate cloudfree/snow-free monthly composites for the entire Europe. Naturally, this leads to missing data, especially in the winter months. However, also, in the 6 months used for this study, spectral values were not available for all reference data, which also hampers an area-wide application of the model. Best classification results would be achieved with S2 data acquired in the period April-September period. Unfortunately, such a feature set provides only 75% spatial coverage. For geographic regions with data acquisition problems (due to extensive cloud coverage), narrower temporal windows must be used. For example, the monthly indicators between May and August allow coverage of 82% of the EU-28, with a drop in the OA of 1 percentage point compared to the best model. The use of the yearly indicators allows coverage of 97% of the area, with a drop in the accuracy of 4 percentage points. For this purpose, several models with different input data would have to be combined. This underlines the complementary potential of weather-independent microwave data such as S1. This could already be shown with less detailed classes by Venter and Sydenham [21], who could increase the overall accuracy by 3 percentage points by combining S1 and S2. Similar results were obtained by Inglada et al. [47]. Their study also found that the use of S2 outperformed that of S1 alone, which can be confirmed with the present study compared with the results of d'Andrimont et al. [22] based on S1 data.
In addition to high-quality remote-sensing data, reference data are crucial for successful classification. In this respect, both quality and quantity play a decisive role, and a good spatial distribution of the data is also important. Here, the European LUCAS data provide a good dataset that has already been used in several recent papers for large-scale applications [18][19][20]22]. As already mentioned, these studies used seasonal composites of optical data and, therefore, the focus was on broader and more general land-cover classes.
However, in addition to continental or global land cover classifications, the current distribution of crops is of interest, and the LUCAS 2018 data represent an excellent dataset. However, using this data, one major problem is related to the big differences in the number of samples per class. This study, therefore, investigated whether it is advantageous to use only different subsamples of the total available LUCAS data. The achieved accuracies using the independent validation data showed only very small differences between the approaches. However, the use of spectrally different pixels from a reference polygon reflects the class distributions best and gave practically the same results as using all pixels but using only 11% of the samples. The use of a single pixel per polygon (e.g., a proportional sampling), on the other hand, provided the lowest accuracies. Not surprisingly, this subset gave the most realistic RF OOB score (Table 9) while using all data for training provided a positively bias RF OOB score (compared to the OA obtained with independent data). This confirms the high significance of the RF OOB score as shown in previous literature provided that the training data are largely independent [48][49][50][51]. Attempting to create a more balanced training dataset mainly for the classes with larger numbers of training samples to address the issue of unbalanced datasets [45,50,[52][53][54], on the other hand, did not lead to any significant improvement. Also, the influence of the classification approach used was minor, with SVM performing only slightly better than RF, confirming the results of other studies [50,55].

Conclusions
Earth observation (EO) data, such as the Copernicus Sentinel-1 (S1) and Sentinel-2 (S2) data, are ideal for repetitive, cost-efficient crop type mapping over wide-areas. However, extensive field data are necessary for training and testing the classification algorithms and validating the maps.
In this study, we explored the potential of the Land Use and Coverage Area frame Survey (LUCAS) field data to provide crop types maps in combination with S2 observations for the year 2018. Classification results are very encouraging (achieving an overall accuracy of 77.6% with 19 crop types, plus grassland, and woodland and shrubland, across Europe) and show the potential of both LUCAS field data and S2 satellite data. However, there are several open questions, the most relevant being related to how to obtain and validate crop type maps in years when LUCAS data are not available. The community is looking forward to the LUCAS 2022 survey data to further test the approach across multiple years in combination with S1 and S2 satellite observations. There is also an urgent need to develop and validate transfer learning approaches that will allow reinforced learning based on multiple years of data. As a result, an improved temporal continuity in the mapping capacity and a more cost-effective use of the field data are expected.  Acknowledgments: This research was motivated by the need of detailed crop type maps that support different research activities and applications. For example, in the Horizon 2020 (H2020) Landsupport project (https://www.landsupport.eu, 13 December 2021), the development was driven by (1) the need of crop type maps for management, modelling and scenario analysis for best practices in agriculture at regional level, and (2) the need for spatially and thematically consistent Europeanwide land cover maps for the application of policy-related tools. In the context of the H2020 SIEUSOIL (https://www.sieusoil.eu, 13 December 2021), crop types, intensity and rotations are key indicators to design optimal soil management practices. A more farmer-oriented application is being developed in the Austrian FFG ARmEO project, where maps are needed to run a benchmarking tool that will allow farmers to compare the performance of their crops against same crop types growing on similar soils in the neighboring regions.

Conflicts of Interest:
The authors declare no conflict of interest.
Appendix A   Table A1. The confusion matrix extracted from the SVM classification result on validation data. UA = user accuracy, PA = producer's accuracy, OA = overall accuracy.