Reliable Crops Classiﬁcation Using Limited Number of Sentinel-2 and Sentinel-1 Images

: The study presents the analysis of the possible use of limited number of the Sentinel-2 and Sentinel-1 to check if crop declarations that the EU farmers submit to receive subsidies are true. The declarations used in the research were randomly divided into two independent sets (training and test). Based on the training set, supervised classiﬁcation of both single images and their combinations was performed using random forest algorithm in SNAP (ESA) and our own Python scripts. A comparative accuracy analysis was performed on the basis of two forms of confusion matrix (full confusion matrix commonly used in remote sensing and binary confusion matrix used in machine learning) and various accuracy metrics (overall accuracy, accuracy, speciﬁcity, sensitivity, etc.). The highest overall accuracy (81%) was obtained in the simultaneous classiﬁcation of multitemporal images (three Sentinel-2 and one Sentinel-1). An unexpectedly high accuracy (79%) was achieved in the classiﬁcation of one Sentinel-2 image at the end of May 2018. Noteworthy is the fact that the accuracy of the random forest method trained on the entire training set is equal 80% while using the sampling method ca. 50%. Based on the analysis of various accuracy metrics, it can be concluded that the metrics used in machine learning, for example: speciﬁcity and accuracy, are always higher then the overall accuracy. These metrics should be used with caution, because unlike the overall accuracy, to calculate these metrics, not only true positives but also false positives are used as positive results, giving the impression of higher accuracy. Correct calculation of overall accuracy values is essential for comparative analyzes. Reporting the mean accuracy value for the classes as overall accuracy gives a false impression of high accuracy. In our case, the difference was 10–16% for the validation data, and 25–45% for the test data.


Introduction
Integrated Administration Control System (IACS) was created in the European Union to control direct payment to agriculture. Under common agricultural policy (CAP), direct payments, without going into different payment schemes, apply to crops, which the farmer declares each year, specifying the type of crop (plant) and its area. Some declarations (ca. 5%) are controlled using on-the-spot check under which the area of an agricultural parcel is measured and the plant is identified in the field. In order to simplify and automate this procedure, in 2018, the European Commission adopted new rules to control all declared parcels based on Copernicus satellite data: Sentinel-1 (S-1), Sentinel-2 (S-2). In purpose of control farmers' declarations, the EC research center JRC (Joint Research Center) recommends analysis of time series vegetation indices of any agricultural parcel during the vegetation time [1][2][3].
The most popular index calculated form optical images is NDVI (normalized differential vegetation index) [4] and in microwave spectral range SIGMA (radar backscattering coefficient) [5]. Analysis of the variability plots of these parameters over time allows for The presented two approaches provided high accuracy of crop recognition for control purposes, but required many unclouded images of large areas. This is rather difficult in the case of Sentinel-2, especially considering that the area of Belgium and Northrhinewestfalia is 10 times smaller than the area of Poland.
The similar research were also conducted by our team [17,18]. Ten Sentinel-2 images from September 2016 to August 2017, and nine Sentinel-1 images from March to September 2017 were analyzed. A Spectral Angle Mapper (SAM) classifier was used to classify the time series of NDVI images. Accuracy of OA = 68.27% was achieved, which is consistent with the accuracy (69%) of other independent studies of similar nature conducted in Poland [19]. Therefore, instead of NDVI and SIGMA time series, it was decided to check the possibility of using the classification of single S-2 images and a combination of several multitemporal S-2 images. The aim of the research was to develop a simple, fast but reliable screening method for farmers' declarations control.
However, while reviewing the literature on the currently used image classification methods for the purpose of crop recognition, we encountered the problem of comparing the accuracy of the classification result. In a traditional remote sensing approach, accuracy is calculated from test data independent of the training data. In machine learning, a lot of attention is paid to the selection of hyperparameters, which is carried out iteratively using only a part of the training set. At the same time, the validation accuracy is determined on the basis of the samples from training set not used for learning.
Some authors report validation accuracy and accuracy calculated on independent test data [7,12,20]. Others only provide information on accuracy based on external reference data not included in the training set [8,15,16,21]. In some publications, there is not enough information on this issue [13,22,23].
There are plenty examples of using only one reference set divided randomly or stratified on training and validation sample, while the distinction between training, validation and independent test data is extremely rare [24].
It turns out that the problem was also noticed by other researchers, e.g., a key work on rigorous accuracy assessment of land cover products [25]. Good practice in accuracy assessment, sampling design for training, validation and accuracy analysis was discussed.
The key, from the point of view of our research is the statement: "Using the same data for training and validation can lead to optimistically biased accuracy assessments. Consequently, the training sample and the validation sample need to be independent of each other which can be achieved by appropriately dividing a single sample of reference data or, perhaps more commonly, by acquiring separate samples for training and testing". Similar conclusions can be found in [26].
Reliability of the classification also depends on the reported accuracy metrics and the unambiguous way of their calculation. Like the previous topic, it is not a trivial issue, although it seems that. Despite many years of research on various accuracy metrics [27][28][29][30][31][32][33] in the 2019 paper, mentioned above, it was stated that overall accuracy (OA), producer accuracy (PA) and user accuracy (UA) are still considered the basic ones [25].
Unfortunately, the situation in this area has become more complicated due to the common use of ML methods. In recent years, additional accuracy metrics have been developed that are not used in traditional approaches, i.e., specificity and precision. Other metrics calculated automatically in ML tools: sensitivity and precision, correspond, respectively, to producer accuracy (PA) and user accuracy (UA) in traditional image classification.
In extensive reviews of the literature from the past and the latest, comparative analyzes of plenty various metrics can be found [26][27][28][29][30][31][32][33], but they do not respect sensitivity and accuracy, despite they are commonly used in ML. It is worth noting that sensitivity and accuracy are appropriate for the classification of one class. A problem arises when they are used in assessing the classification of multiple classes, especially if the average value of the accuracy [34] or global precision [35] is reported as OA, creating an illusion of higher accuracy [36].
Summarizing the issue of the reliability of the classification result, especially using ignorantly ML tools, there is a double risk of overestimation accuracy: related to the lack of independent test set and the incorrect calculation of the most frequently compared metric in the research: OA. Therefore, the accuracy of the classification, which has a very significant impact on the reliability and validity of the remote sensing method for verifying the accuracy of the crops declared by farmers, should be demonstrated with deep attention and carefulness. In the article, we focused on three issues:

1.
Image classification in the aspect of screening method of controlling farmers' declaration based on a limited number of images (verification in Polish conditions of the hypothesis from the publication [8]).

2.
Analysis of the classification' results made using extremely different sampling design (we did not focus on the description of the model fitting).

3.
Comparing the traditional accuracy metrics with those used in ML (also discussing incorrect OA calculations), in order to confirm the hypothesis about artificially overestimating the accuracy of the classification result.
To the best of our knowledge, there are no publications on a quick and reliable screening method to control the declarations submitted by millions of farmers in each EU country each year. In addition, despite there are some publications [7,25] containing information on artificially overestimating the accuracy of the classification if the OA calculated from the validation set, instead of the test set, but there is no broader discussion of this issue. On the other hand, the issue of incorrect calculation of OA is completely ignored in publications.

Materials and Methods
The research consists of three parts Figure 1: • obtaining and preparing reference and image data, • image classification, • comparative accuracy analysis.

Materials and Data Preparation
The test area of 625 square kilometers (25 km × 25 km) was located in central Poland, near Poznań (Figure 2), and included 5500 agriculture parcels declared by farmers for the subsidies. Data on farmers' declarations were provided by Agency for Restructuring and Modernisation of Agriculture (ARMA) in Poland and included size of the agricultural parcel, type of crop, geometry of the agricultural parcel (polygon). The critical size of the agricultural plot is 0.5 ha (this size should exclude the influence of the shape [1]). We selected parcels of the area of 1ha or bigger to avoid technical problems with identifying small parcels. In order to reduce the number of plots and eliminate errors, farmers' declarations were statistically analyzed for: • plot size (less than 1 ha), • use of rare crops (less than 5 declarations).  Finally, 4576 parcels for the analysis were selected ( Figure 3, Table 1). The parcels' set was randomly divided into 2 groups: training fields (2190 parcels) and test fields (2386 parcels), which were used for classification and accuracy assessment.
Images from Sentinel-2 and Sentinel-1 satellites of the European Copernicus program (ESA, 2020) were used for the analysis. Tables 2 and 3 contain a list and description of satellite images used in the study. The images were downloaded from the Copernicus Services Data Hub (CSDH) (https://cophub.copernicus.eu/ (accessed on 1 September 2018).
The data were collected as granules with a size of 100 per 100 km. Three Sentinel-2 images registered in September 2017, May 2018 and July 2018 were selected for analysis. Images of Level-2A were acquired, which means after geometric, radiometric, and atmospheric correction.   S2_20170928, S2_20180526, S2_20180720 Table 3. Characteristics of the Sentinel-1 satellite image.

Parameter Information
Satellite: Sentinel-1 Level: Ground Range Detected Polarisation: Dual VV+VH Number of images: 1 Dates: 15 July 2018 Images: S1B_IW_GRDH_1SDV_20180715T164255_20180715T164320_011824_015C28_07F3 Short name: S1_VV_20180715, S1_VH_20180715 Additionally, one Sentinel-1 image was included in the tests, which had been preprocessed for the sigma coefficient, according to the following workflow for two polarization modes (VV and HV): • radiometric transformation of pixel value to backscatter coefficient (sigma0), • geometric transformation by Range Doppler orthorectification method with SRTM 3 sek as DEM and bilinear interpolation, • removing the salt pepper effect called speckle effect using refined Lee filter, • logarithmic transformation of backscatter coefficient to dB.
In the next step, the classifications were performed on the basis of the following image sets: In the single classification, all 10 channels were used, while the channels with a resolution of 20 m were previously resampled to a spatial resolution of 10 m. The classification of the combination of 3 images consists of the classification of 30 channels, 10 from each Sentinel-2 image. The simultaneous classification of optical and radar images was based on the classification of a stack of 31 channels: 30 optical, Sentinel-2 and 1 Sentinel-1 (VV or HV).

Images' Classifications
The idea was to use SNAP (ESA) software, because it is open-source commonly used for image processing Sentinel-1, Sentinel-2 and is likely to be used in IACS control. However, it has limitations in the size of the training set. Therefore we prepare our own Python scripts to complete the research.
Eventually, images' classifications have been carried out with the random forest algorithm using: • Python version 3.9.0, scikit-learn version 0.23.2.
As mentioned above, classification in SNAP has some limitations. For example, it is not possible to load a relatively large number of training fields, as in our case (2386 parcels). In addition, the choice of classification parameters, such as, e.g., the number of sample pixels is also limited. The training fields must allow the selection of the required number of training pixels. The total number of assigned sample pixels is divided into the number of classes and from each class the algorithm tries to select this number of pixels if possible. It may be problematic to put the number of sample pixels exceeding the total number of pixels in the class. Therefore, the default settings in SNAP are 5000 sample pixels (due to the sampling method of the training set) and 10 trees (due to the computation time). As part of the research, many different variants of classification with different settings were carried out, especially since the default values were insufficient. The commonly used a grid of parameter method was used to select the best hyper parameters of RF. The GridSearchCV class implemented in scikit-learn was applied. The investigated parameter grid included: Five k-fold (CV = 5) cross validation was aplied. Three metrics were used to assess the quality of the model: accuracy, mean value of recall (balanced_accuracy), a weighted average of the precision and recall (f1_weighted). In all hyper parameter estimation simulations, all 3 metrics assessed the parameters at the same level. Usually, the set of hyper parameters is selected that best suits the computational capabilities. By increasing the number of trees, you can achieve better results, but it is very limited by the size of the available RAM. Moreover, by increasing the number of trees above 100, the differences in accuracy for the considered problems are negligible, in the order of tenths of a percent (mean_accuracy: 0.8150, 0.8164 i 0.8177, respectively, 50, 100 i 500 trees).
Eventually, a possible large number of sample pixels and trees were assumed: • 50,000 randomly selected samples (pixels) from the training set, • 23 number of trees.
There are no such limitations generally in Python, and the whole training set (2190 parcels, 1,412,092 pixels) was possible to use for training. We tested different settings with the k-fold cross-validations and decided to apply the following settings:

Accuracy Analysis
Based on a literature review, main metrics were selected for the analysis: OA, PA, U A, additionally f 1 and two metrics from ML, usually not used in remote sensing: accuracy and specificity (Table 4, please notice difference between OA and accuracy). The meaning of these metrics can be illustrated on any full cross matrix, for example (Table 5) taken from the publication from 2021 [37] (it is Table 4-transposed for our purposes). Table 4. Selected accuracy indicators calculated for each class separately except OA [30,38]. Table 5. Example of full confusion matrix (source: [37], Table 4 modified for our purposes-transposed, symbols of classes instead names).

RS Description RS ML Description ML Formula
Accuracy of the classification results is estimated on the basis of the confusion matrix: a full confusion matrix, typically implemented in remote sensing or from a binary confusion matrix used in machine learning.
Full confusion matrix represents the complete error matrix (Table 5), i.e., the combination of all classes with each other using the peer-to-peer method, which includes all commission and omission errors for each class.
Binary confusion matrix contains only cumulative information: number of samples correctly classified as a given class (TP true positives), correctly not classified as this class (TN true negatives), falsely classified as this class (FP false positives) and falsely not classified as this class (FN false negatives). One binary confusion matrix is assigned to one class (e.g., for C1 Table 6). In our case we therefore have 6 binary confusion matrices (Table 7) which are flattened and each matrix written on one row. From the complete confusion matrix, the binary confusion matrices can be computed, but the reverse operation is impossible. From both matrices it is possible to compute all metrics and their values are of course the same. However, it should be also noted that there is more information in the full confusion matrix than in the binary confusion matrix. In the case of more than 4 classes, the size of full confusion matrix is larger then size of binary confusion matrix, because the binary confusion matrix for one class is always 2 × 2 (Table 6), after flattering one row in (Table 7). The main advantage of the full confusion matrix is the possibility of exhausting analysis of testing samples and errors (so-called omission and commission errors).
More important, however, is the distinction between OA and the mean value of accuracy (acc). The sum of the number of correctly classified samples is used in the numerator to calculate OA. In the classification of many classes it is the sum of TP. For one class, we are dealing with samples correctly classified as a given class and correctly not classified to it, i.e., on the diagonal of the binary confusion matrix there is the sum of TP and TN. So, for class C1, OA and acc are equal to (87 + 220)/(87 + 3 + 10 + 220) = 0.9594.
Analyzing individual classes separately, the acc values correspond to OA. While the metrics OA for all classes is 0.9000 and is not the mean acc of the classes, which is 0.9667. In this case, the difference is ca. 7% but one should also take into account the relatively small number of TN, because as can be seen from the formula acc, the more TN the greater the accuracy (acc).
In our research, the accuracy analysis was performed adequately to the classification design. In the case of learning on a selected number of samples, an accuracy analysis was performed 2 times, on the basis of the validation set and of the test set. In the case of using the entire training set in learning, the accuracy analysis was performed only on the test set.
Binary confusion matrices have been calculated for validation simultaneously with the classification in SNAP on the basis of randomly selected pixels from the training set (results are available in the text file, default name: "classifier.txt", SNAP_META). SNAP does not provide accuracy analysis on the independent test set, therefore we made the analysis externally in our own Python scripts.
Accuracy analysis was performed on the test set in the pixel and object-oriented approach, using our own scripts, Python (PP) and Python (PO), respectively. In the object-oriented approach, 2386 samples equal to the number of all parcels in test set were analyzed, corresponding to 1,412,092 pixels (10 m pixel size), which is the number of samples analyzed in the pixel approach.
In Python (PP), test polygons were converted to raster form and cross with classification result for the computation of confusion matrix. In Python (PO), using a zonal statistics algorithm, the modal value of the classification score located within each polygon was calculated. This provided the basis for calculating the confusion matrix. The full confusion matrix, binary confusion matrices and accuracy metrics were calculated for each classification results.

Results
The chapter is composed of three parts. The first part presents accuracy metrics values (calculated based on the binary confusion matrices given in SNAP) for the FR classification using the sampling method for three types of data: • two sets randomly selected from the training set (one for training and one for validation), • an independent test set.
The second part presents the results of the RF classification using all training samples (the entire set of training plots) with the full accuracy analysis on the all testing samples.
The third part shows the discrepancies between farmers' declarations and classification results obtained in these two approaches.

Random Forest Classification Using Sampling Method
The first part presents two sets of RF classification results using the sampling method: • 5000 training, 5000 validating pixels and 10 trees (default in SNAP), • 50,000 training, 50,000 validating pixels and 23 trees. Table 8 shows the accuracy metrics of single image classification on an example of image S2_20180526, which were calculated from the binary confusion matrix stored in SNAP_meta for 5000 training, 5000 validating pixels and 10 trees. Finally, all metrics are very high. Overall accuracy OA = 0.9056, All average accuracy indices are above 0.90: accuracy acc_m = 0.9874 (8.18% higher then OA), sensitivity-tpr_m/PA = 0.9055, precision-ppv_m/UA = 0.9065 and F1 score-f1_m = 0.9055. Table 9 shows the accuracy metrics of single image classification on an example of image S2_20180526, which were calculated from the binary confusion matrix stored in SNAP_meta for 50,000 training, 50,000 validating pixels and 23 trees. By analyzing the accuracy metrics for image S2_20180526 in the Table 9 it can be noticed: • all metrics are significantly above 0.78 (acc even above 0.97) in all classes; all mean values (last row) are above 0.85, • incorrectly reporting acc_m as OA creates a false impression of a 10.50% higher accuracy (acc_m = 0.9838 while OA = 0.8788); it is an illustration of the problem highlighted in the Introduction and also presented in the Material and Methods.
Accuracy metrics calculations for the remaining images and their combinations was the same as for S2_20180526 (Table 10 contains all metrics). Additionally, the graphical presentation of the variability of the two selected indices: OA and f 1 are in Figure 4.  the inclusion of radar images did not increase the accuracy, • the accuracy analysis on the control set Python (PP) and Python (PO) shows a slight decline in accuracy for the image combination, which is in contradiction with the values obtained from the accuracy analysis of SNAP_META.

Random Forest Classification Using Entire Training Set
Sample full confusion matrix, with the best result (Combination4VV, Python PO) is shown in Table A1 and corresponding binary confusion matrices calculated form it, in Table 11.
By analyzing the accuracy metrics for Combination4VV (Python PO) in the Table 11, it can be noticed that: • there is a significant variation in the value of the metrics compared to the Table 9; analyzing mean values (last row) only acc_m is especially high, here equal 0.9746, but also ppv value is much = 0.8587, while tpr, ppv and f 1 are much lower, • in this case, the difference between OA and acc is much higher: OA = 0.8097, a mean acc_m = 0.9746; accuracy overestimation is approx. 16%.
Accuracy metrics calculations for the remaining images and their combinations are presented in (Table 12). Additionally, the graphical presentation of the variability of the two selected indices: OA and f 1 are presented in Figure 5.   Based on the metrics' values in Table 12 and Figure 5, it can be concluded that: • the highest accuracy (OA = 81%) was obtained for Combination_3x, Combination_4VV, Combination_4VH, • an unexpectedly high accuracy (OA = 79%) was obtained for a single image registered in May 2018 S2_20180526, • a very low accuracy (OA = 33%) was obtained for image in the fall of the previous year compared to the year for which the analysis was performed-S1_20180715 VV, • difference between the OA calculated in the pixel and object approach is smaller then in Figure 4, especially for the combination of images (compare run of the yellow and red curves in Figure 5 and yellow and purple curves in Figure 4), • when comparing the metrics in rows, their greater variation can be seen in comparison to the previous paragraph, but always acc_m has very high values above 90%, • comparing the columns OA and acc_m (in this case there are only test set (Python (PP)/PO)) we can see the discrepancy between the correctly calculated OA value and the mean acc_m, but smaller than in the previous paragraph, the difference is on average 25% (except S1_20180715 and S2_20170928), • the shape of the relationship in Figure 5 is similar to that in in Figure 4 (i.e., for an image S2_20180720, the accuracy is reduced compared to the image S2_20180526 and images' combinations, • the inclusion of radar images did not increase the accuracy.

Influence of the Number of Samples on the Classification Result
The Figure 6 shows the SNAP learning accuracy metrics obtained for the default number of pixels equal to 5000 (orange bars in the chart) and for the number of pixels used in the research: 50,000 (blue bars in the chart). The first number in the legend is the number of training samples and the second is the number of samples used for validation; in SNAP, these numbers are equal (the total number of pixels used is 10,000 and 100,000, respectively). Values of (OA, tpr_m, ppv_m, f1_m) are close to each other and high, above 0.8; acc is an outlier equal almost 1.0. It is worth noticing a slightly higher value of metrics calculated on the basis of 5000 pixels compared to the calculation for 50,000 pixels. Higher accuracy metrics are not reflected in the accuracy determined on the basis of independent control fields (1,400,000 pixels), which in the case of default settings is lower then for 50,000 (brown dash line compared to blue dash line). Finally, it is also worth paying attention to the high overall accuracy of the classification made with the use of all available pixels from the training set (130,000).

Discrepancies between Farmers' Declarations and Classification Results
The accuracy analysis discussed in the Results allows to create a map of the discrepancy between the crop declared by the farmer and the one identified using the random forest training algorithm. Figures 7 and 8 show discrepancies between farmers' declarations and classification results obtained for the single image (S2_20280526) and for comparison, the combination of 3 Sentinel-2 and 1 Sentinel-1(VV) images. The parcels for which the classification confirms declarations are presented in gray, the parcels for which the crops declared by the farmers differ from the classification result-in brown. There is a huge difference in the results obtained with classification using sampling method and the result of classification performed on whole training set.

Discussion
In the discussion, we refer to the aim of the research, i.e., the analysis of the results of image classification for an effective and reliable screening method to control farmers' declarations. In temperate climates, an efficient method that is applicable to a large area must be based on as few images as possible, preferably one. The method implemented for the inspection of farmers' declarations must be reliable as it may result in financial penalties for the farmer. The reliability of the method can be determined on the basis of a properly performed accuracy analysis. In this case, we are not interested in the accuracy of fitting the hyperparameters of the classification method. Accuracy of validation does not determine the actual accuracy of the product, which is the classification result. The phenomenon of accuracy overestimation using only validation data set, emphasized in the literature review [25], is confirmed in other literature, e.g., [7,20], and also in our research.
The accuracy analysis should be performed on the training set (if possible, e.g., in the SAM method), on the validation set and on the test set. In most methods, it is not possible to obtain accuracy on the training set, but only on the validation and test set. In many publications the accuracy of validation (OA) is reported, which in almost all cases is above 80% (e.g., SV M = 97.7% [7], SV M = 98.96% [20], SV Mmodi f ied = 98.07% [13], RF = 93% [22], RF = 86.98% [23], RF = 83.96% [39], Dynamic Time Warping algorithm, NDVI time series classification = 72-89%, multi-band classification = 76-88% [40]). In some cases, the accuracy for test data is also delivered: 84.2% [7], 88.94% [20], which means in the case of 13.5% [7] less value than the accuracy of the validation and in the case of 10.02% [20] lower.
In our research, the accuracy of the validation was also over 90%, and the accuracy of the test data was approx. 45%. We used training set composed of 2190 parcels/1,412,092 pixels, test set of 2386 parcels/1,412,092 pixels and the number of samples for learning was 5000 and 50,000. Since the classification accuracy based on selected sample delivered not satisfactory results, the entire training set was used for training and the accuracy on the test data increased to 80% (all accuracy metrics: OA, acc, tpr, ppv, f 1).
It is difficult to compare our experiment with the research design mentioned above. Note the number of training and test samples: 2005/341 points [7] and 2281/1239 pixels [20].
When analyzing the credibility of the method, the issue of selecting accuracy metrics cannot be ignored. The most frequently reported accuracy metric is OA, regardless of whether it is traditional approach or ML. Confusion may arise when the mean accuracy (acc) value is given in the ML instead of the correct OA value. In our research, we obtained an overestimation of up to 45%. It is impossible to refer to the literature on this topic because to our knowledge this problem has not been discussed so far.
Many studies exist regarding the application of remote sensing for crop recognition. They are typically based on time series of optical images, radar images, or both simultaneously. The authors do not always provide sufficient information about the accuracy analysis and they use different metrics. Nevertheless, several examples can be given in this area.
Integration of multi-temporal S-1 and S-2 images resulted in higher classification accuracy compared to classification of S-2 and S-1 data alone [41] (max. kappa for two crops: wheat-0.82 and rapeseed-0.92). Using only S-2 data images it was obtained max. kappa = 0.75 and 0.86 for wheat and rapeseed, respectively. Using only S-1 data images obtained max. kappa = 0.61 and 0.64 for wheat and rapeseed, respectively.
The kappa coefficient was also used in the evaluation of in-season mapping of irrigated crops using Landsat 8, Sentinel-1 time series and Shuttle Radar Topography Mission (SRTM) [42]. Reported classification accuracy using the RF method for integrated data was: kappa = 0.89 compared to kappa = 0.84 for each type of data separately.
In other studies, simultaneous classification of S-1, S-2, Landsat-8 data was applied to crops:wheat, rapeseed, and corn recognition [43]. Classification accuracy performed with the Classification and Regression Trees (CART) algorithm in Google Earth Engine (GEE), estimated in this case by metric: overall accuracy, was OA = 84.25%.
The issue of the effect of different time intervals on early season crop mapping (rice, corn and soyabean) has been the subject of other studies [44]. Based on the analysis of time profiles of different features computed from satellite images, optimal classification sets were selected. The study resulted in maximum accuracy of OA = 95% and slightly lower 91-92% in specific periods of plant phenology.
Wheat area mapping and phenology detection using S-1 and S-2 data has been the subject of other studies by [45]. Classifications were performed using the RF method in GEE obtaining accuracy for integrated data 88.31% (accuracy drops to 87.19% and 79.16% while using only NDVI or VV-VH, respectively).
Time series of various features from S-2 were analysed in the context of three crops recognition rice, corn and soyabean [46]. The research included 126 features from Sentinel-2A images: spectral reflectance of 12 bands, 96 texture parameters, 7 vegetation indices, and 11 phenological parameters. The results of the study indicated 13 features as optimal. Overall accuracies obtained by different methods were, respectively, SVM 98.98%, RF 98.84%, maximum likelihood classifier (MCL) 88.96%.
In conclusion of this brief review, it is important to note the dissimilarity of the metrics when comparing validation accuracy with accuracy based on a test set. In the following discussion, the accuracy data cited from the literature and from our study applies only to metrics computed on the training-independent test data set.
Ultimately, in the context of the literature, we would like to discuss the results of our research on the accuracy of single image classification for crop recognition. Recently, in 2020 reported Ref. [8] that it was possible to obtain high accuracy of crop classification using one Sentinel-2 image registered in the appropriate plant phenological phase. [8] presented the results of the Sentinel-2 time series classification performed on the test area in South Africa, Western Cape Province, for 5 crops: canola, lucerne, pasture (grass), wheat, fallow. The most important conclusion from the research is that it is possible to obtain high accuracy of crop classification (77.2% by SVM Supported Vector Machine method) using one Sentinel-2 image recorded approx. 8 weeks before harvest (comparing max. of 82.4%). Some other researchers compare the results of classification of various combinations of time series with the results of classification of single images. This is particularly important in temperate climatic zones, where acquiring many cloudless images over large areas is problematic.
One example can be noticed in [7] where the authors examined perennial crops using various combinations of multispectral Sentinel-1 and Sentinel-2 images. They obtained the maximum accuracy for the combination of ten images Sentinel-2 and ten Sentinel-1 84.2%, for comparison, the classification accuracy of the combination of ten images Sentinel-2 was 83.0%.
The influence of the classification of all Sentinel-2 channels was also tested in comparison to the classification of channels with a resolution of 10 m. A single optical image with four 10 m channels resulted in an accuracy of 71.6%, while the use of 10 channels improved the accuracy of 77.4%. In addition, these studies show one more conclusion that the NDVI time series classification gives worse results than the classification of the original images (which was also observed during our research [17,18]. Results of the research most similar to ours can be found in the paper [16] (cited also in Introduction). The maximum accuracy of 82% was achieved for the combination: 6 × Sentinel-1 + 6 × Sentinel-2, much bigger then for the single Sentinel-2 image for which it was 39%.
In our case, the highest accuracy (81%) was obtained in RF classification using entire training set in object-oriented approach with accuracy estimation for Combination_3x, Combination_4VV, Combination_4VH (for comparison to pixel approach -78%). It was also astonishing that there was no large decrease in accuracy for a single image S2_20180526 (79% in Python PO and 73% in Python PP). We obtained better accuracy for one image then [16] (one Sentinel-1: 47%, one Sentinel-2: 39%), but comparable with [7,8]. The highest accuracy of 79% was for a single image registered on 26 May 2018, while the classification of the image of 20 July-just before the harvest was slightly less accurate.

Conclusions
Analyses of the classification accuracy of three Sentinel-2 and one Sentinel-1 images allow the following conclusions to be drawn: 1. The accuracy metrics used in machine learning: "accuracy" and "specificity" show overestimated accuracy values because they include not only "true positive" but also "true negative" cases. This approach is valid for one class classification (e.g., medical testing) but not for the use of classification for crop recognition. 2. Reporting the mean accuracy value as overall accuracy gives the false impression of high accuracy. In our first case (SNAP) for the image from May on the control fields, the accuracy overestimation was approx. 45% (if, instead of the correct value of 52%, we gave the average acc_m value of 94%), in the second case it was approx. 20% (instead of 79%, 97%)-compare OA and acc m for S2_20180526 in Tables 10 and 12. 3. The use of all training pixels from the reference polygons, compared to the sampling method, increases the classification accuracy with RFC algorithm by almost 40% (form 50% to 80%). 4. The highest classification accuracy, equaling 81%, was obtained for the combination of 3 Sentinel-2 images with all pixels in our own Python script (for comparison 80% reported by [8], 5. The overall accuracy of the single image classification was equal 79%, which is slightly higher then the value from the literature (77.4% [7], 77.2% [8]) and much better than 47% [16], the highest accuracy we obtained in May, a few weeks before harvest confirming [8], 6. Adding radar images did not improve the classification result, which is also confirmed in the literature [20,23], but due to the use of only one Sentinel-1 image, it does not allow us to generalize this conclusion and requires further research.
The research confirmed the possibility of using a single Sentinel-2 image to screening control farmers' declarations registered several weeks before the harvest. This conclusion is essentially due to the difficulty of acquiring cloudless multitemporal images over large areas in central Europe.
In the random forest classification method, it is recommended to use all data from the training set. It is not possible to input large training data to SNAP, but it is possible with the use of own scripts written e.g., in Python. In accuracy analyses, it is not recommended to use the metrics: accuracy and specificity, which are commonly used in machine learning, and overall accuracy should not be confused with the class mean value of accuracy. However, the following metrics seem reliable: overall accuracy, sensitivity = producer accuracy, precision = user accuracy, F1-score. The conclusion in the last paragraph, according to the authors, fills the gap in the use of the random forest algorithm in crop classification that are characterized by high variability of the spectral response within individual crops.

Conflicts of Interest:
The authors declare no conflict of interest.