Crop identification using deep learning on LUCAS crop cover photos

Crop classification via deep learning on ground imagery can deliver timely and accurate crop-specific information to various stakeholders. Dedicated ground-based image acquisition exercises can help to collect data in data scarce regions, improve control on timing of collection, or when study areas are to small to monitor via satellite. Automatic labelling is essential when collecting large volumes of data. One such data collection is the EU's Land Use Cover Area frame Survey (LUCAS), and in particular, the recently published LUCAS Cover photos database. The aim of this paper is to select and publish a subset of LUCAS Cover photos for 12 mature major crops across the EU, to deploy, benchmark, and identify the best configuration of Mobile-net for the classification task, to showcase the possibility of using entropy-based metrics for post-processing of results, and finally to show the applications and limitations of the model in a practical and policy relevant context. In particular, the usefulness of automatically identifying crops on geo-tagged photos is illustrated in the context of the EU's Common Agricultural Policy. The work has produced a dataset of 169,460 images of mature crops for the 12 classes, out of which 15,876 were manually selected as representing a clean sample without any foreign objects or unfavorable conditions. The best performing model achieved a Macro F1 (M-F1) of 0.75 on an imbalanced test dataset of 8,642 photos. Using metrics from information theory, namely - the Equivalence Reference Probability, resulted in achieving an increase of 6%. The most unfavorable conditions for taking such images, across all crop classes, were found to be too early or late in the season. The proposed methodology shows the possibility for using minimal auxiliary data, outside the images themselves, in order to achieve a M-F1 of 0.817 for labelling between 12 major European crops.


Introduction
The Deep Learning (DL) paradigm is regarded as the Gold Standard of the Machine Learning (ML) community (Alzubaidi et al. [4]). While there is understandably a trade-off between the better performance in the model and the amount of data and resources necessary, it is clear that there is a significant improvement with DL methods, especially so in the field of image classification. Recent advancements in Convolutional Neural Networks (CNNs) have made popular classification tasks ever more affordable in an operational context. Related to this is the ability to perform on-device processing in order to provide an option to anyone wanting to implement the technology, while keeping computational overhead low. A leading architecture in this regard is MobileNet (Howard et al. [19]), of which there are third generation flavours (Howard et al. [18]), which in turn are both significantly smaller than and equal in performance to the previous ones (Sandler et al. [30]). MobileNets are convenient, as they perform on par with other state of the art architectures such as Inception V3 on popular benchmarking datasets, but have up to 20 million less parameters.
Another important point in making DL models operational is the proper and appropriate use of post-processing techniques. What is generally understood here are all manipulations done to the data as output from the model. In an image classification setting this means everything done on the data after the output of the softmax activation function, which is a probability vector with a value for each class, that sums to one (Salvi et al. [29]). Popular post-processing approaches include combining Random Forest or Support Vector Machine classifiers after the CNN output, majority voting (d'Andrimont et al. [14]), patch aggregation (Matvienko et al. [25]), and thresholding. Thresholding is one of the most popular choices, as it is usually simple to implement -keep only the examples, for which the network has a Maximum Probability (MP) for the winning class higher than the threshold. An interesting development in the field is the re-mapping of the base probabilistic output to a value of higher versatility, taking notions from information theory such as Shannon information and entropy (Bogaert et al. [7]).
In the agricultural sector, these technological developments are reflected in projected increases of the use of Artificial Intelligence (AI) throughout the food production chain (Columbus [9]). DL-aided Computer Vision (CV) in particular is crucial for automation and robotic tasks that rely on inspection, evaluation, and execution of management interventions (Tian et al. [37]). Ultimately, these technical innovations should contribute to decreasing costs, while increasing resource use efficiency and precision, of food production systems. An important element of the application of these technologies relates to the possibility of new ways of information exchange among the various actors in the food production chain. This may relate to certification of management practices (Santoso et al. [31]), traceability of products (Kollia et al. [22]), as well as communication towards consumers (Zhu et al. [44]), or indeed various activities in the realm of citizen science and food related topics (Schiller et al. [32]), including biodiversity (Affouard et al. [2]). In technical terms, the possibilities have already been successfully tested for weed management (Wu et al. [42]), crop disease recognition and management (Mohanty et al. [26]), and harvesting operations (Kapach et al. [20]).
Activities also focus on training data collection and curation for increasingly specific applications. Lu and Young [23] identified 34 public image datasets collected under field conditions of relevance to precision agriculture. Zheng et al. [43] presented a crop dataset for deep-learning-based classification and detection in precision agriculture, while Sudars et al. [36] presented a dataset of annotated food crops and weed images for robotic CV control. In the Earth Observation (EO) domain, datasets such as CropHarvest (Tseng et al. [38]) with more than 90,000 worldwide geographically diverse samples with labels, and the LUCAS 2018 Copernicus polygons (d'Andrimont et al. [11]) with almost 60,000 stratified samples in the EU, demonstrate the push from the community to have open and free data to facilitate ML-and-DL-driven research. In this manuscript, we focus on recognizing crops and rely on a selection of legacy close-up photos taken during five tri-annual Land Use/Cover Area frame Survey (LUCAS) surveys from 2006 to 2018 in the EU (d'Andrimont et al. [12]) for our training set.
A fitting application in this sense and specifically so in the European context is the ability of technology to deliver to the needs of regulating bodies that administer the technical regulations of the European Union (EU)'s Common Agricultural Policy (CAP). The CAP is the single largest item on the EU budget, amounting to a total of 58,38 billion euros in 2022, including funds allocated for rural development, market measures, and income support (Commission [10]). Thus developing technology for automatizing the application process and the evidence provision for practices required under subsidy schemes is in great demand by paying agencies of the respective Member States (MS).
While Copernicus Sentinel based monitoring of the CAP area subsidies is being developed and implemented (Devos et al. [13]), ground based information in the form of geo-tagged pictures (Sima et al. [33]) can support and complement the Checks by Monitoring approach (CbM). CbM relies on Copernicus Sentinel data-streams providing wall-to-wall coverage of EU territory and cloud-based processing on the Data and Information Access Services (DIAS) platforms. By using Copernicus Application Ready Data (CARD), in conjunction with geospatial information from the Land Parcel Identification Systems (LPIS) and Geo-Spatial Aid Applications (GSAA), it is possible to extract parcel-level information of markers (see Devos et al. [13]). These markers evidence specific practices (e.g. mowing, irrigation, etc.) that can be related to compliance requirements. Nevertheless, in situations that the Sentinel based checks do not lead to conclusive results, geo-tagged pictures can be used to support and complement checks. Such processing chains may have to be developed for each specific agri-environmental practice for which evidence is needed. In the current CAP programming period (2023-2027), this includes practices under GAEC (Good Agricultural and Environmental Conditions) conditionality, as well as eco-schemes and agri-environmental and climate measures.

Objectives
The aim of the research is to benchmark and test computer vision models to recognize Major and Mature European Crops (MMEC) on close-up photos in a practical agricultural policy relevant context. Specific objectives are: • To select and publish a subset of LUCAS cover photos representative for major and mature crops across the EU for training purposes.
• To deploy and benchmark a set of Mobile-net computer vision models to recognize crops on close-up pictures and identify the best performing model.
• To explore the use of probability and entropy-based metrics to threshold and filter correct and incorrect classifications.
• To illustrate the applications and limitations of the model for inference in a practical and agricultural policy relevant context.

Materials and Methods
The methodological approach in the manuscript consists of 1) the procedure to select close-up LUCAS cover MMEC photos; 2) training, validating, and testing of a large set of Mobilenet based computer vision models; 3) applying the best model to inference photos across the EU; 4) evaluate model performance using metrics, derived from information theory to filter and understand why photos are not classified well; 5) test model performance against images exhibiting a series of unfavorable/outof-scope conditions; 6) illustrate practical implications for protocol development. More specifically the workflow is presented in Figure 1.

Data 2.1.1. LUCAS cover photos
LUCAS Cover is a part of the core LUCAS survey since its inception and accordingly data has been collected for all five campaigns form 2006 to 2018. A total of 875,661 LU-CAS Cover photos have been collected and 874,646 of those were published after anonymization and curation [12]. In contrast to other LUCAS core imagery (four N,S,W,E, photos in Figure 1: Conceptual diagram of the study. The used data is shown on the left. LUCAS attributes are fused with harmonized crop calendars for the selected crops, after which the combined dataset undergoes a process of manual annotation using the pyGeon library. After annotating enough images of sufficiently high quality, a stratified sample across EU countries is done to select the training and inference sets, followed by the DL paradigm (described further in Figure 4). The DL workflow produces a best parameterized model, that in turn is used to inference over a large imbalanced set, where post-processing and further operational-context work takes place. the cardinal directions, and the point photo P), the Cover (C) photos, by protocol, must show the cover on the ground at the GPS location where the survey is carried out in such a way that the relevant crop, or plant can easily be identified during data quality controls. An example of one photo per selected crop is shown in Figure 2. The selection was done by reference to the main crops that are monitored and forecast by the European Commission's Joint Research Centre crop forecasting activities (AGRI4CAST, formerly MARS, see van der Velde et al. [40]). Omitting some classes due to data insufficiency, and including Temporary grassland, the number of crops arrives to 12.

Crop calendars and harmonization
One of the objectives of the study is the identification of mature crops on geo-tagged LUCAS imagery. The rationale being that, from an operational standpoint, the mature stage of the crop is the one in which it is most recognizable. The mature stages of the selected crops have to be firstly ascertained. One way of doing so is by collecting all crop calendars from the variety of sources available, harmonizing them into a common format, extracting the harvest period for each crop, and finally, through the use of expert knowledge, derive the pre-harvest mature stage of the crop.
A Crop Calendar (CC) is a schedule that provides timely information about crops in their respective agro-ecological zone. They are usually provided in tabular or gridded form and cover the space of a calendar year by dividing it into the planting, vegetative, and harvest stages of the respective crop. For the present purposes, CCs were gathered from various sources (Table 1) and harmonised to a common style (AGRI4CAST), as it already hosts the data in tabular and numeric format, facilitating further processing. It must be noted that certain steps had to be taken to account for instances where more than one vari-ety (spring/winter, or early/late ware varieties) of the same crop is cultivated in a country. The decision was made to exclude countries that cultivate both varieties, and use the CC information for only those countries that cultivate the winter and early ware variety, with the information for the excluded countries being populated by expert knowledge. After harmonizing the CCs and extracting harvest stages at national level, the study fills the gaps and validates the result by means of expert knowledge. One way of identifying gaps is using the information from the JRC MARS bulletins (agr [1]). These bulletins offer information on crop growth conditions and yield forecast at EU level and neighbouring countries like the UK, Ukraine, Black Sea area, and Maghreb. The rationale here being that if there is information in the bulletin about the yield of a certain crop for a certain country, then the crop is obviously cultivated in the respective country, and ergo -CC information about it should be present. After identifying the gaps, they were filled with all available information, comprising of interpolations from the COST 725 phenology network (Koch et al. [21]), and expert knowledge. A breakdown of all the information gathered and the sources it was collected from is available This was accomplished again with the use of expert knowledge and was conducted in accordance with the following rules: for cereals, rapeseed, sunflower, and soya -remove the last half month and then add 2 months at the beginning of the harvest stage; for potatoes, sugarbeet, maize, and rice -remove the last half month and add 3 months.

Manual photo pre-processing by visual assessment
The dataset was then visually assessed with the use of the PyGeon jupyter library to remove examples not suited for the study. The photos were selected on the basis of what one could expect in farmer photos: artificial background (map, hand, leg, pivot), and low quality photos (e.g. against the sun, shadowed, etc.) were not allowed. Close ups showing individual leaves, ears, grains were also removed. Overview photos where the crop appears somewhere in the background, usually mixed with other elements (a road, neighbouring field etc) were also removed. Photos with seeds only on a bare soil background (in cereals, soy, maize mostly), and other obviously wrong photos were also eliminated, although this happened only a few times.
At this stage, photos flagged as not suitable to train on are later manually classified into one of the six categories of unfavorable conditions. These are out of season (too early or late in the season), out of protocol (too close or too far away to image the plant matter adequately), or either being blurred or there being a foreign object in the photo. A total of 354 images were selected in this way, while making sure that there is at least one photo per year, per LUCAS land cover class, per unfavorable condition. An example of unfavorable conditions for Common wheat (B11) is shown on Figure 3.

Method
The study makes use of a CNN for an image-classification CV exercise with a balanced training and inference set. There are two rounds of training and parallelized inferencing that make up the hyper-parameterization workflow ( Figure 4) -one without and one with data augmentations (flip, brightness, etc). After the final augmented inference, the best model is identified and it is fed with a much larger imbalanced inference set, supposed to represent a quasi-operational scenario. Specific and innovative post-processing techniques are also explored.

Training and inference set(s) sample selection
The selected number of photos per class for training was set to 400, following the current State of the Art [34]. In order to select the set, a stratified sample across NUTS0 regions in the EU is made from the MMEC dataset (from Section 3.1).  This is done with the idea of having equal representation across EU countries, which allows for articulating conclusions on the European scale.
In order to shorten processing time, instead of using the entire leftover (post-training-set-selection) set of images for inferencing, the study makes use of a custom inference set, sampled out of the leftover set. A total of 85 (the total number of examples of the least represented class (B12)) images per class were selected with a geographic distribution that matches the one of the training set. This "balanced" inference set was used during the first and second stage of inferencing ( Figure 4).
The last set of images to be discussed is the "imbalanced" inference set, which includes all the photos left after the training set selection with all classes capped at 1000 examples per class. This set includes the previous "balanced" set and it is the one used on the identified best model in order to judge the possibility of using the model as an operational tool. It is also this set that any further developments are tested on.

Hyper-parameter search and best model selection
The network in use is MobileNet V2. The images vary in their native resolution (see Supplementary Material Table A.5), but every image in the training and inference set is re-scaled to the net input size of 224x224. The effects of this re-scaling are discussed in section 4.4. The V2 MobileNets are trained for 3000 epochs, with the following settable parameters -learning rate, momentum, optimizer, batch size. These variables were experimented within a random space [6] to generate values for initializing the learning process. In this way 157 model configurations were tried in order to find the best approximation for solving the problem. Model performance was then tested by carrying out an independent inference exercise on the dedicated balanced set. The models are then ranked based on their Overall Accuracy (OA) to find the top five performers. This completes the first round of training.
For the second round the top five performers are run through another cycle of training with the same configuration, but adding image augmentations, in this case random brightness and horizontal image flips. The same inferencing on the balanced set is done to rank the augmented models based on OA. The best performing of these is then taken as the overall best model.

Operational use
After the best model is identified, it is used on the imbalanced inference set (see Section 2.2.1). Because of the class imbalance, it was necessary to use a different metric -Macro-F1 (M-F1) [27]. It is the results from this inference run that are presented in Section 3. It is also on these results that the effectiveness of innovative post-processing techniques will be tested and upon which all the discussion will be carried out.

Computational Infrastructure
All the code developed for this study is available openly on the following repository: https://github.com/Momut1/ lucasVision. The working environment was carried in a docker image. The processing pipeline is fully reproducible and automated to work by calling shell scripts that respectively carry out the hyper-tuning, inferencing, results derivation, and postprocessing and plotting. For more information consult the readMe of the git repository. The processing was done on the JRC BDAP, an in-house, cloud-based, versatile, petabyte-scale platform for heavy-duty processing [35]. The offered GPU services work on a NVidia GeForce GTX 1080 Ti with 11GB memory, CUDA version 10.1, and CUDA driver version 418.67. Pre-processing, launching, and post-processing are done in the JEO-lab layer of the platform in a jupyter notebook docker container, running Tensorflow 1.3.0.

Equivalent Reference Probability filter
Post-processing results from ML/DL exercises is an established practice in practically all such workflows ( [17], [8], [5]). What it usually consists of is the selected removal, based on some criteria, of a substantial enough number of the incorrectly classified examples in order to increase model performance, while simultaneously not falling into the trap of "cherry-picking" one's results.
In classification problems analysts can employ a filter on probability -keeping only examples for which the network has output a MP of the winning class above a threshold. The analyst then decides where to put the threshold in order to control the rigorousness of the filter -higher for more stringent classification, and lower for a more lenient one. The first problem with this is that it depends heavily on the user's decision and is thus, to a degree, arbitrary. The next problem is that the filter is one dimensional -one can only set a threshold along a single axis. Introducing other, or indeed multiple, dimensions to this process would allow for different spreads of the data in the given space. The intuition is that given the chosen dimensions, the data would neatly split between correct and incorrect classifications and allow for more precise filtering. The desired outcome from such filtering would be to remove the biggest amount of incorrectly classified examples, without removing too many correctly classified ones.
The proposed method works with a metric, based on information theory -the Equivalent Reference Probability (ERP), as described in [7]. In information theory Information is the measure of surprise from an event -rare or low probability events are surprising and hence carry more information, and vice versa (Equation 1). Entropy is the information for the probability distribution of the events of a given variable (Equation 2). A low entropy means there is a more pronounced difference between the MP for a given class and the rest of the probabilities for the remaining classes. In [7] the authors make use of the dif- h(x) = −log 2 (p(x)) (1) The appropriate thresholds for ERP and probability are ascertained with a custom function that iteratively moves the threshold down the line. At each step it counts the number of disqualified incorrect images, while trying to keep the number of correctly classified ones below a certain percent. The settable parameter to the function is thus the percentage of correctly classified examples the analyst is willing to discard. After ascertaining the thresholds the space within the scatter plot is divided into four quadrants. Through the iterative exclusion of one or combinations thereof of the examples in these quadrants, the analyst can perform a more precise filtering on results.

Results
Results are divided into five sections. Firstly we present the MMEC dataset, secondly the best performing model is presented, third -the confusion matrix and M-F1 score for best performing model is shown alongside the Producer (PA) and User (UA) Accuracy, fourth the improvement generated from employing an ERP filter, and lastly -we present the performance of the model when faced with images from unfavorable conditions for each class, simulating operational use of the model.

Mature Major European Crops
The processing chain from Sections 2.1.2 and 2.1.3 produces a dataset of 169,460 LUCAS photos of mature crops across 25 EU Member States. Utilizing the manual labelling as described in Section 2.1.4 the study also publishes 15,876 high quality, ready-to-train-on photos. Each of which has been manually checked and verified to exhibit a clear view to the crop in its mature, pre-harvest stage with no visual obstructions, or foreign objects into the frame. Each class has more than 400 photos, allowing for considerable lee-way in training set selection. A breakdown per country is visible on Table 2 and geographical visualization of the same on Figure 5.

Confusion Matrix
The confusion matrix for the best model run (78) over the imbalanced operational inference set is presented in Figure 6. It is clear that the majority of confusion happens between the cereal classes (B11-B15) and with Grassland (B55). In fact, the difference between the average PA of all crops, excluding Grassland, and the average PA of the cereal classes is 27.9, and for UA the difference is 30.9. The class which gets most commonly miss-classified as a false positive is Durum wheat (B12) with a UA of 10.8; the low score arguably has much to do with the unequal representation of the class. The best performing class is Maize (B16), with a PA of 95.5 and UA of 95, followed closely by Rape and turnip rape (B32), showing the clear separation of both from the other classes.

Equivalent reference probability filter
The application of the quadrant filtering method using ERP and MP is shown in Figure 7. The dotted lines represent the thresholds identified by the functions described in Section 2.2.5. The settable parameter is fixed at loosing no more than one percent of the correctly classified images, meaning the identified thresholds are the most conservative ones. They are 0.46 for MP and 0.2 for ERP. The inscribed table shows the number of true and false classifications in each quadrant as labelled by their respective quadrant ID. Although similar, their is a notable difference in the distribution of the true and false classifications, visible in the smooth fitted lines for each group.
The results achieved from employing such filtering are presented in Figure 7 in the table in the uppermost right corner. There is an M-F1 increase of 0.6 from not using any filter and of 0.2 from using only the MP filter.

Unfavorable conditions
Best model 78 was applied over a stratified sample of 1 photo per year, per LUCAS LC1 class, and per unfavorable condition, totalling at an inference set of 354, meaning 59 photos per unfavorable condition (see the examples in Figure 3). A boxplot of the Top1 probability for each unfavorable condition is presented on Figure 8. The conditions are compared firstly to a reference set of quality images that are randomly sampled to have the same distribution as the sets of the conditions, and secondly to the entire imbalanced inference set. Model 78 is most confused about photos with foreign objects, landscape photos, and photos showing the crop post its harvest period, with blurry, early and especially close-up photos performing significantly closer to the reference in terms of Top1 probability.
The actual classification results are presented on Table 4. The worst results are achieved with photos exhibiting post-harvest conditions with an OA of 20%, and early and examples with a foreign object in the frame following with 31% and 37% respectively. The unfavorable conditions that impact the performance the least are blurry and overly close-up photos (54%). This illustrates that a clear protocol is needed when such automated procedures are used within operational workflows, such as for the CAP [33]. In addition, models can progressively be trained with a set of photos covering a wider range of conditions to improve their generalization capacity.

Context
Recently, several relevant studies were published. Zheng et al. [43] present the CropDeep dataset, over which they test state of the art classification and detection DL algorithms. They achieve an averaged accuracy of 99.81% over the CropDeep datasets. These results are impressive, although not directly comparable, as the images were collected from robots in a sterile greenhouse environment, allowing for image conditions to be identical between acquisitions. They furthermore used average accuracy as a metric over an imbalanced inference set, which is not in accordance with the literature [24]. Gao et al. [16] achieved an accuracy of 99.51% in differentiating 30 wheat cultivars at the flowering (most mature) stage. This is very impressive, considering the present study suffered the most error when trying to discriminate between the various cereal classes. The difference is again in the lab quality of the images taken, whereby each image exhibits a single plant on a white background. d'Andrimont et al. [14] achieved a M-F1 score of 62.3% for 10 classes using street level images. The current study outperformed the cited work by 13.4%, though this can be attributed to the lower presence of noise on the images fed to the model.
This study presents the first use of the LUCAS cover dataset for automatic crop identification. Indeed, it is the first study to apply DL for crop identification on still images that are not taken in a controlled environment and coming from a wide variety of sensors, which truly mimics an operational scenario. Secondly, the study produces an automated way to attach crop life-cycle stage information to a database of photos. Third, the introduction of quadrant filtering is a step towards a new State of the Art for more precise post-processing filtering. Whether using crop calendars to extract photos for specific crop lifecycle stages, or using the dataset as a whole, the authors belief that various lines of research may be developed using the LUCAS cover photos. In order the columns represent -the quadrant method ID, the quadrants included in the method, the number of images, and the M-F1 achieved through the inclusion of the respective Qs. In order the QMs represent -1. MP only, 2. ERP only, 3. both above their resp. thresholds, 4. at least one above its threshold.

ERP filtering
A main achievement of the study is the exploration of methods for filtering classification results to achieve better performance and to quantify uncertainty. The study made use of ERP as a metric for assessing this uncertainty. According to the literature, ERP has been shown to be more robust than MP in classifying pixel-level thematic uncertainty [7]; more precise than majority voting in post-processing speckle removal of classified maps [41]; and more flexible than OA in terms of independence of the distribution of the validation data [28].
In practice, MP and ERP are connected, which is clearly visible in the distribution of both groups (correct and incorrect) in the space where the joint probability reference distribution is not null on Figure 7. From the marginal distribution plots we can see that this connection is inverted -there is a high peak in the low values of MP for the incorrectly classified points and a high peak in the high values of ERP for the correctly classified ones. Furthermore, as shown in Figure 9, ERP performs significantly better than MP in post-processing filtering. Because ERP and MP are both probabilities that are in the range between zero and one, their direct comparison in this regard is straightforward. Firstly in subplot A, where for an equal threshold value, the M-F1 value is always higher when utilising ERP over MP. This means that ERP is a much better estimator of uncertainty and manages to capture to a finer degree the nuances that distinguish an incorrect from a correct classification. It needs to be mentioned that this is partly due to the fact that, while for MP the smallest possible threshold value is relatively high (0.20), with ERP it is found at the first stage of filtering (0.01). As seen on the secondary Y axis, which shows the number of images left in the set after performing the filter, this process is not without cost -the number is, for every threshold value, less for ERP than for MP. Nevertheless, it is always preferable to have a bigger spread of the data, over which to set thresholds, especially so when the analysis needs to be conservative regarding the number of correct classification it is willing to lose.
Furthermore, the histograms in subplots B and C show the point at which the proportion of correct and incorrect classifica-

Limitations
Although several novel aspects have been highlighted, some limitations are present in our study. Firstly, there are issues with the pre-processing of the data -in particular, the fact that CC information comes from a variety of sources. Albeit official CCs, which have been harmonized, the fact that the study is treating them as a-priori semantically harmonized could be problematic. Because organizations, based on their goal, have different data collection, processing, and publishing protocols, it is conceivable that the data was intended for a different use. The issue becomes even more apparent when considering the expert knowledge and model output gap filling. Indeed, the concern for the latter introducing error into the results was such, that the study went ahead and calculated the M-F1 for each country (NUTS0 region), for which the crop calendar information was derived from expert knowledge or model output, and then compared to the reference M-F1. No clear drop in M-F1 based on the origin of the mature crop information was registered from this analysis.
Another data issue is that bias can be introduced during manual selection by visual assessment. Other than errors due to distraction during annotation, the annotator undoubtedly bases their decision on which images to keep and discard based on their own discretion. For example, the annotator had to consider questions like -should there be any sky or abundance of soil visible on the image; is the crop on this image to be considered mature enough; and, especially so for the cereal classes -is this the correct label. The matter is even more pronounced when selecting examples for unfavorable conditions. During which, for example, the distinction between "Blurry" and "Close" was sometimes hard to make, the object in the "Object" class and the visual appearance of the landscape in the "Landscape" one were very varied, and that sometimes the image showed more than a single unfavorable condition -the crop can be both early in the season and blurred out, in which case one could have used multi-tags. Such issues were considered prior to undertaking each task, yet the possibility of bias has to be mentioned.
Secondly, there are issues related to the processing logic of certain steps. One such is the identification of threshold points for MP and ERP to generate quadrants. The way the custom function works is by peeking into the correct-incorrect classification results in order to iteratively arrive at the threshold with the main consideration being keeping the number of disqualified correct classifications below a certain percentage. In a sense this means putting the proverbial data cart before the horse, as instead of using simply the values of whichever chosen metrics, the function also considers the result of the classification.

Recommendations
There are several recommendations that would be a logical continuation of the work. In terms of class selection, the major part of the confusion stems from the cereal classes (Section 3). This makes sense, as to distinguish between them can sometimes be troublesome even for a skilled professional. As a grouped cereal class they are easily set apart from the rest of the crops, but between them, the structure of the fruit, stem, and leaf organs can look too similar. Indeed, the approach in Gao et al. [16] yields such good results exactly because the model is designed to pick up on the subtle differences between the varieties. In the present case grouping the cereals together would produce a M-F1 of 88.2 without and 90.4 with quadrant filtering, which is 12.5 and 14.7 points higher than the achieved result. Ideally one would capture the cereal class first at these higher ranges of M-F1 and then have a separate model that deals solely with classifying the type of cereal, variety, or cultivar.
Concerning the point of being more robust in identifying thresholds, or more generally on the topic of splitting the space in Figure 7, one could build a kind of Bayesian Discriminant Rule in order to generalize the combination of the two 1-D thresholds to a 2-D threshold. This can be done by taking into consideration the joint distributions and would yield a single curve that separates correctly and incorrectly classified examples.
An always current topic in DL for CV is the effect of resolution on results. In this case, one can discuss both the input resolution of the source images, and the input resolution of the net in use. Firstly, the range of values of images' resolutions in the inference set vary between 480-3504 in height and 640-4672 in width -a 7.3 times difference in each dimension. Almost 65% of the images are of resolution 1600x1200, with another 22% being 2048x1563 (for full breakdown of available image resolutions check Supplementary Table A.5). With such a spread one can imagine that the level of detail visible on images from either end of the range is quite different. When measuring the correlation between image resolution and the proportion of cor-rectly classified examples for each resolution bin ( Figure A.11), the study found an R-squared value of 0.009, meaning the correlation for this set of LUCAS photos is almost none. Secondly, the net input size is 224x224, meaning each parallelogram image of the training and inference set gets re-scaled to this square size. Intuitively, one can say that larger images would lose more information during re-scaling than smaller ones. In reality, the re-scaling turns the problem into a detection of the major structural features of the crops (e.g. broad leaf vs cereals, colouring, having recognisable flowers or not), where resolution does not matter as much. This would also shed light as to the reason why the network has trouble distinguishing between cereal classes. The analysis still serves to illustrate that the method is developed to handle equally well images from different resolutions. This further showcases the policy relevance of the work, as in an operational context, a regulating body is expected to receive evidence-images in a variety of image resolutions.

Conclusion
This study provides a subset of LUCAS Cover photos for 12 major crops across the EU, to deploy, benchmark, and identify the best configuration of Mobile-net for the classification task, to showcase the possibility of using entropy-based metrics for post-processing of results, and finally to show the applications and limitations of the model in a practical and policy relevant context. The work has produced a dataset of 169,460 images of mature crops for the 12 classes, out of which 15,876 were manually selected as representing a clean sample without any foreign objects or unfavorable conditions. The best performing model to identify crop achieved a Macro F1 (M-F1) of 0.75 on an imbalanced test dataset of 8,642 photos. Using metrics from information theory resulted in achieving an increase of 6%. The most unfavorable conditions for taking such images, across all crop classes, were found to be to early or late in the season. The proposed methodology shows the possibility for using minimal auxiliary data, outside the images themselves, in order to achieve a M-F1 of 0.817 for labelling between 12 major European crops.

Range of pixels
WxH included % of images Less than 1 million 640x480, 1024x768, 800x600