Machine Learning in Agriculture: A Comprehensive Updated Review

The digital transformation of agriculture has evolved various aspects of management into artificial intelligent systems for the sake of making value from the ever-increasing data originated from numerous sources. A subset of artificial intelligence, namely machine learning, has a considerable potential to handle numerous challenges in the establishment of knowledge-based farming systems. The present study aims at shedding light on machine learning in agriculture by thoroughly reviewing the recent scholarly literature based on keywords’ combinations of “machine learning” along with “crop management”, “water management”, “soil management”, and “livestock management”, and in accordance with PRISMA guidelines. Only journal papers were considered eligible that were published within 2018–2020. The results indicated that this topic pertains to different disciplines that favour convergence research at the international level. Furthermore, crop management was observed to be at the centre of attention. A plethora of machine learning algorithms were used, with those belonging to Artificial Neural Networks being more efficient. In addition, maize and wheat as well as cattle and sheep were the most investigated crops and animals, respectively. Finally, a variety of sensors, attached on satellites and unmanned ground and aerial vehicles, have been utilized as a means of getting reliable input data for the data analyses. It is anticipated that this study will constitute a beneficial guide to all stakeholders towards enhancing awareness of the potential advantages of using machine learning in agriculture and contributing to a more systematic research on this topic.


General Context of Machine Learning in Agriculture
Modern agriculture has to cope with several challenges, including the increasing call for food, as a consequence of the global explosion of earth's population, climate changes [1], natural resources depletion [2], alteration of dietary choices [3], as well as safety and health concerns [4]. As a means of addressing the above issues, placing pressure on the agricultural sector, there exists an urgent necessity for optimizing the effectiveness of agricultural practices by, simultaneously, lessening the environmental burden. In particular, these two essentials have driven the transformation of agriculture into precision agriculture. This modernization of farming has a great potential to assure sustainability, maximal productivity, and a safe environment [5]. In general, smart farming is based on four key pillars in order to deal with the increasing needs; (a) optimal natural resources' management, (b) conservation of the ecosystem, (c) development of adequate services, and (d) utilization  The four generic categories in agriculture exploiting machine learning techniques, as presented in [12].
The generic categories dealing with the management of water and soil were found to be less investigated, corresponding cumulatively to 20% of the total number of papers (10% for each category).
Finally, two main sub-categories were identified for the livestock-related applications corresponding to a total 19% of journal papers: • Livestock production; • Animal welfare.

Open Problems Associated with Machine Learning in Agriculture
Due to the broad range of applications of ML in agriculture, several reviews have been published in this research field. The majority of these review studies have been dedicated to crop disease detection [13][14][15][16], weed detection [17,18], yield prediction [19,20], crop recognition [21,22], water management [23,24], animal welfare [25,26], and livestock production [27,28]. Furthermore, other studies were concerned with the implementation of ML methods regarding the main grain crops by investigating different aspects including quality and disease detection [29]. Finally, focus has been paid on big data analysis using ML, aiming at finding out real-life problems that originated from smart farming [30], or dealing with methods to analyze hyperspectral and multispectral data [31].
Although ML in agriculture has made considerable progress, several open problems remain, which have some common points of reference, despite the fact that the topic covers a variety of sub-fields. According to [23,24,28,32], the main problems are associated with the implementation of sensors on farms for numerous reasons, including high costs of ICT, traditional practices, and lack of information. In addition, the majority of the available datasets do not reflect realistic cases, since they are normally generated by a few people getting images or specimens in a short time period and from a limited area [15,[21][22][23]. Consequently, more practical datasets coming from fields are required [18,20]. Moreover, the need for more efficient ML algorithms and scalable computational architectures has been pointed out, which can lead to rapid information processing [18,22,23,31]. The challenging background, when it comes to obtaining images, video, or audio recordings, has also been mentioned owing to changes in lighting [16,29], blind spots of cameras, environmental noise, and simultaneous vocalizations [25]. Another important open problem is that the vast majority of farmers are non-experts in ML and, thus, they cannot fully comprehend the underlying patterns obtained by ML algorithms. For this reason, more user-friendly systems should be developed. In particular, simple systems, being easy to understand and operate, would be valuable, as for example a visualization tool with a userfriendly interface for the correct presentation and manipulation of data [25,30,31]. Taking into account that farmers are getting more and more familiar with smartphones, specific smartphone applications have been proposed as a possible solution to address the above challenge [15,16,21]. Last but not least, the development of efficient ML techniques by incorporating expert knowledge from different stakeholders should be fostered, particularly regarding computing science, agriculture, and the private sector, as a means of designing realistic solutions [19,22,24,33]. As stated in [12], currently, all of the efforts pertain to individual solutions, which are not always connected with the process of decision-making, as seen for example in other domains.

Aim of the Present Study
As pointed out above, because of the multiple applications of ML in agriculture, several review studies have been published recently. However, these studies usually concentrate purely on one sub-field of agricultural production. Motivated by the current tremendous progress in ML, the increasing interest worldwide, and its impact in various do-mains of agriculture, a systematic bibliographic survey is presented on the range of the categories proposed in [12], which were summarized in Figure 1. In particular, we focus on reviewing the relevant literature of the last three years (2018-2020) for the intention of providing an updated view of ML applications in agricultural systems. In fact, this work is an updated continuation of the work presented at [12]; following, consequently, exactly the same framework and inclusion criteria. As a consequence, the scholarly literature was screened in order to cover a broad spectrum of important features for capturing the current progress and trends, including the identification of: (a) the research areas which are interested mostly in ML in agriculture along with the geographical distribution of the contributing organizations, (b) the most efficient ML models, (c) the most investigated crops and animals, and (d) the most implemented features and technologies.
As will be discussed next, overall, a 745% increase in the number of journal papers took place in the last three years as compared to [12], thus justifying the need for a new updated review on the specific topic. Moreover, crop management remained as the most investigated topic, with a number of ML algorithms having been exploited as a means of tackling the heterogeneous data that originated from agricultural fields. As compared to [12], more crop and animal species have been investigated by using an extensive range of input parameters coming mainly from remote sensing, such as satellites and drones. In addition, people from different research fields have dealt with ML in agriculture, hence, contributing to the remarkable advancement in this field.

Outline of the Paper
The remainder of this paper is structured as follows. The second section briefly describes the fundamentals of ML along with the subject of the four generic categories for the sake of better comprehension of the scope of the present study. The implemented methodology, along with the inclusive criteria and the search engines, is analyzed in the third section. The main performance metrics, which were used in the selected articles, are also presented in this section. The main results are shown in the fourth section in the form of bar and pie charts, while in the fifth section, the main conclusions are drawn by also discussing the results from a broader perspective. Finally, all the selected journal papers are summarized in Tables A1-A9, in accordance with their field of application, and presented in the Appendix A, together with Tables A10 and A11 that contain commonly used abbreviations, with the intention of not disrupting the flow of the main text.

Fundamentals of Machine Learning: A Brief Overview
In general, the objective of ML algorithms is to optimize the performance of a task, via exploiting examples or past experience. In particular, ML can generate efficient relationships regarding data inputs and reconstruct a knowledge scheme. In this data-driven methodology, the more data are used, the better ML works. This is similar to how well a human being performs a particular task by gaining more experience [34]. The central outcome of ML is a measure of generalizability; the degree to which the ML algorithm has the ability to provide correct predictions, when new data are presented, on the basis of learned rules originated from preceding exposure to similar data [35]. More specifically, data involve a set of examples, which are described by a group of characteristics, usually called features. Broadly speaking, ML systems operate at two processes, namely the learning (used for training) and testing. In order to facilitate the former process, these features commonly form a feature vector that can be binary, numeric, ordinal, or nominal [36]. This vector is utilized as an input within the learning phase. In brief, by relying on training data, within the learning phase, the machine learns to perform the task from experience. Once the learning performance reaches a satisfactory point (expressed through mathematical and statistical relationships), it ends. Subsequently, the model that was developed through the training process can be used to classify, cluster, or predict.
An overview of a typical ML system is illustrated in Figure 2. With the intention of forming the derived complex raw data into a suitable state, a pre-processing effort is required. This usually includes: (a) data cleaning for removing inconsistent or missing items and noise, (b) data integration, when many data sources exist and (c) data transformation, such as normalization and discretization [37]. The extraction/selection feature aims at creating or/and identifying the most informative subset of features in which, subsequently, the learning model is going to be implemented throughout the training phase [38]. Regarding the feedback loop, which is depicted in Figure 2, it serves for adjustments pertaining to the feature extraction/selection unit as well as the pre-processing one that further improves the overall learning model's performance. During the phase of testing, previously unseen samples are imported to the trained model, which are usually represented as feature vectors. Finally, an appropriate decision is made by the model (for example, classification or regression) in reliance of the features existing in each sample. Deep learning, a subfield of ML, utilizes an alternative architecture via shifting the process of converting raw data to features (feature engineering) to the corresponding learning system. Consequently, the feature extraction/selection unit is absent, resulting in a fully trainable system; it starts from a raw input and ends with the desired output [39,40]. Based on the learning type, ML can be classified according to the relative literature [41,42] as:

•
Supervised learning: The input and output are known and the machine tries to find the optimal way to reach an output given an input; • Unsupervised learning: No labels are provided, leaving the learning algorithm itself to generate structure within its input; • Semi-supervised learning: Input data constitute a mixture of labeled and non-labeled data; • Reinforcement learning: Decisions are made towards finding out actions that can lead to the more positive outcome, while it is solely determined by trial and error method and delayed outcome.

Crop Management
The crop management category involves versatile aspects that originated from the combination of farming techniques in the direction of managing the biological, chemical and physical crop environment with the aim of reaching both quantitative and qualitative targets [52]. Using advanced approaches to manage crops, such as yield prediction, disease detection, weed detection, crop recognition, and crop quality, contributes to the increase of productivity and, consequently, the financial income. The above aspects constitute key goals of precision agriculture.

Yield Prediction
In general, yield prediction is one of the most important and challenging topics in modern agriculture. An accurate model can help, for instance, the farm owners to take informed management decisions on what to grow towards matching the crop to the existing market's demands [20]. However, this is not a trivial task; it consists of various steps. Yield prediction can be determined by several factors such as environment, management practices, crop genotypic and phenotypic characteristics, and their interactions. Hence, it necessitates a fundamental comprehension of the relationship between these interactive factors and yield. In turn, identifying such kinds of relationships mandates comprehensive datasets along with powerful algorithms such as ML techniques [53].

Disease Detection
Crop diseases constitute a major threat in agricultural production systems that deteriorate yield quality and quantity at production, storage, and transportation level. At farm level, reports on yield losses, due to plant diseases, are very common [54]. Furthermore, crop diseases pose significant risks to food security at a global scale. Timely identification of plant diseases is a key aspect for efficient management. Plant diseases may be provoked by various kinds of bacteria, fungi, pests, viruses, and other agents. Disease symptoms, namely the physical evidence of the presence of pathogens and the changes in the plants' phenotype, may consist of leaf and fruit spots, wilting and color change [55], curling of leaves, etc. Historically, disease detection was conducted by expert agronomists, by performing field scouting. However, this process is time-consuming and solely based on visual inspection. Recent technological advances have made commercially available sensing systems able to identify diseased plants before the symptoms become visible. Furthermore, in the past few years, computer vision, especially by employing deep learning, has made remarkable progress. As highlighted by Zhang et al. [56], who focused on identifying cucumber leaf diseases by utilizing deep learning, due to the complex environmental background, it is beneficial to eliminate background before model training. Moreover, accurate image classifiers for disease diagnosis need a large dataset of both healthy and diseased plant images. In reference to large-scale cultivations, such kinds of automated processes can be combined with autonomous vehicles, to timely identify phytopathological problems by implementing regular inspections. Furthermore, maps of the spatial distribution of the plant disease can be created, depicting the zones in the farm where the infection has been spread [57].

Weed Detection
As a result of their prolific seed production and longevity, weeds usually grow and spread invasively over large parts of the field very fast, competing with crops for the resources, including space, sunlight, nutrients, and water availability. Besides, weeds frequently arise sooner than crops without having to face natural enemies, a fact that adversely affects crop growth [18]. In order to prevent crop yield reduction, weed control is an important management task by either mechanical treatment or application of herbicides. Mechanical treatment is, in most cases, difficult to be performed and ineffective if not properly performed, making herbicide application the most widely used operation. Using large quantities of herbicides, however, turns out to be both costly and detrimental for the environment, especially in the case of uniform application without taking into account the spatial distribution of the weeds. Remarkably, long-term herbicide use is very likely to make weeds more resistant, thus, resulting in more demanding and expensive weed control. In recent years, considerable achievements have been made pertaining to the differentiation of weeds from crops on the basis of smart agriculture. This discrimination can be accomplished by using remote or proximal sensing with sensors attached on satellites, aerial, and ground vehicles, as well as unmanned vehicles (both ground (UGV) and aerial (UAV)). The transformation of data gathered by UAVs into meaningful information is, however, still a challenging task, since both data collection and classification need painstaking effort [58]. ML algorithms coupled with imaging technologies or non-imaging spectroscopy can allow for real-time differentiation and localization of target weeds, enabling precise application of herbicides to specific zones, instead of spraying the entire fields [59] and planning of the shortest weeding path [60].

Crop Recognition
Automatic recognition of crops has gained considerable attention in several scientific fields, such as plant taxonomy, botanical gardens, and new species discovery. Plant species can be recognized and classified via analysis of various organs, including leaves, stems, fruits, flowers, roots, and seeds [61,62]. Using leaf-based plant recognition seems to be the most common approach by examining specific leaf's characteristics like color, shape, and texture [63]. With the broader use of satellites and aerial vehicles as means of sensing crop properties, crop classification through remote sensing has become particularly popular. As in the above sub-categories, the advancement on computer software and image processing devices combined with ML has led to the automatic recognition and classification of crops.

Crop Quality
Crop quality is very consequential for the market and, in general, is related to soil and climate conditions, cultivation practices and crop characteristics, to name a few. High quality agricultural products are typically sold at better prices, hence, offering larger earnings to farmers. For instance, as regards fruit quality, flesh firmness, soluble solids content, and skin color are among the most ordinary maturity indices utilized for harvesting [64]. The timing of harvesting greatly affects the quality characteristics of the harvested products in both high value crops (tree crops, grapes, vegetables, herbs, etc.) and arable crops. Therefore, developing decision support systems can aid farmers in taking appropriate management decisions for increased quality of production. For example, selective harvesting is a management practice that may considerably increase quality. Furthermore, crop quality is closely linked with food waste, an additional challenge that modern agriculture has to cope with, since if the crop deviates from the desired shape, color, or size, it may be thrown away. Similarly to the above sub-section, ML algorithms combined with imaging technologies can provide encouraging results.

Water Management
The agricultural sector constitutes the main consumer of available fresh water on a global scale, as plant growth largely relies on water availability. Taking into account the rapid depletion rate of a lot of aquifers with negligible recharge, more effective water management is needed for the purpose of better conserving water in terms of accomplishing a sustainable crop production [65]. Effective water management can also lead to the improvement of water quality as well as reduction of pollution and health risks [66]. Recent research on precision agriculture offers the potential of variable rate irrigation so as to attain water savings. This can be realized by implementing irrigation at rates, which vary according to field variability on the basis of specific water requirements of separate management zones, instead of using a uniform rate in the entire field. The effectiveness and feasibility of the variable rate irrigation approach depend on agronomic factors, including topography, soil properties, and their effect on soil water in order to accomplish both water savings and yield optimization [67]. Carefully monitoring the status of soil water, crop growth conditions, and temporal and spatial patterns in combination with weather conditions monitoring and forecasting, can help in irrigation programming and efficient management of water. Among the utilized ICTs, remote sensing can provide images with spatial and temporal variability associated with the soil moisture status and crop growth parameters for precision water management. Interestingly, water management is challenging enough in arid areas, where groundwater sources are used for irrigation, with the precipitation providing only part of the total crop evapotranspiration (ET) demands [68].

Soil Management
Soil, a heterogeneous natural resource, involves mechanisms and processes that are very complex. Precise information regarding soil on a regional scale is vital, as it contributes towards better soil management consistent with land potential and, in general, sustainable agriculture [5]. Better management of soil is also of great interest owing to issues like land degradation (loss of the biological productivity), soil-nutrient imbalance (due to fertilizers overuse), and soil erosion (as a result of vegetation overcutting, improper crop rotations rather than balanced ones, livestock overgrazing, and unsustainable fallow periods) [69]. Useful soil properties can entail texture, organic matter, and nutrients content, to mention but a few. Traditional soil assessment methods include soil sampling and laboratory analysis, which are normally expensive and take considerable time and effort. However, remote sensing and soil mapping sensors can provide low-cost and effortless solution for the study of soil spatial variability. Data fusion and handling of such heterogeneous "big data" may be important drawbacks, when traditional data analysis methods are used. ML techniques can serve as a trustworthy, low-cost solution for such a task.

Livestock Management
It is widely accepted that livestock production systems have been intensified in the context of productivity per animal. This intensification involves social concerns that can influence consumer perception of food safety, security, and sustainability, based on animal welfare and human health. In particular, monitoring both the welfare of animals and overall production is a key aspect so as to improve production systems [70]. The above fields take place in the framework of precision livestock farming, aiming at applying engineering techniques to monitor animal health in real time and recognizing warning messages, as well as improving the production at the initial stages. The role of precision livestock farming is getting more and more significant by supporting the decision-making processes of livestock owners and changing their role. It can also facilitate the products' traceability, in addition to monitoring their quality and the living conditions of animals, as required by policy-makers [71]. Precision livestock farming relies on non-invasive sensors, such as cameras, accelerometers, gyroscopes, radio-frequency identification systems, pedometers, and optical and temperature sensors [25]. IoT sensors leverage variable physical quantities (VPQs) as a means of sensing temperature, sound, humidity, etc. For instance, IoT sensors can warn if a VPQ falls out of regular limits in real-time, giving valuable information regarding individual animals. As a result, the cost of repetitively and arduously checking each animal can be reduced [72]. In order to take advantage of the large amounts of data, ML methodologies have become an integral part of modern livestock farming. Models can be developed that have the capability of defining the manner a biological system operates, relying on causal relationships and exploiting this biological awareness towards generating predictions and suggestions.

Animal Welfare
There is an ongoing concern for animal welfare, since the health of animals is strongly associated with product quality and, as a consequence, predominantly with the health of consumers and, secondarily, with the improvement of economic efficiency [73]. There exist several indexes for animal welfare evaluation, including physiological stress and behavioral indicators. The most commonly used indicator is animal behavior, which can be affected by diseases, emotions, and living conditions, which have the potential to demonstrate physiological conditions [25]. Sensors, commonly used to detect behavioral changes (for example, changes in water or food consumption, reduced animal activity), include microphone systems, cameras, accelerometers, etc.

Livestock Production
The use of sensor technology, along with advanced ML techniques, can increase livestock production efficiency. Given the impact of practices of animal management on productive elements, livestock owners are getting cautious of their asset. However, as the livestock holdings get larger, the proper consideration of every single animal is very difficult. From this perspective, the support to farmers via precision livestock farming, mentioned above, is an auspicious step for aspects associated with economic efficiency and establishment of sustainable workplaces with reduced environmental footprint [74]. Generally, several models have been used in animal production, with their intentions normally revolving around growing and feeding animals in the best way. However, the large volumes of data being involved, again, call for ML approaches.

Screening of the Relative Literature
In order to identify the relevant studies concerning ML in respect to different aspects of management in agriculture, the search engines of Scopus, Google Scholar, ScienceDirect, PubMed, Web of Science, and MDPI were utilized. In addition, keywords' combinations of "machine learning" in conjunction with each of the following: "crop management", "water management", "soil management", and "livestock management" were used. Our intention was to filter the literature on the same framework as [12]; however, focusing solely within the period 2018-2020. Once a relevant study was being identified, the references of the paper at hand were being scanned to find studies that had not been found throughout the initial searching procedure. This process was being iterated until no relevant studies occurred. In this stage, only journal papers were considered eligible. Thus, non-English studies, conferences papers, chapters, reviews, as well as Master and Doctoral Theses were excluded. The latest search was conducted on 15 December 2020. Subsequently, the abstract of each paper was being reviewed, while, at a next stage, the full text was being read to decide its appropriateness. After a discussion between all co-authors with reference to the appropriateness of the selected papers, some of them were excluded, in the case they did not meet the two main inclusion criteria, namely: (a) the paper was published within 2018-2020 and (b) the paper referred to one of the categories and subcategories, which were summarized in Figure 1. Finally, the papers were classified in these sub-categories. Overall, 338 journal papers were identified. The flowchart of the present review methodology is depicted in Figure 3, based on the PRISMA guidelines [75], along with information about at which stage each exclusive criterion was imposed similarly to recent systematic review studies such as [72,[76][77][78].

Definition of the Performance Metrics Commonly Used in the Reviewed Studies
In this subsection, the most commonly used performance metrics of the reviewed papers are briefly described. In general, these metrics are utilized in an effort to provide a common measure to evaluate the ML algorithms. The selection of the appropriate metrics is very important, since: (a) how the algorithm's performance is measured relies on these metrics and (b) the metric itself can influence the way the significance of several characteristics is weighted. Confusion matrix constitutes one of the most intuitive metrics towards finding the correctness of a model. It is used for classification problems, where the result can be of at least two types of classes. Let us consider a simple example, by giving a label to a target variable: for example, "1" when a plant has been infected with a disease and "0" otherwise. In this simplified case, the confusion matrix ( Figure 4) is a 2 × 2 table having two dimensions, namely "Actual" and "Predicted", while its dimensions have the outcome of the comparison between the predictions with the actual class label. Concerning the above simplified example, this outcome can acquire the following values:  As can be shown in Table 1, the aforementioned values can be implemented in order to estimate the performance metrics, typically observed in classification problems [79]. Other common evaluation metrics were the coefficient of correlation (R), coefficient of determination (R 2 ; basically, the square of the correlation coefficient), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), and Mean Squared Error (MSE), which can be given via the following relationships [80,81]: where X(t) and Z(t) correspond to the predicted and real value, respectively, t stands for the iteration at each point, while T for the testing records number. Accordingly, low values of MAE, MAPE, and MSE values denote a small error and, hence, better performance. In contrast, R 2 near 1 is desired, which demonstrates better model performance and also that the regression curve efficiently fits the data.

Preliminary Data Visualization Analysis
Graphical representation of data related to the reviewed studies, by using maps, bar or pie charts, for example, can provide an efficient approach to demonstrate and interpret the patterns of data. The data visualization analysis, as it usually refers to, can be vital in the context of analyzing large amounts of data and has gained remarkable attention in the past few years, including review studies. Indicatively, significant results can be deduced in an effort to identify: (a) the most contributing authors and organizations, (b) the most contributing international journals (or equivalently which research fields are interested in this topic), and (c) the current trends in this field [82].

Classification of the Studies in Terms of Application Domain
As can be seen in the flowchart of the present methodology (Figure 3), the literature survey on ML in agriculture resulted in 338 journal papers. Subsequently, these studies were classified into the four generic categories as well as into their sub-categories, as already mentioned above. Figure 5 depicts the aforementioned papers' distribution. In particular, the majority of the studies were intended for crop management (68%), while soil management (10%), water management (10%), and livestock management (12% in total; animal welfare: 7% and livestock production: 5%) had almost equal contribution in the present bibliographic survey. Focusing on crop management, the most contributing subcategories were yield prediction (20%) and disease detection (19%). The former research field arises as a consequence of the increasing interest of farmers in taking decisions based on efficient management that can lead to the desired yield. Disease detection, on the other hand, is also very important, as diseases constitute a primary menace for food security and quality assurance. Equal percentages (13%) were observed for weed detection and crop recognition, both of which are essential in crop management at farm and agricultural policy making level. Finally, examination of crop quality was relatively scarce corresponding to 3% of all studies. This can be attributed to the complexity of monitoring and modeling the quality-related parameters. In this fashion, it should be mentioned again that all the selected journal papers are summarized in Tables A1-A9, depending on their field of application, and presented in the Appendix A. The columns of the tables correspond (from left to right) to the "Reference number" (Ref), "Input Data", "Functionality", "Models/Algorithms", and "Best Output". One additional column exists for the sub-categories belonging in crop management, namely "Crop", whereas the corresponding column in the sub-categories pertaining to livestock management refers to "Animal". The present systematic review deals with a plethora of different ML models and algorithms. For the sake of brevity, the commonly used abbreviations are used instead of the entire names, which are summarized in Tables A10 and A11 (presented also in the Appendix A). The list of the aforementioned Tables, along with their content, is listed in Table 2. Table 2. List of the tables appearing in the Appendix A related to: (a) the categories and sub-categories of the machine learning applications in agriculture (Tables A1-A9) and (b) the abbreviations of machine learning models and algorithms (Tables A10 and A11, respectively). Abbreviations of machine learning models A11 Abbreviations of machine learning algorithms

Geographical Distribution of the Contributing Organizations
The subject of this sub-section is to find out the geographical distribution of all the contributing organizations in ML applications in agriculture. To that end, the author's affiliation was taken into account. In case a paper included more than one author, which was the most frequent scenario, each country could contribute only once in the final map chart ( Figure 6), similarly to [83,84]. As can be gleaned from Figure 6, investigating ML in agriculture is distributed worldwide, including both developed and developing economies. Remarkably, out of the 55 contributing countries, the least contribution originated from African countries (3%), whereas the major contribution came from Asian countries (55%). The latter result is attributed mainly to the considerable contribution of Chinese (24.9%) as well as Indian organizations (10.1%). USA appeared to be the second most contributing country with 20.7% percentage, while Australia (9.5%), Spain (6.8%), Germany (5.9%), Brazil, UK, and Iran (5.62%) seem to be particularly interested in ML in agriculture. It should be stressed that livestock management, which is a relatively different sub-field comparing to crop, water, and soil management, was primary examined from studies coming from Australia, USA, China, and UK, while all the papers regarding Ireland were focused on animals. Finally, another noteworthy observation is that a large number of articles were a result of international collaboration, with the synergy of China and USA standing out.

Distribution of the Most Contributing Journal Papers
For the purpose of identifying the research areas that are mostly interested in ML in agriculture, the most frequently appeared international journal papers are depicted in Figure 7. In total, there were 129 relevant journals. However, in this bar chart, only the journals contributing with at least 4 papers are presented for brevity. As a general remark, remote sensing was of particular importance, since reliable data from satellites and UAV, for instance, constitute valuable input data for the ML algorithms. In addition, smart farming, environment, and agricultural sustainability were of central interest. Journals associated with computational techniques were also presented with considerable frequency. A typical example of such type of journals, which was presented in the majority of the studies with a percentage of 19.8%, was "Computers and Electronics in Agriculture". This journal aims at providing the advances in relation to the application of computers and electronic systems for solving problems in plant and animal production.  The "Remote Sensing" and "Sensors" journals followed with approximately 11.8% and 6.5% of the total number of publications, respectively. These are cross-sectoral journals that are concentrated on applications of science and sensing technologies in various fields, including agriculture. Other journals, covering this research field, were also "IEEE Access" and "International Journal of Remote Sensing" with approximately 2.1% and 1.2% contribution, respectively. Moreover, agriculture-oriented journals were also presented in Figure 7, including "Precision Agriculture", "Frontiers in Plant Science", "Agricultural and Forest Meteorology", and "Agricultural Water Management" with 1-3% percentage. These journals deal with several aspects of agriculture ranging from management strategies (so as to incorporate spatial and temporal data as a means of optimizing productivity, resource use efficiency, sustainability and profitability of agricultural production) up to crop molecular genetics and plant pathogens. An interdisciplinary journal concentrating on soil functions and processes also appeared with 2.1%, namely "Geoderma", plausibly covering the soil management generic category. Finally, several journals focusing on physics and applied natural sciences, such as "Applied Sciences" (2.7%), "Scientific Reports" (1.8%), "Biosystems Engineering" (1.5%), and "PLOS ONE" (1.5%), had a notable contribution to ML studies. As a consequence, ML in agriculture concerns several disciplines and constitutes a fundamental area for developing various techniques, which can be beneficial to other fields as well.

Machine Learning Models Providing the Best Results
A wide range of ML algorithms was implemented in the selected studies; their abbreviations are given in Table A11. The ML algorithms that were used by each study as well as those that provided the best output have been listed in the last two columns of Tables A1-A9. These algorithms can be classified into the eight broad families of ML models, which are summarized in Table A10. Figure 8 focuses on the best performed ML models as a means of capturing a broad picture of the current situation and demonstrating advancement similarly to [12].
As can be demonstrated in Figure 8, the most frequent ML model providing the best output was, by far, Artificial Neural Networks (ANNs), which appeared in almost half of the reviewed studies (namely, 51.8%). More specifically, ANN models provided the best results in the majority of the studies concerning all sub-categories. ANNs have been inspired by the biological neural networks that comprise human brains [85], while they allow for learning via examples from representative data describing a physical phenomenon. A distinct characteristic of ANNs is that they can develop relationships between dependent and independent variables, and thus extract useful information from representative datasets. ANN models can offer several benefits, such as their ability to handle noisy data [86], a situation that is very common in agricultural measurements. Among the most popular ANNs are the Deep Neural Networks (DNNs), which utilize multiple hidden layers between input and output layers. DNNs can be unsupervised, semi-supervised, or supervised. A usual kind of DNNs are the Convolutional Neural Networks (CNNs), whose layers, unlike common neural networks, can set up neurons in three dimensions [87]. In fact, CNNs were presented as the algorithms that provide the best output in all sub-categories, with an almost 50% of the individual percentage of ANNs. As stressed in recent studies, such as that of Yang et al. [88], CNNs are receiving more and more attention because of their efficient results when it comes to detection through images' processing. Recurrent Neural Networks (RNNs) followed, representing approximately 10% of ANNs, with Long Short-Term Memory (LSTM) standing out. They are called "recurrent" as they carry out the same process for every element, with the previous computations determining the current output, while they have a "memory" that stores information pertaining to what has been calculated so far. RNNs can face problems concerning vanishing gradients and inability to "memorize" many sequential data. Towards addressing these issues, the cell structures of LSTM can control which part of information will be either stored in long memory or discarded, resulting in optimization of the memorizing process [51]. Moreover, Multi-Layer Perceptron (MLP), Fully Convolutional Networks (FCNs), and Radial Basis Function Networks (RBFNs) appeared to have the best performance in almost 3-5% of ANNs. Finally, ML algorithms, belonging to ANNs with low frequency, were Back-Propagation Neural Networks The second most accurate ML model was Ensemble Learning (EL), contributing to the ML models used in agricultural systems with approximately 22.2%. EL is a concise term for methods that integrate multiple inducers for the purpose of making a decision, normally in supervised ML tasks. An inducer is an algorithm, which gets as an input a number of labeled examples and creates a model that can generalize these examples. Thus, predictions can be made for a set of new unlabeled examples. The key feature of EL is that via combining various models, the errors coming from a single inducer is likely to be compensated from other inducers. Accordingly, the prediction of the overall performance would be superior comparing to a single inducer [89]. This type of ML model was presented in all sub-categories, apart from crop quality, perhaps owing to the small number of papers belonging in this subcategory. Support Vector Machine (SVM) followed, contributing in approximately 11.5% of the studies. The strength of the SVM stems from its capability to accurately learn data patterns while showing reproducibility. Despite the fact that it can also be applied for regression applications, SVM is a commonly used methodology for classification extending across numerous data science settings [90], including agricultural research.
Decision Trees (DT) and Regression models came next with equal percentage, namely 4.7%. Both these ML models were presented in all generic categories. As far as DT are concerned, they are either regression or classification models structured in a tree-like architecture. Interestingly, handling missing data in DT is a well-established problem. By implementing DT, the dataset can be gradually organized into smaller subsets, whereas, in parallel, a tree graph is created. In particular, each tree's node denotes a dissimilar pairwise comparison regarding a certain feature, while each branch corresponds to the result of this comparison. As regards leaf nodes, they stand for the final decision/prediction provided after following a certain rule [91,92]. As for Regression, it is used for supervised learning models intending to model a target value on the basis of independent predictors. In particular, the output can be any number based on what it predicts. Regression is typically applied for time series modeling, prediction, and defining the relationships between the variables.
Finally, the ML models, leading to optimal performance (although with lower contribution to literature), were those of Instance Based Models (IBM) (2.7%), Dimensionality Reduction (DR) (1.5%), Bayesian Models (BM) (0.9%), and Clustering (0.3%). IBM appeared only in crop, water, and livestock management, whereas BM only in crop and soil management. On the other hand, DR and Clustering appeared as the best solution only in crop management. In brief, IBM are memory-based ML models that can learn through comparison of the new instances with examples within the training database. DR can be executed both in unsupervised and supervised learning types, while it is typically carried out in advance of classification/regression so as to prevent dimensionality effects. Concerning the case of BM, they are a family of probabilistic models whose analysis is performed within the Bayesian inference framework. BM can be implemented in both classification and regression problems and belong to the broad category of supervised learning. Finally, Clustering belongs to unsupervised ML models. It contains automatically discovering of natural grouping of data [12].

Most Studied Crops and Animals
In this sub-section, the most examined crops and animals that were used in the ML models are discussed as a result of our searching within the four sub-categories of crop management similarly to [12]. These sub-categories refer to yield prediction, disease detection, crop recognition, and crop quality. Overall, approximately 80 different crop species were investigated. The 10 most utilized crops are summarized in Figure 9. Specifically, the remarkable interest on maize (also known as corn) can be attributed to the fact that it is cultivated in many parts across the globe as well as its versatile usage (for example, direct consumption by humans, animal feed, producing ethanol, and other biofuels). Wheat and rice follow, which are two of the most widely consumed cereal grains. According to the Food and Agriculture Organization (FAO) [93], the trade in wheat worldwide is more than the summation of all other crops. Concerning rice, it is the cereal grain with the third-highest production and constitutes the most consumed staple food in Asia [94]. The large contribution of Asian countries presented in Figure 6, like China and India, justifies the interest in this crop. In the same vein, soybeans, which are broadly distributed in East Asia, USA, Africa, and Australia [95], were presented in many studies. Finally, tomato, grape, canola/rapeseed (cultivated primarily for its oil-rich seed), potato, cotton, and barley complete the top 10 examined crops. All these species are widely cultivated all over the world. Some other indicative species, which were investigated at least five times in the present reviewed studies, were also alfalfa, citrus, sunflower, pepper, pea, apple, squash, sugarcane, and rye. As far as livestock management is concerned, the examined animal species can be classified, in descending order of frequency, into the categories of cattle (58.5%), sheep and goats (26.8%), swine (14.6%), poultry (4.9%), and sheepdog (2.4%). As can be depicted in Figure 10, the last animal, which is historically utilized with regard to the raising of sheep, was investigated only in one study belonging to animal welfare, whereas all the other animals were examined in both categories of livestock management. In particular, the most investigated animal in both animal welfare and livestock production was cattle. Sheep and goats came next, which included nine studies for sheep and two studies for goats. Cattles are usually raised as livestock aimed at meat, milk, and hide used for leather. Similarly, sheep are raised for meat and milk as well as fleece. Finally, swine (often called domestic pigs) and poultry (for example, chicken, turkey, and duck), which are used mainly for their meat or eggs (poultry), had equal contribution from the two livestock sub-categories.

Most Studied Features and Technologies
As mentioned in the beginning of this study, modern agriculture has to incorporate large amounts of heterogeneous data, which have originated from a variety of sensors over large areas at various spatial scale and resolution. Subsequently, such data are used as input into ML algorithms for their iterative learning up until modeling of the process in the most effective way possible. Figure 11 shows the features and technologies that were used in the reviewed studies, separately for each category, for the sake of better comprehending the results of the analysis.
Data coming from remote sensing were the most common in the yield prediction sub-category. Remote sensing, in turn, was primarily based on data derived from satellites (40.6% of the total studies published in this sub-category) and, secondarily, from UAVs (23.2% of the total studies published in this sub-category). A remarkable observation is the rapid increase of the usage of UAVs versus satellites from the year 2018 towards 2020, as UAVs seem to be a reliable alternative that can give faster and cheaper results, usually in higher resolution and independent of the weather conditions. Therefore, UAVs allow for discriminating details of localized circumscribed regions that the satellites' lowest resolution may miss, especially under cloudy conditions. This explosion in the use of UAV systems in agriculture is a result of the developing market of drones and sensing solutions attached to them, rendering them economically affordable. In addition, the establishment of formal regulations for UAV operations and the simplification and automatization of the operational and analysis processes had a significant contribution on the increasing popularity of these systems. Data pertaining to the weather conditions of the investigated area were also of great importance as well as soil parameters of the farm at hand. An additional way of getting the data was via in situ manual measurements, involving measurements such as crop height, plant growth, and crop maturity. Finally, data concerning topographic, irrigation, and fertilization aspects were presented with approximately equal frequency. As far as disease detection is concerned, Red-Green-Blue (RGB) images appear to be the most usual input data for the ML algorithms (in 62% of the publications). Normally, deep learning methods like CNNs are implemented with the intention of training a classifier to discriminate images depicting healthy leaves, for example, from infected ones. CNNs use some particular operations to transform the RGB images so that the desired features are enhanced. Subsequently, higher weights are given to the images having the most suitable features. This characteristic constitutes a significant advantage of CNNs as compared to other ML algorithms, when it comes to image classification [79]. The second most common input data came from either multispectral or hyperspectral measurements originated from spectroradiometers, UAVs, and satellites. Concerning the investigated diseases, fungal diseases were the most common ones with diseases from bacteria following, as is illustrated in Figure 12a. This kind of disease can cause major problems in agriculture with detrimental economic consequences [96]. Other examined origins of crop diseases were, in descending order of frequency, pests, viruses, toxicity, and deficiencies. Images were also the most used input data for weed detection purposes. These images were RGB images that originated mainly from in situ measurements as well as from UGVs and UAVs and, secondarily, multispectral images from the aforementioned sources. Finally, other parameters that were observed, although with lower frequency, were satellite multispectral images, mainly due to the considerably low resolution they provide, video recordings, and hyperspectral and greyscale images. Concerning crop recognition, the majority of the studies used data coming mostly from satellites and, secondarily, from in situ manual measurements. This is attributed to the fact that most of the studies in this category concern crop classification, a sector where satellite imaging is the most widely used data source owing to its potential for analysis of time series of extremely large surfaces of cultivated land. Laboratory measurements followed, while RGB and greyscale images as well as hyperspectral and multispectral measurements from UAVs were observed with lower incidence.
The input data pertaining to crop quality consisted mainly of RGB images, while X-ray images were also utilized (for seed germination monitoring). Additionally, quality parameters, such as color, mass, and flesh firmness, were used. There were also two studies using spectral data either from satellites or spectroradiometers. In general, the studies belonging in this sub-category dealt with either crop quality (80%) or seed germination potential (20%) (Figure 12b). The latter refers to the seed quality assessment that is essential for the seed production industry. Two studies were found about germination that both combined X-ray images analysis and ML.
Concerning soil management, various soil properties were taken into account in 65.7% of the studies. These properties included salinity, organic matter content, and electrical conductivity of soil and soil organic carbon. Usage of weather data was also very common (in 48.6% of the studies), while topographic and data pertaining to the soil moisture content (namely the ratio of the water mass over the dry soil) and crop properties were presented with lower frequency. Additionally, remote sensing, including satellite and UAV multispectral and hyperspectral data, as well as proximal sensing, to a lesser extent, were very frequent choices (in 40% of the studies). Finally, properties associated with soil temperature, land type, land cover, root microbial dynamics, and groundwater salinity make up the rest of data, which are labeled as "other" in the corresponding graph of Figure 11.
In water management, weather data stood for the most common input data (appeared in the 75% of the studies), with ET being used in the vast majority of them. In many cases, accurate estimation of ET (the summation of the transpiration via the plant canopy and the evaporation from plant, soil, and open water surface) is among the most central elements of hydrologic cycle for optimal management of water resources [97]. Data from remote sensors and measurements of soil water content were also broadly used in this category. Soil water availability has a central impact on crops' root growth by affecting soil aeration and nutrient availability [98]. Stem water potential, appearing in three studies, is actually a measure of water tension within the xylem of the plant, therefore functioning as an indicator of the crop's water status. Furthermore, in situ measurements, soil, and other parameters related to cumulative water infiltration, soil and water quality, field topography, and crop yield were also used, as can be seen in Figure 11.
Finally, in what concerns livestock management, motion capture sensors, including accelerometers, gyroscopes, and pedometers, were the most common devices giving information about the daily activities of animals. This kind of sensors was used solely in the studies investigating animal welfare. Images, audio, and video recordings came next, however, appearing in both animal welfare and livestock production sub-categories. Physical and growth characteristics followed, with slightly less incidence, by appearing mainly in livestock production sub-category. These characteristics included the animal's weight, gender, age, metabolites, biometric traits, backfat and muscle thickness, and heat stress. The final characteristic may have detrimental consequences in livestock health and product quality [99], while through the measurement of backfat and muscle thickness, estimations of the carcass lean yield can be made [100].

Discussion and Main Conclusions
The present systematic review study deals with ML in agriculture, an ever-increasing topic worldwide. To that end, a comprehensive analysis of the present status was conducted concerning the four generic categories that had been identified in the previous review by Liakos et al. [12]. These categories pertain to crop, water, soil, and livestock management. Thus, by reviewing the relative literature of the last three years (2018-2020), several aspects were analyzed on the basis of an integrated approach. In summary, the following main conclusions can be drawn:

•
The majority of the journal papers focused on crop management, whereas the other three generic categories contributed almost with equal percentage. Considering the review paper of [12] as a reference study, it can be deduced that the above picture remains, more or less, the same, with the only difference being the decrease of the percentage of the articles regarding livestock from 19% to 12% in favor of those referring to crop management. Nonetheless, this reveals just one side of the coin. Taking into account the tremendous increase in the number of relative papers published within the last three years (in particular, 40 articles were identified in [12] comparing to the 338 of the present literature survey), approximately 400% more publications were found on livestock management. Another important finding was the increasing research interest on crop recognition. • Several ML algorithms have been developed for the purpose of handling the heterogeneous data coming from agricultural fields. These algorithms can be classified in families of ML models. Similar to [12], the most efficient ML models proved to be ANNs. Nevertheless, in contrast to [12], the interest also been shifted towards EL, which can combine the predictions that originated from more than one model. SVM completes the group with the three most accurate ML models in agriculture, due to some advantages, such as its high performance when it works with image data [101].

•
As far as the most investigated crops are concerned, mainly maize and, secondarily, wheat, rice, and soybean were widely studied by using ML. In livestock management, cattle along with sheep and goats stood out constituting almost 85% of the studies. Comparing to [12], more species have been included, while wheat and rice as well as cattle, remain important specimens for ML applications.

•
A very important result of the present review study was the demonstration of the input data used in the ML algorithms and the corresponding sensors. RGB images constituted the most common choice, thus, justifying the broad usage of CNNs due to their ability to handle this type of data more efficiently. Moreover, a wide range of parameters pertaining to weather as well as soil, water, and crop quality was used. The most common means of acquiring measurements for ML applications was remote sensing, including imaging from satellites, UAVs and UGVs, while in situ and laboratory measurements were also used. As highlighted above, UAVs are constantly gaining ground against satellites mainly because of their flexibility and ability to provide images with high resolution under any weather conditions. Satellites, on the other hand, can supply time-series over large areas [102]. Finally, animal welfarerelated studies used mainly devices such as accelerometers for activity recognition, whereas those ones referring to livestock production utilized primary physical and growth characteristics of the animal.
As can be inferred from the geographical distribution (illustrated in Figure 6) in tandem with the broad spectrum of research fields, ML applications for facilitating various aspects of management in the agricultural sector is an important issue on an international scale. As a matter of fact, its versatile nature favors convergence research. Convergence research is a relatively recently introduced approach that is based on shared knowledge between different research fields and can have a positive impact on the society. This can refer to several aspects, including improvement of the environmental footprint and assuring human's health. Towards this direction, ML in agriculture has a considerable potential to create value.
Another noteworthy finding of the present analysis is the capturing of the increasing interest on topics concerning ML analyses in agricultural applications. More specifically, as can be shown in Figure 13, an approximately 26% increase was presented in the total number of the relevant studies, if a comparison is made between 2018 and 2019. The next year (i.e., 2020), the corresponding increase jumped to 109% against 2019 findings; thus, resulting in an overall 164% rise comparing with 2018. The accelerating rate of the research interest on ML in agriculture is a consequence of various factors, following the considerable advancements of ICT systems in agriculture. Moreover, there exists a vital need for increasing the efficiency of agricultural practices while reducing the environmental burden. This calls for both reliable measurements and handling of large volumes of data as a means of providing a wide overview of the processes taking place in agriculture. The currently observed technological outbreak has a great potential to strengthen agriculture in the direction of enhancing food security and responding to the rising consumers' demands. In a nutshell, ICT in combination with ML, seem to constitute one of our best hopes to meet the emerging challenges. Taking into account the rate of today's data accumulation along with the advancement of various technologies, farms will certainly need to advance their management practices by adopting Decision Support Systems (DSSs) tailored to the needs of each cultivation system. These DSSs use algorithms, which have the ability to work on a wider set of cases by considering a vast amount of data and parameters that the farmers would be impossible to handle. However, the majority of ICT necessitates upfront costs to be paid, namely the high infrastructure investment costs that frequently prevent farmers from adopting these technologies. This is going to be a pressing issue, mainly in developing economies, where agriculture is an essential economic factor. Nevertheless, having a tangible impact is a long-haul game. A different mentality is required by all stakeholders so as to learn new skills, be aware of the potential profits of handling big data, and assert sufficient funding. Overall, considering the constantly increasing recognition of the value of artificial intelligence in agriculture, ML will definitely become a behind-thescenes enabler for the establishment of a sustainable and more productive agriculture. It is anticipated that the present systematic effort is going to constitute a beneficial guide to researchers, manufacturers, engineers, ICT system developers, policymakers, and farmers and, consequently, contribute towards a more systematic research on ML in agriculture.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
In this section, the reviewed articles are summarized within the corresponding Tables as described in Table 2.