Machine Learning Applications on Agricultural Datasets for Smart Farm Enhancement

: This work aims to show how to manage heterogeneous information and data coming from real datasets that collect physical, biological, and sensory values. As productive companies—public or private, large or small—need increasing profitability with costs reduction, discovering appropriate ways to exploit data that are continuously recorded and made available can be the right choice to achieve these goals. The agricultural field is only apparently refractory to the digital technology and the “smart farm” model is increasingly widespread by exploiting the Internet of Things (IoT) paradigm applied to environmental and historical information through time-series. The focus of this study is the design and deployment of practical tasks, ranging from crop harvest forecasting to missing or wrong sensors data reconstruction, exploiting and comparing various machine learning techniques to suggest toward which direction to employ efforts and investments. The results show how there are ample margins for innovation while supporting requests and needs coming from companies that wish to employ a sustainable and optimized agriculture industrial business, investing not only in technology, but also in the knowledge and in skilled workforce required to take the best out of it.


Introduction
Nowadays, we are surrounded by a large amount of "smart" sensors and intelligent systems that are always inter-connected through Internet and cloud platforms; this is the Internet of Things (IoT) paradigm that introduces advanced technologies in all social and productive sectors of the society.Considering the worldwide market, companies compete to increase their profitability and economy by optimizing costs, time, and resources and, at the same time, trying to improve the services quality and the products variety offered to customers.The attention towards efficiency and productive improvements is coveted also in the agricultural sector, where the production dynamics and the resource management affect crop types, irrigations, and disinfestations amount; keeping such production rhythms without any automatic control is likely to bring resource waste, rotten or abandoned crops, and polluted and impoverished soils.
Innovative technologies can be useful to face problems such as environmental sustainability, waste reduction, and soil optimization; the gathering and the analysis of agricultural data, which include numerous and heterogeneous variables, are of considerable interest for the possibility of developing production techniques respectful of the ecosystem and its resources (optimization of irrigation and sowing in relation to soil history and seasonal cycles), the identification of influential and non-influential factors, the possibility of carrying out market analysis in relation to the forecast of future hard-predictive information, the possibility of adapting crops to specific environments, and finally the ability to maximize technological investments by limiting and predicting hardware failures and replacements.
In this work, three different datasets will be exploited that differ from each other by origin; structure; organization; and availability of their values since belonging to industry, scientific research, and national statistic institutes.On the well-structured and publicly available Istat dataset, for example, is developed the forecasting of future crop amounts on complete time-series, while on the second one related to industrial IoT sensors, the reconstruction and forecasting of IoT missing or wrong data, as well as the detection of faulty hardware sensors from monitoring stations, are performed by exploiting several machine learning methods.Also, the mid-structured and publicly available scientific National Research Council (CNR) dataset is approached with a predictive goal, introducing evaluation metrics for specific culture species.
While facing living environments like the agricultural one, it is essential to treat an important amount of data even in short-time frames based on a daily, weekly, or annual collection, by examining and identifying patterns and particular combinations that impact on plantation and productions.The cases faced in this study rise from real requests coming from industrial projects, providing a pilot study that allows companies to use their own data to make hardware and software investments; for this aim, environmental factors (weather, humidity, wind) along with productive and structural factors(as soil type and extension) are taken into account and used in five practical tasks that will exploit supervised machine learning techniques like decision trees, K-nearest neighbors, neural networks, and polynomial predictive models.

Related Works
Agriculture companies can be classified according to different factors; knowing the classification allows one to hypothesize the information type that must be approached, their probable structure, and the operations required to meet the needs of a specific agricultural farm [1,2] that can be specialized in the following: • non-permanent arable crops (cereals, vegetables, rice, cotton, forage, legumes) • permanent crops (grapes, apples, oily and citrus fruits, coffee, spices) • horticulture (flowers, greenhouses) • plants reproduction • support or post-harvest activities (maintenance and soil conservation).
The Precision Agriculture model is a result of the rapid developments in the Internet of Things and cloud computing paradigms, which feature context-awareness and real-time events [3]; Wolfert et al. [4] and Biradar et al. [5] present surveys about smart-farm industries, while multidisciplinary models exploiting IoT sensors are examined in the works of [6,7].
Arkeman et al. [8] use green-house gas analysis to monitoring the oil palm plantation used in the production of biodiesel, while Amanda et al. [9] propose an expert system to help farmers to determine tomato varieties matching parameters or preferences using fuzzy logic on factors like altitude, resistance to diseases, fruit size, fruit shape, yield potential, maturity, and fruit color.
The work of Nurulhaq et al. [10] uses IoT hotspots as indicators of forest fires in a region where sequential patterns of occurrences can be extracted from a dataset; Murphy et al. [11] uses wireless sensor network (WSN) technology to monitor a beehive colony and collect key information about activity/environment, while the authors of [12] present solutions that can be integrated into drones using Raspberry Pi module for improvement of crop quality in agricultural field.
Major agri-business companies, that is, Monsanto [13], Farmlink [14], and Farmlogs [15], which invest large resources in research and innovation; considering the environmental sustainability, it results in very useful the predictive modeling employed to manage crop failure risk and to boost feed efficiency in livestock production presented in the literature [16].
Patil and Thorat [17] develop a monitoring system that identifies grape diseases in their early stages, using factors such as temperature, relative humidity, moisture, and leaf wetness sensor, while Truong et al. [18] uses an IoT device with a machine learning algorithm that predicts environmental conditions for fungal detection and prevention, using conditions such as air temperature, relative air humidity, wind speed, and rain fall; moreover, a system for detection and control of diseases on cotton leaf along with soil quality monitoring is presented by Sarangdhar and Pawar [19].Rural Bridge is an IoT-based system that uses sensors to collect scientific information such as soil moisture level, soil pH value, ground water level (GWL), and surface water level (SWL) for a smart and co-operative farming in the literature [20]; also, Pallavi et al. [21] present remote sensing used in greenhouse agriculture to increase the yield and providing organic farming.
A SmartAgriFood conceptual architecture is proposed in Kaloxylos et al. [22], while the authors of [23] introduce internet applications in the agri-food domain; Poppe in [24] proposes the analysis to both the scope and the organization of farm production regulations.Garba [25] develops smart water-sharing methods in semi-arid regions; Hlaing et al. [26] introduce plant diseases recognition using statistical models; and, moreover, in Alipio et al. [27], there are smart hydroponics systems that exploit inference in Bayesian networks.Marimuthu et al. [28] propose and design a Persuasive Technology to encourage smart farming, while also exploiting historical time-series for production quality assurance [29], because nowadays consumers are concerned about food safety assurance related to health and well-being.
In the work of Venkatesan and Tamilvanan [30], there is a system that monitors the agricultural field through Raspberry pi camera, allowing automatic irrigation based on temperature, humidity, and soil moisture.Bauer and Aschenbruck [31] primarily focus on in situ assessment of the leaf area index (LAI), a very important crop parameter for smart farming, while studies of Pandithurai et al. [32] introduce an IoT application, named 'AGRO-TECH', that is accessible by farmers to keep track of soil, crop, and water, which is also deepened by the authors of [33]; Rekha et al. [34] develop an IoT-based precision farming method for high yield groundnut agronomy suggesting irrigation timings and optimum usage of fertilizers respecting soil features.
Emerging economies are also researching these models; the Government of China has performed research to save water for irrigation forecasting weather conditions [35], also considering the soil integrity and the air quality (Zhou et al. [36]), while in Sun et al. [37] the smart farm paradigm is proposed as an opportunity.Finally, an additional issue to take into accounts is data evolution in the deployment of a real application where data availability increase as time goes by [38].

Materials and Methods
This work is addressed to show practical and experimental results, with the aim to introduce improvements for the data management and analysis in small-size industrial companies and, in contingent territorial contexts, often refractory to innovation.In the pre-IoT era, small amounts of well-structured data were profitably treated using few-adaptive mathematical models coming from statistical and numerical theories and so, in this context, the comparison between stable and well-known methodologies (often developed with simple spreadsheets), with different and innovative ones needing investments, as well as new knowledge, for workers, becomes interesting.By considering the sources of data, there are three main processes for their gathering and generation [14,39,40]:

•
Machine-generated (MG): data coming from sensors and intelligent machines (drones, Unmanned Aerial Vehicles (UAVs), Global Positioning System (GPS)).These represent the IoT paradigm and their structure ranges from simple to complex, but generally well-formed numerical records; this data grow critically in volumes and speed and traditional approaches today are not sufficient for their treatment.

•
Process-mediated (PM): traditional commercial data coming from business processes referencing to corporate events such as purchases and orders; they are highly structured, with various data types, and usually are stored in relational databases.

•
Human-sourced (HS): attestation of human experiences recorded in books, photos, audio and video; they are now almost digitized in digital devices and social networks, vaguely structured, and often not validated.The management, analysis, and storage of this data is problematic and open to research.

Data Sources
For this study, three different sources of information are considered (Figure 1), each of them featuring complementary and characteristic features useful to design and test machine learning approaches: Machines 2018, 6, x FOR PEER REVIEW 4 of 22

Data Sources
For this study, three different sources of information are considered (Figure 1), each of them featuring complementary and characteristic features useful to design and test machine learning approaches: Istat (National Institute of Statistics) dataset: the annually-aggregated data concerning Italian crops amounts (Table 1); it is a well-structured database and contains agricultural production information for each Italian province [41].This dataset has been integrated with the altitude attribute of the provinces.
The 16 attributes regard the following:  The dataset portion employed for this work consists of 17 tables, one for each crop type considered and, for each of them, there is the cumulative value of each attribute for 124 italian province calculated on the time-series between 2006 and 2017; in this way, there are 17 × 124 × 12 = 25,296 considered records.
CNR (National Research Council) dataset: it is a structured agrarian dataset, but values are often incomplete or only partially ordered, concerning scientific and technical information from agricultural and biological studies on crops and horticultural species [42].Some useful data have already undergone transformations and measurements (Table 2).
The four considered attributes are as follows: • Date, which indicates the date of detection and calculation • LAI value(leaf area index), which measures the leaf area per soil surface unit • Evapotranspiration (ETc) and its reference value (ETo) calculated with the Penman-Monteith method • Evapotranspiration ratio (ETc/ETo), which represents a useful culture coefficient evaluator.Istat (National Institute of Statistics) dataset: the annually-aggregated data concerning Italian crops amounts (Table 1); it is a well-structured database and contains agricultural production information for each Italian province [41].This dataset has been integrated with the altitude attribute of the provinces.
The 16 attributes regard the following: The dataset portion employed for this work consists of 17 tables, one for each crop type considered and, for each of them, there is the cumulative value of each attribute for 124 italian province calculated on the time-series between 2006 and 2017; in this way, there are 17 × 124 × 12 = 25,296 considered records.
CNR (National Research Council) dataset: it is a structured agrarian dataset, but values are often incomplete or only partially ordered, concerning scientific and technical information from agricultural and biological studies on crops and horticultural species [42].Some useful data have already undergone transformations and measurements (Table 2).
The four considered attributes are as follows: • Date, which indicates the date of detection and calculation • LAI value(leaf area index), which measures the leaf area per soil surface unit • Evapotranspiration (ETc) and its reference value (ETo) calculated with the Penman-Monteith method • Evapotranspiration ratio (ETc/ETo), which represents a useful culture coefficient evaluator.The dataset portion employed for the tasks of this work consists of 23 tables, one for each crop type and year considered (duplicated crop tables are present, but belonging to different years); all the time series between 1993 and 2004 are selected, but they are not always available and not all of them have the same cardinality (128 is the average).Globally, 23 × 128 = 2944 records have been used.
IoT Sensors dataset: an industrial database developed for business needs that uses the precision agriculture data (thermometers, rain gauges) coming from 41 monitoring stations with a 15-min timing [43]; because sensor values do not come organized, a pre-schematization has been performed (Table 3), as well as the integration of the altitude attribute for the monitoring stations.
The 17 attributes regard:  The dataset employed consists of 65 tables, one for each monitoring station, which are located in 43 different italian countries; the time series goes from 1 January 2012 to the 2 March 2018 with daily measurements; the resulting values are arranged in a total of 873,344 records.

Machine Learning Task Design
With so much data from which a technological farm may want to extract valuable information, business-oriented tasks have been designed and performed to find out useful business and process-oriented practices.

Task 1-Forecasting Future Data (Istat Dataset)
The complete and organized historical time series of the Istat dataset about the Italian crop annual amounts is very useful for the forecasting of future data (prediction), as well as employing and comparing the performances of different supervised machine learning techniques.
The supervised machine learning methodology is based on labeled examples used to train and test a model that must learn to discriminate or generate new examples based on those previously seen after the automatic tuning of its internal parameters and exploiting a specific loss function.The first models that will be exploit are the feed-forward neural network and the polynomial regression models.
A neural network (or multi-layer perceptron) requires a large amount of high-quality training data and an internal parameters fine-tuning process to achieve the best performance; for this work, it employs a feed-forward fully-connected architecture, with two hidden layers, with the expectation that it will be powerful, fast, and cheap to manage.
The back-propagation algorithm, exploited to update neurons weights, is summarizedas follows: 1.
for the layer l, weights w l ij and thresholds w l j are randomly initialized 2.
with the training dataset I p and the output dataset O p , the output of all layers is (1) In each layer, calculate the square error err l jp as the difference between the predicted and the real value at output layer and use it to obtain the new weight and threshold values with ( 2) and ( 3) Polynomial (and linear) Regression: a standard technique widely used in the business and industrial fields based on statistical methods that are computationally non-expensive when using low-order functions, for example, the linear one, which estimates a function that best fits and approximates input values in a low-dimensional research space.
With the regression analysis, it is possible to build a mathematical model where the expected value of a dependent variable Y (expressed in matrix form of y i ) is obtained in terms of the value of an independent variable (or vector of independent variables) X, as in (4).
where y i is the i-th value of the dependent variable, β 0 is the intercept, β i is the i-th angular coefficient, and x i is the i-th vector of observations (features).The task goal is, "considering the Istat time series, forecast for the provinces of Calabria, Friuli Venezia Giulia, and Abruzzo italian regions, what will the total harvest of apples and pears bein 2017".
The problem faced from this task exploits the time-series referring to the apple and pear crops in the previous 10 years (2006-2016), keeping the 2017 time-series aside for comparisons on its simulation.
Experimental design: In this task, the predictive goal exploits scientific and biological information about plants and crops, exploiting and estimating the LAI (leaf area index) coefficient.
This experiment is interesting because for each plant species, the LAI value has been recorded in a discontinuous and non-constant way, and features different time periods (1997-1998 or 1999-2000, and so on) configuring itself as a problem of missing data reconstruction and evaluation, suitable for exploiting the linear\polynomial regression and the neural network models.
The culture types that are objects of the experiments are the following: 1.
Eggplant and Pacciamata Eggplant for the year 2003 The task goal is, "predict the LAI attribute values exploiting the scientific CNR agrarian data constituted by often incomplete and fragmented temporal series".
The structure of this dataset is peculiar because it contains information and factors coming from literature and empirical studies; in order to evaluate and compare the predictive performances, the LAI values will not be directly used, but the RAE (relative absolute error) on its predicted value as in (5) will be used; this metric has been chosen as it represents a percentage that not dependent on the significant numbers of the value on which forecasts are made where N indicates the number of data on which the prediction is made, from which it is possible to evaluate the deviation between a predicted value and the real one • θ i is the real value in the i-th row of the test set • θi is the predicted value for the i-th row of the test set With this task, the dataset that contains values and attributes coming from smart-sensors and IoT devices will be used.As these data are very granular and plentiful, they are useful in demonstrating the reconstruction of corrupted or ambiguous sensors data (recovering); it is also interesting to understand how training attributes influence the model performances.
The solar radiation incidence attribute values (r_inc) come from the panels mounted on each monitoring station [44,45] and will be exploited for this experiment.
The task goal is, "consider the r_inc attribute and predict its values at 00:00 (hour of maximum solar incidence), from monitoring stations 173 and 186, in order to evaluate the model performances retrieve the contribute of the remaining attributes".
The experimental setup considers different attribute combinations in the training session to retrieve the amount of their contribution to the model performance: This is a variant of the previous one, which applies further methods of machine learning, keeping all the hypotheses of the previous task.
The K-nearest neighbors algorithm (KNN) is a non-parametric method used for classification and regression.The training examples are vectors in a multidimensional feature space, each with a class label and the training phase of the algorithm simply consists of storing the feature vectors and class labels of the training samples; in the iterative classification phase, k is a user-defined parameter, and an unlabeled vector is classified by assigning the most frequent label among the nearest training samples (Figure 2), which comes from the calculation of a vector distance (Euclidean (6), Manhattan (7), etc.); the classifier can be viewed as assigning the k nearest neighbors a weight 1/d and 0 to all others.A decision tree (or DT or D-tree) is a machine learning classifier based on the data structure of the tree that can be used for supervised learning with a predictive modeling approach; each internal node (split) is labeled with an input feature, while the arcs that link a node to many others (children) are labeled with a condition on the input feature that determines the descending path that leads from the root node to the leaves (nodes without children).
Considering the simplest binary tree (Figure 3), a node can have almost two children; each leaf is labeled with a class name in a discrete set of values or with a probability distribution over the classes that predict the value of the target variable.
In this way, the decision tree classifier results are characterized by the following: • nodes (root/parent/child/leaf/split) and arcs (descending, directed) • no-cycles between nodes  Also, for this classifier, the split functions are very important: classification (or clustering) tree analysis consists of the prediction of the class to which the data belongs Pn(c), with a process called recursive partitioning repeated recursively on each split subset.The algorithms that navigate and build decision trees usually work top-down by choosing a value for the variable at each step that best splits the set of items.To decide which feature to split at each node in the navigation of the tree, the information gain value is used.Experimental design: It is the same as that of Task 3, but employing the decision tree, KNN (K-nearest neighbors), and polynomial regression models.A decision tree (or DT or D-tree) is a machine learning classifier based on the data structure of the tree that can be used for supervised learning with a predictive modeling approach; each internal node (split) is labeled with an input feature, while the arcs that link a node to many others (children) are labeled with a condition on the input feature that determines the descending path that leads from the root node to the leaves (nodes without children).
Considering the simplest binary tree (Figure 3), a node can have almost two children; each leaf is labeled with a class name in a discrete set of values or with a probability distribution over the classes that predict the value of the target variable.
In this way, the decision tree classifier results are characterized by the following: • nodes (root/parent/child/leaf/split) and arcs (descending, directed) , where c is a class label.A decision tree (or DT or D-tree) is a machine learning classifier based on the data structure of the tree that can be used for supervised learning with a predictive modeling approach; each internal node (split) is labeled with an input feature, while the arcs that link a node to many others (children) are labeled with a condition on the input feature that determines the descending path that leads from the root node to the leaves (nodes without children).
Considering the simplest binary tree (Figure 3), a node can have almost two children; each leaf is labeled with a class name in a discrete set of values or with a probability distribution over the classes that predict the value of the target variable.
In this way, the decision tree classifier results are characterized by the following:  Also, for this classifier, the split functions are very important: classification (or clustering) tree analysis consists of the prediction of the class to which the data belongs Pn(c), with a process called recursive partitioning repeated recursively on each split subset.The algorithms that navigate and build decision trees usually work top-down by choosing a value for the variable at each step that best splits the set of items.To decide which feature to split at each node in the navigation of the tree, the information gain value is used.Experimental design: It is the same as that of Task 3, but employing the decision tree, KNN (K-nearest neighbors), and polynomial regression models.Also, for this classifier, the split functions are very important: classification (or clustering) tree analysis consists of the prediction of the class to which the data belongs P n (c), with a process called recursive partitioning repeated recursively on each split subset.The algorithms that navigate and build decision trees usually work top-down by choosing a value for the variable at each step that best splits the set of items.To decide which feature to split at each node in the navigation of the tree, the information gain value is used.
Experimental design: It is the same as that of Task 3, but employing the decision tree, KNN (K-nearest neighbors), and polynomial regression models.

Task 5-Detection of Faulty Monitoring Stations by Sensor Values (IoT Sensors Dataset)
The task is oriented towards the detection of hardware malfunctions, which occur, for example, when having data with plausible values, but very different from those gathered from sensors of the adjacent monitoring stations; it is very important for a business company a sudden recognition of such anomalous variations in order to avoid future errors.
The main step is the localization of neighboring monitoring stations achieved by their clustering in an amplitude area (Figure 4) based on their distances calculated exploiting the Euclidean distance on their altitude, longitude, and latitude geographical attributes.

Task 5-Detection of Faulty Monitoring Stations by Sensor Values (IoT Sensors Dataset)
The task is oriented towards the detection of hardware malfunctions, which occur, for example, when having data with plausible values, but very different from those gathered from sensors of the adjacent monitoring stations; it is very important for a business company a sudden recognition of such anomalous variations in order to avoid future errors.
The main step is the localization of neighboring monitoring stations achieved by their clustering in an amplitude area (Figure 4) based on their distances calculated exploiting the Euclidean distance on their altitude, longitude, and latitude geographical attributes.The task goal is, "perform the geographical clustering of the monitoring stations by a fixed area amplitude and, considering the solar incidence attribute r_inc with a threshold value for its variation, identify all the anomalies as faulty stations".
Experimental design:

Software Tools
In this study, RapidMiner Studio has been used, a visual workflow design tool developed in Java and used to manage big data, from pre-processing phases to the machine learning algorithms applied on data coming from heterogeneous sources.The fact of being open and expandable through extensions allows one to integrate its visual part with code like Python scripting to increase its power, like has been done in this work to accomplish some functions.
The use of this powerful but also accessible tool, thanks to a friendly interface with a fast productive time, brings a double advantage: 1. allow a quick replication, and reuse and customization of the workflows and/or their components (blocks) 2. allow a soft and friendly introduction in small/medium business and industrial environments As a result of limits of space and legibility, it is difficult to insert the complete workflows for all the tasks in a readable way and, moreover, many activity blocks contain sub-workflows.In Figure 5, an example is provided, the workflow that depicts the general structure of Task 3 (with three sub-processes not expanded).The task goal is, "perform the geographical clustering of the monitoring stations by a fixed area amplitude and, considering the solar incidence attribute r_inc with a threshold value for its variation, identify all the anomalies as faulty stations".
Experimental design:

Software Tools
In this study, RapidMiner Studio has been used, a visual workflow design tool developed in Java and used to manage big data, from pre-processing phases to the machine learning algorithms applied on data coming from heterogeneous sources.The fact of being open and expandable through extensions allows one to integrate its visual part with code like Python scripting to increase its power, like has been done in this work to accomplish some functions.
The use of this powerful but also accessible tool, thanks to a friendly interface with a fast productive time, brings a double advantage: 1.
allow a quick replication, and reuse and customization of the workflows and/or their components (blocks) 2.
allow a soft and friendly introduction in small/medium business and industrial environments As a result of limits of space and legibility, it is difficult to insert the complete workflows for all the tasks in a readable way and, moreover, many activity blocks contain sub-workflows.In Figure 5, an example is provided, the workflow that depicts the general structure of Task 3 (with three sub-processes not expanded).Below is the description of the workflows employed for each task.The block names are explanatory and a brief description is provided; when not specified, the parameter values are the default ones.Parameters are as follows: two hidden layers fully connected, training_cycles = 500, learning rate = 0.3, momentum = 0.2, epsilon error = 1.0 × 10 −5 .2. Apply_Model: at each cycle, it is applied to the test set by the cross validation 3. Performance: measures, for each fold, of errors and performances.

Task 2 (CNR Scientific Dataset)
It has the same workflow structure of Task 1 with a "polynomial predictive regression" model exploited in a Python script block; it allows for the reconstruction and visualization by setting the polynomial degree in 'polyval' function and exploiting the matplotlib 'poly1d' and 'plot' to draw the interpolated curves.Below is the description of the workflows employed for each task.The block names are explanatory and a brief description is provided; when not specified, the parameter values are the default ones.

1.
Filtering <province>: to select one or more Italian provinces from the time series 2.
Filtering <crop>: to select one or more crop type from the time series 3.
Union <results>: combines the results of the prediction models [Prediction NN]: components: 1.
Set_role: defines the attribute on which to make the prediction 2.
Nominal_to_Numerical: transforms the nominal values into numerical ones 3.
Filter <missing values>: divides the dataset into missing values and present values 4.
Filter values = 0: select the examples with a reliable value 5.
Multiply: takes an object from the input port and delivers copies of it to the output ports 6.
Cross Validation + NN: a sub-process, applies the model and makes predictions 7.
Linear predictive regression: it is developed by a Python script, where the prediction model is performed through the numpy 'polyval' function with the sklearn 'mean_absolute_error' to calculate the performances.8.
Label <crop>: select the attributes useful for the representation of the results.
Neural Net: at each cycle, it is trained with the training set coming from the cross validation.Parameters are as follows: two hidden layers fully connected, training_cycles = 500, learning rate = 0.3, momentum = 0.2, epsilon error = 1.0 × 10 −5 .

2.
Apply_Model: at each cycle, it is applied to the test set by the cross validation 3.
Performance: measures, for each fold, of errors and performances.

Task 2 (CNR Scientific Dataset)
It has the same workflow structure of Task 1 with a "polynomial predictive regression" model exploited in a Python script block; it allows for the reconstruction and visualization by setting the polynomial degree in 'polyval' function and exploiting the matplotlib 'poly1d' and 'plot' to draw the interpolated curves.Remove: remove and replace missing and anomalous values 2.
Filter <id_station>: select data about the monitoring stations 3.
Select_Attributes: to compose (removing or adding) attribute combinations that affect the predictive performances 4.
Multiply: takes an object from the input port and delivers copies of it to the output ports 5.
Linear and Polynomial Regression: a Python script block where the prediction model is performed through the numpy 'polyval' function with the sklearn 'mean_absolute_error' to calculate the performances; the polynomial degree is set in 'polyval' function and the matplotlib 'poly1d' and 'plot' are used to draw the curves.
Filter <r_inc>: divides the dataset into the training set and prediction test set 4.
Remove values: removes missing and anomalous values 5.
Cross Validation NN: sub-process, see Task 1 components; it is possible to delete the single cross-validation to use the whole training set 6.
Performance: measures, for each fold, of errors and performances.Filter <year-province>: select the datetime information 2.
Select_Attributes: numeric attributes are selected from latitude, longitude, and altitude 3.
Data_to_similarity: measures the similarity of each example of the given ExampleSet with every other example (clustering parameters: mixed_measure = mixed EuclideanDistance, kernel_type = dot) 4.
Similarity_to_data: calculates an ExampleSet from the given similarity measure 5.
Select <coordinates>: arranges the coordinates attribute for the monitoring stations 6.
Select <station>: (Python script) sub-process where the stations of interest are selected, any duplicated data are removed and finally the data arranged to make the comparisons 8.
Select Attributes: isolate the datetime and r_inc attributes for each monitoring station 9.
Generate_difference: generates a new column within the main table in which the differences between the r_inc values of each station are recorded 10.Filter difference: select the differences with a significant value based on a criterion Where a correlation matrix is requested on the dataset that consists of few components, to filter the data to be supplied to the "Correlation_matrix" block, as in Figure 6, where the pipeline reads the input dataset (the table coming from the step 9), filters the categories (the clustered stations that are in the same area), selects the attributes (r_inc and its variation), and uses the correlation matrix block to visualize the results.

Results and Discussion
After the task design in Section 2, the consequent experimental results and their discussion are presented here.
For the error rates of the classifiers, the percentage value contained in the tables identifies the percentage prediction error calculated with (8) on the difference between the real value v and the value p that is obtained from the predictive model In this way, for example, if the real value is 3 and the model predicts 7, the error depicted in the table will be (|3−7|/3) × 100 = 133%.

Task 1-Forecast of Future Data (Istat Dataset-Results)
To train the predictive models, a 10-fold cross validation will be applied, considering each series for ten times; in this way, in ten iterations nine series are used in turn for training while the left one for the test by optimizing the model internal parameters.The best trained model will also be employed to predict new data comparing them with the unused 2017 series manifesting its actual ability to process statistical time series.
In Table 4, the experimental results about the apple and pear crop amounts with the percent error for the three predictive models are depicted; for the provinces of Friuli Venezia Giulia, Abruzzo, and Calabria, the error mean values denote that the neural network model reaches the best performance on the linear regression both on apple crop (9.19% vs. 30.77%)than on the pears one (19.36%vs. 39.11%).

Results and Discussion
After the task design in Section 2, the consequent experimental results and their discussion are presented here.
For the error rates of the classifiers, the percentage value contained in the tables identifies the percentage prediction error calculated with (8) on the difference between the real value v and the value p that is obtained from the predictive model In this way, for example, if the real value is 3 and the model predicts 7, the error depicted in the table will be (|3−7|/3) × 100 = 133%.

Task 1-Forecast of Future Data (Istat Dataset-Results)
To train the predictive models, a 10-fold cross validation will be applied, considering each series for ten times; in this way, in ten iterations nine series are used in turn for training while the left one for the test by optimizing the model internal parameters.The best trained model will also be employed to predict new data comparing them with the unused 2017 series manifesting its actual ability to process statistical time series.
In Table 4, the experimental results about the apple and pear crop amounts with the percent error for the three predictive models are depicted; for the provinces of Friuli Venezia Giulia, Abruzzo, and Calabria, the error mean values denote that the neural network model reaches the best performance on the linear regression both on apple crop (9.19% vs. 30.77%)than on the pears one (19.36%vs. 39.11%).As the Istat dataset features huge and complete time-series with which the neural network model results best fit the predictive task; in Table 5, there are the predicted and real values for the total crops of L'Aquila province and real values are very near to the predicted ones, in fact for apples the difference is less than 2% and for pears less than 4.5%, highlighting the goodness of using this technique on this type of dataset.For this task, the predictive errors depicted in Table 6 highlight that the polynomial model best fits the LAI values prediction for the three considered cultures.This outcome can be explained considering the nature of this scientific values, as well as the temporal discontinuity with which they have been gathered, along with their small amount; for the polynomial model, there is a very large difference from the others, highlighting the simplicity and the advantage of using this standard but also the performing technique.
The plot comparison between linear and polynomial predictive models on this scientific dataset is in Figure 7, where a polynomial interpolation (green plot) shows how the predictive model is able to approximate the peculiar growing trend (blue plot), which can fit unknown incoming data very well.The higher grade mathematical model is better than the others and this happens both if, for the training, you give the data for a single year, either if you give data for three years.As the Istat dataset features huge and complete time-series with which the neural network model results best fit the predictive task; in Table 5, there are the predicted and real values for the total crops of L'Aquila province and real values are very near to the predicted ones, in fact for apples the difference is less than 2% and for pears less than 4.5%, highlighting the goodness of using this technique on this type of dataset.
Table 5. Task 1: a comparison example between the real values and their neural network model prediction for the apple and pear total crops for the Italian province of L'Aquila on the Istat dataset.

Method: NN Apple Pears Italian Province Real Value Predicted Value Real Value Predicted Value
L'Aquila 45,900 45,000 3925 3750

Task 2-Comparison between Machine Learning Algorithms on Missing Data (CNR Dataset-Results)
For this task, the predictive errors depicted in Table 6 highlight that the polynomial model best fits the LAI values prediction for the three considered cultures.This outcome can be explained considering the nature of this scientific values, as well as the temporal discontinuity with which they have been gathered, along with their small amount; for the polynomial model, there is a very large difference from the others, highlighting the simplicity and the advantage of using this standard but also the performing technique.
The plot comparison between linear and polynomial predictive models on this scientific dataset is in Figure 7, where a polynomial interpolation (green plot) shows how the predictive model is able to approximate the peculiar growing trend (blue plot), which can fit unknown incoming data very well.The higher grade mathematical model is better than the others and this happens both if, for the training, you give the data for a single year, either if you give data for three years.This task wants to compare machine learning performances when using for the training phase time intervals of different sizes; it also features a sub-task with the aim to predict future values with them.The prediction errors shown in Tables 7-9 are measured considering two distinct monitoring stations (173 and 186) and finally both together, using as training data the series from the days of 1 to 30 January and making the prediction for the 31th day.
In almost all the experiments, it emerges that the neural network performance is worse than that of linear regression, and a reason is certainly around the use of few training data for the temporal series of one month.There are also results depicted for a polynomial regression model with a function of higher degree than the linear one, but results are again poor and very far away from the others; different from the previous task, the time-series data are few, but temporally complete and well-organized, and so the most fast performing and resource-cheap model is the linear one.
It is also interesting to evaluate how attributes influence the performances; for the neural network relative humidity is the single factor to determine good results in two of the three experiments (82.15% and 51.73%), while considering only the temperature leads to worse ones; conversely, regarding the linear predictive model, which is the best technique, it can be noted how relative humidity must be used together with temperature during the training phase to produce best results in all the three experiments.In Figure 8, the real-values (red dots) and the linear prediction (blue plot, at steps) for them are plotted, when considering the thirty-day training; for this dataset and with these training intervals, the best model is still insufficient and so this model hardly fits the new values.
In Figure 8, the real-values (red dots) and the linear prediction (blue plot, at steps) for them are plotted, when considering the thirty-day training; for this dataset and with these training intervals, the best model is still insufficient and so this model hardly fits the new values.An alternative experiment was performed, avoiding the cross validation training mode because, in this task, with time series, it would be better not mixing the past temporal data with those of the future, in particular when predicting short-term values using few past ones.Maintaining the temporal coherence in the training and test set and using more data coming from both the stations.
From Table 10, it emerges that the neural network model resumes the performances supremacy when predicting the value for 31 January, while trained with the cumulative data on the temporal window of the past thirty days (from 1 January to 30 January); it is also the same when considering the previous five days (from 26 January to 30 January), but when using only the previous and the following four days to predict the central one (5 January), the linear model works better again, but now the polynomial one wins (13.83% vs. 9.37%).
In this way, a linear regression model appears preferable when predicting a single value of which the previous and following values are known using small amount of data for training, while when they are very few, the polynomial one is the slightly better choice.Maintaining the experimental design seen previously, Tables 11-13 show the performance error considering the two monitoring stations, first separated and after then united when employing the decision tree and K-nearest neighbors prediction models.
It emerges that in almost all the experiments, the decision tree model reaches the best prediction performance, while a polynomial model with a function of higher degree than the second brings worse results.Regarding the attributes influence on the performances goodness, for the decision tree An alternative experiment was performed, avoiding the cross validation training mode because, in this task, with time series, it would be better not mixing the past temporal data with those of the future, in particular when predicting short-term values using few past ones.Maintaining the temporal coherence in the training and test set and using more data coming from both the stations.
From Table 10, it emerges that the neural network model resumes the performances supremacy when predicting the value for 31 January, while trained with the cumulative data on the temporal window of the past thirty days (from 1 January to 30 January); it is also the same when considering the previous five days (from 26 January to 30 January), but when using only the previous and the following four days to predict the central one (5 January), the linear model works better again, but now the polynomial one wins (13.83% vs. 9.37%).
In this way, a linear regression model appears preferable when predicting a single value of which the previous and following values are known using small amount of data for training, while when they are very few, the polynomial one is the slightly better choice.Maintaining the experimental design seen previously, Tables 11-13 show the performance error considering the two monitoring stations, first separated and after then united when employing the decision tree and K-nearest neighbors prediction models.
It emerges that in almost all the experiments, the decision tree model reaches the best prediction performance, while a polynomial model with a function of higher degree than the second brings worse results.Regarding the attributes influence on the performances goodness, for the decision tree the relative humidity together with the temperature determines best results, while considering the temperature alone leads to a performance deterioration.There have also been other prediction sub-tasks, like in Task 3, performed without using the cross-validation training mode to maintain the temporal coherence of data when making value predictions on them (Table 14); also this time, the model that best worked using large data interval for its training (DT) is exceeded by the other one (KNN), while again, when considering very few data for the training (four days) the polynomial one is the slightly better choice (9.37% vs. 16.16%).Exploiting the monitoring station attributes altitude, longitude, and latitude in the IoT Sensors dataset, the clustering based on the Euclidean distance builds groups with similar geographic attributes.
In Table 15, there is an example with a cluster made by three monitoring stations (ID = 394, 396, and 397) showing the log of their r_inc value and its calculated global difference; because the difference_max calculated on their r_inc attribute value for June 2017 is very high (3.740− 0.570 = 3.170), also from an tolerance threshold of 30/40, it is plausible that station 396 suffered a fault for its solar radiation sensor from 9 June 2017.The Correlation Index between two statistical variables is a metric that expresses a linear relation between them; given two statistical variables X and Y, their correlation index is the Pearson product-moment correlation coefficient defined in (9), as their covariance divided by the product of the standard deviations of the two variables.
where σ XY is the covariance (a measure of how much the two variables depend together) between X and Y and σ X , σ Y are the two standard deviations (statistical dispersion index, which is an estimate of the variability).
The coefficient always assumes values between −1 and 1, while a value greater than +0.7 evidences a strong local correlation that can be direct (positive sign) or inverse (negative sign).The correlation indexes of n variables (or attributes) can be presented in a correlation matrix, which is a square matrix of n × n dimension with the variables on the rows and on the columns.The matrix is symmetrical, that is, ρ ji = ρ ij and so the coefficients on the main diagonal are 1.
Considering the previous cluster made by three monitoring stations, the correlation matrix in Table 16 extends the correlation coefficient to a set of factor pairs, which are useful to observe if there are other correlated attributes in addition to the geographical ones.Considering the attributes described in Task 3, it is possible to see that the solar incidence r_inc is strongly (inversely) correlated with the minimum relative humidity (RH_min, −0.739) and weakly with the maximum temperature (+0.351).There is also a predictable mild correlation evidence between temperature and the humidity values.

Conclusions
The study presented in this work introduces practical, cheap, and easy-to-develop tasks that are useful to increase the productivity of an agricultural company, deepening the study of the smart farm model; the technological progress in a field that needs control and optimization can really contribute to save environmental resources, respect the business and international laws, satisfy the consumer needs, and pursue economic profits.The three different data sources, with a special eye for the IoT sensors dataset, have been exploited using machine learning techniques and the more standard statistical ones.The first task shows that the forecast of apple and pear total crops on the Istat dataset could be reached with a neural network model with a success rates close to 90%, while in the second task, it emerges that for the CNR scientific data, polynomial predictive and regression models are more suited considering the nature of the dataset.
Tasks 3 and 4 present the same goal faced with different machine learning methods on a pure IoT sensors dataset, showing that the decision tree model works very well; that there are specific environmental factors coming from sensors hardware that affect the model performances; and, moreover, that short-term future values with few past data can be predicted using statistical regressions.It cannot be left out, however, that in cases where there are very few data statistical models such as linear or polynomial that still maintain the best predictive performances; moreover, the detection of faulty monitoring stations in Task 5 successfully employs a clustering of the stations based on their geographic location useful to detect hardware faults.
The proposed real cases highlight the need for integrating management and data scientists, in fact, IoT systems require engineering and diffusion investments that only a wise and visionary management can favor in smart/medium industries; moreover, the necessity to invest in skills and knowledge to profitably employ the IoT paradigm at higher levels emerges.
The main reason for the proposed tasks using different machine learning techniques is that an exploratory and highly experimental work has been employed; the Information Fusion together with the related optimization of methods and results is expected in future work, where new experiments and tasks exploit other sensor types and datasets will be designed and performed to meet the great heterogeneity of agri-companies and of the hardware sensor market.The intelligent systems developed with machine learning algorithms (supervised and non) have to manage fault tolerance and hardware malfunction prediction, and, in this way, they require designing of integrated tools, user-interfaces, and machines that easily adapt to a contexts subjected to natural events not as easily predictable as the agricultural one.Finally, smart systems that provide real-time suggestions and make long-term forecasts based on user choices and preferences must be studied and tested.

Figure 1 .
Figure 1.The datasets used for this study: National Research Council (CNR) scientific dataset, Istat statistical dataset, and the industrial Internet of Things (IoT) Sensors dataset.

Figure 1 .
Figure 1.The datasets used for this study: National Research Council (CNR) scientific dataset, Istat statistical dataset, and the industrial Internet of Things (IoT) Sensors dataset.

Figure 2 .
Figure 2. Two consecutive steps of the K-nearest neighbors (KNN) algorithm (K = 3) in a bi-dimensional feature space; (a): a blue item has ambiguous clustering, (b) the green cluster is assigned to it according to its number and proximity.

•
feature vector v∈Rn • split function fn(v): Rn→R • thresholds Tn∈Rn • set of classes (labels) C • classifications Pn(c), where c is a class label.

Figure 3 .
Figure 3.A (binary) decision tree used to classify and predict values with numerical features.

Figure 2 .
Figure 2. Two consecutive steps of the K-nearest neighbors (KNN) algorithm (K = 3) in a bi-dimensional feature space; (a): a blue item has ambiguous clustering, (b) the green cluster is assigned to it according to its number and proximity.

Figure 2 .
Figure 2. Two consecutive steps of the K-nearest neighbors (KNN) algorithm (K = 3) in a bi-dimensional feature space; (a): a blue item has ambiguous clustering, (b) the green cluster is assigned to it according to its number and proximity.

Figure 3 .
Figure 3.A (binary) decision tree used to classify and predict values with numerical features.

Figure 3 .
Figure 3.A (binary) decision tree used to classify and predict values with numerical features.

Figure 4 .
Figure 4. Task 4: the monitoring station clustering brings together geographically close sensors that are expected to record very similar data values.

Figure 4 .
Figure 4. Task 4: the monitoring station clustering brings together geographically close sensors that are expected to record very similar data values.

Figure 5 .
Figure 5.The workflow blocks on the IoT dataset featuring the two predictive models for the Task 3: the IoT sensors dataset is loaded, invalid and missing values are removed, there are filters to find the monitoring stations and the combination of their attributes, and finally the two machines learning sub-process blocks for the execution of the models.

2. 3 . 1 .
Task 1 (Istat Dataset) Components 1. Filtering <province>: to select one or more Italian provinces from the time series 2. Filtering <crop>: to select one or more crop type from the time series 3. Prediction Neural Network NN (apple/pear): two sub-processes, the predictive model (neural network) 4. Union <results>: combines the results of the prediction models [Prediction NN]: components: 1. Set_role: defines the attribute on which to make the prediction 2. Nominal_to_Numerical: transforms the nominal values into numerical ones 3. Filter <missing values>: divides the dataset into missing values and present values 4. Filter values = 0: select the examples with a reliable value 5. Multiply: takes an object from the input port and delivers copies of it to the output ports 6. Cross Validation + NN: a sub-process, applies the model and makes predictions 7. Linear predictive regression: it is developed by a Python script, where the prediction model is performed through the numpy 'polyval' function with the sklearn 'mean_absolute_error' to calculate the performances.8. Label <crop>: select the attributes useful for the representation of the results.[Cross validation + NN]: components: 1. Neural Net: at each cycle, it is trained with the training set coming from the cross validation.

Figure 5 .
Figure 5.The workflow blocks on the IoT dataset featuring the two predictive models for the Task 3: the IoT sensors dataset is loaded, invalid and missing values are removed, there are filters to find the monitoring stations and the combination of their attributes, and finally the two machines learning sub-process blocks for the execution of the models.

Figure 6 .
Figure 6.A workflow for a correlation matrix to visualize the attributes magnitude for Task 5, where the input dataset is the result of the monitoring stations clustering.

Figure 6 .
Figure 6.A workflow for a correlation matrix to visualize the attributes magnitude for Task 5, where the input dataset is the result of the monitoring stations clustering.

Figure 7 .
Figure 7. Task 2: plot comparison between real-values (red dots), the linear (blue), and polynomial (green) predictive model on the CNR scientific agrarian dataset.

Figure 7 .
Figure 7. Task 2: plot comparison between real-values (red dots), the linear (blue), and polynomial (green) predictive model on the CNR scientific agrarian dataset.

3. 3 .
Task 3-Reconstruction of Missing Data from Monitoring Stations Exploiting Neural Network, and Linear and Polynomial Regression Techniques (IoT Dataset-Results)

Figure 8 .
Figure 8. Task 3: sensor real-values (red dots) and their still insufficient linear predictive model (blue lines-at-step) employing a training time series of thirty days.

Figure 8 .
Figure 8. Task 3: sensor real-values (red dots) and their still insufficient linear predictive model (blue lines-at-step) employing a training time series of thirty days.

Table 1 .
Details about culture time-series in the Istat dataset.

Table 1 .
Details about culture time-series in the Istat dataset.

Table 3 .
Details of the Internet of Things (IoT) sensors dataset.
linear regression • Training set: 10 years (2006-2016) time series (pear and apple total crop, in the Friuli Venezia Giulia, Abruzzo, and Calabria Italian provinces) • January to 30 January 2018 (30 days) by the stations 173 and 186 • Training mode: 10-fold cross validation using five combinations of the attributes r_inc, latitude, longitude, temperature, humidity, and rainfall; performed with data from the distinct stations and after from both • Results: prediction percentage error as the mean of that in each cycle for the r_inc attribute • Training mode (2): whole data from 1 January to 30 January 2018, all the six attributes, both stations • Results (2): prediction percentage error for the future value of the r_inc attribute on 31 January 2018 • Training mode (3): whole data from 26 January to 30 January 2018, all the six attributes, both stations • Results (3): prediction percentage error for the future value of the r_inc attribute on 31 January 2018 • Training mode (4): whole data from 1 January to 9 January 2018 leaving out the 5 January, all the six attributes, both stations • Results (4): prediction percentage error for the future value of the r_inc attribute on 5 January 2018 2.2.4.Task 4-Reconstruction of Missing Data from Monitoring Stations Exploiting the Decision Tree, and Polynomial and K-Nearest Neighbors (KNN) Models (IoT Sensors Dataset)

Table 4 .
Task 1: apples and pears crop prediction error exploiting the neural network and the polynomial linear predictive model on the Istat dataset.

Table 4 .
Task 1: apples and pears crop prediction error exploiting the neural network and the polynomial linear predictive model on the Istat dataset.

Table 5 .
Task 1: a comparison example between the real values and their neural network model prediction for the apple and pear total crops for the Italian province of L'Aquila on the Istat dataset.

Table 6 .
Task 2: comparison on the prediction error for the cultures leaf area index (LAI) value exploiting machine learning methods on the CNR scientific agrarian dataset.

Table 6 .
Task 2: comparison on the prediction error for the cultures leaf area index (LAI) value exploiting machine learning methods on the CNR scientific agrarian dataset.

Table 7 .
Task 3: prediction error of the sensor attribute r_inc coming from monitoring station 173 using neural network, and linear and polynomial regression machine learning models on the IoT Sensors dataset.

Table 8 .
Task 3: prediction error of the sensor attribute r_inc coming from monitoring station 186 using neural network, and linear and polynomial regression machine learning models on the IoT Sensors dataset.

Table 9 .
Task 3: prediction error of the sensor attribute r_inc coming from both 173 and 186 monitoring station using neural network, and linear and polynomial regression machine learning models on the IoT Sensors dataset.

Table 10 .
Task 3: prediction error of the sensor attribute r_inc coming from both 173 and 186 monitoring station using neural network, and linear and polynomial regression machine learning models trained with different time-series interval for the training on the IoT Sensors dataset.

Table 10 .
Task 3: prediction error of the sensor attribute r_inc coming from both 173 and 186 monitoring station using neural network, and linear and polynomial regression machine learning models trained with different time-series interval for the training on the IoT Sensors dataset.

Table 11 .
Task 4: missing data prediction error of the sensor attribute r_inc from monitoring station 173 using decision trees, KNN, and polynomial machine learning methods on IoT Sensors dataset.

Table 12 .
Task 4: missing data prediction error of the sensor attribute r_inc from monitoring station 186 using decision trees, KNN, and polynomial machine learning methods on IoT Sensors dataset.

Table 13 .
Task 4: missing data prediction error of the sensor attribute r_inc from both monitoring station 173 186 using decision trees (DT), KNN, and polynomial machine learning methods on IoT Sensors dataset.

Table 14 .
Task 4: prediction error of the sensor attribute r_inc coming from both 173 and 186 monitoring station using decision tree, KNN, and polynomial regression machine learning models trained with different time-series interval for the training on the IoT Sensors dataset.

Table 15 .
Task 5: a cluster of three monitoring stations where the high value of the difference_max on the r_inc attribute indicates a hardware sensor issue from June 2017 for the station 396.

Table 16 .
Task 5: the correlation matrix for the clustering attributes magnitude.