Anomaly Detection on Data Streams for Smart Agriculture

: Smart agriculture technologies are effective instruments for increasing farm sustainability and production. They generate many spatial, temporal, and time-series data streams that, when analysed, can reveal several issues on farm productivity and efﬁciency. In this context, the detection of anomalies can help in the identiﬁcation of observations that deviate from the norm. This paper proposes an adaptation of an ensemble anomaly detector called enhanced locally selective combination in parallel outlier ensembles (ELSCP). On this basis, we deﬁne an unsupervised data-driven methodology for smart-farming temporal data that is applied in two case studies. The ﬁrst considers harvest data including combine-harvester Global Positioning System (GPS) traces. The second is dedicated to crop data where we study the link between crop state (damaged or not) and detected anomalies. Our experiments show that our methodology achieved interesting performance with Area Under the Curve of Precision-Recall (AUCPR) score of 0.972 in the combine-harvester dataset, which is 58.7% better than that of the second-best approach. In the crop dataset, our analysis showed that 30% of the detected anomalies could be directly linked to crop damage. Therefore, anomaly detection could be integrated in the decision process of farm operators to improve harvesting efﬁciency and crop health.


Introduction
Agriculture is an industry that has greatly benefited from recent advances in sensor technology, data science, and machine-learning approaches. These innovations are in response to the environmental and population challenges that our society is facing where large global agriculture production increases are required to feed a growing population. Agricultural crop production is dependent on the management of elements such as soil, water, and pests, and climate conditions and unforeseen hazards for sustainable agricultural output [1]. Pesticide usage in agriculture is a major issue on a global scale both in terms of the environment, and the health and safety of farmers [2]. Precision agriculture is a strategic approach that collects, processes, and evaluates temporal, spatial, and individual data, and other types of information in order to support managerial decisions on the basis of estimated variability for increased the resource-utilisation efficiency, productivity, quality, profitability, and stability of agricultural production.
Anomalies are data points or patterns in data that do not correspond to normal behavior. Anomaly detection is a crucial approach for detecting significant occurrences in a variety of applications. It is the process of identifying observations that deviate from the norm. Anomalies can be classified as point anomalies, contextual anomalies, or collective anomalies [3]. Individual data elements that are inconsistent or anomalous in comparison to all other data elements are called point anomalies. Data elements that are regarded to be odd or abnormal in a given context are referred to as contextual or conditional anomalies.
A data element is a contextual anomaly if its behavior attributes differ from the behavior attributes of a selection of data elements with the same contextual attributes (e.g., time of day, season, or location). A collection or sequence of linked data elements that are inconsistent with the rest of the dataset is referred to as a collective or group anomaly. Individual data elements in collective anomalies may not be anomalous on their own, but when joined with additional elements, they constitute a group anomaly.
Anomalous conditions in a farm environment were detected in [4] by the application of linear regression to sensor data (soil temperature, humidity, and electrical conductivity, light intensity, and air temperature and humidity). Anomalous data may indicate that the agricultural environment is not conducive for crop growth. Slopes of the regression line were used to determine the trends of each sensor datum over time. The outlier detection threshold was set using the interquartile range (IQR) technique, which was used to calculate the upper and lower bounds, such that any slopes exceeding the set thresholds were classified as anomalies. The proposed technique can be improved by combining environmental data with real-time images of the farm environment in the anomaly detection process. Comparing image variations obtained at different time periods, or recognising particular pest and disease features can also help to enhance the accuracy of evaluating abnormal agricultural conditions. This study differs from ours, in that we consider anomaly detection at the tail end of the production cycle during harvest, with our focus being the crop state and harvest efficiency of farm machinery. We also apply geometric mean and F1 measure as thresholds for the optimal performance of our model, as opposed to IQR techniques used in this study.
Deep-learning techniques were applied in agriculture with DeepAnomaly being proposed in [5]. DeepAnomaly uses a combination of background subtraction and deep learning for detection of obstacles and anomalies in an agricultural field. It takes advantage of the fact that the visual characteristics of an agricultural field are uniform, and obstacles are uncommon. The major goal is to find things that are far away and severely obscured, and unknown object categories. People, barrels, wells, and a distant home were among the observed obstacles. At greater distances, DeepAnomaly identified persons more precisely and in real time. DeepAnomaly applies image data in anomaly detection and is suitable for real-time applications, running on an embedded Graphics Processing Unit (GPU). Our approach, on the other hand, analyses numerical time-series data with the purpose of developing an approach that could be applied in general-purpose computing systems where memory and processing power are constrained. Support vector machine (SVM) and artificial neural networks (ANNs) were applied in [6] in the detection of the occurrence of regional frost disasters using tea frost cases. The study forecast multiple degrees of hazard for tea tree frost catastrophes, so that producers may both detect frost incidence and nonoccurrence using the model, and respond to varying levels of hazard. In [7], two machine-learning models, autoregressive integrated moving average model (ARIMA) and long short-term memory (LSTM), were used to find anomalies in time-series data. The temporal linkages between digital farm sensor data were considered in the development of a temporal anomaly detection technique. With the goal of finding abnormal data readings, LSTM and ARIMA models were evaluated on actual data gathered from deployed agricultural sensors. Employing LSTM improved anomaly detection prediction while also necessitating additional training time. LSTM and ARIMA approaches consider important training datasets and time for each application; therefore, they are not always usable in the case of spatial multivariate data streams with limited allowed computational power.
Precision agriculture grapples with the automated detection of crop plots with aberrant vegetation growth. Detecting crop patches that have significantly different phenological patterns than those of the rest of the crop might help farmers and agricultural cooperatives improve their agricultural practices, disease detection, and fertiliser management. The isolation forest unsupervised outlier detection algorithm was used in [8] to detect the most abnormal wheat and rapeseed crop parcels within a growing season using synthetic aperture radar (SAR) and multispectral images acquired using Sentinel-1 and Sentinel-2 satellites. Heterogeneity problems, growth anomalies, database inaccuracies, and nonagronomic outliers were the four major categories of studied anomalies. Experiments revealed that using both Sentinel-1 and Sentinel-2 characteristics in anomaly detection resulted in the discovery of more late-growth abnormalities and heterogeneous parcels. Our approach adopts the use of isolation forest as one of the base detectors in the ensemble technique with a focus on detecting local and contextual anomalies during crop harvest.
Yield maps are crucial in farmers' decision making when it comes to precision farming, with many farmers having multiyear yield maps [9]. The use of spatial-trend and temporalstability maps can give valuable information on the spatial and temporal variability of a field. In [9], a contour yield map was generated by interpolating the data into a regular grid using a Kriging [10] linear unbiased estimator. Classification management maps were generated which highlighted homogeneous zones that could be targets for investigation, or sections where inputs might need to be increased or reduced on the basis of yield performance. To identify spatial and temporal trends, a quantitative analysis of yield data from four fields over a six-year period was conducted in [11] . Their previous work in [9] was modified to separate the temporal effects into two parts: interyear offset (total variations in yield from one year to the next ) and temporal variance (the degree of change across time). On the basis of obtained results, interyear variability had the greatest influence on overall yield, with spatial variability being significant within each year and cancelling out over time.
Precision farming benefits from the integration of high-accuracy GPS technology into farm machinery such as combine harvesters, grain carts, tractors, and trucks. Machine data generated by modern agricultural machinery provide crucial information that may be used to obtain valuable insights and forecasts about a farm's operating plan, present status, and productivity. Additionally, on the basis of predictive analytics, farm operators could make adaptive logistical decisions. Visualisation of GPS data from farm machinery can provide vital information on the operational status and efficiency of a machine. As an example, Figure 1 presents the movement trajectory of a tractor as it worked on a citrus field, and the followed path to and from the field. From the image, it is clear that a row was skipped during the working of the tractor, which can be detrimental to plant growth, especially during spraying and fertiliser application. A Kalman filter and the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm were used to detect abnormal activity movement of combine harvesters in [13]. Easting and northing coordinates from GPS logs are iteratively applied to a Kalman filter on the basis of a constant velocity dynamical model. The amount of divergence of the filter estimations from actual data is calculated by using the Kalman filter residual. The Kalman filter residual is a measure of the smoothness of the combineharvester motion, with a greater residual value indicating a rapid shift in motion. The DBSCAN algorithm method is then applied to the engine load, vehicle speed, and calculated Kalman filter residual, which creates indicative clusters of the combine harvester's activity. It was feasible to identify clusters for uniform motion on the road, stationary spots, and nonuniform movements on the basis of the findings. Our approach focuses on infield harvest efficiency with special emphasis on local contextual anomalies, which differ from activity detection in this study.
Unsupervised anomaly-detection algorithms automatically identify deviating observations from unlabelled datasets, under some assumptions. Thus, model performance is determined by the different characteristics present in a dataset. Due to this specialisation of the models to different characteristics of observations and resulting in varying detection rates, it is a good idea to integrate the different detection abilities to produce a consensus judgement through the implementation of an ensemble. The combination of these individual abilities is more robust than relying on just one of them [23].
It is also possible that some anomaly-detection algorithms succeed at detecting certain subspaces while others may present poor detection performance [24], although the former's total accuracy may be lower than that of the latter. This implies that an algorithm can have a domain of expertise in a local domain, yet perform poorly over the entire feature space. It is critical to merge the domains of knowledge of each algorithm in order to reduce overall error [25]. The notion of data locality was first presented by the authors of [26], and it was later enhanced in [27] to perform dynamic classifier selection in local spaces of training points. In comparison to static strategies, such as those that simply vote on the basis of all base classifier outputs, strategies that dynamically pick and combine base classifiers produced better results.
The focus of this research is the identification of anomalies that impact harvest efficiency and those that can be linked to crop state and health during harvest. This study examines anomaly detection in agricultural fields by analysing GPS logs where techniques based on hypothesis testing on the occurrence of an anomalous movement pattern during infield harvesting are utilised. We propose the following contributions: we present a detailed state of the art on anomaly-detection techniques with a focus on smart agriculture; 2.
we propose a robust ensemble-based methodology for the detection of anomalies from data streams in smart agriculture context; 3.
we apply the proposed technique to a data stream of combine-harvester GPS logs with the aim of identifying anomalies that impact harvest efficiency of farm machinery; and 4.
we apply the proposed technique to crop data with the aim of identifying anomalies that reveal the state of the crop during harvest.
This paper is structured as follows. Section 2 presents the temporal datasets corresponding to our two cases of study (the first for harvest GPS logs, the second for the crop-damage dataset) and the preprocessing steps. They correspond to the used materials. It also introduces our anomaly-detection approach, called enhanced LSCP (ELSCP) and the performance indicators that we use. Section 3 shows the experiment results on the two datasets. Section 4 is dedicated to the conclusion.

Materials and Methods
In this study, we investigate anomaly detection in two fronts: application to the GPS logs of combine harvesters and detection of crop damage in crop dataset.

Scenario A: Combine Harvester GPS Logs
We used a GPS dataset [28], collected using Nexus 7 tablets on a farm in Colorado USA during the 2019 wheat-harvesting season. A GPS recorder application was kept running for each involved vehicle (combine harvesters, grain carts, and trucks) during harvesting. The combine harvester moved in a straight line at a near constant speed in a normal combine harvester operation (whether harvesting or traveling between fields). The sampling frequency of the data was 1 sample (GPS log) per second. In a typical harvesting scenario, there are more than one harvester working simultaneously in the field. This results in an overlap of the trajectories of the vehicles. A single trajectory is considered as the consolidation of all collected GPS logs belonging to a single combine harvester. For the purpose of this study, we used the movement trajectory of a single combine harvester recorded over a period of three days.
x is the longitude of the combine c k at time t, Trajectory mining was performed using Quantum GIS (QGIS), an open-source crossplatform desktop geographic information system application that supports viewing, editing, and visualising geospatial data. The key focus was on data points generated in the field during harvesting; therefore, visualisation and map matching were conducted using QGIS to ensure that GPS points were mapped onto a field. The second step was to extract only those points within a specific field using a bounding box. The extracted data exhibited normal harvesting behaviour with all data points below 4 mph, which is the maximal harvesting speed [29]. The data were further processed by removing all data points with zero speed, since our interest was on GPS points associated to actual grain harvest. To create an evaluation dataset, anomalies were introduced in the original dataset by varying the vehicle speed at specific points along the trajectory, such that, for specific sections of the trajectory, a sequential number of points had their speed increased by a random number between a given range of values above 4 mph. Figure 2 presents the trajectory of a combine harvester after the introduction of anomalies, with normal points in green and anomalies in red. To visualise the effect of the introduced anomalies, outlier analysis was performed using box plots (as shown in Figure 3) that show the presence of outliers in speed and accuracy attributes.
Further data analysis was conducted to establish the existence of any correlation among data attributes through the generation of a correlation heat map (as shown in Figure  4). The map shows a correlation between latitude and longitude, which was expected since these are spatial coordinates. There was also a high positive correlation between speed and class attribute, which is a true representation since introduced anomalies in the data affect the speed attribute.

Scenario B: Crop Dataset
The second scenario considers anomalies in crop data, where the aim was to link the detected anomalies to the state of the crop at the end of the harvest season (i.e., alive or damaged). The crop dataset [30] was collected by various farmers at the end of the harvest season spanning a period of three seasons. To simplify analysis, we assumed that all other factors such as variations in farming technique were controlled. This dataset has 10 variables as summarised in Table 1. In this study, we consider data for crop type zero (0), which we extracted from the global dataset. The first preprocessing step was to check for missing values in the data, which revealed 6397 missing values in the Number_Weeks_Used variable.The missing data are a result of farmers not recording data for the total number of weeks of pesticide use. Missing values were filled using an iterative imputation technique [31], which performs multivariate imputation by chained equations. In iterative imputation, each feature is modeled as a function of other features. Each feature is imputed one after the other, allowing for previously imputed values to be utilised as part of a model to predict future features.The next step was to check for the distribution of the data per attribute, which is summarised in Figure 5. Exploratory data analysis was performed on the crop data, so as to gain a better understanding of the data through the generation of a correlation heat map (as shown in Figure 6). On the basis of the heat map, Estimated_Insects_count, Pesticide_use_category, and Number_weeks_used were positively correlated with crop damage. This indicates that the state of the crop at harvest time was greatly impacted by the presence and number of insects, the type of pesticides used to control these insects, and the duration of exposure to the pesticides. On the other hand, Number_weeks_used was positively correlated with Estimated_Insects_count and Pesticide_use_category. High negative correlation existed, where Number_weeks_Quit was negatively correlated with Pesticide_use_category and Number_weeks_used. Further analysis was also performed to establish the presence of outliers in the dataset. Figure 7 shows that, except for soil type and season, all other attributes had outliers.

Proposed Approach
The previous section presented the data and preprocessing steps that were performed on them to prepare them for experiments. This section first presents our adaptation and extension of the LSCP approach, which was applied in our experiments on the preprocessed datasets, and, second, the performance indicators that we use.

Enhanced LSCP Algorithm (ELSCP)
Locally selective combination in parallel outlier ensembles (LSCP) [17] is an unsupervised detector that defines a local region around a test instance by using the consensus of its nearest neighbours in randomly selected feature subspaces. The implementation of LSCP uses an average of maximum strategy where a homogeneous list of base detectors is fitted to the training data; then, pseudo-ground truth is generated for each instance by selecting the maximal outlier score. By training base detectors on the entire dataset and emphasising data locality during detector combination, LSCP examines both global and local data relationships.The process flow of LSCP is summarised in Figure 8. We propose ELSCP, which is an extension of LSCP with improvements on how the local-region definition is extracted, and competent detectors are selected. To improve local-region definition, we propose to implement local neighbour search by using a ball tree k-nearest neighbours (kNN) algorithm with Harvesine distance metric. We also implemented Kendall correlation in the model selection and combination phase. The procedure for ELSCP starts with a heterogeneous list of base detectors being fitted to the training data. A pseudo-ground truth for each train instance is generated by taking the maximal outlier score from all the base detectors. For each test instance: 1.
Using a ball tree nearest neighbour algorithm with Haversine distance metric, the local region is defined to be the set of nearest training points in randomly sampled feature subspaces that occur more frequently using a defined threshold over multiple iterations.

2.
Using the local region, a local pseudo-ground truth is defined, and Kendall correlation is calculated between each base detector's training outlier scores and the pseudoground truth.

3.
A histogram is built out of Kendall correlation scores, and detectors in the largest bin are selected as competent base detectors for the given test instance.

4.
Using the correlation scores, the best detector is selected. The final score for the test instance is computed by using the average of the best detector's local region scores.
One of the inherent challenges in detection algorithms is how to handle variance and bias, especially in ensemble techniques. According to [32], variance is reduced by integrating heterogeneous base detectors by using techniques such as averaging, maximum of average, and average of maximum. A combination of all base detectors, on the other hand, may contain inaccuracies, resulting in greater bias. As described in Aggarwal's biasvariance framework, ELSCP combines variance and bias reduction. It improves variance reduction by introducing diversity through the initialisation of numerous base detectors with different hyperparameters. ELSCP also focuses on detector selection based on local competency, which helps identify base detectors with conditionally low model bias.
ELSCP implementation applies three base detectors, HBOS, MCD, and isolation forest (IForest). Histogram-based outlier score (HBOS) [20] is based on the assumption that features are independent, and hence computes outlier scores by creating histograms for each feature. Modelling the precise features of produced histograms and identifying deviations are used to identify anomalies. HBOS does not require data labeling and does not require any training or learning phase. With scoring-based detection, it also provides quick computation, which is important in our context since we process data streams. The isolation forest [18] anomaly-detection algorithm is unsupervised, does not assume the distribution of the data, and does not require labelled data. It also performs well on normal unbiased data with few noise points and is nonparametric [33]. IForest is constructed from a forest of random distinct isolation trees (itrees). IForest properties inform the choice for application in this study because anomalies are rare and different from normal instances; therefore, a tree can be constructed to isolate every instance.
Diagnostics are frequently used in combination with a standard fitting technique to detect outliers. Outliers, on the other hand, can impact these methods, causing the fitted model to miss deviating observations (masking effect) [34]. Robust statistics is a useful technique for discovering these deviating observations since it fits the majority of the data to the fitted model and then identifies data points that differ from the fitted model. In this study, we mitigate the masking effect by implementing minimum covariance determinant (MCD) [35], which is a very reliable estimate for multivariate anomaly identification as one of the base detectors in ELSCP. The determinant of a matrix and Mahalanobis distances are two multivariate feature statistics on which MCD relies. It searches the dataset for observations with the lowest possible determinant in the classical covariance matrix.

Performance Indicators
The balance between false positives and false negatives in anomaly-detection tasks is determined by the application case. False negatives (when no anomaly is discovered while one exists) are usually penalised far more severely than false positives are (a false warning). We use receiver-operating-characteristic (ROC) curves-that is, the true positive rate as a function of the false positive rate-to allow a complete evaluation that is independent of these application-specific trade-offs.
In order to evaluate the performance of the different approaches, we use the area under the curve of the receiver operating characteristic (AUC-ROC) [36] and the area under the curve of precision-recall (AUCPR) [37]. Both indicators are based on the following concepts: AUC-ROC receiver operating characteristics are TPR and FPR. The higher the AUC-ROC is, the better the detection is. AUC-ROC is the most popular evaluation measure for unsupervised outlier-detection methods [38].
AUCPR uses precision and recall. Precision is the fraction of retrieved instances that are relevant [39]. Recall or sensitivity is the ability of a model to find all relevant cases within a dataset. The F1 score (F measure) is the weighted average of precision and recall, and considers both precision (P) and recall (R) rates: AUCPR is a very robust assessment measure that works well for a wide range of classification tasks. It is particularly useful when dealing with imbalanced data when the minority class is more significant, such as in anomaly detection. Therefore, for the purpose of this study, we evaluate the performance of anomaly-detection algorithms using these indicators.

Results
In this section, we present obtained results by experimenting with anomaly-detection techniques on GPS logs and crop data. The algorithms were implemented in Python programming language by using the PyOD framework [41]. Since ELSCP is an ensemble framework, and we used three base detectors in its implementation: HBOS, MCD, and IForest. The reason for using heterogeneous base detectors was to attain unbiased overall detection accuracy with little variance by incorporating the capabilities of different base detectors while carefully combining their outputs to form a robust detector. ELSCP performance ofwas also compared against state-of-the-art anomaly-detection techniques.

Scenario A: Combine Harvester GPS Data
Preprocessed data were then applied to ELSCP, where the aim was to evaluate its performance by looking at the true-positive and false-positive rates by using a ROC curve. AUC-ROC was also computed as an evaluation measure of the model. The baseline AUC-ROC is usually set at a value of 0.5 which suggests that the detector does not discriminate between normal and anomalous data, 0.7 to 0.8 is considered to be acceptable, 0.8 to 0.9 is considered to be excellent, and more than 0.9 is considered to be outstanding [42]. ELSCP was trained with 67% of the data, and prediction performance tested with 33% of the data. ELSCP achieved a global AUC-ROC value of 0.998, which is outstanding performance (as shown in Figure 9). We also evaluated ELCSP performance in terms of precision and recall. The baseline AUCPR for the dataset was 0.0199 (based on a sample size of 51580 with 1026 true-positive anomalies). On the basis of obtained results, ELSCP achieved a global AUCPR of 0.972, which is very good performance (as shown in Figure 10).  In the second set of experiments, we identified the anomaly-detection algorithm that best estimated anomalous instances within GPS logs. We applied the data to ELSCP, baseline LSCP with variants of LOF as base detectors, LODA, COPOD, OCSVM, LOF, and CBLOF. Performance was evaluated by considering each detector's performance in terms of AUC-ROC, AUCPR, and F1 score. Obtained results from the experiments are summarised in Table 2. According to the results, our approach obtained the best scores with an AUC-ROC that was 6.4% better than that of the second-best approach COPOD (99.8% vs. 93.4%); an AUCPR of 97.2%, when the best scores of the other approaches (the one of OCSVM) were only 38.5%. The F1 score reflected the same conclusion (92.1% vs. 22.8%). Therefore, our approach was clearly the best, and could help farmers in real time during harvest in detecting potential issues.

Scenario B: Crop Damage
Preprocessed data were then applied to ELSCP where the aim was to evaluate performance by looking at the true-positive and false-positive rates using a ROC curve. AUC-ROC was also computed as an evaluation measure of the model. In the unsupervised outlier-detection setting, it is often problematic to rigorously judge the effectiveness of the algorithms, especially when it outputs an outlier score that is converted into a label on the basis of a threshold. If the threshold selection is too restrictive (to minimise the number of declared outliers), then the algorithm misses true outlier points (false negatives). On the other hand, too many false positives are generated if the algorithm declares too many data points as outliers. To deal with this issue, we performed threshold selection with the aim of the identifying the threshold on the ROC curve that resulted in the best balance between true-positive and false-positive rates for our imbalanced dataset. The Geometric mean (G-mean) was used to compute the optimal threshold, such that the best threshold was the one with the largest G-mean value. The slope of the ROC curve does not affect the performance of an algorithm, since ROC curves are used to make it easier to identify the best threshold from computed probabilities when making a decision on whether an instance is an anomaly or not.
ELSCP was trained with 67% of the data, and prediction performance tested with 33% of the data. The baseline area under the ROC curve is usually set at a value of 0.5, which represents a random guess scenario where the detector does not discriminate between normal and anomalous data. Figure 11 represents the ROC curve showing the best threshold value for ELSCP model. ELSCP achieved 0.641 AUC-ROC, G-mean value of 0.605, and the best threshold ROC value of 0.1404, which shows that it is able to detect anomalies resulting in crop damage. We also evaluated ELCSP performance in terms of precision and recall. The optimal threshold for the precision-recall curve, which focuses on the performance of a model on the positive (minority class/anomalies), was computed by optimising the F measure. The baseline AUCPR for the dataset was 0.17 (based on a sample size of 63589 with 10811 true-positive anomalies). On the basis of obtained results from the precision-recall curve, ELSCP achieved an AUCPR of 0.277 with an F1 score of 0.343 and the best threshold value of 0.1404 (as shown in Figure 12). The obtained AUCPR value is acceptable since it is above the acceptable baseline for the crop dataset.  The second phase of analysis determined if the detected anomalies were linked to the state of the crop being damaged at the end of the harvest season. We extracted all data that had been labelled as anomalies by the model and compared them to the actual ground truth. The model detected 6324 samples as anomalies, of which 1922 (30%) were actual anomalies that were linked to crop damage. In the dataset, the crop state at the end of the harvest could be categorised as alive, damaged due to other causes, or damaged due to pesticides. On the basis of 30% of the actual anomalies, we categorised them according to the cause of damage, where we established that 1641 samples (85.47%) were damage caused by other sources, and 281 samples (14.62%) were damage as a result of pesticide use.
The third phase identified the anomaly-detection algorithm that best estimated the anomalous instances within the crop dataset. We applied the data to ELSCP, baseline LSCP with variants of LOF as base detectors, LODA, COPOD, OCSVM, LOF, and CBLOF. Performance was evaluated by considering each detector's performance in terms of AUC-ROC, AUCPR, and F1 score. Obtained results from the experiments are summarised in Table 3. According to the results, our approach obtained competitive results where it came second to COPOD with an AUC-ROC margin of 5.4% (64.1% vs. 67.5%) and an AUCPR of 27.7% vs.29.7% of COPOD, which gives a margin of 3.0%. It achieved the highest F1 score, which was 5.2% (34.3% vs. 29.1%) better than that of the second-best detector, OCSVM. Therefore, our approach is clearly very competitive in this context, and can help farmers in real-time to detect potential issues on crop health.

Conclusions
Improving agricultural productivity is critical for increasing farm profitability and fulfilling the world's fast expanding food demand, which is fuelled by rapid population increase. Precision farming seeks to reduce waste and negative outcomes by precisely focusing agricultural inputs. It has benefited from the integration of high-accuracy GPS technology into farm machinery. Several factors such as water availability, soil fertility, pest protection, timely pesticide application, and nature all contribute to a healthy yield. The precise application of pesticides protects crops from pests, while wrong dosages can result in crop damage or even death. The most versatile and most errorprone anomaly-detection technique is unsupervised anomaly detection. In this case, no data assumptions are made, i.e., the data comprise both normal and anomalous records, and algorithms should be able to distinguish between the two without any prior training. Unsupervised learning is highly preferred for real-life applications, especially in anomaly detection since there are many data without labels in this scenario.
In this study, we performed unsupervised anomaly detection in obtained data streams from movement tracks of combine harvesters during wheat harvest, and data for crop damage recorded by farmers over a period of three harvest seasons. On the basis of our results, deviant combine-harvester behavior could be effectively detected by using machine learning. It was also possible to link anomalies extracted from multivariate analysis of various features to damaged crop state at the end of harvest. Therefore, anomaly detection could be integrated in the decision process of farm operators to improve harvesting efficiency and crop health.

Limitations and Future Work
The first limitation of ELSCP is on the extraction of neighbouring data points that constitute the local region of a test instance using distance metrics applied to KNN Ball Tree algorithm. This approach brings two challenges: (a) it takes much time to determine the nearest neighbours of the test instance; and (b) the performance in a multidimensional space may be affected, especially when many features or attributes are irrelevant. To mitigate this challenge, local-region definition can be solved by clustering [43], the use of fast approximate methods [44] or by prototyping [45], which can significantly reduce the required time to set up the local domain because not all data points are required by these techniques. The second limitation is on the generation of the pseudo-ground truth, where we applied a simple maximisation technique. This could be improved by considering exact strategies, for instance, with the active pruning of base detectors [46].
Our future work will focus on addressing model calibration and hyperparameter tuning to improve on outlier scoring since a false negative in smart agriculture can induce uncomfortable issues in crop production and farm efficiency. The configuration of local subspaces can also be improved to decrease the amount of time spent locating a test instance's nearest neighbours by replacing the kNN search strategy with a clustering technique. We also propose to develop the methodology into a tool that can be used by farmers and experts for the real-time detection and analysis of farming data streams.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: