- Article
A Standardized Framework for Cleaning Non-Normal Yield Data from Wheat and Barley Crops, and Validation Using Machine Learning Models for Satellite Imagery
- Patricia Arizo-García,
- Sergio Castiñeira-Ibáñez and
- Alberto San Bautista
- + 3 authors
Modern combine harvesters can collect real-time geolocated yield data, but it is subject to errors. Various protocols have been proposed to clean this data, each with varying levels of complexity. This data is valuable for precision agriculture to implement site-specific management and to train models to predict yield using remote sensing data. Machine learning and deep learning techniques have shown their potential for precision agriculture, and their performance shows no significant differences between models trained with data cleaned using a computationally demanding protocol or a simpler one, such as parametric filtering. However, parametric filtering approaches primarily rely on statistics that are highly sensitive to data distribution and do not effectively filter inliers. The objective of this study is to develop a data-cleansing method that leverages robust statistical measures, specifically the median and interquartile range, to effectively identify and filter outliers and inliers while retaining valid observations in datasets collected from combine harvesters, thereby minimizing the influence of non-normal data distributions. Different levels of data cleaning were applied to a total of 7399 ha of wheat and barley crops, and the quality of each cleaning level was compared. The selected protocol improved the spatial structure of the data, deleting up to 42% and 33% of the data at the polygon level, for wheat and barley, respectively. It increased the mean and median, and decreased the standard deviation and coefficient of variation of the data. Between 78.7% and 82.9% of the fields showed a normal distribution after applying the selected method, and machine learning performance improved compared with the raw data. Compared with previous data cleaning studies, the present work proposes an automatic, low-computational, parametric filtering method that uses robust statistics for non-normal distributions. In addition, its scalability has been demonstrated by applying the method to a large dataset, improving data quality and the performance of yield-prediction ML models in all cases.
5 February 2026







