Next Article in Journal
Thermochemical Conversion of Food Waste into Biochar/Hydrochar for Soil Amendment: A Review
Previous Article in Journal
Enhancing Cereal Crop Tolerance to Low-Phosphorus Conditions Through Fertilisation Strategies: The Role of Silicon in Mitigating Phosphate Deficiency
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Standardized Framework for Cleaning Non-Normal Yield Data from Wheat and Barley Crops, and Validation Using Machine Learning Models for Satellite Imagery

by
Patricia Arizo-García
1,2,
Sergio Castiñeira-Ibáñez
2,3,*,
Enric Cruzado-Campos
1,
Beatriz Ricarte
4,5,
Constanza Rubio
2,3 and
Alberto San Bautista
1,6
1
Centro de Investigación del Regadío y Agrosistemas Mediterráneos, Universitat Politècnica de València, Camí de Vera s/n, 46022 Valencia, Spain
2
Departamento de Física Aplicada, Universitat Politècnica de València, Camí de Vera s/n, 46022 Valencia, Spain
3
Centro de Tecnologías Físicas, Universitat Politècnica de València, Camí de Vera s/n, 46022 Valencia, Spain
4
Instituto Universitario de Investigación de Matemática Multidisciplinar, Universitat Politècnica de València, Camí de Vera s/n, 46022 Valencia, Spain
5
Departamento de Matemática Aplicada, Universitat Politècnica de València, Camí de Vera s/n, 46022 Valencia, Spain
6
Departamento de Producción Vegetal, Universitat Politècnica de València, Camí de Vera s/n, 46022 Valencia, Spain
*
Author to whom correspondence should be addressed.
Agronomy 2026, 16(3), 386; https://doi.org/10.3390/agronomy16030386
Submission received: 2 December 2025 / Revised: 29 January 2026 / Accepted: 2 February 2026 / Published: 5 February 2026
(This article belongs to the Special Issue Integrating Yield Maps, Soil Data, and IoT for Smarter Farming)

Abstract

Modern combine harvesters can collect real-time geolocated yield data, but it is subject to errors. Various protocols have been proposed to clean this data, each with varying levels of complexity. This data is valuable for precision agriculture to implement site-specific management and to train models to predict yield using remote sensing data. Machine learning and deep learning techniques have shown their potential for precision agriculture, and their performance shows no significant differences between models trained with data cleaned using a computationally demanding protocol or a simpler one, such as parametric filtering. However, parametric filtering approaches primarily rely on statistics that are highly sensitive to data distribution and do not effectively filter inliers. The objective of this study is to develop a data-cleansing method that leverages robust statistical measures, specifically the median and interquartile range, to effectively identify and filter outliers and inliers while retaining valid observations in datasets collected from combine harvesters, thereby minimizing the influence of non-normal data distributions. Different levels of data cleaning were applied to a total of 7399 ha of wheat and barley crops, and the quality of each cleaning level was compared. The selected protocol improved the spatial structure of the data, deleting up to 42% and 33% of the data at the polygon level, for wheat and barley, respectively. It increased the mean and median, and decreased the standard deviation and coefficient of variation of the data. Between 78.7% and 82.9% of the fields showed a normal distribution after applying the selected method, and machine learning performance improved compared with the raw data. Compared with previous data cleaning studies, the present work proposes an automatic, low-computational, parametric filtering method that uses robust statistics for non-normal distributions. In addition, its scalability has been demonstrated by applying the method to a large dataset, improving data quality and the performance of yield-prediction ML models in all cases.

1. Introduction

As the world population is expected to reach 10.3 billion by the mid-2080 [1], increasing food production is essential to meet the food requirements of this rapidly growing society. Until now, the increase in final production has been achieved thanks to the combination of variety breeding, an increase in cultivation area, and the great use of inputs. However, since many crops are nearing their maximum physiological yield [2], the negative effects of excessive fertilizer application are undeniable in the environment [3], and natural resources are increasingly limited.
In view of these limitations, precision agriculture (PA) has been presented in the last few decades as a solution that allows site-specific crop management (SSCM) [4] to improve productive potential and efficiency of farm inputs [5].
Within PA, some of the most important sources of information are yield maps and their precise location data, provided by yield monitors installed in combine harvesters and their Global Positioning Systems (GPS) [6]. Using these data for SSCM and the creation of yield prediction models. Nevertheless, even as more growers adopt these technologies, these data are often underused by growers and the industry [7], mainly because of existing anomalies that could jeopardize future decisions based on these data. These errors can be classified into four groups [8]: harvest dynamics (lag time, filling and emptying time), measurement errors related to yield and moisture observations, positioning system accuracy, and harvester operator (speed, harvest turns, and headlands).
In this regard, several authors have highlighted the need to detect and filter these errors prior to any subsequent data analysis. Numerous methodologies have been proposed, most of which have been tested in only a limited number of fields and production areas. Overall, two main approaches are commonly employed:
  • Global filtering, in which outliers are removed from the dataset [9]. In the literature, global filtering comprises several types of filters, including the use of complementary harvest data (e.g., moisture measurements, harvester speed, or harvesting pass), biological limits of the crop, and parametric filtering criteria.
  • Local, post-processing filtering, where a neighborhood of observations is defined, and inliers are subsequently removed [4,6,8,10]. Local filtering can be performed through different approaches, such as the application of the Local Moran’s spatial correlation index (LM) [6,11], clustering algorithms [7,12], or expert-based manual filtering [13].
In both cases, interpolation techniques such as the Block Kriging and Inverse Distance Weighting (IDW) methods are commonly used to replace the removed data points.
Nevertheless, data-cleaning protocols that rely on complementary harvest information are limited by the fact that such ancillary data are not always available [14]. Similarly, parametric filters based on statistical descriptors such as the mean and standard deviation may be unsuitable for large, multi-location datasets with a large number of fields, as the proportion of fields with non-normal distributions tends to increase. Conversely, local filtering techniques are typically characterized by high computational demands, which do not necessarily yield superior performance when the cleaned data are subsequently used to train ML models [15], and many of these approaches lack full automation.
In this sense, researchers in other domains have recommended data transformations, such as the logarithmic transformation, the min-max method, or z-score normalization [16], to obtain a dataset with a normal distribution. Even so, these techniques are not helpful for all datasets and may be difficult to interpret in agriculture. Another feasible option is to use other parameters that are more reliable in non-normal distributions for the filtering process, such as the median or the interquartile range (IQR) [17].
For this purpose, in this paper, the non-normality of raw yield datasets is tested and validated for a large dataset, including two different crops (wheat and barley), five locations, and four different growing seasons. Once this statement is verified, the paper aims to select a new data-cleaning protocol that can be easily automated with low computational expense. This study proposes global and local filtering, using the median and interquartile range as robust statistics to clean skewed data, along with different combinations of coefficient limits. The selection of the optimized data cleaning method is based on an analysis of the data’s internal structure and on enhancing the performance of the machine learning model trained on Sentinel-2 reflectance data and filtered yield data. The proposed method is applied in 7399 ha of wheat and barley.

2. Materials and Methods

2.1. Site of Study

The study was conducted using yield data from five of the main wheat and barley production areas in Spain (Figure 1): Burgos, Córdoba, León, Sevilla, and Valladolid. A total surface of 7399 ha of yield data was used in the data cleaning study (3178 ha and 4221 ha for wheat and barley, respectively), distributed in 648 fields (309 and 339 fields for wheat and barley, respectively) over 4 crop seasons (from 2020 to 2023). On the side of wheat, data from the 5 different locations were available, while in barley, only three locations had available yield data (Burgos, Córdoba, and Valladolid). The available yield data corresponds to winter wheat and barley crops, which were sown in November. The crop management of wheat and barley fields in Spain follows the recommendation of López-Bellido et al. [18,19].
According to the Köppen Climate Classification (KCC), the studied areas have different characteristic climates. Therefore, the climate for each location is: Temperate oceanic climate (Cfb) for Burgos, Semiarid climate (BSh) for Córdoba, warm-summer Mediterranean climate (Csb) for León, hot-summer Mediterranean climate (Csa) for Sevilla, and arid steppes (BSk) for Valladolid. The soil type for these areas is calcimorphic, except for the León area, which presents an umbrisol soil type [20].

2.2. Satellite Data

The satellite data were obtained from a Multi-Spectral Instrument (MSI) on board two twin satellites (Sentinel-2A and Sentinel-2B) that fly in the same orbit but are phased 180°, allowing the acquisition of wide-swath, high-resolution images with a 5-day time frequency [21]. The optical instrument sampled 13 spectral bands, 10 of which were used (Table 1). Only cloud-free images were used, downloaded from ESA’s official Copernicus Browser platform. The downloaded images were level 2A products that provide Bottom-of-Atmosphere (BOA) reflectance data.

2.3. Yield Data Acquisition

The crops’ final yield data was recorded using two different Yield Track software, one installed by TOPCON Corporation (Tokyo, Japan) and the other by Trimble company (Westminster, CO, USA). A total of three combine harvesters were used for data acquisition. Both companies’ measurement systems are based on volumetric grain flow estimates from optical sensors before the grain enters the combine harvester hopper. The software that generates yield maps includes internal calibration based on the crop type, so combine operators select the grain type to be harvested and calibrate the sensors before starting. However, the representation of the data in the resulting shapefiles differs. TOPCON (Yieldtrakk YM-1) (Topcon Positioning Systems, Inc., Livermore, CA, USA) creates a layer composed of polygons with an irregular surface and a constant width that matches the cutting width (7.5 m for wheat and barley crops) of the combine (Figure 2a), whereas Trimble (Trimble Ag) (Trimble Inc., Westminster, CO, USA) creates a layer of points (Figure 2b). The yield maps of Córdoba, León, Sevilla, and Valladolid were created using TOPCON software, while the Burgos yield maps used Trimble software. TOPCON yield maps were downloaded directly from the combine harvester, while the Trimble yield maps were obtained from the company. These datasets were already preprocessed and free of yield values outside the biological limits of the crops.

2.4. Study of Data and Data Processing

Figure 3 shows the following workflow for the data cleaning study. It’s composed of two stages. Stage 1 defines the starting point for the analysis by filtering the raw performance data at P0 (biological limits), evaluating the distribution/transformations, determining the minimum precision threshold value, and providing a preprocessed database (PP). Stage 2 cleans the PP data (Figure 4), using 4 global combinations (G1.0–G2.5) and 16 final combinations (G × L). The optimal cleaning level is selected by comparing the performance maps obtained with the statistical indicators described in Section 3 (percentage of data removed, dispersion, normality metrics, and spatial distribution).

2.4.1. Analysis of Raw Data and Preprocessing

A descriptive analysis of the data was carried out in the original (raw) and preprocessed (P0) data. The P0 datasets were obtained by deleting from the raw data the measured values outside the biological limits of wheat and barley crops (0–10,000 kg ha−1). The selected statistics for the descriptive analysis were the minimum (Min) and maximum (Max) values, the Mean, the median, the coefficient of variation (CV), the Fisher-Pearson coefficient of skewness (Skew), and Pearson’s coefficient of kurtosis (Kurt). In this study, normality was examined using asymmetry and kurtosis thresholds in accordance with descriptive statistical criteria. Subsequently, the percentage of fields that presented a normal distribution of yield data was calculated for each location and crop. The dataset was considered normally distributed when its skew was within the range of −0.5 to 0.5, and its kurtosis was within the range of −2.0 to 2.0.
Once the normality of the P0 data is tested, several methods of data transformation [17,22,23] are applied to determine is the percentage of the field that presents a normal distribution increases with respect to P0, and transformed data should be used for the next analysis step; otherwise, P0 data will be used. The selected transformations were 7, chosen after reviewing the existing literature: Reciprocal ( 1 x i ), Box-Cox, Min-Max, Square Root ( x i ), Squared ( x i 2 ), Cubic ( x i 3 ), and common logarithm (log10(xi)). Afterward, to determine the minimum value that the combine harvester can measure with accuracy (Min. Acc. Value), the data is segmented into groups in function of the yield range (ranges of 500 kg ha−1), and the study of a robust coefficient of variation (CVmedian) based on the median and the Median Absolute Deviation (MAD) is made. CVmedian values were obtained for each crop, and the two monitoring systems were calibrated on each harvester following the instructions of [10]. The minimum threshold value was considered using a conservative criterion to standardize the methodology across all groups. The decision to use CVmedian (Equation (1), where X = { x 1 , x 2 , , x i } being i = 1 , , N ) instead of the CV of the mean was made due to the tendency to non-normal distribution that not-cleaned yield data presents, widely presented by other authors [7,24], and this CV analog is more robust to outliers when relative dispersion needs to be measured. Once the range CVmedian is calculated, excessively high values in the low ranges of yield, compared to the adjacent ranges, were selected as a sign of low accuracy of the combine harvester. Those value ranges highlighted as low-accuracy were deleted, yielding the PP data, which will be used as base data for applying the data cleaning protocols (stage 2).
CV m e d i a n = MAD Median × 100 = Median ( | x i Median ( X ) | ) Median ( X ) × 100
where C V median represents the coefficient of variation based on the median; MAD denotes the median absolute deviation, computed as the median of | x i Median ( X ) | , where x i corresponds to each observation and Median ( X ) is the median of the dataset. The absolute value operator | · | measures the deviation from the median, and the factor × 100 converts the resulting ratio into a percentage.
Given the unusually high dispersion of data in the lower-yield range, the threshold of 500 kg ha−1 is used as a criterion for data quality rather than as a general physiological bias of the plant. Low yield values may be due not only to crop management anomalies but also to grain flow within the harvester system itself. Consequently, these values are considered to have little validity for processing and modeling.

2.4.2. Proposal of Data Cleaning Methodology and Execution Comparison with Other Data Cleaning Methods

The proposed data-cleaning methodology was introduced with two key objectives: ensuring applicability across all yield datasets and minimizing computational cost. Figure 4 illustrates the proposed approach in detail. For the proposed method to be applied to all yield datasets, the only truth data required for the cleaning process are the yield maps generated by the Yield Track system, as other measurements, such as harvesting speed and grain moisture, may not always be available [14]. The PP level of data obtained at stage 1 was subjected to three levels of processing (global adjustment, local adjustment, and rescaling) before obtaining the final product, which will be used for PA purposes.
1.
Global adjustment: A parametric filtering was carried out at the field level to delete the remaining outliers. It’s based on delimiting a range of yield values for the field and deleting the data points that fall outside the predefined bounds. The upper and lower limits of the range were established using Equations (2) and (3). Four types of global filtering were tested (G1.0–G2.5) based on the coefficient limit (n). The selected n values were 1.0, 1.5, 2.0, and 2.5.
Upper limit = α + n · β
Lower limit = α n · β
where α denotes the field median; β represents the field interquartile range. The parameter n takes the values 1.0 , 1.5 , 2.0 , and 2.5 .
2.
Local adjustment: Once the global adjustment was made, another parametric filtering was carried out within each field to delete the remaining inliers. For this second filtering, a search radius of 40 m [6,10] was used to create distinct neighborhoods within each field. The search radius was set to neighborhoods populated with data from each direction of the combine harvester, with at least one repetition, including 5 harvester passes for TOPCON data and 4 for Trimble data. To verify that the selected search radius was not over-smoothing the data, an analysis of the standard deviation of yield in function of different search radius was made (Figure 5). As the search radius increases, the SD mean approaches the median, and the percentage of anomalous data and the CV decrease, indicating that smaller radii are more strongly influenced by noise. Nevertheless, between 40 and 45 m search radius, the improvement is minimal, indicating that increasing the search radius more than 40 m will be over-smoothing data. Upper (Equation (4)) and lower (Equation (5)) limits were established for each neighborhood, and data points that fell outside the bounds were deleted. For each resulting map of the tested global filtering, four types of local filtering (L1.0–L2.5) were tested, depending on the used n. The selected n values were 1.0, 1.5, 2.0, and 2.5.
Upper limit = γ + n · δ
Lower limit = γ n · δ
where γ corresponds to the median within the search-radius area; and δ denotes the interquartile range of the search-radius area. The parameter n takes the values 1.0 , 1.5 , 2.0 , and 2.5 .
3.
Rescaling: All the maps resulting from global filtering (4 levels), and global+local filtering (16 levels) were rescaled into a 10 × 10 m grid, corresponding to Sentinel-2 higher spatial resolution. The 10 m grid is projected in EPSG:32630-WGS 84/UTM zone 30N and explicitly aligned to the Sentinel–2 grid, ensuring that pixel boundaries coincide exactly with the Sentinel–2 10 m bands to prevent spatial shifts. The 10 m gridding was performed by aggregating polygon values using an area-weighted mean based on their spatial overlap with each pixel, while a simple mean of the multiple points falling within the same pixel was made to obtain the pixel yield value.
A parametric filtering procedure was proposed for both global and local filters to obtain a data-cleaning protocol that can be easily industrialized at low computational cost. An automatic cleaning procedure is necessary because it allows use across different contexts without resorting to empirical knowledge. In the literature, parametric global filters have been widely used, but most of the cleaning protocols used mean and standard deviation as filtering statistics [4,6,8,10]. However, these statistics are heavily influenced by data distribution and can lead to an increase in deleted data, erasing data that may not be outliers. Therefore, the proposed protocol relies on statistics that are less sensitive to data distribution, such as the median and the IQR [15,24]. The selection of n values was made based on the existing literature; the most commonly used n values are 3 and 2.5, in accordance with the three-sigma rule. Although the yield monitor data tend to be skewed by harvesting artifacts, the reduction of n to 1.5 has also been applied [4]. Consequently, the value of 3 was omitted, and it was decided to test n values from 1.0 to 2.5 to determine which n combination improved the final data quality the most. The decision to use a parametric local filter rather than other local filtering techniques, such as applying clustering algorithms to detect anomalous data points within established neighborhoods, is justified on the basis of the computational expense. Clean yield data is used as ground truth information for training models that allow PA. Those models are being created using ML and deep learning (DL) algorithms, which are characterized by being less sensitive to anomalous data. In this sense, it has been observed that data cleaned with parametric and more complex filters show similar performance when used to train yield prediction models [15], thus the use of filters with lower computational expense is preferable.
Once the data were cleaned, the proposed levels were compared with the raw and PP data to determine which cleaning level performed better. The comparison began with the study of the percentage of deleted data at the polygon and pixel levels. This was followed by the construction of semivariograms to examine how the different cleaning levels affected the spatial structure of the datasets. A descriptive analysis was also performed, studying the changes in mean, median, SD, CV, Skew, Kurt, and the percentage of fields that present a normal distribution of yield, since raw yield maps often exhibit skewness due to systematic harvester errors or human mistakes, and proper data cleaning allows the distribution of yield data to more closely resemble a normal distribution [25]. Finally, the data was used in conjunction with Sentinel-2 data to train the final yield prediction models, comparing the impact of the level of cleaning on the models’ performance. To train the prediction models, two ML algorithms were used, specifically, PLSR and XGBoost. The employed hyperparameters for PLSR training were 30 components and no variable scaling. In the case of XGBoost, the hyperparameters were 50 trees, a maximum tree depth of 3, a learning rate of 0.05, and a random seed of 0. An independent model was trained for each level of cleaning, location, year, and crop. Each independent dataset was split randomly using a ratio of 80–20%. All models were cross-validated using the test dataset, which was not used during training, using the hold-out validation method. The performance of the models was evaluated using R2, the Mean Absolute Error (MAE), and the Root Mean Squared Error (RMSE).
Finally, the selected cleaning level will be compared with the most used data filtering method in the bibliography. Specifically, the 3SD method will be applied, which consists of a global filtering that deletes all the yield measurements that fall outside of the threshold of mean field ± 3 · SD field [6,8,13,25,26].

2.5. Software

The processing of yield maps, nitrogen maps, and Sentinel-2 data was carried out using QGIS 3.34.6. Descriptive data analysis and semivariogram computation were performed with the scikit-gstat 1.0.19 and pykrige 1.7.2 libraries. The PLSR models were trained and tested, and model performance metrics were calculated using scikit-learn 1.2.2. The training and testing of the XGBoost algorithm were performed with the xgboost 3.0.5 library. A summary of the software used is provided in Table 2.

3. Results

3.1. Study of the Data

3.1.1. Raw and Preprocessed Data

A descriptive analysis of the raw and P0 yield data was carried out to determine whether it followed a normal distribution (Table A1 and Table A2 of the Appendix A). Regarding the results, both in wheat and barley, deleting the biological outliers improved the values of CV, skewness, and kurtosis. However, not all the data presented a normal distribution anyway. In the case of wheat, removing the biological outliers resulted in 53.8% of the data exhibiting a normal distribution, compared to 7.7% of the raw data set. Slightly worse results were observed in the barley datasets, increasing the percentage of data with a normal distribution from 11.1% to 33.3%.
On the other hand, Table 3 shows the percentage of fields inside the datasets of both crops that present a normal distribution. In this sense, the percentage of fields with final production data that is normally distributed ranges from 8.2% to 36.9%.

3.1.2. Transformed Data

The P0 data were subjected to several transformations to obtain normally distributed data. In this regard, Table 4 shows the percentage of fields with normally distributed yield data for each location, crop type, and applied transformation. After data transformation, the datasets showed percentages of fields with normally distributed data ranging from 0 to 50%, indicating that the transformation did not normalize the final production data.

3.1.3. Variance of Data as a Function of the Range of Yields

Table 5 shows, for each of the crops studied, the results of the coefficient of variance of the median for each yield range, as well as the percentage that represents each range with respect to the total dataset. The CVmedian values were calculated for each crop, using all locations and years, selecting a conservative minimum threshold of 500 kg ha 1 for all systems and data collection combinations to standardize the subsequent data cleaning steps. For both crops, the range of 0 to 500 kg ha 1 presents a surprisingly high value, indicating that in that range the yield values are not distributed in a homogeneous way. Therefore, the minimum yield value that the combine harvester can accurately measure can be the range of 0–500 kg ha−1 that showed an unexpectedly high median CV. Consequently, it was considered appropriate that 500 kg ha−1 was a low confidence threshold for this dataset, which affected approximately 12% (wheat) and 10% (barley) of observations.

3.2. Comparison of Raw and Cleaned Data

Different levels of data cleaning were applied to all available datasets. Figure 6 and Figure 7 show the percentage of data removed after the cleaning process. In the case of wheat (Figure 6), erasing outliers related to potential yield and combine harvester measurement limitations (PP) translated into elimination of 0.5% and 31.4% at the polygon level and 0% and 10.9% at the pixel level. Among the remaining data-cleaning levels, G1.0L1.0 was the most restrictive, removing between 31.3% and 51.0% of the data at the polygon level and up to 32.7% at the pixel level. The same trend can be observed for barley datasets (Figure 7), erasing 0.9% and 29.7% at the polygon level and 0% and 7.6% at the pixel level in the PP level and erasing the more restrictive level of cleaning (G1.0L1.0) between 30.8% and 44.9% of the data at polygon level and a maximum 33.6% at pixel level.
Although some cleaning filters remove a large proportion of polygon-level observations, the loss is less after resampling to a standard 10 × 10 m grid (pixel level), because the yield layers in polygons correspond to irregular segments, and consideration at the pixel level homogenizes the spatial resolution, reducing the influence of the reduced segments.
The CVs of the raw yield datasets ranged from 34.9% to 424.1% in wheat datasets and 34.4% to 362.9% in barley datasets. After applying the different cleaning levels, the CV ranged from 4.9% to 80.1% and 12.6% to 74.9% for wheat and barley datasets, respectively. The effect of data cleaning is equal across both crops in all datasets studied, increasing the mean and median values while reducing the CV and standard deviation. Table 6 and Table 7 show an example of a summary of statistics for wheat and barley, respectively. When compared to the raw dataset, the cleaning process translates into an increase in the mean (ranging from 13.4% to 15.5%, and 9.5% to 10.3% for the wheat and barley presented examples) and median (ranging from 8.2% to 8.8%, and 14.1% to 15.7% for the wheat and barley presented examples), and a decrease in the standard deviation (ranging from 35.9% to 47.1%, and 20.1% to 22.2% for the wheat and barley presented examples) and CV (ranging from 48.8% to 58.1%, and 27.2% to 29.4% for the wheat and barley presented examples).
Concerning the spatial structure of the datasets, the structure of the data improved in all the datasets as the cleaning process was more restrictive, being the levels of cleaning process that incorporated a global filtering, with an n value of 1.0 (G1.0 family of cleaning), the ones that improved the spatial structure of the yield datasets the most. Figure 8 shows two examples of semivariograms, one for the wheat dataset (Figure 8a) and another for the barley dataset (Figure 8b). In both examples, it is evident that there were two separate groups in terms of spatial structure. However, the cleaning level G1.5L1.0 is between the two groups, so it should be carefully considered.
The results of this study highlight how restrictive cleaning levels reduce both nugget and sill, demonstrating the importance of cleaning protocols for improving measurement consistency and motivating researchers to consider these factors in their work. Additionally, it is noteworthy that the estimated range remains comparable between the different cleaning protocols, maintaining the spatial dimension of yields.
The improvement in spatial structure was reflected in a progressive increase in the fields normally distributed in the datasets. In this sense, P0 48% of the wheat fields and 50% of the barley fields presented a normal distribution. After applying the less restrictive level of cleaning (G2.5), these percentages increased to 64.8% and 56.1% for wheat and barley, respectively. The most restrictive cleaning level (G1.0L1.0), presented a percentage of normal field distribution of 79.4% in wheat and 73.5% in barley. The levels G1.0L2.5 and G1.5L1.0 presented, respectively, percentages of 82.9% and 71.9% in wheat, and 78.7% and 66.1% in barley. Among all the proposed cleaning levels, G1.0L2.5 was the one that increased the percentage of fields with a normal distribution the most.
In view of all these results, some of the levels of cleaning can be discarded, having to decide on the most suitable cleaning process, comparing the final map quality of five levels of cleaning: the G1.0 level with local filtering and G1.5L1.0.

3.3. Validation of Final Yield Quality

Two ML models were evaluated: partial least squares regression (PLSR), a multivariate statistical technique that combines principal component analysis and multiple linear regression, using Sentinel-2 bands and the yield of the two crops; and XGBoost, an open-source ML model that implements gradient-boosted decision tree algorithms. In this study, PLSR was fitted using 30 components (without scaling), while XGBoost used 50 trees with a maximum depth of 3 and a learning rate of 0.05 (random seed = 0). Table 8 and Table 9 show examples of the performance of trained models using the final yield as the target variable and the data of the reflectance bands of Sentinel-2 for all available dates. The tendency observed in the results of the presented examples can be extrapolated to the remaining datasets. Only the results of the models trained for the pre-selected cleaning levels, as well as the PP and Raw data, are shown. In both crops, the XGBoost algorithm achieved the best performance during training, followed by the PLSR models, whose results were remarkably similar to those of XGBoost. The performance of the PLSR and XGBoost models on the test dataset is similar to that on the training dataset. Comparing the results for both crops, the R2 values in the barley models are higher, while the RMSE and MAE values are higher than in the wheat cases.
Regarding the quality of the final yield datasets, the different cleaning levels presented a similar performance. So, the less restrictive levels (G1.0L2.5 and G1.5L1.0) presented in Table 8 and Table 9 must be more convenient for data erasure.

3.4. Comparison of Selected Cleaning Level with Common Yield Cleaning Method

The selected data cleaned level (G1.0L2.5) and the RAW data were compared with a yield cleaning method commonly used in the literature (3SD). In terms of summary of statistics, it can be seen how for wheat (Table A3 of Appendix B) and barley (Table A4 of Appendix B), the conservative method (3SD) has a higher mean and median, and a lower SD and CV when compared with the raw data, but the statistics are slightly better for the selected cleaning level, especially in terms of SD. In the 3SD method, the % of deleted data improves compared to G1.0L2.5, deleting 3SD by 2.7 and 3.1%, and G1.0G2.5 by 25.8 and 22.4% for wheat and barley, respectively. However, in the case of spatial structure (Figure A1 of Appendix B), it can be seen that 3SD does not improve spatial structure much compared to the RAW data. Regarding ML model execution, 3SD performed better than RAW, but G1.0L2.5 had the highest R2 values and the lowest RMSE and MAE for both training and testing. Finally, after the 3SD method, 44.0% and 55.6% of the fields showed a normal distribution, a significantly lower percentage than that achieved by the proposed and selected cleaning method in this study.

4. Discussion

The results of the descriptive analysis of the datasets justify the non-normal distribution of the Raw datasets. It also confirms that removing combine harvester overlaps, GPS errors, and yield values outside the possible biological range improves the distribution, increasing the number of fields with a normal distribution in the datasets (from 8.2% to 36.9% of fields, depending on location and crop type). In view of these results, other authors, such as Aworke et al. [27], have applied data transformations (e.g., logarithmic transformation) to obtain a dataset with a normal distribution. However, those transformations are usually used with small datasets containing few fields. When these transformations are applied to large datasets, as shown in Table 4, it is evident that this method cannot be applied to all datasets. Therefore, the use of the preprocessed datasets (P0) for the proposed data cleaning processes is highly recommended. The non-normal distribution of the datasets is accounted for, as the proposed approach uses statistics that are not overly influenced by extreme values (Sainani, 2012) [17].
Other authors like Blasch et al. [9], Sun et al. [4], Ping et al. [12], Robinson and Metternicht [25], and Gozdowski et al. [11] have emphasized that the combine harvester has a minimum value of yield that it can measure accurately. For winter cereals such as wheat and barley, the minimum value has been set in the range of 100 and 1000 kg ha−1, but the criteria to set one value or another have not yet been explained. The study of the CVmedian for each range of yield (Table 5) helped to set this minimum in 500 kg ha−1. This decision can be explained from a mathematical point of view: for an increasing range of values, the mean tends to increase while the standard deviation remains similar, resulting in a higher CVmedian and yield ranges with lower yields. However, the CVmedian of the range of yield from 0 to 500 kg ha−1 is too high when compared to the next studied range (500 to 1000 kg ha−1), indicating an unusually high standard deviation.
In terms of total erased data at polygon level, the results obtained are aligned with those obtained by other authors, deleting a percentage of data between 10% and 50% [4,6,7,11,12,24,26]. However, if the cleaned yield data is resampled and used with Sentinel-2 at higher spatial resolution, the percentage of deleted data decreases, reaching a maximum of 33.6% of the 10 × 10 m pixels (pixel level). Given concerns about potential bias, the proposed filtering protocol is applied to each field using the median and interquartile range at the global and 40 m neighborhood levels, with the aim of reducing inconsistent values. To validate this measure, not only were statistical indicators used, but semivariograms and validation tests were also reported.
The removal of global outliers and local inliers significantly modified the summary of statistics of the datasets by lowering the CV (34.4–424.1% in the Raw dataset to 4.9–80.1% after cleaning) and standard deviation (22–64%) and increasing the value of mean (9.5–15.5%) and median (8.2–15.7%). Vega et al. [6] reported a similar decrease in CV, with the datasets’ CV ranging from 88% to 213% before cleaning and from 34% to 46% after cleaning. Other authors have also reported increases of 6–15% in the mean and decreases of 22–64% in the standard deviation, similar to Lyle et al. [28], who reported increases of 11% and decreases of 47% in the mean and standard deviation, respectively. The increase in the median was also revealed in different crops, reducing the differences between mean and median yield [12,24], observing in winter wheat differences < 160 kg ha−1 [7]. Considering that a good cleaning method should increase the normality of the distribution of the yield data [14,24,25], these reduced differences between mean and median yield were reflected in the rise of studied yields that present a normal distribution of data, with a maximum value of 82.9% and 78.7% (G1.0L2.5 cleaning level) of the total wheat and barley fields, respectively. Although the proposed protocols cause an increase in elements that meet normality criteria (skewness, −0.5 and 0.5; kurtosis, −2 and 2), another significant group remained outside these thresholds in the most effective method (G1.0L2.5). Residual non-normality does not indicate a sensor or processing system effect, since the yield image values may reflect the influence of the field’s physical environment and crop management on crop productivity. Consequently, anomalous and obvious data could be eliminated using statistical indicators that do not assume strict normality. Moreover, the uncertainty has been reduced after applying the cleaning protocol to the yield datasets, as shown by a substantial reduction in the nugget value [25] in each proposed cleaned level, being the levels of cleaning of the G1.0 family with local filtering, and G1.5L1.0 the ones that better improved the spatial structure of the yield datasets. In addition, lag distance at which the sill is achieved in the semivariograms is consistent across all proposed cleaning levels, presenting a downward shift in the total semivariance, as observed by Sun et al. [4]. Therefore, the downward shift in the semivariance is consistent with the elimination of error. At the same time, the semivariogram’s stable range indicates that the spatial dimension of the yield distribution is maintained.
Yield data is a precious source of information for precision agriculture, and therefore, proper cleaning is fundamental to its reliable use. In this context, its use to develop yield prediction models is a common practice. The use of ML to develop prediction models has shown good results, and these techniques are less affected by errors than regression models [29]. The execution of the two applied ML algorithms helped achieve a balance between cleaning level and prediction quality, as other authors have reported [15]. All the preselected cleaning levels performed similarly during training for both algorithms, yielding better results than the models trained with Raw and PP yield data. In addition, when the models’ performance is evaluated on the test data, it is evident that R2 values decrease, while RMSE and MAPE increase substantially in the Raw and PP models, indicating clear overfitting and highlighting the poor quality of the data. The performance of the models trained with the pre-selected cleaning levels remains similar when applied to the test data. The good execution of the models in training and testing, along with the similar results obtained between preselected cleaning levels, indicates that the less restrictive preselected cleaning levels (G1.0L2.5 and G1.5L1.0) are the most suitable protocols for cleaning yield data, as they also erase a lower percentage of yield data (maximum percentage of erased polygons of 42% in wheat and 33% in barley for G1.0L2.5, and maximum of 48% in wheat and 35% in barley for G1.5L1.0). In addition, the range of values continues being similar, while R2 increases, also reflecting the improvement of the spatial structure of the data, while the RMSE and MAPE values decrease when compared with the crude data, but not substantially, highlighting that the behavioral patter of the data has not changed maintaining the real variability of yield data, and, subsequently, the cleaning process has not introduced bias in the data.
Between G1.0L2.5 and G1.5L1.0 levels of cleaning, the G1.0L2.5 is chosen as the better protocol for cleaning yield data, as the semivariograms have shown that the spatial structure of the cleaned data is better, and the percentage of fields presented a normal distribution is maximum when it is applied (82.9% in wheat and 78.7% in barley for G1.0L2.5, and maximum of 71.9% in wheat and 66.1% in barley for G1.5L1.0.
After choosing the better cleaning protocol, the results were compared with the ones obtained when a more conservative method, commonly found in the literature, is applied, the 3SD method. When both methods were compared, the G1.0L2.5 method significantly improved the spatial structure of the data and significantly increased the percentage of fields with a normal distribution of yield data.
The presented framework for yield data cleaning is proposed as an easily applied approach for cleaning yield data error measurements. As the protocol includes both global and local filtering, both outliers and inliers will be erased. The framework employs a parametric approach at both the global and local levels. However, since the unfiltered datasets do not follow a normal distribution, the median and interquartile range are used as filtering statistics, as the data distribution has less influence on them. The proposed cleaning method was tested on a large dataset across different locations, growing seasons, and two crops, yielding good results. It has lower computational requirements than more complex methods, and the cleaned data was used to train ML models with good results as a form of validation. Future studies should focus on applying the selected data-cleaning method to other locations and crops to assess the quality of the cleaned data and determine whether the same cleaning level can be applied to conditions not included in the present work.

5. Conclusions

The presented study proposes a data-cleaning methodology for yield data in wheat and barley crops, applied to a large dataset. Only georeferenced yield data was used in the proposed methodology, allowing the application to all yield maps generated by the YieldTrack software. Firstly, biological outliers and measurements below 500 kg ha−1 are deleted. Secondly, low-computational-expense parametric global and local filtering is applied, using statistics robust to non-normally distributed data (median and interquartile range). Four different limit coefficients were used for global and local filtering, yielding in 20 cleaning levels, which were compared to select the optimal one. The percentage of deleted data, the spatial structure, a descriptive analysis of the data, and the performance of machine learning models trained and tested on clean data and Sentinel-2 reflectance data were used to select the best approach.
The selected level was the G1.0L2.5. This cleaning method was among the most effective in improving the spatial structure of the data and the predictive performance of machine learning models. The maximum percentage of polygons deleted was 42% and 33% for wheat and barley, respectively, and the percentage of fields that showed a normal distribution after filtering was highest at 82.9% and 78.7% for wheat and barley, respectively. The use of the selected method increased the mean and median of the datasets, and decreased the standard deviation and CV values.
Future work will focus on applying the selected cleaning method to other locations and crops, as well as to the Yield Track software, and on evaluating the quality of the cleaned data not included in the present study. In addition, the results of the present work will enable the development of early yield-prediction models that producers can use to implement site-specific crop management and apply correction measures when necessary.
This approach will enable the training of predictive models for advanced phenological stages, which can be applied to estimate cereal production. These estimations contribute to Sustainable Development Goal 2 (Zero Hunger) by improving food security and agricultural planning, and to Sustainable Development Goal 8 (Decent Work and Economic Growth) by reducing market uncertainty and facilitating risk management.

Author Contributions

Conceptualization, A.S.B., S.C.-I., C.R. and P.A.-G.; methodology, A.S.B., S.C.-I., C.R. and P.A.-G.; software, P.A.-G., S.C.-I. and B.R.; validation, A.S.B., P.A.-G., S.C.-I., B.R., E.C.-C. and C.R.; formal analysis, A.S.B., P.A.-G., S.C.-I., B.R., E.C.-C. and C.R.; investigation, A.S.B., P.A.-G., S.C.-I., E.C.-C., B.R. and C.R.; resources, A.S.B. and C.R.; data curation, A.S.B., P.A.-G., S.C.-I., E.C.-C., B.R. and C.R.; writing—original draft preparation, A.S.B., P.A.-G. and S.C.-I.; writing—review and editing, A.S.B., P.A.-G., S.C.-I., E.C.-C., B.R. and C.R.; visualization, P.A.-G., S.C.-I. and B.R.; supervision, A.S.B., P.A.-G., S.C.-I., E.C.-C., B.R. and C.R.; project administration, A.S.B.; funding acquisition, A.S.B. and C.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the PREDIC-PRO, project SCPP2100C008733XV0, of the State Research Agency of the Ministry of Science, Innovation and Universities, and the ACIF Generalitat Valenciana, European Union (European Social Fund: Investing in Your Future), grant number CIACIF/2022/255.

Data Availability Statement

The Sentinel-2 used data is openly available at https://browser.dataspace.copernicus.eu (accessed on 5 January 2025). The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

P.A.-G. acknowledges financial support from Generalitat Valenciana, European Union (European Social Fund: Investing in Your Future) through grant CIACIF/2022/255.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
PAPrecision Agriculture
IQRInterquartile Range
CVCoefficient of Variation
CVmedianRobust Coefficient of variation
SDStandard Deviation
RMSERoot Mean Square Error
MAEMean Absolute Error
MADMedian Absolute Deviation
MLMaching learning
DLDeep Learning
PPPre-Processed Data
SSCMSite-Specific Crop Management
GPSGlobal Positioning Systems
LMLocal Moran’s I
KCCKöppen Climate Classification
MSIMulti-Spectral Instrument
ESAEuropean Space Agency
NIRNear Infrared
SWIRShort-Wave Infrared
PLSRPartial Least Squares Regression

Appendix A

The appendix contains the yield descriptive statistics of the raw and P0 data for wheat (Table A1) and barley (Table A2) crops.
Table A1. Yield descriptive statistics of original (Raw) and preprocessed (P0) data of wheat crops. Being n the number of data, Min the minimum value, Max the maximum value, Mdn the median value.
Table A1. Yield descriptive statistics of original (Raw) and preprocessed (P0) data of wheat crops. Being n the number of data, Min the minimum value, Max the maximum value, Mdn the median value.
Loc.YearTypenMinMaxMeanMdnCVSkewKurt
              kg ha−1              %
BUR2021P015,255399884170436234.9−0.20.3
2022P013,181298102580253737.40.93.5
COR2020Raw22,570037,7705056548343.80.712.9
P021,773110,0005128551635.9−0.70.2
2021Raw109,299032,2843453357941.60.610.9
P0106,812198383511360537.0−0.10.6
2022Raw22,699032,9603755389944.91.726.9
P022,065197343820393337.2−0.40.4
2023Raw230,6110385,2761621805399.625.81001.2
P0217,27419992118982998.22.811.5
LEO2023Raw79,89304,461,86810,1342795424.122.11547.6
P060,265199943278279847.41.93.9
SEV2020Raw5722050,5933483393551.14.5107.2
P05523199263547395238.6−0.91.5
2021Raw58,847043,8862053206447.43.281.4
P056,610199382099208640.20.85.0
VALL2020Raw161,369057,0304331402660.40.85.6
P0157,237110,0004360406656.50.3−0.9
2021Raw48,9000399,66934643298186.141.12253.6
P044,992199413392341655.20.1−0.7
2022Raw297,1400783,50016961373221.884.213,859.8
P0279,970199111640143266.21.54.6
2023Raw344,6690295,36515551171221.635.01763.3
P0314,039199971548127077.01.22.1
Table A2. Yield descriptive statistics of original (Raw) and preprocessed (P0) data of barley crops. Being n the number of data, Min the minimum value, Max the maximum value, Mdn the median value.
Table A2. Yield descriptive statistics of original (Raw) and preprocessed (P0) data of barley crops. Being n the number of data, Min the minimum value, Max the maximum value, Mdn the median value.
Loc.YearTypenMinMaxMeanMdnCVSkewKurt
              kg ha−1              %
BUR2021P013,612899803936389934.40.00.5
2022P015,292196162666254036.71.34.6
COR2021Raw17,711026,9923154269854.81.05.4
P017,290198903217273750.70.6−0.2
2022Raw2540010,6572232244050.70.54.1
P02404182312346249442.20.32.2
2023Raw95,0480539,48919121267362.938.41973.9
P089,612198161650135285.11.22.0
VALL2020Raw155,085053,3203657358756.01.010.8
P0150,975199993713364851.60.3−0.5
2021Raw459,26802,910,45038923429287.5108.822,414.3
P0412,130110,0003738362754.20.4−0.3
2022Raw865,85402,910,45029222125300.1141.536,038.5
P0795,430110,0002743225071.81.00.6
2023Raw490,9840727,85915631114229.765.08145.7
P0442,432199971602123491.71.94.4

Appendix B

The present appendix presents the results of applying the 3SD method to the study dataset, compared with the RAW dataset and G1.0L2.5 clean data. Table A3 and Table A4 present the comparisons of statistics for wheat and barley examples, respectively. Figure A1 shows the comparative semivariogram for both crops. Finally, Table A5 and Table A6 compare the performance of the ML models for wheat and barley, respectively.
Table A3. Comparison of statistics RAW, 3SD, and G1.0L2.5 yield data of wheat in Córdoba (2021).
Table A3. Comparison of statistics RAW, 3SD, and G1.0L2.5 yield data of wheat in Córdoba (2021).
TypeMeanMedianSDCVSkewKurt
Raw3453.33579.41300.741.60.610.9
3SD3895.63866.5859.922.10.21.1
G1.0L2.53944.13889.5700.817.80.50.3
Table A4. Comparison of statistics RAW, 3SD, and G1.0L2.5 yield data of barley in Valladolid (2020).
Table A4. Comparison of statistics RAW, 3SD, and G1.0L2.5 yield data of barley in Valladolid (2020).
TypeMeanMedianSDCVSkewKurt
Raw3657.43587.32050.056.11.010.8
3SD3988.94075.01644.141.20.1−0.7
G1.0L2.54031.94150.01596.039.60.1−0.8
Figure A1. Semivariograms comparing RAW, 3SD, and G1.0L2.5 level of yield data: (a) wheat (Córdoba, 2021) and (b) barley (Valladolid, 2020).
Figure A1. Semivariograms comparing RAW, 3SD, and G1.0L2.5 level of yield data: (a) wheat (Córdoba, 2021) and (b) barley (Valladolid, 2020).
Agronomy 16 00386 g0a1
Table A5. Performance of the trained models using PLSR and XGBoost algorithms for Raw, 3SD, and G1.0L2.5 cleaning levels for wheat in Córdoba (2021).
Table A5. Performance of the trained models using PLSR and XGBoost algorithms for Raw, 3SD, and G1.0L2.5 cleaning levels for wheat in Córdoba (2021).
Data Type PLSRXGBoost
R2MAERMSER2MAERMSE
      kg ha−1             kg ha−1      
RawTraining0.74292.79420.310.76285.90399.38
Testing0.68343.40473.480.65364.90497.48
3SDTraining0.72309.43440.180.75300.09416.39
Testing0.71306.49449.440.71316.96455.97
G1.0L2.5Training0.81224.98298.740.83219.92286.62
Testing0.78247.55327.630.80242.28314.58
Table A6. Performance of the trained models using PLSR and XGBoost algorithms for Raw, 3SD, and G1.0L2.5 cleaning levels for barley in Valladolid (2020).
Table A6. Performance of the trained models using PLSR and XGBoost algorithms for Raw, 3SD, and G1.0L2.5 cleaning levels for barley in Valladolid (2020).
Data Type PLSRXGBoost
R2MAERMSER2MAERMSE
      kg ha−1             kg ha−1      
RawTraining0.82511.68687.350.84484.09651.87
Testing0.83512.16681.570.84488.50652.91
3SDTraining0.84490.46652.060.85468.68621.22
Testing0.84490.89656.010.85474.07634.19
G1.0L2.5Training0.88431.49549.020.89405.34516.34
Testing0.88430.26553.910.89417.72534.57

References

  1. Nations, U. Population, 2025. Available online: https://www.un.org/en/global-issues/population (accessed on 19 October 2025).
  2. Bruinsma, J. World Agriculture: Towards 2015/2030; Routledge: London, UK, 2017. [Google Scholar] [CrossRef]
  3. Billen, G.; Lassaletta, L.; Garnier, J. A biogeochemical view of the global agro-food system: Nitrogen flows associated with protein production, consumption and trade. Glob. Food Secur. 2014, 3, 209–219. [Google Scholar] [CrossRef]
  4. Sun, W.; Whelan, B.; McBratney, A.B.; Minasny, B. An integrated framework for software to provide yield data cleaning and estimation of an opportunity index for site-specific crop management. Precis. Agric. 2013, 14, 376–391. [Google Scholar] [CrossRef]
  5. Arizo-García, P.; Castiñeira-Ibáñez, S.; Tarrazó-Serrano, D.; Franch, B.; Rubio, C.; San Bautista, A. Use of Sentinel-2 Images to Elaborate a VRT Sensor-Based and Map-Based Nitrogen Fertilization in Wheat and Barley Crops. Appl. Sci. 2025, 15, 11646. [Google Scholar] [CrossRef]
  6. Vega, A.; Córdoba, M.; Castro-Franco, M.; Balzarini, M. Protocol for automating error removal from yield maps. Precis. Agric. 2019, 20, 1030–1044. [Google Scholar] [CrossRef]
  7. Leroux, C.; Jones, H.; Clenet, A.; Dreux, B.; Becu, M.; Tisseyre, B. A general method to filter out defective spatial observations from yield mapping datasets. Precis. Agric. 2018, 19, 789–808. [Google Scholar] [CrossRef]
  8. Córdoba, M.A.; Bruno, C.I.; Costa, J.L.; Peralta, N.R.; Balzarini, M.G. Protocol for multivariate homogeneous zone delineation in precision agriculture. Biosyst. Eng. 2016, 143, 95–107. [Google Scholar] [CrossRef]
  9. Blasch, G.; Li, Z.; Taylor, J.A. Multi-temporal yield pattern analysis method for deriving yield zones in crop production systems. Precis. Agric. 2020, 21, 1263–1290. [Google Scholar] [CrossRef]
  10. Fita, D.; Rubio, C.; Franch, B.; Castiñeira-Ibáñez, S.; Tarrazó-Serrano, D.; San Bautista, A. Improving harvester yield maps postprocessing leveraging remote sensing data in rice crop. Precis. Agric. 2025, 26, 33. [Google Scholar] [CrossRef]
  11. Gozdowski, D.; Samborski, S.; Dobers, E.S. Evaluation of Methods for the Detection of Spatial Outliers in the Yield Data of Winter Wheat. In Proceedings of the Colloquium Biometricum. Uniwersytet Przyrodniczy w Lublinie, Katedra Zastosowań Matematyki i Informatyki, 2010, Number 40. Available online: https://bibliotekanauki.pl/articles/9643 (accessed on 15 December 2023).
  12. Ping, J.L.; Dobermann, A. Processing of Yield Map Data. Precis. Agric. 2005, 6, 193–212. [Google Scholar] [CrossRef]
  13. Sudduth, K.A.; Drummond, S.T. Yield Editor: Software for Removing Errors from Crop Yield Maps. Agron. J. 2007, 99, 1471–1482. [Google Scholar] [CrossRef]
  14. Natale, A.; Antognelli, S.; Ranieri, E.; Cruciani, A.; Boggia, A. A Novel Cleaning Method for Yield Data Collected by Sensors: A Case Study on Winter Cereals. In Computational Science and Its Applications – ICCSA 2020; Springer International Publishing: Cham, Switzerland, 2020; pp. 684–691. [Google Scholar] [CrossRef]
  15. Sanchez, C.; Pathak, D.; Miranda, M.; Charfuelan, M.; Helber, P.; Nuske, M.; Bischke, B.; Habelitz, P.; Rahman, N.; Mena, F.; et al. Influence of Data Cleaning Techniques on Sub-Field Yield Predictions. In Proceedings of the IGARSS 2023-2023 IEEE International Geoscience and Remote Sensing Symposium, Pasadena, CA, USA, 16–21 July 2023; IEEE: Piscataway, NJ, USA, 2023; Volume 13, pp. 4852–4855. [Google Scholar] [CrossRef]
  16. Jain, S.; Shukla, S.; Wadhvani, R. Dynamic selection of normalization techniques using data complexity measures. Expert Syst. Appl. 2018, 106, 252–262. [Google Scholar] [CrossRef]
  17. Sainani, K.L. Dealing with Non-normal Data. PM&R 2012, 4, 1001–1005. [Google Scholar] [CrossRef]
  18. López-Bellido, L. Cereales, Cultivos Herbáceos; Escuela Técnica Superior de Ingenieros Agrónomos, Universidad de Córdoba: Córdoba, Spain, 1991; Volume 1. [Google Scholar]
  19. López-Bellido, R.J.; López-Bellido, L.; Benítez-Vega, J.; López-Bellido, F.J. Tillage system, preceding crop, and nitrogen fertilizer in wheat crop: II. Water utilization. Agron. J. 2007, 99, 66–72. [Google Scholar] [CrossRef]
  20. IGME. Web Map Viewer, 2023. Available online: https://igme.maps.arcgis.com/home/webmap/viewer.html?useExisting=1 (accessed on 26 September 2025).
  21. European Space Agency. Sentinel-2 User Handbook. Technical Report, ESA, 2015. ESA Standard Document. Available online: https://sentinels.copernicus.eu/documents/247904/685211/Sentinel-2_User_Handbook (accessed on 15 December 2025).
  22. Kappal, S. Data Normalization Using Median & Median Absolute Deviation (MMAD) based Z-Score for Robust Predictions vs. Min–Max Normalization. Lond. J. Res. Sci. Nat. Form. 2019, 19, 39–44. [Google Scholar] [CrossRef]
  23. Atkinson, A.C.; Riani, M.; Corbellini, A. The Box–Cox Transformation: Review and Extensions. Stat. Sci. 2021, 36, 239–255. [Google Scholar] [CrossRef]
  24. Maldaner, L.F.; Molin, J.P. Data processing within rows for sugarcane yield mapping. Sci. Agric. 2020, 77, e20180391. [Google Scholar] [CrossRef]
  25. Robinson, T.; Metternicht, G. Comparing the performance of techniques to improve the quality of yield maps. Agric. Syst. 2005, 85, 19–41. [Google Scholar] [CrossRef]
  26. Řezník, T.; Herman, L.; Trojanová, K.; Pavelka, T.; Leitgeb, Š. Interpolation of Data Measured by Field Harvesters: Deployment, Comparison and Verification. In Environmental Software Systems. Data Science in Action; Springer International Publishing: Cham, Switzerland, 2020; pp. 258–270. [Google Scholar] [CrossRef]
  27. Aworka, R.; Cedric, L.S.; Adoni, W.Y.H.; Zoueu, J.T.; Mutombo, F.K.; Kimpolo, C.L.M.; Nahhal, T.; Krichen, M. Agricultural decision system based on advanced machine learning models for yield prediction: Case of East African countries. Smart Agric. Technol. 2022, 2, 100048. [Google Scholar] [CrossRef]
  28. Lyle, G.; Bryan, B.A.; Ostendorf, B. Post-processing methods to eliminate erroneous grain yield measurements: Review and directions for future development. Precis. Agric. 2013, 15, 377–402. [Google Scholar] [CrossRef]
  29. Neutatz, F.; Chen, B.; Alkhatib, Y.; Ye, J.; Abedjan, Z. Data Cleaning and AutoML: Would an Optimizer Choose to Clean? Datenbank-Spektrum 2022, 22, 121–130. [Google Scholar] [CrossRef]
Figure 1. Location of the studied data. The fields studied are outlined in red for wheat and pink for barley, for each location and growing season presented.
Figure 1. Location of the studied data. The fields studied are outlined in red for wheat and pink for barley, for each location and growing season presented.
Agronomy 16 00386 g001
Figure 2. Representation of the yield data in (a) TOPCON software and (b) Trimble software.
Figure 2. Representation of the yield data in (a) TOPCON software and (b) Trimble software.
Agronomy 16 00386 g002
Figure 3. Two-stage workflow used in this study. Stage 1: Raw to P0 (biological limits), distribution/transformations to PP. Stage 2: PP to global (G) and local (L) filtering, combinations (GL), and comparison of cleaning levels to select the optimal protocol.
Figure 3. Two-stage workflow used in this study. Stage 1: Raw to P0 (biological limits), distribution/transformations to PP. Stage 2: PP to global (G) and local (L) filtering, combinations (GL), and comparison of cleaning levels to select the optimal protocol.
Agronomy 16 00386 g003
Figure 4. Proposed parametric cleaning procedure applied to PP data. Global adjustment uses field median ( α ) and IQR ( β ) to define bounds (Equations (2) and (3)) for n = 1.0–2.5 (G1.0–G2.5). Local adjustment uses neighborhood median ( γ ) and IQR ( δ ) within a 40 m radius (Equations (4) and (5)) to test L1.0–L2.5, producing 16 G × L combinations. All outputs are rescaled to a 10 × 10 m grid by averaging values within each pixel. The dotted box indicates that all the levels of cleaned data within it were rescaled, not only those obtained after applying the local adjustment.
Figure 4. Proposed parametric cleaning procedure applied to PP data. Global adjustment uses field median ( α ) and IQR ( β ) to define bounds (Equations (2) and (3)) for n = 1.0–2.5 (G1.0–G2.5). Local adjustment uses neighborhood median ( γ ) and IQR ( δ ) within a 40 m radius (Equations (4) and (5)) to test L1.0–L2.5, producing 16 G × L combinations. All outputs are rescaled to a 10 × 10 m grid by averaging values within each pixel. The dotted box indicates that all the levels of cleaned data within it were rescaled, not only those obtained after applying the local adjustment.
Agronomy 16 00386 g004
Figure 5. Box-whisker plots of datasets yield the standard deviation as a function of the local adjustment search radius at a field example. Red lines indicate the median standard deviation for each box-whisker; white boxes indicate the mean standard deviation for each group; and black circles represent anomalous data.
Figure 5. Box-whisker plots of datasets yield the standard deviation as a function of the local adjustment search radius at a field example. Red lines indicate the median standard deviation for each box-whisker; white boxes indicate the mean standard deviation for each group; and black circles represent anomalous data.
Agronomy 16 00386 g005
Figure 6. Percentage of erased data in the different cleaning levels at polygon and pixel level for wheat datasets.
Figure 6. Percentage of erased data in the different cleaning levels at polygon and pixel level for wheat datasets.
Agronomy 16 00386 g006
Figure 7. Percentage of erased data in the different cleaning levels at polygon and pixel level for barley datasets.
Figure 7. Percentage of erased data in the different cleaning levels at polygon and pixel level for barley datasets.
Agronomy 16 00386 g007
Figure 8. Semivariograms for all proposed cleaning levels using an example dataset: (a) wheat (Córdoba, 2021) and (b) barley (Valladolid, 2020).
Figure 8. Semivariograms for all proposed cleaning levels using an example dataset: (a) wheat (Córdoba, 2021) and (b) barley (Valladolid, 2020).
Agronomy 16 00386 g008
Table 1. Characteristics of the Sentinel-2 used bands.
Table 1. Characteristics of the Sentinel-2 used bands.
Sentinel-2 BandCentral Wavelength (nm)Spatial Resolution (m)
B02-Blue45010
B03-Green56010
B04-Red66510
B05-Vegetation Red-Edge70520
B06-Vegetation Red-Edge74020
B07-Vegetation Red-Edge78320
B08-NIR84210
B8A-Narrow NIR86520
B11-SWIR161020
B12-SWIR219020
Table 2. Software and libraries used in the study.
Table 2. Software and libraries used in the study.
Software/LibraryVersionDeveloper/OrganizationOfficial URL
QGIS3.34.6QGIS Development Teamhttps://qgis.org                   (accessed on 12 January 2025)
scikit-gstat1.0.19M. Mälicke & contributorshttps://scikit-gstat.readthedocs.io (accessed on 25 April 2025)
pykrige1.7.2PyKrige Developershttps://pykrige.readthedocs.io (accessed on 30 April 2025)
scikit-learn1.2.2scikit-learn Developershttps://scikit-learn.org             (accessed on 31 August 2025)
xgboost3.0.5XGBoost Developershttps://xgboost.readthedocs.io (accessed on 1 September 2025)
Table 3. Percentage of fields that present a normal distribution of yield data (%) in function of the location (Loc.), type of data (Type), and crop.
Table 3. Percentage of fields that present a normal distribution of yield data (%) in function of the location (Loc.), type of data (Type), and crop.
LocationCropType%
BurgosBarleyP033.3
WheatP032.3
CórdobaBarleyRaw25.0
P025.0
WheatRaw6.9
P013.8
LeónWheatRaw0.0
P08.2
SevillaWheatRaw0.0
P016.7
ValladolidBarleyRaw12.2
P036.9
WheatRaw7.9
P022.2
Table 4. Percentage of fields that present a normal distribution of yield data (%) in function of the location (Loc.), crop, and used transformation (Trans.).
Table 4. Percentage of fields that present a normal distribution of yield data (%) in function of the location (Loc.), crop, and used transformation (Trans.).
Loc.CropTransf.%Loc.CropTransf.%
BURBarley1/ x i 4.9LEOWheat1/ x i 12.2
B–Cox29.4B–Cox22.5
M–M33.3M–M8.2
x i 16.7 x i 26.5
x i 2 31.4 x i 2 0.0
x i 3 8.8 x i 3 0.0
log 10 ( x i ) 9.8 log 10 ( x i ) 12.2
Wheat1/ x i 5.1SEVWheat1/ x i 0.0
B–Cox25.3B–Cox16.7
M–M32.3M–M16.7
x i 23.2 x i 0.0
x i 2 22.2 x i 2 0.0
x i 3 8.1 x i 3 0.0
log 10 ( x i ) 9.1 log 10 ( x i ) 0.0
CORBarley1/ x i 0.0VALLBarley1/ x i 0.0
B–Cox50.0B–Cox28.3
M–M25.0M–M36.9
x i 25.0 x i 23.6
x i 2 0.0 x i 2 8.6
x i 3 0.0 x i 3 1.3
log 10 ( x i ) 0.0 log 10 ( x i ) 1.3
Wheat1/ x i 0.0Wheat1/ x i 0.0
B–Cox48.3B–Cox30.2
M–M13.8M–M22.2
x i 13.8 x i 25.4
x i 2 10.3 x i 2 4.0
x i 3 0.0 x i 3 3.2
log 10 ( x i ) 27.6 log 10 ( x i ) 3.2
B–Cox is the abbreviation for Box–Cox data transformation; M–M is the abbreviation for Min–Max data transformation; BUR is the abbreviation for Burgos; COR is the abbreviation for Córdoba; LEO is the abbreviation for León; SEV is the abbreviation for Sevilla; VALL is the abbreviation for Valladolid; x i is the yield value for a data point.
Table 5. Median coefficient of variance and percentage of total P0 yield data for each yield range and studied crops.
Table 5. Median coefficient of variance and percentage of total P0 yield data for each yield range and studied crops.
Range of YieldsWheatBarley
(kg ha−1)CVmedian (%)% of Total DataCVmedian (%)% of Total Data
0 < x i 500 44.412.153.79.8
500 < x i 1000 16.415.916.311.7
1000 < x i 1500 10.014.310.012.5
1500 < x i 2000 7.211.17.210.8
2000 < x i 2500 5.59.75.69.4
2500 < x i 3000 4.68.24.67.9
3000 < x i 3500 3.96.73.96.8
3500 < x i 4000 3.35.53.36.0
4000 < x i 4500 3.04.32.95.4
4500 < x i 5000 2.63.12.64.8
5000 < x i 5500 2.42.12.44.1
5500 < x i 6000 2.21.72.13.1
6000 < x i 6500 2.01.32.02.2
6500 < x i 7000 1.81.01.91.6
7000 < x i 7500 1.80.81.71.2
7500 < x i 8000 1.60.71.60.9
8000 < x i 8500 1.50.61.50.6
8500 < x i 9000 1.40.51.50.4
9000 < x i 9500 1.30.31.40.3
9500 < x i 10 , 000 1.30.11.30.2
Table 6. Summary of statistics for each type of yield data of wheat in Córdoba (2021).
Table 6. Summary of statistics for each type of yield data of wheat in Córdoba (2021).
TypeMeanMedianSDCVSkewKurt
Raw3453.33579.41300.741.60.610.9
PP3559.73622.51246.435.0−0.10.6
G1.03943.33889.0700.517.80.50.3
G1.53949.33894.0771.919.50.50.5
G2.03932.83883.0811.720.60.40.8
G2.53915.33874.0834.321.30.30.9
G1.0L1.03944.33890.5688.017.40.50.2
G1.0L1.53946.83891.0699.317.70.50.3
G1.0L2.03945.13891.0701.017.80.50.3
G1.0L2.53944.13889.5700.817.80.50.3
G1.5L1.03949.83891.0742.818.80.50.5
G1.5L1.53953.13895.0764.219.30.50.5
G1.5L2.03952.13896.0769.519.50.50.5
G1.5L2.53950.53894.0770.819.50.50.5
G2.0L1.03939.33885.0773.819.60.50.8
G2.0L1.53943.63889.0796.820.20.40.8
G2.0L2.03941.23888.0805.020.40.40.8
G2.0L2.53938.23887.5808.620.50.40.8
G2.5L1.03928.33877.0789.120.10.40.9
G2.5L1.53932.73883.0813.820.70.40.9
G2.5L2.03928.33881.0825.821.00.30.9
G2.5L2.53924.43881.0830.321.20.30.9
Table 7. Summary of statistics for each type of yield data of barley in Valladolid (2020).
Table 7. Summary of statistics for each type of yield data of barley in Valladolid (2020).
TypeMeanMedianSDCVSkewKurt
Raw3657.43587.32050.056.11.010.8
PP3797.63718.01858.348.90.3−0.5
G1.04030.54150.01593.939.50.1−0.8
G1.54021.04138.01627.240.50.1−0.7
G2.04009.64107.01633.640.70.1−0.7
G2.54003.34093.01633.340.80.1−0.7
G1.0L1.04033.64142.01601.239.70.1−0.8
G1.0L1.54033.64149.01600.639.70.1−0.8
G1.0L2.04033.04151.01597.639.60.1−0.8
G1.0L2.54031.94150.01596.039.60.1−0.8
G1.5L1.04023.34145.01629.640.50.1−0.8
G1.5L1.54025.34140.01630.740.50.1−0.7
G1.5L2.04024.54143.01630.140.50.1−0.7
G1.5L2.54023.44140.01628.740.50.1−0.7
G2.0L1.04011.64115.01635.940.80.0−0.8
G2.0L1.54015.14115.01637.740.80.1−0.8
G2.0L2.04014.14115.01637.240.80.1−0.7
G2.0L2.54012.74113.01635.740.80.1−0.7
G2.5L1.04007.04106.01634.140.80.1−0.8
G2.5L1.54009.84106.01636.440.80.1−0.7
G2.5L2.04008.04100.01636.440.80.1−0.7
G2.5L2.54006.24097.01635.140.80.1−0.7
Table 8. Performance of the trained models using PLSR and XGBoost algorithms for Raw, PP, and the preselected cleaning levels for wheat in Córdoba (2021).
Table 8. Performance of the trained models using PLSR and XGBoost algorithms for Raw, PP, and the preselected cleaning levels for wheat in Córdoba (2021).
Data Type PLSRXGBoost
R2MAERMSER2MAERMSE
       kg ha−1               kg ha−1       
RawTraining0.74292.79420.310.76285.90399.38
Testing0.68343.40473.480.65364.90497.48
PPTraining0.73299.86426.560.76292.20401.94
Testing0.67359.49477.590.66364.66485.83
G1.0L1.0Training0.84204.99273.650.84203.48266.03
Testing0.81226.01300.800.81240.47305.48
G1.0L1.5Training0.82218.83291.610.83215.64281.63
Testing0.79240.87319.300.81234.25305.93
G1.0L2.0Training0.81223.70297.420.83219.37286.13
Testing0.79246.38325.810.80240.33312.99
G1.0L2.5Training0.81224.98298.740.83219.92286.62
Testing0.78247.55327.630.80242.28314.58
G1.5L1.0Training0.81229.46317.810.82230.10309.10
Testing0.80239.65335.170.77260.39355.59
Table 9. Performance of the trained models using PLSR and XGBoost algorithms for Raw, PP, and the preselected cleaning levels for barley in Valladolid (2020).
Table 9. Performance of the trained models using PLSR and XGBoost algorithms for Raw, PP, and the preselected cleaning levels for barley in Valladolid (2020).
Data Type PLSRXGBoost
R2MAERMSER2MAERMSE
       kg ha−1               kg ha−1       
RawTraining0.82511.68687.350.84484.09651.87
Testing0.83512.16681.570.84488.50652.91
PPTraining0.84493.28655.740.85465.44618.57
Testing0.83506.20652.820.85470.50619.50
G1.0L1.0Training0.89415.78530.900.90389.38497.80
Testing0.88424.23545.370.90396.23507.98
G1.0L1.5Training0.88428.74545.600.90402.11512.71
Testing0.88425.79542.820.89405.71517.73
G1.0L2.0Training0.88430.49547.140.90404.17514.29
Testing0.88429.90553.440.89417.02532.90
G1.0L2.5Training0.88431.49549.020.89405.34516.34
Testing0.88430.26553.910.89417.72534.57
G1.5L1.0Training0.87441.99573.700.89416.84541.11
Testing0.87443.58586.740.88424.47557.62
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Arizo-García, P.; Castiñeira-Ibáñez, S.; Cruzado-Campos, E.; Ricarte, B.; Rubio, C.; San Bautista, A. A Standardized Framework for Cleaning Non-Normal Yield Data from Wheat and Barley Crops, and Validation Using Machine Learning Models for Satellite Imagery. Agronomy 2026, 16, 386. https://doi.org/10.3390/agronomy16030386

AMA Style

Arizo-García P, Castiñeira-Ibáñez S, Cruzado-Campos E, Ricarte B, Rubio C, San Bautista A. A Standardized Framework for Cleaning Non-Normal Yield Data from Wheat and Barley Crops, and Validation Using Machine Learning Models for Satellite Imagery. Agronomy. 2026; 16(3):386. https://doi.org/10.3390/agronomy16030386

Chicago/Turabian Style

Arizo-García, Patricia, Sergio Castiñeira-Ibáñez, Enric Cruzado-Campos, Beatriz Ricarte, Constanza Rubio, and Alberto San Bautista. 2026. "A Standardized Framework for Cleaning Non-Normal Yield Data from Wheat and Barley Crops, and Validation Using Machine Learning Models for Satellite Imagery" Agronomy 16, no. 3: 386. https://doi.org/10.3390/agronomy16030386

APA Style

Arizo-García, P., Castiñeira-Ibáñez, S., Cruzado-Campos, E., Ricarte, B., Rubio, C., & San Bautista, A. (2026). A Standardized Framework for Cleaning Non-Normal Yield Data from Wheat and Barley Crops, and Validation Using Machine Learning Models for Satellite Imagery. Agronomy, 16(3), 386. https://doi.org/10.3390/agronomy16030386

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop