AI-Based Prediction of Carrot Yield and Quality on Tropical Agriculture

: The adoption of arti ﬁ cial intelligence tools can improve production e ﬃ ciency in the agroindustry. Our objective was to perform the predictive modeling of carrot yield and quality. The crop was grown in two commercial areas during the summer season in Brazil. The root samples were taken at 200 points with a 30 × 30 m sampling grid at 82 and 116 days after sowing in both areas. The total fresh biomass, aerial part, and root biometry were quanti ﬁ ed for previous crop harvesting to measure yield. The quality of the roots was assessed by sub-sampling three carrots by the concentration of total soluble solids (°Brix) and ﬁ rmness in the laboratory. Vegetation indices were extracted from satellite imagery. The most important variables for the predictive models were selected by principal component analysis and submi tt ed to the Arti ﬁ cial Neural Network (ANN), Random Forest (RF), and Multiple Linear Regression (MLR) algorithms. SAVI and NDVI indices stood out as predictors of crop yield, and the results from the ANN (R 2 = 0.68) were superior to the RF (R 2 = 0.67) and MLR (R 2 = 0.61) models. Carrot quality cannot be modeled by the predictive models in this study; however, it should be explored in future research, including other crop variables.


Introduction
Carrot (Daucus carota L.) is an important vegetable among the top 10 vegetables grown worldwide [1].It is a geocarpic vegetable crop established underground and is abundant in biologically active substances, such as vitamins, anthocyanins, and carotenoids.Such compounds have antioxidant, antitumor, and antihypertensive capabilities and other beneficial characteristics for human health [2].In addition to these components, carrots contain fermentable sugars to produce bioethanol and can contribute to energy security.
The genetic factors of the cultivars, growing season, environmental conditions, and management practices in the field affect the quality of the roots [3].Among the factors associated with crop quality, in decreasing order of importance, are cultivar, environmental conditions, and practices of cultivation [4].These factors interact with each other to optimize crop yield and root quality.The quality of the table carrots (in natura), the focus of this study, is a characteristic that is even more required for food consumption.The visual aspects of uniformity, such as size, shape, and flavor, define the market value and consumer preference.There are numerous alternatives to evaluate the general quality of the roots, such as the flavor and texture for consumption, the size of the classes, and the lack of defects on the final product to the market.The morphology of carrots is influenced by its genetic components, which impact the relationship between shoot biomass and storage root biomass [5].Over the period of cultivation, there is a deformation of the roots, associated with the accumulation of biomass, according to the environmental and soil conditions of the agricultural area.The carrot has a pivoting root system, with a main, long, and conical primary root, generally orange in color.The aerial part of the plant is composed of leaves that are compound and pinnate, with lanceolate or ovate segments, serrated margins, and a dark green color.During the second year of cultivation, the plant produces an umbel-type inflorescence, with numerous small white or slightly pink flowers.
The content of total soluble solids (sweetness or °Brix) and firmness are directly related to the quality of the carrots for the food sector.The °Brix is commonly evaluated in the laboratory by quantifying the amount of free sugars in its composition.Firmness is measured by the resistance of the food to the penetration of an equipment, inferring the crispness of the food when consumed fresh.These characteristics make up essential appearance, flavor, and texture factors that indicate the freshness and ripeness of the product.
Flavor, sweetness, and texture are organoleptic variables of appreciation during consumption.Mpemba et al. [6] evaluated the soluble solid content of fresh carrots with a refractometer, obtaining values of 9.33 °Brix and a firmness of 23.58 N.This is in accordance with the average of 7.82 °Brix and a firmness of 31.81N obtained in this study.Temperature affects the sweetness of carrot roots.Colder temperatures (about 9 °C) increase the glucose and fructose contents of the root, while normal temperatures (21 °C) favor the accumulation of sucrose and carotene in the product [7].Carrots under climatic conditions in this study, around 25 °C, favored greater color despite their sweetness.Small changes in the biochemical composition of plants, closely linked to the plant's physiology, make it possible to evaluate the crop quickly and non-destructively in relation to its physiological stage.In carrot germplasm, total sugar varies more than ten-fold [8], and terpenoids are genetically controlled by 30 quantitative trait loci (QTLs) [9].The sugar content also affects the level of sweetness, and volatile terpenoids affect the product's flavor [10].
Some contributions of this study to agricultural engineering include innovative tools for monitoring carrot yield and quality characteristics, essential for agronomical decisionmaking processes in the field.Investigating locally the variation in the yield and quality of the carrots would enable us to define management zones and even define locations with higher yield potential for the crop.Agricultural areas with a greater potential for production can increase the efficient use of resources such as fertilization, pesticides, seed population, and sowing period.Areas of carrots with higher sugar contents can be directed towards energy production or greater value in the consumer market.
Different predictive approaches to crop parameters leveraging remote sensing (RS) and artificial intelligence (AI) are being explored in the literature [11][12][13].RS involves collecting data on the Earth's surface using sensors installed on satellites or remotely piloted aircraft to monitor large areas at low costs [14], control in-season weeds, identify the health of the crop, and predict crop yield in a non-destructive way.
Vegetation indices (VIs) are calculated by equations from the spectral bands [15], which makes it possible to map the dynamics of the crop to better understand the field conditions and practices that interfere with the yield and its desired qualitative attributes.Wei et al. [11] did not observe a linear relationship between 88 spectral bands and carrot yield.Therefore, the application of machine learning methods is favorable for modeling this variable (non-linear behavior).The authors identified that the reflectance of the NIR (near-infrared) spectral band increased during crop growth, and the RGB bands decreased at 40 days after sowing (DAS) due to the root phenological stage.The accuracy of crop yield prediction varies depending on the climatic conditions during the development of the plants.The correlation between VIs and yield changes throughout the crop cycle, so it would not be ideal to only use a fixed period to model crop yield through remote sensing [12][13][14].Monitoring geocarpic crops, such as carrots, is more complex because the organ of interest is established underground [15][16], and its extraction is necessary for sampling [17].
Agricultural systems based on machine learning (ML) methods can become more efficient and sustainable in the different scenarios of production [18][19].ML is fundamental for improving the growth of crop yield in a sustainable manner, helping to interpret and correlate field data with computational techniques that can contribute to supporting decision making in agriculture [20][21].Yield and quality modeling for underground crops, such as carrots, is limited, as the product of interest is evaluated indirectly through its aerial part, which reduces the accuracy of predicting its parameters [22][23].There are still no studies that have jointly explored the yield and quality of the carrots in commercial conditions.This study aims to establish a relationship between the variables of total fresh mass, aerial part, and roots, as well as the variables of length, diameter, °Brix, and firmness of the roots, together with crop reflectance data on a large scale.

Experimental Areas
This work was carried out in two irrigated commercial areas in São Gotardo, Minas Gerais, Brazil.Encrusted seeds of the carrot hybrid EX 4098 (Tropical Nantes Group) were sown in the summer season (2022/2023) at two different periods: 29 August (Site 1, coordinates 19°24′33.3″S 46°15′58.0″W, UTM23S), and October 24 (Site 2, coordinates 19°25′45.7″S 46°16′46.6″W, UTM23S), with a final population of 466,700 plants per hectare after thinning (Figure 1).The soil condition of the pivots is a red-yellow Oxisol with a smooth wavy relief (LVd4).The soils at the experimental sites have a history of agricultural cultivation spanning over 30 years with annual crops and are destined most of the time for vegetable crops.The soil profile was prepared with a subsoiler (0.06 m deep) and a rotary hoe with a tiller (0.015 m deep) for sowing hybrid encrustations.The choice of the areas followed the criteria of accessibility to the cultivation site and logistics of transporting the carrot samples to the laboratory.

Root Sampling and Biometric Assessment
Root sampling consisted of two data collection periods: 82 and 116 DAS, with 50 points at each data collection in both experimental areas (total: 200 sampling points).These periods were chosen according to the best time for crop modeling according to Suarez et al. [13], corresponding to the full radial filling (82 DAS) and the date before crop harvesting (116 DAS).The climatic conditions were 1600 degree-days and 2260 degreedays (base temperature of 3 °C) for 82 and 116 DAS, respectively (Figure 2).Irrigation and cultural treatments were carried out according to the needs of the crop, which were monitored daily by a regional station.A hailstorm event occurred in Site 1 between 82 and 116 DAS, momentarily interfering with the aerial part of the plants.One week after rainfall, with a nutritional management strategy, the plants were recovered.
All sampling points were georeferenced with a GNSS (Global Navigation Satellite System) receiver.The sampling points were spaced using a grid of 30 m × 30 m, which was defined to reduce the interference between the sampling points and satellite imagery for obtaining the VIs.The collection of carrot roots was carried out manually within the beds containing four double-sowing lines.A metal template of 1.60 m × 0.155 m was used to delimit the sampling area of 0.25 m 2 .This sampling area was chosen by observing the mass of the carrots and to facilitate the data collection in the field (Figure 3).Based on the carrots collected in the delimited area, the total fresh biomass and the biomass of the aerial part and roots were determined separately.The aerial part and roots were detached close to the base of the root using a knife at the field.A semi-analytical balance was used with an accuracy of 0.01 g (grams) to measure crop biomass.The crop yield was extrapolated from the mass in grams over the area of 0.25 m 2 determined for boxes per hectare (29 kg box, commercial standardization of the carrots before washing process).Then, three carrots were subsampled per sampling point for the biometric analysis of the root length and diameter.These roots were sent to the laboratory for qualitative analysis.

Qualitative Analysis of the Roots
The subsampled carrots were sanitized and placed in identified plastic bags and stored under refrigeration at 2 °C to prevent the loss of their characteristics.Total soluble solids (°Brix) and firmness readings for quality analysis were taken two days after collecting the three carrots per sampling point to obtain an average °Brix per sampling point.For the analysis of total soluble solids (SST or °Brix), the refractive index method was used [24].°Brix determination was carried out using the VX0-90 portable digital refractometer model (accuracy of 0.2%) with automatic temperature compensation to 20 °C.The pure and undiluted carrot liquid was measured after maceration and expressed in the form of a percentage [25].The assessment of root firmness also used a direct method with a portable MOD model penetrometer PTR-300 (accuracy of 0.5%) equipped with an 8 mm diameter tip [26].Three readings were taken near the base of the roots to calculate the average firmness per sample point, and the results were expressed in Newton (N).

The Acquisition and Processing of Satellite Imagery
Satellite images were required to establish the relationship between biometric and qualitative variables of the plant roots and vegetation indices during the data collection period.The satellite images were acquired using the PlanetScope CubeSat platform, which consists of 148 different types of nanosatellites in orbit with high spatial and temporal resolutions that capture images at wavelengths of spectrum: 618 to 780 nm (red), 497 to 570 nm (green), 427 to 476 nm (blue), and above 700 nm (NIR) with a spatial resolution of 3.0 m and a spectral resolution equal to four [27].The images from the experimental areas were requested and subsequently downloaded for the calculation of the selected vegetation indices based on the spectral bands available.
An interval of a maximum of three to five days was recommended before collecting samples in the field to extract vegetation indices due to the occurrence of clouds.The VIs that were calculated and analyzed were the NDVI (normalized difference vegetation index [28]), SAVI (soil adjusted vegetation index [29]), EVI (enhanced vegetation index [30]), and RDVI (re-normalized difference vegetation index [31]).These VIs were chosen because they are widely used in agronomic studies, mainly for underground crops.The VI values were extracted from each georeferenced sampling point for each experimental area and processed in QGIS 3.4.The constants in each equation from the VIs were determined as L = 1.0;G = 2.5; C1 = 6; and C2 = 7.5.

Development of Predictive Models
Principal component analysis (PCA) is based on multivariate statistics to analyze and interpret the interrelationships between variables according to their dimensions.Each variable considered in the analysis becomes a component.The components can be extracted using a covariance matrix or correlation matrix.PCA is a non-parametric linear statistic most used to understand and orthogonalize the dataset [32][33].This analysis was carried out to filter the most important variables, aiming to find the smallest set of these with a minimal loss of information.PCA was performed in R 4.3.2 and selected the variables that were used in the multiple output regression algorithms to predict carrot yield and quality individually.
Before submitting the dataset to the predictive models, the outliers were removed and scattered in data points observed on the PCA for the subsequent analyzes of correlation between the variables.The dataset was normalized for training and testing in modeling.To predict the carrot yield and quality by vegetation indices, three regression methods were used: an Artificial Neural Network (ANN), Random Forest (RF), and Multiple Linear Regression (MLR).For the training and testing of the predictive models, the data were split by sampling period (82 and 116 DAS), which were subsequently randomly divided into 70% and 30% of the dataset, respectively, as a manner of reducing bias and overfitting the models.This procedure was carried out to have points of calibration on modeling for both periods of sampling.A cross-validation was carried out to test and verify the performance of prediction in the training process of all regression methods.
An ANN is a supervised machine learning technique that uses artificial neurons that are capable of learning patterns in a dataset from examples by adjusting the weights between connections of neurons according to the training data [34].The RF algorithm is also a supervised model commonly used to improve the accuracy of the predictive models by joining other simpler models.In this model, the number of trees and prediction variables at each node are defined according to the minor error observed [35].MLR is considered a technique that considers the relationship of predictor variables with a single criterion variable, being successful in modeling biological processes.Its structure is described by equating a regression in the estimation of regression coefficients, measures of overall model fit, and the contribution of individual predictor variables [36][37].The performance of the predictive models was evaluated by means of accuracy (R 2 ), root-mean-squared error (RMSE), and mean absolute error (MAE).All processes of modeling were performed on Python 3.12.0(JupyterLab interface) using the Numpy, Pandas, Scipy Stats, and Scikitlearn libraries.
A synthesis of the proposed methodology is depicted in Figure 4.It includes the (i) manual data collection of 200 sampling points in the field (ground-truth); (ii) acquisition of orbital images with multispectral data; (iii) statistical correlation among the sampling points and calculated VIs and the selection of variables by the PCA; (iv) splitting the database into the training and test datasets; (v) developing predictive models for carrot yield and quality; and (vi) comparing the performance of each modeling by the selected metrics (R 2 , MAE, and RMSE).

Normality of the Dataset
The total fresh mass of the carrot plants was within the normal curve, with an R 2 value of 0.94 (Figure 5).The PCA showed that the data collected in the experimental areas for the qualitative and quantitative variables on the different dates explained 89.4% of the carrot variation in the field, being above the critical limit of 80% for PCAs [38] (Figure 6).The variables that are correlated with CP1 and CP2 are the most important in explaining the variability in the dataset.The PCA also highlighted the temporal influence on the arrangement and structuring of eigenvalues and eigenvectors according to the period of data collection.For component 1, there was a strong correlation for the NDVI at 116 DAS in Site 2, where it contributed more effectively to the characterization of the root mass.The SAVI obtained the best results at 116 DAS for Site 2 for the quantitative variables, such as the total mass and root length of the crop.
In component 2, the EVI had a strong positive correlation at 82 DAS for Site 2. The RDVI, root diameter, °Brix, and firmness had little explanatory power, regardless of the period of data collection.
The better performance of the variables in Site 2 can be explained by the fact that in Site 1, there was a hailstorm between 82 and 116 DAS, which may have compromised the modeling of the quality of the roots.Concerning the °Brix and firmness variables in the results generated by the PCA, none of the machine learning models were able to accurately predict root quality in this study.Future studies can be developed, including the collection of samples at the end of the crop cycle and the assessment of the root mass with automated solutions, such as data sampling using multispectral sensors on robotic platforms over the field.
The descriptive statistics of all the measured data by experimental area is shown in Table 1.The numerical values of the VIs were obtained from orbital images by image processing, and the crop variables were measured in the laboratory prior to harvesting.It was observed that there was a higher variation in the values of SAVI compared to other VIs that could be attributed to the influence of soil reflectance and low vegetation cover at the initial stages of crop development, which implicated a high coefficient of variation on Site 1 (CV = 32%).The variable air mass had a higher CV compared to the other crop variables, indicating that the aerial part of the vegetable was not uniform over the experimental site.It can occur due to the different levels of solar radiation over the field and the maturation processes of the crop.

Correlations between Variables
The results of the statistical correlations (Table 2) indicated that the SAVI and NDVI stood out significantly in relation to the variables of carrot yield (root mass) and quality (firmness).It was observed that carrot yield showed high values of correlation with the SAVI and NDVI.The total mass was correlated by 78 and 68%, respectively.The SAVI and NDVI correlated with root mass at 67 and 78%, respectively.Root length had a low correlation for measuring the crop yield and quality for all variables studied.The qualitative attributes °Brix and firmness were not correlated with each other and showed a low relationship with the root mass of the crop.An inverse negative relationship was observed between the RDVI and root diameter (r = −0.67).The diameter of the roots is directly and linearly proportional to the carrot's total biomass and root yield [39].In this study, root mass and diameter correlated by 49%.The SAVI also stood out in other studies with crops that have a low area of vegetative coverage at the beginning of the cycle and organs positioned below the ground.The good performance of the SAVI correlation with the total mass and root mass assessments is linked to the correction of soil reflectance, which can mask the real reflectance of the crop's vegetation [40].Other methodologies to determine °Brix and firmness should be tested to improve the capacity to understand the in-field variations in total soluble solid content and root texture.Reading the °Brix value on the refractometer is simple and quick.However, there must be a standardization of the time of collection of the food to be analyzed and checking the calibration of the equipment.This standardization is already recommended for fruits and other foods [41], but there is no clear and specific methodology for carrots.
The measurement of crop biometrics, as well as qualitative characteristics, should be carried out to determine the phase indicated after half of the cycle as it is more related to the real yield of the crop.After the slow phase of germination and emergence of carrot seedlings, the root develops in length in relation to the soil surface until around 45 DAS.Soon after this growth in depth, the roots grow radially and increase their diameter.The aerial part stabilizes at this stage, while the roots grow in diameter.The closer to the end of the crop cycle, the greater the accuracy of the predictive models will be, because the crop has already defined its potential for production (maximum accumulation of reserve substances and ideal size required by the commercial classification).
Carrot cultivation lacks modeling methods focused on root quality parameters.We brought some answers for the continuity of the qualitative modeling of the crop.It was observed that no regression model was able to predict the quality of the roots.Although carrot quality modeling was not obtained in this study, the use of simple and accessible AI tools can make them more applicable in the field for decision-making purposes regarding the management of the crop.New studies with intensive data collection at the end of the crop cycle could accurately predict root quality before commercial harvest.

Assesement of Model Performance
The ANN algorithm proved to be accurate in predicting carrot yield.The model's performance after training had an R 2 value of 0.68 and a RMSE of 23.80 boxes ha −1 (Figure 7A).For training and testing the model, the SAVI and NDVI were selected previously, which obtained a better correlation with crop yield.These Vis are considered capable of deriving relationships between the intrinsic characteristics of crop physiology and monitoring variations in underground crops in the field [42].The performance of the regression based on the RF algorithm comprised an R 2 value of 0.67 and a RMSE of 23.93 boxes ha −1 (Figure 7B) using the same Vis.The results from RF modeling are also in agreement with the coefficients found in the literature for underground crops [43,44].The results from the MLR model, after training, had a performance comprising an R 2 value of 0.61 and a RMSE of 24.21 boxes ha −1 (Figure 7C).The advantage of this model is its simple structuring of the predictor variables for developing the regression that explains the outputs.However, it has the disadvantage of the occurrence of multicollinearity, which implies a high degree of correlation between the independent variables and impacts the estimation of the regression coefficients.Machine learning algorithms are continually employed to build models that predict crop yield.There are no definitive conclusions about the best fitting model; however, more complex models, such as ANNs, stand out [45,46].The ANN model was superior in terms of its accuracy and minor error to predict carrot yield at the field level, followed by the RF and MLR models (Table 3).Although the ANN algorithm is considered ineffective on a sub-regional scale with a relatively limited database [46].The results found in this work showed that the model was able to model crop yield with greater accuracy compared to the RF and MLR methods.R 2 values above 0.5 in underground crop modeling are not always found in the literature using Vis to predict underground crop yield, even with the application of ML methods [11].ANNs and MLR were used in a study on three types of carrots by the authors of [47] to understand the relationship between root volume and agroclimatic factors on yield.These authors found adjustments of 0.80 to 0.90 for modeling carrot yield.Tedesco et al. [18] modeled the yield of sweet potato with RF regression and obtained errors ranging from 2.50 to 2.90 t ha −1 , regardless of the stage of the crop-growing season.The SAVI also resulted in better performance in the RF model for this underground crop in this study.Wei et al. [11] modeled carrot yield with the RF algorithm using raw spectral bands and obtained an R 2 value of 0.82 and an average error of 2.64 t ha −1 .Madugundu et al. [21] concluded that the SAVI correlated satisfactorily with carrot yield in their regression models with an error of 4.50 t ha −1 .Abbas et al. [48] studied ML algorithms to predict potato yield with a RMSE ranging from 5.97 to 6.17 t ha −1 .The MLR model of this study is similar to that reported by Suarez et al. [12].The authors had optimal regression adjustments between the VIs and total carrot root production and its size [49], such as the EVI (R 2 0.58), RDVI (R 2 0.78), and SAVI (R 2 0.77).
The findings of this study could be used in regional cooperatives of agriculture to share the estimate of carrot production, avoid crop damage by the tendency of meteorological data, and manage field conditions to obtain a higher quality of the product (added value to the agroindustry chain).It is common to share this kind of information among farmers in Brazil by means of regional or local agricultural cooperatives.
The potential ethical and social implications of the adoption of AI tools in the agroindustry is related to some principles, such as transparency, privacy, sustainability, and responsibility [50].So, the development of the proposed methodology from this study requires attending to those principles to enable sharing data (imagery, meteorological, and crop variables) and performing computational modeling to validate it on national conditions of production on a large scale.
The implications of this study for the future of agricultural production are related to the integration of multiple sources of data to optimize crop yield and quality, manage the input applications in the field (water, macronutrients, and energy) to reduce the negative impact on the environment, and prove the mechanized operations based on historical databases for sustainability purposes.

Conclusions
This study tested different AI algorithms to predict carrot yield and quality based on tropical conditions and previous harvesting, using reflectance data and ground-truth data as input variables in predictive modeling.The satellite imagery was selected according to the crop's phenology, which enables us to observe the crop's response and its spectral reflectance by the days after sowing for both experimental areas.The total mass under both field conditions demonstrated higher variability, indicating a non-uniformity of the cover plants despite having the same agricultural practices related to the plant's nutrition and irrigation management.The SAVI and NDVI indices from orbital images showed promising results on predicting carrot yield, demonstrated by the results of correlation and relevance as input variables in modeling.Principal component analysis revealed the temporal influence on predictor variables, which can be useful for optimizing crop monitoring over the fields.The ANN algorithm demonstrated greater accuracy and lower error on crop yield prediction in relation to the RF and MLR methods.Although crop quality did not achieve satisfactory results in this study, it was possible to provide methodological contextualization for future research and data analysis.This study enables us to implement AI modeling on agricultural scenarios as an alternative manner of field management based on data-driven solutions and integrating multiple sources of georeferenced data.

Figure 1 .
Figure 1.Experimental sites 1 (A) and 2 (B) and their respective areas of data collection.

Figure 3 .
Figure 3. Manual data collection of the carrots.

Figure 4 .
Figure 4. Flowchart of the experimental process and data processing.

Figure 5 .
Figure 5.Comparison of the total crop mass data in relation to the normal distribution.

Table 1 .
Descriptive statistics of the database.

Site 2 Site 1 Site 2 Site 1 Site 2 Site 1 Site 2 Site 1 Site 2
SD: standard deviation; CV: coefficient of variation.

Table 3 .
Metrics of the performance of ANN, RF, and MLR modeling from the test dataset.: mean absolute error (boxes ha −1 ).RMSE: root-mean-squared error (boxes ha −1 ). MAE