1. Introduction
Agriculture is a critical pillar of systems that provide food, rural livelihoods, and global economic stability. In countries where a large number of people rely on agriculture, price swings can have ripple effects ranging from earning power and affordability to entire economies. But almost everything about crop prices is a product of an intricate ballet of logistical, economic and environmental considerations that challenge easy predictions, which makes the farm market a volatile creature. Prices are difficult to predict, but they matter because of wild-weather cycles, pest infestations, input costs and swings in consumer demand. Price forecasting in the case of crops has traditionally been based on statistical and econometric models, which, although useful, have largely neglected non-linear dynamic inter-relationships within agricultural systems. Typical model assumptions that limit to application of predictions in reality are similar, linear a few interactions between the variables are performed for endpoints. A potential solution is represented by the introduction of machine learning (ML) methods, which are able to learn complex multi-dimensional correlations within extensive and heterogeneous datasets. Plenty of studies published in the last few years have shown that machine learning is applicable in agricultural analytics through such tasks as: predicting weather patterns or market prices; classifying types of soil; forecasting yields.
Nevertheless, an integrated prediction framework for stressors that can generalise the results (i.e., soil condition for a specific kind of crop) and transfer the performance between different types of crops in variety growth scenarios is not yet available, as it combines market-based criteria and environmental factors to predict stressor(s). The integration of these approaches is essential to design whole decision-support systems in support of farmers and policy makers for risk mitigation and profit optimisation. To fill this void, we introduce a framework for crop price forecasting by developing a machine learning IMF model that uses environmental, economic and logistic data. Multiple algorithms are tested—Linear Regression, Support Vector Machines, AdaBoost, Random Forest and XGBoost—and the best model to ensure accurate prediction is determined by the framework. In XGBoost, the performance is better than other models because it can model the non-linearity and feature interactions in a better way. This study offers value to computerised agriculture management by predicting price actions, enabling decision makers in crop investing regarding planning, storage and market timing. The results of this project will contribute to increased profit in agriculture, decreased uncertainty in forecasting and support sustainable economic growth in the agri-value chain.
RQ 1. What is the added value of machine learning models for the accurate prediction of crop prices over traditional statistical models?
Machine learning based model, in particular XGBoost, performs better than conventional statistical approaches because of their ability to capture nonlinear relationships and complicated interrelationships between environmental, economic and logistic conditions. In the work described here, we were able to obtain R2 values of 0.94 and RMSE values of 12.8 with XGBoost, providing reliable predictive accuracy for real agricultural crop demand forecasting purposes.
RQ 2. What are the dominant factors affecting the price fluctuations of crops in present developed forecasting model?
The model listed environmental factors (temperature and rainfall), market demand-supply variables, and input costs (fertiliser application levels and transportation charges) as major determinants of variation in crop price. These contributions, in combination, serve as the precision and interpretability of the model for profitable, sustainable agricultural decision making.
2. Literature Survey
Developing such data-driven knowledge and insights for risk management and profit maximisation is vital as farmers rely heavily on strategies based on crop price prediction. Recent research notes that the agricultural market is influenced by a multitude of interconnected factors, which include market demand, pricing trends in the past, weather and environmental conditions. Machine-learning methods have been widely studied to represent these complex relationships and improve the accuracy of price prediction as a result of increased information availability and computing power. This literature review provides an overview of existing research on crop price prediction, examining commonly used datasets, predictive algorithms, feature integration strategies, and their effectiveness in supporting agricultural decision-making.
Grain price forecasting studies increasingly use deep learning models to cope with the strong nonlinearity and temporal dependence in agricultural markets. The proposed hybrid CNN–LSTM architectures that jointly learn spatial-temporal dependencies typically demonstrate lower forecast errors than single-model baselines in financial and commodity time series. However, most existing grain price models emphasise standard weather measures and past price information, with limited exploration of snow-related variables such as Snow Water Equivalent (SWE), snowfall and snow depth. There is therefore a clear gap in evaluating how snow metrics, as distinct hydrological indicators, contribute to multi-step-ahead grain price prediction compared with conventional precipitation-based features [
1].
The paper proposes an intelligent crop price prediction model integrating Support Vector Machines and ARIMA to address volatility in agricultural markets. It first analyses agricultural data to identify key factors influencing crop prices and then models linear temporal patterns using an ARIMA (1,1,8) structure validated through residual normality tests. For peanut price prediction, both ARIMA and an LSTM-based model achieve around 81.65% accuracy, while their combined (hybrid) model reaches 95.65%, showing significant improvement. Moreover, the framework is extended to forecast the risk of investing in crops, with better performance than comparable methods (accuracy of 0.9725). Overall, this study has demonstrated that hybrids of statistical and machine learning models can enhance enterprise risk assessment and price forecasting for decision-makers in agriculture [
2].
To increase prediction precision, contemporary works have highlighted the increasing application of machine learning and ensemble forecasting for agricultural prices. Research has shown that complex and non-linear changes in crop prices, which respond to dynamic market and environmental forces, can be hard for conventional statistical tools to comprehend. The effectiveness of ensemble models such as Random Forest, XGBoost, and, in particular, Stacking Regressors for modelling intricate historical agricultural patterns has been brought into focus by recent research [
3].
New research indicates that traditional, knowledge-based farming struggles to cope with uncertain markets, complex crop selections and unstable weather patterns. Studies have shown that many existing models are not robust in different agroclimatic zones and do not efficiently include heterogeneous or real-time data layers. The trend has focused increasingly on advanced machine learning models, which have been taking temporal (time-series), climatic and spatial (location, soil) information to forecast yield and price to bridge these gaps. To increase sales process and profitability, time series and ML-based hybrids are widely studied for the price of market forecasting. Work on decision-support platforms, which are frequently web-based, emphasizes the importance of user-friendly interfaces that provide farmers with tailored advice, increasing uptake and practical impact [
4].
Research New findings show the growing importance of AI in stabilising agricultural prices that are influenced by worldwide trends, policy and even climate. AI-based approaches are gaining traction as conventional statistical and econometric models often do not account for complex, nonlinear market dynamics. Researches show that the responsiveness and accuracy of price prediction can be improved by using machine learning models, such as LSTM networks, random forests, or linear regression [
5].
The limitations of traditional statistical models, such as linearity assumptions and sensitivity to non-stationarity, in contrast to machine learning and deep learning approaches that can better capture nonlinear patterns and complex temporal dependencies. These additions clarify the motivation for adopting machine learning-based models in this study and strengthen the justification for the proposed approach.
3. Methodology
This study presents a machine learning framework to predict crop prices based on key market, economic and environmental indicators for a model forecast. The six major steps shown in
Figure 1 include data gathering, data preprocessing, feature engineering, feature selection, model construction, and model evaluation. The complete flowchart shows the overall process, as it can be seen in the Process Chart below.
3.1. Data Collection
This paper is based on the publicly available Indian Agriculture Crop Price Dataset, which was retrieved on the Kaggle platform. The data is pooled together with the actual market price data released via Agmarknet (Agricultural Marketing Information Network), Government of India. It includes data on agricultural commodities in 2023 to 2024 and is applicable to various states and districts in India with Karnataka (40%), Maharashtra (12%), and other states (48%). The data set covers the major crops like wheat, rice, maize, cotton, sugarcane, pulses, millets, barley, groundnut and soybean. Attributes in each record, such as date, state, city, type of crop, season, temperature, rainfall, volume of supply and volume of demand allow one to analyze price trends of crops depending on the climatic and market conditions.
3.2. Data Preprocessing
To guarantee the reliability of data, we performed several pre-processing steps. Missing Values were imputed using statistical (features with a normal distribution) or most frequent (features with a categorical distribution) imputations. Duplicate and inconsistent records were eliminated to ensure the integrity of the data. Outliers were handled by means of distributional analysis and either corrected or eliminated as needed. Categorical variables were converted to numerical, and continuous standardised when needed. These helped make the dataset conform to assumptions of effective machine learning modelling.
3.3. Feature Engineering
Feature estimation was employed to transform raw variables into informative features for the model. Categorical features such such as State, City, Crop Type and Season were encoded using one-hot encoding to account for the regional and temporal differences. A Supply-Demand Ratio was established to express market pressure by supply volume divided by demand volume. Temperature, Rainfall, Transportation Cost, Fertiliser Usage, Pest Infestation and Market Competition are the numerical features that were kept after normalisation to the main scale across the features. Such hand-crafted features helped the model to learn better from the underlying relationships for crop price flexes.
3.4. Feature Selection
Two-step feature selection was conducted with wrapper-based approaches and the LASSO regression model to retain only relevant predictors. The wrapper method used model performance to assess several feature subsets, whereas LASSO used coefficient regularisation by default, and helped in the automatic removal of insignificant variables. To identify the most significant market and environmental variables that influence the price of crops, the Least Absolute Shrinkage and Selection Operator (LASSO) regression was applied to show the best features. All input features were zero-mean unit variance accentuated before the implementation of LASSO. The regularization parameter (a) was tuned by k-fold cross-validation (k = 5) on a logarithmic search space that is defined. The best was chosen using the least cross-validation error. The characteristics that had coefficients other than 0 following the regularization were kept to be used later in training the model. The coordinate descent algorithm was used to perform the LASSO optimization.
3.5. Model Development
Five machine learning algorithms, of which are Linear Regression, Support Vector Regression (SVR), Random Forest, AdaBoost and XGBoost, were implemented and tested on predicting crop prices. In the case of Linear Regression, default parameters were applied with L2 regularization to reduce the effect of multi-collinearity. The SVR model used radial basis function (RBF) kernel and both the penalty parameter (C) and the coefficient of the kernel () were optimized via grid search. The Rain Forest model has been set to 100 decision trees and mean squared error as the splitting criterion. AdaBoost used decision tree regressors as weak learners and the learning rate was set at 0.1. XGBoost model was trained with maximum depth of 6 trees, learning rate of 0.1 and 100 boosting rounds. Cross-validation was also used to tune the hyperparameter in order to achieve the best model. As a first step, an 80:20 random split of the dataset is used to form train and test sets. The model hyper-parameters were tuned with grid search settings to select the best performance.
3.6. Model Evaluation
Model evaluations of models were based on common regression measures such as the Coefficient of Determination (R2), Root Mean Square Error (RMSE) and Mean Absolute Error (MAE). These measures offered a good overall view of how well the models’ explanations and predictions performed. In addition to numeric performance measures, several diagnostic visualisations were applied for evaluating model response, such as a comparison of model performance charts, a graph displaying residuals distribution and the Actual vs. Predicted scatter chart. These visualisations supported the identification of errors, bias, and prediction stability. The best model was chosen on the basis of its accuracy overall, quality and interpretability of results from both quantitative measures and graphical assessments.
4. Result and Discussion
Empirical results of the model recognising and the facts’ implications. This section presents the empirical results for the machine learning framework trained, including model performance, feature behaviour and predictive reliability. Several models were tested to identify the most efficient method of forest price estimation in different markets and environmental scenarios.
4.1. Comparative Model Performance
The predictive performance of the models is summarized in
Table 1 and presented graphically in
Figure 2.
Out of the algorithms, XGBoost was the best performer with an R2 of 0.988 and low MAE (7.22), and RMSE (9.26) values were obtained as well. Random Forest performance was similar (MAE = 7.91, RMSE = 9.84, R2 = 0.986) but with higher error values than XGBoost, which could suggest it is less sensitive to interaction effects as compared with the other methods.
4.2. Feature Importance Analysis
In
Figure 3, Feature Importance illustrates the feature importance as determined by the XGBoost model. The dominant predictors were demand volume, supply volume and transport cost, which collectively constituted the fundamental drivers of price variance. These observations are in line with market theory, which suggests that price discovery is heavily influenced by supply—demand considerations and transaction costs. XGBoost inherently captures nonlinear interactions, and feature interactions inaccessible to feature-importance metrics. As a result, the effect of environmental conditions on prices is implied by their contribution to the variation of the yield and supply dynamics, providing a domain-consistent interpretation.
Secondary contributors were market competition, rainfall, and pest infestation, suggesting that field concentration is supported by both environmental and competitive mechanisms. Other variables such as temperature, fertiliser utilisation and geographic indicators had a small impact, indicating relatively weak direct effects on short-run price dynamics. Results: The findings support that LASSO-based feature selection retains only the most important predictors in learning models.
4.3. Error and Residual Distribution
The residual plot in
Figure 4 effectively illustrates how the model’s errors are distributed. The residuals are centred near 0 and have an approximately symmetrical, near-normal distribution, which suggests that the model performs fairly evenly with no particular bias towards over or under-predicting crop prices.
The distribution is also quite even with a few extremes, implying good generalisation over a wide range of prices. A modest right tail extension can be made out, due essentially to occasional under-precision for very high market prices. Such a phenomenon is expected in agricultural commodities where supply shocks or demand surges may result in abrupt price spikes. In general, the pattern of residuals confirms the accuracy and consistency predicted model behaviour.
4.4. Actual vs. Predicted Price Relationship
Figure 5 illustrates the comparison between the actual prices and the predicted prices. The plot shows that the closeness of points to the idealb y = x line is indicative of predictive fidelity. The model can predict both the low-priced and high-priced range well, with little variance over the spectrum.
The fact versus prediction price plot, displayed on the next page, has a heavy concentration of data points toward the diagonal reference line, indicating a good fit between model predictions and true market prices. Such a tight clustering indicates the high predictive accuracy of the model and its generalisation capacity over a range of crop prices. Finally, the lack of any systematic deviation indicates that the model is consistent across low, medium and high prices. There is a small deviation at the two price extremes, but we would not expect our historical context to match exactly with current or future market conditions, particularly with the definition of unpredictable supply shocks and unobserved externalities. In general, that visual pattern supports the numerical evaluation metrics and provides evidence that the XGBoost model can generate consistent, accurate and reliable predictions for crop prices.
4.5. Discussion
Results show that ultimately, XGBoost is the best model to be applied in predicting crop price in the current study. Specifically, its capability to estimate nonlinear effects and deal with multicollinearity while considering the importance of features makes it suitable for complex agricultural data. The good performance in demand, supply and transportation cost variables is consistent with agricultural market literature that highlights the major influence of market forces rather than environmental factors in short-run price formation. The residual and actual-vs-predicted plots also suggest that the model is well generalised and not seriously biased. These findings confirm the capacity of machine learning algorithms, especially ensemble-based techniques, for accurate and data-driven estimation of crop prices that could be used to inform decision-making among farmers, traders and policy-makers.
5. Conclusions
The fluctuations of the crop price are caused complex interaction between weather conditions, market position, as well as technologies that are globalistic and policies adopted by the government. These differences have important implications for the livelihood of farmers, affordability of consumers and national economic balance. Hence, the study of the sources and impacts of price changes is crucial for designing policy measures that strengthen agricultural resilience and promote food security. In this research, a machine learning–inspired model was developed to forecast crop cost with environmental, economic operating parameters. After rigorous preprocessing, feature engineering and feature selection via wrapper methods in combination with LASSO were performed to select the most important predictors. Various multiple regression models were investigated, and XGBoost proved to have the best prediction such as with the performance evaluation metrics and interpretability analyses, including residual behaviour, actual versus predicted pattern, and feature importance ranking. The results have shown that contemporary machine learning methods are capable of modelling non-linearity in agricultural markets and predicting prices appropriately. This type of knowledge can help farmers, policymakers and supply-chain planners improve pricing decisions, mitigate risks and optimise resource allocation. Finally, this conceptual framework covers the way for a more data-driven and sustainable agricultural ecosystem, with high possibilities of integration into operational decision-support systems. The main weaknesses of the presented solution are the following: (i) The model is trained using historical data that has a narrow geographical and time range, and its generalizability to other areas or extreme market conditions could be affected; (ii) Surprising shocks, like sudden changes in policies, extreme weather conditions, or geopolitical factors are not explicitly modeled, (iii) The present framework accepts structured numerical data, and unstructured sources of information, like satellite images or real-time market sentiment, are not explicitly represented.
Author Contributions
Conceptualization, P.A.K. and G.V.S.N.; Methodology, P.A.K. and G.V.S.N.; Software, S.K.K.; Validation, S.K.K. and D.P.; Formal Analysis, S.K.K.; Investigation, G.V.S.N. and D.P.; Resources, P.A.K.; Data Curation, S.K.K. and G.V.S.N.; Writing—Original Draft Preparation, P.A.K. and G.V.S.N.; Writing—Review and Editing, P.A.K. and S.K.K.; Visualization, D.P.; Supervision, G.V.S.N. and S.K.K.; Project Administration, P.A.K.; Funding Acquisition, G.V.S.N. and S.K.K. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Data Availability Statement
The dataset used in this study is publicly available and can be accessed from the Kaggle platform.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Wang, Z.; French, N.; James, T.; Schillaci, C.; Chan, F.; Feng, M.; Lipani, A. Climate and environmental data contribute to the prediction of grain commodity prices using deep learning. J. Sustain. Agric. Environ. 2023, 2, 251–265. [Google Scholar] [CrossRef]
- Cheng, H.; Huang, A. SVM Based Agricultural Crop Price Prediction Model. IAENG Int. J. Comput. Sci. 2025, 52, 307. [Google Scholar]
- Rao, D.S.; Chaganti, S.S.S.; Chelikani, S.S.; Nandamuri, Y.V.; Nippun, P.V. Crop yield prediction using stacking ensemble model. In Proceedings of the International Conference on Computational Intelligence, Las Vegas, NV, USA, 13–15 December 2023; Springer Nature: Singapore, 2023. [Google Scholar]
- Phatangare, S.; Laddha, A.; Bambal, S.; Borhade, B.; Atram, P. A Data-Driven Approach to Crop Yield and Market Price Prediction. In Proceedings of the 2024 8th International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), Kirtipur, Nepal, 3–5 October 2024; IEEE: New York, NY, USA, 2024. [Google Scholar]
- Dutt, S.; Kulkarni, P.N.; Akilan, S.; Mishra, R.; Khetan, P.; Bhadora, A. Agricultural price prediction through artificial intelligence. Int. J. Creat. Res. Thoughts (IJCRT) 2024, 14, 28019. [Google Scholar]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |