Prediction of Corn and Sugar Prices Using Machine Learning, Econometrics, and Ensemble Models "2279

This paper explores the use of several state-of-the-art machine learning models for predicting the daily prices of corn and sugar in Brazil in relation to the use of traditional econometrics models. The following models were implemented and compared: ARIMA, SARIMA, support vector regression (SVR), AdaBoost, and long short-term memory networks (LSTM). It was observed that, even though the prices time series for both products differ considerably, the models that presented the best results were obtained by: SVR, an ensemble of the SVR and LSTM models, an ensemble of the AdaBoost and SVR models, and an ensemble of the AdaBoost and LSTM models. The econometrics models presented the worst results for both products for all metrics considered. All models presented better results for predicting corn prices in relation to the sugar prices, which can be related mainly to its lower variation during the training and test sets. The methodology used can be implemented for other products.


Introduction
The different agricultural products value chains are essential for producing and distributing food, medicines, clothes, among many other products [1,2]. Two of the most important agricultural value chains worldwide are the sugar and corn chains [3]. One essential activity for all agents in those value chains is to correctly predict the agricultural products' prices [1,[4][5][6][7]. The quality of this prediction impacts decision-making and revenue generation for all agents in the value chains. However, most of the literature on time series analysis for price prediction focuses on prediction stock market prices, such as the works by [8][9][10]. Traditionally, econometrics models such as autoregressive integrated moving average (ARIMA), seasonal ARIMA (SARIMA), and SARIMA with exogenous factors (SARIMAX) are used [1,[4][5][6][7][8][9][10]. In the last decade, several machine learning (ML) models showed better performance, with lower mean absolute error (MAE), mean squared error (MSE), root mean squared error (RMSE), and higher R2 score [1,[4][5][6][7]. In addition, the long short-term memory network (LSTM) is considered a state-of-the-art model for price prediction [6].
Ouyang et al. [11] compared the use of ARIMA, LSTNet (an LSTM-based network), and different configurations of artificial neural networks to predict the prices of twelve agricultural commodities. They have concluded that the LSTNet presented the best results. Kanchymalay et al. [4] compared, using a multi-layer perceptron, a support vector regression (SVR), and a Holt Winter exponential smoothing method to predict crude palm oil prices. The authors concluded that the SVR model presented the best results. Ref. [6] evaluated using a backpropagation neural network and an LSTM to predict high and low prices of soybean futures, concluding that the LSTM presented better results.
The main objective of this work is to evaluate the use of state-of-the-art ML models to predict the daily prices of corn and sugar in Brazil in relation to traditional econometrics models. The following models were implemented: ARIMA, SARIMA, SVR, AdaBoost, LSTM, and ML ensembles with different configurations. All the models were evaluated on the test subset, composed of the whole year of 2019, considering three metrics: MAE, MSE, and R2 score.

Methodology
The methodology used in this paper was composed of five main steps: i. Data gathering, collecting daily prices for sugar (from 2004 to 2019) and corn (from 2003 to 2019) from the CEPEA agricultural prices database [12]; ii. data preprocessing, encompassing the following tasks: identifying and handling missing data and outliers, and dividing the datasets into subsets for both products. The subsets used were: (i) training: beginning of the dataset until the validation subset; (ii) validation: cross-validation using the blocking time series method; and (iii) testing: 2019; iii. exploratory data analysis, considering an analysis of each price time series and an autocorrelation analysis with the implementation of the augmented Dickey-Fuller test (ADF) test, autocorrelation (ACF) and partial autocorrelation functions (PACF), and their respective plots; iv. models implementation and hyperparameters analysis, considering the following models:

Results and Discussions
The result of the ADF test showed that both agricultural products' price series present autocorrelation. The analysis of the ACF and PACF plots also pointed out that there is seasonality in the data. Figure 1  (v) for both products, the were increasing significantly during 2020. Those points reflect several factors, such as the fluctuations in product demand worldwide, the occurrence of financial crises, the fluctuation in product supply (influenced mainly by the occurrence of droughts), the variation of exchange rates, and the impacts of the COVID-19 pandemics. Table 1 presents a comparison of the final models implemented on the test subset (the year of 2019), considering the best hyperparameters values identified during the cross-validation procedure. It is essential to observe that, even though the prices time series for both products differed considerably, the models that presented the best results were: (i) SVR, (MAE: 0.287 for corn and 0.430 for sugar); (ii) ensemble of the SVR and LSTM models (MAE: 0.335 for corn and 0.458 for sugar); (iii) ensemble of the AdaBoost and SVR models (MAE: 0.395 for corn and 0.476 for sugar); and (iv) ensemble of the AdaBoost and LSTM models (MAE: 0.425 for corn and 0.500 for sugar). One of the reasons that may explain the superior performance for the SVR is the small dataset size.  The econometrics models presented the worst results for both products for the three metrics considered, implying that those models did not capture the trends in those datasets. Two of the main reasons that may explain this observation are: (i) both products' prices were significantly volatile during the period; and (ii) both datasets presented non-stationary data and varying trends. It is also important to observe that all models presented better results for predicting corn prices in relation to the sugar prices, which can be related mainly to its lower price variation. Additionally, the observation that ML models and ensembles presented better results indicates that ML models and ensembles could improve agricultural products prices prediction results. This observation has important implications for researchers and practitioners, as it can help improve the quality of the price predictions for a given agricultural product. Furthermore, practitioners could use it to implement the ML models used in this work and backtest their results for different agricultural products and periods. This could provide valuable information for decision-making. It is essential to note that the methodology used in this work can be implemented for other products. Lastly, the main limitations observed were: (i) the small datasets used, which could have impacted on the LSTM results; (ii) the unknown market dynamics, making it challenging to generate new features; and (iii) the lack of standard datasets and model implementations in the literature for comparing the results obtained for the different agricultural products.

Conclusions and Future Works
Agricultural products value chains are essential for producing and distributing food, medicines, and clothes, among many other products. Therefore, improving product prices prediction is essential to improve decision-making by the different value chain agents. However, most works in the literature focus on predicting stock market prices. In this work, the use of traditional econometrics (ARIMA and SARIMA), ML (SVR and LSTM), and ML ensembles models (with different configurations) was evaluated for predicting daily prices for corn and sugar in Brazil.
It was observed that: (i) the SVR model presented the best results for both products, followed by the SVR and LSTM ensemble; (ii) the econometrics models presented the worst results for both products; and (iii) all models presented better results for predicting corn prices in relation to the sugar prices. Future work is related to: implementing other ML models, using unsupervised learning to improve pattern detection, implementing deep reinforcement learning models to allow for autonomous decision making, and evaluating other datasets and periods.