Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

In-Season Price Forecasting in Cotton Futures Markets Using ARIMA, Neural Network, and LSTM Machine Learning Models

J. Risk Financial Manag. 2025, 18(2), 93; https://doi.org/10.3390/jrfm18020093

by Jeffrey Vitale^1,* and John Robinson²

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

J. Risk Financial Manag. 2025, 18(2), 93; https://doi.org/10.3390/jrfm18020093

Submission received: 12 November 2024 / Revised: 20 January 2025 / Accepted: 27 January 2025 / Published: 10 February 2025

(This article belongs to the Special Issue Financial Innovations and Derivatives)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Comments:

1. The authors fail to elucidate the training methodology of the models, including the optimizer and learning rate employed.

2. The article fails to present the performance results, including RMSE, for all models in a tabular format.

3. In this research, Models such as stacked LSTM or 2D convolutional LSTM may exhibit overfitting to the training data; hence, the study must explain the measures taken to prevent this issue.

4. The paper relies on historical pricing while neglecting the influence of weather and global demand on cotton prices. The authors should explain the reason.

5. Examine the contribution of your research.

6. It is advised that the author should cite some recent research articles relating to forecasting models, as listed below:

*Forecasting foreign tourist arrivals in India using a single time series approach based on rough set theory. International Journal of Computing Science and Mathematics.

Author Response

Responses to Reviewer #1 Comments

Comments:

The authors fail to elucidate the training methodology of the models, including the optimizer and learning rate employed.

Author’s Response: The models were trained using the Adam optimizer. Learning rate was one of the several model parameters chosen using a K-Fold hyper parameter random search as discussed in Comment 3 below. We also employed early stopping during training to halt the process when validation loss failed to improve, ensuring efficient use of computational resources and avoiding overfitting. All hyperparameter ranges and settings have been transparently documented to facilitate reproducibility.

The article fails to present the performance results, including RMSE, for all models in a tabular format.

Response: They may have not been clearly labeled in the original manuscript. We do have a table that lists the overall RMSE for each model in the revised manuscript. In the revised manuscript, we have ensured that tables and figures are clearly labeled and referenced. The figures complement the tables by illustrating the temporal (cumulative) progression of RMSE, which matches the final values reported in the tables for each model. This was painstaking work!!

In this research, Models such as stacked LSTM or 2D convolutional LSTM may exhibit overfitting to the training data; hence, the study must explain the measures taken to prevent this issue.

Response: Thank you for your insightful feedback. We completely agree that a systematic and transparent approach is essential to mitigate overfitting and avoid “cherry-picking” models. To address these concerns, we employed a k-fold cross-validation method with five folds, integrated with a random parameter search strategy. This approach ensured that both training and validation RMSE were thoroughly evaluated across diverse parameter combinations.

To further enhance the robustness of model selection, we developed a Pareto frontier to systematically identify the models that offered the best trade-offs between training and validation RMSE. The Pareto frontier highlights models that can only be improved by increasing either training or validation RMSE, providing a clear framework for selecting optimal configurations.

From this frontier, we selected the best model based on a composite metric that equally weighted two critical factors:

The model’s distance from the line representing points where training RMSE equals validation RMSE, which prioritizes generalizability.
The training RMSE itself, to ensure that the model adequately captured the underlying data patterns.

This systematic and transparent process minimizes the risk of overfitting. Additionally, we strictly adhered to best practices by ensuring that the testing data was never considered during model selection. The use of the k-fold cross-validation approach further alleviates overfitting concerns by evaluating models across multiple splits of the data.

Finally, our parameter selection process also emphasizes the complexity and sensitivity of neural network models, particularly LSTMs, to hyperparameter choices. Parameters such as the number of epochs, learning rates, and the number of neurons significantly influence model performance, as demonstrated in our study. We hope this approach addresses concerns about overfitting while providing a valuable framework for researchers building similar models.

Here is a snippet of our kfold random search:

# Hyperparameters

n_seq = 6

n_steps = 4

n_steps_in = n_seq * n_steps

n_steps_out = 15

n_features = 1

# Random Search and K-Fold

n_trials = 20

n_splits = 5

random_search_results = []

for trial in range(n_trials):

print(f"Trial {trial + 1}/{n_trials}")

# Random Hyperparameter Sampling

filters = np.random.choice([32, 64, 96, 128])

dropout_rate = np.random.uniform(0.0, 0.5)

learning_rate = np.random.uniform(0.0001, 0.1)

batch_size = np.random.choice([16, 32, 64])

patience = np.random.choice([5, 10])

epochs = np.random.choice([50, 75, 100])

n_seq = np.random.choice([1, 3])

n_steps = np.random.choice([2, 4, 6])

n_steps_in = n_seq * n_steps

# Split train and test data

X_test, y_test = split_sequence(raw_seq_test, n_steps_in, n_steps_out)

X_test_not_used, y_test = split_sequence(jeff_data[2269:3526], n_steps_in, n_steps_out)

X_test = X_test.reshape((X_test.shape[0], n_seq, 1, n_steps, n_features))

# K-Fold Cross Validation

kf = KFold(n_splits=n_splits, shuffle=False)

val_losses = []

train_losses = []

for fold, (train_index, val_index) in enumerate(kf.split(raw_seq_train)):

print(f" Fold {fold + 1}/{n_splits}")

train_split = np.array(raw_seq_train)[train_index]

val_split = np.array(raw_seq_train)[val_index]

X_train, y_train = split_sequence(train_split, n_steps_in, n_steps_out)

X_val, y_val = split_sequence(val_split, n_steps_in, n_steps_out)

X_train = X_train.reshape((X_train.shape[0], n_seq, 1, n_steps, n_features))

X_val = X_val.reshape((X_val.shape[0], n_seq, 1, n_steps, n_features))

# Define the model

model = Sequential()

model.add(ConvLSTM2D(filters=filters, kernel_size=(1, 2), activation='relu', input_shape=(n_seq, 1, n_steps, n_features), return_sequences=True))

model.add(Dropout(dropout_rate))

model.add(BatchNormalization())

model.add(Flatten())

model.add(Dense(n_steps_out))

model.compile(optimizer=Adam(learning_rate=learning_rate), loss='mse')

# Callbacks

early_stop = EarlyStopping(monitor='loss', patience=patience, restore_best_weights=True)

reduce_lr = ReduceLROnPlateau(monitor='loss', factor=0.5, patience=3, min_lr=1e-6)

# Train the model

model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, validation_data=(X_val, y_val), callbacks=[early_stop, reduce_lr], verbose=0)

# Evaluate on training data

train_loss = model.evaluate(X_train, y_train, verbose=0)

train_losses.append(train_loss)

# Predict and Evaluate on validation data

y_pred_val = model.predict(X_val, batch_size=batch_size)

val_loss = model.evaluate(X_val, y_val, verbose=0)

val_losses.append(val_loss)

Our approach aligns with best practices in machine learning by rigorously evaluating models across multiple data splits. While computationally intensive, this method ensures robustness and generalizability, which are critical for time series forecasting in volatile markets.

The paper relies on historical pricing while neglecting the influence of weather and global demand on cotton prices. The authors should explain the reason.

Response: This will be considered in our future research. The intent of this paper was to compare a suite of LSTM models performance to a single time series with particular emphasis on how the memory features of LSTM could provide added model forecasting power depending on model structure. Including additional variables we felt would have introduced too much complexity. While this study emphasizes baseline comparisons, the LSTM architectures used are inherently scalable and can incorporate additional variables in future iterations, such as weather data or geopolitical events, without fundamental changes to the model framework.

BTW We already mentioned this in the originally submitted manuscript at end of Conclusions:

Future work should consider incorporating external variables, such as weather data or other agricultural commodity prices, to potentially enhance the predictive power of these models. Additionally, further exploration of hybrid models that blend traditional statistical methods with deep learning approaches could offer more comprehensive forecasting tools for market stakeholders.

Examine the contribution of your research.

Response:

Our study presents a novel application of advanced deep learning techniques, including sequential, stacked, bidirectional, and 2D convolutional LSTMs, to forecast cotton futures prices. While prior research has explored traditional models like ARIMA, the application of these architectures to agricultural commodity forecasting, particularly in the context of cotton futures, is scarce. A significant contribution of our research lies in systematically comparing a wide range of machine learning architectures and hyperparameters using a robust random search and k-fold cross-validation methodology. Our results demonstrate that 2D convolutional LSTMs outperformed other architectures and in some instances traditional ARIMA models in terms of forecast accuracy, especially during periods of heightened market volatility. This underscores the superiority of deep learning in capturing complex patterns in historical price data.

The primary beneficiaries of this research are U.S. cotton producers, particularly those in South Texas, who face significant risks due to price volatility. By providing more accurate price forecasts, our models enable producers to make informed hedging and marketing decisions, ultimately reducing financial risk and enhancing profitability. This practical utility extends beyond cotton futures, offering insights for other agricultural commodities with similar characteristics. While this study focuses on cotton futures, the models and methodologies can be generalized to other agricultural commodities or markets with time series data. Furthermore, incorporating additional variables (e.g., weather data, geopolitical events) could enhance predictive power. Future work could also explore hybrid approaches, combining statistical models with deep learning to leverage the strengths of both.

By addressing the challenges of forecasting in volatile commodity markets and providing actionable insights for producers, our study makes a meaningful contribution to the fields of agricultural economics, finance, and machine learning. We hope this clarification adequately highlights the significance of our work.

We added text related ot this in the Summary section right before Conclusions:

In addition to the improved forecast accuracy demonstrated by 2D convolutional LSTMs, our models provide tangible benefits to U.S. cotton producers, particularly those in South Texas. Producers facing significant price volatility can leverage these models to make more informed hedging and marketing decisions, reducing financial risk and enhancing profitability. This practical utility underscores the value of deep learning in agricultural commodity forecasting and offers a foundation for broader applications across other markets and commodities. By addressing the challenges of forecasting in volatile commodity markets, our findings provide actionable tools for producers and traders. The methodologies can also serve as a template for other sectors, such as energy or finance, where accurate time series forecasting is critical.

It is advised that the author should cite some recent research articles relating to forecasting models, as listed below:

*Forecasting foreign tourist arrivals in India using a single time series approach based on rough set theory. International Journal of Computing Science and Mathematics.

Response: We are having trouble including new references given that the editors have put it in what is, not be presumptuous, early Galley proof format. We will do our best to work in the reference you describe and other relevant ones. We do have an extensive reference list already. Thanks for the suggestion!!

Reviewer 2 Report

Comments and Suggestions for Authors

The article investigates the application of advanced machine learning models, specifically various Long Short-Term Memory (LSTM) architectures, to forecast cotton futures prices. While it contributes to the literature by introducing new LSTM architectures and comparing them with traditional models like ARIMA and basic neural networks, the study exhibits several disadvantages that may affect the validity and applicability of its findings.

There are issues with data inconsistencies and temporal relevance. The dataset is stated to comprise historical daily December cotton futures prices from December 6, 2009, to December 9, 2023. Given that the current date is December 5, 2024, and considering the knowledge cutoff in October 2023, the inclusion of data beyond 2023 raises questions about the accuracy and availability of such data. This temporal inconsistency could undermine the credibility of the study and its conclusions.

The article lacks detailed information on data preprocessing and handling of potential data issues. Time series data often require careful preprocessing to address missing values, outliers, and non-stationarity. The absence of a thorough discussion on data cleaning, normalization techniques, and stationarity tests may lead to concerns about the robustness of the models and the reliability of the forecasting results.

The study focuses primarily on forecasting accuracy measured by Mean Squared Error (MSE) but does not adequately discuss the statistical significance or economic relevance of the differences in performance among the models. While the 2D convolutional LSTM model outperforms others in terms of MSE, the practical implications of this improvement for cotton producers and market stakeholders are not explored. Without an analysis of whether the enhanced accuracy translates into better decision-making or increased profitability, the usefulness of the model remains uncertain.

Moreover, the article does not sufficiently address the potential for overfitting, especially with complex models like LSTMs that have numerous parameters. Overfitting occurs when a model learns the training data too well, including its noise, and fails to generalize to new, unseen data. The lack of discussion on validation techniques, such as cross-validation or the use of a separate validation dataset, raises concerns about the generalizability of the models and the reliability of the forecasting performance reported.

The article also appears to have a narrow focus in terms of model comparison. It compares various LSTM architectures and an ARIMA model but does not consider other machine learning models that could potentially offer competitive performance, such as Gradient Boosting Machines, Random Forests, or Support Vector Machines. Including a broader range of models could provide a more comprehensive assessment of the forecasting methods available.

Additionally, there is limited consideration of external factors that influence cotton futures prices. Cotton prices are affected by a myriad of factors, including macroeconomic indicators, weather conditions, policy changes, and global supply and demand dynamics. By relying solely on historical price data, the models may fail to capture these important exogenous variables, potentially limiting the accuracy and applicability of the forecasts. Incorporating additional relevant features could enhance the models' predictive capabilities.

The study's discussion on model interpretability is also limited. Advanced neural networks, particularly deep learning models like LSTMs with convolutional layers, are often criticized as "black boxes" due to their complex architectures, making it difficult to understand the underlying relationships they capture. This lack of interpretability can hinder trust and acceptance among practitioners who require transparent models to inform their decision-making processes. The article does not address how this issue might impact the adoption of such models in the cotton industry.

Furthermore, the computational complexity and resource requirements of the advanced models are not discussed. Training deep neural networks can be computationally intensive and time-consuming, which may not be practical for all users, especially those with limited access to high-performance computing resources. The article does not provide insights into the computational costs associated with each model, which is important for evaluating their feasibility in real-world applications.

The analysis of forecasting during periods of price volatility highlights the superior performance of the 2D convolutional LSTM model. However, the explanations for why this model outperforms others are somewhat speculative and lack empirical validation. The article attributes the success to the model's ability to capture localized patterns and short-term fluctuations but does not provide detailed evidence or analysis to support this claim. A deeper examination of the model's behavior during volatile periods would strengthen the conclusions.

Lastly, the article's language and structure could be improved for clarity and coherence. There are instances of grammatical errors, incomplete sentences, and inconsistent formatting, which may distract readers and obscure key points. A thorough proofreading and editing process would enhance the readability and professionalism of the work.

Comments for author File: Comments.pdf

Comments on the Quality of English Language

ok.

Author Response

Quick Note from Authors to Reviewer 2: We sincerely appreciate your review. It shows you also have both an open mind to using AI models in commodity price analysis while addressing valid concerns for their application. Hasty use of any new idea or technology is always to be avoided. We spent the holidays working diligently to address the concerns. We focused primarily on providing transparent documentation of the model parameter choice. It was time consuming, and we weren’t left with a lot of time to address some of the writing concerns that were brought up. Please keep this in mind. Thanks !!!

Response: We started on the paper well over a year ago and the dataset reflected the latest available. There was no purposeful omission of data. We acknowledge the importance of temporal consistency and clarify that the dataset reflects the latest available historical data at the time of study commencement. Any perceived discrepancies are due to the preparation timeline, not inaccuracies in the dataset. Future work will update the dataset to include the most recent observations for a broader temporal scope

Response: We try to emphasize that the primary objective is to compare LSTM models and to assess whether they have any merit in commodity price analysis. We felt it was necessary y to provide a reasonable benchmark from a traditional model alongside the LSTM models. We are not trying to conclude that AI models are better than time series or vice versa. To provide the same type of systematic model parameter choice between the LSTM and time series model, i.e. the ARIMA model, we used the python procedure “auto_arima” to choose the best ARIMA model parameters based on minimum AIC. The output is included here:

ARIMA(2,1,2)(0,0,0)[0] intercept : AIC=7550.699, Time=1.51 sec

ARIMA(0,1,0)(0,0,0)[0] intercept : AIC=7566.960, Time=0.09 sec

ARIMA(1,1,0)(0,0,0)[0] intercept : AIC=7562.955, Time=0.16 sec

ARIMA(0,1,1)(0,0,0)[0] intercept : AIC=7561.771, Time=0.31 sec

ARIMA(0,1,0)(0,0,0)[0] : AIC=7564.962, Time=0.07 sec

ARIMA(1,1,2)(0,0,0)[0] intercept : AIC=7548.965, Time=0.94 sec

ARIMA(0,1,2)(0,0,0)[0] intercept : AIC=7548.255, Time=0.39 sec

ARIMA(0,1,3)(0,0,0)[0] intercept : AIC=7549.466, Time=0.55 sec

ARIMA(1,1,1)(0,0,0)[0] intercept : AIC=7553.713, Time=0.94 sec

ARIMA(1,1,3)(0,0,0)[0] intercept : AIC=7550.925, Time=1.61 sec

ARIMA(0,1,2)(0,0,0)[0] : AIC=7546.257, Time=0.17 sec

ARIMA(0,1,1)(0,0,0)[0] : AIC=7559.773, Time=0.14 sec

ARIMA(1,1,2)(0,0,0)[0] : AIC=7546.967, Time=0.40 sec

ARIMA(0,1,3)(0,0,0)[0] : AIC=7547.468, Time=0.36 sec

ARIMA(1,1,1)(0,0,0)[0] : AIC=7551.715, Time=1.33 sec

ARIMA(1,1,3)(0,0,0)[0] : AIC=7548.927, Time=3.01 sec

The ARIMA(0,1,2) was chosen by the procedure, not the authors, as the best fit. Note that the choice indicates that data was non stationary and was taken care of with the ARIMA mode’s parameters.

We added this to support ARIMA model choice:

The ARIMA model was selected using the Auto ARIMA procedure in Python, which automates the process of identifying the best model configuration by minimizing the Akaike Information Criterion (AIC). Auto ARIMA evaluates multiple combinations of model parameters, including the autoregressive (AR) terms, differencing (I), and moving average (MA) terms, as well as the inclusion of intercepts. In this study, the model configurations ranged from simple ARIMA(0,1,0) to more complex models like ARIMA(2,1,2). Each configuration was assessed based on its AIC value, with lower values indicating better model fit while balancing complexity.

The best-performing model, ARIMA(0,1,2), achieved the lowest AIC of 7546.257, demonstrating a superior balance between goodness-of-fit and model complexity compared to alternatives such as ARIMA(2,1,2) (AIC=7550.699) and ARIMA(0,1,3) (AIC=7547.468). The total fit time for the procedure was approximately 12 seconds, reflecting the efficiency of the Auto ARIMA algorithm in exploring the parameter space. Additionally, the diagnostic statistics for the selected model indicate its robustness: the Ljung-Box test for autocorrelation returned a non-significant p-value of 0.94, suggesting the residuals are uncorrelated, while the Jarque-Bera test confirmed non-normality, potentially due to the presence of heavy tails in the distribution.

The ARIMA(0,1,2) model implies a non-stationary process that was made stationary through first-order differencing (I=1) and exhibits a dependence structure captured by the two moving average terms (MA). The parameter estimates for MA(L1) and MA(L2) were statistically significant (p < 0.001), with coefficients of 0.0531 and -0.0857, respectively, further validating the model. The small variance of the residuals (σ²=1.6294) indicates that the model captures the underlying data trends effectively, making it a robust choice for forecasting.

For the ARIMA model, first-order differencing was applied to handle non-stationarity, as indicated by the parameters selected through the Auto ARIMA procedure. Outlier detection and imputation strategies were not explicitly required for this dataset, as it contained no missing values and adhered to expected seasonal patterns.

Response: Yes we wholeheartedly agree. This paper’s objective is to present LSTM forecasts of prices. While this study focuses on forecast accuracy as a benchmark, future research will explore the economic significance of these forecasts in decision-making. By assessing profitability impacts, we aim to quantify how enhanced forecasting accuracy translates into actionable benefits for cotton producers and market stakeholders. We are already planning a follow-up paper to focus on the decision making and profitability when the forecasts are used in marketing decisions.

Response: Thank you for your insightful feedback. We completely agree that a systematic and transparent approach is essential to mitigate overfitting and avoid “cherry-picking” models. To address these concerns, we employed a k-fold cross-validation with method with five (5) folds, integrated with a random parameter search strategy. This approach ensured that both training and validation RMSE were thoroughly evaluated across diverse parameter combinations.

From this frontier, we selected the best model based on a composite metric that equally weighted two critical factors:

The model’s distance from the line representing points where training RMSE equals validation RMSE, which prioritizes generalizability.
The training RMSE itself, to ensure that the model adequately captured the underlying data patterns.

Here is a snippet of our kfold random search:

# Hyperparameters

n_seq = 6

n_steps = 4

n_steps_in = n_seq * n_steps

n_steps_out = 15

n_features = 1

# Random Search and K-Fold

n_trials = 20

n_splits = 5

random_search_results = []

for trial in range(n_trials):

print(f"Trial {trial + 1}/{n_trials}")

# Random Hyperparameter Sampling

filters = np.random.choice([32, 64, 96, 128])

dropout_rate = np.random.uniform(0.0, 0.5)

learning_rate = np.random.uniform(0.0001, 0.1)

batch_size = np.random.choice([16, 32, 64])

patience = np.random.choice([5, 10])

epochs = np.random.choice([50, 75, 100])

n_seq = np.random.choice([1, 3])

n_steps = np.random.choice([2, 4, 6])

n_steps_in = n_seq * n_steps

# Split train and test data

X_test, y_test = split_sequence(raw_seq_test, n_steps_in, n_steps_out)

X_test_not_used, y_test = split_sequence(jeff_data[2269:3526], n_steps_in, n_steps_out)

X_test = X_test.reshape((X_test.shape[0], n_seq, 1, n_steps, n_features))

# K-Fold Cross Validation

kf = KFold(n_splits=n_splits, shuffle=False)

val_losses = []

train_losses = []

for fold, (train_index, val_index) in enumerate(kf.split(raw_seq_train)):

print(f" Fold {fold + 1}/{n_splits}")

train_split = np.array(raw_seq_train)[train_index]

val_split = np.array(raw_seq_train)[val_index]

X_train, y_train = split_sequence(train_split, n_steps_in, n_steps_out)

X_val, y_val = split_sequence(val_split, n_steps_in, n_steps_out)

X_train = X_train.reshape((X_train.shape[0], n_seq, 1, n_steps, n_features))

X_val = X_val.reshape((X_val.shape[0], n_seq, 1, n_steps, n_features))

# Define the model

model = Sequential()

model.add(ConvLSTM2D(filters=filters, kernel_size=(1, 2), activation='relu', input_shape=(n_seq, 1, n_steps, n_features), return_sequences=True))

model.add(Dropout(dropout_rate))

model.add(BatchNormalization())

model.add(Flatten())

model.add(Dense(n_steps_out))

model.compile(optimizer=Adam(learning_rate=learning_rate), loss='mse')

# Callbacks

early_stop = EarlyStopping(monitor='loss', patience=patience, restore_best_weights=True)

reduce_lr = ReduceLROnPlateau(monitor='loss', factor=0.5, patience=3, min_lr=1e-6)

# Train the model

model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, validation_data=(X_val, y_val), callbacks=[early_stop, reduce_lr], verbose=0)

# Evaluate on training data

train_loss = model.evaluate(X_train, y_train, verbose=0)

train_losses.append(train_loss)

# Predict and Evaluate on validation data

y_pred_val = model.predict(X_val, batch_size=batch_size)

val_loss = model.evaluate(X_val, y_val, verbose=0)

val_losses.append(val_loss)

Our use of k-fold cross-validation and random parameter search not only ensures robustness but also addresses overfitting concerns by testing the model on multiple data splits. Additionally, early stopping and learning rate reduction strategies further mitigate the risk of overfitting, ensuring that the model generalizes well to unseen data.

Response: Agree and that can be addressed in future research. It took considerable effort to get this suite of models parameterized and executed and we believe is adequate for a journal article. LSTMs were selected due to their superior ability to model temporal dependencies in time series data, while ARIMA serves as a robust benchmark for comparison. Future studies will incorporate machine learning models like Gradient Boosting and Random Forests to broaden the scope of comparative analysis.

Response: Agree and in particular the 2D convolutional LSTMN model could be a strong performed with multiple sources of data particularly if it is spatially explicit. This study establishes a baseline for forecasting accuracy using historical prices alone. The framework is designed to accommodate additional variables, such as weather, global supply-demand indicators, and sentiment analysis, which will be incorporated in future work to further enhance model performance. Our aim, since it was our first exploratory attempt, was to keep the dataset simple. Our future aim is to pare down the model suite a bit and as you suggest, add exogenous factors. Bigdata can provide satellite imagery from cotton fields throughout the world as production proxies, weather data is available, and even the use of sentiment analysis from published “cotton outlook” forecast is next on our agenda.

Response: Yes this is a great point and we added some text to explain the difficulties that non machine learning analysts will face. Specifically, the sensitivity of the LSTM models to parameter choice is daunting as illustrated in the random search figure we added. We are not advocates of machine learning, our aim is to provide an unbiased comparison of them to traditional time series models. We acknowledge the interpretability challenge posed by neural networks, particularly deep LSTMs. Future work will explore attention mechanisms and feature attribution techniques to improve transparency and provide actionable insights for stakeholders."

We added this to conclusions:

Neural network models, particularly deep learning architectures, are often considered "black boxes" because they rely on complex, non-linear transformations and large numbers of parameters to make predictions. This makes it difficult to interpret the relationships captured by the model or understand how specific inputs influence outputs. In contrast, traditional time series models, such as ARIMA, offer the advantage of statistical inference, allowing researchers to test structural variables and directly evaluate the significance of model components. This transparency enables a deeper understanding of the underlying data-generating process, making these models more accessible and interpretable for decision-making.

Response: We used an ordinary PC but agree if implemented in a real world setting, where money is truly on the line, much great computing facilities would be required, e.g. a supercomputer.

We added this to the revised manuscript:

While machine learning and AI algorithms can, in certain applications, outperform traditional econometric and statistical methods, their benefits should be weighed alongside the increased computational time and complexity. This study was able to be conducted using a typical personal computer but real-time and real-world implementation would likely require much larger computer facilities. Businesses and institutions could face human capital limitations given the complexity in constructing and implementing advanced neural network models that have yet to become standard curriculum in most business and economic universities.

Response: Agree and issue lies in perhaps the biggest drawback to the AI models, their lack of economic structure and resulting “black box” characterization. Neural network models rely on neurons and they simply do not have the same level of interpretation as a traditional econometric model. In all honesty, this turns off many economists form using them and that is fine. We also wish they there could be a better rational to explain their workings. The best we can do is to remind the reader of this drawback from AI models. We aspire to conduct future studies will conduct a detailed analysis of the convolutional layers to identify the specific patterns and features that enable the 2D convolutional LSTM to excel during volatile periods. This will provide empirical validation for its observed performance advantages.

Response: We have thoroughly revised the manuscript to improve clarity, correct grammatical errors, and ensure consistent formatting. This includes restructuring key sections for coherence and readability and adding text to clarify the comments you raised. Thanks for your thoughtful comments!!!

Reviewer 3 Report

Comments and Suggestions for Authors

The paper reports an application of selected machine learning/neural network methods to forecasting cotton futures prices in short horizons: 5, 10 and 15 days. Although the problem is quite important and interesting from the empirical research of commodity markets perspective, the design and execution of the empirical experiment leave something to be desired. Implementing the following remarks could improve the paper considerably.

1. It is not clear how the models (ARIMA as well as NN) were estimated. The paper suggests that the models were estimated on the training data once and then applied to the entire test data sample. This is inconsistent with the standard forecast evaluation approach that requires rolling or recursive reestimation and evaluation schemes. Single model estimations lead to deterioration of accuracy towards the end of test samples (as seen in your charts) due to possible structural changes of dynamics in the test sample, and do not reflect how models are used in practice.

2. There is no clearly defined benchmark. ARIMA(0,1,2) is somewhat arbitrary (in-sample fit used by auto-arima is usually not a good criterion for selecting a forecasting model). I suggest eg. the naive no-change forecast or AR(1) and show the RMSEs relative to it (as ratios). Then we would clearly see whether we gain anything at all by employing more complex models. Moreover, in order to show statistical significance of your results, a proper Diebold and Mariano type test results should be reported.

3. What you refer to as MSE (line 212) is actually RMSE (root mean squared error).

4. The paper could have been much shorter. The detailed description of NN models should be replaced by references to proper literature. Also, the results are discussed in an overly lengthy and repetitive way. On the other hand, the specifics of cotton market may be interesting to potential readers and you may think of extending this section.

Author Response

Reviewer 3

It is not clear how the models (ARIMA as well as NN) were estimated. The paper suggests that the models were estimated on the training data once and then applied to the entire test data sample. This is inconsistent with the standard forecast evaluation approach that requires rolling or recursive reestimation and evaluation schemes. Single model estimations lead to deterioration of accuracy towards the end of test samples (as seen in your charts) due to possible structural changes of dynamics in the test sample, and do not reflect how models are used in practice.

Authors’ Response:

We appreciate the reviewer’s insightful comment regarding rolling or recursive re-estimation schemes. We took your comment strongly to heart and completely agree that a systematic and transparent approach is essential to mitigate overfitting and avoid “cherry-picking” models. To address these concerns, we employed a k-fold cross-validation method with five folds, integrated with a random parameter search strategy (see code below). This approach ensured that both training and validation RMSE were thoroughly evaluated across diverse parameter combinations.

From this frontier, we selected the best model based on a composite metric that equally weighted two critical factors:

The model’s distance from the line representing points where training RMSE equals validation RMSE, which prioritizes generalizability.
The training RMSE itself, to ensure that the model adequately captured the underlying data patterns.

This systematic and transparent process minimizes the risk of overfitting. Additionally, we strictly adhered to best practices by ensuring that the testing data was never considered during model selection. The use of the k-fold cross-validation approach further alleviates overfitting concerns by evaluating models across multiple splits of the data.

Here is a snippet of our kfold random search:

# Hyperparameters

n_seq = 6

n_steps = 4

n_steps_in = n_seq * n_steps

n_steps_out = 15

n_features = 1

# Random Search and K-Fold

n_trials = 20

n_splits = 5

random_search_results = []

for trial in range(n_trials):

print(f"Trial {trial + 1}/{n_trials}")

# Random Hyperparameter Sampling

filters = np.random.choice([32, 64, 96, 128])

dropout_rate = np.random.uniform(0.0, 0.5)

learning_rate = np.random.uniform(0.0001, 0.1)

batch_size = np.random.choice([16, 32, 64])

patience = np.random.choice([5, 10])

epochs = np.random.choice([50, 75, 100])

n_seq = np.random.choice([1, 3])

n_steps = np.random.choice([2, 4, 6])

n_steps_in = n_seq * n_steps

# Split train and test data

X_test, y_test = split_sequence(raw_seq_test, n_steps_in, n_steps_out)

X_test_not_used, y_test = split_sequence(jeff_data[2269:3526], n_steps_in, n_steps_out)

X_test = X_test.reshape((X_test.shape[0], n_seq, 1, n_steps, n_features))

# K-Fold Cross Validation

kf = KFold(n_splits=n_splits, shuffle=False)

val_losses = []

train_losses = []

for fold, (train_index, val_index) in enumerate(kf.split(raw_seq_train)):

print(f" Fold {fold + 1}/{n_splits}")

train_split = np.array(raw_seq_train)[train_index]

val_split = np.array(raw_seq_train)[val_index]

X_train, y_train = split_sequence(train_split, n_steps_in, n_steps_out)

X_val, y_val = split_sequence(val_split, n_steps_in, n_steps_out)

X_train = X_train.reshape((X_train.shape[0], n_seq, 1, n_steps, n_features))

X_val = X_val.reshape((X_val.shape[0], n_seq, 1, n_steps, n_features))

# Define the model

model = Sequential()

model.add(ConvLSTM2D(filters=filters, kernel_size=(1, 2), activation='relu', input_shape=(n_seq, 1, n_steps, n_features), return_sequences=True))

model.add(Dropout(dropout_rate))

model.add(BatchNormalization())

model.add(Flatten())

model.add(Dense(n_steps_out))

model.compile(optimizer=Adam(learning_rate=learning_rate), loss='mse')

# Callbacks

early_stop = EarlyStopping(monitor='loss', patience=patience, restore_best_weights=True)

reduce_lr = ReduceLROnPlateau(monitor='loss', factor=0.5, patience=3, min_lr=1e-6)

# Train the model

model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, validation_data=(X_val, y_val), callbacks=[early_stop, reduce_lr], verbose=0)

# Evaluate on training data

train_loss = model.evaluate(X_train, y_train, verbose=0)

train_losses.append(train_loss)

# Predict and Evaluate on validation data

y_pred_val = model.predict(X_val, batch_size=batch_size)

val_loss = model.evaluate(X_val, y_val, verbose=0)

val_losses.append(val_loss)

For the test evaluation, we did not allow models to be retrained. Our primary objective was to evaluate the LSTM models' ability to generalize over extended test periods, rather than optimizing them for practical rolling applications. However, we recognize the relevance of rolling or recursive schemes for practical implementation and will incorporate such approaches in future research. This extension will ensure that our findings align more closely with real-world forecasting workflows while maintaining robust evaluation frameworks

There is no clearly defined benchmark. ARIMA(0,1,2) is somewhat arbitrary (in-sample fit used by auto-arima is usually not a good criterion for selecting a forecasting model). I suggest eg. the naive no-change forecast or AR(1) and show the RMSEs relative to it (as ratios). Then we would clearly see whether we gain anything at all by employing more complex models. Moreover, in order to show statistical significance of your results, a proper Diebold and Mariano type test results should be reported.

Response: We apologize for being too brief on the ARIMA model construction in the originally submitted manuscript. Please the lengthy text we added in the revised manusciplt below in this same comment. We appreciate the suggestion to include additional benchmarks, such as AR(1) or naive forecasts. While Auto ARIMA provides a systematic baseline for this study, future iterations will incorporate these alternatives and explicitly report RMSE ratios relative to them. Additionally, we acknowledge the value of the Diebold-Mariano test for assessing forecast accuracy. While not included in this version due to time constraints, we plan to integrate DM testing in future studies to rigorously evaluate statistical differences between model performances. We added the ARIMA to alleviate concerns that by omitting a time series model there would be no “traditional” benechmark to compare it with. Hence our purpose is not to make a definitive comparison between the AI and traditional econometric models. More refined time series models, such as error correction, could perform as well if not better than the LSTM. We are not AI model advocates: we have no dog in the fight and are not trying to convert time series modelers to AI !!

I understand you don’t like auto_arima, but using it provides the same type of systematic approach we used for the LSTM models. Those were also based on the “in sample” and we didn’t use any of the testing data. Here is the auto_arima output:

ARIMA(2,1,2)(0,0,0)[0] intercept : AIC=7550.699, Time=1.51 sec

ARIMA(0,1,0)(0,0,0)[0] intercept : AIC=7566.960, Time=0.09 sec

ARIMA(1,1,0)(0,0,0)[0] intercept : AIC=7562.955, Time=0.16 sec

ARIMA(0,1,1)(0,0,0)[0] intercept : AIC=7561.771, Time=0.31 sec

ARIMA(0,1,0)(0,0,0)[0] : AIC=7564.962, Time=0.07 sec

ARIMA(1,1,2)(0,0,0)[0] intercept : AIC=7548.965, Time=0.94 sec

ARIMA(0,1,2)(0,0,0)[0] intercept : AIC=7548.255, Time=0.39 sec

ARIMA(0,1,3)(0,0,0)[0] intercept : AIC=7549.466, Time=0.55 sec

ARIMA(1,1,1)(0,0,0)[0] intercept : AIC=7553.713, Time=0.94 sec

ARIMA(1,1,3)(0,0,0)[0] intercept : AIC=7550.925, Time=1.61 sec

ARIMA(0,1,2)(0,0,0)[0] : AIC=7546.257, Time=0.17 sec

ARIMA(0,1,1)(0,0,0)[0] : AIC=7559.773, Time=0.14 sec

ARIMA(1,1,2)(0,0,0)[0] : AIC=7546.967, Time=0.40 sec

ARIMA(0,1,3)(0,0,0)[0] : AIC=7547.468, Time=0.36 sec

ARIMA(1,1,1)(0,0,0)[0] : AIC=7551.715, Time=1.33 sec

ARIMA(1,1,3)(0,0,0)[0] : AIC=7548.927, Time=3.01 sec

We added this text to better support the ARIMA model but again respect your concerns against focusing primarily on ARIMA:

What you refer to as MSE (line 212) is actually RMSE (root mean squared error).

Authors Response: Thanks for noting this. We corrected all references to MSE to RMSE in the revised manuscript. Cumulative RMSE is reported to illustrate how forecast accuracy evolves over time, providing a dynamic perspective on model performance for the 5-, 10-, and 15-day horizons. This includes how the averaging process can at times result in decreasing RMSE values.

The paper could have been much shorter. The detailed description of NN models should be replaced by references to proper literature. Also, the results are discussed in an overly lengthy and repetitive way. On the other hand, the specifics of cotton market may be interesting to potential readers and you may think of extending this section.

Response: We kept the text since the other reviewers did not raise this concern but the repetition is noted if future revisions are required. We included detailed descriptions of NN models to provide context for readers less familiar with LSTM architectures. However, we recognize the potential for conciseness in future revisions. Expanding the discussion of the cotton market's specifics, as suggested, is an excellent idea for future work, particularly in applied settings where these forecasts guide real-world marketing decisions.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Accepted.

Author Response

Dear Editor, As far as we are aware, there are no comments to address.

Reviewer 3 Report

Comments and Suggestions for Authors

Since my main concerns (1&2) are not addressed in the revised paper (the explanations are far from convincing), I stand by my original recommendation.

Author Response

Dear Editor, As far as we are aware, there are no comments to address.

Article Menu

In-Season Price Forecasting in Cotton Futures Markets Using ARIMA, Neural Network, and LSTM Machine Learning Models

Further Information

Guidelines

MDPI Initiatives

Follow MDPI