1. Introduction
Crude oil is a vital global commodity whose price fluctuations significantly impact countries, organizations, businesses, and individuals worldwide. Accurate prediction of crude oil prices is, therefore, of paramount importance for economic planning and decision-making. Numerous factors influence crude oil prices, each varying in its degree of impact. In this paper, we thoroughly investigate these factors, assessing their relative significance. Utilizing these insights, we then employ advanced machine learning techniques to predict crude oil prices. Our model also incorporates the dollar index as a critical variable.
In the past, researchers have attempted to predict crude oil prices using models such as statistical models, econometric time series models, and machine learning models. Random walk-based methods are among the statistical models that have been utilized for oil price prediction (
Murat and Tokat 2009). Econometric time series models, which use historical data for forecasting, include quantitative approaches like the Autoregressive Integrated Moving Average (ARIMA) models. 
Fernandez (
2006) compared the accuracy of different models for long- and short-term horizon oil price forecasting and showed that ARIMA models were only suitable for short-term predictions. As a linear model, ARIMA did not work for highly volatile oil prices and was unable to capture the nonlinear feature of oil price time series. Generalized autoregressive conditional heteroskedasticity (GARCH) models have also been employed for oil price prediction. GARCH is an econometric term which describes an approach to estimate volatility in financial markets (
Kenton 2020).
To address the limitations of econometric algorithms, several machine learning techniques, including neural networks (NNs) (
Chen et al. 2017; 
Gupta and Nigam 2020), have been proposed. 
Refenes et al. (
1994) examined the use of NNs for stock performance prediction and found that NNs outperform the classical statistical methods. 
Recently, the application of deep learning techniques in economics and finance is increasing, since they can capture complex patterns. The commonly used deep learning approaches include CNNs, recurrent neural networks (RNNs), and their extensions like long short-term memory (LSTM). 
Li et al. (
2020) introduced an innovative method for analyzing and text-mining online media using a CNN. 
Wang and Wang (
2016) forecasted crude oil indices using an RNN. 
Cen and Wang (
2019) and 
Bristone et al. (
2020) proposed LSTM-based models to predict the fluctuating behaviors of crude oil prices. 
Jahanshahi et al. (
2022) utilized LSTM and bidirectional LSTM (Bi–LSTM) models to forecast crude oil prices impacted by the Russia–Ukraine war and the COVID-19 pandemic. 
Daneshvar et al. (
2022) explored LSTM and Bi–LSTM models to predict Brent crude oil prices.
Sen and Choudhury (
2024) utilized LSTM and gated recurrent unit (GRU) networks to model and predict crude oil prices. The authors did a comparative study and found that GRU exhibited better performance than LSTM. 
Aldabagh et al. (
2023) proposed a deep learning model for one-step- and multi-step-ahead crude oil price forecasting. The model extracts important features that impact crude oil prices and uses them to predict future prices. The prediction model combines two deep learning models, namely CNNs with LSTM. The CNN–LSTM model had better accuracy, compared with other models including LSTM, CNN, SVM, and ARIMA.
 Kakade et al. (
2023) focused on the impact of macroeconomic variables on crude oil prices. Their research employed a hybrid ensemble learning approach to improve the prediction efficiency of crude oil prices. It combines LSTM with factors that influence the price of crude oil. The study highlighted the significant influence of explanatory fundamental and technical variables, such as the dollar index and various macroeconomic variables.
 Lu et al. (
2021) developed a new research framework for core influence factors selection and forecasting. They applied an LSTM model to forecast crude oil prices, incorporating a range of influencing factors such as supply and demand, global economic development, and financial and technology factors. Finally, they compared the LSTM model with six different forecasting techniques. The LSTM model demonstrated superior performance in capturing the intricate dependencies between these factors and oil prices. 
Dai et al. (
2021) provided new evidence that the U.S. dollar index (USDX) has significant out-of-sample forecasting power on oil price. 
 The above papers either consider no or a few macroeconomic factors influencing crude oil prices, no dollar index, or no hybrid models. In this paper, we fill the gap by adopting hybrid deep learning models with core factors and the dollar index. By integrating insights from these studies, our research aims to advance the field of crude oil price prediction through the application of cutting-edge machine learning techniques. We particularly focus on the dollar index’s role and its interplay with other critical factors influencing crude oil prices. This paper not only identifies and evaluates the key factors affecting crude oil prices but also demonstrates the efficacy of machine learning models in enhancing prediction accuracy. The integration of the dollar index into our predictive model further contributes to a comprehensive understanding of the determinants of crude oil prices.
Inspired by the above observations, this paper develops a new method that forecasts the change in crude oil prices by incorporating multiple effects and the dollar index. The main contributions of this work are twofold. (1) By using multiple features including the supply, demand, financial markets, and spot prices of crude oil, the method can achieve better prediction accuracy. (2) By integrating the dollar index into the model, the method can produce better prediction results. 
The rest of this paper is organized as follows: In 
Section 2, the factors affecting the crude oil prices are discussed in detail. In 
Section 3, the theoretical background of the machine and deep learning models used in this paper are briefly explained. 
Section 4 describes the datasets that we used, the evaluation metrics, the preprocessing steps and the results of our experiments. Finally, 
Section 5 summarizes our work, states the advantages and limitations of the proposed method, and discusses some future work.
  2. Factors Affecting Crude Oil Prices
Crude oil is an essential commodity whose price can affect many countries, organizations, businesses, and individuals. As such, the prediction of crude oil prices is of great importance. There are many factors that impact the price of crude oil. Some factors have a huge impact, while others have less impact. In this paper, we study the factors that influence the price of crude oil. First, we analyze the factors and their importance. Then, we predict the price of crude oil using these factors. Also, we consider the dollar index in the model.
  2.1. Influencing Factors and Data Analysis
According to the U.S. Energy Information Administration (EIA) (
https://www.eia.gov/finance/markets/crudeoil/, accessed on 30 June 2024), there are seven factors that influence the crude oil markets. 
Figure 1 depicts the seven main factors including the demand (OECD/non-OECD), supply (OPEC/non-OPEC), balance, financial markets, and spot prices that have a direct impact on the crude oil price.
  2.2. OPEC Supply
The production of crude oil from the Organization of the Petroleum Exporting Countries (OPEC) plays a significant role in shaping oil prices. OPEC actively manages oil production across its member countries by implementing production targets. Historical trends reveal that reductions in OPEC production targets often coincide with increases in crude oil prices. OPEC member countries collectively account for approximately 40 percent of global crude oil production. Moreover, OPEC’s oil exports have considerable impact on global prices, representing about 60 percent of total petroleum traded internationally. Given their substantial market share, OPEC’s decisions exert a notable influence on international oil prices. Notably, signals indicating changes in crude oil production from Saudi Arabia, OPEC’s leading producer, frequently prompt fluctuations in oil prices. 
Figure 2 plots the Saudi production change (indicated by the blue bar or blue for short) and the WTI production change (indicated by the red curve or red for short), from 2001 to 2023, plus the first quarter of 2024, with a total of 93 quarters.
The degree to which OPEC countries utilize their existing production capacity serves as a measure for assessing the tightness of global oil markets and the extent of OPEC’s impact on price escalation. Spare capacity, as defined by the EIA, refers to the production volume that can be swiftly initiated within 30 days and maintained for a minimum of 90 days. Notably, Saudi Arabia, the foremost oil producer within OPEC and the largest oil exporter globally, plays a pivotal role in this context. The spare capacity held by OPEC serves as a benchmark for assessing the global oil market’s resilience in the case of potential crises that may disrupt oil supplies. Consequently, when OPEC’s spare capacity diminishes, oil prices often increase as a risk premium. 
Figure 3 plots the OPEC spare capacity (blue) and the WTI prices (red) from 2001 to 2025, with a total of 100 quarters. The last three quarters of 2024 and the four quarters of 2025 are estimated values (the same thereafter).
  2.3. Non-OPEC Supply
Presently, approximately 60 percent of global oil production originates from countries outside OPEC. Prominent hubs of non-OPEC production encompass North America, various territories within the previous Soviet Union, and the North Sea region. 
Figure 4 plots the non-OPEC production changes (blue) and the WTI prices (red) from 2001 to 2025, with a total of 100 quarters.
In general, producers in non-OPEC countries are considered price takers, meaning that they react to market prices rather than actively managing production to influence prices. Consequently, non-OPEC producers typically operate close to full capacity, resulting in limited spare capacity. All else being equal, a decrease in non-OPEC supply levels tends to exert upward pressure on prices, by reducing total global supply and increasing the demand for OPEC’s output. This heightened demand for OPEC’s oil enhances its potential influence on prices. 
Figure 5 plots the world production change (blue) and the WTI prices (red) from 2001 to 2025, with a total of 100 quarters. The world production includes OPEC and non-OPEC production.
Further data analysis was carried out by plotting a graph to see the correlation and dependency of the crude oil price, based on Saudi production change, non-OPEC production change, and world production change. We integrated all the changes and prices versus the quarters in one plot. 
Figure 6 shows the world production change (indicated by the orange bar or orange for short), non-OPEC production change (indicated by the green bar or green for short), Saudi production change (blue), and WTI crude oil prices (red) from 2001 to 2023 plus the first quarter of 2024, with a total of 93 quarters. From the figure, we can spot the correlation between the Saudi production trend (blue) and WTI crude oil prices trend (red). This can be seen in the period from quarter 0 to quarter 50 as the two trends move in the same direction. Also, this is observed from quarter 70 to quarter 93.
  2.4. OECD Demand
The Organization for Economic Cooperation and Development (OECD) comprises industrialized countries, such as the United States, a significant portion of Europe, and other countries. 
Figure 7 plots the OECD consumption change (blue) and the WTI prices (red) from 2001 to 2025, with a total of 100 quarters. The last three quarters of 2024 and the four quarters of 2025 are estimated values (the same thereafter).
  2.5. Non-OECD Demand
In recent years, there has been a marked surge in oil consumption in developing countries, which are outside OECD, and they have been pivotal in propelling the surge in global demand for petroleum products. While oil consumption in OECD countries experienced a downturn from 2000 to 2010, non-OECD oil consumption witnessed a robust increase of more than 40 percent. Notably, China, India, and Saudi Arabia exhibited the most substantial growth in oil consumption among non-OECD nations during this time. Collectively, non-OECD countries surpassed OECD counterparts in petroleum product consumption for the first time in 2014. 
Figure 8 plots the non-OECD consumption change (blue) and their gross domestic production (GDP) percentage growth (red) from 2001 to 2025, with a total of 100 quarters.
Figure 9 plots the world consumption change (blue) and WTI prices (red) from 2002 to 2025, with a total of 96 quarters. 
 Moreover, we plot a graph to see the correlation and dependency of the crude oil price, compared with the OECD, non-OECD, and world consumption change. We integrate the changes and prices versus the quarters into one plot. 
Figure 10 shows the world consumption change (orange), OECD consumption change (blue), non-OECD consumption change (green), and WTI crude oil prices (red) from 2002 to 2025, with a total of 96 quarters.
From the figure, we can spot some correlation between the non-OECD consumption changes (green) and the WTI crude oil prices trend (red), especially in the first 20 quarters, 25th to 40th quarter, and the final 20 quarters. For a precise measurement, we calculated the correlation coefficients of the WTI price with the OECD, non-OECD, and world consumption percentage changes. The highest correlation coefficient is between the WTI price and the non-OECD consumption change, as shown in 
Table 1.
  2.6. U.S. Dollar Index 
USDX is employed for assessing the dollars’ worth, in comparison to a selection of six foreign currencies (Euro, Swiss franc, Japanese yen, Canadian dollar, British pound, and Swedish krona). It is considered a fair value for the dollar in economic markets.
Figure 11 shows the monthly USDX during the period from 2002 to 2023. When the USDX rises, it signifies an increase in the value of the U.S. dollar relative to other currencies. For example, if the index value increases by 30 from its initial value of 100, it means a 30% appreciation. Likewise, if the index value decreases by 30 from its initial value of 100, it means a 30% depreciation. The USDX serves as an indicator of the overall condition of the U.S. economy, and so traders have the option to utilize it for speculating on changes in the value of the dollar relative to other currencies (
Chen 2024).
   3. Theoretical Background
In this section, we provide a brief introduction to the algorithms compared in this paper. We considered eight models used in the literature. We are interested in machine and deep learning algorithms, as they can capture complex patterns such as the nonlinearity and stochasticity of crude oil prices. We first provide some background for decision trees (DTs), random forests (RFs), gradient boosting (GB), NNs, CNNs, LSTM, Bi–LSTM, and CNN–LSTM architectures.
  3.1. Decision Trees
DTs are classification techniques that use multiple covariates to build prediction algorithms for a target variable. This approach categorizes a population into branch-like segments, forming an inverted tree structure with a root node, internal nodes, and leaf nodes. In DTs, questions serve as decision nodes, dividing the data into subsets. Each question aids in reaching a final decision, represented by a leaf node. Instances meeting the criteria follow the “Yes” branch, while others follow an alternate path. DTs aim to identify the optimal split to partition the data.
  3.2. Random Forests
RFs are ensemble learning models used for classification and regression problems (
Breiman 2001). It merges the outcomes from numerous DTs to produce a single result. It comprises a collection of tree predictors, where each decision tree incorporates a random vector as a parameter and randomly selects a subset of attributes. Consequently, a random feature is used to partition each node. Additionally, each predictor randomly chooses a training sample. 
Figure 12 shows the result of a random forest, which is obtained by summing the individual DTs.
  3.3. Gradient Boosting
GB is a method that constructs new models to predict the errors of preceding models, subsequently combining them to produce the forecast. This technique utilizes a gradient descent approach to minimize loss, as new models are iteratively added. eXtreme gradient boosting (XGBoost) is based on gradient boosted decision trees. XGBoost employs ensemble learning techniques, where multiple models are aggregated to increase the prediction accuracy. It utilizes DTs as the base learners, by adding one tree at a time to the model. To predict the target variable 
, first, an initial model 
 is established. The model is associated with a residual 
. In the next phase, these residuals are fitted to a new model 
. The models 
 and 
 are combined to form an improved model 
. This process is repeated until the residuals are minimized (
Theerthagiri and Ruby 2023).
        
        where 
 is the number of iterations used to refine the model. 
Figure 13 shows the XGBoost architecture.
  3.4. Neural Networks
NNs are inspired by neural networks observed in the human brain. The goal is to produce a pattern based on input data. NNs have a multitude of nodes (neurons) and connections organized in a parallel manner. The main advantage of these algorithms is their capability to handle nonlinearity, and so they are widely used in forecasting tasks. Due to their proficiency in pattern recognition, NNs have become the predominant technique in the domain.
  3.5. Convolutional Neural Networks
CNNs, initially introduced by 
LeCun and Bengio (
1998) for computer vision applications, mimic the human eye’s perception and learning processes across various tasks, such as image processing, natural language processing, face recognition, classification problems, and recommendation systems. CNNs are particularly effective at automatically extracting and learning features from one-dimensional sequence data, like univariate time series. They consist of several layers, including an input layer, convolutional layers, pooling layers, fully connected layers, and an output layer. The convolutional layer’s function is to perform a convolution operation on the data, essentially filtering the input to assess its impact. The filter size determines its scope, and each filter uses a shared set of weights for the convolutional operation.
  3.6. LSTM and Bi–LSTM
LSTM, a specialized variant of RNNs, was originally introduced by 
Hochreiter and Schmidhuber (
1996). It effectively addresses mathematical challenges in modeling long sequence dependencies. The conventional fully connected RNN encounters the problem of vanishing gradients, when modeling long time series data. To address this issue, the LSTM approach introduces a memory cell with a complex internal gate structure to replace the ordinary node in a hidden layer. This feature enables LSTM with enhanced learning capabilities, facilitating the automatic extraction of features and integration of external variables. By overcoming the gradient vanishing problem inherent in RNNs, LSTM is particularly well equipped to handle long-term dependency issues.
Bi–LSTM combines two independent RNNs, enabling the network to access both backward and forward information about the sequence at each time step. Unlike traditional LSTM, Bi–LSTM processes input data in two directions—one from past to future and the other from future to past. This bidirectional approach preserves information from both directions and merges two hidden states (
Jahanshahi et al. 2022).
  3.7. CNN–LSTM
The CNN–LSTM model integrates CNNs and LSTM. The CNN component is effective in identifying and learning new features from time series data, while the LSTM component is adept at capturing long-term dependencies within sequences. This combined model is particularly well suited for temporal analysis. 
Aldabagh et al. (
2023) employed this model to predict one-step- and multi-step-ahead crude oil prices.
  4. Empirical Results
In this section, we compare the effectiveness of different forecasting models, using various evaluation metrics.
  4.1. Evaluation Metrics
We use two performance metrics to identify the difference between the actual and predicted oil prices: root mean square error (RMSE) and mean absolute error (MAE). The first metric, RMSE, quantifies the difference between the actual and the predicted prices. If 
 are the actual prices and 
 are the corresponding predicted prices, then the RMSE is calculated using Equation (2).
        
The RMSE is the widely used metric for measuring the performance of models predicting commodity prices. It is a criterion that assigns a high weight to large absolute errors.
The second metric, MAE, compares the results of different models. It is a measure of errors between predicted and actual values. It is calculated as the sum of absolute errors divided by the sample size using Equation (3).
        
  4.2. Dataset
We downloaded recent datasets from the EIA website, which include world GDP change, world consumption change, non-OECD consumption growth, OECD consumption change, non-OECD GDP growth, non-OPEC production change, OPEC spare capacity, Saudi production change, capacity percent change, WTI production change, and crude oil prices. We also downloaded the dollar index from the MarketWatch website (2002–2023). Each dataset was split into two parts, with 70% for training and 30% for testing. 
After that, we followed two preprocessing steps to ensure better data interpretation and manipulation. The two steps are data interpolation and price and dollar index change. Using interpolation, we expanded the data from quarterly to monthly data. This provided us with 264 monthly samples from 88 quarterly samples. The second step calculates the WTI price change and the dollar index change using Equations (4) and (5).
        
  4.3. Experimental Results
To compare different machine learning algorithms, we included the DTs, RFs, GB, NNs, CNN, LSTM, Bi–LSTM, and CNN–LSTM models in our experiments. The models’ output is the WTI crude oil price. The following factors, which have been explained in 
Section 2, are input to the models:
- World consumption change: this variable represents the change in the world’s crude oil consumption; 
- World GDP change: this variable represents the change in the world’s GDP; 
- Non-OECD consumption growth: this variable represents the crude oil consumption growth of the non-OECD countries; 
- Non-OECD GDP growth: this variable represents the growth of the GDP of the non-OECD countries; 
- OECD consumption change: this variable represents the crude oil consumption change in the OECD countries; 
- Non-OPEC production change: this variable represents the change in the non-OPEC countries’ crude oil production; 
- Saudi production change: this variable represents the change in the Saudi’s crude oil production; 
- OPEC spare capacity: this variable represents the OPEC spare capacity. It is described as the volume of production that can be initiated within 30 days and maintained for at least 90 days. Saudi Arabia, the largest oil producer within OPEC and the world’s leading oil exporter, has historically maintained the highest spare capacity. Typically, Saudi Arabia keeps over 1.5 to 2 million barrels per day of spare capacity available for market management purposes; 
- Capacity percent change: this variable represents the change in world liquid fuels production. Although natural gas liquids (NGLs) provide significant additional volume to world liquids supply, they are not included in OPEC production allocations; 
- WTI production change: this variable represents the change in the WTI crude oil production; 
- Dollar index: this variable represents the dollar index values. 
We conducted our experiments with three different settings. The first setting excludes the dollar index and the second data preprocessing step, namely the calculation of the price change. The second setting includes the dollar index in the experiments. The third setting includes the dollar index and the second preprocessing step, by calculating the WTI crude oil price and dollar index changes using Equations (1) and (2). 
Table 2 and 
Table 3 summarize the results of running the five models on the three datasets.
From the tables, we can conclude that the best results are obtained using the hybrid CNN–LSTM model. We can observe that CNN–LSTM outperforms the other models for the three settings. The RMSE and MAE of the CNN–LSTM model are the lowest among the eight models. Also, we can notice that the preprocessing step and the dollar index factor improve the results, yielding 3.93 as RMSE and 1.33 as MAE.
It is worth noting that the performance of the deep learning models CNN, LSTM, and Bi–LSTM is close. The RMSE and MAE of the three models are within a close range. Although the use of Bi–LSTM has been adopted in many crude oil price prediction models, as we can see here, its performance is not significant on our datasets. The RMSE and MAE of RF, GB, and DT fall within a close range. The simple neural networks have the worst performance, especially when the data are not preprocessed.
Figure 14, 
Figure 15 and 
Figure 16 depict the actual oil prices versus the predicted ones for the three datasets, respectively, using the CNN–LSTM model. 
Figure 14 shows good accuracy except around the 50th quarter. 
Figure 15 indicates a better accuracy. 
Figure 16 reveals accurate results except around the 30th quarter.
   4.4. Factors’ Importance
We conducted further analysis to show the relative importance of different factors. We observed that when we integrated the dollar index in the experiments, it affected the prediction results. The factors included in our experiments contributed with a certain percentage to the obtained results. The following factors are used in our experiments, including OPEC and non-OPEC supply, OECD and non-OECD consumption, world GDP, non-OECD GDP, OPEC spare capacity, world liquid fuels production, and the WTI and Saudi oil production factors. The factors that have a huge impact on the prediction results are the following:
- Non-OECD consumption represents the crude oil consumption of the non-OECD countries. This is justified by the marked surge in oil consumption within developing countries outside OECD. China, India, and Saudi Arabia represent the highest growth in oil consumption among non-OECD nations; 
- OECD consumption represents the amount of consumption of the OECD countries. Although the non-OECD demand is higher than the OECD demand, the OECD consumption had a significant role in deciding the price of WTI crude oil. This is due to the fact that OECD countries comprise the United States, much of Europe, and other industrialized countries, which contribute to the decision of WTI crude oil prices. This can be observed when the crude oil prices decreased significantly during the second quarter of COVID-19, due to a modest economical demand of OECD countries for crude oil, followed by a surge in crude oil OECD demand and a jump in crude oil prices in 2021. This observation is revealed in  Figure 7- . 
  5. Conclusions
This paper studies multiple factors affecting crude oil prices. The factors include the demand of the OECD and non-OECD countries and the supply of the OPEC and non-OPEC countries. We also included the dollar index in our work. The models compared include DTs, RFs, GB, NNs, CNNs, LSTM, Bi–LSTM, and CNN–LSTM. Results show that the dollar index and the preprocessing steps, including interpolation and calculating the changes of the crude oil prices and the dollar indices, improve the performance of the models. Also, the hybrid model, i.e., CNN–LSTM, exhibits better performance than other models. In addition, non-OECD consumption and OECD consumption growth have an impact on the prediction of crude oil prices. In recent years, we noticed a sharp rise in oil consumption in developing countries and a decline in oil consumption in the OECD countries. This clearly affects the price of crude oil prices.
This study advances the field of crude oil price forecasting by introducing a hybrid CNN–LSTM model that incorporates the dollar index, an approach not extensively explored in previous studies. Our analysis demonstrates that including the dollar index significantly enhances the predictive accuracy of oil price fluctuations, providing a more nuanced understanding of the economic factors impacting global markets.
This innovative approach offers substantial contributions to risk management practices by enabling more precise and reliable oil price forecasts. For stakeholders in sectors sensitive to oil price volatility, such as energy, transportation, and manufacturing, our model delivers a valuable tool for better anticipating market trends and adjusting risk mitigation strategies accordingly. Furthermore, the integration of macroeconomic indicators like the dollar index helps in identifying and interpreting the broader economic signals that drive market changes, offering stakeholders a comprehensive view of potential financial exposures.
The robustness and innovation of our methodology not only improve forecasting accuracy but also invite further research into the application of similar models across various commodities and financial markets, broadening the implications for risk management across different sectors.
This paper makes two contributions. First, our model considers multiple factors, including OPEC and non-OPEC supply, OECD and non-OECD consumption, world GDP, non-OECD GDP, OPEC spare capacity, world liquid fuels production, and WTI and Saudi Oil production. Second, the model considers the dollar index, which improves the results. In the future, we plan to study other factors, including Brent, Mars, Tapai, and Dubai oil prices, that may affect WTI crude oil prices. We also plan to study other optimization algorithms that may increase the prediction accuracy. In addition, the information from online media sources could be integrated into the model. For example, feature extraction natural language processing techniques, including bidirectional encoder representations from transformers (BERT) language models, could be used to process tweet hashtags.