A Model Selection Approach for Time Series Forecasting: Incorporating Google Trends Data in Australian Macro Indicators

This study examined whether the behaviour of Internet search users obtained from Google Trends contributes to the forecasting of two Australian macroeconomic indicators: monthly unemployment rate and monthly number of short-term visitors. We assessed the performance of traditional time series linear regression (SARIMA) against a widely used machine learning technique (support vector regression) and a deep learning technique (convolutional neural network) in forecasting both indicators across different data settings. Our study focused on the out-of-sample forecasting performance of the SARIMA, SVR, and CNN models and forecasting the two Australian indicators. We adopted a multi-step approach to compare the performance of the models built over different forecasting horizons and assessed the impact of incorporating Google Trends data in the modelling process. Our approach supports a data-driven framework, which reduces the number of features prior to selecting the best-performing model. The experiments showed that incorporating Internet search data in the forecasting models improved the forecasting accuracy and that the results were dependent on the forecasting horizon, as well as the technique. To the best of our knowledge, this study is the first to assess the usefulness of Google search data in the context of these two economic variables. An extensive comparison of the performance of traditional and machine learning techniques on different data settings was conducted to enable the selection of an efficient model, including the forecasting technique, horizon, and modelling features.


Introduction
Forecasting the trends of economic indicators is crucial to policy makers and investors to make informed decisions. However, the official release of the indicators suffers from an information time lag because of the time and effort needed to collect the required data. To address this issue, researchers have aimed to nowcast and forecast the economic indicators.
The unemployment rate is one of the key indicators due to its direct connection to the economic cycle and its influence on decision-makers. Several researchers have attempted to improve the forecasting for the unemployment rate for various developed and developing countries. While some authors have applied different machine learning techniques to forecast unemployment [1,2], others have focused on incorporating additional data, in particular online search data, to improve the forecasting accuracy. Ettredge et al. [3] were the first to address such issues and investigated the link between online job searches and the official rates of unemployment in the United States. Additionally, Choi and Varian [4,5] put forward this line of research by describing and illustrating how Internet search data could be used to improve the predictions of several economic indicators such as unemployment claims, retail sales, property demand, and holiday destinations popularity. These two papers have stimulated much recent research in this field. Since none of the researchers investigated the relation between online search data and the unemployment rate in Australia, we chose to assess whether the search behaviour of Australian Internet users can improve the performance of the Australian monthly unemployment-rate-forecasting models.
In addition to the unemployment rate, we selected another indicator for our experiments, the number of short-term travellers visiting Australia. Being a destination for millions of tourists, the tourism industry in Australia is directly linked to its economic wellbeing. Forecasting the number of incoming travellers will assist investors in making their investment decisions and government agencies to properly allocate their resources to accommodate the number of travellers. Researchers have used online search data for different applications within the tourism industry. While some have focused on forecasting the hotel demand for particular cities or countries [6][7][8], others such as Feng et al. [9] and Gawlik et al. [10] have assessed the effectiveness of search data in forecasting the number of tourists rather than hotel demand.
The selection of the two indicators analysed in this study, which are released monthly by the Australian Bureau of Statistics, was based on their ability to reflect the behaviour of Google users across different geographical locations. While Google Trends data collected within Australia were used to forecast the monthly unemployment rate, we employed globally searched keywords via the Google engine to forecast the number of travellers visiting Australia. This approach enabled us to assess the applicability of Google Trends data for two distinct settings and evaluate the forecasting horizon associated with the behaviours of both local and international users. Furthermore, we present a novel forecasting framework that selects the optimally performing model from two families of techniques suitable for forecasting time series data, namely traditional linear techniques (SARIMA and SARIMAX) and machine learning techniques (SVR). The framework also incorporates feature selection techniques, which play a crucial role in the forecasting process. It should be noted that prior literature had not extensively explored this aspect to the extent that is presented in this paper.
In our paper, we examined the predictive power of Google Trends data using support vector regression (SVR) and convolutional neural networks (CNNs) against the traditional linear regression techniques such as SARIMA and SARIMAX in forecasting the two selected time series indicators. The paper is organised as follows. Section 2 presents an overview of the literature on using Google Trends data for economic indicators. Section 3 presents our contribution of applying a data-driven approach to forecast both indicators through a description of the methodology covering the data collection, feature engineering and selection techniques, as well as the forecasting models used in our paper, alongside the evaluation metrics. Section 4 describes the experimental setup, in particular a description of each set of experiments and the datasets associated with them. Section 5 evaluates empirically the forecasting performance of our models and provides an indication as to why a data-driven approach to forecasting is necessary. Section 6 contains a discussion on the suitability of using alternative data and non-traditional techniques for forecasting.

Literature
Over the last few years, several attempts have been made to explore the potential benefits of using Internet search data in forecasting economic variables [4]. In this section, we present some of the recent studies that have incorporated Internet search data in unemployment and tourism demand forecasting. To the best of our knowledge, our paper is the first to assess the usefulness of Google search data in the context of these economic variables in Australia and the first to compare the performance of traditional and machine learning techniques on different data settings.

Unemployment Forecasting
Forecasting unemployment has become an area of interest for researchers. There are two areas of focus to improve its accuracy: incorporating additional data sources (mainly Internet search data) and using non-traditional techniques.
These studies did not establish whether Internet data can replace or complement traditional methods. Some authors obtained better results when combining both data in their model [16]. Most researchers have suggested using multiple keywords to improve the prediction accuracy of the forecasting models. In this paper, we compared the models using combined data, as well as search data on their own to address this limitation.
There is another set of research focused on using alternative techniques to forecast unemployment. Researchers have compared several machine learning techniques such as artificial neural networks (ANNs) [2,30,31], SVR [2], as well as hybrid approaches [1]. They found that their experiments yielded better results than ARIMA models.
Considering that researchers have not tested the impact of search data when forecasting unemployment using traditional and machine learning techniques, we were interested in assessing the efficacy of Google Trends in Australia where Google search is widely used. For this forecasting purpose, we employed SARIMA, SVR, and CNNs on an expanded list of search keywords that is related to Australia.

Tourism Forecasting
The real-time characteristics of Internet search data have motivated researchers to examine their predictive power in the tourism and hospitality industry. The scope of past research varied from forecasting hotel demand to the number of visitors to cities and countries.
Several research works have successfully employed Internet data to forecast the demand for hotel rooms and flights for different forecasting horizons [6][7][8]32]. Similar to unemployment forecasting, tourism research has been extended to predict the volume of visitors to cities [33,34] and countries [5,9,10,25,26]. Their results presented higher accuracy when incorporating search data. A limited number of those research works focused on the volume of incoming visitors regardless of their point of departure and did not evaluate the performance of other techniques.
While fewer studies have forecasted the number of visitors on a macro level, there has not been an assessment of the benefit of using search data with historical visitors' data using machine learning techniques. In our paper, we used the same approach applied on unemployment data to evaluate the SARIMA and SVR results in forecasting the number of short-term visitors coming to Australia. A similar comparison was performed recently by Botta et al. [35], but instead of using SVR, they deployed an ANN to predict the number of a local museum visitors. We also applied the search keywords used by Feng et al. [9] and tailored them to the Australian context, since they were proven successful in forecasting the number of visitors and they covered different aspects of tourism (food, airline, shopping).

Data Collection
The initial stage of our research involved data collection. We utilised two main sources of data: Australian economic indicators and Google Trends data. For the economic indicators, we extracted the historical data of two key indicators for the Australian economy from the Australian Bureau of Statistics' website: monthly unemployment rate and monthly number of short-term visitors arriving in the country. These indicators represent essential aspects of the economy, and forecasting their future values would offer significant insights for policymakers and economic stakeholders. Those figures are often calculated by conducting surveys and collecting data from different agencies, leading to a delay in publishing the most-recent numbers. Monthly unemployment rate data are available from February 1978, while the number of visitors' data cover the period starting in January 1991.
Australia has a stable economy. The unemployment rate has not surpassed the 10% mark since 1994. Since then, the Australian unemployment rate has fluctuated between 4% and 6%. As seen in Figure 1, there were two spikes/increases in unemployment in the last two decades: once after the GFC and one during COVID-19.

Data Collection
The initial stage of our research involved data collection. We utilised two main sources of data: Australian economic indicators and Google Trends data. For the economic indicators, we extracted the historical data of two key indicators for the Australian economy from the Australian Bureau of Statistics' website: monthly unemployment rate and monthly number of short-term visitors arriving in the country. These indicators represent essential aspects of the economy, and forecasting their future values would offer significant insights for policymakers and economic stakeholders. Those figures are often calculated by conducting surveys and collecting data from different agencies, leading to a delay in publishing the most-recent numbers. Monthly unemployment rate data are available from February 1978, while the number of visitors' data cover the period starting in January 1991.
Australia has a stable economy. The unemployment rate has not surpassed the 10% mark since 1994. Since then, the Australian unemployment rate has fluctuated between 4% and 6%. As seen in Figure 1, there were two spikes/increases in unemployment in the last two decades: once after the GFC and one during COVID-19. Australian unemployment data are seasonal in nature, where the same trend is repeated each year. For example, there has always been an increase in the unemployment rate post December, and this is expected to continue in the future. Since we used the SARIMA model, the parameter m that indicates the cycle of the trend was 12. Figure 2 shows the number of short-term visitors coming to Australia. Australia is becoming a more-popular destination over time. The seasonality in the data is visible through the repeated trends.
A closer inspection of Figure 2 shows that the same trends are repeated the same month every year, e.g., an increased number of visitors around Christmas time and during summer. The large drop of the number of visitors on the right-hand side of the chart is due to COVID, when Australia had travel restrictions in place. Australian unemployment data are seasonal in nature, where the same trend is repeated each year. For example, there has always been an increase in the unemployment rate post December, and this is expected to continue in the future. Since we used the SARIMA model, the parameter m that indicates the cycle of the trend was 12. Figure 2 shows the number of short-term visitors coming to Australia. Australia is becoming a more-popular destination over time. The seasonality in the data is visible through the repeated trends.
A closer inspection of Figure 2 shows that the same trends are repeated the same month every year, e.g., an increased number of visitors around Christmas time and during summer. The large drop of the number of visitors on the right-hand side of the chart is due to COVID, when Australia had travel restrictions in place.
In parallel, we collected Google Trends data related to the aforementioned economic indicators. Google Trends data, which have been offered by Google since 2014, provide the search frequency of keywords, which shows the ratio of the search amount of a certain keyword to the total search amount of all keywords in a certain period of time, and then further normalises the search frequency into the interval of [0, 100], which can avoid changes in the amount of keyword searches due to an increase in the number of users. It represents a rich source of insights about public interest in various topics over time. By selecting search terms related to the economic indicators, we could gauge public interest in these topics and examine the potential predictive power this interest holds for future economic conditions. In parallel, we collected Google Trends data related to the aforementioned economic indicators. Google Trends data, which have been offered by Google since 2014, provide the search frequency of keywords, which shows the ratio of the search amount of a certain keyword to the total search amount of all keywords in a certain period of time, and then further normalises the search frequency into the interval of [0, 100], which can avoid changes in the amount of keyword searches due to an increase in the number of users. It represents a rich source of insights about public interest in various topics over time. By selecting search terms related to the economic indicators, we could gauge public interest in these topics and examine the potential predictive power this interest holds for future economic conditions.
In this paper, we searched for keywords related to each of the two target indicators used in this paper. For unemployment, the process of selecting search keywords began by considering what Internet users would search for if they became or were about to become unemployed [4]. It seems sensible to suggest that our searches were likely to be focused on two areas: available benefits to the unemployed and particular websites and keywords that unemployed people may use (e.g., job advertisement website, "job and education" topic search). Table 1 shows the data extracted from the Google Trends service. Centerlink is an Australian government service that offers several benefits including unemployment benefits. Additionally, we incorporated an indicator for "job" to accommodate searches related to job searches that are general in nature and difficult to capture using more specific terms. Seek and Indeed are popular job advertisement websites, which are mainly used to look for job vacancies, so they are also included. Additionally, we added the trend data of searches for the word "unemployment". All the extracted data using Google Trends were restricted to searches within Australia. Figure 3 shows the popularity of four of the Google indicators extracted to forecast the unemployment rate. In this paper, we searched for keywords related to each of the two target indicators used in this paper. For unemployment, the process of selecting search keywords began by considering what Internet users would search for if they became or were about to become unemployed [4]. It seems sensible to suggest that our searches were likely to be focused on two areas: available benefits to the unemployed and particular websites and keywords that unemployed people may use (e.g., job advertisement website, "job and education" topic search). Table 1 shows the data extracted from the Google Trends service. Centerlink is an Australian government service that offers several benefits including unemployment benefits. Additionally, we incorporated an indicator for "job" to accommodate searches related to job searches that are general in nature and difficult to capture using more specific terms.
Seek and Indeed are popular job advertisement websites, which are mainly used to look for job vacancies, so they are also included. Additionally, we added the trend data of searches for the word "unemployment". All the extracted data using Google Trends were restricted to searches within Australia. Figure 3 shows the popularity of four of the Google indicators extracted to forecast the unemployment rate.   There were some limitations associated with selecting search keywords relevant to unemployment. Centerlink offers several services other than unemployment benefits; therefore, a change in its trend does not necessarily reflect the changes in demand for those benefits. Additionally, there are certain job vacancies relevant to industries such as construction that might not be posted on the "Seek" website. An increase in unemployment in the construction industry might not lead to an increase in access to the popular job search website. Furthermore, there are other popular platforms such as LinkedIn that can be accessed via a mobile application or directly through the website to look for job vacancies.
Given the limitation of using Google Trends data, we intended to use the extracted time series data as a proxy for changes in the labour market, rather than an accurate reflection of changes in the Australian unemployment rate.
The selected Google indicators to be used in forecasting the number of short-term visitors is shown in Table 2 alongside their reference names used in our code. Those indicators are similar to those used by Feng et al. [9], and they cover different areas of what travellers might need to get to their destination and to facilitate their visit. Terms such as "Australian weather" and "Australian climate" indicate the interest of search engine users in knowing what to wear when visiting Australia. The terms "Australia airline", "Qantas", and "Australian map" indicate the interest to know more on how to get to and navigate   There were some limitations associated with selecting search keywords relevant to unemployment. Centerlink offers several services other than unemployment benefits; therefore, a change in its trend does not necessarily reflect the changes in demand for those benefits. Additionally, there are certain job vacancies relevant to industries such as construction that might not be posted on the "Seek" website. An increase in unemployment in the construction industry might not lead to an increase in access to the popular job search website. Furthermore, there are other popular platforms such as LinkedIn that can be accessed via a mobile application or directly through the website to look for job vacancies.
Given the limitation of using Google Trends data, we intended to use the extracted time series data as a proxy for changes in the labour market, rather than an accurate reflection of changes in the Australian unemployment rate.
The selected Google indicators to be used in forecasting the number of short-term visitors is shown in Table 2 alongside their reference names used in our code. Those indicators are similar to those used by Feng et al. [9], and they cover different areas of what travellers might need to get to their destination and to facilitate their visit. Terms such as "Australian weather" and "Australian climate" indicate the interest of search engine users in knowing what to wear when visiting Australia. The terms "Australia airline", "Qantas", and "Australian map" indicate the interest to know more on how to get to and navigate Australia; Qantas is the flagship carrier of Australia and its largest airline by fleet size, international flights, and international destinations. The extracted search data cover worldwide searches in contrast to the ones used to forecast unemployment, which were restricted to Australia.
One of the limitations of using keywords looked up all over the world is that this includes the searches of users within Australia. Searches for those keywords by Australian residents do not contribute to the number of tourists visiting Australia. For this exercise, we assumed that the search for these terms within Australia did not create any noise as there were no noticeable changes in the search trends. Additionally, the search results were limited to the Google engine and did not include the usage of people residing in China due to the restriction on using Google in China. Chinese nationals consist of large proportion of tourists visiting Australia.

Feature Engineering
After the data collection, we proceeded to the feature-engineering phase. The goal was to transform the collected data into a format that could be more effectively utilised by our predictive models. This involved creating new variables based on our raw data that better represent the underlying trend patterns for the predictive models. This process was applied in our study to increase the predictive performance of our models.
For our dataset from the Australian Bureau of Statistics (ABS) and Google Trends, the original data were augmented by creating time-based features. These features were designed to capture the dynamic behaviour and trends in the data over time. These included lagged values of the indicators themselves and derived statistics such as moving averages.
We created lag features for each dataset, specifically for the 12 previous months. The assumption here was that the current month's value of a given economic indicator (such as the unemployment rate or visitor arrivals) or Google Trends value would have some correlation with its past values. For instance, if the unemployment rate was high last month, it could likely be high in the current month as well, barring any substantial changes in the economic environment.
Lagged features were derived by shifting the time series data by one period (month) to create a new feature (Lag-1), by two periods to create another feature (Lag-2), and so on, up to twelve periods (Lag-12). This was carried out because it is plausible that both the dependent variables and the Google Trends indicators could have monthly seasonality that last up to a year, and we wanted our models to capture this potential seasonal effect.
We also created moving average features, which represent the mean of the data points over a specified period. These were calculated for the last 3, 4,. . ., and 12 months. The rationale for creating these features is that, while individual data points (such as a spike in search interest or a dip in unemployment) can be quite volatile, the average value over a certain period can provide a smoother representation of the underlying trend in the data.
X average(n) = (X lag(i) + X lag(i+1) + . . . + X lag(n) )/n In addition to the lag and moving average features, a "month" feature was created to capture any potential seasonal effects. This feature represents the month of the year (a number between 1 and 12) at each data point. This is particularly important for data such as tourism, which can show substantial variation depending on the time of year. Table 3 shows a list of all the features created. The high-volatility components associated with time series data are often very difficult to model successfully; hence, a scaling and/or transformation process is usually performed on the series prior to implementing the actual experiments [36].
Since we wished to be able to correctly predict the direction of movement of the number of short-term visitors, we applied a data transformation to the data series, which would result in better performance [37]. Natural logarithm transformations were applied to the data series prior to conducting the SARIMA(X) and SVR algorithms.
To achieve a logarithmic transformation with our short-term visitors' data, the following equation was applied.
where y t is the transformed number of visitors and p t is the original value.

Feature Selection
In recent years, many feature-selection methods have been proposed. These methods can be categorised into three [38]: filter, wrapper, and embedded methods.
Filter methods calculate the score of each feature and rank them accordingly without dependency on the model. They are simple to implement, easy to interpret, and work effectively with high-dimensional data. Filter methods are fast strategies that provide good results in classification tasks [39][40][41]. An extensive overview of existing filter methods was presented by Lazar et al. [42].
After engineering a wide range of features from the target variables and Google Trends indicators, we applied different feature-selection methods that incorporated recursive feature elimination (RFE) with mutual information (MI) and the f_test. These methods provided us with a robust and diverse perspective on feature importance. For the exogenous variables derived from Google Trends, we used the Pearson correlation to determine the most-relevant variables, which were used to train the SARIMAX model.
The wrapper method, RFE, uses a machine learning algorithm (in our case, a Deci-sionTreeRegressor) to rank features by importance and recursively eliminates the leastimportant features. This method can capture interactions between features since it uses a machine learning model for ranking.
The filter methods, the f_test and mutual information, rank features based on their individual predictive power. The f_test checks the correlation between each feature and the target variable, while mutual information measures the dependency between the feature and the target. A higher mutual information means a higher dependency. By using these methods together, we obtained the benefits of both: the power of a machine learning model to capture complex relationships and the speed and simplicity of univariate statistics.
The filter feature selection approach used for the SVR and CNN models is shown in Figure 4 and described in the snippet below. sionTreeRegressor) to rank features by importance and recursively eliminates the leastimportant features. This method can capture interactions between features since it uses a machine learning model for ranking.
The filter methods, the f_test and mutual information, rank features based on their individual predictive power. The f_test checks the correlation between each feature and the target variable, while mutual information measures the dependency between the feature and the target. A higher mutual information means a higher dependency.
By using these methods together, we obtained the benefits of both: the power of a machine learning model to capture complex relationships and the speed and simplicity of univariate statistics.
The filter feature selection approach used for the SVR and CNN models is shown in Figure 4 and described in the snippet below.

1.
Create a training dataset.

2.
Perform RFE using a decision tree as an estimator.

3.
Select the top 50% of the features from RFE.

4.
Compute the mutual information value (MIV) and f_test for the remaining features.

5.
Filter out the features based on the f_test and MIV. Select the top 10% of features based on the f_test and the top 25% based on the MIV.
The approach of using the Pearson correlation as a feature-selection method for our SARIMAX model is a straightforward, yet effective one given that SARIMAX is not capable of modelling non-linear relationships.
The Pearson correlation coefficient measures the linear relationship between two datasets. It ranges from −1 to 1. A correlation of −1 indicates a perfect negative linear relationship; a correlation of 1 indicates a perfect positive linear relationship; a correlation of 0 indicates no linear relationship.
In our experiment, we selected only those exogenous variables that have a correlation value greater than 0.4 (either positive or negative) and considered to have a moderate to strong linear relationship with the dependent variable. This could help reduce the dimensionality of our data and might improve the interpretability and performance of our models.
In summary, we chose a combination of feature selection and reduction techniques in our experiments to highlight the importance of incorporating such techniques in the modelling process to improve the accuracy of the models. The comparison of different techniques is out-of-scope for this paper. However, the selected techniques can detect different relationship between the created features and the target variable.

Forecasting Techniques
The seasonal auto-regressive integrated moving average (SARIMA) is an extension of the ARIMA model. ARIMA models are a subset of linear regression models that attempt to use the past observations of the target variable to forecast future values. The "S" in SARIMA stands for seasonal. It adjusts the model to deal with repeated trends. Seasonal data can be easily identified by looking at repetitive spikes over the same period of time. Those spikes are consistently cyclical and easily predictable, which suggests that we should look past the cyclicality to adjust for it.
Since SARIMA can only use the past values of Y and X, SARIMAX is used to incorporate exogenous variables. When using SARIMAX, the input data will include parallel time series variables that are used as a weighted input to the model.
To find the optimal SARIMA and SARIMAX models, a grid search to determine the value of the parameters for the best model was performed. The best model found will have the lowest Akaike's information criterion (AIC) and Bayesian information criterion (BIC).
SARIMA and SARIMAX were used as the baseline models for the time series forecasting of the two Australian indicators of interest: monthly unemployment rate and monthly number of short-term visitors.
Since the SARIMAX model can only detect the linearity between the target variable and the past values of the input data, we employed SVR and CNNs to check whether there was non-linearity between the input feature and the target variable, and therefore, the forecasting performance can be improved over that of SARIMAX. SVR, introduced by Drucker et al. [43], is a category of the support vector machines (SVMs), originally introduced by Vapink [44]. The model produced by SVR only depends on a subset of the training data, because the cost function for building the model ignores any training data that are close (within a threshold ε) to the model prediction. A detailed analysis and description of SVR can be found in Basak et al. [45], Sapankevych and Sankar [46], and Smola and Schölkopf [47] and an application to the prediction of unemployment in Stasinakis et al. [48]. SVR has been used widely for time series prediction [46], and the application areas are many, such as financial forecasting [49], among others.
Convolutional neural networks (CNNs) were introduced by Yann LeCun, Yoshua Bengio, and others in the 1990s [50]. Initially, CNNs were primarily developed and used for computer vision tasks such as image classification. However, CNNs have also been adapted and applied to other domains, including time series analysis and regression tasks. While CNNs were originally designed for image-based data, their ability to learn hierarchical patterns and capture local dependencies in data makes them suitable for analysing time series data as well. In time series analysis, CNNs can be used as regression techniques by applying them to the input data and predicting the target variable. By leveraging the convolutional layers and pooling operations, CNNs can automatically learn and extract relevant features from the time series data, making them powerful tools for time series forecasting and regression tasks.
To carry out non-linear regression using SVR and CNN, it is necessary to create a higher-dimensional feature space from the time series data, as discussed in Section 3.2.

Model Evaluation
In order to evaluate the performance of the SARIMA, SVR, and CNN models on out-of-time sample data, we used two different metrics: mean-squared error (MSE) and symmetric mean absolute percentage error (SMAPE) [51]. These two metrics proxy the accuracy of the model since they distinctly measure the difference between the actual and predicted values. The objective of our experiments was to improve the accuracy of the models; therefore, they seemed appropriate to evaluate the results. The MSE is a metric corresponding to the expected value of the squared error or loss. Ifŷ i is the predicted value of the i-th sample and y i is the corresponding true value, then the MSE estimated over n (number of samples) is defined as: The SMAPE is an accuracy measure based on percentage (or relative) errors, defined as follows: where At is the actual value and Ft is the forecast value. The absolute difference between At and Ft is divided by half the sum of the absolute values of the actual value At and the forecast value Ft. The value of this calculation is summed for every fit point t and divided again by the number of fit points n. A perfect SMAPE score is 0.0, and a higher score indicates a higher error rate.
Further statistical significance testing was applied to evaluate the performance of the different techniques and to determine if there were significant differences among them. One approach is to use the analysis of variance (ANOVA) on the predicted values generated by multiple models (ARIMA, ARIMAX, SVR, and CNN). ANOVA assesses the variation between the predicted values of different models and compares it to the overall variation in the data. The goal was to determine if there are statistically significant differences in the performance of the models.
After performing ANOVA, if significant differences are detected, further analysis can be conducted using post hoc tests to identify specific pairs of models that significantly differ from each other. One commonly used post hoc test is Tukey's honestly significant difference (HSD) test. The Tukey HSD compares all possible pairs of models and determines if the differences in their predicted values are statistically significant.
The statistical significance approach helps with comparing and ranking the models based on their performance and identifying the models that significantly outperform or underperform others. It provides a quantitative and objective measure to assess the statistical differences between the techniques, allowing for informed decision-making in selecting the most-appropriate model for time series forecasting tasks.

Experimental Setup
In this paper, we sought to examine the out-of-sample forecast performance of the SARIMA, SVR, and CNN models with a focus on two key Australian indicators: unemployment rate and monthly number of short-term visitors. The methodology, delineated in Section 3, was consistently applied across all our experimental setups. Each setup entailed two distinct data periods, one considering all available data up to December 2022, and another ending in December 2019. This approach allowed us to make an equitable comparison between the models built using the full dataset versus those developed using a reduced data subset.
The design of our experiments was intended to assess the influence of the COVID-19 pandemic on the correlation between our chosen indicators and Google Trends data. By intentionally omitting data from the last three years and focusing on the pre-pandemic period, we evaluated if the dynamics between the indicators and Google Trends were dissimilar during a relatively more economically stable period.
In each experimental setup, we constructed four iterations of each of our 12 models (elaborated further in Tables 4-6). Each iteration was trained and tested on a different dataset corresponding to its unique forecasting horizon. We built two SARIMA models, one that utilised all the historical data and another that incorporated the data from 2005 onwards. The objective here was to assess whether the inclusion of more historical data enhanced the model's performance. Subsequently, two SARIMAX models were devel-oped, one utilising all exogenous variables and another using a subset of selected variables, as outlined in Section 3. This exercise allowed us to juxtapose the performance of SARIMAX with the SARIMA model constructed using more-recent data, as well as to discern if the Google Trends data could bolster the model's accuracy. The SARIMAX model with selected exogenous features served as a comparison point with the original SARIMAX model. Furthermore, we constructed the SVR and CNN models using all features from the target variable and then a subset of these features after implementing the RFE, MI, and f_test. This approach enabled us to contrast the performance of SVR and the CNNs with SARIMA and determine whether the feature selection enhanced the model performance. Later, the SVR and CNN models were constructed using all Google Trends features along with the target variable features for comparison with SARIMAX. The same models were then built using a selected subset of features. This exhaustive comparative analysis enabled us to assess the effectiveness of the machine learning and deep learning models vis-à-vis the conventional ones. It also helped ascertain if the incorporation of Google Trends data enhanced the predictive accuracy of these models and whether contemporary models more effectively encapsulated the relationship between the variables. Moreover, the utility of feature selection in improving outcomes could be gauged. Comparing different experimental sets provided insights into the influence of the Google data on the model performance, in particular by contrasting the model outcomes using datasets that include and exclude the COVID-19 period.

Results and Discussion
In this section, we present an overview of the experiments' results for each individual set of experiments. Additional comparison between Experiments 1 and 2 and Experiments 3 and 4 were conducted to highlight the difference in the performance between the models and the features selected for different data-driven settings influenced by COVID19. Given the large number of experiments and comprehensive statistical significance tests for the built models, only the comparison of results using the MSE are presented in Table 7, accompanied by the feature-selection results for each set of experiments in Table 8. The full results along with the data and code used to conduct these experiments are available at the following Git repository: https://github.com/a-abdulkarim/time-series-forecasting-p1/ (accessed on 1 July 2023).
Experiment 1: The first experiment revealed significant differences in performance among the models across the four forecasting horizons. The SARIMAX_ALL model outperformed all others for the 3-, 6-, and 12-month horizon levels, indicating its strong predictive power in the shortto mid-term. Interestingly, the SARIMA_HIST model, utilising historical data without the inclusion of exogenous variables, performed better for the 24-month horizon, hinting at its efficacy in capturing long-term trends and cycles.
Compared to the SARIMA_RECENT, which takes into account only recent data, SARIMA_HIST's superior performance for the 24-month horizon suggested that a broader historical context enhances long-term forecasting. SARIMAX_ALL's outperformance of SARIMA_HIST and SARIMA_RECENT for shorter horizons demonstrated the value of integrating all available features, including exogenous variables, into time series models for short-term forecasts.
Experiment 2: In the second experiment, the superiority of the SARIMAX_ALL model continued for the 6-and 12-month horizons, but faced competition from the CNN_TARGET_GI_FS model for the 3-month horizon. This indicates that deep learning models like CNN_TARGET_GI_FS can capture intricate data patterns more effectively in the short-term. For the 24-month horizon, however, the SARIMA_HIST model again outperformed, reaffirming the notion that simpler models utilising a broader historical context fare better in long-term forecasting.  Table 8. Feature-selection results.

Features Selected Selected Exogenous Variables
Experiment 1 Feature selection models, such as SARIMAX_FS and CNN_TARGET_GI_FS, performed comparably to their all-feature counterparts for shorter horizons, suggesting that narrowing down the feature set does not necessarily impair short-term predictive capacity. Experiment 3: The third experiment introduced a new dominant model: SVR_TARGET_GI_FS. This machine learning model with feature selection demonstrated the best performance at the 3and 6-month horizon levels, outperforming both SARIMA variants and SARIMAX_ALL. This suggested that machine learning techniques coupled with feature selection can excel in short-term forecasts. However, the SARIMAX_ALL model still held its ground for the 12-month horizon, and SARIMA_HIST regained superiority for the 24-month horizon.
Again, feature selection models showed strong performance. The SVR_TARGET_GI_FS model's superiority for shorter horizons over SARIMAX_ALL indicated that feature selection can even outperform all-feature models in certain situations.

Experiment 4:
In the final experiment, the deep learning model CNN_TARGET_GI_FS excelled for the 3-and 6-month horizons, while SARIMAX_ALL performed best for the 12-month horizon. For the 24-month horizon, the SVR_TARGET_FS model, a machine learning model with feature selection, surpassed other models, affirming the potency of feature selection for longer-term forecasting.
Across all four experiments, the results demonstrated the strengths and weaknesses of each model for different forecasting horizons, the potential advantages of machine learning and deep learning techniques over traditional SARIMA/SARIMAX models, and the possible gains from employing feature selection.
Taken together, these experiments provided nuanced insights into the interplay between traditional models (SARIMA and SARIMAX) and more modern, ML and DL techniques. While the former maintained strong performance at medium-term horizons, in particular when supplemented with a complete feature set, the latter-especially when utilising feature selection-appeared more-effective for both short-and long-term forecasting. Thus, the decision between ML/DL and traditional methods hinges on the forecasting horizon, underlining the importance of a targeted approach in time series prediction.
Compared to SARIMA_RECENT, which takes into account only recent data, SARIMA_HIST's superior performance for the 24-month horizon suggests that a broader historical context enhances long-term forecasting. SARIMAX_ALL's outperformance of SARIMA_HIST and SARIMA_RECENT at shorter horizons demonstrated the value of integrating all available features, including exogenous variables, into time series models for short-term forecasts.

Unemployment Forecasting
Comparing both experiments, it was clear that the inclusion or exclusion of the COVID period data significantly influenced the predictive power of the models. In the shorter-term forecasts (3-, 6-, and 12-month horizons), the exclusion of the COVID period seemed to enhance the performance of models such as the CNNs, possibly due to the reduction of unprecedented volatility in the training data.
In contrast, the SARIMAX model, which was the most-effective short-and mid-term forecasting model when the COVID data were included, saw its dominance reduced when the COVID data were excluded. This indicated that the SARIMAX model might be particularly effective at accounting for abrupt exogenous shocks such as the COVID pandemic.
For the 24-month horizon, the SARIMA_HIST model remained the superior performer, with or without the COVID data, indicating its robustness in long-term forecasting regardless of drastic economic changes.
These comparisons highlight the importance of considering the stability of the economic environment and the characteristics of the training data when selecting and interpreting forecasting models.

Number of Visitors Forecasting
Comparing both experiments, it became evident that the inclusion or exclusion of the COVID period data significantly impacted the models' predictive performance. In the short-term forecasts (3-, 6-, and 12-month horizons), the exclusion of the COVID period data seemed to improve the performance of the CNN with the feature selection model, possibly due to the removal of the unpredictable COVID-induced volatility from the training data.
Conversely, the SARIMAX using the all exogenous variables model, which was the most-effective short-and mid-term forecasting model with the COVID data included, saw a reduction in its dominance when the COVID data were excluded. This indicated that the model was particularly potent when dealing with abrupt exogenous shocks such as those experienced during the COVID-19 pandemic.
For the long-term 24-month horizon forecast, the SARIMA_HIST model remained the best performer, irrespective of whether the COVID data were included or excluded, highlighting its robustness in long-term forecasting regardless of drastic changes in the economic environment.
These findings underlined the importance of considering both the stability of the economic environment and the nature of the training data when choosing and interpreting forecasting models. They also demonstrated how different models may respond differently to periods of economic volatility, further emphasising the need for careful model selection based on the specific context and forecasting horizon.

Conclusions
This research investigated the efficacy of various traditional and machine learning models in forecasting key economic indicators, namely the monthly unemployment rate and the monthly number of short-term visitors to Australia. It also explored the role of Google Trends data in enhancing the forecasting performance of these models.
Overall, the results indicated that both machine learning (ML) and deep learning (DL) models offer considerable advantages over traditional SARIMA and SARIMAX models in forecasting these indicators, particularly in the shorter-term forecasting horizons. For instance, the SVR model demonstrated superior performance over SARIMA and SARIMAX in predicting the unemployment rate across all forecasting horizons in Experiments 1 and 2. Similarly, the CNN model was more effective than its traditional counterparts in predicting short-term visitor numbers in Experiments 3 and 4, especially in the short-to mid-term forecasting horizons.
These findings align with the growing recognition of ML and DL techniques as valuable tools in economic forecasting, capable of handling complex data structures and identifying intricate patterns in the data. However, the results also underscored the robustness of traditional models such as SARIMA and SARIMAX in long-term forecasting, reminding us of their enduring relevance in certain forecasting contexts.
Importantly, the inclusion of Google Trends data proved to enhance the forecasting performance of several models. Models incorporating Google Trends data, such as SARIMAX and the CNN with feature selection, consistently outperformed their counterparts that relied solely on historical data, particularly in the short-to mid-term forecasting horizons. These findings affirmed the potential of Google Trends data as a valuable supplement to traditional economic data, particularly in an era where digital information plays an increasingly central role in economic activities.
This study, however, was not without its limitations. The forecasting performance of the models might be sensitive to the inclusion or exclusion of extreme events, such as the COVID-19 pandemic period data. The volatility introduced by such events can impact the predictive capability of different models in various ways, making it difficult to ascertain the most-effective model across all possible contexts.
Furthermore, while the study considered a broad range of models and data types, there are still other potentially useful models and data sources that remain unexplored. For instance, other types of ML and DL models, such as recurrent neural networks (RNNs) and transformers, might offer different insights or outperform the models investigated in this study.
Future research should aim to address these limitations and explore these uncharted territories. More-comprehensive investigations could consider a broader range of extreme events and their impacts on different models or investigate other types of ML and DL models and their efficacy in forecasting economic indicators. Moreover, future studies could explore other types of auxiliary data, such as social media data or other online data, to gauge their potential in enhancing economic forecasts.
In conclusion, this research underscored the potential of ML and DL techniques in economic forecasting and highlighted the value of integrating Google Trends data into these models. However, it also stressed the importance of model selection based on the specific forecasting context and the need for the continuous exploration of novel models and data sources to enhance our forecasting capabilities.
Author Contributions: Conceptualisation, A.A.K., E.P. and S.M.; methodology, A.A.K., E.P. and S.M.; formal analysis, A.A.K.; writing-original draft preparation, A.A.K.; writing-review and editing, E.P. and S.M. supervision, E.P. and S.M. All authors have read and agreed to the published version of the manuscript.