Multivariate Forecasting Evaluation: Nixtla-TimeGPT

Karim, S M Ahasanul; Zarrin, Bahram; Lassen, Niels Buus

doi:10.3390/cmsf2025011029

Open AccessProceeding Paper

Multivariate Forecasting Evaluation: Nixtla-TimeGPT^†

by

S M Ahasanul Karim

^1,*,‡,§

,

Bahram Zarrin

^2,§ and

Niels Buus Lassen

^3,*

¹

Microsoft Research Hub, Microsoft Development Center Copenhagen, 2800 Copenhagen, Denmark

²

Microsoft Dynamics 365 AI ERP, Microsoft Development Center Copenhagen, 2800 Copenhagen, Denmark

³

Department of Digitalization, Copenhagen Business School, 2000 Copenhagen, Denmark

^*

Authors to whom correspondence should be addressed.

^†

Presented at the 11th International Conference on Time Series and Forecasting, Canaria, Spain, 16–18 July 2025.

^‡

Current address: Department of Digitalisation, Copenhagen Business School, 2000 Frederiksberg, Denmark.

^§

These authors contributed equally to this work.

Comput. Sci. Math. Forum 2025, 11(1), 29; https://doi.org/10.3390/cmsf2025011029

Published: 26 August 2025

(This article belongs to the Proceedings of The 11th International Conference on Time Series and Forecasting)

Download

Browse Figures

Versions Notes

Abstract

Generative models are being used in all domains. While primarily built for processing texts and images, their reach has been further extended towards data-driven forecasting. Whereas there are many statistical, machine learning and deep learning models for predictive forecasting, generative models are special because they do not need to be trained beforehand, saving time and computational power. Also, multivariate forecasting with the existing models is difficult when the future horizons are unknown for the regressors because they add mode uncertainties in the forecasting process. Thus, this study experiments with TimeGPT(Zeroshot) by Nixtla where it tries to identify if the generative model can outperform other models like ARIMA, Prophet, NeuralProphet, Linear Regression, XGBoost, Random Forest, LSTM, and RNN. To determine this, the research created synthetic datasets and synthetic signals to assess the individual model performances and regressor performances for 12 models. The results then used the findings to assess the performance of TimeGPT in comparison to the best fitting models in different real-world scenarios. The results showed that TimeGPT outperforms multivariate forecasting for weekly granularities by automatically selecting important regressors whereas its performance for daily and monthly granularities is still weak.

Keywords:

generative AI; time series forecasting; multivariate forecasting; exogenous regressors; statistical modelling; machine learning; deep learning

1. Introduction

Generative AI has taken the world by storm. It has shown its ability to undertake a wide variety of tasks like question answering, rational thinking, summarising, sentiment analysis, image analysis, and so forth. Whereas the majority of users are using these for texts and images, recent developments have further extended their use towards forecasting. Models like TTM by IBM [1], Lag-llama by Facebook [2], TimesFM by Google [3], and TimeGPT by Nixtla [4], which are pre-trained with many time series, can forecast within a very short runtime and computation overhead while also achieving great accuracy. However, these models are often benchmarked using univariate forecasts rather than multivariate forecasts. Multivariate forecasting is important because it can take external events into account and provide great context for predicting shocks or structural breaks. But it is a challenge to predict with better accuracy with a multivariate approach, especially when the future horizon is unknown for the exogenous regressor because it adds more uncertainty into the equation, and the errors of the predicted horizon for the regressors often drive more error for predicting the target. Generative AI can be a promising development for dealing with this issue.

Ekambaram et al. analysed exogenous variables with generative models like TTM, Moment, GPT4TS, and Time-LLM in a wide variety of datasets, but TimeGPT and other state-of-the-art models were not considered [1]. At this date, this is possibly the only available research where performance assessment has been carried out with generative models for multivariate forecasting. For this paper, we have chosen to work with the generative model TimeGPT by Nixtla (nixtlats library version “0.5.2” on python) because it has a proper encoder–decoder structure and it is pre-trained with the largest collection of time series ranging up to 100 billion data points from multiple sources like finance, economics, demographics, healthcare, weather, IoT, sensor data, energy, web traffic, sales, transport, and banking, which makes it most suitable among other generative models for forecasting. Garza et al. showed that univariate TimeGPT can surpass the accuracy of models like ZeroModel, HistoricAverage, SeasonalNaive, Theta, DOTheta, ETS, CES, ADIDA, IMAPA, CrostonClassic, LGBM, LSTM, DeepAR, and TFT [4].

However, it is questionable if TimeGPT can outperform the best-fitting non-generative model with the right external variables and make robust predictions based on different scenarios. Thus, this paper attempts to determine if a generative AI model like TimeGPT(Zeroshot) can leverage external events better to forecast results for daily, weekly, and monthly granularity in terms of mean absolute error in comparison toother models like ARIMA, Seasonal ARIMA, Prophet, Linear Regression, XGBoost, LightGBM, Random Forest, LSTM, RNN, and Neural Prophet using different Python libraries (pmdarima “1.8.3”, statsforecast “1.7.5”, prophet “1.1.5”, mlforecast “0.13.2”, pycaret “3.3.1”, nixtlats “0.5.2”, neuralprophet “0.9.0” and neuralforecast “0.1.0”). The mean absolute error metric is chosen because it delivers a correct estimation of the average error while forecasting multiple time series altogether. It is always positive; hence, after averaging, any negative errors do not cancel positive errors. Also, they do not explode in magnitude for outliers or simply for a bad fit. It also gives an exact idea of how much we are underestimating or overestimating while forecasting. The best models were tested in real-world demand forecasting scenarios for assessment and comparison of collective forecasting accuracies and statistical significance.

2. Materials and Methods

First, we will create simulated scenarios and events that we control to discover model performances for different patterns and interpret useful insights from them with controlled experiments. We will find the best-fitting model for each scenario and the best type of regressors to compare TimeGPT’s(nixtlats “0.5.2”) performance with them. For the subsequent case studies, we will collect the data, preprocess it, and evaluate our synthetic findings by multivariate forecasting. The overall procedure of this research is portrayed in Figure 1.

2.1. Dataset Creation and Evaluation

In the simulated dataset creation stage, we will create a synthetic dataset of time series with diverse characteristics and resample it in daily, weekly, and monthly formats. We will also create simulated external variables for testing. Then, we will move to the model fitting stage and fit the models from the mentioned models with different libraries by splitting the synthetic data into training and test sets. We will test the models both with and without external variables to understand the effects of the external variables and assess the impact of each external signal by excluding it from the training set. Then, in the performance evaluation stage, we will evaluate each model’s performance and external variable effects by MAE. Then, we will use the results from the simulated experiments to find out the best performing model and the best external signals to be used to make forecasts for real-world use cases and compare it with a generative AI model to assess their competitive performance.

To simulate a set of time series with different characteristics, we designed a set of 216 time series with a combination of stable, upwards, downwards trends, the presence or absence of seasonality, cycles and noises (non-linearity), additive or multiplicative modes, and from 0 up to 5 structural breaks in random time points using the following equations:

(i) Trend Components:

T (t) = 1 + \frac{t}{N - 1} for t \in [0, N - 1]

(1)

T (t) = 2 - \frac{t}{N - 1} for t \in [0, N - 1]

(2)

T (t) = 1 for all t

(3)

(ii) Seasonal Components:

S (t) = 0.5 * cos 2 * π * t / (365 * 7)

(4)

(iii) Cycle Components:

C (t) = 0.3 * sin 2 * π * t / (365 * 5)

(5)

(iv) Noise Components:

N (t) = Noise, N (t) \sim N (0, 20)

(6)

(v) Modes:

a d d i t i v e : Y (t) = T (t) + S (t) + C (t) + N (t)

(7)

m u l t i p l i c a t i v e : Y (t) = T (t) * (2 + S (t)) * (2 + C (t)) + (1 + N (t) / 100)

(8)

(vi) Structural Breaks: 0 to 5 breaks in random time points.

a d d i t i v e : Y (t) = Y (t) + constant for t \geq t_{break}

(9)

m u l t i p l i c a t i v e : Y (t) = Y (t) * constant for t \geq t_{break}

(10)

Running all these attributes in a loop, we generated a 5-year time series dataset with different trends, cycles, seasonality, noises, modes, and breaks, resulting in 216 time series with diverse characteristics. Each time series was labelled with a unique identifier. The dataset was stored in daily, monthly, and weekly granularity for further characteristic testing of model performances. Adding positive constants in the break points pushed some downward trends upwards and created seasonalities of their own in some situations. Also, changing granularities has completely changed structural aspects for some time series. Therefore, the metadata containing details about individual series characteristics was manually adjusted later. The dataset was then forecasted with all the models by an 80:20 training/test split.

2.2. Simulated External Regressor Creation

For the dataset created for assessment, we considered 6 external regressors to understand their effect on prediction accuracy for the remaining models. The first two signals are static and inspired by the work of [5], we designed exogenous features for the NBEATSx model, which means that these signals remain the same throughout each of the 216 time series. The 3rd, 4th, and 5th signals are dynamic and series-specific, exogenous, and calculated from the target variable and its lagged values to mimic events that have a delayed effect. The last signal is random noise-generated. Below are the equations described for the external variables and their correlations.

trend (t) = 200 \times (a_{0} + a_{1} t^{1} + a_{2} t^{2} + \dots + a_{d} t^{d})

(11)

seasonal (t) = \sum_{k = 1}^{K} (\begin{matrix} 2000 \times sin (\frac{2 π k t}{P}) \\ + & 2000 \times cos (\frac{2 π k t}{P}) \end{matrix})

(12)

exogenous (y) = α \times y + | ϵ |

(13)

{exogenous}_{{lag}_{3}} (t) = exogenous (t + 3)

(14)

{exogenous}_{{lag}_{5}} (t) = exogenous (t + 5)

(15)

noise (t) = \frac{1}{N} \sum_{i = 1}^{N} ϵ_{i} (t)

(16)

These variables were used to assess the competitive performance of TimeGPT with respect to the other models. The average correlation score of the external variables with the target is portrayed in Table 1. The trend, regardless of being static, shows great correlation towards the dependent variable. This can be because most series pushed upwards due to rapid structural breaks. Also, the mathematically modeled y dependent exogenous feature still showed a lower correlation due to the added noise term

ϵ

.

2.3. Case Studies

Case Study 1: This case involves forecasting the UK’s monthly electricity demand in each 30 min interval over the span of 35 months. This dataset is gathered by National Grid ESO, an electricity system operator for Great Britain, which is updated twice per hour, 48 times each day [6]. It contains daily data over 14 years which were aggregated for each 30 min time interval into 48 time series. All the electricity variables are estimated in Megawatts (MW). In the dataset, most of the features are related to supply more than demand, and some are dependent on demand rather than demand depending on them. Thus, for external features, items such as the following were further added from other external sources: disaster-affected days, average temperature (tavg), precipitation (prcp), hours of sunshine (tsun), electricity on gdp, oil prices, CPI, and weather. The data was converted into a monthly granularity to determine the monthly electricity load. After the training/test split, the forecast horizon was 35 months. The patterns of the main series were gradually decreasing over time, with regular seasonal fluctuations that multiplicatively shrunk the demand over time as shown in Figure 2.

Case Study 2: In this case, we will forecast docked bike demand for each hour at each station weekly. Divvy is a popular bike-sharing service in Chicago. It allows users to dock and rent bikes from biking stations across the city. The available dataset in Kaggle contains several datasets, but among those only “Divvy_Trips_1719.csv” and “Chicago_Weather_2017_2022[1].csv” were found suitable, representing trip details and weather details, respectively. There was another dataset of bus and rail boarding called “CTA Bus Rail Daily Totals.csv” but it was found to be missing years of data; therefore, it was omitted and added from another external source [7]. Weather features such as temperature and dew were then aggregated by weekly means. Other external variables like bus rail passengers, disaster-affected days, holidays and working days, and oil prices, were further added from external sources. Among the 14,614 unique_ids, the unique_ids that had 156 instances of data were segmented as a smooth forecasting sample, which left us with a total of 4447 time series. After the training/test split, the forecast horizon was 32 weeks, and they collectively showed a stable trend, and seasonal characteristics as shown in Figure 2.

Case Study 3: In this case, we will forecast demand for each perishable good at each store daily. The dataset is from Brick-and-Mortar grocery stores in Ecuador, which is a perfect dataset for daily demand planning, as a good number of goods are perishable and over-forecasting or under-forecasting too much is an issue for thousands of stores across many locations [8]. Here, store ID and item ID were joined to make a unique ID, which leaves us with a combination of 174,685 stores and products. From the item information, we segment out the perishable items which are crucial for daily demand forecasting and contain 4.5 years of regular daily data, imputing the missing values with zero for unit sales. This leaves us with 1654 unique_ids to be forecasted. External events like local, regional, and national holidays, workdays, payday, oil price, and promotion were further added. After the training/test split, the forecast horizon was 365 days, the common pattern being stable trend with rapid breaks, sudden spikes and drops as shown in Figure 2.

We will later find out the best-performing models in these different kinds of scenarios and granularities with our synthetic dataset and fit them to these cases to compare their multivariate performance with TimeGPT.

3. Results

The results derived from fitting all the models without any exogenous regressors for the synthetic dataset for daily, monthly, and weekly granularities are given in Table 2, where the top three models are in bold with the best model in asterics and the worst model is underlined. Libraries like Nixtla and PyCaret are used for further comparative forecast performance assessment, repeating some models adopting different hyperparameter tuning and grid search approaches. For the machine learning models, feature engineering from [9] was implemented, as the study claimed that it had outperformed the statistical models in demand forecasting scenarios. For comparison among various models, the different MAEs were scaled between 0 and 1. If

e_{\max}

is the maximum amount of error among a set of errors

{e_{1}, e_{2}, \dots, e_{n}}

for different models for a series, then the scaling factor

S_{i}

can be expressed as

S_{i} = \frac{e_{i}}{e_{\max}}, Where e_{i} = {e_{1}, e_{2}, \dots, e_{n}} .

(17)

The results show that, despite being computationally very expensive, AutoLSTM and AutoRNN with 20 were not performing accordingly in collective forecasting scenarios; therefore, they were omitted from further testing. From the experiments on the synthetic dataset, we discovered that almost all the models generally tend to forecast better for weekly granularities. This indicates the number of data points over-representing or under-representing the characteristics, and the forecast horizons being too long or too short for daily and monthly granularities, respectively. This could also be due to the fact that aggregating the data by their weekly mean smoothens and handles the outliers to such an extent that it is neither too abstract nor too over-detailed, thus easier for the algorithms to grasp. We also learned that a machine learning algorithm, such as linear regression is a badly performing model in these kind of forecasting scenarios in both Nixtla’s mlforecast and PyCaret, because of their sensitivity to outliers and multicollinearity, whereas statistical models like ARIMA, Prophet, and TimeGPT tend to perform well overall in all scenarios, and Prophet performs better when structural breaks are present from Figure 3.

The results in Table 3 show that some models are performing better with all the external variables included, while some models are not. TimeGPT, except for the daily scenario, is performing better in monthly and weekly cases. We can also see that, in one case, linear regression exceeded the error value for maximum float64 capacity, and linear regression and Neural Prophet models generally performed badly; therefore, they were further omitted from the study. ARIMA models from Nixtla were also omitted because they were found to be forecasting negative demand.

Removing each external feature added the following positive and negative importance to that variables as shown in Table 4 and Figure 4. The external variables were further evaluated by keeping this all-feature-inclusive MAE as baseline and removing each external feature from the model in turns, where feature importance was calculated as (MAE without a distinct feature—MAE with all features included) so that the positive values indicate the positive impact.

When we tried the same variables with TimeGPT, and tried to calculate the importance for each external variable, we found all feature importances to be 0. Interestingly, it did not show any distinct feature importances, and all the scores were 0 or almost 0 until 7 decimal places for all these experiments. This could either be because of the LLM hallucination or TimeGPT inherently only choosing automatically the feature that is best for the data by automated interpretations. Table 4 further backs up the argument by showing that TimeGPT has performed better by adding all external features used together, whereas for other models, it has been somewhat the opposite, except for the daily granularity. It has a surge in error for daily granularities, but it is to be noted that since linear regression was not present this time with external variables, the averaging metrics were manipulated. This can also be a reason for TimeGPT not having enough daily data in its training corpus. The results mimic those from [4]. However, TimeGPT can also be fine-tuned further, but it is not free and is expensive in usage credit units.

Considering the results and model performances in different scenarios, we chose to fit the best state-of-the-art models and regressors with the three cases to compare them with TimeGPT. Table 5 shows that all the models performed worse than their univariate results when we picked every correlated external event with their best chosen models for multivariate setting, keeping the future values of the extra regressors unknown. We decided to fit TimeGPT with all the external features, as from our synthetic experiment, it seemingly autodetects or ignores the relevant external effects for forecasting. We found TimeGPT performing better than the best state-of-the-art model in the weekly docked bike demand forecasting case, just like it did on Table 4.

4. Discussion

With the results, we can conclude that whether external features would add more accuracy to a model or not is a matter of relativity, and there are no two similar ways. It is more difficult for scenarios where thousands of time series are involved, and the exogenous effects are not uniform among all of them. Also, forecasting exogenous features for the target horizon can make the future values of those regressors more error-prone, which is a threat to the main target forecast. When using multiple regressors as exogenous, the features can have plenty of different kinds of intercorrelations with each other apart from the target variable, which also poses a challenge of overfitting. A generative pre-trained model like TimeGPT can be a solution in this case, but the performance of TimeGPT is still to be improved to surpass the best-fitting model in all aspects.

This paper assumes that TimeGPT is trained with more weekly time series than daily and monthly ones, which can explain its specific efficiency in weekly granularities for multivariate forecasting. However, further tests need to be performed to explain the reasons behind its peak performance. Future works can also focus on evaluating error metrics other than MAE with more robust analysis and analysing more complex correlations.

5. Conclusions

With the case studies, we can conclude that multivariate forecasting is a challenge. It involves many barriers in finding and fitting the best model and incorporating the right exogenous variable. This research attempted to explore these challenges and tested whether they can be solved by using generative AI models. The main research interest was to find the best models and external variables based on different scenarios to see if TimeGPT could surpass their forecasting accuracy. TimeGPT could overcome that challenge for weekly granularities, but it could not perform better for the other cases. This research further suggests exploring results from fine-tuned TimeGPT instead of zero-shot approaches and explaining why TimeGPT performs as it does. More in-depth research should also be carried out on what makes TimeGPT perform better, and more models further need to be compared, like Autotheta, AutoCES, N-Beats, N-Hits, XLSTM, LLama-3, TTM, and TimesFM. Also, in this research, forecasting performance was only evaluated in MAE; thus, other error metrics can also bring about different results. Further research in this area with generative models will lead us to explore more futuristic forecasting.

Author Contributions

Conceptualization, S.M.A.K.; Methodology, S.M.A.K.; Software, S.M.A.K.; Validation, S.M.A.K., B.Z., and N.B.L.; Formal Analysis, S.M.A.K., B.Z., and N.B.L.; Investigation, S.M.A.K., B.Z., and N.B.L.; Resources, S.M.A.K. and B.Z. Data Curation, S.M.A.K.; Writing—Original Draft Preparation, S.M.A.K.; Writing—Review and Editing, N.B.L. and S.M.A.K.; Visualization, S.M.A.K.; Supervision, B.Z. and N.B.L.; Project Administration, S.M.A.K.; Funding Acquisition, B.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received necessary computational resources and funding support from Microsoft Development Center Copenhagen and the Department of Digitalisation, Copenhagen Business School.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The publicly available Data for UK electricity consumption can be found at: https://www.kaggle.com/datasets/albertovidalrod/electricity-consumption-uk-20092022 (accessed on 1 July 2024). The bike sharing data can be found at: https://www.kaggle.com/datasets/leonidasliao/divvy-station-dock-capacity-time-series-forecast (accessed on 1 July 2024). The additional public transport data (bus, rail onboarding) can be found at: https://www.transitchicago.com/ridership/ (accessed on 1 July 2024). The Corporación Favorita grocery sales forecasting data can be found at: https://www.kaggle.com/competitions/favorita-grocery-sales-forecasting/overview (accessed on 1 July 2024).

Acknowledgments

A huge thanks to the Microsoft Demand Planning Team for their incredible support and collaboration throughout this research project. Their insights, expertise, and willingness to share knowledge made a huge difference in shaping our work. We truly appreciate the time and effort they put into discussions, feedback, and guidance—it has been an amazing learning experience. We are also grateful to the broader Microsoft team for fostering an environment that encourages curiosity, innovation, and problem-solving. This project would not have been the same without their support.

Conflicts of Interest

Author Bahram Zarrin was employed by the company Microsoft. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The authors declare that this study received funding and resource support from Microsoft Development Center Copenhagen and Department of Digitalization, Copenhagen Business School. The funders were not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication.

References

Ekambaram, V.; Jati, A.; Dayama, P.; Mukherjee, S.; Nguyen, N.H.; Gifford, W.M.; Reddy, C.; Kalagnanam, J. Tiny Time Mixers (TTMs): Fast Pre-trained Models for Enhanced Zero/Few-Shot Forecasting of Multivariate Time Series. arXiv 2024, arXiv:2401.03955. [Google Scholar]
Rasul, K.; Ashok, A.; Williams, A.R.; Ghonia, H.; Bhagwatkar, R.; Khorasani, A.; Bayazi, M.J.D.; Adamopoulos, G.; Riachi, R.; Hassen, N.; et al. Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting. arXiv 2024, arXiv:2310.08278. [Google Scholar] [CrossRef]
Das, A.; Kong, W.; Sen, R.; Zhou, Y. A decoder-only foundation model for time-series forecasting. arXiv 2024, arXiv:2310.10688. [Google Scholar]
Garza, A.; Challu, C.; Mergenthaler-Canseco, M. TimeGPT-1. arXiv 2024, arXiv:2310.03589. [Google Scholar]
Olivares, K.G.; Challu, C.; Marcjasz, G.; Weron, R.; Dubrawski, A. Neural basis expansion analysis with exogenous variables: Forecasting electricity prices with NBEATSx. Int. J. Forecast. 2023, 39, 884–900. [Google Scholar] [CrossRef]
Kaggle. Electricity Consumption UK 2009–2024. 2024. Available online: https://www.kaggle.com/datasets/albertovidalrod/electricity-consumption-uk-20092022 (accessed on 1 July 2024).
Kaggle. Divvy Station Dock Capacity Time Series Forecast. 2022. Available online: https://www.kaggle.com/datasets/leonidasliao/divvy-station-dock-capacity-time-series-forecast (accessed on 1 July 2024).
Kaggle. Corporación Favorita Grocery Sales Forecasting. 2018. Available online: https://www.kaggle.com/competitions/favorita-grocery-sales-forecasting/overview (accessed on 1 July 2024).
Ghiletki, D. Machine Learning-Based Demand Forecasting for Supply Chain Management. Master’s Thesis, Department of Technology, Management and Economics, DTU, Copenhagen, Denmark, 2023. [Google Scholar]

Figure 1. Overview of the process flow for multivariate forecasting research.

Figure 2. Comparison of demand across different sectors (electricity, docked bikes, and groceries).

Figure 3. Model performances on samples of the synthetic dataset.

Figure 4. Individual external feature importance for daily, weekly, and monthly granularities in MAE.

Table 1. Average correlation (linear) with target for simulated external features.

X Feature	Daily	Weekly	Monthly
Static trend	0.51	0.57	0.63
Static seasonal	−0.07	−0.07	−0.04
Exogenous	0.31	0.36	0.43
Exogenous_lag_3	0.27	0.35	0.4
Exogenous_lag_5	0.27	0.35	0.44
Noise	0	0	0

Table 2. Scaled MAEs for different models across time frequencies for univariate forecasting.

Model	Daily	Weekly	Monthly
AutoLSTM (neuralforecast)	0.009652	0.002137	0.08955
AutoRNN (neuralforecast)	0.009892	0.002277	0.090511
Neuralprophet	0.009429	0.002662	0.104538
TimeGPT(zeroshot)	0.006097	0.001962 *	0.090954
LightGBM (Nixtla)	0.00653	0.007487	0.158012
Linear Regression (Nixtla)	0.364913	0.123713	0.320041
Random Forest (Nixtla)	0.011217	0.003226	0.124173
XGBoost (Nixtla)	0.006377	0.004707	0.134841
LightGBM (PyCaret)	0.008808	0.00226	0.101377
Linear Regression (PyCaret)	0.728414	0.906034	0.783822
XGBoost (PyCaret)	0.008687	0.002192	0.100657
Random Forest (PyCaret)	0.008748	0.002243	0.100578
ARIMA (Pmdarima)	0.005902	0.002024	0.088495
ARIMA (Nixtla)	0.005812 *	0.00234	0.077143 *
MSTL ARIMA (Nixtla)	0.005834	0.002298	0.081126
Prophet	0.007215	0.001965	0.098457

Table 3. Scaled MAE for different time series characteristics.

Title	Scenario	AutoLSTM	AutoRNN	NeuralProphet	TimeGPT	LGBM(Nixtla)	RF(Nixtla)	Xgb(Nixtla)	Lgbm(PyC)	Xgb(PyC)	Arima(Pmd)	MSTL Arima(Nixtla)	Prophet	Arima(Nixtla)
trend- upward, linear	Weekly	0.03	0.07	0.01	0.00	1.00	0.11	0.28	0.00	0.00	0.02	0.05	0.00	0.02
	Monthly	0.10	0.12	0.06	0.01	0.59	0.50	0.63	0.00	0.00	0.01	0.12	0.02	0.00
	Daily	0.01	0.01	0.00	0.05	0.66	0.50	0.52	0.00	0.00	0.00	0.05	0.00	0.05
trend- upward with breaks, nonlinear	Weekly	0.04	0.01	1.00	0.05	0.02	0.07	0.05	0.43	0.39	0.03	0.07	0.33	0.04
	Monthly	0.02	0.01	0.18	0.02	0.57	0.63	1.00	0.07	0.09	0.11	0.01	0.07	0.01
	Daily	0.88	0.87	1.00	0.06	0.12	0.06	0.02	0.40	0.41	0.06	0.06	0.78	0.06
trend- upward with breaks, linear, Additive	Weekly	0.03	0.05	0.17	0.01	1.00	0.00	0.44	0.02	0.01	0.00	0.00	0.01	0.01
	Monthly	0.02	0.05	0.33	0.01	1.00	0.14	0.92	0.02	0.01	0.02	0.01	0.09	0.01
	Daily	0.28	0.34	0.42	0.01	1.00	0.01	0.84	0.10	0.07	0.01	0.01	0.00	0.01
trend- upward with breaks, linear	Weekly	0.58	0.59	0.67	0.44	0.69	0.56	0.57	0.56	0.57	0.36	0.32	0.47	0.32
	Monthly	0.34	0.34	0.39	0.33	0.77	0.64	0.46	0.38	0.38	0.30	0.26	0.37	0.19
	Daily	0.53	0.56	0.60	0.28	0.38	0.44	0.41	0.46	0.46	0.23	0.42	0.33	0.25
trend- upward with breaks	Weekly	0.11	0.14	0.89	0.03	1.00	0.01	0.10	0.35	0.13	0.02	0.01	0.22	0.01
	Monthly	0.01	0.01	0.13	0.01	0.96	0.21	1.00	0.01	0.01	0.00	0.01	0.02	0.00
	Daily	0.88	0.89	0.92	0.00	0.05	0.02	0.11	0.66	0.64	0.01	0.01	0.68	0.01
trend- stable with breaks, nonlinear	Weekly	0.21	0.25	0.94	0.10	0.06	0.03	0.10	0.69	0.67	0.00	0.02	0.32	0.00
	Monthly	0.04	0.05	0.12	0.01	1.00	0.49	0.59	0.08	0.08	0.07	0.01	0.06	0.00
	Daily	0.88	0.89	0.92	0.00	0.05	0.02	0.11	0.66	0.64	0.01	0.01	0.68	0.01
trend- stable with breaks, linear, Multiplicative	Weekly	0.98	0.99	1.00	0.96	0.96	0.96	0.96	0.98	0.98	0.96	0.96	0.94	0.96
	Monthly	0.99	1.00	1.00	0.96	0.76	0.68	0.66	0.98	0.98	0.92	0.95	0.97	0.96
	Daily	1.00	1.00	1.00	0.95	0.95	0.89	0.94	0.97	0.97	0.95	0.95	0.91	0.95
trend- stable with breaks, linear	Weekly	0.23	0.28	0.95	0.19	0.29	0.21	0.29	0.40	0.40	0.13	0.14	0.34	0.12
	Monthly	0.11	0.15	0.30	0.09	0.79	0.52	0.69	0.13	0.14	0.08	0.08	0.13	0.07
	Daily	0.74	0.72	0.41	0.12	0.28	0.19	0.20	0.19	0.19	0.07	0.08	0.24	0.06
trend- downward, linear	Weekly	0.09	0.06	0.34	0.03	0.74	0.28	0.47	0.12	0.03	0.06	0.02	0.02	0.02
	Monthly	0.26	0.35	0.33	0.08	0.84	0.16	0.72	0.06	0.03	0.10	0.05	0.07	0.00
	Daily	0.30	0.25	0.36	0.06	0.93	0.07	0.90	0.11	0.08	0.09	0.06	0.11	0.06
trend- downward with breaks, nonlinear	Weekly	0.40	0.39	0.48	0.50	0.76	0.03	1.00	0.61	0.62	0.39	0.15	0.52	0.15
	Monthly	0.03	0.03	0.36	0.42	1.00	0.87	0.63	0.51	0.51	0.38	0.12	0.48	0.04
	Daily	0.43	0.43	0.27	0.35	0.96	0.49	0.18	0.38	0.38	0.04	1.00	0.36	0.04
trend- downward with breaks, linear	Weekly	0.35	0.38	0.83	0.32	0.64	0.39	0.61	0.61	0.60	0.28	0.24	0.43	0.24
	Monthly	0.29	0.32	0.51	0.29	0.68	0.57	0.46	0.35	0.35	0.26	0.23	0.35	0.20
	Daily	0.60	0.49	0.70	0.22	0.39	0.45	0.30	0.40	0.42	0.21	0.32	0.35	0.21
trend- downward with breaks	Weekly	0.22	0.26	1.00	0.02	0.06	0.05	0.19	0.32	0.33	0.04	0.01	0.42	0.03
	Monthly	0.01	0.01	0.12	0.00	0.32	1.00	0.61	0.02	0.01	0.01	0.02	0.01	0.01
	Daily	0.30	0.23	1.00	0.21	0.29	0.68	0.21	0.31	0.28	0.19	0.19	0.17	0.22
trend- downward	Weekly	0.56	0.57	0.69	0.56	1.00	0.48	0.71	0.41	0.34	0.61	0.46	0.48	0.59
	Monthly	0.54	0.79	0.56	0.46	1.00	0.54	0.95	0.50	0.61	0.51	0.49	0.60	0.50
	Daily	0.78	0.88	0.81	0.81	0.79	0.98	0.81	1.00	0.98	0.87	0.79	0.76	0.89
multiple seasonalities, upward, linear, Additive	Weekly	0.29	0.30	0.58	0.24	0.54	0.29	1.00	0.26	0.49	0.39	0.21	0.16	0.23
	Monthly	0.35	0.48	0.68	0.22	0.85	0.24	0.60	0.18	0.18	0.29	0.27	0.26	0.24
	Daily	0.43	0.58	0.57	0.15	1.00	0.40	0.91	0.25	0.24	0.21	0.15	0.28	0.15
multiple seasonalities, stable, linear, Multiplicative	Weekly	0.79	0.77	1.00	0.82	0.75	0.83	0.81	0.87	0.91	0.83	0.81	0.76	0.81
	Monthly	0.25	0.24	0.33	0.27	1.00	0.26	0.16	0.29	0.30	0.28	0.26	0.29	0.27
	Daily	0.75	0.76	1.00	0.82	0.79	0.44	0.69	0.87	0.87	0.82	0.82	0.67	0.82
multiple seasonalities, downward, linear, Multiplicative	Weekly	0.91	0.90	1.00	0.95	0.77	0.89	0.30	0.94	0.96	0.93	0.94	0.92	0.94
	Monthly	0.41	0.41	0.45	0.42	1.00	0.20	0.16	0.42	0.41	0.40	0.42	0.42	0.42
	Daily	0.76	0.77	1.00	0.91	0.84	0.91	0.77	0.92	0.90	0.95	0.92	0.86	0.92
cycles- upward, nonlinear, Additive	Weekly	0.26	0.46	0.23	0.13	0.88	0.46	0.92	0.36	0.37	0.14	0.14	0.21	0.04
	Monthly	0.30	0.35	0.37	0.15	0.84	0.63	0.65	0.34	0.31	0.22	0.25	0.12	0.09
	Daily	0.25	0.35	0.98	0.02	0.18	1.00	0.09	0.17	0.34	0.20	0.02	0.15	0.02
cycles- upward, linear, Multiplicative	Weekly	0.07	0.04	1.00	0.11	0.74	0.03	0.15	0.14	0.13	0.16	0.02	0.15	0.03
	Monthly	0.20	0.26	1.00	0.15	0.13	0.14	0.19	0.38	0.48	0.22	0.06	0.30	0.02
	Daily	0.25	0.35	0.98	0.02	0.18	1.00	0.09	0.17	0.34	0.20	0.02	0.15	0.02
cycles- upward, linear, Additive	Weekly	0.00	0.05	0.35	0.01	1.00	0.22	0.31	0.08	0.01	0.00	0.01	0.01	0.00
	Monthly	0.15	0.18	0.51	0.02	1.00	0.21	0.96	0.18	0.28	0.20	0.04	0.29	0.03
	Daily	1.00	0.97	0.31	0.06	0.43	0.49	0.34	0.31	0.23	0.03	0.04	0.17	0.04
cycles- upward with breaks, linear, Multiplicative	Weekly	0.27	0.20	1.00	0.09	0.16	0.41	0.05	0.13	0.02	0.10	0.21	0.29	0.12
	Monthly	0.09	0.11	0.31	0.05	0.23	0.69	0.55	0.05	0.10	0.08	0.07	0.12	0.07
	Daily	0.15	0.15	0.79	0.14	1.00	0.15	0.96	0.16	0.16	0.14	0.14	0.50	0.15
cycles- upward with breaks, linear, Additive	Weekly	0.37	0.50	0.66	0.38	0.59	0.38	0.57	0.47	0.44	0.40	0.40	0.31	0.39
	Monthly	0.34	0.48	0.57	0.27	0.49	0.64	0.34	0.35	0.37	0.36	0.32	0.27	0.26
	Daily	0.66	0.61	0.44	0.30	0.76	0.22	0.69	0.23	0.24	0.24	0.27	0.28	0.27
cycles- stable, linear, Additive	Weekly	0.27	0.31	0.99	0.26	0.71	0.53	0.51	0.29	0.35	0.27	0.25	0.30	0.27
	Monthly	0.44	0.51	1.00	0.23	0.37	0.30	0.38	0.35	0.38	0.34	0.30	0.37	0.28
	Daily	0.86	0.92	0.61	0.22	0.32	0.39	0.35	0.34	0.32	0.22	0.22	0.46	0.22
cycles- stable with breaks, linear, Additive	Weekly	0.21	0.34	0.92	0.17	0.73	0.37	0.44	0.27	0.23	0.15	0.07	0.38	0.03
	Monthly	0.34	0.45	0.90	0.22	0.53	0.46	0.34	0.32	0.28	0.43	0.20	0.42	0.08
	Daily	0.52	0.51	0.76	0.02	0.10	0.27	0.07	0.16	0.08	0.06	0.02	0.25	0.02
cycles- downward with breaks, linear, Multiplicative	Weekly	0.20	0.25	1.00	0.10	0.19	0.52	0.39	0.29	0.25	0.12	0.11	0.47	0.14
	Monthly	0.07	0.07	0.27	0.03	0.51	0.88	0.62	0.06	0.08	0.30	0.10	0.06	0.08
	Daily	0.53	0.55	0.89	0.28	0.17	0.30	0.29	0.20	0.21	0.12	0.24	1.00	0.26
cycle- upward, linear, Additive	Weekly	0.86	1.00	0.46	0.91	0.40	0.91	0.94	0.82	0.98	0.77	0.07	0.70	0.20
	Monthly	0.44	0.45	0.37	0.73	0.72	0.49	0.59	0.59	0.72	1	0.13	0.6	0.13
	Daily	0.53	0.55	0.89	0.28	0.17	0.3	0.29	0.2	0.21	0.12	0.24	1	0.26

Table 4. MAE for different models with and without all external variables.

Model Name	Daily MAE		Weekly MAE		Monthly MAE
	Without X	With X	Without X	With X	Without X	With X
LightGBM (Nixtla)	0.006865	0.013344	0.063523	0.063384	0.063523	0.07267
Linear Regression (Nixtla)	0.352763	0.081512	0.067139	0.095601	0.067139	0.171675
Random Forest (Nixtla)	0.011633	0.007133	0.062887	0.016148	0.062887	0.057416
XGBoost (Nixtla)	0.006671	0.006597	0.062794	0.04739	0.062794	0.060364
LightGBM (PyCaret)	0.009663	0.009685	0.068635	0.000482	0.068635	0.068837
Linear Regression (PyCaret)	0.731683	inf	0.36375	0.83549	0.36375	0.685998
Random Forest (PyCaret)	0.009592	0.009491	0.068217	0.000486	0.068217	0.068322
XGBoost (PyCaret)	0.009515	0.009698	0.068399	0.000486	0.068399	0.068336
TimeGPT	0.00645	0.064104	0.066995	0.000489	0.066995	0.064504
ARIMA (pmdarima)	0.006228	0.0066	0.069404	0.000519	0.069404	0.069404
Neuralprophet	0.010263	0.010255	0.070215	0.000676	0.070215	0.070183
Prophet	0.007307	0.008082	0.071083	0.000472	0.071083	0.067343

Table 5. Case study results.

Case Study	Best Model Used	Granularity	Forecast Horizon	MAE Univariate	MAE with All Corr. External Features	MAE TimeGPT
Electricity Demand Forecasting	XGBoost	Monthly	35 months	47,731.78 (MW)	91,854.41 (MW)	73,488.35 (MW)
Docked Bike Demand Forecasting	ARIMA	Weekly	32 weeks	1.112	1.7445	0.8898
Perishable Goods Demand Forecasting	Prophet	Daily	365 days	6.56	6.729	42.098

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Karim, S.M.A.; Zarrin, B.; Lassen, N.B. Multivariate Forecasting Evaluation: Nixtla-TimeGPT. Comput. Sci. Math. Forum 2025, 11, 29. https://doi.org/10.3390/cmsf2025011029

AMA Style

Karim SMA, Zarrin B, Lassen NB. Multivariate Forecasting Evaluation: Nixtla-TimeGPT. Computer Sciences & Mathematics Forum. 2025; 11(1):29. https://doi.org/10.3390/cmsf2025011029

Chicago/Turabian Style

Karim, S M Ahasanul, Bahram Zarrin, and Niels Buus Lassen. 2025. "Multivariate Forecasting Evaluation: Nixtla-TimeGPT" Computer Sciences & Mathematics Forum 11, no. 1: 29. https://doi.org/10.3390/cmsf2025011029

APA Style

Karim, S. M. A., Zarrin, B., & Lassen, N. B. (2025). Multivariate Forecasting Evaluation: Nixtla-TimeGPT. Computer Sciences & Mathematics Forum, 11(1), 29. https://doi.org/10.3390/cmsf2025011029

Article Menu

Multivariate Forecasting Evaluation: Nixtla-TimeGPT^†

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Creation and Evaluation

2.2. Simulated External Regressor Creation

2.3. Case Studies

3. Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Multivariate Forecasting Evaluation: Nixtla-TimeGPT †

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Creation and Evaluation

2.2. Simulated External Regressor Creation

2.3. Case Studies

3. Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Multivariate Forecasting Evaluation: Nixtla-TimeGPT^†