Next Article in Journal
Classifying Two Banking Cultures: The Pragmatic Structure of Economic Revelations
Previous Article in Journal
Drift and Diffusion in Panel Data: Extracting Geopolitical and Temporal Effects in a Study of Passenger Rail Traffic
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Proceeding Paper

Benchmarking Foundation Models for Time-Series Forecasting: Zero-Shot, Few-Shot, and Full-Shot Evaluations †

by
Frédéric Montet
*,
Benjamin Pasquier
and
Beat Wolf
Institute of AI and Complex Systems (iCoSys), School of Engineering and Architecture of Fribourg (HEIA-FR), HES-SO University of Applied Sciences and Arts Western Switzerland, 1700 Fribourg, Switzerland
*
Author to whom correspondence should be addressed.
Presented at the 11th International Conference on Time Series and Forecasting, Canaria, Spain, 16–18 July 2025.
These authors contributed equally to this work.
Comput. Sci. Math. Forum 2025, 11(1), 32; https://doi.org/10.3390/cmsf2025011032
Published: 8 September 2025

Abstract

Recently, time-series forecasting foundation models trained on large, diverse datasets have demonstrated robust zero-shot and few-shot capabilities. Given the ubiquity of time-series data in IoT, finance, and industrial applications, rigorous benchmarking is essential to assess their forecasting performance and overall value. In this study, our objective is to benchmark foundational models from Amazon, Salesforce, and Google against traditional statistical and deep learning baselines on both public and proprietary industrial datasets. We evaluate zero-shot, few-shot, and full-shot scenarios using metrics such as sMAPE and NMAE on fine-tuned models, ensuring reliable comparisons. All experiments are conducted with onTime, our dedicated open-source library that guarantees reproducibility, data privacy, and flexible configuration. Our results show that foundation models often outperform traditional methods with minimal dataset-specific tuning, underscoring their potential to simplify forecasting tasks and bridge performance gaps in data-scarce settings. Additionally, we address non-performance criteria, such as integration ease, model size, and inference/training time, which are critical for real-world deployment.

1. Introduction

The ability to forecast is one of the most prevalent uses of modeling. From weather forecasting to anomaly detection for specific industrial machines, accurately predicting the next few data points can significantly impact outcomes. Therefore, to be able to choose the right predictor is of critical importance.
In this paper, we explore some of the most recent modeling methods used to predict time series. These models are called foundation models and are becoming gradually more prevalent in the literature. For instance, notable examples include Chronos by Amazon, Moirai by Salesforce, and TimesFM by Google [1,2,3]. These models are based on methods similar to those used in foundation models for other applications, such as NLP.
Today, data scientists can rely on existing benchmarks such as GIFT-Eval or ProbTS to guide the selection of foundation models [4,5]. However, when applying these models to their own datasets, the results may differ from those reported. Furthermore, the benchmarks often consider the zero-shot scenario of those models, whereas the latter models can be trained and used in few-shot or full-shot scenarios as well, leaving practitioners with an incomplete view of the models’ true potential.
Therefore, our research question revolves around the use of time series with foundation models and is as follows: “What is the performance of foundation models for time series forecasting and analysis on industrial datasets, particularly in the context of few-shot and zero-shot learning for specific use cases?”
Given the modernity of these models, this task involves challenges. Each model is rather large, requires significant computational power, has no unified interface, can treat time-series differently (e.g., as uni- or multi-variate, etc.), and the models’ metrics are difficult to choose.
In summary, the ability to position the performance of a given model on a specific dataset is of utmost importance. It enables practitioners to leverage these models and accelerates the technological transfer from academia to industry.
The outline of our paper is structured within the framework of the scientific method. First, we explain the method based on a benchmarking tool that we developed. Second, we present the results of our benchmark. Finally, we discuss the implications of these results.

2. Materials and Methods

Our work’s objective is to evaluate the performance of recent foundation models in comparison with traditional methods. The following sections introduce our method, which aims to be as fair as possible. First, we present the choice of datasets, models, and metrics. Then, we introduce our method, which fits different training and evaluation scenarios.

2.1. Datasets

Our dataset selection, presented in Table 1, aims to represent both reference data frequently used in academia and industrial data that no model has ever seen during training. All academic datasets were sourced through the Darts Python library [6]. For consistency, a context window of 512 time steps is used across all datasets. We evaluate forecasting performance at horizons of 24, 48, 96, and 192 time steps, with a particular focus on the 96-step horizon in most experiments.
The energy dataset covers hourly energy generation and weather in Spain from 2015 to 2018; eight constant features were removed. ETTh1 and ETTm1 contain hourly and 15-minute multivariate data from an electricity transformer, used to forecast oil temperature based on power load features. ExchangeRate contains daily exchange rates for eight countries from 1990 to 2016. Weather includes 21 weather indicators, such as temperature and humidity, recorded every 10 min in Germany during 2020. ZurichElectricity reports 15-minute resolution electricity consumption in Zurich up to 2022, combining household and business usage with interpolated weather data. HEIA reports hourly electricity consumption from eight buildings at our engineering school in Fribourg, Switzerland. Finally, MeteoSwiss provides 10 min meteorological data from Fribourg/Grangeneuve, with features like pressure, wind, and humidity.
The datasets are split in a standard way, with 80% used for training and 20% for testing. For the deep learning models presented in the next section, the data are standardized to zero mean and unit variance using z-score normalization. In contrast, no normalization is applied to the statistical models, while for the foundation models, data normalization is handled internally by their respective implementations.

2.2. Models

Table 2 presents all models we considered for our benchmark. The 16 models can be split into three categories: (1) statistical, (2) deep learning, and (3) foundational. Those models have different characteristics: some are able to model multivariate data, while others focus on univariate series; the number of parameters varies; and one statistical model is included as a naive baseline for reference.
The hyperparameters of the deep learning models are optimized using the Optuna library in Python [20]. While each model has its own specific set of tunable parameters, several common hyperparameters—such as output chunk length, learning rate and its scheduler, dropout rate, and batch size—were consistently optimized across all models.

2.3. Metrics

The different metrics chosen for this benchmark must enable comparison across datasets and model characteristics (for instance, uni- vs. multivariate output). Therefore, scale-independent metrics such as sMAPE and NMAE were chosen over traditional metrics like MSE or MAE.
sMAPE (Symmetric Mean Absolute Percentage Error) evaluates the relative error as a percentage, symmetrically penalizing over- and under-predictions. sMAPE is calculated as shown in Equation (1). Unlike traditional MAPE, sMAPE avoids division by zero and ensures that the metric remains bounded, making it well-suited for datasets with values near zero.
NMAE (Normalized Mean Absolute Error) expresses the total absolute error relative to the total magnitude of the true values. It provides an interpretable, scale-independent measure of forecast accuracy, which is particularly useful for comparing performance across datasets with different units or scales. NMAE is computed as shown in Equation (2).
sMAPE = 200 × 1 T t = 1 T | y t y ^ t | ( | y t | + | y ^ t | )
NMAE = t = 1 T y t y ^ t t = 1 T y t
By normalizing the absolute error by the total observed magnitude, NMAE avoids scale-related issues and allows a fair evaluation across different datasets or targets.
For multivariate forecasting, metrics are computed per component and averaged to yield a single aggregated performance score across all target variables.

2.4. Scenarios

Once the trio of models, datasets, and metrics is defined, different training scenarios are tested along two dimensions: (1) the amount of training data and (2) the prediction length.
The amount of training data can vary from none (zero-shot) to a little (few-shots) and to most of the dataset (full-shot). Statistical and deep-learning models have been trained in full-shot setting on the datasets from Section 2.1. Foundation models are first evaluated in a zero-shot setting, then fine-tuned for few-shot scenarios of varying training data proportion, and finally for full-shot scenarios. This dimension allows us to understand how effectively a model can capture patterns or characteristics within the data.
Regarding the prediction length, the lengths mentioned in Section 2.1 are considered (24, 48, 96, and 192 time-steps). This allows us to evaluate the ability of a model to capture dependencies as a function of time.

2.5. Evaluation Framework and Infrastructure

To create this benchmark, we developed an evaluation framework based on the onTime library [21]. This tool eases the development of all scenarios while increasing the certainty that experiments are executed in the exact same way. To define such experiments, the trio datasets, models, and metrics can be defined easily in a Python file. This framework is extensible via abstract classes, which allows a practitioner to specify a custom implementation.
While training is handled by each model’s implementation, evaluation is performed in a unified way across all models to ensure fair comparison. Specifically, a sliding window with a stride equal to the prediction horizon is used to generate multiple input–target samples from the test set. These are provided to the model after training, or directly in the case of zero-shot foundation models. Figure 1 illustrates this sampling process.
In terms of infrastructure, all experiments are performed on a Slurm cluster in version 24.05.3 with the following NVIDIA GPUs (NVIDIA Corporation, Santa Clara, CA, USA): (1) RTX A6000 with 48 GB, (2) A40 with 48 GB, and (3) TITAN RTX with 24 GB. Once all models are trained, a single GPU (A40) is used to compute all inference times, thus allowing a fair comparison.

3. Results

In this section, we present the results of the benchmark described in Section 2. First, we report the performance of the baseline models and compare them to the foundation models in a zero-shot setting (Section 3.1). Next, we analyze the performance of the foundation models across varying prediction horizons (Section 3.2). Finally, we evaluate the foundation models in few-shot settings, considering different proportions of the training data (Section 3.3).

3.1. Predictions Across Models

The baseline results from Table 3 feature deep learning and statistical models. The results are consistent across metrics, and TiDE provides the best predictions in most cases, followed by Naïve Seasonal and AutoARIMA.
In Table 4, we observe that foundation models perform better than the best baseline from Table 3. The best foundation models are Chronos Bolt Base and Chronos Large, followed by Chronos Tiny and TimesFM. For some datasets, the performance across metrics shows a slight instability.
Figure 2 illustrates the trade-off between inference efficiency and forecasting error on the HEIA dataset. The bar chart (on the right) shows large differences in inference time across models: deep learning models are generally faster than foundation models, although lightweight variants like Chronos Bolt Tiny and Moirai Small also exhibit low inference times. In contrast, larger models such as Chronos Large and Moirai MoE Base are slower, as are AutoARIMA and Exponential Smoothing.
The scatter plot (on the left) highlights that foundation models offer the best accuracy while maintaining competitive latency. Chronos Bolt and Moirai models represent a favorable trade-off, achieving both low error and fast inference.

3.2. Predictions Horizons

In terms of prediction horizon, Table 5 showcases the ability of models to predict different horizons. The best models align with the results from the previous Table 4. The awaited result is that the longer the prediction horizon, the more difficult it is to predict. However, when looking at the scores, the aforementioned statement is not always confirmed, and the scores can be interpreted as instabilities that are difficult to characterize.

3.3. Few-Shot Learning with Various Data Proportions

Finally, we test the gain in score as a function of the proportion of the data seen by the model. Table 6 reports forecasting performance for 0% (zero-shot), 33%, 67%, and 100% (full-shot) scenarios. The results indicate that fine-tuning provides added value in most cases. However, Chronos Large appears to benefit less from fine-tuning compared to other models. Additionally, fine-tuning does not yield improvements when the data presents a more random character, which could indicate a data issue.

4. Discussion

In the context of this benchmark, we asked ourselves the following questions: “What is the performance of foundation models for time series forecasting and analysis on industrial datasets, particularly in the context of few-shot and zero-shot learning for specific use cases?”. To answer this question, we performed an extensive analysis of the models listed in Table 2 with an evaluation procedure testing cases such as datasets, prediction length, and data quantity.
The results presented in the previous section show the superiority of foundational models when compared with traditional methods in various contexts. First, as seen in Table 4, the latter models mostly provide better resulting metrics on prediction tasks. Second, as seen on Figure 2, when their size remains small, their inference speed is on par with the fastest traditional models. Third, when used in zero-shot settings, they do not require model-specific retraining.
However, such models are not a solution to all predictive needs. Having sizes varying from 8M to 700M parameters, their larger versions are computationally intensive and slow, limiting applications in resource-constrained contexts. Albeit, the smallest versions provide great predictive performances compared to most baseline models (see Figure 2) within a very portable model, thus allowing for numerous applied usages.
When benchmarking, we noted interesting features of foundation models regarding (1) their prediction stability, (2) their prediction length performance, and (3) their fine-tuning process.
Regarding the prediction stability, when looking at the metrics across datasets, it seems like foundation models are more frequently specialized for some datasets (see Chronos Bolt Base), whereas baseline models (see TiDE and NaiveSeasonal) obtained great performance for 50% of the datasets. This seems slightly counterintuitive, since one of the basic concepts of foundation models is to perform well across different datasets.
Concerning prediction performance, irregular behaviors have been observed across different prediction lengths. Indeed, longer predictions are usually harder to make, and when looking at Table 5, some errors did decrease. The case of the weather dataset is particularly special, which may be due to the sampling length resulting in samples that are difficult to predict.
With respect to fine-tuning, such a process requires quite extensive knowledge and computation, and can deliver a wide range of results depending on the chosen dataset; e.g., marginal gains on the HEIA dataset and significant ones on the MeteoSwiss and ZurichElectricity datasets. Therefore, such a process should be conducted carefully and is, for now, limited to practitioners who have access to great computational resources.
Finally, we cannot say that one foundation model is better than all others, but given our quantitative evaluation, Chronos generally seems to perform better. However, the testing of such models should also be conducted with a qualitative mindset and visual analysis, for instance, to validate their performance together with applied experts.

5. Conclusions

This study benchmarks foundation models for industrial time-series forecasting, highlighting their superior zero-shot accuracy compared to traditional methods. Compact models, such as Chronos Bolt Tiny, provide an optimal balance of speed and performance suitable for resource-constrained scenarios. However, larger models require substantial computational resources, and their performance varies across datasets and prediction horizons. Future work should focus on improving model stability, simplifying fine-tuning processes, and incorporating qualitative assessments to enhance practical usability.

Author Contributions

Conceptualization, B.P., F.M. and B.W.; methodology, B.P. and F.M.; software, B.P. and F.M.; validation, F.M. and B.W.; formal analysis, B.P.; investigation, B.P. and F.M.; resources, B.P., F.M. and B.W.; data curation, B.P.; writing—original draft preparation, F.M. and B.P.; writing—review and editing, F.M., B.P., B.W.; visualization, B.P.; supervision, B.W., F.M.; project administration, B.W.; funding acquisition, B.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research is funded through a research grant of the University of Applied Sciences and Arts of Western Switzerland, HES-SO.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Section 2.1 presents all datasets available in this study. The datasets Energy, ETTh1, ETTm1, ExchangeRate, Weather, and ZurichElectricity are available in Darts Python library [6]. The dataset MeteoSwiss is available on request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Ansari, A.F.; Stella, L.; Turkmen, C.; Zhang, X.; Mercado, P.; Shen, H.; Shchur, O.; Rangapuram, S.S.; Pineda Arango, S.; Kapoor, S.; et al. Chronos: Learning the Language of Time Series. arXiv 2024, arXiv:2403.07815. [Google Scholar] [CrossRef]
  2. Liu, X.; Liu, J.; Woo, G.; Aksu, T.; Liang, Y.; Zimmermann, R.; Liu, C.; Savarese, S.; Xiong, C.; Sahoo, D. Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts. arXiv 2024, arXiv:2410.10469. [Google Scholar]
  3. Das, A.; Kong, W.; Sen, R.; Zhou, Y. A decoder-only foundation model for time-series forecasting. arXiv 2024, arXiv:2310.10688. [Google Scholar]
  4. Aksu, T.; Woo, G.; Liu, J.; Liu, X.; Liu, C.; Savarese, S.; Xiong, C.; Sahoo, D. GIFT-Eval: A Benchmark For General Time Series Forecasting Model Evaluation. arXiv 2024, arXiv:2410.10393. [Google Scholar] [CrossRef]
  5. Zhang, J.; Wen, X.; Zhang, Z.; Zheng, S.; Li, J.; Bian, J. ProbTS: Benchmarking Point and Distributional Forecasting across Diverse Prediction Horizons. arXiv 2024, arXiv:2310.07446. [Google Scholar]
  6. Herzen, J.; Lässig, F.; Piazzetta, S.G.; Neuer, T.; Tafti, L.; Raille, G.; Pottelbergh, T.V.; Pasieka, M.; Skrodzki, A.; Huguenin, N.; et al. Darts: User-Friendly Modern Machine Learning for Time Series. J. Mach. Learn. Res. 2022, 23, 5442–5447. [Google Scholar]
  7. Energy Consumption, Generation, Prices and Weather. 2019. Available online: https://www.kaggle.com/datasets/nicholasjhana/energy-consumption-generation-prices-and-weather (accessed on 17 November 2024).
  8. Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Conference, 2–9 February 2021; AAAI Press: Washington, DC, USA, 2021; Volume 35, pp. 11106–11115. [Google Scholar]
  9. Lai, G.; Chang, W.C.; Yang, Y.; Liu, H. Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks. In Proceedings of the The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18, Ann Arbor, MI, USA, 8–12 July 2018. pp. 95–104. [Google Scholar] [CrossRef]
  10. Max Planck Institute for Biogeochemistry. Weather Data from the Max Planck Institute for Biogeochemistry, Jena, Germany. 2025. Available online: https://www.bgc-jena.mpg.de/wetter/ (accessed on 16 May 2025).
  11. Umwelt- und Gesundheitsschutz Zürich. Stündlich aktualisierte Meteodaten, Seit 1992. 2025. Available online: https://data.stadt-zuerich.ch/dataset/ugz_meteodaten_stundenmittelwerte (accessed on 16 May 2025).
  12. Elektrizitätswerk der Stadt Zürich. Viertelstundenwerte des Stromverbrauchs in den Netzebenen 5 und 7 in der Stadt Zürich, seit 2015. 2025. Available online: https://data.stadt-zuerich.ch/dataset/ewz_stromabgabe_netzebenen_stadt_zuerich (accessed on 16 May 2025).
  13. MeteoSwiss. Federal Office of Meteorology and Climatology. 2024. Available online: https://www.meteoswiss.admin.ch/ (accessed on 18 November 2024).
  14. Hyndman, R.J.; Khandakar, Y. Automatic Time Series Forecasting: The forecast Package for R. J. Stat. Softw. 2008, 27, 1–22. [Google Scholar] [CrossRef]
  15. Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv 2014, arXiv:1406.1078. [Google Scholar] [CrossRef]
  16. Das, A.; Kong, W.; Leach, A.; Mathur, S.; Sen, R.; Yu, R. Long-term Forecasting with TiDE: Time-series Dense Encoder. arXiv 2024, arXiv:2304.08424. [Google Scholar]
  17. Lim, B.; Arık, S.Ö.; Loeff, N.; Pfister, T. Temporal Fusion Transformers for interpretable multi-horizon time series forecasting. Int. J. Forecast. 2021, 37, 1748–1764. [Google Scholar] [CrossRef]
  18. Chen, S.A.; Li, C.L.; Yoder, N.; Arik, S.O.; Pfister, T. TSMixer: An All-MLP Architecture for Time Series Forecasting. arXiv 2023, arXiv:2303.06053. [Google Scholar] [CrossRef]
  19. Woo, G.; Liu, C.; Kumar, A.; Xiong, C.; Savarese, S.; Sahoo, D. Unified Training of Universal Time Series Forecasting Transformers. arXiv 2024, arXiv:2402.02592. [Google Scholar] [CrossRef]
  20. Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, Anchorage, AK, USA, 4–8 August 2019; pp. 2623–2631. [Google Scholar] [CrossRef]
  21. ontime.re. onTime: Your Library to Work with Time Series. GitHub Repository. 2024. Available online: https://github.com/ontime-re/ontime (accessed on 18 November 2024).
Figure 1. Sliding window sampling on the test set. Each sample includes an input (context) and a target (ground truth for the prediction horizon).
Figure 1. Sliding window sampling on the test set. Each sample includes an input (context) and a target (ground truth for the prediction horizon).
Csmf 11 00032 g001
Figure 2. Comparison of forecasting models’ trade-off between inference time (log scale) and forecasting accuracy measured by NMAE on the HEIA dataset.
Figure 2. Comparison of forecasting models’ trade-off between inference time (log scale) and forecasting accuracy measured by NMAE on the HEIA dataset.
Csmf 11 00032 g002
Table 1. All datasets used to train the models. These datasets cover a wide range of use cases.
Table 1. All datasets used to train the models. These datasets cover a wide range of use cases.
DatasetType# FeaturesResolution# Target FeaturesSize
Energy [7]Academic201 h1 (Total load)35,064
ETTh1 [8]Academic71 h1 (Oil temp.)17,420
ETTm1 [8]Academic715 min1 (Oil temp.)69,680
ExchangeRate [9]Academic81 day8 (All)7588
Weather [10]Academic2110 min21 (All)52,704
ZurichElectricity [11,12]Academic1015 min2 (Consumption)93,409
HEIA1hIndustrial81 h8 (All)11,664
MeteoSwiss [13]Industrial810 min24 (All)105,264
Table 2. Overview of all models used for the benchmark.
Table 2. Overview of all models used for the benchmark.
ModelTypePrediction Type# Parameters
NaïveSeasonalStatisticalUnivariateNot applicable
AutoARIMA [14]StatisticalUnivariate<100
ExpotentialSmoothingStatisticalUnivariateNot applicable
GRU [15]Deep learningMultivariate3–160 K
TiDE [16]Deep learningMultivariate285 K–8.5 M
TFT [17]Deep learningMultivariate3–36 K
TSMixer [18]Deep learningMultivariate19–931 K
Chronos Tiny  [1]FoundationUnivariate8 M
Chronos Large  [1]FoundationUnivariate710 M
Chronos Bolt Small [1]FoundationUnivariate48 M
Chronos Bolt Base [1]FoundationUnivariate205 M
Moirai small  [19]FoundationMultivariate14 M
Moirai large  [19]FoundationMultivariate311 M
Moirai MoE Small [2]FoundationMultivariate117 M
Moirai MoE Base [2]FoundationMultivariate935 M
TimesFM [3]FoundationUnivariate500 M
These models are also finetuned.
Table 3. Forecasting performance of baseline models on a prediction horizon of 96 time steps. The best score per row is shown in bold; the second-best is underlined.
Table 3. Forecasting performance of baseline models on a prediction horizon of 96 time steps. The best score per row is shown in bold; the second-best is underlined.
Deep LearningStatistical
MetricsGRUTFTTiDETSMixerNaive SeasonalAutoARIMAES
EnergysMAPE13.4915.258.1029.72420.5613.3422.67
NMAE0.13320.15070.08250.09520.19460.13180.2215
ETTh1 sMAPE35.0047.1540.1138.1334.4933.6035.80
NMAE0.40030.63100.40970.52610.36270.35260.3781
ETTm1 sMAPE32.1141.5355.0226.0522.2723.5924.55
NMAE0.31900.65681.0400.29500.23250.23690.2462
ExchangeRate sMAPE15.5518.0611.4515.342.3362.5142.537
NMAE0.14280.16310.10490.14160.02340.02490.0252
Weather sMAPE79.0981.4862.5365.6754.5861.7766.82
NMAE250.0223.550.2139.411.07310.8538.16
ZurichElectricity sMAPE14.2021.875.3958.26018.5318.4923.39
NMAE0.14010.21600.05480.08350.18710.18580.2416
HEIA sMAPE42.2652.4729.5639.4839.9941.6531.28
NMAE0.53800.67060.32400.54220.41820.44450.3606
MeteoSwiss sMAPE75.6489.0864.6480.1168.9871.9672.76
NMAE1.6412.1701.0851.8102.1521.7682.664
ES = Exponential Smoothing.
Table 4. Forecasting performance of foundation models on a prediction horizon of 96 time steps, evaluated in a zero-shot setting. The best baselines scores from Table 3 are also reported. The best score per row is in bold; the second best is underlined.
Table 4. Forecasting performance of foundation models on a prediction horizon of 96 time steps, evaluated in a zero-shot setting. The best baselines scores from Table 3 are also reported. The best score per row is in bold; the second best is underlined.
MoiraiMoirai-MoEChronosChronos BoltTFM BB
MetricsSmallLargeSmallBaseTinyLargeTinyBase
Energy sMAPE 7.3607.1347.0887.0627.5084.8546.1835.0957.0358.102
NMAE0.07440.07180.07190.07180.07510.04910.06220.05110.07080.0825
ETTh1sMAPE32.6434.6133.2534.0431.1530.6831.9530.5631.4033.60
NMAE0.33820.33440.33330.33700.32420.32170.34080.33340.32310.3526
ETTm1sMAPE23.6624.6424.7823.6422.9521.6121.5822.4022.4522.27
NMAE0.24610.26580.25690.24880.23470.23030.23460.23290.25210.2325
ExchangeRatesMAPE2.5052.6872.4702.5352.7142.5832.4122.5522.5652.336
NMAE0.02510.02730.02480.02550.02700.02600.02420.02550.02560.0234
WeathersMAPE64.2064.4362.0159.4763.4761.8362.4061.8145.0554.58
NMAE2.6639.05211.055.19213.030.49876.4554.6291.2431.073
ZurichElectricitysMAPE17.9218.0617.3015.638.3686.1195.6354.1777.2925.395
NMAE0.17690.18040.17270.15400.08720.06390.05920.04400.07680.0548
HEIAsMAPE22.7420.8923.3120.2023.3320.7320.3719.7120.6729.56
NMAE0.25700.24230.26700.23210.26870.24110.23430.22940.23430.3240
MeteoSwisssMAPE67.7266.9866.1466.4967.9962.9567.5865.1862.7464.64
NMAE0.81461.4882.4421.1731.6571.6620.99321.3770.97771.085
TFM = TimesFM, BB = Best baselines.
Table 5. Forecasting performance of foundation models on different prediction horizons, evaluated in a zero-shot setting. NMAE is reported. The best score per row is in bold; the second-best is underlined.
Table 5. Forecasting performance of foundation models on different prediction horizons, evaluated in a zero-shot setting. NMAE is reported. The best score per row is in bold; the second-best is underlined.
MoiraiMoirai-MoEChronosChronos BoltTFM
HorizonsSmallLargeSmallBaseTinyLargeTinyBase
Energy24 0.06430.05920.05910.05280.05630.03380.05030.03850.0589
480.07190.06780.06780.06280.06780.04140.05870.04420.0676
960.07440.07180.07190.07180.07510.04910.06220.05110.0708
1920.07730.07550.07460.08020.07540.05290.06460.05560.0734
ETTh1240.20480.21040.19700.20210.20110.20740.19990.19440.2077
480.26190.27130.25030.25470.24470.24900.24950.25170.2536
960.33820.33440.33330.33700.32420.32170.34080.33340.3231
1920.28830.29690.30630.29050.28490.28380.30700.28700.2784
ETTm1240.16160.19670.17520.17550.17960.14460.14790.14430.1643
480.25150.27420.25740.25710.23820.21660.23970.23070.2594
960.24610.26580.25690.24880.23470.23030.23460.23290.2521
1920.29170.30940.30550.30070.29500.28700.30080.30170.2977
ExchangeRate240.01380.01330.01280.01300.01400.01360.01310.01360.0132
480.01770.01790.01710.01740.01860.01840.01700.01820.0180
960.02510.02730.02480.02550.02700.02600.02420.02550.0256
1920.03500.04720.03940.03970.04060.03750.03400.03430.0346
Weather240.86050.41020.59900.80832.2600.39762.7600.61890.6271
482.9151.3176.5387.2371.7790.90956.6551.9191.834
962.6639.05211.055.19213.030.49876.4554.6291.243
1920.52760.63590.80070.78210.78160.55190.67350.55590.6298
ZurichElectricity240.08830.07580.07210.05960.03390.02440.02920.02330.0316
480.15760.14130.14030.10760.05010.02970.03480.02890.0441
960.17690.18040.17270.15400.08720.06390.05920.04400.0768
1920.17570.18380.17520.16210.10560.08250.06670.05010.0926
HEIA240.22200.19590.21310.19240.21230.18370.19980.19370.2013
480.24590.21770.24790.21110.23220.20860.21730.20980.2206
960.25700.24230.26700.23210.26870.24110.23430.22940.2343
1920.26870.26590.28070.24670.27620.25400.24370.24080.2507
MeteoSwiss240.69490.81550.84570.62220.75580.71670.65760.81760.6057
481.1531.4411.7711.0161.4571.2971.0451.3460.9871
960.81461.4882.4421.1731.6571.6620.99321.3770.9777
1921.2261.5301.8501.8981.5841.3741.2951.2311.336
TFM = TimesFM.
Table 6. Forecasting performance of fine-tuned foundation models using different proportions of available training data from 0% (zero-shot) to 100% (full-shot). NMAE is reported. The best score for a given dataset and model is shown in bold. The overall best score for each dataset across all models is highlighted in bold red, and the second-best is underlined in red.
Table 6. Forecasting performance of fine-tuned foundation models using different proportions of available training data from 0% (zero-shot) to 100% (full-shot). NMAE is reported. The best score for a given dataset and model is shown in bold. The overall best score for each dataset across all models is highlighted in bold red, and the second-best is underlined in red.
MoiraiChronos
ProportionsSmallLargeTinyLarge
Energy0% 0.07440.07180.07510.0491
33%0.07270.07040.06490.0549
67%0.06670.06580.05940.0495
100%0.06870.06610.05990.0445
ETTh10%0.33820.33440.32420.3217
33%0.31460.33070.33150.3161
67%0.31730.31670.31570.3272
100%0.31270.33460.30990.3460
ETTm10%0.24610.26580.23470.2303
33%0.24040.36160.24590.2540
67%0.26860.36930.24840.2754
100%0.24370.25580.21980.2308
ExchangeRate0%0.02510.02730.02700.0260
33%0.02820.06940.03190.0319
67%0.02500.03600.02850.0299
100%0.02850.07790.02510.0302
Weather0%2.6639.05213.030.4987
33%3.7936.0412.0530.9532
67%6.6702.4251.6703.274
100%2.4593.7038.6445.433
ZurichElectricity0%0.17690.18040.08720.0639
33%0.04820.04750.03260.0295
67%0.04190.05260.03230.0277
100%0.04990.05350.03330.0265
HEIA0%0.25700.24230.26870.2411
33%0.26310.27360.32710.2574
67%0.29910.29200.29810.2488
100%0.26530.27280.24940.2420
MeteoSwiss0%0.81461.4881.6571.662
33%0.56500.66900.61110.9706
67%0.49540.45321.4460.7739
100%0.53960.53160.73820.6285
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Montet, F.; Pasquier, B.; Wolf, B. Benchmarking Foundation Models for Time-Series Forecasting: Zero-Shot, Few-Shot, and Full-Shot Evaluations. Comput. Sci. Math. Forum 2025, 11, 32. https://doi.org/10.3390/cmsf2025011032

AMA Style

Montet F, Pasquier B, Wolf B. Benchmarking Foundation Models for Time-Series Forecasting: Zero-Shot, Few-Shot, and Full-Shot Evaluations. Computer Sciences & Mathematics Forum. 2025; 11(1):32. https://doi.org/10.3390/cmsf2025011032

Chicago/Turabian Style

Montet, Frédéric, Benjamin Pasquier, and Beat Wolf. 2025. "Benchmarking Foundation Models for Time-Series Forecasting: Zero-Shot, Few-Shot, and Full-Shot Evaluations" Computer Sciences & Mathematics Forum 11, no. 1: 32. https://doi.org/10.3390/cmsf2025011032

APA Style

Montet, F., Pasquier, B., & Wolf, B. (2025). Benchmarking Foundation Models for Time-Series Forecasting: Zero-Shot, Few-Shot, and Full-Shot Evaluations. Computer Sciences & Mathematics Forum, 11(1), 32. https://doi.org/10.3390/cmsf2025011032

Article Metrics

Back to TopTop