4.1. Experimental Settings
To validate the effectiveness of this framework, we conducted comprehensive experiments on six different datasets including mechanical systems (ETT), energy (Electricity), traffic (Traffic), Weather (Weather), economics (Exchange), and disease (ILI). These datasets are publicly available and have been extensively utilized for TSF [
2]. The datasets were partitioned into training, validation, and test sets in a 7:1:2 ratio, following the chronological order of the time series.
In the implementation of our StreamTS, we adopted deepseek-math-7b-rl as the trend backbone and a fully connected layer as the linear model for season series prediction. The deepseek-math-7b-rl is a relatively small LLM model and is used to serve as the base model for the Top4 team at the Artificial Intelligence Mathematical Olympiad (AIMO).
We compared StreamTS with the latest models for TSF, including the linear-based model DLinear [
32], the CNN-based models TimesNet [
39] and MICN [
40], the Transformer-based models PatchTST [
27], Autoformer [
2], and Nonstationary [
41], and frequency-based model FiLM [
15]. To ensure a fair comparison, we adhered to the experimental configurations across all models with a unified training pipeline. All deep learning models were implemented in PyTorch 2.2.2 and trained on NVIDIA A100 80 GB. The Adam optimizer was used with decay rates
, and initial learning rates were selected from
. We adopted a cosine annealing learning rate schedule with
and
. Batch size was set to 16. In our model, the patch size of Patch Reprogramming was set to 16, and the window size for decomposition was fixed at 25, following the setting in Autoformer to ensure fairness in experimental comparisons.
The experiments were designed from six perspectives: (1) full-shot forecasting, assessing the performance of the models in predicting different time lengths when using the entire training set; (2) few-shot forecasting, verifying the model’s performance using of the training set; (3) zero-shot forecasting, demonstrating the potentials of StreamTS in TSF without training any parameters; (4) comparisons with LLM time series models, analyzing its time efficiency compared to an advanced LLMs-based time series forecasting model; (5) model analysis, analyzing the attention mechanism of LLM Underlying Operations, assessing effects of time series decomposition; (6) parametric analysis, exploring the impact of input length. Each experiment was repeated twice on the test sets to ensure the accuracy of experimental results. MAE and MSE were used as the evaluation metrics. A lower MAE/MSE indicates better performance of the model.
4.2. Full-Shot Forecasting
All of the models take the input time series length L = 512 and the prediction length for the ILI dataset and for other datasets.
The experimental results are presented in
Table 1. Overall, StreamTS achieves the best aggregate performance across the six datasets, winning 34 out of 50 comparative evaluations. In terms of average MAE/MSE, our model ranks first or second on most datasets, except for the Electricity dataset. Specifically, among the Transformer-based models, patchTST achieves SOTA performance in TSF, while our model’s average MSE across six datasets is lower than it by
. Similarly, compared to the linear-based model DLinear and CNN-based model TimesNet, our model achieves a reduction of
and
, respectively, on the average MSE. FiLM beats other models by eight times.
On the ILI and Exchange datasets, where the temporal dynamics are highly volatile, our current sliding-average decomposition with a fixed window size struggles to effectively capture the periodic patterns, leading to weaker forecasting performance. FiLM adopts a frequency-domain perspective and incorporates denoising strategies, which allow it to better extract essential temporal patterns and achieve more robust long-term forecasts under such challenging conditions. Inspired by this, exploring frequency-domain decomposition as an alternative or complementary approach represents a promising direction for our future work. In
Figure 3, we visualize the full-shot forecasting results by StreamTS on four datasets with different data characteristics. The visualizations reveal that the predicted values closely match the ground truth, suggesting high accuracy across diverse scenarios.
4.4. Zero-Shot Forecasting
Zero-shot forecasting refers to the task of applying a forecaster trained on a source dataset directly to an unseen target dataset, without any additional training or fine-tuning [
13]. Recent research on LLM-based TSF have to some extent explored the potential of LLMs as effective zero-shot reasoners. They often evaluate the zero-shot forecasting ability in transfer settings where the source and target datasets share similar temporal or domain characteristics, e.g., Time-LLM [
16] was trained on ETTh1 and tested on ETTh2. Such a setting, involving only a minor shift in data distribution, embeds implicit domain knowledge and thus offers limited insight into the true generalization capabilities of these models.
To address this limitation, we propose a more rigorous and practical setting of zero-shot time series forecasting by directly leveraging pre-trained LLMs without any task-specific training. Our approach relies purely on the pretraining knowledge embedded in LLMs and does not involve gradient-based learning on time series data. This design is particularly valuable in real-world scenarios where access to reliable, relevant, and well-structured domain-specific data is limited. Building on this idea, we further introduce a hybrid method, LLM-ARIMA, for zero-shot time series prediction. As illustrated in
Figure 1, the overall architecture remains unchanged; we only replace the Linear module with the traditional statistical method ARIMA, while all other components—including the Instance Norm, Series Decomp, BC-Prompt, LLM, Out Projection, and Reverse Instance Norm—are kept intact. LLM-ARIMA can leverage the language understanding capabilities of LLMs with the inductive bias of classical statistical models. To ensure full automation and maintain the inference-only nature of the pipeline, we apply an off-the-shelf statistical selection process to automatically determine the optimal ARIMA parameters (p, d, q) based on the Akaike Information Criterion (AIC) [
3]. Without requiring any manual configuration or downstream fine-tuning, any trainable parameters are avoided. Our method pushes the boundary of zero-shot forecasting and explores the intersection of foundation models and traditional time series analysis.
To validate the feasibility of our proposed method, we conduct a series of experiments on widely recognized benchmark datasets in the field of TSF. Specifically, we select ETTh1, Electricity, and Traffic, which represent diverse temporal patterns and complexity levels, to ensure comprehensive evaluation across different real-world scenarios. For each dataset, we adopt a consistent evaluation protocol: the last 100 time steps are used as historical input, and the model is required to forecast the subsequent 20 steps (short horizons) and 50 steps (long horizons). This setting enables a fair comparison across models and highlights the robustness of our zero-shot framework under various data conditions.
Our method is compared against several strong baselines. First, we include GPT4S [
1], an LLM-based method that applies prompt engineering for zero-shot forecasting. Second, we include classical time series forecasting models, ARIMA and Prophet. These comparisons provide a comprehensive understanding of how our LLM-based approach performs across traditional, neural, and hybrid paradigms under truly zero-shot conditions.
Table 3 shows the performance of each model for both short and long horizon prediction across various datasets. The results demonstrate that traditional statistical learning methods, when combined with pre-trained LLMs, can still achieve competitive results. In
Figure 5, we present a visualization of the LLM_Arima predictions (i.e., LLM equipped with ARIMA), illustrating this synergy. These findings highlight the feasibility of applying LLMs in zero-shot time series forecasting scenarios.
4.5. Comparisons with LLM-Based Time Series Forecasting
Effect of LLM Variants. To assess whether the strong performance of the proposed StreamTS framework stems primarily from the choice of the underlying LLM, we conducted experiments using different pre-trained LLMs, paired with the same input and output modules. The compared models include DeepSeek-Math-7B-RL (the base LLM in our framework), DeepSeek-Math-7B-Instruct (a variant from the DeepSeek-Math family designed for broader knowledge coverage; we also replaced the base model in our framework with this variant), and Qwen2-0.5B-Instruct (a lightweight model from the Qwen2 family). Since the real-world changes we want to predict are drastically changing, the prediction can only be verified on datasets with irregular changes. For more difficult predicting tasks with irregular changes, we chose Exchange, a dataset that collects daily exchange rates for 8 countries from 1990 to 2016. These LLMs with a standard input–output module make predictions of 96, 192, 336, and 720 with an input length 512. The results are presented in
Table 4, where Deepseek-rl refers to DeepSeek-Math-7B-RL, Deepseek-instruct corresponds to DeepSeek-Math-7B-Instruct, and Qwen2-instruct denotes Qwen2-0.5B-Instruct. StreamTS (Instruct) represents our proposed framework with DeepSeek-Math-7B-Instruct as the base LLM.
The performances of DeepSeek-Math-7B-RL and DeepSeek-Math-7B-Instruct are nearly identical, whether they are used as standalone LLM predictors or as the backbone within our framework. In contrast, Qwen2-0.5B-Instruct has a significantly different architecture from the DeepSeek-Math family. Nevertheless, all three LLM-only models are consistently outperformed by the full StreamTS framework.
Figure 6 shows that both DeepSeek and Qwen models have a large deviation from the true value in the second half of the prediction. We conclude that just using a pre-trained LLM makes it difficult to make a good prediction for TSF tasks. Complex time series are often composed of multiple sub-sequences, and it is difficult for an LLM to make predictions of such complex sequences. For this reason, decomposing the time series into more smooth sequences will reduce the prediction difficulty of the LLM.
The consistent results between the math-focused model DeepSeek-Math-7B-RL and the more general-purpose DeepSeek-Math-7B-Instruct can be attributed to our use of Patch Reprogramming in the encoding stage, which maps time series data into natural language representations. As a result, the mathematical reasoning capabilities of DeepSeek-Math-7B-RL are not fully utilized in this context.
LLM-based TSF models. We compared StreamTS with some of the leading LLM-based TSF models: TIME-LLM [
16], AutoTimes [
13], One Fits All [
12]. TIME-LLM stands out as a prominent example of these mainstream temporal LLMs. It does not need to fine-tune any of the layers in the LLM but introduces two learnable modules of Patch Reprogramming and Output Projection for the input and output processing. AutoTimes uses the LLM as an autoregressive time series predictor to allow arbitrary-length predictions with long-enough inputs and introduces LLM-embedded textual timestamps to enhance forecasting. One Fits All has a simple structure; however, it requires fine-tuning of the positional embeddings and layer normalization layers in the pre-trained language or image model.
For fair comparisons, TIME-LLM, AutoTimes, and One Fits All were also implemented using the Deepseek-math-7b-rl baseline LLM model.
Table 5 shows the prediction performance of all the models. It clearly indicates that our StreamTS model achieves more accurate predictions on the ETTh1, Exchange, and Traffic datasets for various forecasting tasks. One Fits All is sometimes even worse than an LLM alone (e.g., on the Exchange dataset). This may be caused by the fine-tuning LLM layers with insufficient data. AutoTimes is unable to predict the output length 720 with the input length 512 due to its autoregressive model design, leaving the corresponding lines blank in
Table 5. TIME-LLM performs much better than AutoTimes and One Fits All but consistently worse than our model.
The number of model parameters and the training speed (seconds per iteration, s/iter) are shown in
Table 6. Since the LLM-based model takes up a large number of parameters, the total number of parameters in each model does not vary much. One Fits All and AutoTimes does not change the original model much and does not add too many modules, so the training speed is relatively fast. Compared to TIME-LLM, which has better prediction results, our model is much more sufficient in training with the average speed of 0.59 s/iter, which is about 68.6% of the time used by TIME-LLM. In StreamTS, we freeze the inference backbone without additional training and only need to train the linear forecasting component and the patch embedding. Despite having fewer training parameters, the accuracy of the predictions remains unaffected.
4.6. Model Analysis
In this section, we examine each part of the module carefully, from the original LLM prediction analysis to the final overall prediction effect.
Model Interpretation. From
Table 5, we can observe that using a pre-trained LLM directly for time series prediction is not as effective as our proposed LLM with time series decomposition. This motivates us to analyze the underlying reason. LLM is a self-attention-based network that utilizes autoregression to generate the next prediction value, and the previous values have a great impact on the later ones. Ref. [
12] has proved that self-attention performs a function closely related to principle component analysis (PCA). However, PCA is known to be sensitive to data distribution and outliers. We take a time series from the ETTh1 dataset as an example and demonstrate the effect of before and after data in a sequence in
Figure 7.
The left side of
Figure 7 displays the original sequence and its decomposed trend component from top to bottom, while the right side shows the corresponding attention distribution maps generated in the 1st, 3rd, and 6th layers and the multiple heads (0, 6, and 10) for the 6th layer. We can observe dramatic changes in the original sequence (highlighted by a red circle in
Figure 7), and these change are reflected in the attention maps, more intensively in the 6th layer (highlighted by a green box in
Figure 7). Meanwhile, the trend sequence is smooth owing to a moving average operation. The corresponding attention maps show the trend well eliminates the influence of local noise.
Time series decomposition and its effects. In
Figure 8, we visualize the results of time series decomposition for three datasets. The Electricity dataset features short and relatively regular change cycles, the Weather dataset shows periodicity on larger timescales, while the ETTh1 dataset lacks clear patterns of change. After decomposition of the complex time series into trend and seasonal components, the LLM is used for long-term trend prediction, and the linear model is used for local seasonal fitting. The time series in the Electricity dataset exhibits inherent regularity, thus the decomposed seasonal series is more consistent in magnitude compared to the original series. The time series in the Weather dataset contains several cyclic peaks. After decomposition, the periodic is well captured by the trend, and the remaining seasonal series are reduced to a small range, which minimizes their interference in the prediction process. The decomposition of the time series of the ETTh1 dataset also produces a seasonal series within a small range around zero, which is better predicted by considering short-term influence.
We demonstrate the linear model fits the seasonal series well in
Figure 9. For the Electricity dataset, the linear model can accurately capture the waveforms, leading to a small MSE 0.1355 for the seasonal part. For the Weather and ETTh1 datasets, while there are some points where the linear model does not fit perfectly, it learns data patterns rather than overly responding to outliers. According to [
19], simple linear layers have exceeded a complex self-attention mechanism for learning time dependency. This can be explained by the fact that time series variations can be highly influenced by the sequence order, while the attention mechanisms are permutation invariant on the temporal dimension.
To examine the LLM capability for long-term trend prediction, we compare it with three other models, which represent linear, CNN-based, and Transformer-based models, respectively.
Figure 10 provides a qualitative result of the four different models by visualizing the prediction results for 720 steps in the Weather dataset. In the case of TimesNet, the error for trend prediction is high (MSE = 0.2341), leading to a higher overall prediction error (MSE = 0.2365). In particular, TimesNet predicts a downward ultimate trend (in the orange highlighted part), which is against the true upward trend.
DLinear and MICN, although better at following long-term variations than TimesNet, struggle to capture high-frequency patterns effectively. Our model, on the other hand, successfully manages to capture both long-term variations and high-frequency patterns. This advantage can be attributed to the intricate design of our model’s architecture, which is tailored to capitalize on both season component and trend component predictions. From the results in
Table 5, it is clear that the adapted model is significantly more effective in predicting compared to the LLM.
Ablation study
To validate the effectiveness of the components of StreamTS, BC-Prompt, and decomposition for time series, we conducted ablation studies on the ETTh1 and Exchange datasets. The results are shown in
Table 7. It is evident that the prediction performance consistently declines when only the forecasting task information is kept in the prompts (
w/
o BC). The absence of decomposition (
w/
o DT) reduces the overall performance of the model. For the Exchange dataset that is more difficult to predict, the average drop is approximately 33% in MSE, 16% in MAE. The removal of BC-Prompt and decomposition (
w/
o P&DT) results in the most severe impact, preventing the LLM from effectively capturing the short-term changes of time series. For long-term forecasting (e.g., pred-720), however, the prediction errors are not serious. This further confirms that the LLM model is more capable of learning the general trend of time series.
To examine the robustness of prompts to noisy or incomplete background information, we introduced perturbations into the BC-Prompt background (w/NoiseP). Specifically, we randomly masked 10% of the background tokens in BC-Prompt with the symbol “#”. The results on the ETTh1 and Exchange datasets show only minor fluctuations, indicating that the model remains stable under such perturbations. In addition, we replaced the decomposition module with Prophet decomposition (w/Prophet) to evaluate the adaptability of our framework. On the ETTh1 dataset, this substitution leads to a slight performance decline, whereas on the Exchange dataset, we observe a performance gain. These results suggest that while our design is flexible to different decomposition choices, the effectiveness may vary across datasets.
In addition, we conducted a deeper investigation into the influence of sliding window size on the decomposition process. On the ETTh1 dataset, we evaluated the window sizes of 13, 25 (default in StreamTS), 49, and 65. The results are summarized in
Table 8. This observation underscores the sensitivity of decomposition effectiveness to the window size. Notably, since ETTh1 is collected hourly from the electricity transformer device, it exhibits a periodicity of 24. Based on the results in
Table 8, for the relatively short-horizon forecasting (Pred-96), the window size of 25 achieves the best performance. While for long-horizon forecasting, the most favorable performance arises when the window size (49) is approximately twice the underlying period. One can reasonably expect that choosing a window size aligned with the data cycle or twice its cycle for time series decomposition may further boost the predictive performance of our method. These results indicate that an appropriately chosen window size is crucial for enhancing both decomposition fidelity and predictive accuracy and thus represents a promising avenue for future research.
Impact of input length. Figure 11 shows the model performance of predicting the future steps of 96, 192, 336, and 720 on the ETTh1 dataset when the input lengths are 96, 192, 336, and 512, respectively. We can see that the error of the prediction decreases with the input length up to 336, but at that point, the prediction error starts to increase with further increase of the input length. This indicates the input step length in a certain range will bring more auxiliary information to the prediction, but too much data will bring interference to the prediction of the model. Based on these findings, we conclude that excessively long history input lengths do not enhance prediction accuracy and may, in fact, increase computational overhead. Thus, identifying an optimal history length is essential for effective model performance.
Statistical Analysis. We conducted 10 bootstrap evaluations on the test sets of the ETTh1 and Traffic datasets, each time randomly sampling 90% of the test data, and computed the MSE and MAE for StreamTS, TIME-LLM, and AutoTimes, reporting the results as mean ± standard deviation (
Table 9).
The results show that all the three LLM-based methods achieve highly stable predictions with very small standard deviations (typically on the order of ). According to a paired t-test, our proposed StreamTS significantly outperforms the baseline models across all prediction horizons. The statistical analysis highlights the reliability of the experimental conclusion, confirming StreamTS is effective in adapting frozen LLMs for time series forecasting.