Next Article in Journal
Design and Real-Time Application of Explicit Model-Following Techniques for Nonlinear Systems in Reciprocal State Space
Next Article in Special Issue
A Multimodal Analysis of Automotive Video Communication Effectiveness: The Impact of Visual Emotion, Spatiotemporal Cues, and Title Sentiment
Previous Article in Journal
Correction: Zaman et al. Towards Secure and Intelligent Internet of Health Things: A Survey of Enabling Technologies and Applications. Electronics 2022, 11, 1893
Previous Article in Special Issue
Integrating Temporal Interest Dynamics and Virality Factors for High-Precision Ranking in Big Data Recommendation
 
 
Article
Peer-Review Record

StreamTS: A Streamline Solution Towards Zero-Shot Time Series Forecasting with Large Language Models

Electronics 2025, 14(20), 4088; https://doi.org/10.3390/electronics14204088
by Wei Song 1, Yi Fang 1,*, Xinyu Gu 2, Wenbo Zhang 1, Zhixiang Liu 1, Yu Cheng 3 and Mario Di Mauro 4
Reviewer 1: Anonymous
Reviewer 2:
Electronics 2025, 14(20), 4088; https://doi.org/10.3390/electronics14204088
Submission received: 6 September 2025 / Revised: 11 October 2025 / Accepted: 15 October 2025 / Published: 17 October 2025
(This article belongs to the Special Issue Advances in Data-Driven Artificial Intelligence)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The manuscript proposes a time-series forecasting framework that decomposes each series into a trend and a residual component, predicts the trend with a large language model guided by a background-plus-reasoning prompt, and predicts the residual with a lightweight model; in a zero-shot variant, ARIMA replaces the residual head. Experiments on several public benchmarks report competitive accuracy with lower training cost than some recent LLM-based baselines. While the report presents several findings, there are several issues that should be addressed before the work can be reliably evaluated and reproduced.

 

  • Data & preprocessing – leakage risk from decomposition.

The paper derives the “trend” via moving-average smoothing and feeds this trend to the LLM, but it does not specify whether the filter is strictly causal (trailing) or centered/two-sided. This is problematic because a centered smoother leaks future observations into the input features used to predict the future, creating optimistic, non-reproducible performance. Even a trailing smoother can introduce phase lag that, if not compensated, will misalign the trend and target during training and inference. The consequences include inflated accuracy, misleading ablations, and conclusions that may not hold in a live setting where only past data are available. Readers also cannot replicate results or assess sensitivity to the smoothing window without the exact kernel, stride, padding, and alignment rules. The authors should specify a strictly causal design, document window length and alignment, and report an ablation over causal window sizes to show robustness. They should also explain how any phase lag is handled at recombination (e.g., shift correction) so the forecasted trend and residual align temporally.

 

  • Data & preprocessing – normalization protocol under-specified.

The manuscript mentions instance normalization but does not state when and over which segment statistics are computed, nor how the inverse transform is applied for evaluation. This is a concern because computing normalization using information from the forecast horizon or the validation/test partitions can silently leak target distribution information. In decomposed pipelines, inconsistent scaling between the trend and residual channels can also distort recombination and error metrics. The downstream effect is that reported errors could be biased low, and third parties cannot faithfully reproduce results or compare baselines under the same scaling. A clear description should include whether statistics are fit on training data only, whether rolling/online statistics are used at inference, numerical stability constants (epsilon), and the precise de-normalization step before metric computation. The authors should also clarify whether zero-shot experiments use the same normalization as the trained setting, since ARIMA’s differencing and scaling interact. Finally, providing a small sensitivity study (e.g., per-dataset z-score vs min-max vs instance norm) would help demonstrate robustness.

 

  • Study design & methods – missing details for BC-Prompt and Patch Reprogramming.

The core method relies on a “BC-Prompt” that injects background knowledge and reasoning, and on patch-based reprogramming to adapt series tokens to an LLM, but the exact templates and hyperparameters are not provided. This is an issue because prompt content, token budget, and patch shapes are first-order drivers of LLM behavior and can materially change accuracy, latency, and memory use. Without these details, the results are not independently reproducible and the community cannot judge whether gains arise from architectural choices or from careful prompt engineering. The lack of specifics also prevents fair comparison because baselines cannot be upgraded with equivalent prompting to isolate the effect of decomposition. The practical consequence is reduced credibility and diminished utility for practitioners who might otherwise adopt the method. The authors should supply full prompt templates with placeholders, token counts, and at least one filled example per dataset, plus patch length/stride, number of prototypes/heads, projection dimensions, and any positional encoding used. They should also describe how prompts differ between training, validation, and zero-shot inference (if at all), so others can replicate the full protocol.

 

  • Study design & methods – zero-shot ARIMA configuration and scaling.

In zero-shot mode, the linear residual head is replaced by ARIMA with automatic order selection, but the search ranges, information criterion (AIC/AICc/BIC), seasonal terms, and differencing strategy are not specified. This matters because ARIMA performance depends strongly on (p, d, q, P, D, Q, m) choices and on whether the series is scaled or differenced before or after decomposition. If the auto-selection inspects the evaluation horizon or benefits from scaling fit on the full series, the zero-shot claim would be overstated. The consequence is that readers cannot replicate the zero-shot results, and the reported advantage may reflect favorable configuration rather than inherent model generality. The paper should disclose the exact auto-selection routine (including grid limits and fallback rules), how scaling is handled in zero-shot, and how decomposition interacts with ARIMA differencing. It should also provide numeric zero-shot tables for all datasets, not just visuals, and compare against additional classical zero-shot baselines (e.g., ETS/Prophet) to contextualize the gains. Clarifying these points would make the zero-shot evaluation credible and portable.

 

  • Model design & validation – baseline coverage and fairness.

The evaluation compares against several known models, yet it is unclear whether recent strong forecasters (e.g., iTransformer, tuned DLinear variants, ModernTCN, properly tuned PatchTST) were included with competitive settings. This is problematic because state-of-the-art claims require both breadth (covering the strongest contemporaries) and depth (tuning baselines comparably). Without transparent configs and search budgets, differences in results might stem from under-tuned baselines rather than genuine architectural advantages. The practical risk is over-claiming SOTA and giving readers an incomplete picture of where the method excels or lags (e.g., some Electricity horizons). To resolve this, the authors should document hyperparameter ranges, early-stopping rules, and data preprocessing for every baseline and align search budgets. They should also highlight cases where the proposed approach underperforms and analyze why (e.g., periodic vs stochastic residuals), which would strengthen the paper’s scientific value. Finally, releasing baseline scripts alongside the main code would enable community verification.

 

  • Statistical analysis – variance, seeds, and significance not reported.

The manuscript appears to report averages over a small number of runs without fixed random seeds, confidence intervals, or paired statistical tests. This is an issue because time-series benchmarks can be noisy, and improvements within a few percent may not be statistically meaningful. Without dispersion metrics and hypothesis testing, it is difficult to assess whether differences are robust across series and replication. The downstream consequence is reduced confidence in the headline claims and limited guidance for practitioners deciding whether to switch models. The authors should fix seeds, report mean ± standard deviation (or 95% confidence intervals) over a reasonable number of runs, and consider paired tests on per-series errors to control for series-level heterogeneity. They should also state early-stopping criteria, gradient clipping, precision settings, and any variance-reducing tricks used. Providing such statistical discipline will make the comparative story much more persuasive.

 

  • Reproducibility & resources – code/data availability and compute reporting.

Although the work uses public datasets, the paper does not provide links, scripts, or an open repository to regenerate the tables and figures end-to-end. This is a problem because small choices in data loaders, splits, normalization, and tokenization can yield materially different results. Practically, reviewers and readers cannot verify claims, and practitioners cannot adopt the method with confidence. Furthermore, compute details such as wall-clock times per epoch, total training time, peak VRAM, and mixed-precision usage are not reported, which obscures the true efficiency trade-offs versus baselines. The authors should release code with pinned library versions, dataset fetching scripts, and one-click recipes to reproduce each main table, together with logs or metadata for the reported runs. They should also add a short section on computational cost and memory footprint to contextualize the claimed efficiency. These steps would substantially strengthen the paper’s transparency and impact.

Author Response

Comments 1: Data & preprocessing – leakage risk from decomposition.

The paper derives the “trend” via moving-average smoothing and feeds this trend to the LLM, but it does not specify whether the filter is strictly causal (trailing) or centered/two-sided. This is problematic because a centered smoother leaks future observations into the input features used to predict the future, creating optimistic, non-reproducible performance. Even a trailing smoother can introduce phase lag that, if not compensated, will misalign the trend and target during training and inference. The consequences include inflated accuracy, misleading ablations, and conclusions that may not hold in a live setting where only past data are available. Readers also cannot replicate results or assess sensitivity to the smoothing window without the exact kernel, stride, padding, and alignment rules. The authors should specify a strictly causal design, document window length and alignment, and report an ablation over causal window sizes to show robustness. They should also explain how any phase lag is handled at recombination (e.g., shift correction) so the forecasted trend and residual align temporally.

Response 1: Thank you for raising this important concern. We realize that our original description of the decomposition step was not sufficiently clear and could lead to misunderstandings. In the revised manuscript, we clarify the following points:

  1. Filtering design. We adopt a centered moving smoothing filter with replication padding applied at both ends of the sequence. Specifically, before applying the centered moving-average with a window size of w, we replicate the boundary values to pad elements at both ends of the sequence. The smoothing is then performed on this extended sequence. As a result, for any position inside the original time sequence, no true future observations beyond the prediction horizon are included in the model input. This avoids the risk of information leakage highlighted by the reviewer.
  1. Phase alignment. Because the smoothing window is symmetric, the extracted trend is inherently zero-phase and thus remains temporally aligned with the original series. Consequently, there is no need for additional shift correction when recombining the trend and residual components.
  2. Ablation study. Following the reviewer’s suggestion, we have added a sensitivity analysis on different window sizes in Section 4.6 Model Analysis. The results show that the model is robust across a range of window lengths, with the best performance achieved when the window size is close to the dominant periodicity of the dataset.

We have updated Section 3.2 Decomposition to include these implementation details, and Section 4.6 Model Analysis now reports the ablation results. We believe these clarifications improve the transparency and reproducibility of our method. We sincerely thank the reviewer again for pointing out this critical issue.

 

Comments 2: Data & preprocessing – normalization protocol under-specified.

The manuscript mentions instance normalization but does not state when and over which segment statistics are computed, nor how the inverse transform is applied for evaluation. This is a concern because computing normalization using information from the forecast horizon or the validation/test partitions can silently leak target distribution information. In decomposed pipelines, inconsistent scaling between the trend and residual channels can also distort recombination and error metrics. The downstream effect is that reported errors could be biased low, and third parties cannot faithfully reproduce results or compare baselines under the same scaling. A clear description should include whether statistics are fit on training data only, whether rolling/online statistics are used at inference, numerical stability constants (epsilon), and the precise de-normalization step before metric computation. The authors should also clarify whether zero-shot experiments use the same normalization as the trained setting, since ARIMA’s differencing and scaling interact. Finally, providing a small sensitivity study (e.g., per-dataset z-score vs min-max vs instance norm) would help demonstrate robustness.

Response 2: Thank you for your questions.

First of all, in our implementation, Instance Normalization (IN) is computed strictly using statistics derived from the training set only. Unlike the more commonly used Batch Normalization (BN)—which normalizes each feature across all samples and time steps within a batch, IN is applied independently to each feature of each individual sample. In other words, IN does not retain the statistics calculated during training.

Consequently, during inference, IN calculates the mean and variance using each individual sample input as it is. Therefore, no rolling or online updates are applied, ensuring that there is no risk of target distribution leakage. Following a standard practice, a small constant is added to the denominator during normalization to avoid division by zero.

Secondly, after both components (trend and seasonal) are obtained, we apply the inverse instance normalization operation to restore the series back to its original scale before evaluation. This ensures that the recombination is consistent and that all reported error metrics are computed in the original data space.

Thirdly, zero-shot experiments adopt the same normalization strategy. Since normalization is applied prior to the decomposition step, it has no effect on the ARIMA component.

Finally, both Instance Normalization and Batch Normalization are structural components within neural networks. They are built upon Z-score standardization, augmented with learnable parameters, with the purpose of reshaping the data distribution to facilitate gradient propagation and improve model training. In contrast, Min-Max normalization is typically applied as a preprocessing step on the raw input data, aiming to rescale features of different magnitudes to a common range for easier handling by the model. It is generally used during data preparation rather than as a trainable layer within the network.

To clarify these distinctions, we have added the corresponding details in 3.0 Method. The description of reverse Instance Normalization has been added in Figure 1 and Section 3.4 LLM Forecasting.

 

Comments 3: Study design & methods – missing details for BC-Prompt and Patch Reprogramming.

The core method relies on a “BC-Prompt” that injects background knowledge and reasoning, and on patch-based reprogramming to adapt series tokens to an LLM, but the exact templates and hyperparameters are not provided. This is an issue because prompt content, token budget, and patch shapes are first-order drivers of LLM behavior and can materially change accuracy, latency, and memory use. Without these details, the results are not independently reproducible and the community cannot judge whether gains arise from architectural choices or from careful prompt engineering. The lack of specifics also prevents fair comparison because baselines cannot be upgraded with equivalent prompting to isolate the effect of decomposition. The practical consequence is reduced credibility and diminished utility for practitioners who might otherwise adopt the method. The authors should supply full prompt templates with placeholders, token counts, and at least one filled example per dataset, plus patch length/stride, number of prototypes/heads, projection dimensions, and any positional encoding used. They should also describe how prompts differ between training, validation, and zero-shot inference (if at all), so others can replicate the full protocol.

Response 3: Thank you for raising this important point. We have revised Figure 2, providing a full BC-Prompt template with placeholders and one instantiated example (Traffic dataset). Prompts are identical across training, validation, and zero-shot inference, and the complete set for all datasets is available in our released code (https://github.com/kobe-heimanba/StreamTS).

For patch reprogramming, detailed hyperparameters—including patch length/stride, number of prototypes/heads, projection dimensions, and positional encoding—are documented in code. Baselines were given equivalent input structures to ensure fairness.

Some key parameters are also reported in Section 4.1 Experimental Settings. Thus, together with the released resources, our results are fully reproducible and directly usable by the community.

 

Comments 4: Study design & methods – zero-shot ARIMA configuration and scaling.

In zero-shot mode, the linear residual head is replaced by ARIMA with automatic order selection, but the search ranges, information criterion (AIC/AICc/BIC), seasonal terms, and differencing strategy are not specified. This matters because ARIMA performance depends strongly on (p, d, q, P, D, Q, m) choices and on whether the series is scaled or differenced before or after decomposition. If the auto-selection inspects the evaluation horizon or benefits from scaling fit on the full series, the zero-shot claim would be overstated. The consequence is that readers cannot replicate the zero-shot results, and the reported advantage may reflect favorable configuration rather than inherent model generality. The paper should disclose the exact auto-selection routine (including grid limits and fallback rules), how scaling is handled in zero-shot, and how decomposition interacts with ARIMA differencing. It should also provide numeric zero-shot tables for all datasets, not just visuals, and compare against additional classical zero-shot baselines (e.g., ETS/Prophet) to contextualize the gains. Clarifying these points would make the zero-shot evaluation credible and portable.

Response 4: Thank you for your suggestions. We have added the related information in 4.4 Zero-shot forecasting and Table 3 of the revised manuscript. In our zero-shot experiments, to maintain a fully inference-only automated pipeline, we adopt AIC-based automatic order selection for ARIMA parameters (p,d,q). Specifically, we search over and , selecting the configuration with the lowest AIC. If the search fails, we fall back to  to ensure stability. Differencing is applied after decomposition, operating only on the residual component. We use Instance Normalization (IN) for scaling, IN is computed strictly using statistics derived from the training set only, and predictions are mapped back through inverse Instance Normalization during evaluation to ensure fair comparison.

In Table 3, we provide the complete zero-shot results, and additionally include Prophet as a classical baseline to better contextualize the effectiveness of our approach. This design ensures that our zero-shot evaluation is both credible and reproducible, aligning with best practices in the field.

 

Comments 5: Model design & validation – baseline coverage and fairness.

The evaluation compares against several known models, yet it is unclear whether recent strong forecasters (e.g., iTransformer, tuned DLinear variants, ModernTCN, properly tuned PatchTST) were included with competitive settings. This is problematic because state-of-the-art claims require both breadth (covering the strongest contemporaries) and depth (tuning baselines comparably). Without transparent configs and search budgets, differences in results might stem from under-tuned baselines rather than genuine architectural advantages. The practical risk is over-claiming SOTA and giving readers an incomplete picture of where the method excels or lags (e.g., some Electricity horizons). To resolve this, the authors should document hyperparameter ranges, early-stopping rules, and data preprocessing for every baseline and align search budgets. They should also highlight cases where the proposed approach underperforms and analyze why (e.g., periodic vs stochastic residuals), which would strengthen the paper’s scientific value. Finally, releasing baseline scripts alongside the main code would enable community verification.

Response 5: We respectfully clarify that in Table 1, we present a comprehensive comparison against several representative and competitive forecasting models. In line with prior works such as Autoformer and Time-LLM, which report aggregate comparisons across multiple datasets, we also adopt the same evaluation protocol. Importantly, our comparisons include recent strong baselines such as tuned DLinear variants and PatchTST, as detailed in Section 4.2 Full-shot Forecasting.

To ensure fairness, we describe the hyperparameter ranges, early-stopping rules, and preprocessing strategies for all baselines in Section 4.1 Experimental setting, and align search budgets across methods. The released code also contains the full baseline configurations, enabling reproducibility.

To avoid over-claiming state-of-the-art performance, we have revised the paper’s wording to present results more objectively in Section 4.2 Full-shot Forecasting. In addition, we provide dedicated analyses of weaker cases such as ILI and Exchange, and discuss possible reasons. These revisions ensure that the evaluation is both credible and balanced, while the released scripts support transparent community verification.

 

Comments 6: Statistical analysis – variance, seeds, and significance not reported.

The manuscript appears to report averages over a small number of runs without fixed random seeds, confidence intervals, or paired statistical tests. This is an issue because time-series benchmarks can be noisy, and improvements within a few percent may not be statistically meaningful. Without dispersion metrics and hypothesis testing, it is difficult to assess whether differences are robust across series and replication. The downstream consequence is reduced confidence in the headline claims and limited guidance for practitioners deciding whether to switch models. The authors should fix seeds, report mean ± standard deviation (or 95% confidence intervals) over a reasonable number of runs, and consider paired tests on per-series errors to control for series-level heterogeneity. They should also state early-stopping criteria, gradient clipping, precision settings, and any variance-reducing tricks used. Providing such statistical discipline will make the comparative story much more persuasive.

Response 6: We fully acknowledge the reviewer’s valuable suggestion that reporting mean ± standard deviation (or confidence intervals) is an effective way to better capture the variability and stability of model performance, and such results would undoubtedly strengthen the persuasiveness of the conclusions.

In the current version of our work, however, we have strictly followed the experimental protocol adopted by most mainstream time series forecasting models (e.g., [Autoformer, Time-LLM]), where only the averaged MSE and MAE are reported. Moreover, we conducted multiple runs for the same experimental settings to ensure consistency, although the detailed records were not retained. Given the facts that we have 8 different methods, 6 datasets, and 4 prediction horizons, repeating experiments in the format of “multiple runs + statistical reporting” at this stage would require enormous computational resources and time. Such large-scale repeated experiments are therefore impractical at present. Moreover, LLM-based approaches are already significantly more resource-intensive than conventional forecasting models, which further amplifies the difficulty of conducting extensive repetitions.

We have nonetheless fixed random seeds in all reported runs and adopted stable training practices, including early-stopping with a patience of 3. These details are now explicitly described in Section 4.1 Experimental setting.

To further address reproducibility, we have released complete scripts and interfaces, enabling researchers to re-run experiments under different seeds. Finally, in our future extensions, we aim to incorporate more comprehensive statistical reporting (mean ± standard deviation, confidence intervals, and paired statistical tests) to further strengthen the robustness and credibility of our findings.

 

Comments 7: Reproducibility & resources – code/data availability and compute reporting.

Although the work uses public datasets, the paper does not provide links, scripts, or an open repository to regenerate the tables and figures end-to-end. This is a problem because small choices in data loaders, splits, normalization, and tokenization can yield materially different results. Practically, reviewers and readers cannot verify claims, and practitioners cannot adopt the method with confidence. Furthermore, compute details such as wall-clock times per epoch, total training time, peak VRAM, and mixed-precision usage are not reported, which obscures the true efficiency trade-offs versus baselines. The authors should release code with pinned library versions, dataset fetching scripts, and one-click recipes to reproduce each main table, together with logs or metadata for the reported runs. They should also add a short section on computational cost and memory footprint to contextualize the claimed efficiency. These steps would substantially strengthen the paper’s transparency and impact.

Response 7: We greatly appreciate the reviewer’s comments on the importance of transparency and reproducibility. In response, we have made our code publicly available(https://github.com/kobe-heimanba/StreamTS), together with scripts for dataset acquisition, preprocessing, and one-click recipes to reproduce all main tables. To ensure consistency, the repository also includes pinned library versions (requirements.txt). Detailed instructions are provided in the README.md, covering data loading, splitting, normalization, and tokenization, so that both reviewers and practitioners can replicate our pipeline end-to-end with confidence.

We thank the reviewer for highlighting the importance of reporting computational cost and efficiency. Unfortunately, we did not record wall-clock times, training duration, or memory footprints during our initial experiments, and reproducing all runs would require substantial additional resources. Nevertheless, we have released our full codebase, including dataset scripts, preprocessing routines, and one-click recipes, so that interested researchers can reproduce the experiments and collect such statistics if needed. In future work, we plan to incorporate systematic reporting of computational cost to provide a more complete picture of efficiency trade-offs.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The paper tackles a very timely and important problem which is adapting large language models (LLMs) for time series forecasting. The authors propose StreamTS, a decomposition-based framework that separates trend and seasonal components, assigning trend forecasting to a frozen pre-trained LLM with BC-Prompting, and seasonal components to a linear/stochastic model (e.g., ARIMA for zero-shot). The study positions itself as a lightweight, efficient, and generalizable solution for TSF, especially in zero- and few-shot scenarios.

The manuscript is well-structured, clearly written, and experimentally comprehensive, covering multiple datasets, training regimes (full, few, zero-shot), and comparisons with both classical and modern LLM-based baselines.

Here’s my suggestions for improvement for the paper:

  • While the decomposition-based design is intuitive, the paper lacks a formal theoretical explanation of why separating seasonal and trend components improves LLM forecasting. A variance-reduction or attention-stability argument would strengthen the methodological grounding.
  • The proposed approach underperforms on highly volatile datasets such as ILI and Exchange, but this is only briefly noted. A more explicit discussion of failure cases and potential remedies  would improve transparency.
  • The BC-Prompt is well-motivated, yet its generality is unclear. The robustness of prompts to noisy or incomplete background information should be evaluated. Comparisons with other prompting strategies (e.g., prefix-tuning, retrieval-augmented generation) could clarify its relative merits.
  • Current zero-shot tests focus on short horizons. Extending experiments to longer horizons would validate scalability. Including transfer-based zero-shot settings (train on one dataset, test on another) could further demonstrate generalization.

  • Although efficiency is emphasized, the analysis primarily reports training speed (s/iter). Providing additional metrics such as GPU hours, FLOPs, and memory usage would give a more complete picture. Comparisons with strong baselines (e.g., DLinear, TimesNet, FiLM) should be expanded.

  • Ablation experiments are somewhat limited. Additional analyses should include:
    • Performance without decomposition vs. with decomposition.
    • Comparison across different decomposition methods (STL, EMD, Prophet-style).
    • Sensitivity to decomposition window size.
    • Results with and without BC-Prompt.
  • The practical deployment aspects are underdeveloped. A dedicated section on challenges in real-world use—such as robustness to distribution shifts, interpretability, and data privacy—would add depth. Addressing responsible AI concerns (e.g., fairness, explainability) would also strengthen the impact discussion.

 

Author Response

Comments 1: While the decomposition-based design is intuitive, the paper lacks a formal theoretical explanation of why separating seasonal and trend components improves LLM forecasting. A variance-reduction or attention-stability argument would strengthen the methodological grounding.

Response 1: We thank the reviewer for this insightful comment. In Section 3.0 Method, we have provided a theoretical justification from the perspective of attention stability, showing that decomposition into trend and seasonal components reduces the entropy of attention distributions, thereby enhancing their robustness. Furthermore, in Section 4.6 Model Analysis, we present empirical evidence that directly confirms this effect. Together, these results establish both theoretical grounding and experimental validation for the effectiveness of our design.

 

 

Comments 2: The proposed approach underperforms on highly volatile datasets such as ILI and Exchange, but this is only briefly noted. A more explicit discussion of failure cases and potential remedies would improve transparency.

Response 2: Thank you for raising this important point. We acknowledge that the proposed method underperforms on highly volatile datasets such as ILI and Exchange. In the revised version, we have included a more detailed discussion of these failure cases, together with an analysis of potential remedies in Section 4.2 Full-shot Forecasting. We believe that a transparent examination of such limitations will provide valuable insights into the scope of our approach and guide future improvements.

 

Comments 3: The BC-Prompt is well-motivated, yet its generality is unclear. The robustness of prompts to noisy or incomplete background information should be evaluated. Comparisons with other prompting strategies (e.g., prefix-tuning, retrieval-augmented generation) could clarify its relative merits.

Response 3: We appreciate the reviewer’s suggestion regarding the generality of BC-Prompt. In the revised manuscript, we have expanded our ablation studies to include robustness evaluations of BC-Prompt under noisy or incomplete background information in Table 7.

Prefix-tuning is a lightweight prompting strategy that prepends a small set of trainable continuous vectors (“prefixes”) to the input sequence, guiding the model’s generation without updating the full set of model parameters.

Retrieval-augmented generation (RAG), on the other hand, dynamically retrieves relevant external knowledge from a database or corpus and integrates it into the prompt, enabling the model to ground its predictions on retrieved evidence rather than relying solely on parametric memory.

We appreciate the reviewer’s suggestion. Our proposed BC-Prompt indeed shares conceptual similarities with existing prompting strategies. Specifically, in the initial stage it retrieves relevant contextual information (e.g., background knowledge), which is reminiscent of retrieval-augmented generation (RAG). Furthermore, through the patch reprogramming operation, our prompt vectors are prepended to the input sequence in a way that resembles prefix-tuning. Considering both aspects jointly, we designed BC-Prompt as a unified prompting mechanism tailored to the time-series forecasting scenario.

 

Comments 4: Current zero-shot tests focus on short horizons. Extending experiments to longer horizons would validate scalability. Including transfer-based zero-shot settings (train on one dataset, test on another) could further demonstrate generalization.

Response 4: We thank the reviewer for this constructive suggestion. In our work, zero-shot evaluation is defined as directly applying the model to new datasets without further training. We have already attempted to extend the forecasting horizon in this setting and have added the corresponding experimental results in Section 4.4 Zero-shot Forecasting.

Regarding transfer-based zero-shot (train on one dataset, test on another), while this is indeed an interesting direction to further assess generalization, it still requires training on a source dataset. This makes it unsuitable for data-scarce domains, which are the primary focus of our design philosophy. Therefore, such transfer-based settings fall beyond the intended scope of our work.

 

Comments 5: Although efficiency is emphasized, the analysis primarily reports training speed (s/iter). Providing additional metrics such as GPU hours, FLOPs, and memory usage would give a more complete picture. Comparisons with strong baselines (e.g., DLinear, TimesNet, FiLM) should be expanded.

Response 5: We appreciate the reviewer’s suggestion to provide a more comprehensive efficiency analysis, including GPU hours, FLOPs, memory usage, and extended comparisons with strong baselines such as DLinear, TimesNet, and FiLM. Unfortunately, these evaluations were not collected in our initial experiments, and re-running the full pipeline would require prohibitive computational resources and time. To support reproducibility and further benchmarking, we have released our full code(https://github.com/kobe-heimanba/StreamTS) and experimental configurations, enabling interested researchers to evaluate these metrics on their own.

 

Comments 6: Ablation experiments are somewhat limited. Additional analyses should include:

Performance without decomposition vs. with decomposition.

Comparison across different decomposition methods (STL, EMD, Prophet-style).

Sensitivity to decomposition window size.

Results with and without BC-Prompt.

Response 6: We thank the reviewer for the valuable suggestion. In the revised manuscript, we have expanded the ablation study (Section 4.6, Model Analysis – Ablation Study) to include: (1) performance comparison with and without decomposition, (2) comparisons across different decomposition methods (STL, EMD, Prophet-style), (3) sensitivity analysis with respect to decomposition window size, and (4) results with and without BC-Prompt. These additions provide a more comprehensive evaluation of the contributions and robustness of each component.

 

Comments 7: The practical deployment aspects are underdeveloped. A dedicated section on challenges in real-world use—such as robustness to distribution shifts, interpretability, and data privacy—would add depth. Addressing responsible AI concerns (e.g., fairness, explainability) would also strengthen the impact discussion.

Response 7: We sincerely thank the reviewer for highlighting the importance of practical deployment considerations. Our model is designed for strong generalization, enabling zero-shot forecasting, which is particularly valuable in data-scarce or privacy-sensitive domains where large-scale training is impractical. Its plug-and-play nature further supports direct applicability in such constrained scenarios. In the revised manuscript, we have also expanded the Section 5.0 Conclusion to explicitly discuss these deployment aspects, together with current limitations and directions for future research.

Regarding interpretability, we provide both theoretical justifications (in Section 3.0 Method) before decomposition and empirical visualizations of attention distributions (in Section Model Analysis) after decomposition, offering transparent insights into model behavior. These design choices collectively enhance robustness to distribution shifts and align with responsible AI principles, balancing predictive performance with explainability and practical feasibility.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

I appreciate the authors’ detailed responses and the substantial revisions made to the manuscript. My earlier comments have been largely addressed, and the updated submission is significantly improved in both clarity and reproducibility.

That said, two issues remain where further revision is strongly recommended. First, the statistical analysis still lacks variance reporting and significance testing. While I understand the authors’ concerns about computational cost, providing even minimal standard deviation or confidence intervals (on one or two datasets, or via bootstrapped per-series errors) would considerably strengthen the reliability of the results. Second, the normalization protocol should be clarified more precisely, as the current explanation includes conflicting descriptions of how instance normalization statistics are computed and applied, especially in the zero-shot setting.

If the authors choose to address these remaining concerns, I do not request another round of review and defer to the editorial team regarding final acceptance.

Author Response

Comments 1: That said, two issues remain where further revision is strongly recommended. First, the statistical analysis still lacks variance reporting and significance testing. While I understand the authors’ concerns about computational cost, providing even minimal standard deviation or confidence intervals (on one or two datasets, or via bootstrapped per-series errors) would considerably strengthen the reliability of the results. Second, the normalization protocol should be clarified more precisely, as the current explanation includes conflicting descriptions of how instance normalization statistics are computed and applied, especially in the zero-shot setting.

Response 1: Thank you for the reviewer’s constructive suggestions. To address the remain issues, we have conducted additional experiments and made further explanations accordingly. 

(1) Statistical Analysis:
In response to the reviewer’s first recommendation, we have supplemented additional statistical analysis in Section 4.6 Model Analysis. Specifically, we now report the variance of MSE and MAE across multiple runs on the ETTh1 and Traffic datasets, and further compare the results with AutoTimes and Time-LLM. These additional statistical analyses by paired t-test strengthen the reliability and robustness of our findings by demonstrating consistent performance trends under repeated experiments.

(2) Normalization Protocol Clarification:

In our implementation, Instance Normalization (IN) is computed strictly based on the input sequence. Specifically, for each sample and each feature, IN independently calculates the mean and variance, thereby performing normalization in a feature-wise and instance-wise manner. This design ensures that no information from the prediction (future) phase or other features is involved, effectively preventing data leakage. Moreover, during the denormalization phase, the model restores the output sequence to its original scale using the statistics derived from the input sequence. In addition, to ensure numerical stability, we add a small constant ε = 10⁻⁵ to the denominator in variance normalization, following standard practice.

We have further clarified this process in Section 3.1 Input Preprocessing, Section 3.5 LLM-Forecasting, and Section 4.4 Zero-shot Forecasting.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Thank you for the opportunity to review the revised version of the manuscript. I have carefully evaluated the updated submission and would like to confirm that all of my previous comments and suggestions have been thoroughly addressed by the authors. The manuscript has significantly improved in clarity, structure, and scientific rigor as a result.

Author Response

Comments1: Thank you for the opportunity to review the revised version of the manuscript. I have carefully evaluated the updated submission and would like to confirm that all of my previous comments and suggestions have been thoroughly addressed by the authors. The manuscript has significantly improved in clarity, structure, and scientific rigor as a result.

Response1: We sincerely thank the reviewer for the encouraging comments and for acknowledging the improvements in the revised version. We are grateful for the reviewer’s constructive feedback, which has been invaluable in strengthening the clarity, organization, and scientific quality of the manuscript.

Back to TopTop