1. Introduction
Multi-step time series forecasting in industrial settings is critical to maintaining quality, optimizing performance, and reducing operational risk. For example, accurately predicting the silicon content of molten iron in a blast furnace can enable early corrective actions to stabilize production. However, traditional models are typically trained using scalar-based loss functions like mean squared error (MSE), which may not capture the qualitative or domain-specific preferences that practitioners care about.
Recent advances in transformer architectures, especially decoder-only GPT-style models, offer new possibilities for autoregressive forecasting (GPT stands for Generative Pre-Trained Transformer [
1]). Currently these models still rely on metric-based supervision (e.g., MSE) and miss important information on how “good” a predicted sequence appears overall. This is especially important in the early prediction steps where human intervention may matter the most.
In this paper, we propose an enhancement from scalar loss optimization to learning from preferences. We introduce a new type of training corpus consisting of preferences over predicted time-series trajectories. These preferences can be derived either from human judgments or from automated heuristics, such as comparing predicted versus real trajectories using metrics such as R2 or MSE. Using these preference annotations, we apply Score-based Preference Optimization (SPO), a contrastive loss method adapted from Reinforcement Learning from Human Feedback (RLHF), to fine-tune transformer models for time-series forecasting.
Our proposed main approach is that preference-based training may provide a richer training signal for time-series models operating in noisy prediction domains. We show empirically that SPO-trained models can improve forecasts in the prediction steps.
We validate our approach on both a proprietary dataset from a Midwestern steel manufacturer and a public benchmark (UCI Appliances Energy). The results demonstrate improvement over traditional fine-tuning using MSE loss alone.
In the following sections, we describe the background motivating preference-based learning for time series, our corpus design and annotation scheme, the SPO algorithm, and the empirical results.
LSTMs and XGBoost are two common approaches for time-series-based problems but in this work we use GPTs. While transformers and GPTs are traditionally associated with text processing, we adapt the decoder architecture of a transformer to directly ingest numerical tabular time-series data.
ChatGPT exists today in its current form thanks to RLHF. The architecture used to create ChatGPT is the decoder-only GPT architecture. One of the key techniques that helped solve many of its early challenges was preference alignment. It is only natural that this framework be adapted to other domains, such as time-series forecasting. The core contribution of this work is fine-tuning through preference annotations. While LSTMs and XGBoost are widely used for tabular data, they do not have the same track record as GPT models in auto-regressive generation, scalable parallelization, or alignment training using preference-based supervision. Our Score-based Preference Optimization (SPO) approach directly benefits from training using a preferences corpus. SPO provides “contrastive” training over sequential annotated time-series outputs.
While LSTMs and XGBoost remain common in time-series forecasting, we deliberately focus on GPT-style transformers because they align with the current state of scalable, preference-driven machine learning. Although our annotation process is manual in this initial work, it can be fully automated in future iterations. This may, in the future, enable the generation of massive quantities of preference-labeled time-series data. At that scale, the benefits of GPT parallelism and architecture-level scaling become obvious, as demonstrated by large language models. Our goal is not only to introduce a new corpus, but also to establish a foundation for large-scale preference-aligned time-series annotation and modeling.
The key idea to convert a text-based GPT into a time-series GPT is to understand the shape and meaning of the tensors that serve as input to the GPT, and the tensors that serve as output to the GPT. In a text-based GPT, tokens are converted to embeddings. So a tensor of shape [32, 40, 1] containing tokens is projected via an embedding layer to a tensor of shape [32, 40, 512]. Now each token is represented by a vector of size 512. To convert this for time series, a simple approach is to remove the embedding layers and feed the time-series tensor directly (e.g., of shape [32, 40, 28]). Here, the last value in the tensor (i.e., 28) represents the features at a given time “t” step. This time-series tensor ([32, 40, 28]) can be projected to another size such as [32, 40, 512]. The same applies for the output of the time-series GPT (
Figure 1).
Figure 1 illustrates how a sequence of time-series vectors is passed as input to the GPT model. The output is a shifted version of the same sequence, where the final vector (highlighted in blue) represents the next-step prediction.
We have conducted many runs and the results seem promising. Many good results have been achieved. However, consistency has been an issue. Sometimes you get great results and sometimes the results are more noisy.
Additionally, unlike the domain of text and images, there are not many pre-trained models for time-series industrial data prediction. This is in part because modeling time-series data may not be as intuitive as using text or images. Our approach outlines simple and clear methods for building time-series GPTs using the familiar steps of pre-training, fine-tuning, and preference optimization.
In this paper we discuss the advantages of decoder-only transformers for time-series forecasting, the challenges encountered, GPU operational conditions, and potential future improvements. Our findings suggest that transformer-based architectures could play a significant role in time-series modeling for industrial applications.
Steel blast furnaces require real-time control of variables to maintain desired output quality. Predicting silicon content 1–9 steps into the future allows for early corrective action. So the question is can we train a GPT-style transformer to forecast multi-step silicon levels from raw input sequences?
A GPT of this type is usually trained by minimizing an MSE loss. Standard MSE loss may not capture qualitative aspects of the signal such as what looks better or operates better. Human evaluators or custom heuristics (e.g., R
2), on the other hand, can guide learning via preference for time series. So in this work we consider two types of training. The first training approach is based on MSE loss optimization using a standard GPT architecture adapted for time-series data (
Figure 1). The second training approach is based on preference optimization using a preference loss approach inspired from both Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO). The preference data annotation scheme is presented and discussed.
Text and image datasets are abundant, whereas industrial and tabular data remain limited. Building powerful models in these domains often requires more and better-quality data. Our work proposes a new methodology for generating preference-labeled data for time-series forecasting to address this gap. In particular, it seeks to capture human preferences in industrial settings. Our contention is that human preferences may provide a stronger signal than what can be obtained from standard scalar losses such as MSE. As such, this work contributes not only a new dataset, but more importantly, an annotation scheme for generating time series preference data that can begin with manual labeling and scale to automated processes.
The resulting corpus and learning method are reproducible, extensible, and designed to support broader research in time-series forecasting. Therefore, this work supports the goal of developing new data resources that enable new sequential machine learning approaches.
3. Materials and Methods
This work started and is mainly focused on silicon time-series forecasting. The silicon data used was collected at a Midwestern steel blast furnace from real furnace sensors. Many of the methods presented in this paper were developed based on this dataset. However, the dataset is proprietary and we are not able to release it. For reproducibility, we also ran the methods, in parallel, using an open source dataset from the UCI repository that is comparable in many respects to the proprietary silicon blast furnace data. Therefore, our methods can be replicated using this UCI dataset. For the rest of this paper, we will refer to these two datasets as (1) the silicon blast furnace data, and (2) the UCI appliances data [
17]. The silicon blast furnace dataset has about 1300 samples, and the UCI appliances dataset has about 19,000 samples. While the initial datasets are small, the appeal of our proposed method is that we can greatly increase the data via preference annotation.
3.1. Features
The main features in the silicon blast furnace dataset include delta silicon, moving average silicon, silicon, host blast moisture, hot blast temperature, natural gas injection, windrate, high purity oxygen, coal flow, cast average Mn, slag Fe, top gas CO, top gas
, top gas
, top gas
, slag
, slag CaO, Slag MgO, snort valve position, top pressure, hot blast pressure, taphole, hot metal temperature, cokerate, etc. The target variable that we tracked was silicon. A correlations matrix of some of the most important features can be seen in
Figure 4. A correlation matrix is a common tool in data science that shows the relationships between features and outputs. For example, in this graph, the correlation between silicon (SI) and the coke rate is relatively high, with a value of 0.54.
The main features in the UCI appliances dataset include lights, T1, appliances, RH_1, T2, RH_2, T3, RH_3, T4, RH_4, T5, RH_5, T6, RH_6, T7, RH_7, T8, RH_8, T9, RH_9, T_out, Press_mm_hg, RH_out, windspeed, visibility, dewpoint, rv1, rv2, etc. The target variable that we tracked was appliances.
3.2. Time Series GPT Architecture
We employ a decoder-only GPT-style model that is adapted for multivariate time-series forecasting. The input is a 3D tensor of shape
, where each time step contains a feature vector of observed process variables (35 for silicon data and 28 for UCI). The model is trained to auto-regressively predict the next step given a sequence (
Figure 1). For the most part, we have not deviated too much from the original architecture first proposed in [
18]. During training, we randomly sample a sequence from the training set (xb), and we then shift xb by 1 time step to create yb. These xb and yb tensors are then used to train the GPT using Teacher Forcing (
Figure 5).
Figure 5 illustrates “Teacher Forcing,” a fundamental and well-established technique used to train GPT models). This is the same approach used in [
18].
More specifically, the architecture consists of the following:
An initial projection layer mapping input features (e.g., 35) to an embedding dimension (e.g., 512);
Positional encodings, which are learned rather than fixed sinusoidal;
Multiple transformer decoder blocks (e.g., 6 layers with 8 attention heads each);
A final linear projection back to the original feature space.
At inference time, predictions are generated auto-regressively using the model’s own outputs as inputs for subsequent steps (
Figure 1). Defining the time “t” interval in a time-series dataset is very important. For the silicon data, a cast or casting process is defined as the time step “t”, which is every few hours. The time “t” in the UCI appliances data is equal to every 10 min.
3.3. Training Chunk Generation
Small datasets were used for initial training. This is one of the drivers for why we wanted to augment the data via preference annotation. There was a lot of per-minute data available but it was consolidated on a per-cast basis. The final dataset consists of 1300 samples of real data collected at the blast furnace site. Because of operational requirements, a data sub-sampling technique is used. To predict the next 9 casts, the model needs to be trained on just the previous 100, 200, 300, or 400 samples. This reduces training data. This requirement led us to consider a fine-tuning approach based on preference annotation and augmentation.
3.4. Preference Annotation Pipeline
As is common in the literature, we generate training data in the form of triplets: (input, preferred, rejected). Each triplet corresponds to a time-series input and two different predicted output sequences.
Preferences are stored in JSON or CSV format and consist of floating-point arrays representing sequences of time-series data. Dropout and Gaussian noise were applied during trajectory generation. This was used to amplify the contrast between trajectories. Without this measure, the two generated trajectories are very similar and are difficult to distinguish. Another approach commonly used in the literature is to use two different GPT checkpoints.
3.5. Score-Based Preference Optimization (SPO)
We apply Score-based Preference Optimization (SPO) to fine-tune the model using the preference triplets data. We originally drew inspiration from Group Relative Policy Optimization (GRPO) as introduced in the DeepSeekMath paper [
16]. However, our implementation is a very simplified version of it. In our context, GRPO would use several generations (groups) to optimize the preference loss. In our case, we only have two groups and we therefore perform pairwise preference optimization. The comparison metric is MSE and GRPO uses a reward function.
Specifically, in our SPO approach, we compute a softmax over two scalar scores (e.g., MSE-based scalar) representing preferred and rejected outputs. We then optimize a cross-entropy loss to align the model’s preference with the lower-error prediction. This introduces a classification approach over score-based outputs. While simplified, our approach is inspired by GRPO in its use of score-based and probabilistic preference modeling.
We considered two preference optimization approaches. One that calculated the difference between preferred and rejected scores, and one with elements of both DPO and GRPO which we refer to as SPO. Our two preference optimization approaches differ in how they handle the comparison between preferred and rejected outputs. The first approach uses a difference-based objective. We use a softplus function to measure the difference between rejection and preference scores.
In contrast, the GRPO-inspired approach (SPO) proposes the problem as a type of classification problem where the higher probability in a softmax probability distribution acts as a ranking. We apply a softmax over the two scores (a tensor of [1, 2]) and optimize a cross-entropy loss to assign higher probability to the preferred output (e.g., [1, 0]). This probabilistic treatment (via softmax) introduces a score-based classification approach.
So, in summary, our method does not implement full Group Relative Policy Optimization (GRPO), but it is inspired by two key elements: (1) score-based logits scaled by a temperature parameter (T), and (2) a cross-entropy loss over softmaxed preferences to model probabilistic selection (e.g., a ranking). These components are inspired by GRPO, though our formulation operates in a pairwise (two options: preferred and rejected) setting using regression-based scores (MSE) rather than group ranking. GRPO uses a more dynamic scaling than our “T” based on the mean and standard deviation of samples in the given group. We viewed these rewards as weights which could quasi-rank something as better or worse. We instead use MSEs and softmax to obtain a type of probability ranking. And unlike GRPO, our SPO formulation does not include PPO-style policy ratios, for now.
More formally and concretely, we present our formulation as follows. The core idea of SPO is to treat preference supervision as a type of classification problem.
Given a predicted output
(preferred) and
(rejected), the model computes scores such as follows:
The scaled scores form logits and need to be scaled and softmaxed. We will refer to this as
.
Generally speaking, the softmax function for a vector
is defined as follows:
Additionally, the classic cross-entropy loss formulation for a 2-class classification is represented as follows:
where
is the one-hot label vector (indicating that the preferred class is index 0). Given
, the cross-entropy simplifies to the following for
:
where
is the soft-maxed logits. And put together, the cross-entropy loss updates the weights of the model to produce trajectories that are more similar to the preferred sequence. In the next equation, the term inside the log is the classic softmax function for two values. The term
is the full cross entropy formulation. Notice that it looks like the Bradley–Terry formulation [
19] used in the DPO paper.
To prevent overfitting or excessive drift from the base model, we add a KL-like penalty:
where
is the current model and
is the base (preference-free) model. Essentially, this is the initial copy that was not fine tuned. We tune
(typically 0.1–0.5) and temperature
T (e.g., 0.1) to control the contrastiveness and regularization strength. The KL term can be a simple distance function like MSE or Kullback–Leibler Divergence.
3.6. Implementation Details
Training is performed using PyTorch (1.13.1) on NVIDIA GPUs. Models are optimized using AdamW with a learning rate between
and
depending on phase (pre-training vs. preference tuning). Training batches contain randomly sampled triplets from the preference corpus, and the GPT model is fine-tuned for 3–5 epochs depending on convergence (
Figure 3). Metrics are reported across 1–9 forecast steps, with a focus on steps 1–4 and 5–9.
3.7. Score Preference Optimization (SPO) Example
The total loss used during training combines the preference loss with a KL-like regularization term to prevent excessive deviation from the base model. Specifically, the total loss at each step is defined as follows:
where the preference loss is computed using cross-entropy over score-based (MSE) logits:
The regularization term kl_term is computed using the mean squared error (MSE) between the predictions of the fine-tuned model and the base model. We use this to constrain the models and keep it from drifting too far from its original behavior. Using too low a Beta (0.05) led to excessive drift. The coefficient is a key hyperparameter to control drift. We use in our experiments.
To compute preference scores, we perform the following. Preference data triplets include [input, preferred trajectory, and rejected trajectory]. So, in simple terms, we feed the input to the GPT, and the GPT generates a new trajectory. This new trajectory (pred_new) is then compared to the preferred and rejected trajectories using MSE as follows:
In our setup, a lower MSE indicates a more preferred output. Given predicted output pred_new, we compute two scalar scores.
A temperature parameter is applied to scale these scores before converting them into logits for softmax classification.
The preference scores are then scaled and negated to form logits. These logits are computed as follows:
Lower scores indicate better predictions. The negation ensures that the preferred output receives a higher probability under softmax. The label tensor is set as follows:
where class
0 corresponds to the preferred output (i.e., [1, 0]). Finally, the preference loss is computed using cross-entropy:
This produces a tensor of shape [2], which is unsqueezed to shape [1, 2] to match the expected input format of the cross-entropy loss function, representing one sample with two class scores.
As further illustration, consider
pref_score = 0.2 and
rej_score = 0.8, with a temperature
. After scaling, the scores become
The logits are then constructed as follows:
Applying the softmax function produces
and the log-softmax values are
With the label set to class 0 (indicating the preferred output), the one-hot target vector is
, and the cross-entropy loss is computed as follows:
We provide this as a simple proof of concept for preference optimization and stress that many other approaches using fully realized DPO, GRPO, etc., can be used. Inspired by DPO, we do not wish to use a separate transformer-based reward model. However, reward models based on neural networks or heuristics could also be added for preference feedback.