A Preferences Corpus and Annotation Scheme for Human-Guided Alignment of Time-Series GPTs

Calix, Ricardo A.; Okosun, Tyamo; Zhou, Chenn; Wang, Hong

doi:10.3390/data10100161

Open AccessFeature PaperArticle

A Preferences Corpus and Annotation Scheme for Human-Guided Alignment of Time-Series GPTs

¹

Department of Computer Information Technology, Purdue University Northwest, Hammond, IN 46323, USA

²

Center for Innovation Through Visualization and Simulation (CIVS) and Steel Manufacturing Simulation and Visualization Consortium (SMSVC), Purdue University Northwest, Hammond, IN 46323, USA

³

Oak Ridge National Laboratory, Oak Ridge, TN 37830, USA

^*

Author to whom correspondence should be addressed.

Data 2025, 10(10), 161; https://doi.org/10.3390/data10100161

Submission received: 14 August 2025 / Revised: 28 September 2025 / Accepted: 1 October 2025 / Published: 9 October 2025

Download

Browse Figures

Versions Notes

Abstract

The process of time-series forecasting such as predicting trajectories of silicon content in blast furnaces is a difficult task. Most time-series approaches today focus on scalar-type MSE loss optimization. This optimization approach, while widely common, could benefit from the use of human expert or process-level preferences. In this paper, we introduce a novel alignment and fine-tuning approach that involves learning from a corpus of preferred and dis-preferred time-series prediction trajectories. Our contributions include (1) a preference annotation pipeline for time-series forecasts, (2) the application of Score-based Preference Optimization (SPO) to train decoder-only transformers from preferences, and (3) results showing improvements in forecast quality. The approach is validated on both proprietary blast furnace data and the UCI Appliances Energy dataset. The proposed preference corpus and training strategy offer a new option for fine-tuning sequence models in industrial settings.

Keywords:

time-series forecasting; preference learning; transformer models; GPT; DPO; industrial AI

1. Introduction

Multi-step time series forecasting in industrial settings is critical to maintaining quality, optimizing performance, and reducing operational risk. For example, accurately predicting the silicon content of molten iron in a blast furnace can enable early corrective actions to stabilize production. However, traditional models are typically trained using scalar-based loss functions like mean squared error (MSE), which may not capture the qualitative or domain-specific preferences that practitioners care about.

Recent advances in transformer architectures, especially decoder-only GPT-style models, offer new possibilities for autoregressive forecasting (GPT stands for Generative Pre-Trained Transformer [1]). Currently these models still rely on metric-based supervision (e.g., MSE) and miss important information on how “good” a predicted sequence appears overall. This is especially important in the early prediction steps where human intervention may matter the most.

In this paper, we propose an enhancement from scalar loss optimization to learning from preferences. We introduce a new type of training corpus consisting of preferences over predicted time-series trajectories. These preferences can be derived either from human judgments or from automated heuristics, such as comparing predicted versus real trajectories using metrics such as R² or MSE. Using these preference annotations, we apply Score-based Preference Optimization (SPO), a contrastive loss method adapted from Reinforcement Learning from Human Feedback (RLHF), to fine-tune transformer models for time-series forecasting.

Our proposed main approach is that preference-based training may provide a richer training signal for time-series models operating in noisy prediction domains. We show empirically that SPO-trained models can improve forecasts in the prediction steps.

We validate our approach on both a proprietary dataset from a Midwestern steel manufacturer and a public benchmark (UCI Appliances Energy). The results demonstrate improvement over traditional fine-tuning using MSE loss alone.

In the following sections, we describe the background motivating preference-based learning for time series, our corpus design and annotation scheme, the SPO algorithm, and the empirical results.

LSTMs and XGBoost are two common approaches for time-series-based problems but in this work we use GPTs. While transformers and GPTs are traditionally associated with text processing, we adapt the decoder architecture of a transformer to directly ingest numerical tabular time-series data.

ChatGPT exists today in its current form thanks to RLHF. The architecture used to create ChatGPT is the decoder-only GPT architecture. One of the key techniques that helped solve many of its early challenges was preference alignment. It is only natural that this framework be adapted to other domains, such as time-series forecasting. The core contribution of this work is fine-tuning through preference annotations. While LSTMs and XGBoost are widely used for tabular data, they do not have the same track record as GPT models in auto-regressive generation, scalable parallelization, or alignment training using preference-based supervision. Our Score-based Preference Optimization (SPO) approach directly benefits from training using a preferences corpus. SPO provides “contrastive” training over sequential annotated time-series outputs.

While LSTMs and XGBoost remain common in time-series forecasting, we deliberately focus on GPT-style transformers because they align with the current state of scalable, preference-driven machine learning. Although our annotation process is manual in this initial work, it can be fully automated in future iterations. This may, in the future, enable the generation of massive quantities of preference-labeled time-series data. At that scale, the benefits of GPT parallelism and architecture-level scaling become obvious, as demonstrated by large language models. Our goal is not only to introduce a new corpus, but also to establish a foundation for large-scale preference-aligned time-series annotation and modeling.

The key idea to convert a text-based GPT into a time-series GPT is to understand the shape and meaning of the tensors that serve as input to the GPT, and the tensors that serve as output to the GPT. In a text-based GPT, tokens are converted to embeddings. So a tensor of shape [32, 40, 1] containing tokens is projected via an embedding layer to a tensor of shape [32, 40, 512]. Now each token is represented by a vector of size 512. To convert this for time series, a simple approach is to remove the embedding layers and feed the time-series tensor directly (e.g., of shape [32, 40, 28]). Here, the last value in the tensor (i.e., 28) represents the features at a given time “t” step. This time-series tensor ([32, 40, 28]) can be projected to another size such as [32, 40, 512]. The same applies for the output of the time-series GPT (Figure 1). Figure 1 illustrates how a sequence of time-series vectors is passed as input to the GPT model. The output is a shifted version of the same sequence, where the final vector (highlighted in blue) represents the next-step prediction.

We have conducted many runs and the results seem promising. Many good results have been achieved. However, consistency has been an issue. Sometimes you get great results and sometimes the results are more noisy.

Additionally, unlike the domain of text and images, there are not many pre-trained models for time-series industrial data prediction. This is in part because modeling time-series data may not be as intuitive as using text or images. Our approach outlines simple and clear methods for building time-series GPTs using the familiar steps of pre-training, fine-tuning, and preference optimization.

In this paper we discuss the advantages of decoder-only transformers for time-series forecasting, the challenges encountered, GPU operational conditions, and potential future improvements. Our findings suggest that transformer-based architectures could play a significant role in time-series modeling for industrial applications.

Steel blast furnaces require real-time control of variables to maintain desired output quality. Predicting silicon content 1–9 steps into the future allows for early corrective action. So the question is can we train a GPT-style transformer to forecast multi-step silicon levels from raw input sequences?

A GPT of this type is usually trained by minimizing an MSE loss. Standard MSE loss may not capture qualitative aspects of the signal such as what looks better or operates better. Human evaluators or custom heuristics (e.g., R²), on the other hand, can guide learning via preference for time series. So in this work we consider two types of training. The first training approach is based on MSE loss optimization using a standard GPT architecture adapted for time-series data (Figure 1). The second training approach is based on preference optimization using a preference loss approach inspired from both Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO). The preference data annotation scheme is presented and discussed.

Text and image datasets are abundant, whereas industrial and tabular data remain limited. Building powerful models in these domains often requires more and better-quality data. Our work proposes a new methodology for generating preference-labeled data for time-series forecasting to address this gap. In particular, it seeks to capture human preferences in industrial settings. Our contention is that human preferences may provide a stronger signal than what can be obtained from standard scalar losses such as MSE. As such, this work contributes not only a new dataset, but more importantly, an annotation scheme for generating time series preference data that can begin with manual labeling and scale to automated processes.

The resulting corpus and learning method are reproducible, extensible, and designed to support broader research in time-series forecasting. Therefore, this work supports the goal of developing new data resources that enable new sequential machine learning approaches.

2. Background Research

2.1. Time-Series Forecasting with Transformers

Transformers have emerged as powerful sequence models for time-series forecasting because of their ability to capture long-range dependencies through the Attention mechanism. Early work like the Informer [2] and Autoformer [3] adapted encoder-decoder transformers for time-series forecasting. In general, these methods introduce mechanisms such as generative decoding and decomposition-based modules to handle seasonality and trends.

More recently, PatchTST [4] introduced patch-based embeddings to model continuous time windows using a pure transformer encoder. In their work, the authors report that the PatchTST approach outperforms traditional RNN- and CNN-based baselines. However, most of these models rely on supervised regression objectives like mean squared error (MSE). And, as such, these models may not accurately capture user preferences or operational details that are obvious to an industrial operator such as those operating a blast furnace.

In LLM4TS [5], the authors propose a two-stage framework for adapting pre-trained text-based language models to time-series forecasting. Their method converts multivariate time-series data into patch-based embeddings and aligns them with the LLM’s input space using a learned projection layer. This design enables the reuse of powerful text-based LLM architectures in forecasting tasks, essentially using what it learned about language sequences for time-series forecasting. While their focus is on structuring inputs and enabling language-model reuse, our work addresses a complementary challenge. As such, our work focused on aligning model outputs with domain-specific preferences through annotated comparisons and Score-based Preference Optimization (SPO). Therefore, our approach could serve as a third-stage enhancement for LLM4TS, and all previously mentioned transformer-based models, by extending their predictive framework with a preference-based fine-tuning step.

In their work, Niu et al. propose LangTime [6], a framework that uses pre-trained text-based language models for time-series forecasting. Specifically, they design Temporal Comprehension Prompts (TCPs) to integrate semantic information and other details to help LLMs understand time-series data for specific domains. While innovative, their approach still relies on text-based LLMs and does not incorporate training from human preferences.

2.2. System Description

The ironmaking blast furnace is a countercurrent moving bed reactor which converts iron oxides into liquid pig iron for downstream process steelmaking. Solid, iron-bearing materials are charged in the top of the reactor alongside coke fuel, and hot blast air is pushed through a ring of nozzles near the bottom of the furnace, combusting coke and generating reducing gases (CO and

H_{2}

) to reduce iron oxides into metallic iron and provide the thermal energy for melting, as shown in Figure 2. The molten products (typically called hot metal) collect in the hearth with molten oxides floating on top as a layer of slag. The hot metal and slag are tapped from the blast furnace hearth, then divided into separate streams for processing. Figure 2 represents a 3D rendering of the blast furnace process (generated by the researchers) including molten iron, slag, burden material, the raceway, blast air, tuyeres, and basic indications of material flow and chemistry.

Silicon oxides enter the furnace with the burden raw materials (pellets, sinter, coke, and limestone), and may exit the furnace either in the slag as molten oxides or as elemental silicon, dissolved into the hot metal. The reduction of the silicon oxides that enter the furnace can only take place in high-temperature regions (such as in the combustion region near the entry points of the hot blast), and the amount of elemental silicon in the hot metal is correlated with the temperature of the furnace. The level of silicon in the blast furnace hot metal may indicate potential problems with wasted heat generation or stability problems due to excessive cooling [7]. Additionally, improved control of hot metal silicon content may enable more efficient operation of the steelmaking process, as silicon must be removed with oxygen injection in downstream processes.

While first principle models have allowed operators to infer future silicon levels in the hot metal, the approach is often not accurate enough. Additionally, there exists significant time delays between changes to charged burden material and observed influences upon hot metal chemistry (potentially six to eight hours). Hence, the application of data-based modeling approaches to forecast hot metal silicon content presents significant advantages for addressing uncertainties in the process. Researchers have explored various approaches for hot metal chemistry prediction, particularly focused on silicon levels, including traditional statistical models, Bayesian network models, and XGBoost or Neural Network approaches [8,9,10,11].

2.3. Limitations of MSE-Type Supervision in Forecasting

Standard loss functions such as MSE and MAE optimize for numerical accuracy but may lack alignment with qualitative judgments or human preferences. In industrial domains like steel manufacturing, an MSE-optimized prediction may still be rejected by a domain expert due to operational infeasibility, random behavior, or other concerns. Additionally, MSE-type metrics may consider error evenly across all forecast steps, even though early predictions, or other concerns, may be more critical.

This is the main motivation for our proposed work. A standard MSE loss may not be able to capture all the information that an industrial operator can glean from looking at a plot or chart. This, in general, was the motivation for preference optimization since the beginning. A human may not be able to create the loss function to teach a robot how to dance. But a human possesses the ability to look at two alternatives and decide which one looks more like dancing. With this human-generated preference data, the model’s weights can be fine-tuned and improved.

2.4. Learning from Preferences and RLHF

Preference learning has gained much attention through its application in fine-tuning and aligning large language models (LLMs) with human feedback. Christiano et al. [12] proposed Reinforcement Learning from Human Feedback (RLHF) to better align LLMs. The first examples consisted of LLMs generating pairwise answers given a question. Human raters would review the answers and select the best one. This data (the preference data) was then used to fine-tune the model via policy optimization approaches such as Proximal Policy Optimization (PPO). This approach was successfully scaled in models like InstructGPT and ChatGPT [13] which demonstrated improved alignment and quality. Figure 3 shows an example of the RLHF method during training. As observed, the loss decreases from 121 to 102, indicating that the model is learning.

Inspired by RLHF, recent works such as DPO (Direct Preference Optimization) [14,15] and GRPO (Group Relative Policy Optimization) [16] have adapted preference-based objectives for efficient fine-tuning without requiring additional LLMs as reward models. This was a significant improvement since training one LLM is easier than training two at the same time so that one can provide feedback to the other. These methods (i.e., DPO), now use preference data in a type of supervised training setting where the learning is achieved by optimizing some type of contrastive preference loss to increase the likelihood of the preferred samples.

One key question to ask is why add preference optimization? Many have tried, for instance, to add physics to their models (e.g., Physics Informed Neural Networks) but there are certain problems associated with this approach. These include the following: (1) What if you do not know the physics of your system? Or (2) what if you do know the physics but still do not know how to effectively incorporate them into the model? Here, preferences optimization may help. Finally, these methods have already been proven. Popular models like ChatGPT and DeepSeek use similar approaches.

While preference learning is well-studied in NLP, its application to time-series forecasting remains limited and under-explored. Our work addresses this gap by applying a Score-based Preference Optimization (SPO) approach to autoregressive time-series models. From our results, we can see how domain-specific preferences (manual or automated) can improve forecasting performance. It is important to note that while we have tried automated annotation, the results in this paper are based entirely on manual preference optimization.

2.5. Contributions of This Work

We position this paper at the intersection of time-series modeling and preference-based learning. Our key contributions include the following:

A novel annotation scheme and corpus of preference annotations over predicted time series trajectories in industrial forecasting tasks. In this case, we make available an augmented UCI appliances dataset with preference annotations based on generated time-series trajectories.
A fine tuning from a preference approach based on Score-based Preference Optimization.
Preliminary Empirical validation results using both real-world and benchmark datasets.

3. Materials and Methods

This work started and is mainly focused on silicon time-series forecasting. The silicon data used was collected at a Midwestern steel blast furnace from real furnace sensors. Many of the methods presented in this paper were developed based on this dataset. However, the dataset is proprietary and we are not able to release it. For reproducibility, we also ran the methods, in parallel, using an open source dataset from the UCI repository that is comparable in many respects to the proprietary silicon blast furnace data. Therefore, our methods can be replicated using this UCI dataset. For the rest of this paper, we will refer to these two datasets as (1) the silicon blast furnace data, and (2) the UCI appliances data [17]. The silicon blast furnace dataset has about 1300 samples, and the UCI appliances dataset has about 19,000 samples. While the initial datasets are small, the appeal of our proposed method is that we can greatly increase the data via preference annotation.

3.1. Features

The main features in the silicon blast furnace dataset include delta silicon, moving average silicon, silicon, host blast moisture, hot blast temperature, natural gas injection, windrate, high purity oxygen, coal flow, cast average Mn, slag Fe, top gas CO, top gas

{CO}_{2}

, top gas

H_{2}

, top gas

N_{2}

, slag

{SiO}_{2}

, slag CaO, Slag MgO, snort valve position, top pressure, hot blast pressure, taphole, hot metal temperature, cokerate, etc. The target variable that we tracked was silicon. A correlations matrix of some of the most important features can be seen in Figure 4. A correlation matrix is a common tool in data science that shows the relationships between features and outputs. For example, in this graph, the correlation between silicon (SI) and the coke rate is relatively high, with a value of 0.54.

The main features in the UCI appliances dataset include lights, T1, appliances, RH_1, T2, RH_2, T3, RH_3, T4, RH_4, T5, RH_5, T6, RH_6, T7, RH_7, T8, RH_8, T9, RH_9, T_out, Press_mm_hg, RH_out, windspeed, visibility, dewpoint, rv1, rv2, etc. The target variable that we tracked was appliances.

3.2. Time Series GPT Architecture

We employ a decoder-only GPT-style model that is adapted for multivariate time-series forecasting. The input is a 3D tensor of shape

[batch, sequence length, features]

, where each time step contains a feature vector of observed process variables (35 for silicon data and 28 for UCI). The model is trained to auto-regressively predict the next step given a sequence (Figure 1). For the most part, we have not deviated too much from the original architecture first proposed in [18]. During training, we randomly sample a sequence from the training set (xb), and we then shift xb by 1 time step to create yb. These xb and yb tensors are then used to train the GPT using Teacher Forcing (Figure 5). Figure 5 illustrates “Teacher Forcing,” a fundamental and well-established technique used to train GPT models). This is the same approach used in [18].

More specifically, the architecture consists of the following:

An initial projection layer mapping input features (e.g., 35) to an embedding dimension (e.g., 512);
Positional encodings, which are learned rather than fixed sinusoidal;
Multiple transformer decoder blocks (e.g., 6 layers with 8 attention heads each);
A final linear projection back to the original feature space.

At inference time, predictions are generated auto-regressively using the model’s own outputs as inputs for subsequent steps (Figure 1). Defining the time “t” interval in a time-series dataset is very important. For the silicon data, a cast or casting process is defined as the time step “t”, which is every few hours. The time “t” in the UCI appliances data is equal to every 10 min.

3.3. Training Chunk Generation

Small datasets were used for initial training. This is one of the drivers for why we wanted to augment the data via preference annotation. There was a lot of per-minute data available but it was consolidated on a per-cast basis. The final dataset consists of 1300 samples of real data collected at the blast furnace site. Because of operational requirements, a data sub-sampling technique is used. To predict the next 9 casts, the model needs to be trained on just the previous 100, 200, 300, or 400 samples. This reduces training data. This requirement led us to consider a fine-tuning approach based on preference annotation and augmentation.

3.4. Preference Annotation Pipeline

As is common in the literature, we generate training data in the form of triplets: (input, preferred, rejected). Each triplet corresponds to a time-series input and two different predicted output sequences.

Preferred: This is the trajectory selected to be better. This could be done manually (via human preferences) or automatically. For this paper, we only considered manual annotation. Figure 6 shows the interface during annotation. The real line is plotted as well as two predicted lines. The human annotator selects the line that is more like the real line as the preferred trajectory. In Figure 6, the x-axis represents time, while the y-axis represents the silicon value.
Rejected: This is the trajectory that is dis-preferred by the human annotator. In this case, this is the predicted line that looks worst when compared to the real line.

Preferences are stored in JSON or CSV format and consist of floating-point arrays representing sequences of time-series data. Dropout and Gaussian noise were applied during trajectory generation. This was used to amplify the contrast between trajectories. Without this measure, the two generated trajectories are very similar and are difficult to distinguish. Another approach commonly used in the literature is to use two different GPT checkpoints.

3.5. Score-Based Preference Optimization (SPO)

We apply Score-based Preference Optimization (SPO) to fine-tune the model using the preference triplets data. We originally drew inspiration from Group Relative Policy Optimization (GRPO) as introduced in the DeepSeekMath paper [16]. However, our implementation is a very simplified version of it. In our context, GRPO would use several generations (groups) to optimize the preference loss. In our case, we only have two groups and we therefore perform pairwise preference optimization. The comparison metric is MSE and GRPO uses a reward function.

Specifically, in our SPO approach, we compute a softmax over two scalar scores (e.g., MSE-based scalar) representing preferred and rejected outputs. We then optimize a cross-entropy loss to align the model’s preference with the lower-error prediction. This introduces a classification approach over score-based outputs. While simplified, our approach is inspired by GRPO in its use of score-based and probabilistic preference modeling.

We considered two preference optimization approaches. One that calculated the difference between preferred and rejected scores, and one with elements of both DPO and GRPO which we refer to as SPO. Our two preference optimization approaches differ in how they handle the comparison between preferred and rejected outputs. The first approach uses a difference-based objective. We use a softplus function to measure the difference between rejection and preference scores.

In contrast, the GRPO-inspired approach (SPO) proposes the problem as a type of classification problem where the higher probability in a softmax probability distribution acts as a ranking. We apply a softmax over the two scores (a tensor of [1, 2]) and optimize a cross-entropy loss to assign higher probability to the preferred output (e.g., [1, 0]). This probabilistic treatment (via softmax) introduces a score-based classification approach.

So, in summary, our method does not implement full Group Relative Policy Optimization (GRPO), but it is inspired by two key elements: (1) score-based logits scaled by a temperature parameter (T), and (2) a cross-entropy loss over softmaxed preferences to model probabilistic selection (e.g., a ranking). These components are inspired by GRPO, though our formulation operates in a pairwise (two options: preferred and rejected) setting using regression-based scores (MSE) rather than group ranking. GRPO uses a more dynamic scaling than our “T” based on the mean and standard deviation of samples in the given group. We viewed these rewards as weights which could quasi-rank something as better or worse. We instead use MSEs and softmax to obtain a type of probability ranking. And unlike GRPO, our SPO formulation does not include PPO-style policy ratios, for now.

More formally and concretely, we present our formulation as follows. The core idea of SPO is to treat preference supervision as a type of classification problem.

Given a predicted output

y_{pref}

(preferred) and

y_{rej}

(rejected), the model computes scores such as follows:

score (y) = - MSE (y, y_{target})

(1)

The scaled scores form logits and need to be scaled and softmaxed. We will refer to this as

p_{i}

.

logits = [score (y_{pref}) / T, score (y_{rej}) / T]

(2)

Generally speaking, the softmax function for a vector

\vec{x} = [x_{1}, x_{2}, \dots, x_{n}]

is defined as follows:

softmax (x_{i}) = \frac{exp (x_{i})}{\sum_{j = 1}^{n} exp (x_{j})}

(3)

Additionally, the classic cross-entropy loss formulation for a 2-class classification is represented as follows:

L_{CE} = - \sum_{i = 1}^{2} y_{i} log (p_{i})

(4)

where

y = [1, 0]

is the one-hot label vector (indicating that the preferred class is index 0). Given

y = [1, 0]

, the cross-entropy simplifies to the following for

p_{0}

:

L_{pref} = - (1 \cdot log (p_{0}) + 0 \cdot log (p_{1})) = - log (p_{0})

(5)

where

p_{i}

is the soft-maxed logits. And put together, the cross-entropy loss updates the weights of the model to produce trajectories that are more similar to the preferred sequence. In the next equation, the term inside the log is the classic softmax function for two values. The term

L_{pref}

is the full cross entropy formulation. Notice that it looks like the Bradley–Terry formulation [19] used in the DPO paper.

L_{pref} = - log (\frac{exp ({score}_{pref} / T)}{exp ({score}_{pref} / T) + exp ({score}_{rej} / T)})

(6)

To prevent overfitting or excessive drift from the base model, we add a KL-like penalty:

L_{total} = L_{pref} + β \cdot KL (P_{θ} ∥ P_{θ_{0}})

(7)

where

P_{θ}

is the current model and

P_{θ_{0}}

is the base (preference-free) model. Essentially, this is the initial copy that was not fine tuned. We tune

β

(typically 0.1–0.5) and temperature T (e.g., 0.1) to control the contrastiveness and regularization strength. The KL term can be a simple distance function like MSE or Kullback–Leibler Divergence.

3.6. Implementation Details

Training is performed using PyTorch (1.13.1) on NVIDIA GPUs. Models are optimized using AdamW with a learning rate between

1 \times 10^{- 5}

and

5 \times 10^{- 4}

depending on phase (pre-training vs. preference tuning). Training batches contain randomly sampled triplets from the preference corpus, and the GPT model is fine-tuned for 3–5 epochs depending on convergence (Figure 3). Metrics are reported across 1–9 forecast steps, with a focus on steps 1–4 and 5–9.

3.7. Score Preference Optimization (SPO) Example

The total loss used during training combines the preference loss with a KL-like regularization term to prevent excessive deviation from the base model. Specifically, the total loss at each step is defined as follows:

total_loss_step = pref_loss + β \cdot kl_term

where the preference loss is computed using cross-entropy over score-based (MSE) logits:

pref_loss = cross_entropy (logits, labels)

The regularization term kl_term is computed using the mean squared error (MSE) between the predictions of the fine-tuned model and the base model. We use this to constrain the models and keep it from drifting too far from its original behavior. Using too low a Beta (0.05) led to excessive drift. The coefficient

β

is a key hyperparameter to control drift. We use

β = 0.3

in our experiments.

To compute preference scores, we perform the following. Preference data triplets include [input, preferred trajectory, and rejected trajectory]. So, in simple terms, we feed the input to the GPT, and the GPT generates a new trajectory. This new trajectory (pred_new) is then compared to the preferred and rejected trajectories using MSE as follows:

\begin{matrix} prefSc & = MSE (pred_new, prefer_tr [:, - 9 :, 2]) \\ rejSc & = MSE (pred_new, reject_tr [:, - 9 :, 2]) \end{matrix}

In our setup, a lower MSE indicates a more preferred output. Given predicted output pred_new, we compute two scalar scores.

A temperature parameter

T = 0.1

is applied to scale these scores before converting them into logits for softmax classification.

The preference scores are then scaled and negated to form logits. These logits are computed as follows:

logits = stack ([- \frac{pref_score}{T}, - \frac{rej_score}{T}])

Lower scores indicate better predictions. The negation ensures that the preferred output receives a higher probability under softmax. The label tensor is set as follows:

labels = torch . tensor ([0])

where class 0 corresponds to the preferred output (i.e., [1, 0]). Finally, the preference loss is computed using cross-entropy:

pref_loss = cross_entropy (logits, labels)

This produces a tensor of shape [2], which is unsqueezed to shape [1, 2] to match the expected input format of the cross-entropy loss function, representing one sample with two class scores.

As further illustration, consider pref_score = 0.2 and rej_score = 0.8, with a temperature

T = 0.1

. After scaling, the scores become

\frac{pref_score}{T} = 2, \frac{rej_score}{T} = 8

The logits are then constructed as follows:

logits = [- 2.0, - 8.0]

Applying the softmax function produces

softmax = [0.9975, 0.0025]

and the log-softmax values are

log (0.9975) \approx - 0.0025, log (0.0025) \approx - 5.99

With the label set to class 0 (indicating the preferred output), the one-hot target vector is

[1, 0]

, and the cross-entropy loss is computed as follows:

loss = - (1 \cdot log (0.9975) + 0 \cdot log (0.0025)) \approx 0.0025

We provide this as a simple proof of concept for preference optimization and stress that many other approaches using fully realized DPO, GRPO, etc., can be used. Inspired by DPO, we do not wish to use a separate transformer-based reward model. However, reward models based on neural networks or heuristics could also be added for preference feedback.

4. Results

4.1. Datasets

We evaluate our approach on two datasets: (a) the silicon data and (b) the UCI data.

4.1.1. Steel Blast Furnace (Proprietary)

A proprietary dataset collected from a Midwestern steel manufacturing plant. It consists of time-series measurements recorded per blast furnace cast. Each sample contains approximately 35 features (e.g., natural gas flow, oxygen rate, coal rate), with silicon content as the primary target variable. The dataset comprises around 1300 samples, and we predict 9 future time steps (casts). The time step is a “cast”. A cast in the context of a blast furnace for steel making refers to the process of tapping or draining of liquid and very hot iron (and slag) from the hearth (furnace enclosure) into ladles or runners for further refining and processing.

4.1.2. UCI Appliance Energy Prediction

We utilized a public benchmark from the UCI repository [17], containing 28 variables such as temperature, humidity, and energy use in a smart home, recorded at 10-min intervals. We used a sliding window to extract sequences and forecast the next nine time steps. We used appliances as the target variable.

4.2. Pre-Training GPT Results

We train an initial GPT-style model using standard MSE loss. This serves as the base model for preference generation and comparison. The performance can sometimes be good after just pre-training although this is not always consistent.

Figure 7 shows results for an ideal case of sequence predictions. At step 1, the model achieves an R² score of 0.708. Performance drops to 0.605 at step 2, and so on. The scores are reported both cumulatively (averaging all predictions up to that step) and individually for each step. This example uses data from samples indexed as 300 to 500.

Figure 8 and Figure 9 show what the plots look like when obtaining good or bad results, respectively.

4.3. SPO Results

Using the preference data, we fine-tune the pre-trained model with SPO and compare performance before and after preference fine-tuning. For example, Figure 10 and Figure 11 illustrate this. Figure 10 shows model performance over time before SPO fine-tuning, while Figure 11 shows performance after SPO fine-tuning. One can observe that the curves indicate improved alignment following the application of SPO.

To assess the impact of preference-based fine-tuning, we report (1) R², RMSE, MAE on nine-step forecasts, (2) a separate evaluation for early steps (1–4) and later steps (5–9), and (3) the percentage of test sequences where SPO outperforms the pre-training.

Table 1 summarizes how often the SPO model outperformed the baseline in different evaluation sections for an ideal case using the Silicon data. The values in the following table represent an ideal case where good results where obtained. Other cases or runs can obtain worse performance. For this particular case, `After’ was better in 356 out of 450 collected metrics. That is 79.11% of the metrics, which is really good and is an ideal case.

From Table 1, we can see evidence the SPO can improve the GPT performance using preference data.

For reproducibility, we also provide results for the UCI appliances data. For this particular case, ‘After’ was better in 324 out of 450 collected metrics. That is 72% of the metrics, which is really good and is an ideal case.

From Table 2, we can see evidence that SPO can improve the GPT performance using preference data. The values in Table 2 represent an ideal case where good results where obtained. Other cases or runs can obtain worse performance.

4.4. Ablation: Preference Generation and $β$ Sensitivity

We conduct additional experiments to understand the sensitivity of the SPO loss to preference contrast and regularization:

Using low-contrast preference pairs leads to less desirable learning and difficult annotation. We settled on a dropout of 0.8 to generate more contrastive trajectories.
High $β$ values (>0.5) constrain the model, while very low values (<0.1) may lead to model drifting.
The best performance was observed in $β \in [0.2, 0.4]$ with $T = 0.1$ .

5. Result Analysis

Our results demonstrate that learning from preferences provides an effective alternative to just MSE loss minimization for fine-tuning pre-trained models in time-series forecasting.

5.1. Annotating to Prefer Early Predictions

We saw some cases where SPO improved the first 3–4 steps of the forecast. This may suggest that preferences emphasizing early behavior could fine-tune the model to improve the performance of early steps. This is an example of how a human annotator can use judgments of what it prefers to influence the model.

5.2. Importance of Preference Contrast

Generating contrasting trajectories for annotation was very challenging in the context of time-series data. To enhance signal strength, we found it helpful to increase dropout or to add noise into the model inputs or parameters. One remaining problem is that the dropout value required to generate the preference data for annotation can be as high as 0.8. And this is something that we are still working on.

5.3. Role of KL Regularization

The KL term in SPO stabilizes training by preventing the fine-tuned model from drifting too far from the base model. However, it introduces a delicate trade-off: too strong a penalty (large

β

) limits learning or one that is too weak (small

β

) risks drifting. We found that

β \in [0.2, 0.4]

worked best for our experiments.

5.4. Corpus and Annotation Discussion

Each training step in our scheme compares two trajectories, which could be very informative when compared to regular MSE given that the human expert judgment may be captured. Additionally, the preference corpus can be reused with other architectures or updated incrementally. This could make it desirable as a data augmentation technique, especially for industrial applications where data is difficult to collect but domain feedback is available.

5.5. Limitations

Here we identify several limitations we encountered. Annotation in general can be difficult and it is possible that a better annotation approach could have yielded better results. For example, annotation by a graduate student may be different from annotation by a plant expert, especially in ambiguous or edge cases. Second, while we used simple metrics like MSE or R², more sophisticated domain-specific heuristics could provide better results. For instance, using R² to measure time-series results can be challenging especially when using small datasets. We had many training and test scenarios and sometimes the model did not perform well at all. Also, datasets of just 1300 or 19,000 samples are very small and may not be enough for pre-training.

6. Conclusions and Future Work

This work introduces a preference annotation and fine-tuning methodology for time-series forecasting in industrial settings. By applying Score Preference Optimization (SPO) to a pre-trained decoder-only GPT model, we demonstrate improved performance after SPO fine-tuning on both proprietary industrial data and open source data. The code base and preference data is available here [20].

We contribute the followsing: (1) a scheme for preference annotation of time-series data, (2) an annotated preferences corpus, and (3) an application of SPO for fine-tuning from preferences. Future work will explore (1) expanding the preference annotation process, (2) using NN-based and heuristic-based reward models, and (3) considering other datasets and industrial domains.

Author Contributions

Conceptualization, C.Z., T.O. and H.W.; methodology, C.Z., T.O., H.W. and R.A.C.; software, R.A.C.; validation, R.A.C. and T.O.; formal analysis, R.A.C., T.O. and H.W.; investigation, R.A.C., T.O. and H.W.; resources, C.Z.; data curation, T.O.; writing—original draft preparation, R.A.C. and T.O.; writing—review and editing, R.A.C., T.O. and H.W.; visualization, R.A.C. and T.O.; supervision, T.O. and C.Z.; project administration, T.O. and C.Z.; funding acquisition, T.O. and C.Z. All authors have read and agreed to the published version of the manuscript.

Funding

The authors would like to extend their gratitude to ironmaking technical experts from the Steel Manufacturing Simulation and Visualization Consortium for providing ongoing technical guidance and support during this research effort. This research was supported by the US Department of Energy’s Office of Energy Efficiency and Renewable Energy under the Industrial Technologies Office Award Number DE-EE0009390. This manuscript has been authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript or allow others to do so, for US government purposes. The DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan, accessed on 28 September 2025).

Data Availability Statement

The code base and preference data is available here [20].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. OpenAI Blog 2018, 1. Available online: https://www.mikecaptain.com/resources/pdf/GPT-1.pdf (accessed on 28 September 2025).
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; pp. 11106–11115. [Google Scholar]
Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–14 December 2021. [Google Scholar]
Nie, Y.; Nguyen, N.; Sinthong, P.; Kalagnanam, J. A Time Series is Worth 64 Words: Long-term Forecasting with Transformers. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Ching, C.; Wang, W.Y.; Peng, W.C.; Chen, T.F. LLM4TS: Aligning Pre-Trained LLMs as Data-Efficient Time-Series Forecasters. Acm Trans. Intell. Syst. Technol. 2025, 16, 60. [Google Scholar] [CrossRef]
Niu, W.; Xie, Z.; Sun, Y.; He, W.; Xu, M.; Hao, C. LangTime: A Language-Guided Unified Model for Time Series Forecasting with Proximal Policy Optimization. arXiv 2025, arXiv:2503.08271. [Google Scholar]
Gao, C.; Zhi-Min, Z.; Ji-Xin, Q. Chaotic identification and prediction of silicon content in hot metal. J. Iron Steel Res. Int. 2005, 12, 3–5. [Google Scholar]
Zhou, K.; Fei, H.; Gong, J. Prediction of Silicon Content of Molten Iron in Blast Furnace Based on Particle Swarm - Random Forest. In Proceedings of the 2021 33rd Chinese Control and Decision Conference (CCDC), Kunming, China, 22–24 May 2021; pp. 2814–2819. [Google Scholar] [CrossRef]
Liu, X.; Gong, S. Application of fuzzy Bayesian network to prediction of silicon content in molten iron of blast furnace. Metall. Autom. 2005, 29, 30–32. [Google Scholar]
Gao, C.-H.; Zhou, Z.-M.; Shao, Z.-J. Chaotic analysis for blast furnace ironmaking process. Acta Phys. Sin. 2005, 54, 1490–1494. [Google Scholar] [CrossRef]
Waller, M.; Saxén, H. Application of nonlinear time series analysis to the prediction of silicon content of pig iron. ISIJ Int. 2002, 42, 316–318. [Google Scholar] [CrossRef]
Christiano, P.F.; Leike, J.; Brown, T.B.; Martic, M.; Legg, S.; Amodei, D. Deep reinforcement learning from human preferences. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 4302–4310. [Google Scholar]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. arXiv 2022, arXiv:2203.02155. [Google Scholar] [CrossRef]
Rafailov, R.; Sharma, A.; Mitchell, E.; Ermon, S.; Manning, C.D. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv 2023, arXiv:2305.18290. [Google Scholar] [CrossRef]
Rafailov, R.; Sharma, A.; Mitchell, E.; Manning, C.D.; Ermon, S.; Finn, C. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 10–16 December 2023; Volume 36, pp. 1–14. [Google Scholar]
DeepSeek. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv 2024, arXiv:2402.03300.
Candanedo, L.M.I.; Feldheim, V.; Deramaix, D. Data Driven Prediction Models of Energy Use of Appliances in a Low-Energy House. Energy Build. 2017, 140, 81–97. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, l.; Polosukhin, I. Attention is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Bradley, R.A.; Terry, M.E. Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons. Biometrika 1952, 39, 324–345. [Google Scholar] [CrossRef]
Calix, R. Preferences and Time Series GPT GitHub Data and Code. Available online: https://github.com/rcalix1/PreferencesTimeSeriesGPT (accessed on 28 September 2025).

Figure 1. Last element is the prediction.

Figure 2. A 3D rendering of the blast furnace process.

Figure 3. SPO training and loss optimization.

Figure 4. Correlation Matrix (Silicon Data).

Figure 5. Teacher Forcing—xb and shifted xb (yb).

Figure 6. Annotation Interface.

Figure 7. GPT performance for first five steps (ideal)—Pre-Training.

Figure 8. Good Prediction.

Figure 9. Bad Prediction.

Figure 10. SPO-based fine-tuning (before).

Figure 11. SPO-based fine-tuning (after).

Table 1. SPO Improves Generated Trajectories (Silicon).

Chunk	Steps	Wins After	Win %
400 to 500	1–4	35/40	87.5%
400 to 500	5–9	50/50	100.0%
300 to 500	1–4	15/40	37.5%
300 to 500	5–9	45/50	90.0%
200 to 500	1–4	30/40	75.0%
200 to 500	5–9	50/50	100.0%
100 to 500	1–4	5/40	12.5%
100 to 500	5–9	41/50	82.0%
000 to 500	1–4	35/40	87.5%
000 to 500	5–9	50/50	100.0%

Table 2. SPO Improves Generated Trajectories (UCI).

Chunk	Steps	Wins After	Win %
400 to 500	1–4	23/40	57%
400 to 500	5–9	27/50	54%
300 to 500	1–4	26/40	65%
300 to 500	5–9	29/50	58%
200 to 500	1–4	26/40	65%
200 to 500	5–9	45/50	90%
100 to 500	1–4	22/40	55%
100 to 500	5–9	41/50	82%
000 to 500	1–4	40/40	100%
000 to 500	5–9	45/50	90%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Calix, R.A.; Okosun, T.; Zhou, C.; Wang, H. A Preferences Corpus and Annotation Scheme for Human-Guided Alignment of Time-Series GPTs. Data 2025, 10, 161. https://doi.org/10.3390/data10100161

AMA Style

Calix RA, Okosun T, Zhou C, Wang H. A Preferences Corpus and Annotation Scheme for Human-Guided Alignment of Time-Series GPTs. Data. 2025; 10(10):161. https://doi.org/10.3390/data10100161

Chicago/Turabian Style

Calix, Ricardo A., Tyamo Okosun, Chenn Zhou, and Hong Wang. 2025. "A Preferences Corpus and Annotation Scheme for Human-Guided Alignment of Time-Series GPTs" Data 10, no. 10: 161. https://doi.org/10.3390/data10100161

APA Style

Calix, R. A., Okosun, T., Zhou, C., & Wang, H. (2025). A Preferences Corpus and Annotation Scheme for Human-Guided Alignment of Time-Series GPTs. Data, 10(10), 161. https://doi.org/10.3390/data10100161

Article Menu

A Preferences Corpus and Annotation Scheme for Human-Guided Alignment of Time-Series GPTs

Abstract

1. Introduction

2. Background Research

2.1. Time-Series Forecasting with Transformers

2.2. System Description

2.3. Limitations of MSE-Type Supervision in Forecasting

2.4. Learning from Preferences and RLHF

2.5. Contributions of This Work

3. Materials and Methods

3.1. Features

3.2. Time Series GPT Architecture

3.3. Training Chunk Generation

3.4. Preference Annotation Pipeline

3.5. Score-Based Preference Optimization (SPO)

3.6. Implementation Details

3.7. Score Preference Optimization (SPO) Example

4. Results

4.1. Datasets

4.1.1. Steel Blast Furnace (Proprietary)

4.1.2. UCI Appliance Energy Prediction

4.2. Pre-Training GPT Results

4.3. SPO Results

4.4. Ablation: Preference Generation and β Sensitivity

5. Result Analysis

5.1. Annotating to Prefer Early Predictions

5.2. Importance of Preference Contrast

5.3. Role of KL Regularization

5.4. Corpus and Annotation Discussion

5.5. Limitations

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.4. Ablation: Preference Generation and $β$ Sensitivity