Next Article in Journal
An RMSprop-Incorporated Latent Factorization of Tensor Model for Random Missing Data Imputation in Structural Health Monitoring
Previous Article in Journal
Testing the Effectiveness of Voxels for Structural Analysis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Memory-Efficient Batching for Time Series Transformer Training: A Systematic Evaluation

1
Informatics Innovation Center of Excellence, School of Informatics, Walailak University, Nakhon Si Thammarat 80160, Thailand
2
Capital One, New York, NY 10171, USA
3
IBM Research India, Banglore 560045, India
4
IBM TJ Watson Research Center, Yorktown Heights, NY 10598, USA
*
Author to whom correspondence should be addressed.
Algorithms 2025, 18(6), 350; https://doi.org/10.3390/a18060350
Submission received: 3 May 2025 / Revised: 27 May 2025 / Accepted: 28 May 2025 / Published: 5 June 2025
(This article belongs to the Section Parallel and Distributed Algorithms)

Abstract

Transformer-based time series models are being increasingly employed for time series data analysis. However, their training remains memory intensive, especially with high-dimensional data and extended look-back windows, while model-level memory optimizations are well studied, the batch formation process remains an underexplored factor to performance inefficiency. This paper introduces a memory-efficient batching framework based on view-based sliding windows operating directly on GPU-resident tensors. This approach eliminates redundant data materialization caused by tensor stacking and reduces data transfer volumes without modifying model architectures. We present two variants of our solution: (1) per-batch optimization for datasets exceeding GPU memory, and (2) dataset-wise optimization for in-memory workloads. We evaluate our proposed batching framework systematically using peak GPU memory consumption and epoch runtime as efficiency metrics across varying batch sizes, sequence lengths, feature dimensions, and model architectures. Results show consistent memory savings, averaging 90% and runtime improvements of up to 33% across multiple transformer-based models (Informer, Autoformer, Transformer, and PatchTST) and a linear baseline (DLinear) without compromising model accuracy. We extensively validate our method using synthetic and standard real-world benchmarks, demonstrating accuracy preservation and practical scalability in distributed GPU environments. The proposed method highlights batch formation process as a critical component for improving training efficiency.

1. Introduction

The popularity of transformer-based models is evident and has been gaining more traction in several industries in recent years. These large-scale pre-trained models have the ability to learn general representations from vast amounts of data and generalize across a wide range of downstream tasks. Applications of these models in the natural language processing (NLP) domain are emerging at an unprecedented rate, ranging from conversational agents [1,2,3] and machine translation [4,5,6] to content generation [7,8,9] and sentiment analysis [10,11], proving the usability of these models in real-world settings. In addition to the natural language processing domain, transformer architectures have been adopted by closely related fields such as computer vision [12,13,14,15], code [16,17,18], and increasingly time series [19,20,21].
In time series data analysis, transformer-based models have emerged as a powerful tool for extracting meaningful patterns and insights from sequential data. These models exhibit remarkable capabilities in tasks such as forecasting [20,22], anomaly detection [21,23], and classification [19,24]. However, as the size of datasets grow, training time series transformer models becomes increasingly memory intensive, particularly when using sliding-window approaches to expose temporal dependencies. Although these methods are commonly used because of their effectiveness in capturing sequential patterns, they introduce substantial memory and computational overhead during batch preparation. Conventional training pipelines generate overlapped sliding windows on the CPU and then transfer stacked tensors to the GPU, resulting in redundant memory copies and inefficient data transfer. Although asynchronous data loaders can partially mask transfer costs, they do not eliminate data duplication. As batch sizes, sequence lengths, and feature dimensions increase, these inefficiencies become significant bottlenecks for scalability, especially in multivariate forecasting tasks [25,26,27,28].
To address this limitation, this paper revisits the often-overlooked issue of batching overhead and proposes an effective solution, view-based sliding-window batching. By replacing default tensor stacking operations with zero-copy tensor views, we eliminate redundant data materialization without changing model architectures or affecting downstream accuracy, while view-based operations have long existed in frameworks like PyTorch [29], their integration into large-scale training pipelines for time series models has not been systematically explored.
We develop a data preparation pipeline (https://github.com/psinthong/ME_Batch_Formation (accessed on 3 May 2025)). that applies view-based sliding-window optimizations to GPU-resident data for training time series transformer models, while our focus is on transformer architectures, we also validate the method’s broader applicability using a simple linear forecasting model. Although existing memory optimization methods, such as quantization, tensor rematerialization, and asynchronous CPU–GPU data loading, have shown significant progress, they primarily target model-level optimization or computational overlap techniques. Our work optimizes the batching pipeline itself by performing a GPU-based, zero-copy sliding-window formation using tensor views. By eliminating redundant data materialization, our approach reduces memory usage and runtime without sacrificing model performance. Through extensive empirical analysis, we demonstrate that view-based batching yields non-trivial runtime and memory savings. We evaluate its impact across varying batch sizes, sequence lengths, feature dimensions, and model architectures. This evaluation offers a comprehensive view of its practical effectiveness. Our contributions are the following.
  • Memory-Optimized Batching Framework: We present a memory-efficient batching framework for training time series foundation models on GPUs, utilizing a view-based sliding-window technique to eliminate redundant data replications.
  • Comprehensive analysis across multiple training dimensions: We provide an in-depth evaluation of our approach under varying sequence lengths, batch sizes, feature dimensions, and model architectures. This is the first study to systematically examine the end-to-end impact of batching strategies across the time series training stack.
  • Scalable Optimization for Distributed Training: We demonstrate the scalability of our memory optimization techniques in distributed training environments, enabling efficient training of large-scale time series models on GPU clusters.
The rest of this paper is organized as follows: Section 2 discusses background and related work. Section 4 provides an overview of our system. Section 5 details a set of performance experiments and results from running our optimizations against the traditional batching baseline, including tests on both transformer-based and linear models. Section 6 analyzes the experimental results and Section 7 describes future works.

2. Background

In general, the training process of deep learning models involves several repetitive steps; from data preparation to training and evaluation. In the following subsections, we outline different time series models and time series-specific sample preparation, explain the batching process, and detail the data-loading pipeline used for facilitating model training.

2.1. Time Series Transformer Models

Transformer-based time series models have emerged as a promising approach for analyzing and extracting patterns from sequential data, drawing inspiration from the success of foundation models in natural language processing. These models are pre-trained on massive datasets, enabling them to capture complex temporal dependencies and generalize to various downstream tasks. The architectures employed in time series data analysis are diverse and rapidly evolving; transformer-based models are the most common. They leverage a self-attention mechanism to capture long-range dependencies and have shown remarkable success in time series forecasting. Examples include Informer [22], which addresses efficiency in processing long sequences; Autoformer [26], designed to improve model interpretability and adaptability; and PatchTST [30], which optimizes performance for high-dimensional time series data. Alongside transformers, architectures like the temporal fusion transformer (TFT) [31] have emerged, combining elements of recurrent and convolutional layers to handle both temporal dynamics and static covariates. Additionally, simpler architectures like the DLinear [32] model employ linear transformations, providing computational efficiency at the expense of expressive power. These models are often trained on large amounts of prepared data using powerful computational hardware such as GPUs to accelerate the training process. We will describe the process of time series data preparation and the data pipeline required to facilitate the training process in the next subsections.

2.2. Time Series Sample Preparation

While data preparation shares some common principles across time series foundation models, large language models (LLMs), and vision transformers, there are notable differences in the preprocessing steps, data representations, and batch formation strategies tailored to the characteristics of each domain. Understanding these differences is essential for designing efficient training pipelines and optimizing memory usage in deep learning applications.
Data preparation for training time series models diverges significantly from the processes employed for LLMs and vision transformers, while all three leverage large-scale pre-training strategies, the inherent nature of the data they handle necessitates distinct approaches. LLMs primarily operate on tokenized text data, where sentences or documents are converted into sequences of tokens. The emphasis lies on preparing sequential text data while preserving its inherent order. Vision transformers’ data preparation, on the other hand, involves partitioning images into non-overlapping patches and flattening them into sequences of vectors. In contrast, time series data preparation primarily focuses on segmenting sequential data into fixed-length subsequences using techniques like sliding windows to capture temporal dependencies. This involves storing overlapping segments, leading to increased memory usage. The issue intensifies when dealing with high-resolution time series data, small window sizes for capturing fine-grained temporal patterns, or small incremental steps to increase sample density. These notable differences reflect the unique characteristics of each domain.
Figure 1 shows the sliding-window operation performed on univariate time series data. We assume that these time series data are already sorted based on the timestamps in increasing order. The left side of Figure 1 displays an example of applying a sliding window of size six, and a stride of size one to the time series data containing one variable. The sliding window is moved over one step at a time to create the next sequence. The step value is often referred to as “stride”. Usually, the stride value is less than the window size, to allow for overlapping values across sequences. Due to the current implementation of sample preparation, each sequence of time series data is converted to tensors on CPU before transferring to the GPU memory resulting in data duplication across all sequences. The data duplication can become increasingly apparent when dealing with multivariate time series data as demonstrated on the right side of Figure 1. It is important to note that both the window size and stride are hyperparameters that need to be configured. Domain-specific data characteristics (e.g., sampling frequency, seasonal characteristics, patterns, etc.) are often taken into account when selecting the best optimal values for these parameters. Hyperparameter optimization is not the focus area of this paper as it directly affects the performance of the model.

2.3. Batching and Memory Usage

After all data samples are prepared according to the domain-specific data preparation steps, the system will start the batch formation process. Batching plays a fundamental role in the training of transformer-based models. It refers to the process of grouping training data into smaller subsets called batches, which are then sequentially fed to the model during training iterations. This technique offers several advantages, including computational efficiency during training by leveraging the parallel processing capabilities of modern GPUs, improving model convergence through gradient averaging, and mitigating noise inherent in individual samples. However, the batching process requires careful consideration of memory usage. The preparation and loading of each batch from the CPU to the GPU’s dedicated memory creates a temporary increase in peak memory consumption. This increase, coupled with the model’s memory requirements, can create a bottleneck, especially when working with large transformer-based models or substantial batch sizes.
Figure 2 outlines the default batching process. Step 1, sequence preparation, involves various data preparation steps. The outputs of step 1 are sequences of tensors. Step 2 is the batch assembly process. For time series data analysis, the batch assembly entails grouping windowed sequences of ordered data. By default, batch assembly is performed on the CPU after the data preparation (sliding-window operation) is applied to the ordered data based on pre-configured batch size. As a result, there are unnecessary data copies accumulated in memory as shown on the right side of Figure 1. The amount of memory usage increases drastically with large window sizes and especially with high dimensional data (multivariate). During each training iteration, each batch of tensors goes through the following steps: data transfer from CPU to GPU (labeled 3 in Figure 2), batch processing during the forward pass (labeled 4 in Figure 2), and unloading during the backward pass. After the model processes each batch, the processed batch will be released from the GPU memory, and the system will start transferring the next batch of tensors. This data transformation is facilitated by a data-loading pipeline described in detail in the next subsection.

2.4. Data-Loading Pipeline

In order to leverage GPUs for model training, data have to be prepared and grouped into batches to feed to the model during training iterations. The data-loading pipeline ingests raw data, preprocesses them through transformations, and then iteratively feeds batches of data to the deep learning model for training. The pipeline also monitors the data flow and logs relevant statistics throughout the training process.
PyTorch’s data-loading pipeline exhibits a modular design, providing flexibility in how datasets are accessed and prepared for deep learning models. Due to the expressive and flexible nature of this design, we use PyTorch to implement our approach. The data-loading pipeline in PyTorch comprises three core components that interact to achieve scalable training iterations:
  • __getitem__: Implemented within a custom Dataset subclass, the __getitem__ method is responsible for retrieving a single data sample (often a feature–label pair) and performing any necessary preprocessing on the raw data.
  • __len__: Also defined within the Dataset class, __len__ returns the total number of samples within the dataset. This information is crucial for the DataLoader to determine how many iterations are needed per epoch.
  • collate_fn: The collate_fn, a function passed to the DataLoader, dictates how individual samples retrieved via __getitem__ are aggregated into batches. Default stacking behavior can be overridden with custom collate_fn implementations to introduce optimizations or accommodate unique data structures.
Collectively, these components orchestrate the seamless transition from raw data within a dataset to model-ready batches, enabling efficient training of PyTorch deep learning models. This modular design allows for scalability across devices via distributed training techniques supported by the framework. In the next subsection, we will describe the necessity and functionality of different distributed training techniques used for large-scale model training.

3. Related Works

In terms of memory optimizations for transformer models, the innovations in this field fall into three main categories: quantization, layer optimization, and tensor rematerialization.
Quantization: Quantization reduces the precision of weights and activations within a model, typically from 32-bit floating-point numbers (FP32) to lower precision formats like 16-bit (FP16) or even 8-bit integers (INT8). This compression significantly reduces memory footprint without sacrificing significant accuracy. Post-training quantization techniques (PTQ) [33,34] achieve this by analyzing the distribution of weights and activations and identifying suitable lower-precision representations. Alternatively, quantization-aware training (QAT) [35,36] integrates quantization into the training process itself, allowing the model to adapt to the lower precision format.
Layer Optimization: Certain layers in transformer models are inherently more memory-intensive than others. For instance, self-attention layers, with their quadratic complexity in sequence length, can consume a substantial portion of memory. Layer optimization techniques focus on reducing the memory footprint of these specific layers. One approach involves using techniques like sparse attention [37,38], where attention weights are concentrated on a smaller subset of relevant tokens, leading to sparser activation tensors. Another strategy involves exploiting low-rank approximations for attention matrices [39,40], effectively capturing the essential information while reducing the number of parameters to be trained.
Tensor Swapping and Rematerialization: Transformer models often produce large intermediate tensors during the forward and backward passes. Tensor swapping and materialization techniques aim to optimize the management of these tensors in memory. Tensor swapping [41] leverages techniques like checkpointing [42] to temporarily store intermediate activations on secondary storage (e.g., disks) when they are not actively needed. This technique frees up precious GPU memory for computations involving other tensors. Rematerialization [43,44], on the other hand, focuses on reducing memory overheads associated with creating these intermediate tensors altogether. Recomputing intermediate values rather than retaining them in memory is often accompanied by techniques like operation fusion which combines multiple operations into a single kernel, minimizing the number of temporary tensors generated during the computation.
Existing batching techniques for time series transformer models typically rely on CPU-based preprocessing followed by GPU data transfers, resulting in redundant tensor replication, memory bottlenecks, and limited scalability. Even asynchronous loading methods, which overlap computation and data transfer, fail to fully eliminate data duplication. In contrast, our approach optimizes the batching pipeline itself rather than modifying model architectures or numerical precision by employing GPU-resident, view-based sliding windows through a custom data-loading pipeline, thereby entirely preventing redundant tensor materialization, seamlessly complementing other memory optimizations, and enabling integration into distributed training workflows.

4. System Design

The combination of large model sizes, long sequence lengths, and batching creates significant memory constraints for efficient training of transformer models. Due to the limited sets of operations available on GPUs, batching is typically performed on CPUs, which can accommodate user-defined functions for batch formation. Especially in the NLP domain, where tokenizations require specific sets of string-manipulating operations, these operations are currently only available on the CPU devices. However, in time series, the batching process also involves window-based operations that can be greatly accelerated by modern GPUs.
We propose a GPU-based batch formation process for time series data by pre-calculating the pre-batch data size and performing view-based sliding-window techniques on GPU directly prior to creating batches of training data. This technique can avoid re-materializing the redundant data in each window resulting in two-fold benefits, allowing users to increase batch sizes, which in turn reduces the overall training time, and models could also attend to longer history by increasing the size of the look-back window. It is important to note that identifying the most effective look-back window is an active area of research that is not covered in this paper. In cases of resource-constrained developing environment, the zero-copy sliding-window technique would significantly reduce the memory requirement and allow the training process to take place on devices with limited memory space. The novelty of our proposed method lies in two critical aspects. First, leveraging view-based operations directly on GPU-resident tensors to generate zero-copy sliding-window views without duplicating data. Second, introducing a dual-mode operational strategy, “Optimize” (batch-wise GPU loading) and “Optimize in memory” (full GPU pre-loading), to adapt flexibly based on available GPU resources, while the view-based operation itself is not new, applying it in this GPU-native, view-only manner for memory-efficient batching in time series transformer models represents a previously unexplored approach.
In the following subsections, we explain our system design by first outlining the two variants of our optimized batching approach in Section 4.1. Then, we elaborate on each of the components required to enable our optimizations on time series datasets during training in Section 4.2.

4.1. Solution Flowchart

After the raw data have been pre-processed, they are assembled to create mini-batches of tensors for model training. As opposed to executing the batch formation in CPU (default mode) and then offloading the batches to GPU memory, we propose disabling the automatic batching process in CPU and offloading the pre-batched data to the GPU as a first step. In this case, there are two scenarios to consider as shown in Figure 3. First, if the size of the pre-batched data is larger than the available GPU memory, then we apply sharding techniques to split the raw data into partitions to fit the memory. Second, if the pre-batched data are smaller than the available GPU memory, we can directly transfer all the raw data from CPU to GPU and start the batching process there directly. Before transferring the pre-batched data to GPU, the data are converted from floating point numbers to tensors. It is important to note here that we assume the raw data have already been sorted based on time before being converted to tensors because these tensors no longer allow datetime as a data type.
Once the pre-batched tensor is in GPU memory, we enable the sliding-window technique using memory pointers. After the sliding window is applied, the tensor batch is ready to be processed by the model. This batching process will then be repeated for subsequent batches of data for each training iteration. The batching process is often overlapped with the model’s backward pass for latency-hiding purposes. During the backward pass, the first batch will be released from the GPU memory and the second batch will be transferred into the available memory.
Our two operational modes are designed to optimally utilize the available GPU memory according to the size of the input data. However, the same sliding-window optimization is applied to the data in both variants of our design. In the experimental section, we refer to our per-batch optimization variant as “Optimize” and the data pre-loading variant as “Optimize in memory”. In the next subsection, we will describe our sliding-window optimization in detail.

4.2. Zero-Copy Sliding Window

In a default PyTorch workflow, even if view-based windowing operations are applied on the CPU side to generate sliding windows, those windowed tensors become re-materialized during batching due to PyTorch’s default batching (e.g., stacking individual samples with collate_fn) ends up copying or concatenating the windowed data into a new tensor. In other words, the result of the window-based operations on CPU does not inherently remain a “zero-copy” view once you start collecting individual samples into a batch. Instead, PyTorch’s standard batching code (the default collate_fn) creates new contiguous blocks of memory that hold the stacked batch.
Compared to the default data preparation method depicted in Figure 1, we propose a modified sample preparation workflow. Instead of applying the sliding-window operation on the time series data residing on the CPU, we transfer a batch of data over to the GPU and apply the sliding-window operation there, as shown in Figure 4. By contrast, our framework delays re-materializing until the data are on the GPU and performs the operation via views, thus forming batches without triggering a second memory copy step. Overlap redundancy is efficiently addressed through the use of tensor views, where overlapping windows reference the same underlying data in GPU memory, avoiding explicit duplication of data and significantly reducing memory consumption. The size of the pre-batched data is determined by the sum of the batch size and the sequence length (B + S−1). For instance, in the example of preparing batches of five samples with a sequence length of six and a stride value of one, we only need to copy pre-batched data containing ten time steps (B + S−1) to the GPU. This contrasts with the original batch preparation workflow, which requires copying five samples, each with a sequence length of six. As indicated by the dotted line on the GPU side in Figure 4, the view-based sliding-window operation employs pointers instead of replicating data for each window, resulting in reduced memory usage and computational time for batch formation. Our batching strategy accommodates variable-length time series while maintaining fixed input shapes. For sequences longer than or equal to the window size, we apply view-based sliding-window generation directly using GPU-resident tensor views, without any padding. For sequences shorter than the window size, we apply minimal zero-padding to extend the sequence and yield one valid window. This selective padding ensures that no data are discarded.
In order to effectively utilize a view-based sliding window as part of the sample preparation workflow, we employ a customized Dataset class with predefined batch size calculation. An example code snippet for such a custom dataset is displayed in Figure 5. The data-partitioning logic to calculate pre-batched sequence length (line 13) is part of the class initialization step ensuring sufficient samples to create each batch of training data. In the code snippet, the variable “seq_len” stores the sequence length value. The__len__method also has to be updated to return the number of batches instead of returning the number of samples (line 24). If padding is enabled on the sequences, this method will have to be updated to reflect an additional last batch (if any). In order to avoid redundant data copies, the automatic batching mode has to be disabled by setting a corresponding flag in the DataLoader class.
In addition to offloading sample preparation to the GPU, another equally important modification to the data pipeline is the custom sliding-window collation function.
The default behavior of PyTorch’s collate_fn, which involves stacking individual tensors into batches, can introduce memory overheads and redundant data transfers from host to device memory. To address this, we implemented a custom collate_fn, Algorithm 1 shown below, that leverages sliding-window view mechanism on GPU-resident data. Our approach extracts overlapping sliding windows directly from the raw tensor residing on the GPU, thereby minimizing memory copies and preserving data locality. By employing views, we avoid the creation of intermediate tensors for each window. Furthermore, by operating directly on the GPU, we take advantage of the computational benefits of hardware acceleration. This view-based, GPU-centric collate_fn has the potential to streamline the training process, particularly for large-scale multivariate time series datasets. The reduction in memory footprint and enhanced device-side computations can lead to improved resource utilization and faster training iterations as demonstrated later in our experimental section.
Algorithm 1: GPU-optimized sliding window collate_fn.
Data: Tensor of time series samples B = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x n , y n ) } where
            x i R T × F , window size w, stride s
Result: Tensor of batched windows X R n × w × F , corresponding labels y R n
T move _ to _ device ( B ,   cuda )   //  Move input batch to GPU
X generate _ sliding _ windows ( T , w , s )   // Apply windowing using
     view-based transformation
y extract _ labels ( B )   // Collect labels from original batch
return  X , y
For the input data, R describes the space of matrices or tensors with real-valued entries that have the following dimensions: T corresponds to the number of time steps, and F represents the number of features measured at each time step. In terms of the result, R has the following dimensions N X W X F: N corresponds to number of samples per batch and W represents the window size.
Based on a given batch size value, we assemble the corresponding number of sliding-window views with a predefined look-back window and a stride. Since the raw data reside in GPU memory, we can apply a time-based zero-copy sliding window on the tensors. This method creates sliding-window views with memory pointers to the actual raw data. In this way, sliding-windowed data are only a view, and data duplication across windows is greatly reduced as we only use pointers and not the actual data.
Figure 6 displays the modified multivariate time series data batch formation process. This approach differs from traditional windowed data accumulation within the CPU, followed by replication to GPU memory. Instead, our method prioritizes efficiency by preparing only a batch’s worth of raw data on the CPU. These pre-batched data are subsequently tensorized and transferred to the GPU. The critical process of our approach lies in the zero-copy sliding-window operation executed directly on the pre-batched tensor within the GPU. This strategy significantly reduces memory usage, a known bottleneck in time series processing. Furthermore, by streamlining data handling and minimizing redundant data transfers, we achieve computational performance improvements as shown later in Section 5.
Theoretical Justification: The substantial memory savings from our zero-copy sliding-window technique stem directly from a fundamental reduction in memory complexity. Traditional CPU-based batching and tensor stacking operations replicate tensor data, incurring a memory complexity of O(B × S × F), where B is batch size, S is sequence length, and F is the number of features. In contrast, our view-based sliding windows avoid actual data replication, using tensor views with pointer references. This effectively reduces the memory complexity to O((B + S) × F), representing only minor overhead for pointers. This theoretical justification emphasizes the observed empirical improvements in memory efficiency resulting in substantial savings, especially when B and S are large as can be calculated using the formula displayed in Figure 7.

5. Experimental Setup and Evaluation

To the best of our knowledge, there is no existing open-source library or well-documented method explicitly addressing memory-efficient batching via sliding windows for time series models. Existing approaches typically rely on the default PyTorch tensor-stacking workflow, incurring substantial data replication. Thus, we compare our proposed system against PyTorch’s standard data pipeline as the baseline.
To comprehensively evaluate the efficacy and effectiveness of our zero-copy sliding window techniques in optimizing memory usage for time series models, we conducted a set of experiments. Our experiments are designed to evaluate memory consumption and runtime performance. We evaluated our design implemented in Python (v3.10.16, Python Software Foundation, DE, USA) with PyTorch (v 2.5.1, PyTorch Foundation, CA, USA) using a custom PyTorch dataset, collate_fn, and the custom DataLoader against the default PyTorch data-loading pipeline to provide a comprehensive comparison.

5.1. Dataset Generation

We evaluate our approach using both synthetic and real-world time series datasets. Synthetic data were generated to systematically control dimensionality and sequence characteristics. For real-world validation, we utilized publicly available electricity transformer temperature (ETT) datasets (ETTh1, ETTh2, ETTm1, ETTm2), widely recognized benchmarks for multivariate forecasting. These datasets vary in sequence length, seasonal patterns, and feature dimensionality, ensuring comprehensive evaluation of the proposed batching method’s effectiveness. Detailed dataset statistics and source code are provided in our public GitHub repository for reproducibility.
For synthetic data, we employed a standard Python random floating point number generator to simulate multivariate time series datasets with controllable characteristics. This approach allowed us to tailor aspects like the number of features (dimensionality) and sequence length for a diverse experimental benchmark. Our simulations ensured dataset sizes were representative of real-world time series problems.

5.2. Experimental Setup

Our experiments were implemented in Python, leveraging the PyTorch framework for deep learning model construction and GPU acceleration. We disabled PyTorch’s automatic batching functionality to enforce the use of our custom collate_fn. For baseline results, all models without optimizations utilized the standard PyTorch data-loading pipeline for comparison.
We evaluated our view-based batching optimization using two key performance metrics:
  • Memory Usage: We measured the peak GPU memory consumption during batching with special attention to memory behavior during the execution of collate_fn.
  • Runtime Performance: We measured (i) the processing time of each batch and (ii) the full end-to-end runtime per training epoch, averaged across multiple iterations.
  • Forecasting Accuracy: This was evaluated using standard mean squared error (MSE) and mean absolute error (MAE) metrics on the ETT benchmark datasets to confirm that our batching optimization does not affect model performance.

5.2.1. Parameter Variations

To assess the generality and robustness of our method, we systematically varied the following training parameters:
  • Window Size: Different sliding-window lengths were tested to assess their impact on performance.
  • Batch Size: The relationship between batch size and memory/runtime efficiency was explored.
  • Feature Dimension: We varied the number of input features to test scalability in multivariate settings.
  • Model Architecture: We selected five time series model architectures to examine how our optimization interacts with varying model complexity.
We focused on four transformer-based architectures—Informer, Autoformer, PatchTST, and a vanilla Transformer—as well as a linear forecasting model (DLinear) for lightweight comparison. Each model was trained using its default configuration. The number of parameters in each model and the evaluation setup details are summarized in Table 1.

5.2.2. Error Metrics

We evaluate forecasting accuracy using two widely accepted metrics: mean absolute error (MAE) and mean squared error (MSE).
Mean Absolute error (MAE) measures average absolute deviations:
MAE = 1 n i = 1 n | y i y ^ i |
Mean squared error (MSE) measures average squared deviations, penalizing larger errors more significantly:
MSE = 1 n i = 1 n ( y i y ^ i ) 2
Both metrics evaluate how closely predictions ( y ^ i ) match the actual values ( y i ).
Cluster setup: All experiments were conducted on a computing cluster with A100 40 GB GPUs and a network connection speed of 200 Gbps. We ran each experiment 15 times and took the average of the last 10 runs. The results were carefully averaged across multiple runs with different random seeds to mitigate randomness and provide robust conclusions.

5.3. Results and Analysis

This section presents a comprehensive analysis of our view-based batching technique in a diverse set of training configurations. We evaluate its end-to-end impact on runtime and accuracy, considering factors such as model architecture, sequence length, batch size, and number of feature. This evaluation provides the first systematic study of its practical benefits beyond micro-benchmarking.
We conducted several memory and runtime benchmarks using varying time series dataset sizes generated by the random number generator. We compare our results against the standard PyTorch data pipeline with stacked tensors. The results are grouped into two main categories; memory optimization evaluation and runtime performance analysis displayed below.

5.3.1. Memory Optimization Evaluation

To understand the effect of our view-based sliding-window batching techniques on different dataset parameters, we measured peak memory usage per batch while varying these dataset parameters. We conducted memory usage comparisons when increasing the number of features, batch sizes, and sequence lengths.
(1) Number of features: For this particular experiment, we fixed the size of the batches to 500 samples and the sequence length of 1000 time steps. We compare the memory usage of each batch of data between two variants of our GPU-optimized sliding-window technique and the original PyTorch’s tensor stacking baseline. We varied the number of features to create seven datasets with features ranging from 10–5000 to mimic real-world time series data in different industries and demonstrate the benefit and scalability of our approach. The experiment results and memory reduction percentages are displayed in Figure 8. Figure 8a displays the GPU memory usage to accommodate each batch of tensors for the training iterations. The x-axis represents each dataset with an increasing number of features. The y-axis shows the peak memory usage measured in gigabytes (GB). The results are plotted on a logarithmic scale for visualization purposes.
As the number of features increases, the memory usage for each batch also increases for all measured approaches as shown in Figure 8a. However, both variants of our GPU-optimized approach (labeled “Optimized” and “Optimized in memory”), used less GPU memory than the default PyTorch tensor stacking approach (labeled “Original”) across all seven benchmarked datasets. The ‘preload’ variant of our optimization (labeled “Optimize in memory”) consumes more memory than the ‘per-batch’ optimization variant (labeled “Optimized”) because all the raw data are transferred into the GPU memory for the “preload” variant, but only a batch worth of data is transferred for the “per-batch” optimization variant. Our optimizations reduced the memory usage by 98% and 99% on average for the “preload” and “per-batch” variants, respectively, as shown in Figure 8b.
(2) Batch size: We varied the size of each batch of data ranging from 100 to 1000 samples. The sequence length was 1000 time steps and the dataset contained 500 time series features. The experiment results and memory reduction percentages are displayed in Figure 9. Figure 9b displays the GPU memory usage to accommodate each batch of tensors for the training process. The x-axis represents each dataset with increasing batch sizes. The y-axis shows the memory usage measured in gigabytes (GB). The results are plotted on a logarithmic scale.
As the number of samples per batch increases, the memory usage for each batch also increases linearly for the baseline and our “per-batch” optimization variant (labeled “Optimize”). This is because the system has to transfer an increasing amount of data per batch. However, the difference in memory usage between these two approaches is evident. On the contrary, our optimization variant labeled “Optimize in memory” has a constant memory usage throughout the experiment even with an increase in the batch size. This is because this variant preloaded all the data into GPU memory and only created views from this data object. The size of the raw data was the same across all measured batch sizes. Therefore, the changes in the batch size are negligible during batch formation. Our optimizations reduced the memory usage by 98% and 99% on average for the “preload” and “per-batch” variants, respectively, as shown in Figure 9b.
(3) Sequence length: Similar to the previous experiment, here we varied the sequence length (look-back window) ranging from 100 to 5000 for each data sample and measured the memory usage. The batch size was 1000 samples and the dataset contained 500 time series features. The results of average memory usage measurement with varying sequence lengths and memory reduction percentages are displayed in Figure 10. Figure 10a displays the GPU memory usage to accommodate each batch of tensors with varying sequence lengths for the training process. The x-axis represents each dataset with an increasing batch size. The y-axis shows the memory usage measured in gigabytes (GB). The results are plotted on a logarithmic scale to help with visualization.
The experimental results for the setting of varying sequence length are similar to the previous experiment of increasing the batch size. As the size (sequence length) of each sample increases, the memory usage for the baseline (labeled “Original”) and our “per-batch” optimization variant (labeled “Optimize”) increase linearly. However, the size of the raw data that was preloaded into GPU memory was exactly the same for the “preload” variant (labeled “Optimize in memory”) resulting in a constant memory usage across all evaluated sequence lengths. Our optimizations reduced the memory usage by 94.5% and 99% for the “preload” and “per-batch” variants, respectively, as shown in Figure 10b.

5.3.2. Runtime Performance Analysis

Modifications made to the data sample preparation and batch formation directly affect not only memory usage in both CPU and GPU devices but also data transfer time, and batch processing time during the model’s forward pass. By offloading pre-batched data to the GPU, we can reduce the amount of data typically transferred between the devices, therefore reducing the transfer time. We can also leverage the GPU’s computing power to perform the sliding-window operation for the batch formation process. As a result, we conducted several runtime analyses to measure the impact factor and understand the efficacy of our approach.
To evaluate our hypotheses, we divide this set of evaluations into two parts. First, we benchmark the average time duration for the system to prepare each batch of tensor data and have it available in the GPU memory for the training process. We conducted three main experiments to understand the influence of our approach on different batch sizes, sequence lengths, and increasing number of features. This part of the experiment would demonstrate the runtime effect of the data size reduction. Second, we measure the end-to-end runtime to load and process each batch of data during the models’ training iteration. To understand the effect of our approach on different time series models, we also measured the average epoch training time for each different model architecture to process the data. This later part of the evaluation would illustrate the effect of utilizing the GPU-optimized sliding-window operation for the batch formation and how different model complexity could benefit from this optimization.
(1) Number of features: In this experiment, we compare the average time duration for loading in each batch of data consisting of various numbers of features. We used the same seven datasets as the experiment in Section 5.3.1 with varying numbers of features ranging from 10 to 5000 to measure the average per-batch runtime. The experiment results and runtime reduction percentages are displayed in Figure 11. Figure 11a displays the average time duration for each batch of tensors to be available in the GPU memory for training. This includes the data transfer time from CPU to GPU and applying the GPU-optimized sliding-window operation on the data residing in the GPU memory. The x-axis represents each dataset with an increasing number of features. The y-axis shows the time duration measured in seconds (s). For visualization, the values are plotted on a logarithmic scale.
As the number of features increases across datasets, the runtime for both approaches (all variants) also increases accordingly due to transferring increasing amounts of pre-batched data from CPU to GPU, and the sliding-window operation had to be applied to datasets with an increased number of input features. However, our GPU-optimized sliding-window approach (both variants) has a much lower runtime across all dataset sizes. On average, across seven datasets, our “per-batch” optimization approach labeled “Optimized” and the “preload” variant labeled “Optimized in memory” reduced the runtime by 99% on average as shown in Figure 11b. The significant runtime reduction is a result of avoiding unnecessary materialization of windowed data across the time series dataset. The runtime reduction also increases with the number of features in the dataset. When applying our optimization approach under the assumption that the dataset size is smaller than the GPU memory, the entire dataset was loaded into the GPU memory prior to the batching process, resulting in a much lower runtime.
(2) Batch size: For this experiment, we increased the number of samples per batch to explore the effect of runtime reduction on different batch sizes. The experiment results and runtime reduction percentages are displayed in Figure 12.
In this experiment, we used the same datasets as the experiment in Section 5.3.1 with varied batch size ranging from 100 to 1000 samples. The average runtime results are displayed in Figure 12a. The x-axis represents the dataset with an increasing number of batch sizes. The y-axis shows the time duration per batch measured in seconds (s). Note that the values are plotted on a logarithmic scale for visualization.
As the number of samples per batch increases, the runtime for the baseline (labeled “Original”) increases linearly. This is because the system has to transfer and assemble an increasing amount of windowed data per batch. On the other hand, both of our optimized variants have a much lower increase in runtimes. For our “preload” optimization variant, when all the data are pre-loaded into the GPU memory (labeled “Optimized in memory”) there is no increase in the time duration even when the batch size increases. This is because the entire dataset was already available in the GPU memory and the same sliding-window settings were applied to the dataset, making the batch size increase trivially with the overall runtime. The sliding windows that were applied to the dataset only resulted in views with no additional data copies. For our “per-batch” optimized variant (labeled “Optimized”), the increase in the time duration across different batch sizes is still significantly less than that of the baseline. On average, across the five different batch sizes, both of our optimized variants reduced the runtime by 99%, as shown in Figure 12b.
(3) Sequence length: For this experiment, we increased the number of time steps for each sample to explore the effect of memory reduction on different sample sequence lengths (window sizes). The experiment results and runtime reduction percentages are displayed in Figure 13.
In this experiment, we used the same datasets as the experiment in Section 5.3.1 with varied sequence lengths ranging from 100 to 5000 time steps to measure the average per-batch runtime. The results are displayed in Figure 13a. The x-axis represents the dataset with an increasing number of time steps (window size). The y-axis shows the time duration per batch measured in seconds (s). Note that the values are plotted on a logarithmic scale for visualization purposes.
Similar to the batch size evaluation results, as the number of time steps per window increases, the runtime for the baseline (labeled “Original”) increases linearly. This is because the system has to assemble an increase in the data volume due to the increase in the sequence length per sample even with the fixed number of features and batch size. On the other hand, our “per-batch” optimized variant (labeled “Optimized”), has a significantly lower increase in runtimes than that of the baseline. For our “preload” optimized variant (labeled “Optimized in memory”), there was no increase in the time duration even when the window size increased. This is because the entire dataset was already available in the GPU memory. The changes in the window size only affect the sliding-window settings. Since the sliding-window operation in our implementation only creates views, there were no additional data copies with the changes in the sequence length. On average, across five different sequence lengths, both of our optimized variants reduced the runtime by 99% as shown in Figure 13b.
(4) Time-per-batch: In the second part of our runtime experiment, we measure the end-to-end runtime across multiple batches of data. The purpose of this experiment is to give us an insight into the performance of each data-loading pipeline when it is operating. In this evaluation, we compare the average runtime between our two variants of GPU-optimized sliding-window techniques and the default PyTorch’s tensor stacking workflow (labeled ‘Original’).
Figure 14 shows the runtime comparisons between our GPU-optimized technique in two operational modes and the default PyTorch’s data-loading pipeline. The x-axis values indicate batch indices and the y-axis values show the time duration measured in seconds (s). The evaluation results are plotted on a logarithmic scale to help with visualization, but the actual value differences are much further apart. In this evaluation, we utilized one of the datasets from our earlier experiments, the dataset contains 500 features. We set the batch size to 500 and the sequence length to 1000 to be consistent with the other experiments. We took the average runtime for each batch of the data to be available in the GPU memory, ready to go through the training iteration. In all of the evaluated systems, the time duration to prepare and load each batch of data is consistent across all batches of data because of the static batch size and sequence length. However, only the “Optimized in memory” variant of our GPU-optimized sliding window approach has a distinct runtime difference between the first batch and the subsequent batches. This mode assumes that the raw data fit entirely within the available GPU memory. The entire dataset was loaded from the CPU into the GPU memory before the batch formation process. Since all of the batches were created when the raw data were already in the GPU memory, the largest runtime is in the data loading when creating the first batch. For the subsequent batches, the tensor data were already in the GPU memory resulting in significantly lower batching time.

5.3.3. End-to-End Training and Accuracy Evaluation

To evaluate the practical impact of our memory-efficient batching strategy on the full training workflow, we conducted end-to-end runtime measurements across several model architectures. In addition to performance, we validated that our proposed optimization method preserves model accuracy using real-world benchmarks.
(1) End-to-End Training: For this experiment, we apply our optimization approaches to four different transformer-based model architectures and one linear model to explore the effect of memory reduction on different architecture complexity. In order to demonstrate both the practicality and scalability of our approach, we applied our optimizations in a distributed training mode. PyTorch’s Distributed Data Parallel (DDP) allows parallel training of the model by partitioning the data and replicating the model architecture across the GPU devices. We applied the DDP training on 4 GPUs and measured the average runtime per training iteration (epoch). The runtime reporting was conducted only on rank 0 of the training processes.
Figure 15 displays the average time duration for each training iteration (epoch). The x-axis represents the different model architecture names. The y-axis shows the time duration per batch measured in seconds (s). Each training iteration includes the processes of loading each batch of data, applying the model, computing the loss and gradients, and adjusting the learning weights. The operation sequence was repeated for all batches of data. We utilized our generated dataset from the previous experiment (Section 5.3.1) with 100 time series features. The sequence length was fixed to 300 time steps, the batch size was 100 samples, and the total number of time steps was 10,000. For each evaluated model architecture, we trained the model on a fixed multivariate time series dataset using both our optimized data-loading pipeline (both variants) and the standard PyTorch pipeline. We monitored and recorded the end-to-end runtime for each training epoch across multiple training runs, ensuring that all hyperparameters remained consistent across both the optimized and baseline configurations.
Our experimental result indicates a consistent trend across all evaluated architectures that our GPU-optimized sliding-window approach (both variants) outperforms the PyTorch baseline in terms of runtime efficiency. The magnitude of the increase in speed varied slightly between architectures, but in all cases, the reduction in runtime was statistically significant. We summarize the percentage of runtime reduction for each model architecture in Table 2. For the transformer-based architectures, except PatchTST, our “Optimized” and “Optimized in memory” variants reduced the runtime on average by 20% and 33%, respectively. For PatchTST, the runtime reductions were 7% and 8% for the two optimized variants. This is due to the default extra pre-processing (patching) of input data, the tensor data had to be materialized and the model itself consumed more memory and computational power than the other evaluated transformer-based models for time series forecasting. For DLinear, the runtime reduction were 15% for the “Optimized” variant and 95% for the “Optimized in memory” variant.
(2) Accuracy Evaluation: To verify that our batching optimizations do not impact model quality, we ran additional accuracy tests using the PatchTST/42 model (42 input patches with sequence length = 336) across four benchmark ETT datasets (ETTh1, ETTh2, ETTm1, ETTm2). These datasets have been extensively utilized for benchmarking and publicly available in the Autoformer’s online repository [26]. The results are shown in Table 3. Our accuracy metrics are consistent with those reported in prior work [30], confirming that the optimization is transparent to model performance.

6. Discussion

In this section, we will go through each of the experimental results and analyze the behaviors we observed.
Our investigation into batch memory optimizations via view-based sliding windows for GPU-resident multivariate time series data, has revealed compelling insights. The experimental results on both memory and runtime benchmarks demonstrate the superiority of our proposed optimizations over the conventional PyTorch tensor-stacking baseline.
The first set of experiments, examining memory consumption with varying feature counts, batch sizes, and sequence lengths (Figure 8, Figure 9 and Figure 10), reveal the advantages of our two optimization variants. The “preload” variant, suitable when the entire dataset fits within GPU memory, exhibits a stable memory footprint regardless of any batch size or sequence length. This behavior is expected due to the one-time upfront transfer of data. On the other hand, the “per-batch” optimization, designed for larger datasets, demonstrates a linear relationship between memory usage and the three varying dataset parameters, aligning with the batch-wise data-loading mechanism. Even in this constrained setting, our approach consistently outperforms the baseline, particularly when handling high-dimensional time series data, showcasing its effectiveness in mitigating memory pressure.
The subsequent runtime analyses (Figure 11, Figure 12 and Figure 13) across varying numbers of features, batch sizes, and sequence lengths further solidify the efficacy of our optimizations. In all scenarios, both our optimization variants consistently surpass the PyTorch baseline in terms of runtime performance. The preloading variant, as anticipated, delivers the most substantial speedup due to the elimination of repetitive CPU-to-GPU data transfers. This result strongly advocates for preloading data onto the GPU whenever feasible. Notably, even the per-batch optimization variant, despite the unavoidable data transfers, maintains a consistent speed advantage over the baseline, emphasizing the efficiency of view-based operations in reducing computational overhead.
Applying our optimization techniques on five distinct model architectures (PatchTST, autoformer, informer, transformer, and DLinear) operating in a distributed environment for the task of time series forecasting further demonstrates the generalizability and scalability of our approach. The consistent runtime improvements observed across all models, regardless of their complexity, highlight the broad applicability of our memory optimizations. However, the application of our optimizations to the PatchTST architecture revealed a unique behavior, while both variants still improved runtime compared to the baseline, the magnitude of the reduction was notably smaller than that observed for other transformer-based models (Figure 15). This can be attributed to PatchTST’s internal default mechanism, which involves a form of padding and patching of input tensors. This process requires input data materialization of views. Since we have not modified any model architecture to directly take advantage of the view-based tensors, this could be a future improvement that would significantly enhance the impact of our optimizations. Despite the limitation, our approach still offers a measurable performance boost even for PatchTST, suggesting its broad applicability.
A particularly distinct runtime reduction was observed for the DLinear model. The “preload” optimization variant, which pre-emptively transfers the entire dataset to GPU memory, achieved a remarkable 95% reduction in training iteration runtime. This substantial improvement is attributed to the relatively small size of the DLinear model and its consequently lower computational demands compared to other evaluated transformer-based architectures. In this context, the elimination of repeated data transfers between CPU and GPU, facilitated by the preload strategy, becomes a dominant factor in accelerating the overall runtime. Our proposed batching optimization operates at the data-loading level. Gradient computation and accumulation remain unaltered, ensuring compatibility with existing optimization strategies and maintaining the effective batch size and model convergence properties.
Even though we have not modified any model architectures or changed dataset values, for a sanity check, we ran the same ETT dataset benchmarks that are provided by PatchTST [45] and autoformer [26] to ensure that our optimizations do not affect the model accuracy. As shown in Section 5.3.3, the results are consistent with the values reported in the original paper [30].

7. Conclusions and Future Work

This study has demonstrated the significant potential of view-based sliding-window techniques for optimizing memory usage and accelerating training in time series transformer models. By intelligently leveraging views on GPU-resident data and bypassing the default PyTorch tensor-stacking behavior, we have achieved substantial reductions in memory consumption and runtime across various model architectures and data characteristics. The two variants of our optimization approach cater to different scenarios: preloading the entire dataset when it fits within GPU memory and applying optimizations per batch for larger datasets. Both variants consistently outperform the baseline, particularly in scenarios with high-dimensional data.
The consistent runtime improvements observed across diverse architectures, such as transformer-based models (autoformer, informer, and transformer), indicate the broad applicability of our approach. The experimental results on PatchTST highlight the interplay between model architecture and memory optimization strategies, emphasizing the need for tailored solutions.
Additionally, our utilization of distributed model training showcased the scalability of our approach in distributed settings, an essential aspect of training large-scale models on multiple GPUs. The consistent performance gains observed under the distributed environment reinforce the practical applicability of our optimizations for real-world scenarios.
While our findings are promising, several improvements remain for future exploration. To fully exploit the memory reduction achieved through our batch construction strategy, a compelling direction is to modify model architectures to directly leverage view-based tensors. This could potentially enhance further runtime and memory gains by minimizing unnecessary data materializations and transformations within the model. In addition to that, a more in-depth investigation into the impact of our optimizations on operation fusion and graph-based models could reveal additional opportunities for performance enhancement and more efficient scalable training pipelines.
In conclusion, our work lays a solid foundation for future research into memory-efficient training strategies for time series transformer models. By integrating our findings with advancements in model architecture design and execution optimization, we envision a future where these powerful models can be trained more efficiently, enabling broader adoption and accelerating research progress in this rapidly evolving field.

Author Contributions

Conceptualization, P.S.; methodology, P.S.; validation, V.E. and A.J.; resources, N.N.; data curation, P.S. and N.N.; writing—original draft preparation, P.S. and P.K.; supervision, J.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research has been supported in part by Walailak University’s research grant (grant number WU67219). This work was started in collaboration with IBM Research, Yorktown Heights, USA.

Data Availability Statement

All materials related to our study, including the benchmark scripts, source code, and datasets, are publicly accessible via our dedicated GitHub repository: https://github.com/psinthong/ME_Batch_Formation (accessed on 3 May 2025).

Conflicts of Interest

Authors Dr. Nam Nguyen was employed by the Capital One (United States), Mr. Vijay Ekambaram, Dr. Arindam Jati, and Dr. Jayant Kalagnanam were employed by the company IBM research. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. Authors P. Sinthong, N. H. Nguyen, V. Ekambaram, A. Jati, and J. R. Kalagnanam are listed inventors on the published U.S. Patent 18/377,564, titled “Sliding Window Memory Optimizations for Time-series Foundation Models,” which covers techniques related to the batching method presented in this paper. The authors declare no other conflicts of interest.

References

  1. Thoppilan, R.; De Freitas, D.; Hall, J.; Shazeer, N.; Kulshreshtha, A.; Cheng, H.T.; Jin, A.; Bos, T.; Baker, L.; Du, Y.; et al. Lamda: Language models for dialog applications. arXiv 2022, arXiv:2201.08239. [Google Scholar]
  2. Adiwardana, D.; Luong, M.T.; So, D.R.; Hall, J.; Fiedel, N.; Thoppilan, R.; Yang, Z.; Kulshreshtha, A.; Nemade, G.; Lu, Y.; et al. Towards a human-like open-domain chatbot. arXiv 2020, arXiv:2001.09977. [Google Scholar]
  3. Zhang, Y.; Sun, S.; Galley, M.; Chen, Y.C.; Brockett, C.; Gao, X.; Gao, J.; Liu, J.; Dolan, B. Dialogpt: Large-scale generative pre-training for conversational response generation. arXiv 2019, arXiv:1911.00536. [Google Scholar]
  4. Fan, A.; Bhosale, S.; Schwenk, H.; Ma, Z.; El-Kishky, A.; Goyal, S.; Baines, M.; Celebi, O.; Wenzek, G.; Chaudhary, V.; et al. Beyond english-centric multilingual machine translation. J. Mach. Learn. Res. 2021, 22, 4839–4886. [Google Scholar]
  5. Costa-jussà, M.R.; Cross, J.; Çelebi, O.; Elbayad, M.; Heafield, K.; Heffernan, K.; Kalbassi, E.; Lam, J.; Licht, D.; Maillard, J.; et al. No language left behind: Scaling human-centered machine translation. arXiv 2022, arXiv:2207.04672. [Google Scholar]
  6. Xue, L.; Constant, N.; Roberts, A.; Kale, M.; Al-Rfou, R.; Siddhant, A.; Barua, A.; Raffel, C. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 6–11 June 2021; pp. 483–498. [Google Scholar] [CrossRef]
  7. Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. (NeurIPS) 2020, 33, 1877–1901. [Google Scholar]
  8. Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res. 2023, 24, 1–113. [Google Scholar]
  9. Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models (2023). arXiv 2023, arXiv:2302.13971. [Google Scholar]
  10. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
  11. Kenton, J.D.M.W.C.; Toutanova, L.K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the naacL-HLT, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, p. 2. [Google Scholar]
  12. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 3–7 May 2021. [Google Scholar]
  13. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Online, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
  14. Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning (ICML), Online, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
  15. Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Online, 11–17 October 2021; pp. 568–578. [Google Scholar]
  16. Feng, Z.; Guo, D.; Tang, D.; Duan, N.; Feng, X.; Gong, M.; Shou, L.; Qin, B.; Liu, T.; Jiang, D.; et al. Codebert: A pre-trained model for programming and natural languages. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 1536–1547. [Google Scholar]
  17. Ahmad, W.U.; Chakraborty, S.; Ray, B.; Chang, K.W. Unified pre-training for program understanding and generation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 4171–4186. [Google Scholar]
  18. Wang, Y.; Wang, W.; Joty, S.; Hoi, S.C. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 7–11 November 2021. [Google Scholar]
  19. Zerveas, G.; Jayaraman, S.; Patel, D.; Bhamidipaty, A.; Eickhoff, C. A transformer-based framework for multivariate time series representation learning. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Singapore, 14–18 August 2021; pp. 2114–2124. [Google Scholar]
  20. Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; Jin, R. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. In Proceedings of the International Conference on Machine Learning (ICML), Baltimore, MA, USA, 17–23 July 2022; pp. 27268–27286. [Google Scholar]
  21. Xu, J.; Wu, H.; Wang, J.; Long, M. Anomaly transformer: Time series anomaly detection with association discrepancy. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 3–7 May 2021. [Google Scholar]
  22. Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; Volume 35, pp. 11106–11115. [Google Scholar]
  23. Tuli, S.; Casale, G.; Jennings, N.R. TranAD: Deep transformer networks for anomaly detection in multivariate time series data. Proc. VLDB Endow. 2022, 15, 1201–1214. [Google Scholar] [CrossRef]
  24. Yang, C.H.H.; Tsai, Y.Y.; Chen, P.Y. Voice2series: Reprogramming acoustic models for time series classification. In Proceedings of the International Conference on Machine Learning (ICML), Online, 18–24 July 2021; pp. 11808–11819. [Google Scholar]
  25. Lim, B.; Zohren, S. Time-series forecasting with deep learning: A survey. Philos. Trans. R. Soc. A 2021, 379, 20200209. [Google Scholar] [CrossRef]
  26. Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Adv. Neural Inf. Process. Syst. 2021, 34, 22419–22430. [Google Scholar]
  27. Chowdhury, S.P.; Solomou, A.; Dubey, A.; Sachan, M. On learning the transformer kernel. arXiv 2021, arXiv:2110.08323. [Google Scholar]
  28. Andoorveedu, M.; Zhu, Z.; Zheng, B.; Pekhimenko, G. Tempo: Accelerating transformer-based model training through memory footprint reduction. Adv. Neural Inf. Process. Syst. 2022, 35, 12267–12282. [Google Scholar]
  29. Li, S.; Zhao, Y.; Varma, R.; Salpekar, O.; Noordhuis, P.; Li, T.; Paszke, A.; Smith, J.; Vaughan, B.; Damania, P.; et al. Pytorch distributed: Experiences on accelerating data parallel training. Proc. VLDB Endow. 2020, 13, 2150–8097. [Google Scholar] [CrossRef]
  30. Nie, Y.; Nguyen, N.H.; Sinthong, P.; Kalagnanam, J. A Time Series is Worth 64 Words: Long-term Forecasting with Transformers. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  31. Lim, B.; Arık, S.Ö.; Loeff, N.; Pfister, T. Temporal fusion transformers for interpretable multi-horizon time series forecasting. Int. J. Forecast. 2021, 37, 1748–1764. [Google Scholar] [CrossRef]
  32. Zeng, A.; Chen, M.; Zhang, L.; Xu, Q. Are transformers effective for time series forecasting? In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 11121–11128. [Google Scholar]
  33. Liu, Z.; Wang, Y.; Han, K.; Zhang, W.; Ma, S.; Gao, W. Post-training quantization for vision transformer. Adv. Neural Inf. Process. Syst. 2021, 34, 28092–28103. [Google Scholar]
  34. Yao, Z.; Yazdani Aminabadi, R.; Zhang, M.; Wu, X.; Li, C.; He, Y. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Adv. Neural Inf. Process. Syst. 2022, 35, 27168–27183. [Google Scholar]
  35. Zhou, S.; Wu, Y.; Ni, Z.; Zhou, X.; Wen, H.; Zou, Y. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv 2016, arXiv:1606.06160. [Google Scholar]
  36. Esser, S.K.; McKinstry, J.L.; Bablani, D.; Appuswamy, R.; Modha, D.S. Learned step size quantization. arXiv 2019, arXiv:1902.08153. [Google Scholar]
  37. Dao, T.; Fu, D.; Ermon, S.; Rudra, A.; Ré, C. Flashattention: Fast and memory-efficient exact attention with io-awareness. Adv. Neural Inf. Process. Syst. 2022, 35, 16344–16359. [Google Scholar]
  38. Dao, T. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv 2023, arXiv:2307.08691. [Google Scholar]
  39. Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
  40. Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. QLORA: Efficient finetuning of quantized LLMs. Adv. Neural Inf. Process. Syst. 2024, 36, 10088–10115. [Google Scholar]
  41. Huang, C.C.; Jin, G.; Li, J. Swapadvisor: Pushing deep learning beyond the gpu memory limit via smart swapping. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, 16–20 March 2020; pp. 1341–1355. [Google Scholar]
  42. Chen, T.; Xu, B.; Zhang, C.; Guestrin, C. Training deep nets with sublinear memory cost. arXiv 2016, arXiv:1604.06174. [Google Scholar]
  43. Kirisame, M.; Lyubomirsky, S.; Haan, A.; Brennan, J.; He, M.; Roesch, J.; Chen, T.; Tatlock, Z. Dynamic tensor rematerialization. arXiv 2020, arXiv:2006.09616. [Google Scholar]
  44. Jain, P.; Jain, A.; Nrusimha, A.; Gholami, A.; Abbeel, P.; Gonzalez, J.; Keutzer, K.; Stoica, I. Checkmate: Breaking the memory wall with optimal tensor rematerialization. Proc. Mach. Learn. Syst. 2020, 2, 497–511. [Google Scholar]
  45. Nie, Y. PatchTST. 2023. Available online: https://github.com/yuqinie98/PatchTST (accessed on 3 March 2025).
Figure 1. Time-based sliding window.
Figure 1. Time-based sliding window.
Algorithms 18 00350 g001
Figure 2. Default batch formation steps.
Figure 2. Default batch formation steps.
Algorithms 18 00350 g002
Figure 3. System flowchart.
Figure 3. System flowchart.
Algorithms 18 00350 g003
Figure 4. Modified sliding-window data preparation.
Figure 4. Modified sliding-window data preparation.
Algorithms 18 00350 g004
Figure 5. Custom dataset example.
Figure 5. Custom dataset example.
Algorithms 18 00350 g005
Figure 6. Improved batch formation.
Figure 6. Improved batch formation.
Algorithms 18 00350 g006
Figure 7. Memory saving by view-based batching vs. traditional stacking-based batching. B is the batch size, S is the sequence length, and F is the number of features.
Figure 7. Memory saving by view-based batching vs. traditional stacking-based batching. B is the batch size, S is the sequence length, and F is the number of features.
Algorithms 18 00350 g007
Figure 8. Memory usage—increasing number of features. (a) Memory usage per batch; (b) memory reduction percentage.
Figure 8. Memory usage—increasing number of features. (a) Memory usage per batch; (b) memory reduction percentage.
Algorithms 18 00350 g008
Figure 9. Memory usage—increasing batch size. (a) memory usage per batch; (b) memory reduction percentage.
Figure 9. Memory usage—increasing batch size. (a) memory usage per batch; (b) memory reduction percentage.
Algorithms 18 00350 g009
Figure 10. Memory Usage—Increasing Sequence Length. (a) Memory Usage per Batch; (b) Memory Reduction Percentage.
Figure 10. Memory Usage—Increasing Sequence Length. (a) Memory Usage per Batch; (b) Memory Reduction Percentage.
Algorithms 18 00350 g010
Figure 11. Runtime—Increasing Number of Features. (a) Runtime per Batch; (b) Runtime Reduction Percentage.
Figure 11. Runtime—Increasing Number of Features. (a) Runtime per Batch; (b) Runtime Reduction Percentage.
Algorithms 18 00350 g011
Figure 12. Runtime—increasing batch size. (a) Runtime per batch; (b) runtime reduction percentage.
Figure 12. Runtime—increasing batch size. (a) Runtime per batch; (b) runtime reduction percentage.
Algorithms 18 00350 g012
Figure 13. Runtime—increasing sequence length. (a) Runtime per batch; (b) runtime reduction percentage.
Figure 13. Runtime—increasing sequence length. (a) Runtime per batch; (b) runtime reduction percentage.
Algorithms 18 00350 g013
Figure 14. Multiple batches runtime comparison.
Figure 14. Multiple batches runtime comparison.
Algorithms 18 00350 g014
Figure 15. Model runtime comparison.
Figure 15. Model runtime comparison.
Algorithms 18 00350 g015
Table 1. Summary of experimental parameters and model configurations.
Table 1. Summary of experimental parameters and model configurations.
Controlled Parameters
OptimizerAdam
Number of Training Epochs15 (runtime tests), 50 (accuracy tests with early stopping)
Optimizer SettingsDefault settings (PyTorch)
Loss FunctionMSELoss (PyTorch)
Hardware Platform4 × NVIDIA A100 GPUs (40 GB each)
Distributed SetupPyTorch DDP with NCCL backend
Dataloader Threads4
GPU Memory TrackingPyTorch CUDA memory API
Accuracy MetricsMSE, MAE (see Section 5.2.2)
Varied Parameters
Feature Dimension{10, 50, 100, 500, 1000, 3000, 5000}
Batch Size{100, 300, 500, 700, 1000}
Sequence Length{100, 500, 1000, 3000, 5000}
Model ArchitecturesTransformer, Informer, Autoformer, PatchTST, DLinear
Batching ModePyTorch Default, Optimized, Optimized In Memory
DatasetsSynthetic, ETTh1, ETTh2, ETTm1, ETTm2
Model Size (Number of Parameters)
Transformer10 Millions
Informer11 Millions
Autoformer11 Millions
PatchTST8 Millions
DLinear36,000
Table 2. Average percentage reduction in end-to-end epoch wall-clock time achieved by the proposed batching methods compared to the PyTorch baseline.
Table 2. Average percentage reduction in end-to-end epoch wall-clock time achieved by the proposed batching methods compared to the PyTorch baseline.
Model
PatchTSTInformerAutoformerDLinearTransformer
MethodsOptimized6.97%21.86%19.20%15.40%21.99%
Optimized in memory8.48%38.01%30.51%95.36%31.50%
Table 3. Supervised PatchTST on ETT datasets with prediction length = 96, 192, 336, and 720.
Table 3. Supervised PatchTST on ETT datasets with prediction length = 96, 192, 336, and 720.
ETTh1ETTh2ETTm1ETTm2
MSEMAEMSEMAEMSEMAEMSEMAE
960.3780.4030.2760.3370.2950.3490.1670.258
1920.4140.4220.3430.3800.3400.3770.2260.297
3360.4250.4300.3320.3840.3710.3970.2810.332
7200.4370.4570.3790.4190.4150.4210.3720.385
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sinthong, P.; Nguyen, N.; Ekambaram, V.; Jati, A.; Kalagnanam, J.; Koad, P. Memory-Efficient Batching for Time Series Transformer Training: A Systematic Evaluation. Algorithms 2025, 18, 350. https://doi.org/10.3390/a18060350

AMA Style

Sinthong P, Nguyen N, Ekambaram V, Jati A, Kalagnanam J, Koad P. Memory-Efficient Batching for Time Series Transformer Training: A Systematic Evaluation. Algorithms. 2025; 18(6):350. https://doi.org/10.3390/a18060350

Chicago/Turabian Style

Sinthong, Phanwadee, Nam Nguyen, Vijay Ekambaram, Arindam Jati, Jayant Kalagnanam, and Peeravit Koad. 2025. "Memory-Efficient Batching for Time Series Transformer Training: A Systematic Evaluation" Algorithms 18, no. 6: 350. https://doi.org/10.3390/a18060350

APA Style

Sinthong, P., Nguyen, N., Ekambaram, V., Jati, A., Kalagnanam, J., & Koad, P. (2025). Memory-Efficient Batching for Time Series Transformer Training: A Systematic Evaluation. Algorithms, 18(6), 350. https://doi.org/10.3390/a18060350

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop