TisLLM: Temporal Integration-Enhanced Fine-Tuning of Large Language Models for Sequential Recommendation

Zhu, Xiaosong; Li, Wenzheng; Zhang, Bingqiang; Geng, Liqing

doi:10.3390/info16090818

Open AccessArticle

TisLLM: Temporal Integration-Enhanced Fine-Tuning of Large Language Models for Sequential Recommendation^†

by

Xiaosong Zhu

^1,2,3

,

Wenzheng Li

^1,2,*,

Bingqiang Zhang

⁴ and

Liqing Geng

¹

School of Automation and Electrical Engineering, Tianjin University of Technology and Education, Tianjin 300222, China

²

Tianjin Key Laboratory of Information Sensing and Intelligent Control, Tianjin 300222, China

³

The Technology Innovation Center of Cultural Tourism Big Data of Hebei Province, Langfang 065399, China

⁴

Information Technology Center, Tsinghua University, Beijing 100084, China

^*

Author to whom correspondence should be addressed.

^†

This article is a revised and expanded version of a paper entitled Time Series Large Language Model for Recommendation System, which was presented at RAIIC 2025, held from 4–6 July 2025, at Wangjiang Campus, Sichuan University, Chengdu, China.

Information 2025, 16(9), 818; https://doi.org/10.3390/info16090818

Submission received: 15 August 2025 / Revised: 7 September 2025 / Accepted: 18 September 2025 / Published: 21 September 2025

(This article belongs to the Special Issue AI and Machine Learning in the Big Data Era: Advanced Algorithms and Real-World Applications)

Download

Browse Figures

Versions Notes

Abstract

In recent years, the remarkable versatility of large language models (LLMs) has spurred considerable interest in leveraging their capabilities for recommendation systems. Critically, we argue that the intrinsic aptitude of LLMs for modeling sequential patterns and temporal dynamics renders them uniquely suited for sequential recommendation tasks—a foundational premise explored in depth later in this work. This potential, however, is tempered by significant hurdles: a discernible gap exists between the general competencies of conventional LLMs and the specialized needs of recommendation tasks, and their capacity to uncover complex, latent data interrelationships often proves inadequate, potentially undermining recommendation efficacy. To bridge this gap, our approach centers on adapting LLMs through fine-tuning on dedicated recommendation datasets, enhancing task-specific alignment. Further, we present the temporal Integration Enhanced Fine-Tuning of Large Language Models for Sequential Recommendation (TisLLM) framework. TisLLM specifically targets the deeper excavation of implicit associations within recommendation data streams. Its core mechanism involves partitioning sequential user interaction data using temporally defined sliding windows. These chronologically segmented slices are then aggregated to form enriched contextual representations, which subsequently drive the LLM fine-tuning process. This methodology explicitly strengthens the model’s compatibility with the inherently sequential nature of recommendation scenarios. Rigorous evaluation on benchmark datasets provides robust empirical validation, confirming the effectiveness of the TisLLM framework.

Keywords:

large language model; sequential recommendation; temporal integration; fine-tuning

Graphical Abstract

1. Introduction

In the era of information explosion, recommendation systems have become indispensable engines driving user engagement and satisfaction in critical domains such as e-commerce and online video platforms. Traditional recommendation models, including sequence-aware models that leverage users’ temporal behavioral sequences to capture dynamic preferences, are heavily reliant on extensive historical user–item interaction data and consequently face significant limitations. Their accuracy is often constrained by data sparsity and the curse of high dimensionality [1]. More fundamentally, these models are typically designed for and confined to specific domains, exhibiting poor transferability when confronted with cross-domain challenges like zero-shot or few-shot learning scenarios [2,3]. This inherent limitation severely hampers their effectiveness in handling the pervasive cold-start problem, where new users or items lack sufficient interaction history.

The advent of large language models (LLMs), pre-trained on massive and diverse corpora, presents a transformative opportunity. These models possess not only vast world knowledge but also demonstrate remarkable reasoning and contextual understanding capabilities [1]. Consequently, LLMs are emerging as a powerful supplement, or even alternative, to traditional recommendation paradigms. Recent research has illuminated two primary roles for LLMs within recommendation systems: acting as direct recommendation engines and serving as sophisticated information augmenters. As recommendation engines, LLMs leverage their generative power and ability to follow instructions. Through carefully crafted prompts, they enable flexible and natural interactions, capable of directly generating personalized recommendations without the explicit need for complex embedding layers or intricate model architectures [4,5,6]. This approach shows promise particularly in zero-shot settings [6]. For information augmentation, LLMs utilize their profound reasoning abilities to enrich raw input data. They can generate synthetic interactions, infer latent user preferences from textual descriptions, extract complex features, or build knowledge graphs, thereby providing higher-order semantic representations that enhance downstream recommendation models [7,8,9]. However, while powerful, the “black-box” nature of LLM reasoning poses significant challenges for explainability and trust in recommendation outcomes, an area demanding further exploration.

Beyond these established capabilities of direct generation and information enrichment, we identify a deeper, intrinsic structural alignment between large language models and sequential recommendation tasks, which underpins our view that LLMs are naturally suited for this domain. First, the chain rule of language modeling:

P (w_{1 : T}) = \prod_{t}^{T} P (w_{t} | w_{1 : t - 1})

(1)

Here,

P (w_{1 : T})

is the joint probability of the entire token sequence

(w_{1}, w_{2}, \dots, w_{T})

. This equation decomposes this joint probability into a product of conditional probabilities

P (w_{t} | w_{1 : t - 1})

, each predicting the next token

w_{t}

given all previous tokens in the sequence

w_{1 : t - 1}

.

P (i_{1 : T} | u) = \prod_{t}^{T} P (i_{t} | i_{1 : t - 1}, u)

(2)

Similarly, in sequential recommendation,

P (i_{1 : T} | u)

represents the joint probability of a user’s entire interaction sequence

(i_{1}, i_{2}, \dots, i_{T})

given a user u. It is factorized into a product of conditional probabilities

P (i_{t} | i_{1 : t - 1}, u)

, each predicting the next item

i_{t}

based on the user’s historical interactions

i_{1 : t - 1}

and their identity u.

Furthermore, the joint probability formulation of user behavior prediction exhibits a mathematical isomorphism, which allows the self-attention mechanism of the Transformer architecture to be seamlessly adapted to recommendation scenarios. The self-attention mechanism computes relevance scores between elements in the sequence through the following operation:

\frac{Q K^{T}}{\sqrt{d_{k}}}

(3)

where Q and K represent Query and Key matrices, respectively, and

\sqrt{d_{k}}

scales the dot products to avoid extreme values. This mechanism effectively captures global dependencies across the entire sequence. The weight computation in Transformers overcomes the local perception limitations of traditional RNNs by leveraging a multi-head parallel mechanism to simultaneously capture short-term behavioral patterns and long-term interest evolution. Compared to the fixed inductive biases of CNN and RNN, this data-driven global dependency modeling significantly enhances the representation of complex behavioral patterns. Second, language models inject strict temporal sensitivity into sequences through sinusoidal positional encoding:

P E_{(p o s, 2 i)} = s i n (\frac{p o s}{{10,000}^{2 i / d_{m o d e l}}})

(4)

P E_{(p o s, 2 i + 1)} = c o s (\frac{p o s}{{10,000}^{2 i / d_{m o d e l}}})

(5)

Here,

p o s

denotes the position of an element in the sequence, i is the dimension index, and

d_{m o d e l}

is the embedding dimension. The value 10,000 was chosen empirically to create a wavelength progression that provides unique and smooth positional representations across dimensions, enabling the model to generalize to sequences longer than those seen during training [10].

The masked attention in the decoder enforces the causal constraints of the recommendation system, ensuring that predictions at time t rely solely on historical behaviors

i_{1 : t - 1}

. The stacked architecture of Transformer layers enables dynamic evolution of user representations, with each attention layer reconstructing feature interactions across different temporal scales. This mechanism overcomes the static representation limitations of traditional matrix factorization methods. Empirical evidence, such as the results demonstrated by SASRec [11], has shown that directly adopting a language model architecture for sequential recommendation validates the architecture’s effectiveness in modeling time decay effects and interest drift phenomena, fundamentally proving the superiority of large language models as a general-purpose temporal modeling framework.

To address this, we propose TisLLM, a time-series LLM-based recommendation model. TisLLM uses a temporal sliding window to segment user interaction sequences, preserving rich descriptive information while uncovering potential correlations between items. This approach enhances recommendation accuracy and user experience. Our experiments across multiple datasets demonstrate the effectiveness of TisLLM compared to existing models. The main contributions of this paper are as follows:

We highlight the importance of time-series patterns in user preferences and argue that LLMs are inherently suitable for time-series recommendation tasks.
We introduce a sliding window method to segment user interaction sequences, enriching training samples for model fine-tuning.
We validate the effectiveness of TisLLM through extensive experiments on multiple datasets.

2. Related Work

In recent years, large language models have demonstrated exceptional interaction capabilities, driven by extensive data foundations and powerful inference abilities. These models hold significant potential for transforming traditional recommendation paradigms, particularly in addressing cold-start and collaborative reasoning challenges. However, a key challenge lies in bridging the semantic gap between LLMs and recommendation systems (RSs), as recommended items are often represented by discrete identifiers not present in LLMs’ vocabularies [12,13]. This gap arises because LLMs primarily capture linguistic semantics, while RSs implicitly encode collaborative semantics, limiting the full utilization of LLMs for recommendations [2]. In order to integrate LLMs with RS, researchers explored various methods, as shown below.

2.1. Zero-Shot Recommendation via Prompt Design

Researchers have leveraged LLMs’ inference capabilities and prior knowledge through carefully designed prompt templates to generate recommendations without fine-tuning. For example, InstructRec [4] structures recommendations as instruction–question–answer tasks, enabling LLMs to address recommendation goals through crafted questions [14,15]. InstructRec’s key innovation is formalizing recommendation tasks into explicit instructions, specific questions, and desired answer formats. It leverages LLMs’ instruction-following capability via meticulously designed prompt templates to guide the LLM in understanding the recommendation intent and generating results, requiring no additional training data.

DLCRec [16] achieves fine-grained diversity control in recommendation through a task decomposition paradigm—sequentially executing genre prediction, genre filling, and item prediction subtasks—which dynamically encodes user requirements into structured prompts. This subtask-level prompt engineering distinguishes it from single-instruction approaches like InstructRec [4], enabling explicit diversity modulation beyond traditional zero-shot limitations.

Similarly, Wu et al. [5] treat LLMs as RCs by inputting user profiles and behavioral data, Wu et al.’s approach positions the LLM as a conversational recommendation agent. It directly inputs user profiles and behavioral data as natural language descriptions to the LLM, expecting the LLM to generate recommendations and justifications conversationally. This method focuses on utilizing the LLM’s conversational generation capabilities for natural interactive recommendation, while Zhang et al. [4] frame recommendations as instruction-following tasks. This refers to InstructRec. It emphasizes framing recommendation as a specific type of instruction-following task, using structured instructions to explicitly tell the LLM what action to perform.

Other methods, such as Wang et al.’s zero-shot next-item recommendation [6], use prompts to reorder recommendation lists or generate candidate items. Wang et al.’s work focuses specifically on the next-item recommendation task. They propose using LLMs to re-rank an initial candidate list generated by traditional recommendation models, or directly prompting the LLM to generate candidate item names. Their prompt design emphasizes enabling the LLM to understand the context of the user’s historical sequence and the requirement to predict the next item.

However, these methods often rely on static prompts, limiting their adaptability to dynamic user preferences and resulting in suboptimal recommendation performance. The fundamental limitation of zero-shot methods is their reliance on fixed or heavily manually engineered prompt templates. They struggle to dynamically capture real-time shifts in user interest, and the generated recommendations often lack personalized depth, leading to performance typically inferior to specially trained models.

2.2. Fine-Tuning LLMs with User Interaction Data

By converting user–item interaction data into natural language text, LLMs can be fine-tuned to learn user preferences. For instance, Bao et al. [14] propose TALLRec, which converts users’ historical interaction sequences into instructional text and fine-tunes LLMs to directly predict next-item titles, teaching the mapping from sequences to recommendation targets. Separately, Cao et al. [17] enhance recommendation knowledge alignment by supplementing fine-tuning with auxiliary task data alongside the main recommendation objective. This approach broadens the LLM’s understanding of item features and user preferences, improving both recommendation performance and knowledge alignment.

SLMRec [18] reveals redundancy in intermediate layers of large-scale LLMs for sequential recommendation by distilling them into smaller models, retaining only 13% of the parameters, directly challenging the “larger is better” hypothesis. Its distillation strategy is compatible with existing fine-tuning frameworks like TALLRec [14], providing an efficient alternative for resource-constrained scenarios—though still dependent on user interaction data for fine-tuning.

Ji et al. [19] introduce a two-stage generative paradigm that first fine-tunes LLMs to produce textual descriptions of target items from user interaction sequences, then maps these descriptions back to item IDs via text similarity or retrieval for recommendations, bridging textual semantics and item identification. Separately, Yang et al. [20] leverage LLMs’ semantic understanding to deeply analyze item names, extracting commonsense features that enrich knowledge graphs for enhanced recommendation. This knowledge graph augmentation improves the performance of KG-based recommendation models by deepening item feature representation.

Despite their effectiveness, these methods rely heavily on manual prompt engineering, leading to inefficiency and poor adaptability to dynamic environments. Although fine-tuning methods often yield better results, their data construction process typically relies on manually designed templates. Furthermore, during inference, carefully crafted prompts are often still required for optimal performance, reducing the method’s efficiency and flexibility in new scenarios.

2.3. Leveraging LLMs’ Prior Knowledge for Recommendation

LLMs’ inherent commonsense knowledge can supplement recommendation systems. Yang et al. [20] use LLMs’ commonsense to construct knowledge graphs for recommendations. As mentioned above, Yang et al. [20] leverage LLMs to mine implicit commonsense semantic relationships and features from item names/descriptions, which are used to build or enrich the knowledge graph employed by the recommendation system. This injection of external knowledge aims to compensate for the limitations of traditional collaborative signals in understanding the deep attributes and relationships of items.

DeepRec [21] introduces a multi-round autonomous interaction paradigm between LLMs and TRMs, optimized via reinforcement learning. LLMs generate user preference hypotheses using world knowledge, while TRMs retrieve candidate item sets with recommendation expertise; their alternating iterations enable deep item space exploration. This approach dynamically transforms the static knowledge graph framework of Yang et al. [20] into an interactive process, significantly enhancing knowledge utilization efficiency.

Luo et al. [22] employ prompts to generate concise summaries of targets. TRAWL focuses on leveraging LLMs to generate concise summaries of target items. These summaries distill the core features or selling points of an item. The LLM-generated summaries are then fed into traditional sequential recommendation models, replacing or supplementing the original item IDs or titles, with the goal of providing richer semantic information to help the model better understand user interests and item relevance.

However, these approaches depend on predefined knowledge graphs and prompts, limiting their ability to adapt to diverse domains and user preferences. The main challenges for knowledge utilization methods are: (1) For KG construction methods, effectiveness depends on the initial KG structure and the quality/coverage of knowledge mined by the LLM. (2) For summary generation methods, prompt design directly impacts the usefulness of the summaries. (3) Both struggle to adjust the utilized knowledge in a dynamic and personalized way to fit different domains or capture subtle shifts in user preferences.

2.4. Learning User-Item Interactions with LLMs

To address hallucination issues and capture structured knowledge, researchers integrate graph structures with LLMs. Du et al. [9] propose a graph-aware convolutional LLM method to enhance descriptions by exploring multi-hop neighbors. Du et al. integrate the concept of graph convolutional networks into the LLM fine-tuning process. They construct a user–item interaction graph and design a graph-aware convolutional layer. This allows the LLM, when generating or interpreting item text descriptions, to aggregate information from its multi-hop neighbors. This method aims to infuse the generated textual descriptions with richer collaborative filtering signals.

Guo et al. [8] design prompts to align textual information with graph nodes, Guo et al. focus on a graph neural network-based session-based recommendation. They design specific prompts that encode both graph structural information of items within the session sequence for input to the LLM. Their goal is to enable the LLM to understand the session context and graph structure to predict the next interacted item.

Zhang et al. [7] argue that existing LLM4Rec methods struggle to capture collaborative signals from user-item co-occurrence patterns, a critical limitation. To address this, they first train a traditional recommendation model to extract user and item ID embeddings that encode collaborative information. These embeddings are then incorporated as additional inputs alongside textual item descriptions during LLM fine-tuning, enabling the model to leverage both semantic and collaborative signals for recommendation tasks.

Reason4Rec [23] introduces “deliberative alignment,” requiring LLMs to explicitly reason through user preferences via step-by-step explanations and collaborative expert modules trained in stages, contrasting with Zhang et al. [7], who implicitly fuse collaborative signals into LLM inputs. While Zhang et al. focus on implicit representation integration, Reason4Rec specifically enhances LLM understanding of user behavior structures through explicit reasoning chains.

In summary, LLMs’ inference capabilities and prior knowledge make them promising for recommendation tasks, particularly in addressing cold-start and zero-shot challenges [24]. However, LLMRec methods often underperform traditional models in recommendation accuracy [25], highlighting the need for further exploration to enhance their effectiveness and adaptability.

3. Methodology

To bridge the gap between general-purpose LLMs and personalized sequential recommendation, we propose TisLLM, a time-series-based framework that leverages the natural temporal order of user interactions. Figure 1 illustrates the architecture of our proposed time-series model (TisLLM), which organizes users’ interaction records into chronological order, thereby formatting disorganized data for use in fine-tuning large language models. After fine-tuning, the LLMs can more accurately predict the next interaction outcomes of users.

As illustrated in Figure 1, the proposed framework encompasses four key stages: initial data processing, where raw user interaction data are cleaned and organized; subsequent data enhancement, which formats the processed data into sequential structures appropriate for fine-tuning; followed by LLM fine-tuning, employing a customized strategy to adapt a pre-trained language model for recommendation; and finally, inference verification, wherein the fine-tuned model is assessed on structured test data to verify its predictive accuracy.

3.1. Data Processing

Given the large language model’s powerful capability in world knowledge, we associate user, item, and interaction information within the dataset, retaining only the names of interacted items, along with the ratings or reviews of the interacted items, and the timestamp of the interaction. Each interaction is represented as a quadruple

R = (u s e r I D, i t e m I D, r a t i n g, T i m e s t a m p)

. Here, we adopt two different processing methods: The first one directly uses the rating score value [26]. The second one converts the rating into a binary classification problem, we convert the rating by setting a threshold to a binary classification of Like or Dislike

t h r e s h o l d = i n t (\frac{m a x (r a t i n g s)}{2}) + 1

(6)

where ratings is the set of all ratings in the dataset. If rating ≥ threshold, then rating = “Like” otherwise rating = “Dislike”. This gives us the preprocessed dataset format.

3.2. Data Enhancement

Given the historical interaction information of user

R u = r_{1}, r_{2}, \dots, r_{m}

, we adopt a sliding window method with window length l and step size s for data segmentation, ultimately dividing Ru into multiple samples. Assuming l = 2 and s = 1, we can then use

r_{1}

and

r_{2}

to predict

r_{3}

, and use

r_{2}

,

r_{3}

to predict

r_{4}

obtaining the fine-tuning dataset

R u = \{[r_{1}, r_{2}, r_{3}], [r_{2}, r_{3}, r_{4}] \dots [r_{m - 2}, r_{m - 1}, r_{m}]\}

. For a user interaction record of length, after processing with a sliding window strategy of length l and step size s, it will generate

\frac{m - (l + 1)}{s} + 1

training samples. To avoid wasting historical information, we need each interaction information

r_{x}

to participate in the generation of training samples; thus, it must satisfy s < l + 1. The specific processing approach is illustrated in Algorithm 1 as follows:

Algorithm 1 Pseudocode for data processing.

1:: function process_user_ratings(csv_file)
2:: Load CSV into reader; Initialize users dict to store ratings
3:: for row in reader do
4:: userID, rating, timestamp, movieID, pref = Extract from row
5:: if userID not in users then
6:: users[userID] = []
7:: end if
8:: Append {movieID, rating, timestamp, pref} to users[userID]
9:: end for
10:: json_output = []
11:: for user_id, movies in users.items() do
12:: Sort movies by timestamp
13:: if len(movies) < 16 then
14:: watched_movies = [movies[:-1]]
15:: next_movie = [movies[-1]]
16:: else
17:: watched_movies = [movies[i:i+15] for i in 0 to len(movies)-16]
18:: next_movie = [movies[i+15] for i in 0 to len(movies)-16]
19:: end if
20:: for i from 0 to len(watched_movies)-1 do
21:: input_str = Create string with watched_movies[i] and prefs
22:: output_str = Create string for next_movie[i]
23:: Append {‘Input’: input_str, ‘Output’: output_str} to json_output
24:: end for
25:: end for
26:: return json_output
27:: end function
28:: function write_json(json_data, json_file)
29:: Write json_data to json_file in JSON format
30:: end function
31:: Call process_user_ratings(csv_file)
32:: Call write_json(returned_data, json_file)

3.3. LLM Fine Tuning

Fine-tuning large language models methods include prefix tuning, prompt tuning, and LoRA (Low-Rank Adaptation), aiming to optimize the model’s adaptation to new tasks while maintaining its generalization ability. LoRA enhances performance significantly with a small number of additional parameters and is computationally efficient and resource-friendly, making it especially suitable for resource-constrained environments [27]. Therefore, LoRA is chosen for fine-tuning the LLaMa2-7B model. In LoRA, assuming the original weight matrix of the LLaMa2-7B model is W, two low-rank matrices A and B are introduced so that the fine-tuned weights can be expressed as W + AB. Now, assume W is obtained from the fine-tuning dataset described in the formatting process in Section 3.1

D = \{(X_{i}, y_{i})\}

, where each data point contains input X and label y:

D = \{(X_{i}, y_{i} | i = 1, 2, 3 . . . N)\}

(7)

We can obtain the optimally fine-tuned model by calculating the loss function L. For LoRA, instead of directly updating the weights W, we update the low-rank matrices A and B. Let

Θ_{A}

and

Θ_{B}

be the parameters of A and B, respectively; then the objective function for fine-tuning LoRA can be expressed as:

min_{Θ_{A}, Θ_{B}} L (y, f (W + A B))

(8)

Here,

f (X; W + A B)

denotes the model’s predicted output after applying the LoRA method, and L represents the loss function, which is used to measure the discrepancy between the model’s predictions and the true labels. Specifically, if masked language modeling (MLM) is employed, the loss function

L_{M L M}

can be defined as the average negative log-likelihood over all masked positions:

L_{M L M} = - \frac{1}{| M |} \sum_{m ϵ M} l o g P (X_{m} | X_{- m}; W + A B)

(9)

where

x_{m}

represents the masked words,

x_{- m}

represents the unmasked words, and M is the set of masked positions. If causal language modeling (CLM) is used, the loss function

L_{C L M}

is then the cumulative negative log-likelihood over all positions in the sequence:

L_{C L M} = - \sum_{t - 1}^{T} l o g P (X_{t} | X_{< t}; W + A B)

(10)

where

x_{t}

is the word at time step t, and

x_{< t}

refers to all positions in vector x before time step t. Finally, the formula for computing the self-attention mechanism in the Transformer architecture is as follows:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(11)

Here, Q (Query), K (Key), and V (Value) are matrices derived from linear transformations of the input embeddings, and

d_{k}

is the dimension of K. This formula ensures that the model maintains effective information flow when processing long texts and can capture relationships between different parts. Through the aforementioned formulas and strategies, we can clearly see that the entire process of large language models aligns well with our TisLLM framework, which is one of the reasons we can effectively fine-tune the LLaMA2-7B model, enabling it to better adapt to recommendation system tasks and improve its performance in specific application scenarios.

3.4. Inference Verification

As shown in Algorithm 2, we establish instruction rules defining input and output formats for the dataset generated in Section 3.2. To comprehensively validate our framework’s effectiveness, Section 4.1 employs four distinct datasets with two data processing strategies: the first converts ratings into binary Like/Dislike labels for evaluation, while the second retains original ratings for rating prediction tasks. Since two datasets used in rating evaluation lack explicit item names, they can only be referenced through item IDs within the datasets.

Algorithm 2 The fine-tuning data format of TisLLM for recommendation tasks is based on the user’s interaction history and the time series of interacted items, comprising two parts: input and output.The MovieLen dataset and Beauty data were used as examples [28,29].

Recommendation Prompt Example Movie
Input: “In chronological order, User 1 has successively watched the films ‘All Dogs Go to Heaven 2 (1996)’, ‘Operation Dumbo Drop (1995)’, ‘Sneakers (1992)’, ‘Disclosure (1994)’, ‘Doom Generation, The (1995)’, ‘Batman Forever (1995)’, ‘Wizard of Oz, The (1939)’, ‘Indiana Jones and the Last Crusade (1989)’, ‘Patton (1970)’, ‘Evil Dead II (1987)’ and provided the respective evaluations of ‘Dislike’, ‘Dislike’, ‘Like’, ‘Like’, ‘Dislike’, ‘Dislike’, ‘Like’, ‘Like’, ‘Dislike’, ‘Dislike’. Please judge whether user likes the ‘Young Frankenstein (1974)’. Please output only the results ‘Like’ or ‘Dislike’.”
Output: “Like”
Recommendation Prompt Example Beauty
Input: “In chronological order, User 1 has successively watched the products ‘B001DYLHJA’, ‘B0089JVEPO’, ‘B001G2LWDK’, ‘B005Z41P28’, ‘B0055MYJ0U’ and provided the respective ratings ‘5.0’, ‘1.0’, ‘5.0’, ‘3.0’, ‘4.0’ (with a maximum score of 5 and a minimum score of 1). Please predict the rating (within the range of 1 to 5) that the user will give to the product ‘B00117CH5M’. Please output only the score as the result.”
Output: “4”

4. Experiment

In this section, we evaluate our proposed TisLLM framework through four commonly used datasets in the recommendation systems domain. Additionally, we conduct ablation studies to demonstrate the improvements brought by our proposed TisLLM framework. To verify the superiority of our framework, we will address the following questions:

How does the performance of the TisLLM framework compare to traditional methods?
What is the impact of the time series component on the performance of the TisLLM framework?
How does the sliding window length of the time series in the TisLLM framework affect the experimental results?
What implications does the TisLLM framework have for the interpretability analysis of large language models?

4.1. Dataset

Firstly, we divide the four datasets—MovieLens, Amazon-Book, Beauty, and Toys-and-Games—into two groups: Group A consists of MovieLens and Amazon-Book, while Group B includes Beauty and Toys-and-Games. Different evaluation metrics are applied to each group to comprehensively validate the superiority of our approach from multiple perspectives. For these four datasets, we only retain users with at least 10 interaction records.

To better characterize the data distribution, we provide the sparsity of each dataset, calculated as the proportion of missing user–item interactions. As shown in Table 1, all datasets exhibit high sparsity, exceeding 93% and even 99% in some cases, which is common in recommendation scenarios. This high sparsity highlights the challenge of extracting meaningful signals from limited interactions and underscores the necessity of robust recommendation methods.

The source datasets in Group A contain user information as well as ratings given by users to movies and books, along with specific item name information. The datasets are split into train and test sets in an 8:2 ratio, with 10% of the training set further allocated as a validation set. As described in Section 3.1, interactions where user ratings are greater than or equal to 4 are treated as Likes, whereas those with ratings less than 4 are considered Dislikes.

The source datasets in Group B contain user information and ratings given by users to cosmetics and toy products. However, they lack specific product name information and only provide item IDs. This limitation somewhat restricts the generalization capability of the LLM’s inherent world knowledge, but this setup is suitable for testing the capabilities of our TisLLM model under specific constraints. This group’s datasets are also split into train and test sets in an 8:2 ratio, with 10% of the training set further allocated as a validation set. To adopt different evaluation metrics than Group A, we conduct experiments using predictive ratings for the Group B datasets.

The existing datasets, after being cleaned and split, were processed through the workflow described in Section 3, Methodology, resulting in datasets tailored to the requirements of sequence recommendation systems. The final format of the datasets is shown in Table 1.

4.2. Evaluation Metrics

To comprehensively and objectively evaluate the performance of TisLLM, we have selected widely recognized evaluation metrics in the recommendation systems field: for Group A, Area Under the Curve (AUC) is selected as the core assessment standard; for Group B, Mean Absolute Error (MAE) is selected as the core assessment standard.

AUC quantifies a model’s ability to distinguish between positive and negative samples—specifically predicting whether a user will like a recommended item—by calculating the area under the Receiver Operating Characteristic Curve (ROC). The trapezoidal integration method computes this area through the formula:

A U C = \sum_{k = 1}^{K - 1} \frac{(T P R_{k + 1} + T P R_{k}) * (F P R_{k + 1} - F R P_{k})}{2}

(12)

In this formula, K denotes the total number of threshold points selected along the ROC curve. Each

T P R_{k}

term represents the true positive rate at the k-th threshold, calculated as true positives divided by the sum of true positives and false negatives. Similarly, each

F P R_{k}

term corresponds to the false positive rate at the k-th threshold, calculated as false positives divided by the sum of true negatives and false positives. The summation iterates through consecutive threshold pairs, calculating trapezoidal areas between each adjacent point on the ROC curve. A higher resulting AUC value signifies superior performance in distinguishing user preferences.

MAE measures prediction accuracy by calculating the average absolute deviation between predicted values and actual observations. Its computational formula is expressed as:

M A E = \frac{1}{n} \sum_{i = 1}^{n} |y_{i} - \hat{y_{i}}|

(13)

Here, n signifies the total number of evaluation samples. Each sample contributes an actual observed value

y_{i}

that captures true user responses such as ratings, alongside a corresponding predicted value

\hat{y_{i}}

generated by the recommendation model. The calculation first determines the absolute error for every individual sample, then sums these deviations across the entire dataset, and finally, normalizes the total by dividing by the sample count. A smaller resulting MAE value indicates reduced disparity between predictions and reality, demonstrating better predictive performance.

4.3. Baseline Model Comparison Analysis

Given that our TisLLM framework utilizes users’ historical interaction records to predict subsequent interaction outcomes, similar to sequential recommendations, we consider the following sequential recommendation models for comparison: (1) The GRU4Rec model, a sequence recommender based on RNNs, which uses gated recurrent units to capture complex dynamics in users’ historical interaction sequences [30]. (2) The Caser model, which employs CNNs to encode these historical interaction sequences [31]. (3) SASRec, a classic model based on the Transformer architecture, which can deeply understand long-term dependencies within sequences through self-attention mechanisms [11]. (4) DROS, a state-of-the-art sequential recommendation model introducing distributionally robust optimization strategies aimed at providing more robust and reliable recommendation outcomes [32] and (5) TALLRec, a model fine-tuned using Alpaca tuning and Rec-tuning, effectively aligning large language models with recommendation tasks. For fairness in experimentation, the base model of TALLRec has been replaced with the same LLaMA2-7B used in this paper [14].

These five methods are used to compare results from Group A, while the results for Group B are compared using the following methods: (1) HFT, a classic model combining rating data and review texts, fusing topics from reviews with latent factors obtained via matrix factorization for rating prediction [33]. (2) SATMCF, which leverages sentiment analysis of review texts to pre-fill the rating matrix, optimizing matrix factorization methods with implicit topic models [34]. (3) DeepCoNN, the first deep recommendation model to simultaneously leverage user and resource reviews for modeling users and resources, exhibiting superior performance [35]. (4) NARRE, which introduces a comment-level attention mechanism to filter review texts, combining CNN to capture features from review texts, further enhancing recommendation performance [36] and (5) DeepSAMI, integrating shallow and deep features using multi-layer neural networks to model nonlinear interactions between users and resources, predicting rating values for resource recommendations [37].

4.4. Specific Details

The TisLLM framework is applied to the LLaMA2-7B model within the LoRA architecture, ensuring it can be fine-tuned on an NVIDIA GeForce RTX 3090 (24G) GPU (NVIDIA Corporation, Santa Clara, CA, USA). The experimental setup also utilized an Intel Core i9-12900K CPU (Intel Corporation, Santa Clara, CA, USA) and 128 GB RAM configured with four 32GB Kingston KF3200C16D4/32GX DDR4 memory modules (Kingston Technology Corporation, Fountain Valley, CA, USA). The TisLLM framework uses the Adam optimizer, with fine-tuning conducted over epochs, a total batch size of 128, and a learning rate set to 1 × 10⁻⁴ based on common practices in this task and similar recommendation scenarios. The selection of LoRA rank set to 8 and the scaling factor for gradient updates set to 16 follows established configurations that balance parameter efficiency and model capacity [38,39]. In preliminary experiments, we conducted a grid search over several candidate values learning rates of 1 × 10⁻⁵, 5 × 10⁻⁵, 1 × 10⁻⁴, and 5 × 10⁻⁴ as well as LoRA ranks of 4, 8, and 16 and observed that performance exhibited minimal fluctuations around the optimal values, indicating model stability across this range. We configure the model layers to which LoRA is applied as downproj, uppro, vproj, oproj, kproj, qproj, and gate_pro ensuring that LoRA operates on these critical modules. The learning rate scheduler type employed is cosine annealing. The hardware information for other devices can be viewed in Table 2.

4.5. Resource Utilization and Time Efficiency Analysis

We conducted detailed measurements of training time, inference latency, and GPU utilization during the fine-tuning and inference stages of TisLLM based on LLaMA2-7B under the LoRA configuration. All experiments were performed on a system equipped with two NVIDIA GeForce RTX 3090 GPUs.

4.5.1. Training Phase Analysis

The resource consumption during the training phase for each dataset is summarized in Table 3. The results demonstrate substantial but efficient utilization of GPU resources. A clear trend can be observed: the training time for MovieLens and Amazon-Book significantly exceeds that of Beauty and Toy-and-Game. This is primarily attributed to the longer average sequence length of samples in the MovieLens and Amazon-Book datasets. Since the self-attention mechanism in the Transformer architecture has a computational complexity that grows quadratically with the sequence length, processing longer sequences requires substantially more computations during both the forward and backward passes, directly leading to increased training time. Notably, the power consumption consistently approached the thermal design power (TDP) limit of 350 W for both GPUs, indicating the hardware was engaged at near maximum capacity. VRAM usage was also high, approaching the full 24 GB available on each device, which is expected for fine-tuning a large model like LLaMA2-7B.

4.5.2. Inference Phase Analysis

The performance and resource utilization during the inference phase are presented in Table 4. Compared to training, inference requires significantly less time but maintains high GPU utilization to minimize latency. A key observation is the reduction in VRAM usage during inference, as the process does not require storing gradients and optimizer states. This makes the framework suitable for deployment scenarios where rapid response is critical.

The analysis confirms that the TisLLM framework achieves practical training times on consumer-grade hardware and provides efficient inference latency. The high and stable resource utilization metrics demonstrate the effectiveness of our implementation for large-scale recommendation tasks.

4.6. Experimental Results

4.6.1. Performance Comparison (RQ1)

The results presented in Table 5 demonstrate that the TisLLM framework achieves a substantial leap in recommendation accuracy, surpassing conventional methods by significant margins. On MovieLens, TisLLM attains a dominant mean AUC of 0.7009 ± 0.0019—exceeding the strongest traditional baseline DROS at 0.5502 by 27.4% and outperforming random-guessing-level models such as SasRec at 0.5225 by 34.2%. This performance gap expands further on Amazon-Book, where TisLLM’s mean AUC of 0.7053 ± 0.0032 outperforms TallRec’s 0.6484 by 8.8% while leaving all non-LLM methods below 0.5021 far behind. These results are averaged over three independent runs, with standard deviations indicating consistent model performance. Critically, the consistent achievement of above 0.7 AUC across diverse domains signifies TisLLM’s superior knowledge generalization capability, a direct outcome of its fine-tuning mechanism that effectively transfers semantic understanding to recommendation tasks. The empirically verified over 20% advantage relative to non-LLM baselines, calculated from their 0.50x-level performance to TisLLM’s 0.70x-level dominance, confirms the framework’s capacity to transcend heuristic patterns through deep language reasoning for preference modeling.

Beyond AUC metrics, TisLLM exhibits exceptional robustness as evidenced by its MAE performance under sparse metadata conditions. For the Beauty dataset, it achieves a record-low mean MAE of 0.7773 ± 0.0030, surpassing advanced deep learning baselines including DeepSami at 0.8364 and NARRE at 0.8488. Notably, even when operating solely on anonymized item identifiers without descriptive names in the Toy-and-Game dataset, TisLLM maintains competitive efficacy with a mean MAE of 0.6269 ± 0.0022—outperforming specialized rating predictors HFT at 0.6638 by 5.6% and SATMCF at 0.6499 by 3.5% despite the framework’s primary design focus on semantic recommendation. Although formal statistical testing was not feasible due to the lack of released source code for baseline models and the temporal dependency in sequential recommendation data that prohibits standard cross-validation, the low standard deviations across multiple runs confirm the stability of TisLLM’s performance. We further note that on the Toy-and-Game dataset, TisLLM performs slightly worse than DeepSami. We hypothesize that this is due to the dataset’s relatively sparse and temporally condensed interaction history, which may not fully leverage TisLLM’s strength in modeling long-range semantic dependencies and nuanced user interest evolution. To mitigate this, we propose incorporating auxiliary temporal segmentation or designing adaptive session windows that can better align with the dataset’s characteristics, thereby allowing TisLLM to capitalize on its sequential modeling capacity even in shorter interaction sequences. Although DeepSami achieves a slightly lower 0.6127 in this specific scenario due to TisLLM’s reduced dependency on descriptive features, the latter’s overall superiority in metadata-scarce environments validates its exceptional adaptability. This dual strength across both semantic-rich AUC-driven contexts and feature-constrained MAE-driven scenarios positions TisLLM as a uniquely versatile solution for heterogeneous recommendation systems. The experimental results can be seen more intuitively in Figure 2.

Although the present study utilizes public datasets and is constrained by computational resources, precluding actual deployment and application, the observed improvements in recommendation accuracy suggest potential positive impacts in real-world scenarios. Prior industry research has shown that even modest gains in recommendation precision can significantly enhance user engagement and revenue generation. For example, in e-commerce platforms, Zhou et al. [40] reported that Alibaba’s improved recommendation algorithm—built upon an approximately 0.6% AUC increase—directly contributed to a 3.5% rise in user conversion rates and a 5.2% growth in gross merchandise volume. These empirical findings confirm that the AUC improvements demonstrated in our experiments could translate into meaningful business value in practical e-commerce recommendation scenarios.

4.6.2. Time Series Impact (RQ2)

To rigorously evaluate the efficacy of the time series component within our TisLLM framework for enhancing recommendation performance, we executed an ablation study under identical experimental conditions. Specifically, TisLLM was benchmarked against two modified variants: Time Series-removed LLMRec termed TSRL, which omits the entire time series module, and Sliding Window-removed LLMRec termed SWRL, which replaces the sliding window strategy with random sampling of historical data. As visually summarized in Figure 3, TisLLM achieved superior AUC scores of 0.7018 on the MovieLens dataset and 0.7033 on the Amazon-Book dataset. This represents significant improvements over the baselines, compared to TSRL with 0.5952 and 0.6563, and SWRL with 0.5855 and 0.6671. The results clearly demonstrate that both the incorporation of time series data and the structured sliding window approach are critical, with the absence of either component leading to substantial performance degradation.

To further substantiate the robustness of the time series integration, supplementary experiments were conducted utilizing the MAE metric across additional datasets, namely Beauty and Toy-and-Game. In this comparative analysis, TisLLM was systematically contrasted with TSRL and SWRL to quantify error reduction. As illustrated in Figure 4, TisLLM achieved substantially lower MAE scores of 0.6245 and 0.7794 on Beauty and Toy-and-Game, respectively. This corresponds to error reductions compared to TSL with 0.6545 and 0.7953, and similarly outperforms the SWRL variant with 0.6499 and 0.7968. These findings underscore that the temporal component within TisLLM, particularly when structured through the sliding window mechanism, consistently minimizes prediction inaccuracies, thereby elevating recommendation precision across diverse domains.

4.6.3. Sliding Window Analysis on Time Series (RQ3)

To investigate the impact of Sliding Window Length (SWL) on recommendation performance, we conducted comparative experiments using five distinct sliding window lengths—2, 5, 10, 15, and 20—denoted as SWL. An overly short SWL, such as 2 or 5, provides insufficient contextual information, preventing LLMs from uncovering latent interaction patterns and resulting in suboptimal recommendations with low AUC values of 0.6846 and 0.6907 for MovieLens, and 0.6887 and 0.6983 for Amazon-Book. Conversely, an excessively long SWL introduces data redundancy, distorting relationship mining and increasing computational overhead. Our results in Figure 5 reveal a nuanced trend: While AUC generally rises with SWL extension—peaking at 0.712 for MovieLens at SWL-20 and 0.7148 for Amazon-Book at SWL-15—overextension beyond optimal lengths degrades accuracy. This is starkly evident in Amazon-Book, where AUC drops to 0.7001 at SWL-20 despite higher computational costs, confirming the critical balance between information sufficiency and noise reduction.

To assess the influence of sliding window length on prediction precision, we evaluated the mean absolute error across SWL settings 3 to 7. Our results in Figure 6 reveal a nuanced trend: excessively short windows, exemplified by SWL-3, yield high MAE due to sparse interaction context, as seen in Beauty and Toy-and-Game with errors of 0.6544 and 0.7953, respectively. Conversely, overly extended windows amplify noise interference: Beyond the optimal SWL-5—where MAE reaches its minimum at 0.6245 for Beauty and 0.7794 for Toy-and-Game—performance deteriorates markedly. At SWL-7, MAE rebounds to 0.6419 for Beauty and 0.7994 for Toy-and-Game, exceeding initial error levels. This inverse-U relationship underscores that while moderate SWL expansion refines pattern capture, excessive lengths introduce irrelevant interactions, distorting predictions and nullifying computational efficiency. Thus, an intermediate SWL-5 optimally balances contextual depth with noise mitigation.

As shown in Table 6,the specific values of the data for this ablation experiment are as follows:

While our sliding window length analysis established its critical role in balancing contextual sufficiency against noise, that investigation primarily addressed quantitative context volume. To further examine how temporal interaction density within fixed SWLs influences interest modeling, we conducted extended experiments using the previously identified optimal configurations: SWL-15 for MovieLens and Amazon-Book, and SWL-5 for Beauty and Toy-and-Game. We implemented two additional filtration layers: First, retaining only users with a minimum of twenty interactions to ensure robust behavioral histories. Second, segmenting sequences based on actual time spans between initial and final interactions within each window. Specifically, for any prediction window such as item1-item5 predicting item6, we calculated

Δ

T as the temporal interval between interaction item1 and item5. Test instances were then grouped by minimum

Δ

T thresholds including 30, 45, 60, 75, 90, and 120 days for MovieLens/Amazon-Book, and 10–60 day increments for Beauty/Toy-and-Game. Performance was evaluated exclusively on sequences meeting each temporal span criterion, thereby isolating how interaction dispersion duration affects preference modeling.

As Figure 7 demonstrates, enforcing minimum temporal spans within fixed SWLs reveals domain-dependent relationships between interaction dispersion and recommendation accuracy. MovieLens maintains stable AUC values between 0.6975 and 0.7036 across all tested spans, with only marginal improvement emerging at the maximum 120-day span. This stability suggests user preferences in movie domains evolve gradually, where even compact temporal windows sufficiently capture interest patterns. Conversely, Amazon-Book exhibits a strong positive correlation where AUC rises consistently from 0.6849 at 30 days to 0.7035 at 120 days. This 2.7% improvement indicates book preferences benefit substantially from observing interactions dispersed over longer periods, which better distills lasting interests from ephemeral engagements. Notably, both datasets achieve their peak AUC values under extended temporal constraints, surpassing baseline performances in Table 7 and confirming that temporal filtering enhances signal quality within optimal window lengths.

Figure 8 reveals divergent temporal sensitivity patterns in MAE across domains under constrained interaction spans. Beauty experiences continuous accuracy degradation as temporal spans lengthen, with MAE rising from 0.8138 at 10 days to 0.8852 at 60 days. This 8.8% deterioration indicates beauty products exhibit strong recency dependencies, where extended intervals introduce historically irrelevant context that severely distorts predictions. Conversely, Toy-and-Game demonstrates an inverse-U relationship: MAE initially improves to 0.6943 at the 20-day span before progressively worsening to 0.7189 at 60 days. This non-monotonic pattern suggests moderate temporal dispersion benefits toy recommendations by filtering transient noise, but beyond optimal thresholds, sparse interactions compromise predictive fidelity. These contrasting behaviors underscore fundamental domain differences: Beauty recommendations demand recent interaction contexts, while toy preferences tolerate intermediate temporal dispersion but deteriorate under extreme fragmentation. All specific information about the data from the experiment is in Table 8.

4.6.4. Interpretive Analysis of the TisLLM Framework (RQ4)

The previous experimental validation confirmed the effectiveness of the TisLLM framework, and in order to analyze the interpretability, we designed appropriate prompts for testing. The framework employs temporal representation learning to accurately capture dynamic variations in users’ reading behaviors. Taking the reading case illustrated in Figure 9 as an example, the user initially demonstrated mixed preferences in the early stage, gradually developed enhanced appreciation for books during the mid-term phase, and recently exhibited a generalized preference for diverse literary works. This temporal-aware deep mining enables the model to interpret user preference evolution from a dynamic evolutionary perspective, thereby establishing a robust foundation for determining user affinity toward specific books.

Concurrently, the TisLLM framework effectively integrates book metadata (including genres and themes) with user evaluations. Through meticulous categorization and interpretation of user preferences across different literary categories, the model achieves a comprehensive analysis of formative factors underlying reading preferences. Building upon this foundation, the framework synthesizes genre-specific characteristics with users’ recent reading inclinations to generate persuasive recommendation rationales. This methodology fully leverages the model’s explanatory capabilities, ultimately delivering well-substantiated recommendations through systematic consideration of multifaceted preference dimensions.

5. Conclusions

In this study, we explored the impact of temporal changes on individual preferences. The essence of LLMs lies in predicting the most likely forthcoming words or character sequences based on input prompts after being trained on vast amounts of text data. Building upon this foundation, we constructed a novel TisLLM framework, which is an LLMRec that recommends data by sorting through time series and then dividing them using a sliding window approach. Theoretically, the recommendation data obtained from time-based sliding window segmentation aligns well with the generative structure of LLM textual responses. By finely segmenting user interaction data, we expand the content of user data, allowing LLMs to more accurately understand individual user preferences. Experiments have shown that our TisLLM framework can effectively uncover latent associations within recommendation data, thereby achieving superior recommendation outcomes.

In the future, research will focus on enhancing the overall performance of the recommendation system and the large language model’s ability to adapt to user preference shifts caused by data changes. This will be achieved by implementing plug-in databases to eliminate the need for repeated LLM fine-tuning. Additionally, critical challenges such as the substantial computational resource demands and latency issues inherent to large models must be addressed to further optimize model performance.

Author Contributions

Methodology: X.Z. and W.L.; writing—review and editing: X.Z.; funding acquisition: X.Z.; conceptualization: X.Z.; software: W.L.; data curation: W.L.; writing—original draft: W.L.; resources: B.Z.; conceptualization: B.Z.; visualization: B.Z.; project administration: L.G.; formal analysis: L.G.; validation: L.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant No. 62172352); Tianjin Education Commission Research Plan (2023KJ203), “Research and Application of Causal Machine Learning”; and the Open Project of Hebei Technology Innovation Center of Cultural Tourism Big Data (SG2019036-zd202206), “Research on Bias Problem in Recommendation System Based on Causal Inference”, Tianjin Science and Technology Project 23YDTPJC00320.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are all open-source and can be obtained from the internet.

Acknowledgments

We would like to thank Jiaqi Ji from Hebei Minzu Normal University for his continuous attention, support, and assistance with this project. This article is a revised and expanded version of a paper entitled Time Series Large Language Model for Recommendation System [41], which was presented at RAIIC 2025, held from 4–6 July 2025, at Wangjiang Campus, Sichuan University, Chengdu, China.

Conflicts of Interest

The authors have no competing interests to declare that are relevant to the content of this article.

References

Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Zheng, B.; Hou, Y.; Lu, H.; Chen, Y.; Zhao, W.X.; Chen, M.; Wen, J.R. Adapting large language models by integrating collaborative semantics for recommendation. In Proceedings of the 2024 IEEE 40th International Conference on Data Engineering (ICDE), Utrecht, The Netherlands, 13–16 May 2024; IEEE: Piscataway, NJ, USA, 2024. [Google Scholar]
Lin, J.; Shan, R.; Zhu, C.; Du, K.; Chen, B.; Quan, S.; Tang, R.; Yu, Y.; Zhang, W. Rella: Retrieval-enhanced large language models for lifelong sequential behavior comprehension in recommendation. In Proceedings of the ACM on Web Conference 2024, Singapore, 13–17 May 2024. [Google Scholar]
Zhang, J.; Xie, R.; Hou, Y.; Zhao, X.; Lin, L.; Wen, J.R. Recommendation as instruction following: A large language model empowered recommendation approach. ACM Trans. Inf. Syst. 2023, 43, 1–37. [Google Scholar] [CrossRef]
Wu, L.; Zheng, Z.; Qiu, Z.; Wang, H.; Gu, H.; Shen, T.; Qin, C.; Zhu, C.; Zhu, H.; Liu, Q.; et al. A survey on large language models for recommendation. World Wide Web 2024, 27, 60. [Google Scholar] [CrossRef]
Wang, L.; Lim, E.P. Zero-shot next-item recommendation using large pretrained language models. arXiv 2023, arXiv:2304.03153. [Google Scholar]
Zhang, Y.; Feng, F.; Zhang, J.; Bao, K.; Wang, Q.; He, X. Collm: Integrating collaborative embeddings into large language models for recommendation. arXiv 2023, arXiv:2310.19488. [Google Scholar] [CrossRef]
Guo, N.; Cheng, H.; Liang, Q.; Chen, L.; Han, B. Integrating Large Language Models with Graphical Session-Based Recommendation. arXiv 2024, arXiv:2402.16539. [Google Scholar] [CrossRef]
Du, Y.; Wang, Z.; Sun, Z.; Chua, H.; Liu, H.; Wu, Z.; Ma, Y.; Zhang, J.; Sun, Y. Large Language Model with Graph Convolution for Recommendation. arXiv 2024, arXiv:2402.08859. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; NeurIPS: San Diego, CA, USA, 2017; Volume 30. [Google Scholar]
Kang, W.C.; McAuley, J. Self-attentive sequential recommendation. In Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM), Singapore, 17–20 November 2018; IEEE: Piscataway, NJ, USA, 2018. [Google Scholar]
Luo, S.; Yao, Y.; He, B.; Huang, Y.; Zhou, A.; Zhang, X.; Xiao, Y.; Zhan, M.; Song, L. Integrating large language models into recommendation via mutual augmentation and adaptive aggregation. arXiv 2024, arXiv:2401.13870. [Google Scholar] [CrossRef]
Zhang, Y.; Bao, K.; Yan, M.; Wang, W.; Feng, F.; He, X. Text-like Encoding of Collaborative Information in Large Language Models for Recommendation. arXiv 2024, arXiv:2406.03210. [Google Scholar] [CrossRef]
Bao, K.; Zhang, J.; Zhang, Y.; Wang, W.; Feng, F.; He, X. Tallrec: An effective and efficient tuning framework to align large language model with recommendation. In Proceedings of the 17th ACM Conference on Recommender Systems, Singapore, 18–22 September 2023. [Google Scholar]
Geng, S.; Liu, S.; Fu, Z.; Ge, Y.; Zhang, Y. Recommendation as language processing (rlp): A unified pretrain, personalized prompt & predict paradigm (p5). In Proceedings of the 16th ACM Conference on Recommender Systems, New York, NY, USA, 18–23 September 2022. [Google Scholar]
Chen, J.; Gao, C.; Yuan, S.; Liu, S.; Cai, Q.; Jiang, P. Dlcrec: A novel approach for managing diversity in llm-based recommender systems. In Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining, Hannover, Germany, 10–14 March 2025. [Google Scholar]
Cao, Y.; Mehta, N.; Yi, X.; Keshavan, R.; Heldt, L.; Hong, L.; Chi, E.H.; Sathiamoorthy, M. Aligning Large Language Models with Recommendation Knowledge. arXiv 2024, arXiv:2404.00245. [Google Scholar] [CrossRef]
Xu, W.; Wu, Q.; Liang, Z.; Han, J.; Ning, X.; Shi, Y.; Lin, W.; Zhang, Y. SLMRec: Distilling large language models into small for sequential recommendation. arXiv 2025, arXiv:2405.17890. [Google Scholar]
Ji, J.; Li, Z.; Xu, S.; Hua, W.; Ge, Y.; Tan, J.; Zhang, Y. Genrec: Large language model for generative recommendation. In European Conference on Information Retrieval; Springer Nature: Cham, Switzerland, 2024. [Google Scholar]
Yang, S.; Ma, W.; Sun, P.; Zhang, M.; Ai, Q.; Liu, Y.; Cai, M. Common sense enhanced knowledge-based recommendation with large language model. In International Conference on Database Systems for Advanced Applications; Springer Nature: Singapore, 2024. [Google Scholar]
Zheng, B.; Wang, X.; Liu, E.; Wang, X.; Hongyu, L.; Chen, Y.; Zhao, W.X.; Wen, J.R. DeepRec: Towards a Deep Dive Into the Item Space with Large Language Model Based Recommendation. arXiv 2025, arXiv:2505.16810. [Google Scholar] [CrossRef]
Luo, W.; Song, C.; Yi, L.; Cheng, G. TRAWL: External Knowledge-Enhanced Recommendation with LLM Assistance. arXiv 2024, arXiv:2403.06642. [Google Scholar]
Fang, Y.; Wang, W.; Zhang, Y.; Zhu, F.; Wang, Q.; Feng, F.; He, X. Reason4Rec: Large Language Models for Recommendation with Deliberative User Preference Alignment. arXiv 2025, arXiv:2502.02061. [Google Scholar]
Wang, L.; Hu, H.; Sha, L.; Xu, C.; Wong, K.F.; Jiang, D. RecInDial: A unified framework for conversational recommendation with pretrained language models. arXiv 2021, arXiv:2110.0747. [Google Scholar]
Ren, X.; Wei, W.; Xia, L.; Su, L.; Cheng, S.; Wang, J.; Yin, D.; Huang, C. Representation learning with large language models for recommendation. In Proceedings of the ACM on Web Conference 2024, Singapore, 13–17 May 2024. [Google Scholar]
Liu, J.; Liu, C.; Zhou, P.; Lv, R.; Zhou, K.; Zhang, Y. Is chatgpt a good recommender? A preliminary study. arXiv 2023, arXiv:2304.10149. [Google Scholar] [CrossRef]
Zheng, Z.; Chao, W.; Qiu, Z.; Zhu, H.; Xiong, H. Harnessing large language models for text-rich sequential recommendation. In Proceedings of the ACM on Web Conference 2024, Singapore, 13–17 May 2024. [Google Scholar]
Harper, F.M.; Konstan, J.A. The MovieLens Datasets: History and Context. ACM Trans. Interact. Intell. Syst. (TiiS) 2015, 5, 1–19. [Google Scholar] [CrossRef]
McAuley, J.; Targett, C.; Shi, Q.; van den Hengel, A. Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’15), Santiago, Chile, 9–13 August 2015; Association for Computing Machinery: New York, NY, USA, 2015; pp. 43–52. [Google Scholar]
Hidasi, B.; Karatzoglou, A.; Baltrunas, L.; Tikk, D. Session-based recommendations with recurrent neural networks. arXiv 2015, arXiv:1511.06939. [Google Scholar]
Tang, J.; Wang, K. Personalized top-n sequential recommendation via convolutional sequence embedding. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, Los Angeles, CA, USA, 5–9 February 2018. [Google Scholar]
Yang, Z.; He, X.; Zhang, J.; Wu, J.; Xin, X.; Chen, J.; Wang, X. A generic learning framework for sequential recommendation with distribution shifts. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, Taipei, Taiwan, 23–27 July 2023. [Google Scholar]
McAuley, J.; Leskovec, J. Hidden factors and hidden topics: Understanding rating dimensions with review text. In Proceedings of the 7th ACM Conference on Recommender Systems, Hong Kong, China, 12–16 October 2013. [Google Scholar]
Sun, P.; Li, J.; Li, G. Research on Collaborative Filtering Recommendation Algorithm Based on Sentiment Analysis and Topic Model. In Proceedings of the 4th International Conference on Big Data and Computing, Guangzhou, China, 10–12 May 2019. [Google Scholar]
Zheng, L.; Noroozi, V.; Yu, P.S. Joint deep modeling of users and items using reviews for recommendation. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, Cambridge, UK, 6–10 February 2017. [Google Scholar]
Chen, C.; Zhang, M.; Liu, Y.; Ma, S. Neural attentional rating regression with review-level explanations. In Proceedings of the 2018 World Wide Web Conference, Lyon, France, 23–27 April 2018. [Google Scholar]
Li, H.; Lv, Y.; Wang, X.; Huang, J. A Deep Recommendation Model with Multi-Layer Interaction and Sentiment Analysis. Data Anal. Knowl. Discov. 2023, 7, 43–57. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2022, arXiv:2106.09685. [Google Scholar]
Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
Zhou, G.; Mou, N.; Fan, Y.; Pi, Q.; Bian, W.; Zhou, C.; Zhu, X.; Gai, K. Deep Interest Evolution Network for Click-Through Rate Prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33. [Google Scholar]
Li, W.; Zhu, X.; Zhang, B.; Geng, L.; Ji, J. Time Series Large Language Model for Recommendation System. In Proceedings of the 2025 4th International Conference on Robotics, Artificial Intelligence and Intelligent Control (RAIIC), Chengdu, China, 4–6 July 2025. [Google Scholar]

Figure 1. Structure diagram of a time-series-based large language model recommendation system.

Figure 2. The left side shows the AUC comparison of different methods on MovieLens and Amazon-Book datasets, and the right side shows the MAE comparison on Beauty and Toy-and-Game datasets.

Figure 3. The difference in AUC values between RTSL and TisLLM.

Figure 4. The difference in MAE values between RTSL and TisLLM.

Figure 5. The final AUC results for different SWL lengths on two datasets.

Figure 6. The final MAE results for different SWL lengths on two datasets.

Figure 7. The impact of varied temporal spans on the final AUC results under a fixed SWL-15 length across two datasets.

Figure 8. The impact of varied temporal spans on the final MAE results under a fixed SWL-15 length across two datasets.

Figure 9. TisLLM framework interpretability testing.

Table 1. Basic dataset information.

Dataset	Number of Users	Number of Ratings	Sparsity
MovieLens	943	100,000	0.9363
Amazon-Book	651	99,979	0.9966
Beauty	5123	91,824	0.9985
Toys-and-Games	4188	74,423	0.9985

Table 2. Device configuration specifications.

Equipment Setting	Device Information
CPU	12th Gen Intel(R) Core(TM) i9-12900K
GPU	NVIDIA GeForce RTX 3090 24G
RAM	128G KF3200C16D4/32GX DDR4
Operating System	Linux ubuntu 5.15.0-134-generic
Programming Language	Python 3.10.14
CUDA	CUDA 11.8

Table 3. Resource consumption during the training phase.

Dataset	Time (hh:mm:ss)	GPU Util. (%)	Power (W)	VRAM Usage (MiB)
MovieLens	41:14:42	100	348	23,399
Amazon-Book	40:41:40	100	347	22,281
Beauty	12:16:05	83	348	16,679
Toy-and-Game	09:30:20	73	348	16,364

Table 4. Latency and resource consumption during the inference phase.

Dataset	Time (mm:ss)	GPU Util. (%)	Power (W)	VRAM Usage (MiB)
MovieLen	59:22	99	349	23,436
Amazon-book	57:34	99	349	22,147
Beauty	41:27	100	348	14,741
Toy-and-Game	32:43	100	348	14,618

Table 5. Experimental results for AUC and MAE metrics. Bold indicates the optimal result.

AUC
Data\Methods	DROS	Gru4Rec	SasRec	Caser	TallRec	TisLLM
MovieLens	0.5502	0.5341	0.5225	0.5420	0.6866	0.7009 ± 0.0019
Amazon-Book	0.5021	0.4988	0.4991	0.4959	0.6484	0.7053 ± 0.0032
MAE
Data\Methods	HFT	SATMCF	DeepConn	NARRE	DeepSami	TisLLM
Beauty	0.9114	0.8733	0.8550	0.8488	0.8364	0.7773 ± 0.0030
Toy-and-Game	0.6638	0.6499	0.6435	0.6264	0.6127	0.6269 ± 0.0022

Table 6. Performance comparison of TisLLM framework across different sliding window lengths. Bold indicates the optimal result.

AUC
Dataset	SWL-2	SWL-5	SWL-10	SWL-15	SWL-20
MovieLens	0.6846	0.6907	0.7018	0.7033	0.7120
Amazon-Book	0.6887	0.6983	0.7104	0.7148	0.7001
MAE
Dataset	SWL-3	SWL-4	SWL-5	SWL-6	SWL-7
Beauty	0.7953	0.7913	0.7794	0.7854	0.7994
Toy-and-Game	0.6544	0.6442	0.6245	0.6373	0.6419

Table 7. The specific numerical values of AUC and MAE. Bold indicates the optimal result.

Model	AUC		MAE
Model	MovieLens	Amazon-Book	Beauty	Toy-and-Game
TisLLM	0.7018	0.7033	0.6245	0.7794
TSRL	0.5952	0.6563	0.6545	0.7953
SWRL	0.5855	0.6671	0.6499	0.7968

Table 8. Performance of TisLLM framework under minimum temporal span constraints.

AUC
Dataset	30 d	45 d	60 d	75 d	90 d	120 d
MovieLens	0.6997	0.6975	0.6993	0.7012	0.7018	0.7036
Amazon-Book	0.6849	0.6881	0.6891	0.6939	0.6969	0.7035
MAE
Dataset	10 d	20 d	30 d	40 d	50 d	60 d
Beauty	0.7052	0.6943	0.7118	0.7087	0.7174	0.7189
Toy-and-Game	0.8138	0.8395	0.8581	0.8672	0.8534	0.8852

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, X.; Li, W.; Zhang, B.; Geng, L. TisLLM: Temporal Integration-Enhanced Fine-Tuning of Large Language Models for Sequential Recommendation. Information 2025, 16, 818. https://doi.org/10.3390/info16090818

AMA Style

Zhu X, Li W, Zhang B, Geng L. TisLLM: Temporal Integration-Enhanced Fine-Tuning of Large Language Models for Sequential Recommendation. Information. 2025; 16(9):818. https://doi.org/10.3390/info16090818

Chicago/Turabian Style

Zhu, Xiaosong, Wenzheng Li, Bingqiang Zhang, and Liqing Geng. 2025. "TisLLM: Temporal Integration-Enhanced Fine-Tuning of Large Language Models for Sequential Recommendation" Information 16, no. 9: 818. https://doi.org/10.3390/info16090818

APA Style

Zhu, X., Li, W., Zhang, B., & Geng, L. (2025). TisLLM: Temporal Integration-Enhanced Fine-Tuning of Large Language Models for Sequential Recommendation. Information, 16(9), 818. https://doi.org/10.3390/info16090818

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TisLLM: Temporal Integration-Enhanced Fine-Tuning of Large Language Models for Sequential Recommendation †

Abstract

1. Introduction

2. Related Work

2.1. Zero-Shot Recommendation via Prompt Design

2.2. Fine-Tuning LLMs with User Interaction Data

2.3. Leveraging LLMs’ Prior Knowledge for Recommendation

2.4. Learning User-Item Interactions with LLMs

3. Methodology

3.1. Data Processing

3.2. Data Enhancement

3.3. LLM Fine Tuning

3.4. Inference Verification

4. Experiment

4.1. Dataset

4.2. Evaluation Metrics

4.3. Baseline Model Comparison Analysis

4.4. Specific Details

4.5. Resource Utilization and Time Efficiency Analysis

4.5.1. Training Phase Analysis

4.5.2. Inference Phase Analysis

4.6. Experimental Results

4.6.1. Performance Comparison (RQ1)

4.6.2. Time Series Impact (RQ2)

4.6.3. Sliding Window Analysis on Time Series (RQ3)

4.6.4. Interpretive Analysis of the TisLLM Framework (RQ4)

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

TisLLM: Temporal Integration-Enhanced Fine-Tuning of Large Language Models for Sequential Recommendation^†