Context-Driven Recommendation via Heterogeneous Temporal Modeling and Large Language Model in the Takeout System

Deng, Wei; Hu, Dongyi; Jiang, Zilong; Zhang, Peng; Shi, Yong

doi:10.3390/systems13080682

Open AccessArticle

Context-Driven Recommendation via Heterogeneous Temporal Modeling and Large Language Model in the Takeout System

by

Wei Deng

¹

,

Dongyi Hu

¹

,

Zilong Jiang

²,

Peng Zhang

³ and

Yong Shi

^4,5,*

¹

School of Big Data Statistics, Guizhou University of Finance and Economics, Guiyang 550025, China

²

Department of Computer Science and Technology, Guizhou University of Finance and Economics, Guiyang 550025, China

³

Cyberspace Institute of Advanced Technology, Guangzhou University, Guangzhou 510006, China

⁴

University of Chinese Academy of Sciences, Beijing 100190, China

⁵

The Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing 100190, China

^*

Author to whom correspondence should be addressed.

Systems 2025, 13(8), 682; https://doi.org/10.3390/systems13080682

Submission received: 28 June 2025 / Revised: 28 July 2025 / Accepted: 7 August 2025 / Published: 11 August 2025

Download

Browse Figures

Versions Notes

Abstract

On food delivery platforms, user decisions are often driven by dynamic contextual factors such as time, intent, and lifestyle patterns. Traditional context-aware recommender systems struggle to capture such implicit signals, especially when user behavior spans heterogeneous long- and short-term patterns. To address this, we propose a context-driven recommendation framework that integrates a hybrid sequence modeling architecture with a Large Language Model for post hoc reasoning and reranking. Specifically, the solution tackles several key issues: (1) integration of multimodal features to achieve explicit context fusion through a hybrid fusion strategy; (2) introduction of a context capture layer and a context propagation layer to enable effective encoding of implicit contextual states hidden in the heterogeneous long and short term; (3) cross attention mechanisms to facilitate context retrospection, which allows implicit contexts to be uncovered; and (4) leveraging the reasoning capabilities of DeepSeek-R1 as a post-processing step to perform open knowledge-enhanced reranking. Extensive experiments on a real-world dataset show that our approach significantly outperforms strong baselines in both prediction accuracy and Top-K recommendation quality. Case studies further demonstrate the model’s ability to uncover nuanced, implicit contextual cues—such as family roles and holiday-specific behaviors—making it particularly effective for personalized, dynamic recommendations in high-frequency scenes.

Keywords:

context-driven; multimodal hybrid fusion; time awareness; sequential recommendation; DeepSeek-R1

1. Introduction

Sequential Recommendation (SR) is a recommendation approach that predicts users’ potential next item of interest based on their historical behavior sequences [1]. While SR captures user preferences through sequence modeling, it often overlooks the influence of contextual factors. In real-world scenarios, besides personal preferences and needs, an individual’s consumption behavior is usually influenced by contextual factors [2], such as an individual work schedules, lifestyle, and other situational conditions.

Context is defined as any information that can be used to describe the state of an entity, where the entity may refer to users, items, and so forth [3]. Specifically, context is shaped by factors such as time and environment [4]. Yao et al. categorized context into three types: user context, item context, and decision context [5]. Various context-aware recommendation systems (CARSs) have been proposed to incorporate such contextual information into recommendations [6]—for example, time-aware recommendation systems [7,8] and session-based recommendation systems [9,10]. Typically, context-aware recommendation frameworks integrate contextual data as explicit auxiliary input during computation [11,12].

However, it is important to note that some contextual factors are implicit and cannot be directly observed [13]. Consequently, many context-aware recommendation systems lack the ability to learn implicit context-driven patterns, which has led to a shift in research focus from context-aware recommendations to context-driven recommendations.

The paradigm of context-driven recommendation focuses on inferring a user’s contextual state through mining historical user–item interactions, enabling personalized context modeling. In this paradigm, the item content helps the system understand the user’s contextual state [14]. Since user behavior constantly evolves, the corresponding contextual state also changes dynamically [15], which is often reflected in the evolution of user preferences over time.

To integrate contextual information into the recommendation model, this work begins by analyzing user preferences. As user preferences are inherently dynamic, capturing their long-term dependencies is crucial for understanding changing contextual states [16]. In addition, short-term behaviors can also influence long-term contextual dynamics, making it important to account for time intervals and the tailing effect within historical interactions. Furthermore, some contextual patterns exhibit temporal periodicity, which implies that the corresponding state representations may also follow cyclical patterns.

When it comes to consumer goods, items on e-commerce platforms often have multimodal representations such as comments and images, which can be regarded as an explicit item context. Product names are sometimes crafted with fancy wording to attract users, while images usually provide the most intuitive and direct visual representation of a product. Features such as color and shape are also crucial visual factors influencing purchase decisions. In fact, humans perceive the world through multiple modalities, including vision, touch, and taste [17]. Due to the inherent limitations of online shopping, users often rely primarily on visual cues and personal preferences to make context-driven purchasing decisions. Therefore, applying fusion strategies to integrate multimodal features for complementary and collaborative representation is of great importance.

Fusion strategies can be categorized into early fusion, intermediate fusion, late fusion, and hybrid fusion, depending on the stage at which features from different modalities are integrated [18]. In early fusion, features are extracted separately from each modality and then combined. Intermediate fusion involves modeling and merging features through layers or progressively staged fusion. Late fusion combines decisions obtained from independently trained models for each modality. Hybrid fusion blends the other three strategies to maximize their strengths [19].

Based on these observations, this paper proposes a Time-Aware Long- and Short-Term Recommendation (TLSR) model, which leverages the hybrid fusion strategy to enable personalized recommendations in context-driven scenarios. TLSR consists of a behavior extraction layer, a context capture layer, a context propagation layer, and a prediction layer, and it frames the recommendation task as a binary classification problem through sequential modeling.

First, TLSR leverages multimodal features to integrate explicit contextual information and incorporates contextual states from both long- and short-term user preferences. It utilizes cross attention mechanisms to compute the probability that a user will purchase a given item under the current contextual state. Based on the predicted probabilities, items are ranked in descending order to generate a Top-K recommendation list.

Furthermore, the TLSR model is validated on a real-world food delivery dataset. Through case studies, it is demonstrated that TLSR is capable of learning implicit context-driven patterns. Combined with the reasoning ability of DeepSeek-R1, the model further enhances recommendation quality by reranking based on open knowledge. In addition, an ablation study is conducted to verify the effectiveness of the hybrid fusion strategy employed by TLSR. In summary, the main contributions of this paper are as follows.

1.: For multimodal features, TLSR employs a hybrid fusion strategy to integrate explicit contextual information. Specifically, its behavior extraction layer embeds multimodal data to comprehensively represent item content attributes and extract explicit context, achieving early fusion. During intermediate fusion, the context capture layer and context propagation layer model long- and short-term user preferences, enabling cross-temporal local feature fusion and dynamic feature fusion for personalized context state representation. Subsequently, user features are progressively fused as part of the final representation.
2.: Empirical analysis and comparative experiments on the real-world food delivery dataset show that TLSR consistently outperforms baseline models both in overall prediction accuracy and in the quality of the generated Top-K recommendation lists.
3.: Through case studies, it is demonstrated that TLSR is capable of learning implicit contextual signals. Leveraging cross attention mechanisms, TLSR performs context retrospection to uncover implicit contextual states. After integrating DeepSeek-R1’s reasoning capabilities for open knowledge-based enhancements, the reranked recommendation lists better align with the current context.
4.: The effectiveness of TLSR’s hybrid fusion strategy is validated through ablation studies. The results confirm that modality fusion, layer fusion, and combination fusion all play significant roles in helping the model learn contextual states and improving its overall performance.

The rest of the paper is organized as follows. Section 2 discusses relevant literature. Section 3 provides a detailed introduction to the methodology used in this paper. Section 4 shows the experimental procedure and result analysis. Section 5 provides an in-depth discussion. Finally, Section 6 concludes the paper.

2. Literature Review

This section first describes the studies related to multimodal fusion and then provides a detailed introduction to various recommendation systems.

2.1. Multimodal Fusion

The complete representation of an item is often distributed across multiple data modalities. Therefore, effectively extracting and integrating key information from different modalities enhances feature complementarity and facilitates collaborative learning [20]. Multi-modal data can generally be categorized into two types based on their sources: multi-sensor data and multi-type data.

Multi-sensor data refers to information collected from different sensors or devices. For example, Li et al. proposed a focusing information integration framework to improve the fusion of multi-sensor images, enhancing their representational quality [21]. Krishna et al. designed a device directedness speech detection (DDSD) model that incorporates non-verbal prosody to augment verbal features, achieving improved performance through the combination of intermediate and late fusion strategies [22]. In the field of medical imaging, Li et al. introduced a Laplacian redecomposition (LRD) framework to mitigate the effects of color distortion, blurring, and noise in multi-modal medical image fusion [23].

Multi-type data refers to information represented in different forms, such as text, images, and videos. To address the problem of text descriptions failing to capture background details—which often leads to degraded image generation quality—Zhao et al. proposed a multi-modal fusion generative adversarial network that significantly improves text-to-image generation performance [24]. For the task of fake news detection, Zhu et al. introduced IFIS, a multi-modal detection method that leverages intra-modality feature aggregation and inter-modality semantic fusion, achieving strong and balanced performance across multiple benchmarks [25]. In the context of vision-textual sentiment analysis, Gan et al. developed a multi-modal fusion network (MFN) equipped with a multi-head self-attention mechanism, which effectively handles both the heterogeneity and the homogeneity of different modalities. Their results demonstrated that intermediate fusion leads to the best performance in the MFN architecture [26]. In this paper, multi-type data is employed to express explicit contextual information across different modalities, thereby contributing to the realization of personalized recommendations.

2.2. Recommendation System

With the rapid development of e-commerce, social networks, and related fields, recommendation systems have become an indispensable component of modern digital platforms and have drawn significant attention from researchers. Based on their core methodologies, recommendation approaches can be broadly divided into two categories: traditional models and deep learning-based models [27].

In traditional models, classical approaches include collaborative filtering, logistic regression, and factorization machines. Collaborative filtering generates recommendations by calculating similarities between users or items [28]. Logistic regression formulates the recommendation task as a binary classification problem to predict user preferences for items [29]. Factorization machines extend this further by capturing higher-order interactions through feature cross-interactions [30].

In contrast, deep learning-based models have introduced powerful techniques such as feature crossing, graph neural networks, and attention mechanisms. For example, the Deep & Cross Network (DCN) enhances recommendation performance by learning complex feature interactions at both shallow and deep levels [31]. Graph neural networks model user–item interactions as graph structures, learning embeddings for nodes and edges while framing the recommendation as a link prediction problem [32]. Meanwhile, attention mechanisms dynamically prioritize key features in the input data, allowing models to focus on the most relevant signals for more accurate recommendations [33].

A particularly important subfield of recommendation research is Sequential Recommendation. One of the earliest methods in this domain, the Markov chain, models the probability of future user actions based on a transition probability matrix. However, its focus on short-term dependencies limits its ability to capture long-term preferences [34]. The Self-Attention-based Sequential Recommendation model (SASRec) addresses this limitation by leveraging self-attention mechanisms to model both long-term semantic dependencies and sparse behavioral patterns in user sequences [35]. Building upon this, TiSASRec incorporates temporal interval information into the self-attention framework, further improving the accuracy of Sequential Recommendation [36].

In the e-commerce context, two core recommendation tasks are Click-Through Rate (CTR) prediction and ranking-based recommendation. Zhou et al. proposed the Deep Interest Network (DIN), which adaptively captures user interests from historical behavior sequences. Online A/B testing on Alibaba’s display advertising system showed that DIN increased CTR by 10% [37]. Han et al. introduced SRP4CTR, a Sequential Recommendation pre-training framework specifically designed for CTR prediction, which achieved a 0.7% CTR improvement when deployed on Meituan’s takeaway recommender system [38]. Wang et al. proposed FTAPR, a Fulfillment Time-Aware Personalized Ranking model that integrates order fulfillment cycle time prediction into the recommendation process. Online experiments on the Ele.me platform demonstrated that FTAPR improved CTR by 1.3% [39].

An increasing number of studies are focusing on recommendation systems based on Large Language Models (LLMs). They can be broadly categorized into four types: LLM-based direct recommendation, LLM-based representation learning recommendation, LLM-based generative learning recommendation, and LLM-based prompt learning recommendation [40]. Dai et al. explored the recommendation capabilities of LLMs and analyzed them from three ranking perspectives, demonstrating that ChatGPT (gpt-3.5-turbo) consistently outperformed baseline models across all ranking tasks [41]. Du et al. proposed LGIR, an LLM-based interactive recommendation approach that integrates generative adversarial networks (GANs) for job recommendation, effectively addressing both few-shot and fabricated generation issues [42]. Wang et al. utilize ChatGPT to collaborate with the AutoDisenSeq model, where the pretrained knowledge of the LLM and specific knowledge of the sequential model are combined, which can benefit cold-start scenarios [43].

Unlike other models, the TLSR model proposed in this paper is capable of capturing the dynamic evolution of contextual states through sequence modeling, thereby enabling the learning of implicit contextual information.

3. Methodology

3.1. TLSR

To achieve context-driven personalized recommendations, this paper proposes a Time-Aware Long- and Short-Term Recommendation (TLSR) model that adopts the hybrid fusion strategy. TLSR consists of four main components: a behavior extraction layer, context capture layer, context propagation layer, and prediction layer. It formulates item recommendations as a binary classification prediction problem through sequential modeling, with its architecture illustrated in Figure 1.

Specifically, TLSR first divides the user’s historical behavior sequence into long-term and short-term behavior sequences in the behavior extraction layer, integrating the multimodal information of items to achieve early fusion. Next, it leverages the context capture layer and context propagation layer to model long- and short-term preferences, achieving intermediate fusion and integrating personalized contextual states within preferences. Then, the prediction layer utilizes cross-attention mechanisms to interact with candidate items. By progressively incorporating user features, it predicts the purchase probability of each candidate item under the current context.

Finally, items are ranked in descending order based on their predicted probabilities, generating a Top-K recommendation list.

3.1.1. Behavior Extraction Layer

In the TLSR model, based on the current timestamp of the candidate item, this paper first defines the historical behavior sequence of the current week as the user’s short-term behaviors. Then, the historical behavior sequences from the nearest four non-empty weeks are used as the user’s long-term behaviors. The historical behavior sequence consists of previously purchased items sorted in chronological order and is further organized by day and week. Since the daily historical behavior sequences vary in length, this paper applies zero-padding to standardize them. Specifically, positions in the sequence where no item was purchased are masked to ensure consistency in sequence length.

Secondly, to comprehensively represent the content attributes of items, this paper integrates multimodal information for feature embedding and concatenation. For the image modality, a pre-trained ResNet50 [44] model is used as a feature extractor, and the extracted dense features

F_{p}

are directly incorporated into the model calculations. For the numerical modality of time intervals, the time interval

Δ t = t_{0} - t

(unit: second) is calculated based on the timestamp

t_{0}

of the candidate item and the timestamps t of historical purchases. The time interval encoding [45] is then used to embed this feature, as follows.

F_{t} = \cos (W \cdot Δ t + b) .

(1)

where W represents the weight, and b is the bias. Similarly, the time interval feature of the candidate item is encoded and embedded accordingly. For the sparse features in the numerical modality, this paper employs embedding encoding, denoted as

F_{s}

. Likewise, for the text modality, embedding encoding

F_{l}

is also used. In summary, the item features

F_{f}^{(t)}

at timestamp t are composed of the multimodal embedding features, as follows.

F_{f}^{(t)} = [F_{p} ‖F_{t} ‖F_{s} ‖F_{l}] .

(2)

As a result, explicit contextual information is integrated into the multimodal features of the items, making them suitable for representing user preferences.

Finally, for the long- and short-term behavior sequences that contain multimodal features, dimension permutation is applied. Specifically, this involves permuting the length dimension of the daily behavior sequences from the long-term and short-term sequences into the channel dimension, allowing them to be compatible with the subsequent convolutional operations in the context capture layer.

3.1.2. Context Capture Layer

Influenced by context-driven consumption, user preferences often exhibit time periodicity and tail effects, such as purchasing on similar weekdays across different weeks or making purchases on consecutive days within the same week. Additionally, considering the real-world consumption scenario, a user can make multiple purchases within a single day, and each purchase may involve multiple items. This results in daily historical behavior sequences typically containing multiple items, with some items having the same purchase time.

Based on this, the TLSR model designs a context capture layer to achieve cross-temporal local feature fusion within the long- and short-term preferences. This layer is mainly composed of convolutional neural networks (CNNs) [46]. Specifically, in the Conv3d layer, a

S \times S \times S

convolutional kernel is used to aggregate neighboring information across S weeks and S days, applying it to the channel dimension. This allows the multimodal features of multiple items in the sequence to be integrated into a single feature vector. Similarly, the Conv2d layer uses a

S \times S

convolutional kernel to aggregate neighboring information across S days within the current week, which is also applied to the channel dimension to obtain the integrated feature vector. The operation diagram is shown in Figure 2. The feature vector is then processed using Maxpool1d to comprehensively represent the user’s long- and short-term preferences. The calculation steps of the context capture layer are as follows.

Step 1: A feature fusion matrix

F u s i o n_{c 3 d}^{(t)}

is generated by applying a Conv3d layer to the historical behavior sequences representing long-term preferences. Likewise, a Conv2d layer is applied to the historical behavior sequences representing short-term preferences to obtain another feature fusion matrix

F u s i o n_{c 2 d}^{(t)}

, as illustrated below.

F u s i o n_{c 3 d}^{(t)} = Conv 3 d (F_{l o n g}^{1 (t)}, F_{l o n g}^{2 (t)}, \dots, F_{l o n g}^{N (t)}),

(3)

F u s i o n_{c 2 d}^{(t)} = Conv 2 d (F_{s h o r t}^{1 (t)}, F_{s h o r t}^{2 (t)}, \dots, F_{s h o r t}^{N (t)}) .

(4)

where N represents the length of the daily behavior sequence (which is equal to the number of channels). This means that the maximum number of items purchased per day in the user’s historical behaviors is N. Additionally, both the Conv3d layer and the Conv2d layer perform padding of

⌊\frac{S}{2}⌋

in each dimension, i.e., the padding is (

⌊\frac{S}{2}⌋

,

⌊\frac{S}{2}⌋

,

⌊\frac{S}{2}⌋

) for Conv3d and (

⌊\frac{S}{2}⌋

,

⌊\frac{S}{2}⌋

) for Conv2d. Stride = 1 is then applied to the channel dimension, thus achieving the cross-temporal local feature fusion for both long-term and short-term preferences. Moreover, the s-th feature matrix in the sequence is:

F_{l o n g}^{s (t)} = [\begin{matrix} F_{f}^{w_{1} d_{1} s (t)} & F_{f}^{w_{1} d_{2} s (t)} & \dots & F_{f}^{w_{1} d_{7} s (t)} \\ F_{f}^{w_{2} d_{1} s (t)} & F_{f}^{w_{2} d_{2} s (t)} & \dots & F_{f}^{w_{2} d_{7} s (t)} \\ F_{f}^{w_{3} d_{1} s (t)} & F_{f}^{w_{3} d_{2} s (t)} & \dots & F_{f}^{w_{3} d_{7} s (t)} \\ F_{f}^{w_{4} d_{1} s (t)} & F_{f}^{w_{4} d_{2} s (t)} & \dots & F_{f}^{w_{4} d_{7} s (t)} \end{matrix}],

(5)

F_{s h o r t}^{s (t)} = [\begin{matrix} F_{f}^{w_{c} d_{1} s (t)} & F_{f}^{w_{c} d_{2} s (t)} & \dots & F_{f}^{w_{c} d_{7} s (t)} \end{matrix}] .

(6)

where

w_{1}, w_{2}, w_{3}, w_{4}

represent the four weeks involved in the long-term preferences, while

w_{c}

represents the current week involved in the short-term preferences, and

d_{1}, d_{2}, \dots, d_{7}

correspond to the 7 days of the week (from Monday to Sunday). These are the comprehensive features

F_{f}^{(t)}

of the items at timestamp t, as obtained from Equation (2).

Step 2: After dividing

F u s i o n_{c 3 d}^{(t)}

into weekly segments, Maxpool1d is applied for further processing. The corresponding

F u s i o n_{c 2 d}^{(t)}

is directly refined through Maxpool1d, as described by the following formulas.

[F u s i o n_{c 3 d}^{1 (t)}, F u s i o n_{c 3 d}^{2 (t)}, F u s i o n_{c 3 d}^{3 (t)}, F u s i o n_{c 3 d}^{4 (t)}] = F u s i o n_{c 3 d}^{(t)},

(7)

F u s i o n_{p o o l}^{g (t)} = Maxpool 1 d (F u s i o n_{c 3 d}^{g (t)}),

(8)

F u s i o n_{p o o l}^{c (t)} = Maxpool 1 d (F u s i o n_{c 2 d}^{(t)}) .

(9)

where g = 1, 2, 3, and 4 represent the four weeks involved in the long-term preferences;

F u s i o n_{p o o l}^{g (t)}

denotes the contextual state captured after integrating long-term preferences; and

F u s i o n_{p o o l}^{c (t)}

denotes the contextual state captured after integrating short-term preferences.

3.1.3. Context Propagation Layer

Due to the dynamic nature of user preferences, the contextual states inferred from them also change over time. Capturing long-term dependencies among these states is therefore essential for accurate modeling [16]. To this end, the TLSR model employs Gated Recurrent Unit (GRU) [47], which is specifically designed to handle sequential data with temporal dependencies. GRU can maintain and update hidden states through gated mechanisms, enabling them to remember important information over long periods. This makes GRU particularly effective at capturing time-related patterns such as trends, intervals, and periodic behaviors in user preferences. Accordingly, TLSR introduces a context propagation layer with dual recurrent operations—namely intra-week and inter-week recurrence—to reflect the periodic nature of user behaviors. In contrast, architectures like the Transformer [48], originally developed for natural language processing, rely heavily on positional encoding, which primarily reflects token positions within a sequence. This approach lacks an explicit mechanism to model actual time intervals or evolving temporal patterns, limiting its ability to capture complex temporal dynamics in recommendation tasks [49]. The operation diagram is shown in Figure 3.

As a result, after arranging the user’s long-term preferences in chronological order by day and week, the dual recurrent operations effectively learn their dynamic changes. This allows for the propagation of the embedded contextual states, achieving dynamic feature fusion. The specific calculation steps are as follows.

Step 1: Four GRUs are used to perform intra-week recurrence calculations, with each GRUs containing L GRU layers. Additionally, a dropout of 0.1 is applied between the layers, as follows.

F u s i o n_{g r u}^{g (t)} = {GRUs}_{g} (F u s i o n_{p o o l}^{g (t)}) .

(10)

Step 2: After the intra-week recurrence, the four fused features are concatenated and flattened according to the weekly time sequence. Then, another GRUs is used to perform the inter-week recurrence calculation, as follows.

F u s i o n_{g r u s}^{(t)} = {GRUs}_{g} (Flatten [F u s i o n_{g r u}^{1 (t)}, F u s i o n_{g r u}^{2 (t)}, F u s i o n_{g r u}^{3 (t)}, F u s i o n_{g r u}^{4 (t)}]) .

(11)

3.1.4. Prediction Layer

To enable interaction between the candidate items and the user’s long-term and short-term contextual states, the TLSR model uses two cross attention mechanisms [48,49], each designed to handle the interaction of layer-normalized contextual states. Specifically, a multi-head cross attention mechanism is employed for the interaction between the long-term contextual states and the candidate items, where the number of attention heads is set to H, allowing each head to focus on different parts of the long-term contextual states.

In contrast, a single-head cross attention mechanism is used for the interaction between the short-term contextual states and the candidate items, enabling the model to concentrate on the user’s contextual state within the current week.

The outputs of the two cross attention mechanisms are then enhanced through residual connections to retain implicit information. Afterward, the enhanced features are fused via concatenation and layer normalization. Finally, the fused features are concatenated with the embedding representation of the user ID and passed through a linear layer, followed by a Sigmoid activation function, to predict the probability of the user purchasing the candidate items under the current contextual state.

The detailed computation steps of the prediction layer are as follows.

Step 1: Normalize the long- and short-term contextual states through layer normalization.

F u s i o n_{p}^{l o n g (t)} = LayerNorm (F u s i o n_{g r u s}^{(t)}),

(12)

F u s i o n_{p}^{s h o r t (t)} = LayerNorm (F u s i o n_{p o o l}^{c (t)}) .

(13)

Step 2: Interaction between candidate items and long-term and short-term contextual states.

F u s i o n_{a t t n}^{l o n g (t)} = W^{l o n g} Concat (A t t n_{1}^{(t)}, \dots, A t t n_{H}^{(t)}),

(14)

F u s i o n_{a t t n}^{s h o r t (t)} = W^{s h o r t} A t t n_{0}^{(t)} .

(15)

where,

A t t n_{H}^{(t)} = {Attention}_{H} (F_{f}^{(t)}, F u s i o n_{p}^{l o n g (t)}, F u s i o n_{p}^{l o n g (t)}),

(16)

A t t n_{0}^{(t)} = Attention (F_{f}^{(t)}, F u s i o n_{p}^{s h o r t (t)}, F u s i o n_{p}^{s h o r t (t)}) .

(17)

Step 3: After residual connections, concatenation and layer normalization are performed.

F u s i o n_{f}^{l o n g (t)} = F u s i o n_{a t t n}^{l o n g (t)} + F_{f}^{(t)},

(18)

F u s i o n_{f}^{s h o r t (t)} = F u s i o n_{a t t n}^{s h o r t (t)} + F_{f}^{(t)},

(19)

F u s i o n_{f}^{l s (t)} = LayerNorm (Concat (F u s i o n_{f}^{l o n g (t)}, F u s i o n_{f}^{s h o r t (t)})) .

(20)

Step 4: After fusing user features, the linear layer is connected and then activated through the Sigmoid function.

F u s i o n_{f u}^{l s (t)} = Linear (Concat (F u s i o n_{f}^{l s (t)}, F_{u})),

(21)

p_{f u}^{(t)} = Sigmoid (F u s i o n_{f u}^{l s (t)}) .

(22)

where

p_{f u}^{(t)}

represents the probability that user u will purchase item f at the current timestamp t.

In summary, the TLSR model is a Sequential Recommendation model that utilizes a hybrid fusion strategy. During early fusion, it extracts and concatenates features from various modalities to comprehensively represent the content attributes of items. In the intermediate fusion, the context capture layer uses convolution, while the context propagation layer employs a dual recurrence mechanism to fuse multiple feature vectors within the historical behavior sequences, which then interact with candidate items through the cross attention mechanisms. In the subsequent stages, user features are progressively integrated via concatenation.

3.2. DeepSeek-R1

DeepSeek-R1 is a large-scale language model built upon the Transformer architecture and designed to enhance both language understanding and text generation capabilities, particularly excelling in Chinese and multilingual tasks. The core model leverages Multi-head Latent Attention (MLA) and Mixture-of-Experts (MoE) techniques [50], enabling cost-effective training and efficient inference. Furthermore, DeepSeek-R1 refines its language reasoning abilities through reinforcement learning. To support the research community, DeepSeek has open-sourced DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1, based on the Qwen and Llama architectures. According to the official technical report, DeepSeek-R1 demonstrates strong performance across several authoritative benchmarks, including MMLU, AIME, and Codeforces, showcasing its effectiveness in language understanding, knowledge-based question answering, and mathematical reasoning tasks [51].

4. Experiments and Results

In this section, we provide a detailed description of the experimental setup and the analysis of the results. Our experiments aim to answer the following research questions.

RQ1: How does the performance and effectiveness of TLSR compare to other baseline models?

RQ2: How does the TLSR model focus on and balance the user’s long-term and short-term contextual states?

RQ3: Does TLSR have the ability to learn implicit context?

RQ4: How do different components affect the results of TLSR?

4.1. Experimental Settings

4.1.1. Dataset and Preprocessing

To validate the prediction effectiveness and recommendation performance of the TLSR model, we first collected a food delivery dataset from the LifePlus platform covering the period from 4 March 2024, to 2 June 2024. These 13 weeks were labeled 2024-W10 to 2024-W22 according to the ISO 8601 week-based format [52]. After filtering and excluding irrelevant data, we retained the historical orders from the top 1000 users with the highest purchase frequency for empirical analysis, ensuring that their historical behavior sequences were non-sparse. Consequently, the data used for the experiment involved a total of 75,274 orders, 2398 merchants, 31,295 food items, and 1000 users.

Currently, LifePlus is an On-demand Food Delivery (OFD) platform that has been operating for 6 years, with a user base of 1.08 million and over 25 million service orders. Its primary service location is in Guizhou Province, China [53].

(1). Context-Driven Characteristics of the Data

Takeout food, which is a special type of item, typically has a short lifecycle and low consumption levels, resulting in relatively high purchase frequency. Moreover, considering the impact of context-driven consumption, users’ time preferences for purchases often exhibit periodicity and continuity. Based on these, we visualize the context-driven characteristics implicit in the data, as shown in Figure 4 and Figure 5.

The heatmap in Figure 4 represents the percentage of users purchasing takeout food each day, and this heat distribution visually demonstrates the time periodicity of purchasing preferences. Specifically, users tend to purchase more takeout food near the weekends. On the other hand, Figure 5 illustrates the distribution of the maximum consecutive days for users to purchase from the same merchant, showcasing the continuity of their purchasing preferences. In other words, most users purchase takeout food from the same merchant for no more than three consecutive days.

(2). Random Negative Sampling

Since all the sample data in the orders are positive samples, this study uses a random negative sampling method by food ID with a 1:20 ratio to obtain negative samples. It is important to note that users can purchase multiple food items from the same merchant at the same time when placing order, so the number of positive samples purchased by a user varies at different timestamps. Therefore, the number of negative samples obtained by random negative sampling at different timestamps is also unequal. This means that the samples used in this study belong to dynamically imbalanced samples. Accordingly, all samples from the first 12 weeks (2024-W10 to 2024-W21) are randomly divided into training and validation sets in an 8:2 ratio, while samples from the 13th week (2024-W22) are all used as the test set. According to statistics, the number of samples in the training set is 2,015,546, in the validation set is 503,887, and in the test set is 299,229.

(3). Multimodal Data Preprocessing

For the image modality in the data, each image is cropped to 224 × 224 (unit: pixel), and the RGB channel values are standardized. Finally, a pre-trained Resnet50 model [44] is used to extract dense features, which are directly incorporated into the model’s computations. For the numerical modality, two features are used: the timestamp and merchant ID. The timestamp is encoded and embedded as shown in Equation (1), while the merchant ID representing the food’s affiliation is embedded via embedding encoding. As for the text modality, although the food name serves as an iconic attribute, in the real-world food delivery scenario, the semantics may carry noise and inconsistencies. In such cases, directly extracting semantic features from the names may not accurately characterize the corresponding food. To address this, the study adopts a labeling approach that combines both the name and the image to assign each food item a unified and unique label. These labels are categorized into nine types, such as Chinese food, Western food, snacks, beverages, fruits, etc. Finally, the text labels are converted to numeric labels (1~9) and embedded as features.

4.1.2. Experimental Details

Since the model output is the probability of purchase, this study treats personalized recommendation as a binary classification task. Therefore, the Binary Cross-Entropy (BCE) function [54] is used as the loss function for model training. The BCE function is as follows.

Logloss = - \frac{1}{N} \sum_{i = 1}^{N} y_{i} \log (p_{i}) + (1 - y_{i}) \log (1 - p_{i}),

(23)

where

p_{i}

is the purchase probability

p_{f u}^{(t)}

for the i-th sample calculated by Equation (22), and

y_{i}

is the true label for the i-th sample (1 for positive samples and 0 for negative samples). Additionally, this paper sets the optimizer to Adam [55], the learning rate to 0.0001, the training epochs to 30, and the batch size to 100. The loss value on the validation set (i.e., Validation Logloss) is used as the monitoring metric, and the model that achieves the minimum Validation Logloss during training is retained as the optimal model.

Then, considering the key parameters of TLSR, the kernel size S in the context capture layer and the number of GRU layers L in the context propagation layer are particularly important. The value of S determines the ability of the context capture layer to perform local feature fusion across temporal sequences. However, an excessively large kernel size may lead to model overfitting. The value of L determines the dynamic feature fusion capability of the dual recurrence layer, but too many GRU layers may cause gradient vanishing or explosion.

To identify the optimal combination of these key parameters, the search space for S is defined as (3, 5) and for L as (1, 2, 3). The minimum validation loss (i.e., Validation Logloss) is used as the evaluation metric. The grid search results are visualized using a heatmap, as shown in Figure 6.

As shown in Figure 6, the smaller Validation Logloss in the heatmap indicates a better-performing parameter combination. The optimal combination of key parameters is identified as (S = 3, L = 2), which corresponds to the best TLSR model configuration: the convolution kernel size of S = 3 and GRU layers L = 2. Based on this result, the TLSR model with the optimal parameter combination is selected for the subsequent experiments.

In addition, the pseudo-code for the TLSR training process and a summary of its model architecture are provided in Appendix A (Algorithm A1).

4.1.3. Evaluation Metrics

To comprehensively evaluate the model’s prediction effectiveness and recommendation performance, this paper selects multiple metrics used for confusion matrix evaluation [56,57] and Top-K recommendation list evaluation [58,59].

(1). Confusion Matrix Evaluation Metrics

In order to assess the overall prediction effectiveness of the test set, this paper first sets the threshold at 0.5 to calculate the confusion matrix’s accuracy, precision, recall, and F1 score. Then, by combining different thresholds, the average precision (Ap) and Area Under Curve (AUC) are calculated and used as evaluation metrics. The specific calculation formulas are as follows.

Accuracy = \frac{TP + TN}{TP + FP + TN + FN},

(24)

Precision = \frac{TP}{TP + FP},

(25)

Recall = \frac{TP}{TP + FN},

(26)

F 1 = \frac{2 \cdot Precision \cdot Recall}{Precision + Recall},

(27)

Ap = \sum_{i = 1}^{n} (R_{i} - R_{i - 1}) P_{i},

(28)

AUC = \int_{0}^{1} TPR (FPR) d (FPR) .

(29)

where, TP represents True Positive, FN represents False Negative, FP represents False Positive, and TN represents True Negative. Ap is the weighted average of precision

P_{i}

at different recall

R_{i}

levels. AUC refers to the area under the Receiver Operating Characteristic (ROC) curve, commonly used to assess the performance of binary classification models. The ROC curve is plotted with the True Positive Rate (TPR) on the vertical axis and the False Positive Rate (FPR) on the horizontal axis.

(2). Top-K Recommendation Evaluation Metrics

For evaluating the recommendation performance of the model, this paper ranks the predicted candidate food purchase probabilities in descending order and recommends the top K items, generating a Top-K food recommendation list. However, considering that in actual food delivery scenarios the OFD platform primarily recommends merchants rather than food items, this paper further groups the foods belonging to the same merchant, selects the highest probability as the merchant-level purchase probability, and then ranks and selects the Top-K merchants for recommendation, generating a Top-K merchant recommendation list. Accordingly, to evaluate the Top-K recommendation list, this paper adapts the formula for the Hit Ratio (HR) metric to suit the actual food delivery scenario and combines it with Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG) metrics. The specific calculation formulas are as follows.

H R_{F o o d} @ K = \frac{1}{T N} \sum_{i} \sum_{t} \frac{\sum_{j} h i t s_{F o o d} (i, t, j)}{J N_{i t}},

(30)

H R_{S e l l} @ K = \frac{1}{T N} \sum_{i} \sum_{t} h i t s_{S e l l} (i, t),

(31)

M R R_{S e l l} @ K = \frac{1}{T N} \sum_{i} \sum_{t} \frac{1}{p_{i t}^{S e l l}},

(32)

N D C G_{S e l l} @ K = \frac{1}{T N} \sum_{i} \sum_{t} \frac{1}{\log (p_{i t}^{S e l l} + 1)} .

(33)

where

h i t s_{F o o d} (i, t, j)

is an indicator function that evaluates as 1 if food

j

purchased by user

i

at timestamp

t

appears on the food recommendation list, and 0 otherwise. Similarly,

h i t s_{S e l l} (i, t)

indicates whether the merchant from which user

i

made a true purchase at timestamp

t

is present on the recommendation list. If present, its value is 1; if not, its value is 0.

p

represents the position of the merchant ID in the recommendation list for user, and if the merchant does not appear on the list, its value is

+ \infty

.

J N

denotes the number of foods purchased simultaneously, while

T N

refers to the total number of purchases made by all users. Considering that in food delivery, users can only place an order from one merchant at a time but can buy multiple foods from that merchant, the MRR@K and NDCG@K metrics in this paper are used solely to evaluate the Top-K merchant recommendation list.

4.2. Results and Analysis

To answer the RQ1~RQ4 proposed earlier, this paper analyzes three studies.

4.2.1. Comparative Study (RQ1)

To comprehensively compare the prediction effectiveness and recommendation performance of TLSR, this paper selects other baseline models for training and testing.

(1). Context-aware recommendation models: LR, DCNv2, and DIN are classic models based on context awareness. Among them, LR uses a linear model to weight and sum various features, and then maps them to probabilities [29]. DCNv2 further enhances the ability of feature crossover on the basis of DCN [60]. DIN can adaptively learn users’ dynamic interests [37].

(2). Sequential Recommendation models: GRU4Rec, SASRec, STAMP, CORE, NARM, and NPE are efficient models based on sequence modeling. Among them, GRU4Rec models users’ dynamic preferences through the GRU layer [61]. SASRec is a recommendation model based on Transformer, and it uses a self-attention mechanism to capture the dependency relationships between sequences [35]. STAMP prioritizes modeling short-term memory while balancing long-term interests [62]; CORE has achieved efficient recommendation through representation consistency encoding and robust distance measurement [63]; NARM models user behavior sequences through a hybrid encoder and captures their primary purpose [64]; and NPE improves recommendation performance by integrating the relationships between items and user preferences [65].

Moreover, the implementation of these baseline models mainly refers to the RecBole library [66,67]. In order to ensure fairness in comparison, their main training hyperparameters are kept consistent with TLSR, while a few hyperparameters unique to different models refer to the default values of RecBole. In addition, based on the predicted results, recommendation lists for Top-5 and Top-10 are generated for comparison. The specific evaluation results are shown in Table 1, Figure 7, Table 2, and Table 3, and the bolded results in the table represent the optimal performance for the corresponding metrics.

According to the comparison results, it can be seen that TLSR performs significantly better in all aspects, whether it is the evaluation of prediction performance or the evaluation of the Top-K recommendation list. In particular, the Hit Ratio of the Top-5 merchant recommendation list reaches about 0.94. Based on the actual takeout scenario, it can be seen that the window of a recommendation page on the OFD platform often only displays up to five merchant cards at the same time. Therefore, the Top-5 merchant recommendation list can correspond to the recommended content on the platform homepage. This means that models with a higher

H R_{S e l l} @ 5

will greatly contribute to improving user experience. LR is a simple model in terms of structure and function, which limits its recommendation capabilities. Since the AUC metric of DCNv2 is close to 0.5, it indicates that the model is ineffective, and this means that DCNv2 is not suitable for this task. As for DIN, although it has the attention mechanism, it does not take dynamic preferences into consideration, so its performance is poor. GRU4Rec relies solely on the temporal modeling capabilities of GRU, which often limits its recommendation capabilities. SASRec and CORE both use Transformer-based encoders, which often have difficulty recognizing complex temporal patterns. For STAMP, NARM, and NPE, there is still room for improvement in their ability to model long- and short-term preferences. In summary, TLSR has higher prediction effectiveness and more accurate personalized recommendation performance.

Then, to fully test the model performance, this paper calculates the absolute error (i.e.,

A E = |y_{i} - p_{i}|

) of the prediction results of the models. Among them,

p_{i}

is the purchase probability and

y_{i}

is the true label, which corresponds to Equation (23). Furthermore,

A E_{TLSR}

is paired with other baseline models to form paired samples, and two-sided Wilcoxon tests are performed to determine whether the differences in model performance are significant. The null hypothesis (H₀) is that the model performance of the paired sample is similar.

As shown in the Wilcoxon test results in Table 4, the p-values of all paired samples are less than the significance level

α = 0.05

, which means that all tests reject H₀. In other words, the differences in model performance within all paired samples are significant, which once again proves the superiority of TLSR over other models.

Finally, this paper compares the computational overhead of the models, as shown in Table 5. Although TLSR has more total params than most baseline models, the GPU memory required to train it remains moderate. Combined with the recommendation performance of TLSR, its overall computational efficiency is good.

4.2.2. Case Study (RQ2 and RQ3)

To further analyze the focus and balance of TLSR in long- and short-term contexts, this paper selected four cases to present the results, as shown in Table 6.

In the recommendation lists, the bolded food ID and merchant ID represent the food or merchant that the user actually purchased at the corresponding timestamp. Thus, the recommendation lists generated by the TLSR model can be visually displayed to be close to the user’s actual purchasing situation. Then, in order to visually demonstrate the focus and balance of long- and short-term contexts in the candidate foods, this article takes food ID = 24238, 9516, 16901, and 14459 mentioned in Table 6 as examples for visualization. Among them, the multi-head cross attention weight coefficient in TLSR is used to display the focus of its long-term context, while the single-head cross attention weight coefficient is used to display the focus of its short-term context. The normalization result of the cascaded feature layers, represented by

F u s i o n_{f}^{l s (t)}

in Equation (20), intuitively demonstrates the balance of its long- and short-term contexts. The specific visualization results of the four cases are shown in Figure 8 and Appendix B.

Regarding food ID = 14459 for case 4 shown in Figure 8, in the long-term context, it focuses more on the features centered around “W18” and “W21” after local convolution. In the short-term context, it focuses more on the features of the current week (W22) that have undergone local convolution with “Saturday” as the core. This also indicates a high correlation between food ID = 14459 and these features.

Through contextual retrospection, it was found that the name of food ID = 14459 indicates that it belongs to children, and the user has also purchased similar children’s food in the historical behavior sequences of W18. This means that there may be children among the family members with user ID = 423. But unlike other cases, food ID = 14459 and its affiliated merchant does not rank first in the Top-10 recommendation list generated by TLSR. According to timestamp = 1717243462, the specific time after conversion is “2024-06-01 12:04:22,” which means it is Children’s Day. This indicates that although TLSR has the ability to learn implicit contexts, it still lacks understanding of open knowledge such as festivals. Based on this, this article chooses to combine the Large Language Model DeepSeek-R1 (671B) to achieve open knowledge enhancement. Specifically, its reasoning ability is used to complete contextual reasoning about festivals and to rerank the recommendation list. The specific process is shown in Figure 9.

Based on the excellent reasoning ability of DeepSeek-R1, this paper uses the zero-shot manner and its three roles for dialogue, namely system, user, and assistant. This is achieved by accessing DeepSeek’s official API. For the system, we provide an accurate prompt and standardize the output format. For the user, we provide relevant information on the current timestamp and recommendation list. Thus, the assistant undergoes a detailed chain-of-thought reasoning process to output standardized rerank results. Based on this, we combine TLSR and DeepSeek-R1 to form recommendation pipelines, with specific details shown in Figure 10.

After using DeepSeek-R1 to generate a new Top-10 food recommendation list, some foods may belong to the same merchant. In response to this situation, this paper groups them according to their affiliated merchants and then uses their highest ranking in the food recommendation list as the merchant recommendation ranking. When there are fewer than 10 merchants in the new Top-10 merchant recommendation list, the old Top-10 merchant recommendation list is used to supplement it. As a result, the comparison results in Table 7 are obtained after reranking.

According to Table 7, after the reranking adjustment of DeepSeek-R1, food ID = 14459 and its affiliated merchant is placed at the top of the recommendation list. Based on this, the rankings of the new recommendation lists become more in line with the current “Children’s Day” context while maintaining the same Top-10 content.

Finally, we reranked all the recommended lists dated June 1, 2024 by using the same method and evaluate them, as shown in Table 8. After adjusting for DeepSeek-R1, although the performance on various indicators may have decreased, considering the actual situation, these are more in line with the context of Children’s Day.

In summary, the results of the case study indicate that the weight coefficients of cross-attention can not only intuitively reflect the focus of long- and short-term contexts but also combine contextual retrospection to identify users’ implicit contexts. This means that TLSR has the ability to learn implicit contexts and thus achieve personalized recommendations. Through the reasoning ability of DeepSeek-R1, open knowledge enhanced reranking recommendation lists are obtained, making it more in line with its current context.

4.2.3. Ablation Study (RQ4)

Finally, regarding the role of the hybrid fusion strategy in the TLSR model, this paper designs ablation experiments [68,69] to verify the effectiveness of modality fusion, layer fusion, and combination fusion. The specific evaluation results after ablation are shown in Table 9, Figure 11, Table 10, and Table 11.

(1). Modality ablation: By removing the image modality from the food features, an “image modality ablation” experiment is conducted, and the resulting model is referred to as TLSR-i. In addition, the food features corresponding to the text labels are removed to conduct a “text modality ablation” experiment, and the resulting model is referred to as TLSR-t.

(2). Convolutional layer ablation: The role of convolutional layers in the TLSR model is to convolve multiple food features in the historical behavior sequences into one feature vector through cross-temporal local feature fusion. Therefore, all Conv3d and Conv2d layers in TLSR are removed for the “layer ablation” experiment, and the resulting model is referred to as TLSR-c. In addition, to demonstrate the effectiveness of cross-temporal fusion, the convolution kernels in TLSR are replaced with those without cross-temporal effects. Specifically, the convolution kernel of Conv3d is replaced with

1 \times 1 \times 1

and no padding, while the convolution kernel of Conv2d is replaced with

1 \times 1

and no padding. The resulting model is referred to as TLSR-k.

(3). Dual recurrence layer ablation: The role of the dual recurrence layer in the TLSR model is to explore the temporal dependencies in users’ long-term contexts in order to achieve dynamic feature fusion. Therefore, the dual recurrence layer in TLSR is removed for the “layer ablation” experiment, and the resulting model is referred to as TLSR-r. In addition, to demonstrate that dual recurrence has better fusion performance than single recurrence, the inter-week recurrence exits and only the intra-week recurrence is retained; specifically, the input time sequences is only arranged by day. The resulting model is referred to as TLSR-g.

(4). Combination ablation: When gradually integrating the embedding features of the user ID, the concatenation is to combine and fuse while preserving the independence between user features and candidate food features. Therefore, the concatenation is transformed into a Hadamard product for a “combination ablation” experiment, and the resulting model is referred to as TLSR-u.

According to the comparison of ablation study results, it can be seen that the image modality ablation causes the model to lack learning of important features of food, resulting in a significant decrease in prediction and recommendation performance. This means that images have a significant impact on increasing the probability of purchasing by visually displaying food content. The results of text modality ablation show that its precision metric performs excellently, but its recall metric drops significantly, thereby affecting its recommendation performance. Furthermore, other metrics also show that its overall performance is inferior to TLSR. Moreover, the replacement of convolutional kernels and the use of only single recurrence can lead to a decrease in the performance of the model, while the complete exit of the convolutional and dual recurrence layers can result in a significant decline in model performance. This means that both the cross-temporal local feature fusion function of the convolutional layer and the dynamic feature fusion function of the dual recurrence layer are indispensable in the TLSR model. In addition, the Hadamard product has commonly been used as one of the methods for calculating user and item combinations in previous studies. However, through ablation comparison, it is known that this method may undermine the independence between features, thereby reducing the combination fusion effect of the model. At the same time, this also indicates that the use of cascaded combination fusion in this article yields better results.

4.2.4. Sensitivity Analysis

Finally, this article conducts sensitivity analysis on the hyperparameters in TLSR, namely the size S of the convolution kernel, the number L of GRU layers, and the head number H of multi-head cross attention. The recommendation performance of different combinations of these parameters is shown in Table 12 and Table 13. Among them, (S = 3, L = 2, H = 4) refers to the parameter combination set in the previous main experiment. When S = 5, it indicates that the convolution kernel size of Conv3d in TLSR is

5 \times 5 \times 5

, and the convolution kernel size of Conv2d is

5 \times 5

, with padding of (2, 2, 2) and (2, 2), respectively. When S = 1, it is consistent with TLSR-k in ablation studies. When L = 1, there is only one GRU layer among all GRUs, and there is no dropout effect between layers.

According to different combinations of recommendation performance evaluation indicators, the sensitivity of the TLSR model to hyperparameters is S>L>H. The impact of cross-temporal local feature fusion in TLSR is the greatest, followed by the impact of dynamic feature fusion, and the impact of multi-head cross attention on important context effects is the smallest.

5. Discussion

The experimental results demonstrate that the TLSR model achieves strong predictive effectiveness and recommendation performance. The comparative study confirms its superiority over baseline models, largely due to its effective integration of multimodal information and sequential patterns.

The ablation study reveals the critical role of the image modality, as its removal leads to consistent performance drops across all metrics. The text modality also contributes, particularly by improving precision, though sometimes at the cost of reduced recall—highlighting a trade-off between recommendation specificity and coverage. Each architectural component, including the convolution, recurrence, and cross attention mechanisms, proves essential, with any omission resulting in notable performance degradation.

Despite these strengths, sensitivity analysis indicates that TLSR is vulnerable to key parameter fluctuations, revealing limited robustness under varying conditions—an issue that could affect stability in real-world applications.

Moreover, the current framework does not account for external factors such as weather, promotions, and delivery time, which are known to influence user behavior. Incorporating such contextual signals could further enhance recommendation relevance.

Finally, while an LLM-based reranking mechanism is included, its integration remains basic. Future work should explore more adaptive strategies—such as dynamic weighting or semantic-aware fusion—to better leverage the capabilities of Large Language Models and improve overall recommendation quality.

6. Conclusions

Based on the context-driven characteristics, this article first proposes a Time-Aware Long- and Short-Term Recommendation (TLSR) model based on a hybrid fusion strategy. TLSR regards recommendation tasks as a binary prediction problem and integrates personalized contextual states in user preferences through sequence modeling, thereby mining users’ implicit contexts based on multimodal explicit contexts. Its behavior extraction layer can achieve early fusion of explicit contextual information in multimodal data, the context capture layer achieves cross-temporal local feature fusion of long- and short-term preferences to capture implicit contextual states, the context propagation layer achieves dynamic feature fusion by capturing long-term context temporal dependencies, and the prediction layer utilizes cross attention mechanisms to achieve interaction between contextual states and candidate item features. Secondly, this article conducts empirical analysis using a real-world food delivery dataset and evaluates the model comprehensively using multiple indicators. After the comparative study, it is known that TLSR has significantly higher prediction effectiveness and recommendation performance than other baseline models. According to the case study, the weight coefficients of cross attention in TLSR can intuitively display the emphasis and balance of users’ long- and short-term contexts. It has been proven that TLSR has the ability to learn implicit contexts. Combined with the reasoning ability and open knowledge enhancement of DeepSeek-R1, the reranked recommendation list is more in line with the current context. Finally, the results of the ablation study indicate that the roles of modality fusion, layer fusion, and combination fusion in TLSR are indispensable and all have a significant impact on their prediction effectiveness and recommendation performance.

Author Contributions

Conceptualization, W.D. and D.H.; methodology, W.D. and D.H.; writing—original draft preparation, W.D. and D.H.; writing—review and editing, Z.J., P.Z. and Y.S.; supervision, W.D. and Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (grant numbers 72231010, 71932008, and 72362004), the Guizhou Provincial Science and Technology Projects (grant numbers qiankehejichu[2024]Youth183, qiankehejichu-ZK[2022]yiban019 and qiankehezhichengDXGA[2025]yiban014), and the Research Start-up Project for Recruited Talents of Guizhou University of Finance and Economics [2022] (grant number 2020YJ045).

Data Availability Statement

The dataset presented in this paper cannot be freely accessed because it was obtained from the Life Plus platform of Guizhou Sunshine HaiNa Eco-Agriculature Co., Ltd. (Xingyi, China), and the company does not allow authors to share its dataset.

Acknowledgments

The authors acknowledge the data support provided by the Life Plus platform for this study.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SR	Sequential Recommendation
CARS	Context-Aware Recommendation System
TLSR	Time-Aware Long- and Short-Term Recommendation
LLM	Large Language Model
CNN	Convolutional Neural Network
Conv3d	Three-Dimensional Convolution
Conv2d	Two-Dimensional Convolution
GRU	Gated Recurrent Unit
BCE	Binary Cross-Entropy
RQ	Research Question
OFD	On-demand Food Delivery
RGB	Red, Green, Blue
Ap	Average Precision
AUC	Area Under Curve
ROC	Receiver Operating Characteristic
HR	Hit Ratio
MRR	Mean Reciprocal Rank
NDCG	Normalized Discounted Cumulative Gain

Appendix A

Appendix A.1

Algorithm A1: The Training Process of TLSR
	Input: Image feature matrix $ℝ$ extracted by ResNet50, User behavior matrix $ℚ$ . Output: Optimized model parameters $θ$ .
1:	Initialize $θ, θ_{b e s t} \leftarrow θ, {Logloss}_{v a l}^{\min} \leftarrow + \infty$ ;
2:	for epoch=1 to $E$ do
3:	${Logloss}_{t r a i n} \leftarrow 0, {Logloss}_{v a l} \leftarrow 0$ ; // $E = 30$
	// Training (gradient backpropagation). $K_{t r a i n}$ is the number of training batches.
4:	for $k = 1$ to $K_{t r a i n}$ do
5:	$X_{t r a i n}^{(k)}, y_{t r a i n}^{(k)} \leftarrow TrainLoader (k)$ ;
6:	$p_{t r a i n}^{(k)} \leftarrow {TLSR}_{θ} (X_{t r a i n}^{(k)}, ℝ, ℚ)$ ;
7:	${Logloss}_{t r a i n} \leftarrow BCE (p_{t r a i n}^{(k)}, y_{t r a i n}^{(k)})$ ;
8:	$θ \leftarrow Adam (θ, {Logloss}_{t r a i n}, η)$ ; // learning rate $η = 0.0001$
9:	end for
	// Validation (no gradient backpropagation). $M_{v a l}$ is the number of validation batches.
10:	for $m = 1$ to $M_{v a l}$ do
11:	$X_{v a l}^{(m)}, y_{v a l}^{(m)} \leftarrow ValLoader (m)$ ;
12:	$p_{v a l}^{(m)} \leftarrow {TLSR}_{θ} (X_{v a l}^{(m)}, ℝ, ℚ)$ ;
13:	${Logloss}_{v a l} \leftarrow {Logloss}_{v a l} + BCE (p_{v a l}^{(m)}, y_{v a l}^{(m)})$ ;
14:	end for
15:	${Logloss}_{v a l} \leftarrow {Logloss}_{v a l} / M_{v a l}$ ;
	// Model selection
16:	if ${Logloss}_{v a l} < {Logloss}_{v a l}^{\min}$ then
17:	${Logloss}_{v a l}^{\min} \leftarrow {Logloss}_{v a l}$ ;
18:	$θ_{b e s t} \leftarrow θ$ ;
19:	end if
20:	end for
21:	return $θ_{b e s t}$ .

Appendix A.2

A summary of the TLSR model architecture is shown in Table A1, which illustrates the output shapes of each component and corresponds to Figure 1.

Table A1. Summary of TLSR.

Layer	Component	Output Shape
Behavior extraction layer	Multimodal feature embedding and concatenation	(100, 4, 7, 48, 2110)
		(100, 7, 48, 2110)
		(100, 1, 2110)
	Dimension permutation	(100, 48, 4, 7, 2110)
	Dimension permutation	(100, 48, 7, 2110)
Context capture layer	Conv3d	(100, 1, 4, 7, 2110)
	Conv2d	(100, 1, 7, 2110)
	Maxpool-1, Maxpool-2, Maxpool-3, Maxpool-4	(100, 7, 1055)
	Maxpool-5	(100, 7, 422)
Context propagation layer	GRUs-1, GRUs-2, GRUs-3, GRUs-4	(100, 7, 200)
Context propagation layer	GRUs-5	(100, 4, 400)
Prediction layer	Multi-head cross attention	(100, 1, 2110)
	Sigle-head cross attention	(100, 1, 2110)
	LayerNorm-1	(100, 4, 400)
	LayerNorm-2	(100, 7, 422)
	LayerNorm-3	(100, 1, 2110)
	LayerNorm-4	(100, 4220)
	Embedding	(100, 100)
	Linear	(100, 1)
	Sigmoid	(100, 1)

Finally, the time complexity of TLSR is analyzed to be

O (D \cdot H \cdot W \cdot S^{3} \cdot C_{i n} \cdot C_{o u t} + n d^{2} + n^{2} d)

. Among them,

D

represents depth,

H

represents height,

W

represents width,

S

represents the size of the convolution kernel,

C_{i n}

represents the number of input channels,

C_{o u t}

represents the number of output channels,

n

represents sequence length, and

d

represents dimension. Specifically, the duration of training TLSR with the optimal parameter combination for 30 epochs is 68,885.58 seconds.

Appendix B

Appendix B.1

Regarding food ID = 24238 for case 1 shown in Figure A1, in the long-term context, it focuses more on the features centered around “W18” and “W21” after local convolution. In the short-term context, it focuses more on the features of the current week (W22) that have undergone local convolution with “Thursday” as the core. This also indicates a high correlation between food ID = 24238 and these features.

Through contextual retrospection, it is found that timestamp = 1717099526 in case 1 was at the 20th hour of the day, and the user had purchased takeout food around 20:00 in both W18’s and W21’s historical behavior sequences. On Thursday in W22, around 19:00 (close to 20:00), the user also had a history of having purchased takeout food. From this, it can be seen that the consumption behavior of user ID = 3 is driven by the time context, and the TLSR model learns this implicit context through features such as time intervals.

Figure A1. The focus and balance of the contextual state in case 1 (time context-driven).

Appendix B.2

Regarding food ID = 9516 for case 2 shown in Figure A2, in the long-term context, it focuses more on the features centered around “W18” and “W21” after local convolution. In the short-term context, it focuses more on the features of the current week (W22) that have undergone local convolution with “Wednesday” as the core. This also indicates a high correlation between food ID = 9516 and these features.

Figure A2. The focus and balance of the contextual state in case 2 (family context-driven).

Through contextual retrospection, it is found that the name of food ID = 9516 indicates that it is a children’s package, and the user had also purchased a children’s package in the historical behavior sequences of W18. This means that there may be children among the family members with user ID = 185, so their consumption behavior is driven by the family context, and the TLSR model learns this implicit context through features such as image modality.

Appendix B.3

Regarding food ID = 16901 for case 3 shown in Figure A3, in the long-term context, it focuses more on the features centered around “W18” and “W21” after local convolution. In the short-term context, it focuses more on the features of the current week (W22) that have undergone local convolution with “Thursday” as the core. This also indicates a high correlation between food ID = 16901 and these features.

Figure A3. The focus and balance of the contextual state in case 3 (target context-driven).

Through context retrospection, it is found that the name of food ID = 16901 is healthy and low-fat, and the user had also purchased such low-fat food in the historical behavior sequences of Wednesday and Thursday in W22. This means that user ID = 405 has the goal of reducing fat and pursuing a healthy lifestyle when timestamp = 1717111357, so their consumption behavior is driven by the target context, and the TLSR model learns this implicit context through features such as image modality and affiliated merchants.

References

Wang, S.; Hu, L.; Wang, Y.; Cao, L.; Sheng, Q.Z.; Orgun, M. Sequential Recommender Systems: Challenges, Progress and Prospects. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019; International Joint Conferences on Artificial Intelligence Organization: Montreal, QC, Canada, 2019; pp. 6332–6338. [Google Scholar]
Thøgersen, J. The Importance of the Context. In Concise Introduction to Sustainable Consumption; Edward Elgar Publishing: Cheltenham, UK, 2023; pp. 76–87. [Google Scholar]
Dey, A.K. Understanding and Using Context. Pers. Ubiquitous Comput. 2001, 5, 4–7. [Google Scholar] [CrossRef]
Sanne, C. Willing Consumers—Or Locked-in? Policies for a Sustainable Consumption. Ecol. Econ. 2002, 42, 273–287. [Google Scholar] [CrossRef]
Yao, W.; He, J.; Huang, G.; Cao, J.; Zhang, Y. A Graph-Based Model for Context-Aware Recommendation Using Implicit Feedback Data. World Wide Web 2015, 18, 1351–1371. [Google Scholar] [CrossRef]
Hartatik; Winarko, E.; Heryawan, L. Context-Aware Recommendation System Survey: Recommendation When Adding Contextual Information. In Proceedings of the 2022 6th International Conference on Information Technology, Information Systems and Electrical Engi-neering (ICITISEE), Yogyakarta, Indonesia, 13–14 December 2022; pp. 7–13. [Google Scholar]
Yuen, M.-C.; King, I.; Leung, K.-S. Temporal Context-Aware Task Recommendation in Crowdsourcing Systems. Knowl.-Based Syst. 2021, 219, 106770. [Google Scholar] [CrossRef]
Ma, Y.; Narayanaswamy, B.; Lin, H.; Ding, H. Temporal-Contextual Recommendation in Real-Time. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, CA, USA, 23–27 August 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 2291–2299. [Google Scholar]
Wang, Z.; Wei, W.; Zou, D.; Liu, Y.; Li, X.-L.; Mao, X.-L.; Qiu, M. Exploring Global Information for Session-Based Recommendation. Pattern Recognit. 2024, 145, 109911. [Google Scholar] [CrossRef]
Kumar, C.; Kumar, M. Session-Based Recommendations with Sequential Context Using Attention-Driven LSTM. Comput. Electr. Eng. 2024, 115, 109138. [Google Scholar] [CrossRef]
Adomavicius, G.; Tuzhilin, A. Context-Aware Recommender Systems. In Recommender Systems Handbook; Ricci, F., Rokach, L., Shapira, B., Kantor, P.B., Eds.; Springer US: Boston, MA, USA, 2011; pp. 217–253. [Google Scholar]
Meng, X.; Du, Y.; Zhang, Y.; Han, X. A Survey of Context-Aware Recommender Systems: From an Evaluation Perspective. IEEE Trans. Knowl. Data Eng. 2023, 35, 6575–6594. [Google Scholar] [CrossRef]
Dourish, P. What We Talk about When We Talk about Context. Pers. Ubiquitous Comput. 2004, 8, 19–30. [Google Scholar] [CrossRef]
Pagano, R.; Cremonesi, P.; Larson, M.; Hidasi, B.; Tikk, D.; Karatzoglou, A.; Quadrana, M. The Contextual Turn: From Context-Aware to Context-Driven Recommender Systems. In Proceedings of the 10th ACM Conference on Recommender Systems, Boston, MA, USA, 15–19 September 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 249–252. [Google Scholar]
Nimchaiyanan, W. A Hotel Hybrid Recommendation Method Based On Context-Driven Using Latent Dirichlet Allocation. Master’s Thesis, Chulalongkorn University, Bangkok, Thailand, 2018. [Google Scholar]
Zhou, G.; Mou, N.; Fan, Y.; Pi, Q.; Bian, W.; Zhou, C.; Zhu, X.; Gai, K. Deep Interest Evolution Network for Click-Through Rate Prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 17 July 2019; Volume 33, pp. 5941–5948. [Google Scholar]
Xue, Z.; Marculescu, R. Dynamic Multimodal Fusion. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 17–24 June 2023; pp. 2575–2584. [Google Scholar]
Zhao, F.; Zhang, C.; Geng, B. Deep Multimodal Data Fusion. ACM Comput. Surv. 2024, 56, 1–36. [Google Scholar] [CrossRef]
Stahlschmidt, S.R.; Ulfenborg, B.; Synnergren, J. Multimodal Deep Learning for Biomedical Data Fusion: A Review. Brief. Bioinform. 2022, 23, bbab569. [Google Scholar] [CrossRef]
Dai, Y.; Yan, Z.; Cheng, J.; Duan, X.; Wang, G. Analysis of Multimodal Data Fusion from an Information Theory Perspective. Inf. Sci. 2023, 623, 164–183. [Google Scholar] [CrossRef]
Li, X.; Li, X.; Ye, T.; Cheng, X.; Liu, W.; Tan, H. Bridging the Gap between Multi-Focus and Multi-Modal: A Focused Integration Framework for Multi-Modal Image Fusion. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 1617–1626. [Google Scholar]
Krishna, G.; Dharur, S.; Rudovic, O.; Dighe, P.; Adya, S.; Abdelaziz, A.H.; Tewfik, A.H. Modality Drop-Out for Multimodal Device Directed Speech Detection Using Verbal and Non-Verbal Features. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 8240–8244. [Google Scholar]
Li, X.; Guo, X.; Han, P.; Wang, X.; Li, H.; Luo, T. Laplacian Redecomposition for Multimodal Medical Image Fusion. IEEE Trans. Instrum. Meas. 2020, 69, 6880–6890. [Google Scholar] [CrossRef]
Zhao, L.; Hu, Q.; Li, X.; Zhao, J. Multimodal Fusion Generative Adversarial Network for Image Synthesis. IEEE Signal Process. Lett. 2024, 31, 1865–1869. [Google Scholar] [CrossRef]
Zhu, P.; Hua, J.; Tang, K.; Tian, J.; Xu, J.; Cui, X. Multimodal Fake News Detection through Intra-Modality Feature Aggregation and Inter-Modality Semantic Fusion. Complex Intell. Syst. 2024, 10, 5851–5863. [Google Scholar] [CrossRef]
Gan, C.; Fu, X.; Feng, Q.; Zhu, Q.; Cao, Y.; Zhu, Y. A Multimodal Fusion Network with Attention Mechanisms for Visual–Textual Sentiment Analysis. Expert Syst. Appl. 2024, 242, 122731. [Google Scholar] [CrossRef]
Yu, M.; He, W.; Zhou, X.; Cui, M.; Wu, K.; Zhou, W. Review of Recommendation System. J. Comput. Appl. 2022, 42, 1898–1913. Available online: https://www.joca.cn/CN/10.11772/j.issn.1001-9081.2021040607 (accessed on 27 July 2025). (In Chinese).
Su, X.; Khoshgoftaar, T.M. A Survey of Collaborative Filtering Techniques. Adv. Artif. Intell. 2009, 2009, 421425. [Google Scholar] [CrossRef]
Richardson, M.; Dominowska, E.; Ragno, R. Predicting Clicks: Estimating the Click-through Rate for New Ads. In Proceedings of the 16th International Conference on World Wide Web, Banff, AB, Canada, 8–12 May 2007; Association for Computing Machinery: New York, NY, USA, 2007; pp. 521–530. [Google Scholar]
Rendle, S. Factorization Machines. In Proceedings of the 2010 IEEE International Conference on Data Mining, Sydney, NSW, Australia, 13–17 December 2010; pp. 995–1000. [Google Scholar]
Wang, R.; Fu, B.; Fu, G.; Wang, M. Deep & Cross Network for Ad Click Predictions. In Proceedings of the ADKDD’17, Halifax, NS, Canada, 14 August 2017; ACM: Halifax, NS, Canada, 2017; pp. 1–7. [Google Scholar]
Wu, S.; Sun, F.; Zhang, W.; Xie, X.; Cui, B. Graph Neural Networks in Recommender Systems: A Survey. ACM Comput. Surv. 2022, 55, 1–37. [Google Scholar] [CrossRef]
Gao, G. Review of Research on Neural Network Combined with Attention Mechanism in Recommendation System. Comput. Eng. Appl. 2024, 60, 47–60. (In Chinese) [Google Scholar] [CrossRef]
Rendle, S.; Freudenthaler, C.; Schmidt-Thieme, L. Factorizing Personalized Markov Chains for Next-Basket Recommendation. In Proceedings of the 19th International Conference on World Wide Web, Raleigh, NC, USA, 26–30 April 2022; Association for Computing Machinery: New York, NY, USA, 2010; pp. 811–820. [Google Scholar]
Kang, W.-C.; McAuley, J. Self-Attentive Sequential Recommendation. In Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM), Singapore, 17–20 November 2018; IEEE: Singapore, 2018; pp. 197–206. [Google Scholar]
Li, J.; Wang, Y.; McAuley, J. Time Interval Aware Self-Attention for Sequential Recommendation. In Proceedings of the 13th International Conference on Web Search and Data Mining, Houston, TX, USA, 3–7 February 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 322–330. [Google Scholar]
Zhou, G.; Zhu, X.; Song, C.; Fan, Y.; Zhu, H.; Ma, X.; Yan, Y.; Jin, J.; Li, H.; Gai, K. Deep Interest Network for Click-Through Rate Prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 1059–1068. [Google Scholar]
Han, R.; Li, Q.; Jiang, H.; Li, R.; Zhao, Y.; Li, X.; Lin, W. Enhancing CTR Prediction through Sequential Recommendation Pre-Training: Introducing the SRP4CTR Framework. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, Boise, ID, USA, 21–25 October 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 3777–3781. [Google Scholar]
Wang, H.; Li, Z.; Liu, X.; Ding, D.; Hu, Z.; Zhang, P.; Zhou, C.; Bu, J. Fulfillment-Time-Aware Personalized Ranking for On-Demand Food Recommendation. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Virtual Event, QLD, Australia, 1–5 November 2021; ACM: New York, NY, USA, 2021; pp. 4184–4192. [Google Scholar]
Wu, G.; Qin, H.; Hu, Q.; Wang, X.; Wu, Z. Research on Large Language Models and Personalized Recommendation. CAAI Trans. Intell. Syst. 2024, 19, 1351–1365. (In Chinese) [Google Scholar] [CrossRef]
Dai, S.; Shao, N.; Zhao, H.; Yu, W.; Si, Z.; Xu, C.; Sun, Z.; Zhang, X.; Xu, J. Uncovering ChatGPT’s Capabilities in Recommender Systems. In Proceedings of the 17th ACM Conference on Recommender Systems, Singapore, 18–22 September 2023; Association for Computing Machinery: New York, NY, USA, 2023; pp. 1126–1132. [Google Scholar]
Du, Y.; Luo, D.; Yan, R.; Wang, X.; Liu, H.; Zhu, H.; Song, Y.; Zhang, J. Enhancing Job Recommendation through LLM-Based Generative Adversarial Networks. Proc. AAAI Conf. Artif. Intell. 2024, 38, 8363–8371. [Google Scholar] [CrossRef]
Wang, X.; Chen, H.; Pan, Z.; Zhou, Y.; Guan, C.; Sun, L.; Zhu, W. Automated Disentangled Sequential Recommendation with Large Language Models. ACM Trans. Inf. Syst. 2025, 43, 1–29. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Xu, D.; Ruan, C.; Korpeoglu, E.; Kumar, S.; Achan, K. Inductive Representation Learning on Temporal Graphs, Addis Ababa, Ethiopia. arXiv 2020, arXiv:2002.07962. [Google Scholar]
Zhang, T.; Zhang, J.; Guo, C.; Chen, H.; Zhou, D.; Wang, Y.; Xu, A. A Survey of Image Object Detection Algorithm Based on Deep Learning. Telecommun. Sci. 2020, 36, 92–106. (In Chinese) [Google Scholar]
Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2017; Volume 30. [Google Scholar]
Hu, D.; Deng, W.; Jiang, Z.; Shi, Y. A Study on Predicting Key Times in the Takeout System’s Order Fulfillment Process. Systems 2025, 13, 457. [Google Scholar] [CrossRef]
DeepSeek-AI; Liu, A.; Feng, B.; Xue, B.; Wang, B.; Wu, B.; Lu, C.; Zhao, C.; Deng, C.; Zhang, C.; et al. DeepSeek-V3 Technical Report. arXiv 2024, arXiv:2412.19437. [Google Scholar]
DeepSeek-AI; Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv 2025, arXiv:2501.12948. [Google Scholar]
Briney, K.A. The Problem with Dates: Applying ISO 8601 to Research Data Management. J. EScience Librariansh. 2018, 7, e1147. [Google Scholar] [CrossRef]
Guizhou Sunshine HaiNa Eco-Agriculature Co., Ltd. Available online: https://www.yangguanghaina.com/ (accessed on 13 June 2025). (In Chinese).
Kweon, W.; Kang, S.; Jang, S.; Yu, H. Top-Personalized-K Recommendation. In Proceedings of the ACM Web Conference 2024, Singapore, 13–17 May 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 3388–3399. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2017, arXiv:1412.6980. [Google Scholar]
Ong, K.; Haw, S.-C.; Ng, K.-W. Deep Learning Based-Recommendation System: An Overview on Models, Datasets, Evaluation Metrics, and Future Trends. In Proceedings of the 2019 2nd International Conference on Computational Intelligence and Intelligent Systems, Bangkok, Thailand, 23–25 November 2019; Association for Computing Machinery: New York, NY, USA, 2020; pp. 6–11. [Google Scholar]
Li, M.; Ma, W.; Chu, Z. KGIE: Knowledge Graph Convolutional Network for Recommender System with Interactive Embedding. Knowl.-Based Syst. 2024, 295, 111813. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, Z.; Yu, W.; Hu, L.; Jiang, P.; Gai, K.; Chen, X. Soft Contrastive Sequential Recommendation. ACM Trans. Inf. Syst. 2024, 42, 1–28. [Google Scholar] [CrossRef]
Valcarce, D.; Bellogín, A.; Parapar, J.; Castells, P. Assessing Ranking Metrics in Top-N Recommendation. Inf. Retr. J. 2020, 23, 411–448. [Google Scholar] [CrossRef]
Wang, R.; Shivanna, R.; Cheng, D.; Jain, S.; Lin, D.; Hong, L.; Chi, E. DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-Scale Learning to Rank Systems. In Proceedings of the Web Conference 2021, Ljubljana, Slovenia, 19–23 April 2021; ACM: Ljubljana, Slovenia, 2021; pp. 1785–1797. [Google Scholar]
Tan, Y.K.; Xu, X.; Liu, Y. Improved Recurrent Neural Networks for Session-Based Recommendations. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, Boston, MA, USA, 15 September 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 17–22. [Google Scholar]
Liu, Q.; Zeng, Y.; Mokhosi, R.; Zhang, H. STAMP: Short-Term Attention/Memory Priority Model for Session-Based Recommendation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 1831–1839. [Google Scholar]
Hou, Y.; Hu, B.; Zhang, Z.; Zhao, W.X. CORE: Simple and Effective Session-Based Recommendation within Consistent Representation Space. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 1796–1801. [Google Scholar]
Li, J.; Ren, P.; Chen, Z.; Ren, Z.; Lian, T.; Ma, J. Neural Attentive Session-Based Recommendation. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, 6–10 November 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 1419–1428. [Google Scholar]
Nguyen, T.; Takasu, A. NPE: Neural Personalized Embedding for Collaborative Filtering. arXiv 2018, arXiv:1805.06563. [Google Scholar]
Zhao, W.X.; Mu, S.; Hou, Y.; Lin, Z.; Chen, Y.; Pan, X.; Li, K.; Lu, Y.; Wang, H.; Tian, C.; et al. RecBole: Towards a Unified, Comprehensive and Efficient Framework for Recommendation Algorithms. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Virtual Event Queensland, Australia, 1–5 November 2021; Association for Computing Machinery: New York, NY, USA, 2021; pp. 4653–4664. [Google Scholar]
Zhao, W.X.; Hou, Y.; Pan, X.; Yang, C.; Zhang, Z.; Lin, Z.; Zhang, J.; Bian, S.; Tang, J.; Sun, W.; et al. RecBole 2.0: Towards a More Up-to-Date Recommendation Library. arXiv 2022, arXiv:2206.07351. [Google Scholar]
Liu, C.; Li, Y.; Lin, H.; Zhang, C. GNNRec: Gated Graph Neural Network for Session-Based Social Recommendation Model. J. Intell. Inf. Syst. 2023, 60, 137–156. [Google Scholar] [CrossRef]
Xuan, H.; Liu, Y.; Li, B.; Yin, H. Knowledge Enhancement for Contrastive Multi-Behavior Recommendation. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, Singapore, 27 February–3 March 2023; Association for Computing Machinery: New York, NY, USA, 2023; pp. 195–203. [Google Scholar]

Figure 1. TLSR model architecture.

Figure 2. Illustration of the convolution operation. (a) Three-dimensional convolution when S = 3; (b) Two-dimensional convolution when S = 3.

Figure 3. Illustration of the dual recurrence.

Figure 4. User distribution visualization.

Figure 5. Visualization of the distribution of maximum consecutive days.

Figure 6. Grid search.

Figure 7. Comparison of models for the AUC metric.

Figure 8. The focus and balance of the contextual state in case 4 (festival context-driven). (a) The focus heat map for long-term context; (b) The focus heat map for short-term context; (c) The balance heat map for long- and short-term contexts.

Figure 9. DeepSeek-R1 context reasoning.

Figure 10. Recommendation pipelines consisting of TLSR and DeepSeek-R1.

Figure 11. Comparison of models for the AUC metric after ablation.

Table 1. Comparison of models for prediction effectiveness.

Models	Accuracy	Precision	Recall	F1	Ap
TLSR	0.9776	0.8221	0.6749	0.7412	0.7951
LR	0.9538	0.5976	0.0926	0.1604	0.2714
DCNv2	0.4967	0.0478	0.5057	0.0873	0.0477
DIN	0.9533	0.5632	0.0881	0.1524	0.2463
GRU4Rec	0.9224	0.2011	0.2122	0.2065	0.1405
SASRec	0.9537	0.5822	0.0947	0.1629	0.2679
STAMP	0.7531	0.0637	0.3057	0.1055	0.0751
CORE	0.9614	0.7953	0.2560	0.3873	0.4484
NARM	0.9440	0.3888	0.3070	0.3431	0.2786
NPE	0.9588	0.7828	0.1854	0.2998	0.3239

The bold numbers indicate the optimal performance for the corresponding metrics.

Table 2. Comparison of models for the HR@K metric.

Models	$H R_{F o o d} @ 5$	$H R_{F o o d} @ 10$	$H R_{S e l l} @ 5$	$H R_{S e l l} @ 10$
TLSR	0.8578	0.9290	0.9398	0.9700
LR	0.5163	0.6597	0.6769	0.8003
DCNv2	0.1374	0.2812	0.2277	0.4375
DIN	0.5018	0.6456	0.6686	0.7985
GRU4Rec	0.4138	0.5521	0.5654	0.7203
SASRec	0.5085	0.6501	0.6754	0.7964
STAMP	0.3938	0.5257	0.5239	0.6668
CORE	0.5983	0.7068	0.7513	0.8370
NARM	0.5154	0.6352	0.6758	0.7894
NPE	0.5871	0.7046	0.7434	0.8363

The bold numbers indicate the optimal performance for the corresponding metrics.

Table 3. Comparison of models for the MRR@K and NDCG@K metrics.

Models	$M R R_{S e l l} @ 5$	$M R R_{S e l l} @ 10$	$N D C G_{S e l l} @ 5$	$N D C G_{S e l l} @ 10$
TLSR	0.8730	0.8771	0.8898	0.8997
LR	0.4671	0.4839	0.5193	0.5596
DCNv2	0.1055	0.1328	0.1355	0.2026
DIN	0.4569	0.4746	0.5096	0.5520
GRU4Rec	0.3670	0.3875	0.4163	0.4662
SASRec	0.4639	0.4801	0.5166	0.5558
STAMP	0.3412	0.3601	0.3866	0.4326
CORE	0.6043	0.6160	0.6410	0.6689
NARM	0.5103	0.5256	0.5517	0.5885
NPE	0.5736	0.5862	0.6160	0.6463

The bold numbers indicate the optimal performance for the corresponding metrics.

Table 4. Wilcoxon test results.

Paired Sample	W-Statistic	p-Value
$A E_{TLSR}$ and $A E_{LR}$	5,742,352,616.0	0.0000
$A E_{TLSR}$ and $A E_{DCNv 2}$	11,161,300,372.0	0.0000
$A E_{TLSR}$ and $A E_{DIN}$	6,293,013,366.5	0.0000
$A E_{TLSR}$ and $A E_{GRU 4 Rec}$	7,308,931,681.0	0.0000
$A E_{TLSR}$ and $A E_{SASRec}$	6,979,763,575.0	0.0000
$A E_{TLSR}$ and $A E_{STAMP}$	5,474,380,013.0	0.0000
$A E_{TLSR}$ and $A E_{CORE}$	15,083,317,753.0	0.0000
$A E_{TLSR}$ and $A E_{NARM}$	14,264,308,243.0	0.0000
$A E_{TLSR}$ and $A E_{NPE}$	19,812,001,285.5	0.0000

Table 5. Comparison of computational overhead.

Models	Total Params	GPU Memory
TLSR	11,620,449	7691 MiB
LR	3,229,804	669 MiB
DCNv2	3,256,780	693 MiB
DIN	32,258,483	8503 MiB
GRU4Rec	6,333,096	2417 MiB
SASRec	6,560,596	18,283 MiB
STAMP	6,340,896	1443 MiB
CORE	6,560,697	18,329 MiB
NARM	6,343,046	2413 MiB
NPE	9,520,096	1447 MiB

Table 6. Top-10 recommendation list cases of the TLSR model.

Case	User ID	Timestamp	Top-10 Food Recommendation List	Top-10 Merchant Recommendation List
1	3	1717099526	(24200, 24238, 24215, 24241, 24243, 24239, 24267, 8745, 24902, 23684)	(1921, 175, 1972, 1874, 209, 927, 241, 1268, 408, 21)
2	185	1717098093	(9516, 9166, 581, 21230, 28172, 15431, 18818, 2688, 20533, 6309)	(887, 63, 1697, 1523, 1299, 1528, 207, 1646, 2, 203)
3	405	1717111357	(26873, 20625, 30572, 16901, 11286, 170, 22440, 2606, 3391, 2641])	(788, 25, 1785, 231, 253, 234, 457, 195, 111, 692)
4	423	1717243462	(2548, 14459, 9198, 8502, 14936, 15237, 9831, 15377, 997, 16846)	(180, 1244, 887, 15, 269, 1287, 622, 97, 1377, 1980)

The bold numbers indicate the food ID or merchant ID that are actually purchased.

Table 7. Comparison of reranking results for case 4.

Case 4	Old	New
Top-10 food recommendation list	(2548, 14459, 9198, 8502, 14936, 15237, 9831, 15377, 997, 16846)	(14459, 15377, 16846, 8502, 15237, 9831, 997, 2548, 9198, 14936)
Top-10 merchant recommendation list	(180, 1244, 887, 15, 269, 1287, 622, 97, 1377, 1980)	(1244, 622, 1377, 15, 1287, 887, 97, 180, 269, 1980)

The bold numbers indicate the food ID or merchant ID that are actually purchased.

Table 8. Comparison of reranking results in the context of Children’s Day.

Metrics	Old	New
$H R_{F o o d} @ 5$	0.8599	0.4539
$H R_{F o o d} @ 10$	0.9226	0.9226
$H R_{S e l l} @ 5$	0.9328	0.6486
$H R_{S e l l} @ 10$	0.9564	0.9564
$M R R_{S e l l} @ 5$	0.8690	0.3362
$M R R_{S e l l} @ 10$	0.8723	0.3798
$N D C G_{S e l l} @ 5$	0.8853	0.4129
$N D C G_{S e l l} @ 10$	0.8930	0.5150

Table 9. Comparison of models for prediction effectiveness after ablation.

Models	Accuracy	Precision	Recall	F1	Ap
TLSR	0.9776	0.8221	0.6749	0.7412	0.7951
TLSR-i	0.9515	0.4690	0.1432	0.2195	0.2153
TLSR-t	0.9685	0.8744	0.3954	0.5446	0.6460
TLSR-k	0.9649	0.7468	0.3993	0.5204	0.5374
TLSR-c	0.9541	0.6947	0.0631	0.1157	0.2519
TLSR-g	0.9682	0.7660	0.4786	0.5891	0.6386
TLSR-r	0.9530	0.6514	0.0270	0.0519	0.2080
TLSR-u	0.9589	0.5922	0.4415	0.5059	0.4732

The bold numbers indicate the optimal performance for the corresponding metrics.

Table 10. Comparison of models for the HR@K metric after ablation.

Models	$H R_{F o o d} @ 5$	$H R_{F o o d} @ 10$	$H R_{S e l l} @ 5$	$H R_{S e l l} @ 10$
TLSR	0.8578	0.9290	0.9398	0.9700
TLSR-i	0.4767	0.6313	0.5174	0.6628
TLSR-t	0.7688	0.8738	0.8609	0.9243
TLSR-k	0.6340	0.7685	0.7295	0.8354
TLSR-c	0.4514	0.6348	0.5460	0.7214
TLSR-g	0.7798	0.8766	0.8706	0.9286
TLSR-r	0.3982	0.5838	0.4817	0.6583
TLSR-u	0.6467	0.7672	0.7313	0.8312

The bold numbers indicate the optimal performance for the corresponding metrics.

Table 11. Comparison of models for the MRR@K and NDCG@K metrics after ablation.

Models	$M R R_{S e l l} @ 5$	$M R R_{S e l l} @ 10$	$N D C G_{S e l l} @ 5$	$N D C G_{S e l l} @ 10$
TLSR	0.8730	0.8771	0.8898	0.8997
TLSR-i	0.3460	0.3654	0.3885	0.4355
TLSR-t	0.7350	0.7436	0.7666	0.7872
TLSR-k	0.5875	0.6014	0.6228	0.6568
TLSR-c	0.3493	0.3729	0.3979	0.4549
TLSR-g	0.7518	0.7596	0.7816	0.8004
TLSR-r	0.2924	0.3160	0.3392	0.3963
TLSR-u	0.5961	0.6094	0.6298	0.6621

The bold numbers indicate the optimal performance for the corresponding metrics.

Table 12. Sensitivity analysis of the HR@K metric.

S	L	H	$H R_{F o o d} @ 5$	$H R_{F o o d} @ 10$	$H R_{S e l l} @ 5$	$H R_{S e l l} @ 10$
1	2	4	0.6340	0.7685	0.7295	0.8354
3	1	4	0.8205	0.9055	0.9179	0.9597
3	2	2	0.7858	0.8833	0.8754	0.9344
3	2	4	0.8578	0.9290	0.9398	0.9700
3	2	8	0.8060	0.8917	0.9022	0.9455
3	3	4	0.7296	0.8456	0.8314	0.9046
5	2	4	0.8467	0.9207	0.9306	0.9628

The bold numbers indicate the optimal combination of key parameters and their evaluation metrics.

Table 13. Sensitivity analysis of the MRR@K and NDCG@K metrics.

S	L	H	$M R R_{S e l l} @ 5$	$M R R_{S e l l} @ 10$	$N D C G_{S e l l} @ 5$	$N D C G_{S e l l} @ 10$
1	2	4	0.5875	0.6014	0.6228	0.6568
3	1	4	0.8269	0.8325	0.8498	0.8633
3	2	2	0.7577	0.7657	0.7872	0.8064
3	2	4	0.8730	0.8771	0.8898	0.8997
3	2	8	0.8027	0.8086	0.8276	0.8418
3	3	4	0.6900	0.7000	0.7254	0.7492
5	2	4	0.8506	0.8550	0.8708	0.8813

The bold numbers indicate the optimal combination of key parameters and their evaluation metrics.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Deng, W.; Hu, D.; Jiang, Z.; Zhang, P.; Shi, Y. Context-Driven Recommendation via Heterogeneous Temporal Modeling and Large Language Model in the Takeout System. Systems 2025, 13, 682. https://doi.org/10.3390/systems13080682

AMA Style

Deng W, Hu D, Jiang Z, Zhang P, Shi Y. Context-Driven Recommendation via Heterogeneous Temporal Modeling and Large Language Model in the Takeout System. Systems. 2025; 13(8):682. https://doi.org/10.3390/systems13080682

Chicago/Turabian Style

Deng, Wei, Dongyi Hu, Zilong Jiang, Peng Zhang, and Yong Shi. 2025. "Context-Driven Recommendation via Heterogeneous Temporal Modeling and Large Language Model in the Takeout System" Systems 13, no. 8: 682. https://doi.org/10.3390/systems13080682

APA Style

Deng, W., Hu, D., Jiang, Z., Zhang, P., & Shi, Y. (2025). Context-Driven Recommendation via Heterogeneous Temporal Modeling and Large Language Model in the Takeout System. Systems, 13(8), 682. https://doi.org/10.3390/systems13080682

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Context-Driven Recommendation via Heterogeneous Temporal Modeling and Large Language Model in the Takeout System

Abstract

1. Introduction

2. Literature Review

2.1. Multimodal Fusion

2.2. Recommendation System

3. Methodology

3.1. TLSR

3.1.1. Behavior Extraction Layer

3.1.2. Context Capture Layer

3.1.3. Context Propagation Layer

3.1.4. Prediction Layer

3.2. DeepSeek-R1

4. Experiments and Results

4.1. Experimental Settings

4.1.1. Dataset and Preprocessing

4.1.2. Experimental Details

4.1.3. Evaluation Metrics

4.2. Results and Analysis

4.2.1. Comparative Study (RQ1)

4.2.2. Case Study (RQ2 and RQ3)

4.2.3. Ablation Study (RQ4)

4.2.4. Sensitivity Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

Appendix A.1

Appendix A.2

Appendix B

Appendix B.1

Appendix B.2

Appendix B.3

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI