TLFormer: Scalable Taylor Linear Attention in Transformer for Collaborative Filtering

Hao, Dongdong; Yu, Dongxiao; Hou, Xiaowen

doi:10.3390/electronics15040759

Open AccessArticle

TLFormer: Scalable Taylor Linear Attention in Transformer for Collaborative Filtering

by

Dongdong Hao

^1,2

,

Dongxiao Yu

¹ and

Xiaowen Hou

^3,*

¹

Institute of Intelligent Computing, School of Computer Science and Technology, Shandong University, Qingdao 266237, China

²

University of Health and Rehabilitation Sciences, Qingdao 266113, China

³

Shandong Hi-Speed Qingdao Investment Co., Ltd., Qingdao 266100, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(4), 759; https://doi.org/10.3390/electronics15040759

Submission received: 20 January 2026 / Revised: 3 February 2026 / Accepted: 4 February 2026 / Published: 11 February 2026

(This article belongs to the Special Issue Wireless Artificial Intelligent Computing Systems and Applications (WASA))

Download

Browse Figures

Versions Notes

Abstract

Graph Neural Networks (GNNs) have become foundational models in recommender systems due to their ability to propagate information over user–item bipartite graphs via neighborhood aggregation. Despite their empirical success, GNNs are inherently constrained by their reliance on local connectivity, which limits their ability to capture global interaction patterns, particularly in large-scale recommendation scenarios characterized by severe data sparsity. To address these challenges, we propose the Taylor Linear attention in Transformer (TLFormer), which enhances recommendation performance by enabling global attention across all user–item pairs while preserving graph structural information. Unlike existing Transformer-based recommendation approaches that focus on local attention patterns, TLFormer introduces a novel linear attention mechanism derived from the first-order Taylor approximation, allowing efficient computation of all-pair interactions. TLFormer integrates spatial topology as positional encoding while maintaining linear complexity, effectively balancing computational efficiency with model expressiveness for large-scale recommendation scenarios. Extensive experiments across multiple datasets demonstrate that TLFormer significantly outperforms state-of-the-art methods, particularly in scenarios with sparse interactions and long-tail distributions.

Keywords:

collaborative filtering; graph transformer; linear attention

1. Introduction

With the explosive growth of the Internet, recommender systems [1,2,3,4,5,6,7] have become essential tools for information filtering and discovery, with applications spanning e-commerce [8], movies [9], music platforms [10], and increasingly multimodal scenarios [11]. At their core, recommender systems aim to present users with items that match their preferences by analyzing user behavior data [9]. Collaborative filtering [12,13,14] represents one of the most effective approaches in recommender systems, using patterns in user–item interaction data to identify underlying similarities and predict user preferences. GNNs [15] have emerged as powerful models for recommendation tasks [16] through their neighbor-based message passing mechanisms, demonstrating both architectural simplicity and empirical effectiveness [17,18,19].

Despite their promising performance, GNN-based recommendation models suffer from several key limitations that restrict their effectiveness. First, their message-passing mechanism confines information propagation to local neighborhoods [12,16,17], creating an inherent bottleneck in capturing long-range dependencies between users and items. This architectural constraint [20,21] proves particularly problematic in recommendation scenarios characterized by extreme sparsity, where meaningful connections often exist beyond immediate neighborhoods. Second, GNNs tend to amplify noise [22,23] through multi-hop propagation, as irrelevant signals accumulate when traversing multiple neighborhood layers. Third, these models frequently exacerbate the popularity bias problem by over-emphasizing frequently occurring (popular) items while suppressing representations of items in the long tail [24,25]. The aggregate effect of these limitations manifests in suboptimal recommendation quality for diverse user preferences, particularly for niche interests and less popular items that comprise the majority of most item catalogs [26].

Graph Transformers (GTs) have emerged as promising alternatives to GNNs across various graph-based tasks, demonstrating superior performance in capturing complex patterns and long-range dependency of node relationships [27,28,29,30,31]. The key advantage of GT architectures stems from their attention mechanism, which computes relationships between all pairs of nodes regardless of their position in the graph structure, effectively addressing the limitations of local message passing in GNNs. However, adapting these architectures to recommender systems presents several fundamental challenges. First, computational scalability remains a critical barrier. The quadratic time and space complexity of standard attention mechanisms renders global attention practically inapplicable for large-scale recommendation datasets with millions of users and items [32,33]. This limitation is particularly severe in sparse interaction graphs, where a substantial portion of attention weights correspond to non-existent connections, resulting in computational inefficiency. While recent adaptations such as GraphGPS [27], Nodeformer [34], and Knowformer [35] have made significant strides in reducing computational complexity through linear attention mechanisms, these advances have primarily benefited generic graph learning tasks rather than recommender systems. Second, preserving graph structural information presents unique challenges in recommendation contexts. Even on smaller recommendation datasets, global attention often collapses to near-zero weights, failing to effectively capture topological structures in user–item interaction graphs. Although various structural encoding approaches, including random walk-based encodings [36], Laplacian positional encodings [37], and centrality-based spatial encodings [31], have been proposed, these methods exhibit limited scalability in large-scale recommendation scenarios. Unlike sequential data in Natural Language Processing [38], user–item graphs exhibit permutation invariance, making the definition of positional encodings non-trivial in recommendation contexts. Furthermore, the fundamental distinction in task objectives—where most GT architectures [39] target node or graph classification while recommender systems focus on Top-K link prediction in bipartite graphs—creates a misalignment between architectural design and recommendation-specific requirements, limiting the effectiveness of existing attention mechanisms despite their computational advantages. Although some prior work [21,40] has attempted to leverage Transformer-style architectures to mitigate the locality of message passing and better capture long-range dependencies in recommender systems, their substantial algorithmic complexity and engineering overhead have limited their scalability and broad deployment.

To address the challenges of computational complexity and long-range dependencies in recommender systems, we propose TLFormer, a deliberately simple and transparent Transformer with linear-time complexity designed for recommendation tasks. Our approach leverages a global attention mechanism that enables each user to consider all items while maintaining linear time and space complexity. To effectively handle large-scale sparse interaction data, TLFormer implements three complementary components. The Token Lookup Table component serves as the foundation, enabling efficient retrieval of representation vectors by treating users and items as tokens in the system. Building upon this representation, the structural encoding module integrates spatial positional information to capture the underlying graph structure of user–item interactions. The Linear Attention Mechanism ties these elements together through a single-layer, single-head attention model, processing the tokenized representations and structural information to learn comprehensive user and item embeddings. We conducted extensive experiments on four real-world recommendation datasets with varying scales and sparsity levels. In the large-scale Amazon-Kindle dataset with hundreds of thousands of nodes, TLFormer achieves consistent improvements over state-of-the-art (SOTA) methods, surpassing MGFormer by 9.13% on average in terms of Recall and NDCG. Similar performance gains are observed on smaller datasets, with an average 5.72% improvement over GONet on Amazon-CD. Our implementation is publicly available at https://github.com/hdd1009/TLFormer, accessed on 19 January 2026.

2. Preliminaries

2.1. Problem Formulation

Let

U = {u_{1}, u_{2}, \dots, u_{M}}

and

I = {i_{1}, i_{2}, \dots, i_{N}}

denote the user set and the item set, respectively. Let

V = U \cup I

represent the set of all nodes, including both users and items. The user–item interaction matrix is defined as

R \in R^{M \times N}

, where

R_{u i}

is set to 1 if the user u has previously interacted with the item i, and 0 for pairs of unobserved pairs of the user–item. Here, M and N denote the cardinalities of the user set

| U |

and the item set

| I |

, respectively. The objective of the recommendation is to predict the interaction scores for pairs of unobserved user–item, which indicate the likelihood that a user will interact with a specific item. Based on these predicted scores, items with the Top-K highest scores are recommended to each user.

Typically, an encoder

f (\cdot)

maps users and items to low-dimensional representation vectors

f (u), f (i) \in R^{d}

, where d denotes the dimensionality of the representation space. The encoder can be implemented using various techniques, including Matrix Factorization [41,42], Variational AutoEncoders [43], and GNNs [17,18]. Among these, GNN-based encoders [16] have gained significant attention due to their ability to leverage the graph structure in user–item interaction data. For GNNs, the representation learning process follows a layer-wise recursive update rule. At the l-th layer, the representation of a node u is updated by aggregating the embeddings of its neighboring nodes

N_{u}

from the

(l - 1)

-th layer and combining them with its own representation:

f^{(l)} (u) = f_{combine} (f^{(l - 1)} (u), f_{aggregate} (f^{(l - 1)} (i) ∣ i \in N_{u})) .

here

f_{aggregate} (\cdot)

aggregates the features of neighboring nodes, while

f_{combine} (\cdot)

integrates the aggregated features with the node’s own representation. Various designs for these functions have been proposed in the literature [16,17]. However, this layer-wise aggregation process restricts the model’s receptive field to l-hop neighbors, making it challenging to capture long-range dependencies or potential unobserved interactions in the graph. After L layers of propagation, a readout function

f_{readout} (\cdot)

is typically applied to generate the final node representations:

f (u) = f_{readout} (f^{(l)} (u) ∣ l = 0, \dots, L) .

Common implementations of the readout function include mean or sum operations [18].

Finally, a prediction layer is constructed to estimate the likelihood of interaction between a user u and an item i. This is commonly achieved by computing the dot product of their final representations,

f {(u)}^{⊤} f (i)

, which facilitates efficient retrieval in large-scale recommender systems.

2.2. Transformer Architecture

The Transformer architecture [38,44,45] introduced the self-attention mechanism to process sequential data, which has since become fundamental to Large Language Models [46], computer vision [47], and multimodal scenarios [44,45,48]. The self-attention mechanism computes pairwise interactions between all positions in an input sequence, enabling the model to capture long-range dependencies. The architecture consists of two primary components: the self-attention module for computing contextual relationships and the feedforward neural network module for processing the attended representations. For graph-structured data, GTs [27,34] adapt the self-attention mechanism to operate on nodes rather than sequence positions.

Formally, let

X \in R^{n \times d}

denote the feature matrix of the input nodes in GTs, where n is the number of nodes, and d is the dimension of the features. In the self-attention module,

X

is projected into three matrices through linear transformations: the query matrix

Q = X W_{Q}

, the key matrix

K = X W_{K}

, and the value matrix

V = X W_{V}

. Here,

W_{Q} \in R^{d \times d_{q}}

,

W_{K} \in R^{d \times d_{k}}

,

W_{V} \in R^{d \times d_{v}}

are trainable weight matrices, where

d_{q}

,

d_{k}

,

d_{v}

represent the dimensions of the query, key and value, respectively. The self-attention computation is defined as follows:

Attn (X) = Attention (Q, K, V) : = Softmax (\frac{Q K^{T}}{\sqrt{d}}) V,

(1)

where

\sqrt{d}

is the scaled factor. In practice, multi-head attention mechanisms are commonly used, where multiple Equation (1) are concatenated. This approach has been shown to be highly efficient in real-world applications [46]. The output of the self-attention mechanism is then passed through LayerNorm [49], Dropout [50], and a residual connection [51] before reaching the feedforward network. These components together form the Transformer block, as shown in the following equation:

\begin{matrix} X^{'} = X + Dropout (LayerNorm (Attn (X))), \\ X^{″} = FFN (X^{'}) : = ReLU (X^{'} W_{1} + b_{1}) W_{2} + b_{2}, \end{matrix}

(2)

where

W_{1}, W_{2}

are

d \times d

matrices and

b_{1}, b_{2}

are d dimensional vectors. By stacking multiple blocks of Equation (2), a Transformer model can be constructed, ultimately providing node representations that capture global information. Moreover, the self-attention mechanism is equivariant to permutations of the input nodes. The Transformer generates identical representation vectors for nodes with the same neighboring nodes in a graph, irrespective of the broader graph structure, effectively addressing the challenge of long-range dependencies in GNNs [27].

However, this global attention mechanism introduces computational challenges [32], as calculating attention scores between all pairs of nodes results in quadratic complexity with respect to the number of nodes. This limitation becomes particularly significant when dealing with large-scale graphs common in recommender systems.

3. The Proposed TLFormer

In this section, we introduce Taylor Linear Attention in Transformer (TLFormer), a novel approach to address the challenges of global dependency modeling in graph-based recommender systems that can capture all-pair node relationships with linear computational complexity. As illustrated in Figure 1, TLFormer consists of three essential components: TokenTable, Structural Encoding and Linear Attention Module.

3.1. Representation Vector Lookup

Following the architectural principles of Transformers [38], we implement a TokenTable module that maps users and items to their corresponding representation vectors. For any user u and item i, their representations are obtained through:

e_{u} = TokenTable (u), e_{i} = TokenTable (i),

(3)

where

e_{u} \in R^{d}

and

e_{i} \in R^{d}

represent the d-dimensional embedding vectors in the latent space. These embeddings are stored in a unified token embedding matrix

E \in R^{(M + N) \times d}

, where M and N denote the number of users and items, respectively.

3.2. Spatial Structural Information

While attention mechanisms effectively weight input information, they inherently lack sequence-order awareness, which is traditionally addressed through positional encoding [38] in language models. However, in graph-based contexts [15], the primary consideration shifts from sequential ordering to structural relationships. Unlike sequence models [38], graphs exhibit equivariance properties and encode rich topological information rather than positional ordering. Recent research has shown that incorporating structural information significantly enhances the performance of the Transformer model in graph-based tasks [27,52].

For our bipartite recommender systems, we leverage structural topology as a form of positional encoding through Singular Value Decomposition (SVD) [52] of the interaction matrix

R

:

\begin{matrix} SVD (R) \to \hat{U} \cdot Σ \cdot {\hat{V}}^{⊤} \to (\hat{U} \sqrt{Σ}) {(\hat{V} \sqrt{Σ})}^{⊤} . \end{matrix}

(4)

The resulting matrices

(\hat{U} \sqrt{Σ}) \in R^{M \times d}

and

{(\hat{V} \sqrt{Σ})}^{⊤} \in R^{N \times d}

capture the structural representations for users and items, respectively. We consolidate these representations into a unified structural encoding matrix

S \in R^{(M + N) \times d}

. The final input representation combines this structural information with the token embeddings:

X = Concat (E, S) .

(5)

The resulting matrix

X \in R^{(M + N) \times 2 d}

integrates both embedding and structural information. Importantly, since node positions in the structural layout are determined solely by inter-node interactions, the structural encoding

S

remains fixed post-graph construction and is therefore maintained as a non-trainable component.

3.3. Taylor Linear Attention

While the structure encoding strengthens the topological information in our bipartite user–item graph, the standard self-attention mechanism in Equation (1) still faces computational challenges due to its quadratic complexity. To address this limitation, we first observe that the attention operation between users and items can be reformulated through kernel methods [53]. This reformulation enables us to develop a more efficient attention mechanism that maintains the global modeling capacity while reducing computational overhead. Specifically, for any user u in our recommendation graph, the attention operation can be expressed as

Attn (x_{u}) = \sum_{i \in V} \frac{κ_{\exp} (x_{u}, x_{i})}{\sum_{w \in V} κ_{\exp} (x_{u}, x_{w})} f (x_{i}),

(6)

where

x_{u}

represents the query vector of node u computed as

Q = X W_{Q}

,

x_{i}

denotes the key vector of node i obtained through

K = X W_{K}

, and

f (x_{i})

corresponds to the value representation

V = f_{v} (X)

. The parameters

W_{Q}, W_{K} \in R^{2 d \times 2 d}

learn the query and key transformations. The exponential kernel

κ_{\exp}

quantifies node similarity:

κ_{\exp} (x_{u}, x_{i}) : = \exp (\frac{〈 x_{u}, x_{i} 〉}{\sqrt{2 d}}),

(7)

where

〈 \cdot, \cdot 〉

computes the dot product between transformed vectors.

Following attention-based architectures [27,38], we apply Frobenius normalization to both query and key functions. Although this normalization traditionally helps with gradient flow and training stability in Transformer models, we observe that in our recommendation context, it consistently produces values close to zero. This numerical characteristic allows us to further simplify Equation (7) by eliminating the scaling factor

\sqrt{2 d}

:

κ_{\exp} (x_{u}, x_{i}) : = \exp (〈 {\bar{x}}_{u}, {\bar{x}}_{i} 〉),

(8)

where

{\bar{x}}_{u} = x_{u} / {| x_{u} |}_{F}

and

\bar{x} i = x_{i} / {| x_{i} |}_{F}

represent the normalized embeddings.

Building upon the normalized kernel, we derive a computationally efficient approximation through first-order Taylor expansion of the exponential function:

\exp (〈 {\bar{x}}_{u}, {\bar{x}}_{i} 〉) \approx 1 + 〈 {\bar{x}}_{u}, {\bar{x}}_{i} 〉 + \frac{\exp (ϵ)}{2} {(〈 {\bar{x}}_{u}, {\bar{x}}_{i} 〉)}^{2},

(9)

where

ϵ \in (0, 〈 {\bar{x}}_{u}, {\bar{x}}_{i} 〉)

is the expansion point. Based on this approximation, we propose a simplified linear kernel function:

κ_{linear} ({\bar{x}}_{u}, {\bar{x}}_{i}) = 1 + 〈 {\bar{x}}_{u}, {\bar{x}}_{i} 〉

(10)

To establish the validity of this approximation, we provide the following theoretical guarantee.

Theorem 1.

For any normalized user and item embeddings

{\bar{x}}_{u}

and

{\bar{x}}_{i}

, the approximation error between the exponential kernel

κ_{\exp}

and the linear kernel

κ_{linear}

is bounded:

| κ_{\exp} ({\bar{x}}_{u}, {\bar{x}}_{i}) - κ_{linear} ({\bar{x}}_{u}, {\bar{x}}_{i}) | \leq \frac{e}{2}

(11)

This theoretical bound ensures that our linear approximation maintains controlled error while reducing the computational complexity of the attention mechanism from

O (N^{2})

to

O (N)

, where N is the total number of users and items, and e denotes the natural constant.

Matrix Notation for Linear Attention. For practical implementation, we express our approach using matrix notation. The concatenated token embeddings for users and items are represented by $X \in R^{(M + N) \times 2 d}$ . Following standard attention mechanisms, we compute the query and key matrices through linear projections followed by Frobenius normalization:

$Q = \frac{X W_{Q}}{| X W_{Q} |_{F}}, K = \frac{X W_{K}}{| X W_{K} |_{F}}, V = X .$

(12)

Motivated by the empirical findings in LightGCN [17], we omit feature transformation on node embeddings and set $V = X$ , which not only simplifies the model architecture, but also empirically improves the performance. The linear attention computation can then be formulated as a two-step process:

$\begin{matrix} D & = diag (1 + \frac{Q (K^{⊤} + 1) + | V |}{| V |}), \\ X & = D^{- 1} (V + \frac{1^{⊤} V + Q (K^{⊤} V)}{| V |}) . \end{matrix}$

(13)

This formulation maintains linear computational complexity in both time and space with respect to the number of the users and items by strategically computing matrix products between key-value pairs before query interaction.

3.4. Model Training

In this work, we adopt the loss function with two complementary terms: alignment and uniformity [54,55]. The alignment loss minimizes the distance between representations of interacting user–item pairs:

ℓ_{align} = E_{(u, i) \sim p_{pos}} {∥x_{u} - x_{i}∥}^{2},

(14)

where

p_{pos}

denotes the distribution of observed user–item interactions. The uniformity loss does not require explicit negative sampling and encourages user and item representations to spread uniformly over the unit hypersphere by maximizing the distance between representations within each mini-batch:

\begin{matrix} ℓ_{uniform} = log E_{(u, u^{'}) \sim p_{user}} e^{- {∥x_{u} - x_{u^{'}}∥}^{2}} \\ + log E_{(i, i^{'}) \sim p_{item}} e^{- {∥x_{i} - x_{i^{'}}∥}^{2}}, \end{matrix}

(15)

where the exponential form of the loss naturally bounds the repulsion between dissimilar pairs. The complete objective combines these terms:

L = ℓ_{align} + γ \cdot ℓ_{uniform},

(16)

where

γ

balances the trade-off between pulling similar pairs together and pushing dissimilar pairs apart. By operating directly on normalized embeddings and leveraging batchwise comparisons, this formulation provides richer learning signals compared to traditional pairwise approaches.

3.5. Complexity Analysis

The computational cost of TLFormer comes from two primary components: the structural encoding computation via SVD and the linear attention mechanism. SVD-based structural encoding is computed once during preprocessing and remains fixed during training, thus not contributing to the runtime complexity of model operations. The main computational cost comes from our linear attention mechanism in Equation (13), we observe that the time complexity is

O (N d)

, where N denotes the total number of nodes (users and items combined) and d represents the embedding dimension. Meanwhile, the space complexity is also

O (N d)

, corresponding to the storage of node embeddings and attention parameters. We validate these theoretical complexity bounds through empirical run-time analysis in Section 4.3, demonstrating that TLFormer maintains its efficiency advantages as the dataset increases. Algorithm 1 shows the pseudocode of TLFormer.

Algorithm 1 Pseudocode of TLFormer

Input: user–item interaction matrix

R

, embedding dimension d, balance weight

γ

, query matrix

W_{q}

, key matrix

W_{k}

.
Output: Node representation matrix

H

1:: Initialize TokenTable $X$ using Xavier initialization
2:: Compute structural encoding $S = SVD (R)$ and freeze parameters
3:: Concatenate embeddings $X = Concat (X, S)$
4:: for each batch of user–item pairs $(u, i) \in R$ do
5:: Compute normalized attention matrices:
6:: $Q = \frac{X W_{q}}{| X W_{q} | F}$
7:: $K = \frac{X W_{k}}{| X W_{k} | F}$
8:: $V = X$
9:: Compute diagonal matrix:
10:: $D = diag (1 + \frac{Q (K^{T} + 1) + | V |}{| V |})$
11:: Apply Taylor linear attention:
12:: $X = (V + \frac{1^{T} V + Q (K^{T} V)}{| V |}) D^{- 1}$
13:: Extract and normalize embeddings:
14:: $x_{u} = x_{u} / | x_{u} |$ , $x_{i} = x_{i} / | x_{i} |$
15:: Compute loss:
16:: $L = ℓ_{align} (x_{u}, x_{i}) + γ (ℓ_{uniform} (x_{u}) + ℓ_{uniform} (x_{i})) / 2$
17:: Update parameters $X$ , $W_{q}$ , $W_{k}$ via gradient descent
18:: end for
19:: return $\bar{X} = X$

4. Experiment

In this section, we conduct extensive experiments to evaluate the effectiveness of TLFormer against SOTA baselines. We first describe our experimental setup, including datasets, baseline methods, and evaluation metrics. We then present comprehensive results analyzing model performance, followed by detailed ablation studies examining the impact of different components and hyperparameters. Through these experiments, we demonstrate that TLFormer achieves superior performance while maintaining computational efficiency across diverse recommendation scenarios.

4.1. Experimental Settings

Datasets. Our experiments utilize four real-world public datasets that have been widely adopted in recommendation research [18,56]:

Pinterest [56]: A large-scale image recommendation dataset collected from Pinterest social media platform, containing implicit user–image interactions through user behavior data.
Amazon Datasets (https://jmcauley.ucsd.edu/data/amazon/ accessed on 15 March 2024): Three large-scale product review datasets from Amazon’s e-commerce platform. We select Movies, Kindle, and CDs categories for their substantial size and diverse interaction patterns.

The statistical characteristics of these datasets are summarized in Table 1. Following standard practice in recommender systems [55,57], we utilize 5-core datasets that filter out users and items with fewer than five interactions during preprocessing to ensure reliable model learning. This filtering step helps maintain data quality by removing sparse interactions that could potentially introduce noise into the learning process.

Baselines. We compare TLFormer against a diverse set of established recommendation algorithms, including traditional matrix factorization approaches, generative models, GNN-based models, and recent self-supervised learning methods.

BPRMF [42] is a typical negative sampling method, which optimizes ranking loss using the maximum a posterior estimation.
RecVAE [43] builds upon Variational AutoEncoders (VAEs) and enhances Mult-VAE with a novel $β$ -VAE hyperparameter setting, a composite prior for latent codes, and an alternating update training strategy.
LightGCN [17] is a simplified GCN that removes activation functions and feature transformation to facilitate large-scale recommender systems.
CLRec [58] is a contrastive learning method that adopts InfoNCE loss to reduce exposure bias through inverse propensity weighting in the recommender systems.
SGL [26] applies contrastive learning to recommender systems by generating multiple graph views through edge dropping, node dropping, and random walks, thereby enhancing the robustness of learned user and item representations.
CONet [59] captures both local and nonlocal messages in graphs by performing k-Means clustering on nodes’ GNN embeddings to obtain graph-level representations (e.g., centroids).
DirectAU [55] introduces the concepts of alignment and uniformity into recommender systems, eliminating negative sampling and ensuring closer user–item pairs, while making the distribution of user and item sets within the same batch more uniform.
MGFormer [21] can capture all-pair interactions among nodes with linear complexity and incorporate learnable relative degree information to appropriately reweigh the attentions.
BIGCF [13] revisits user–item interactions from a causal perspective by disentangling them into collective and individual intents, and reconstructs the interaction graph through a bilateral intent modeling framework grounded in generative graph learning.

Evaluation Metrics. For evaluation, we employ two standard metrics in recommender systems: Recall and Normalized Discounted Cumulative Gain (NDCG) at different cutoff values $K \in 10, 20, 50$ . We use NDCG@20 as the validation metric for model selection during training [57]. Each dataset is split into training, validation, and test sets with a ratio of 8:1:1. To ensure comprehensive evaluation, we adopt the full-ranking protocol where the model ranks all items for each user rather than using a sampled subset. All baselines are tuned to their optimal performance through multiple experimental runs, with the best results reported for fair comparison.
Experimental Setup. We implement all methods using RecBole [57] frameworks for fair comparison. For optimization, we use Adam [60] with a learning rate of 1 × 10⁻³ and train for a maximum of 300 epochs. We employ early stopping with a patience of 10 epochs to prevent overfitting. All models use embedding dimensions of either 64 or 128, with a batch size of 1024. For TLFormer, the default encoder consists of a linear transformation mapping user/item IDs from the embedding table, and we conduct extensive parameter tuning for the trade-off weight between uniformity and alignment objectives. We follow the hyperparameter settings as specified in the original papers [13,17,21,26,42,43,55,58,59]. All parameters are initialized using the Xavier initialization [61].

4.2. Accuracy Evaluation

Table 2 presents the performance comparison between TLFormer and SOTA baseline methods across different Top-K recommendation metrics (

K \in {10, 20, 50}

). The relative improvements over the strongest baseline are reported. Based on these experimental results, we make several key observations regarding model performance and effectiveness.

The experimental results demonstrate that Transformer-based architectures (TLFormer and MGFormer) achieve superior performance compared to conventional approaches. This performance gap highlights the effectiveness of global attention mechanisms in capturing complex user–item relationships, surpassing traditional collaborative filtering methods. Notably, TLFormer achieves these improvements with a single-layer linear attention mechanism, indicating that architectural simplicity combined with global context modeling can enhance recommendation performance. While GNN-based models like LightGCN show advantages over traditional matrix factorization through message-passing mechanisms, their performance remains constrained by the local nature of graph convolutions. This limitation is most evident on the Amazon-Kindle dataset, where LightGCN achieves an NDCG@20 of 0.1009, significantly below TLFormer’s 0.1460. This 45% performance gap demonstrates the clear advantage of our global attention approach over localized message passing for capturing user–item relationships in sparse recommendation scenarios.

4.3. Efficiency Comparison

To evaluate computational efficiency, we compare the per-epoch runtime of TLFormer against LightGCN and DirectAU across all datasets. As shown in Figure 2, TLFormer demonstrates consistently faster execution speeds across all four datasets. The improvements are particularly notable on the Pinterest dataset, where TLFormer achieves a 47% reduction in per-epoch runtime compared to baseline methods. These empirical results validate that our linear attention design translates to substantial practical efficiency gains in large-scale recommendation scenarios.

4.4. Convergence Analysis

We further analyze the convergence behavior of TLFormer against LightGCN [17] and DirectAU [55] on the Amazon-Movies and Pinterest datasets. Figure 3 tracks the training loss progression across epochs, while Figure 4 monitors the validation Recall@20 performance. These metrics provide complementary views of model convergence dynamics.

TLFormer demonstrates superior convergence efficiency compared to baseline models on both datasets. Specifically, TLFormer reaches optimal performance at the 40th and 46th epochs for Amazon-Movies and Pinterest, respectively, while DirectAU requires 41 and 62 epochs, and LightGCN needs 202 and 229 epochs. These results quantify substantial reductions in training time, with TLFormer achieving both faster convergence and improved performance metrics. The accelerated convergence of TLFormer stems from two architectural advantages. First, the Alignment and Uniformity loss normalizes representations to the unit hypersphere, avoiding the representation collapse often observed with LightGCN’s BPR loss and single negative sampling. Second, the global attention mechanism enables comprehensive all-pair message passing, allowing nodes to attend to the entire graph structure rather than just local neighborhoods. This broader perception field facilitates more efficient learning of user–item relationships compared to DirectAU’s local message passing approach.

Figure 4 reveals that DirectAU achieves higher initial Recall@20 performance than TLFormer. This early advantage stems from DirectAU’s GNN-based architecture, which restricts information flow to local neighborhoods, naturally filtering noise in the initial training stages. Conversely, TLFormer’s global attention mechanism processes all user–item relationships simultaneously, initially incorporating more noise but ultimately learning more comprehensive representations as training progresses. This pattern illustrates the fundamental trade-off between immediate local precision and long-term global comprehension in recommendation models.

4.5. Noise Robustness Analysis

To evaluate the robustness of TLFormer, we conduct systematic experiments with adversarial perturbations in the training data. Following the methodology in [26], we introduce controlled levels of noise by contaminating the training set with negative user–item interactions at ratios of 5%, 10%, 15%, and 20%. The validation and test sets remained unmodified to maintain evaluation integrity. Figure 5 presents comparative results on the Amazon-CDs and Amazon-Movies datasets, demonstrating performance variations under different noise conditions.

Analysis of the experimental results reveals that while both TLFormer and DirectAU experience performance degradation with increasing noise, the impact differs significantly between the two models. DirectAU exhibits steeper performance decline as noise levels increase, whereas TLFormer maintains more stable performance characteristics, particularly evident in the Amazon-Movies dataset where it demonstrates notably smoother degradation patterns. This resilience can be attributed to TLFormer’s all-pair global attention mechanism, which enables effective filtering of spurious interactions through comprehensive modeling of long-range dependencies. The global receptive field provides TLFormer with enhanced capability to distinguish genuine user preferences from noise-induced patterns, resulting in more consistent interpretation of user–item relationships even under noisy conditions.

4.6. Effectiveness in Long-Tail Recommendation

As highlighted in prior work [26], GNN-based recommendation models are prone to the long-tail problem. To evaluate TLFormer’s capability in handling this challenge, we divide the items into ten equally-sized groups based on interaction frequency, with higher group indices corresponding to items with greater popularity. To quantitatively assess performance across these groups, we decompose the Recall@20 metric into group-specific components:

R e c a l l = \sum_{g = 1}^{10} \frac{1}{M} \sum_{u = 1}^{M} \frac{| {(l_{rec}^{u})}^{(g)} \cap l_{test}^{u} |}{| l_{test}^{u} |} = \sum_{g = 1}^{10} R e c a l l^{(g)},

where M denotes the total user count,

l_{r e c}^{u}

represents the top-K recommendations for user u, and

l_{t e s t}^{u}

contains the user’s relevant items in the test set. The term

R e c a l l^{(g)}

measures the recommendation accuracy for group g.

Experimental results shown in Figure 6 reveal distinct patterns across datasets. On Amazon-CDs, TLFormer demonstrates uniform performance improvements across all popularity groups. In contrast, for Amazon-Movies, the gains are concentrated in groups with sparse interactions. These results indicate that the global attention mechanism in TLFormer more effectively captures item representations compared to DirectAU’s local message-passing approach, particularly for items with limited interaction data.

4.7. Ablation Study

The architecture of TLFormer presented in Figure 1 emphasizes structural simplicity through its core components. We conduct ablation experiments to evaluate the contribution of the structural encoding module, particularly focusing on the SVD-based encoding mechanism. Table 3 presents comparative results across two representative datasets, Amazon-Movies and Pinterest, examining both Recall and NDCG metrics with and without structural encoding and Linear Attention. Experimental results show that adding SVD-based structural encoding and Taylor Linear Attention yields substantial gains, with pronounced improvements under sparse interaction regimes. These findings indicate that both components are crucial for capturing latent interaction patterns, with Taylor Linear Attention contributing the larger effect, particularly beneficial for datasets with varying density characteristics and complex user–item relationships.

4.8. Parameter Sensitivity

The practical advantage of TLFormer extends beyond its architectural design to its hyperparameter configuration, requiring only the tuning of

γ

, which controls the balance between alignment and uniformity objectives in the loss function. To assess the model’s robustness to this parameter, we evaluate performance metrics NDCG@20 and Recall@20 across different values of

γ

on four benchmark datasets, as shown in Figure 7.

The performance curves exhibit a consistent pattern of initial improvement followed by gradual degradation, with the optimal

γ

value varying across datasets. This behavior suggests a stable relationship between the alignment and uniformity components of the loss function, making the model relatively robust to parameter selection. For practical implementation, we recommend an initial coarse search within the range [0.01, 10], followed by localized fine-tuning around promising values based on validation performance. This straightforward approach to parameter selection enhances the model’s usability while maintaining robust performance across different recommendation scenarios.

5. Related Work

5.1. GNNs for Recommender Systems

GNNs [62,63,64] have become a cornerstone of recommender systems, utilizing a message-passing mechanism to aggregate information from adjacent nodes [13,17,18,20,26,65,66,67,68]. This approach has proven effective for modeling user–item interactions in implicit feedback bipartite graphs and has inspired numerous modified frameworks. For instance, NGCF [16] learns the representation embeddings of the node using a graph convolutional network enhanced with feature transformation and activation functions. Building on NGCF, LightGCN [17] simplifies the framework by removing feature transformation and activation functions, resulting in a more lightweight encoder module. Similarly, SimCE [18] employs a lightweight graph convolutional module as the encoder but incorporates Sampled Softmax Cross-Entropy, comparing one positive sample with multiple negative samples to enhance learning effectiveness. Despite extensive research on using GNNs as encoder, the challenge of data sparsity in recommender systems datasets persists. Self-supervised learning (SSL), as a solution to alleviate data sparsity, has been widely adopted in this domain. SGL [26] pioneered the integration of SSL into recommender systems, proposing three techniques—edge dropping, node dropping, and random walks—to enhance the learning performance of GNNs. NCL [12] introduced the concept of semantic neighborhoods and combined it with structure-contrastive learning. By employing the Expectation-Maximization (EM) algorithm for optimization, it enhanced the robustness of the model. DirectAU [55] retains the lightweight GNN architecture but optimizes the loss function by introducing Alignment and Uniformity to ensure well-distributed representations on the unit hypersphere.

Despite their success, GNN-based models face several inherent challenges. Limited receptive fields [26], caused by data sparsity, noisy interactions, and the long-tail effect, restrict their ability to capture long-range dependencies [39]. These limitations hinder their ability to fully exploit global structural information in large-scale graphs, which remains a critical bottleneck in improving recommendation performance.

5.2. Graph Transformers

GTs compute fully connected weight matrices for all nodes [27,28,29,30,34,69,70] and are capable of operating even without explicit graph inputs [39]. Compared to GNNs, GTs effectively address the pervasive issue of data sparsity in graph datasets by leveraging their global receptive field and fully connected design. Several notable GT architectures have been proposed in recent years inspired by advances in Transformers [71,72,73,74]. GraphGPS [27] introduced a general architecture that achieves linear complexity with respect to the number of nodes and edges by decoupling local real edge aggregation from the fully connected Transformer layers. Nodeformer [34] employs a kernelized Gumbel-Softmax operator to reduce algorithmic complexity to linearity with respect to node numbers, enabling the learning of latent graph structures in a differentiable manner for large, potentially fully connected graphs. SGFormer [39] simplifies the attention mechanism, employing a lightweight attention model that efficiently propagates information among arbitrary nodes in a single layer. In the domain of recommender systems, Transformer-based architectures have shown significant promise. LightGT [75] introduces a modality-specific embedding and a layer-wise position encoder to improve similarity measurements, while employing a lightweight self-attention block to improve computational efficiency. GFormer [75] combines a graph Transformer with parameterized collaborative rationale discovery to selectively enhance user–item relationships while preserving global contextual information. MGFormer [21], designed with linear complexity, treats all nodes of users and items as independent tokens, augments them with positional embeddings, and processes them through a kernelized attention mechanism. Additionally, MGFormer incorporates learnable relative degree information to appropriately reweight attention scores, further improving recommendation performance.

Despite their advances, methods like LightGT and GFormer retain GNNs components, which may inherit challenges such as sensitivity to data sparsity. Recent progress in lightweight Transformers [21,35,39] has focused on addressing these challenges both theoretically and practically. Building on this direction, our work proposes an effective and efficient Transformer framework tailored for recommender systems, specifically designed to address data sparsity in large-scale datasets.

6. Conclusions

This work advances recommender systems by introducing TLFormer, which addresses key efficiency challenges in attention mechanisms for large-scale recommender systems. By reformulating attention computation through Taylor expansion and integrating structural graph information, our approach achieves linear complexity while preserving global dependency modeling capabilities. Extensive experiments across diverse datasets demonstrate consistent performance improvements over SOTA methods, along with significant reductions in training time and enhanced robustness to noise and long-tail scenarios. Future work will explore adaptive structural encoding mechanisms and investigate the application of our linear attention framework to multi-modal recommendation scenarios, particularly focusing on the integration of heterogeneous information sources while maintaining computational efficiency.

Author Contributions

Conceptualization, D.H. and D.Y.; methodology, D.H.; software, D.H.; validation, D.H., D.Y. and X.H.; formal analysis, D.H.; investigation, D.H.; resources, D.Y.; data curation, D.H.; writing—original draft preparation, D.H.; writing—review and editing, D.Y. and X.H.; visualization, D.H.; supervision, D.Y.; project administration, X.H.; funding acquisition, D.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the Major Basic Research Program of the Shandong Provincial Natural Science Foundation (Grant No. ZR2025ZD18), and by the Joint Key Funds of the National Natural Science Foundation of China (Grant No. U23A20302).

Data Availability Statement

Data are contained within the article.

Acknowledgments

The authors would like to sincerely thank Dongsheng Luo from Florida International University, Huiyuan Chen from Amazon, Fan Liu from Southeast University, Zhenyang Li from City University of Hong Kong, and Shunkang Zhang from NVIDIA, for their valuable discussions and insightful suggestions.

Conflicts of Interest

Author Xiaowen Hou was employed by the company Shandong Hi-Speed Qingdao Investment Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Peng, J.; Gong, J.; Zhou, C.; Zang, Q.; Fang, X.; Yang, K.; Yu, J. Kgcfrec: Improving collaborative filtering recommendation with knowledge graph. Electronics 2024, 13, 1927. [Google Scholar] [CrossRef]
Li, P.; Zhan, W.; Gao, L.; Wang, S.; Yang, L. Multimodal Recommendation System Based on Cross Self-Attention Fusion. Systems 2025, 13, 57. [Google Scholar] [CrossRef]
Wang, L.; Jin, D. A time-sensitive graph neural network for session-based new item recommendation. Electronics 2024, 13, 223. [Google Scholar] [CrossRef]
Lu, H.; Chen, Z. SocialJGCF: Social Recommendation with Jacobi Polynomial-Based Graph Collaborative Filtering. Appl. Sci. 2024, 14, 12070. [Google Scholar] [CrossRef]
Duan, Z.; Wang, C.; Zhong, W. Ssgcl: Simple social recommendation with graph contrastive learning. Mathematics 2024, 12, 1107. [Google Scholar] [CrossRef]
Chen, H.; Lin, Y.; Pan, M.; Wang, L.; Yeh, C.C.M.; Li, X.; Zheng, Y.; Wang, F.; Yang, H. Denoising self-attentive sequential recommendation. In Proceedings of the 16th ACM Conference on Recommender Systems, Seattle, WA, USA, 18–23 September 2022; pp. 92–101. [Google Scholar]
Chen, H.; Li, J. Adversarial tensor factorization for context-aware recommendation. In Proceedings of the 13th ACM Conference on Recommender Systems, Copenhagen, Denmark, 16–20 September 2019; pp. 363–367. [Google Scholar]
Bawack, R.E.; Wamba, S.F.; Carillo, K.D.A.; Akter, S. Artificial intelligence in E-Commerce: A bibliometric study and literature review. Electron. Mark. 2022, 32, 297–338. [Google Scholar] [CrossRef]
Goyani, M.; Chaurasiya, N. A review of movie recommendation system: Limitations, Survey and Challenges. ELCVIA Electron. Lett. Comput. Vis. Image Anal. 2020, 19, 18–37. [Google Scholar] [CrossRef]
Schedl, M.; Knees, P.; McFee, B.; Bogdanov, D. Music recommendation systems: Techniques, use cases, and challenges. In Recommender Systems Handbook; Springer: Berlin/Heidelberg, Germany, 2021; pp. 927–971. [Google Scholar]
Li, Z.; Liu, F.; Wei, Y.; Cheng, Z.; Nie, L.; Kankanhalli, M.S. Attribute-driven Disentangled Representation Learning for Multimodal Recommendation. In Proceedings of the 32nd ACM International Conference on Multimedia; ACM: New York, NY, USA, 2024; pp. 9660–9669. [Google Scholar]
Lin, Z.; Tian, C.; Hou, Y.; Zhao, W.X. Improving graph collaborative filtering with neighborhood-enriched contrastive learning. In Proceedings of the ACM Web Conference 2022, Lyon, France, 25–29 April 2022; pp. 2320–2329. [Google Scholar]
Zhang, Y.; Sang, L.; Zhang, Y. Exploring the individuality and collectivity of intents behind interactions for graph collaborative filtering. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024; pp. 1253–1262. [Google Scholar]
Liu, F.; Zhao, S.; Cheng, Z.; Nie, L.; Kankanhalli, M. Cluster-based graph collaborative filtering. ACM Trans. Inf. Syst. 2024, 42, 1–24. [Google Scholar] [CrossRef]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
Wang, X.; He, X.; Wang, M.; Feng, F.; Chua, T.S. Neural graph collaborative filtering. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, 21–25 July 2019; pp. 165–174. [Google Scholar]
He, X.; Deng, K.; Wang, X.; Li, Y.; Zhang, Y.; Wang, M. Lightgcn: Simplifying and powering graph convolution network for recommendation. In Proceedings of the SIGIR, Xi’an, China, 25–30 July 2020. [Google Scholar]
Yang, X.; Chen, H.; Yan, Y.; Tang, Y.; Zhao, Y.; Xu, E.; Cai, Y.; Tong, H. SimCE: Simplifying Cross-Entropy Loss for Collaborative Filtering. arXiv 2024, arXiv:2406.16170. [Google Scholar]
Cheng, Z.; Han, S.; Liu, F.; Zhu, L.; Gao, Z.; Peng, Y. Multi-behavior recommendation with cascading graph convolution networks. In Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April–4 May 2023; pp. 1181–1189. [Google Scholar]
Huang, J.; Cao, Q.; Xie, R.; Zhang, S.; Xia, F.; Shen, H.; Cheng, X. Adversarial learning data augmentation for graph contrastive learning in recommendation. In Proceedings of the International Conference on Database Systems for Advanced Applications, Tianjin, China, 17–20 April 2023; Springer: New York, NY, USA, 2023; pp. 373–388. [Google Scholar]
Chen, H.; Xu, Z.; Yeh, C.C.M.; Lai, V.; Zheng, Y.; Xu, M.; Tong, H. Masked Graph Transformer for Large-Scale Recommendation. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024; pp. 2502–2506. [Google Scholar]
Jain, K.; Jindal, R. Sampling and noise filtering methods for recommender systems: A literature review. Eng. Appl. Artif. Intell. 2023, 122, 106129. [Google Scholar] [CrossRef]
Wang, S.; Zhang, X.; Wang, Y.; Ricci, F. Trustworthy recommender systems. ACM Trans. Intell. Syst. Technol. 2024, 15, 84. [Google Scholar] [CrossRef]
Sreepada, R.S.; Patra, B.K. Enhancing long tail item recommendation in collaborative filtering: An econophysics-inspired approach. Electron. Commer. Res. Appl. 2021, 49, 101089. [Google Scholar] [CrossRef]
Xu, Z.; Chai, Z.; Xu, C.; Yuan, C.; Yang, H. Towards effective collaborative learning in long-tailed recognition. IEEE Trans. Multimed. 2023, 26, 3754–3764. [Google Scholar] [CrossRef]
Wu, J.; Wang, X.; Feng, F.; He, X.; Chen, L.; Lian, J.; Xie, X. Self-supervised graph learning for recommendation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 11–15 July 2021; pp. 726–735. [Google Scholar]
Rampášek, L.; Galkin, M.; Dwivedi, V.P.; Luu, A.T.; Wolf, G.; Beaini, D. Recipe for a general, powerful, scalable graph transformer. In Proceedings of the NeurIPS, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
Geisler, S.; Li, Y.; Mankowitz, D.J.; Cemgil, A.T.; Günnemann, S.; Paduraru, C. Transformers meet directed graphs. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
Ma, L.; Lin, C.; Lim, D.; Romero-Soriano, A.; Dokania, P.K.; Coates, M.; Torr, P.H.; Lim, S.N. Graph inductive biases in transformers without message passing. In Proceedings of the ICML, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
Shirzad, H.; Velingker, A.; Venkatachalam, B.; Sutherland, D.J.; Sinop, A.K. Exphormer: Sparse transformers for graphs. In Proceedings of the ICML, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
Ying, C.; Cai, T.; Luo, S.; Zheng, S.; Ke, G.; He, D.; Shen, Y.; Liu, T.Y. Do transformers really perform badly for graph representation? In Proceedings of the NeurIPS, Virtual, 6–14 December 2021. [Google Scholar]
Liu, C.; Yao, Z.; Zhan, Y.; Ma, X.; Pan, S.; Hu, W. Gradformer: Graph Transformer with Exponential Decay. arXiv 2024, arXiv:2404.15729. [Google Scholar] [CrossRef]
Xing, Y.; Wang, X.; Li, Y.; Huang, H.; Shi, C. Less is more: On the over-globalizing problem in graph transformers. arXiv 2024, arXiv:2405.01102. [Google Scholar] [CrossRef]
Wu, Q.; Zhao, W.; Li, Z.; Wipf, D.P.; Yan, J. Nodeformer: A scalable graph structure learning transformer for node classification. In Proceedings of the NeurIPS, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
Liu, J.; Mao, Q.; Jiang, W.; Li, J. KnowFormer: Revisiting Transformers for Knowledge Graph Reasoning. arXiv 2024, arXiv:2409.12865. [Google Scholar]
Dwivedi, V.P.; Luu, A.T.; Laurent, T.; Bengio, Y.; Bresson, X. Graph Neural Networks with Learnable Structural and Positional Representations. In Proceedings of the ICLR, Vienna, Austria, 3–7 May 2021. [Google Scholar]
Choromanski, K.M.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.Q.; Mohiuddin, A.; Kaiser, L.; et al. Rethinking Attention with Performers. In Proceedings of the ICLR, Vienna, Austria, 3–7 May 2021. [Google Scholar]
Vaswani, A. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Wu, Q.; Zhao, W.; Yang, C.; Zhang, H.; Nie, F.; Jiang, H.; Bian, Y.; Yan, J. Simplifying and empowering transformers for large-graph representations. Adv. Neural Inf. Process. Syst. 2024, 36, 2826. [Google Scholar]
Li, C.; Xia, L.; Ren, X.; Ye, Y.; Xu, Y.; Huang, C. Graph transformer for recommendation. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, Taipei, Taiwan, 23–27 July 2023; pp. 1680–1689. [Google Scholar]
Koren, Y.; Bell, R.; Volinsky, C. Matrix factorization techniques for recommender systems. Computer 2009, 42, 30–37. [Google Scholar] [CrossRef]
Rendle, S.; Freudenthaler, C.; Gantner, Z.; Schmidt-Thieme, L. BPR: Bayesian personalized ranking from implicit feedback. arXiv 2012, arXiv:1205.2618. [Google Scholar] [CrossRef]
Shenbin, I.; Alekseev, A.; Tutubalina, E.; Malykh, V.; Nikolenko, S.I. Recvae: A new variational autoencoder for top-n recommendations with implicit feedback. In Proceedings of the 13th International Conference on Web search and DATA Mining, Houston, TX, USA, 3–7 February 2020; pp. 528–536. [Google Scholar]
Li, Z.; Guo, Y.; Wang, K.; Chen, X.; Nie, L.; Kankanhalli, M.S. Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR. In Proceedings of the 31st ACM International Conference on Multimedia; ACM: New York, NY, USA, 2023; pp. 5634–5644. [Google Scholar]
Li, Z.; Guo, Y.; Wang, K.; Liu, F.; Nie, L.; Kankanhalli, M.S. Learning to Agree on Vision Attention for Visual Commonsense Reasoning. IEEE Trans. Multimed. 2024, 26, 1065–1075. [Google Scholar] [CrossRef]
OpenAI. GPT-4 Technical Report. arXiv 2024, arXiv:2303.08774. [Google Scholar]
Koresh, E.; Gross, R.D.; Meir, Y.; Tzach, Y.; Halevi, T.; Kanter, I. Unified CNNs and transformers underlying learning mechanism reveals multi-head attention modus vivendi. Phys. A Stat. Mech. Its Appl. 2025, 666, 130529. [Google Scholar] [CrossRef]
Li, Z.; Guo, Y.; Wang, K.; Wei, Y.; Nie, L.; Kankanhalli, M.S. Joint Answering and Explanation for Visual Commonsense Reasoning. IEEE Trans. Image Process. 2023, 32, 3836–3846. [Google Scholar] [CrossRef]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar] [CrossRef]
Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 July 2016; pp. 770–778. [Google Scholar]
Hussain, M.S.; Zaki, M.J.; Subramanian, D. Global self-attention as a replacement for graph convolution. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; pp. 655–665. [Google Scholar]
Liu, F.; Huang, X.; Chen, Y.; Suykens, J.A. Random features for kernel approximation: A survey on algorithms, theory, and beyond. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7128–7148. [Google Scholar] [CrossRef]
Wang, T.; Isola, P. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 9929–9939. [Google Scholar]
Wang, C.; Yu, Y.; Ma, W.; Zhang, M.; Chen, C.; Liu, Y.; Ma, S. Towards representation alignment and uniformity in collaborative filtering. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; pp. 1816–1825. [Google Scholar]
He, X.; Liao, L.; Zhang, H.; Nie, L.; Hu, X.; Chua, T.S. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web, Perth, Australia, 3–7 April 2017; pp. 173–182. [Google Scholar]
Zhao, W.X.; Mu, S.; Hou, Y.; Lin, Z.; Chen, Y.; Pan, X.; Li, K.; Lu, Y.; Wang, H.; Tian, C.; et al. Recbole: Towards a unified, comprehensive and efficient framework for recommendation algorithms. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Virtual, 1–5 November 2021; pp. 4653–4664. [Google Scholar]
Zhou, C.; Ma, J.; Zhang, J.; Zhou, J.; Yang, H. Contrastive learning for debiased candidate generation in large-scale recommender systems. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Singapore, 14–18 August 2021; pp. 3985–3995. [Google Scholar]
Chen, H.; Yeh, C.C.M.; Wang, F.; Yang, H. Graph neural transport networks with non-local attentions for recommender systems. In Proceedings of the ACM Web Conference 2022, Lyon, France, 25–29 April 2022; pp. 1955–1964. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2017, arXiv:1412.6980. [Google Scholar] [CrossRef]
Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, Sardinia, Italy, 13–15 May 2010; pp. 249–256. [Google Scholar]
Luo, D.; Cheng, W.; Xu, D.; Yu, W.; Zong, B.; Chen, H.; Zhang, X. Parameterized explainer for graph neural network. Adv. Neural Inf. Process. Syst. 2020, 33, 19620–19631. [Google Scholar]
Luo, D.; Zhao, T.; Cheng, W.; Xu, D.; Han, F.; Yu, W.; Liu, X.; Chen, H.; Zhang, X. Towards inductive and efficient explanations for graph neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5245–5259. [Google Scholar] [CrossRef]
Chen, H.; Wang, L.; Lin, Y.; Yeh, C.C.M.; Wang, F.; Yang, H. Structured graph convolutional networks with stochastic masks for recommender systems. In Proceedings of the SIGIR, Virtual, 11–15 July 2021. [Google Scholar]
Yu, J.; Yin, H.; Xia, X.; Chen, T.; Cui, L.; Nguyen, Q.V.H. Are graph augmentations necessary? simple graph contrastive learning for recommendation. In Proceedings of the 45th international ACM SIGIR conference on research and development in information retrieval, Padua, Italy, 13–18 July 2022; pp. 1294–1303. [Google Scholar]
Wu, Y.; Zhang, L.; Mo, F.; Zhu, T.; Ma, W.; Nie, J.Y. Unifying Graph Convolution and Contrastive Learning in Collaborative Filtering. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; pp. 3425–3436. [Google Scholar]
Wu, W.; Wang, C.; Shen, D.; Qin, C.; Chen, L.; Xiong, H. AFDGCF: Adaptive Feature De-correlation Graph Collaborative Filtering for Recommendations. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024; pp. 1242–1252. [Google Scholar]
Zhao, J.; Wenjie, W.; Xu, Y.; Sun, T.; Feng, F.; Chua, T.S. Denoising diffusion recommender model. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024; pp. 1370–1379. [Google Scholar]
Chen, D.; O’Bray, L.; Borgwardt, K. Structure-aware transformer for graph representation learning. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2022; pp. 3469–3489. [Google Scholar]
Wu, Q.; Yang, C.; Zhao, W.; He, Y.; Wipf, D.; Yan, J. Difformer: Scalable (graph) transformers induced by energy constrained diffusion. arXiv 2023, arXiv:2301.09474. [Google Scholar] [CrossRef]
Qiu, Y.; Zhang, K.; Wang, C.; Luo, W.; Li, H.; Jin, Z. Mb-taylorformer: Multi-branch efficient transformer expanded by taylor formula for image dehazing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 12802–12813. [Google Scholar]
Babiloni, F.; Marras, I.; Deng, J.; Kokkinos, F.; Maggioni, M.; Chrysos, G.; Torr, P.; Zafeiriou, S. Linear Complexity Self-Attention with 3rd-Order Polynomials. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12726–12737. [Google Scholar] [PubMed]
Katharopoulos, A.; Vyas, A.; Pappas, N.; Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2020; pp. 5156–5165. [Google Scholar]
Nauen, T.C.; Palacio, S.; Dengel, A. Taylorshift: Shifting the complexity of self-attention from squared to linear (and back) using taylor-softmax. In Proceedings of the International Conference on Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2025; pp. 1–16. [Google Scholar]
Wei, Y.; Liu, W.; Liu, F.; Wang, X.; Nie, L.; Chua, T.S. Lightgt: A light graph transformer for multimedia recommendation. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, Taipei, Taiwan, 23–27 July 2023; pp. 1508–1517. [Google Scholar]

Figure 1. The proposed TLFormer architecture, illustrated with its data flow, takes the interaction matrix R as input, processing it through two pathways: one generates representation embeddings via the TokenTable, while the other extracts structural encodings using SVD. These components are concatenated to form X, a combined embedding fed into TLFormer’s single-layer global attention mechanism, which is shown in Section 4.3 to operate without constructing an attention weight matrix. The resulting output,

\bar{X}

, captures representative features for all users and items, enabling a comprehensive view of user–item interactions.

Figure 1. The proposed TLFormer architecture, illustrated with its data flow, takes the interaction matrix R as input, processing it through two pathways: one generates representation embeddings via the TokenTable, while the other extracts structural encodings using SVD. These components are concatenated to form X, a combined embedding fed into TLFormer’s single-layer global attention mechanism, which is shown in Section 4.3 to operate without constructing an attention weight matrix. The resulting output,

\bar{X}

, captures representative features for all users and items, enabling a comprehensive view of user–item interactions.

Figure 2. Per-epoch runtime comparison of methods across different Datasets.

Figure 3. Comparison of Train Loss across Amazon-Movies and Pinterest datasets.

Figure 4. Comparison of Recall@20 across Amazon-Movies and Pinterest datasets.

Figure 5. Model performance

w . r . t .

noise ration. The line represents the trend of Recall@20 as it varies with the ratio of noise.

Figure 5. Model performance

w . r . t .

noise ration. The line represents the trend of Recall@20 as it varies with the ratio of noise.

Figure 6. Performance comparison over different item groups between Amazon-CDs and Amazon-Movies (Y-axis in Log Scale).

Figure 7. Parameter sensitivity with regard to the tradeoff of

l_{alignment}

and

l_{uniform}

in TLFormer.

Figure 7. Parameter sensitivity with regard to the tradeoff of

l_{alignment}

and

l_{uniform}

in TLFormer.

Table 1. Statistics of 4 benchmark datasets.

Dataset	#User	#Item	#Inter.	Density
Pinterest	55.2k	9.9k	1500.8k	0.274%
Amazon-CDs	43.2k	35.6k	777.4k	0.051%
Amazon-Kindle	138.9k	98.7k	1910.0k	0.014%
Amazon-Movies	44.4k	25.9k	1070.9k	0.096%

Table 2. Top-K recommendation performance on four datasets. The best results are in bold, and the runner-up are underlined. Superscripts ** denote

p \leq 0.01

in a paired t-test comparing TLFormer with the runner-up. Relative improvements are denoted as Improv.

Table 2. Top-K recommendation performance on four datasets. The best results are in bold, and the runner-up are underlined. Superscripts ** denote

p \leq 0.01

in a paired t-test comparing TLFormer with the runner-up. Relative improvements are denoted as Improv.

Setting		Baseline Methods									Ours
Dataset	Metric	BPRMF	RecVAE	LightGCN	CLRec	SGL	GONet	DirectAU	MGFormer	BIGCF	TLFormer	Improv.
Pinterest	Recall@10	0.0840	0.1099	0.1028	0.1050	0.1070	0.1114	0.1170	0.1173	0.1163	0.1220 **	4.01%
	Recall@20	0.1393	0.1703	0.1616	0.1687	0.1687	0.1767	0.1834	0.1820	0.1831	0.1899 **	3.54%
	Recall@50	0.2568	0.2921	0.2800	0.3021	0.2948	0.3113	0.3136	0.3118	0.3135	0.3181 **	1.43%
	NDCG@10	0.0551	0.0755	0.0685	0.0704	0.0708	0.0746	0.0796	0.0754	0.0783	0.0818 **	2.76%
	NDCG@20	0.0730	0.0951	0.0876	0.0910	0.0908	0.0958	0.1011	0.0962	0.0999	0.1039 **	2.76%
	NDCG@50	0.1029	0.1261	0.1178	0.1249	0.1229	0.1310	0.1343	0.1340	0.1332	0.1365 **	1.64%
Amazon-CDs	Recall@10	0.0677	0.0830	0.0886	0.1154	0.0980	0.1165	0.1097	0.1152	0.1093	0.1253 **	7.55%
	Recall@20	0.1037	0.1189	0.1319	0.1634	0.1455	0.1655	0.1594	0.1615	0.1599	0.1741 **	5.20%
	Recall@50	0.1764	0.1865	0.2088	0.2489	0.2275	0.2525	0.2440	0.2465	0.2471	0.2547 **	0.87%
	NDCG@10	0.0391	0.0502	0.0526	0.0706	0.0587	0.0714	0.0661	0.0711	0.0654	0.0776 **	8.68%
	NDCG@20	0.0486	0.0598	0.0641	0.0832	0.0713	0.0843	0.0792	0.0828	0.0789	0.0904 **	7.24%
	NDCG@50	0.0642	0.0743	0.0806	0.1015	0.0889	0.1029	0.0974	0.1012	0.0976	0.1078 **	4.76%
Amazon-Kindle	Recall@10	0.0453	0.1613	0.1416	0.1486	0.1348	0.1360	0.1597	0.1800	0.1457	0.1982 **	10.11%
	Recall@20	0.0720	0.2037	0.1859	0.1995	0.1810	0.1881	0.2154	0.2314	0.2023	0.2496 **	7.87%
	Recall@50	0.1261	0.2692	0.2553	0.2818	0.2549	0.2667	0.3027	0.3112	0.2846	0.3270 **	5.08%
	NDCG@10	0.0252	0.1081	0.0891	0.0945	0.0828	0.0856	0.0994	0.1202	0.0868	0.1323 **	10.07%
	NDCG@20	0.0322	0.1195	0.1009	0.1080	0.0950	0.0953	0.1142	0.1302	0.1012	0.1460 **	12.14%
	NDCG@50	0.0438	0.1337	0.1158	0.1257	0.1108	0.1084	0.1330	0.1488	0.1192	0.1629 **	9.48%
Amazon-Movies	Recall@10	0.0532	0.0668	0.0654	0.0837	0.0708	0.0805	0.0789	0.0875	0.0866	0.0929 **	6.17%
	Recall@20	0.0857	0.1008	0.1003	0.1251	0.1084	0.1211	0.1194	0.1281	0.1267	0.1346 **	5.07%
	Recall@50	0.1525	0.1653	0.1656	0.2031	0.1812	0.1988	0.1944	0.2004	0.2023	0.2081 **	2.46%
	NDCG@10	0.0313	0.0407	0.0394	0.0514	0.0424	0.0487	0.0480	0.0538	0.0522	0.0573 **	6.51%
	NDCG@20	0.0400	0.0499	0.0488	0.0624	0.0526	0.0596	0.0589	0.0673	0.0661	0.0685 **	1.78%
	NDCG@50	0.0547	0.0641	0.0633	0.0796	0.0686	0.0766	0.0755	0.0801	0.0781	0.0847 **	5.74%

Table 3. Ablation results of TLFormer on Amazon-Movies and Pinterest. NoALL removes both the SVD-based structural encoding and the Taylor Linear Attention; NoSVD removes the SVD component; NoFormer removes the Taylor Linear Attention. The best results are shown in bold.

Models	Amazon-Movies		Pinterest
Models	R@20	N@20	R@20	N@20
NoALL	0.1159	0.0593	0.1757	0.0973
NoSVD	0.1253	0.0627	0.1814	0.0998
NoFormer	0.1247	0.0625	0.1772	0.0975
TLFormer	0.1346	0.0685	0.1899	0.1039

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hao, D.; Yu, D.; Hou, X. TLFormer: Scalable Taylor Linear Attention in Transformer for Collaborative Filtering. Electronics 2026, 15, 759. https://doi.org/10.3390/electronics15040759

AMA Style

Hao D, Yu D, Hou X. TLFormer: Scalable Taylor Linear Attention in Transformer for Collaborative Filtering. Electronics. 2026; 15(4):759. https://doi.org/10.3390/electronics15040759

Chicago/Turabian Style

Hao, Dongdong, Dongxiao Yu, and Xiaowen Hou. 2026. "TLFormer: Scalable Taylor Linear Attention in Transformer for Collaborative Filtering" Electronics 15, no. 4: 759. https://doi.org/10.3390/electronics15040759

APA Style

Hao, D., Yu, D., & Hou, X. (2026). TLFormer: Scalable Taylor Linear Attention in Transformer for Collaborative Filtering. Electronics, 15(4), 759. https://doi.org/10.3390/electronics15040759

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TLFormer: Scalable Taylor Linear Attention in Transformer for Collaborative Filtering

Abstract

1. Introduction

2. Preliminaries

2.1. Problem Formulation

2.2. Transformer Architecture

3. The Proposed TLFormer

3.1. Representation Vector Lookup

3.2. Spatial Structural Information

3.3. Taylor Linear Attention

3.4. Model Training

3.5. Complexity Analysis

4. Experiment

4.1. Experimental Settings

4.2. Accuracy Evaluation

4.3. Efficiency Comparison

4.4. Convergence Analysis

4.5. Noise Robustness Analysis

4.6. Effectiveness in Long-Tail Recommendation

4.7. Ablation Study

4.8. Parameter Sensitivity

5. Related Work

5.1. GNNs for Recommender Systems

5.2. Graph Transformers

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI