This section surveys solutions that utilize various data types to predict the CTR. These solutions can be classified into graph-based approaches, feature-interaction-based methods, customer behavior techniques, and cross-domain approaches.
5.2.1. CTR Prediction Utilizing Graph Models
This subsection presents solutions that utilize graph models to predict the CTR. A comparison of these approaches using different criteria is shown in
Table 3. Liu et al. [
89] introduce an approach based on graph convolutional neural networks named graph convolutional network interaction (GCN-int). This approach facilitates the learning of the hard-to-comprehend interaction between various features, offers a decent interaction representation across high-order features, and enhances the explainability of feature interaction. The method is evaluated on two public datasets (i.e., Criteo and Avazu) and a customized dataset comprising internet protocol television (IPTV) movie recommendation records. The experimental results demonstrate the effectiveness of the proposed method in terms of accuracy and efficiency compared with existing methods, such as the attentional factorization machine (AFM) [
90] and DeepCrossing [
67]. However, the proposed method omits the weights of interactions between features and instead performs feature interactions with identical weight values. Zhang et al. [
91] propose a graph fusion reciprocal recommender (GFRR) approach, which can learn reciprocal information circulation across customers to predict pair matching. This approach can also learn structural information about customers’ historical behaviors and is based on a graph neural network (GNN). Compared with previous reciprocal recommender systems (RSSs) that concentrate only on reply prediction, this approach focuses on both transmit and response signals. Additionally, the authors present negative instance mining to investigate the impact of various kinds of instances on recommendation precision in real-world settings. The authors validate their approach on a real-world dataset, which yielded decent results compared with those of other previous works, such as latent factor for reciprocal recommender (LFRR) systems [
92] and deep feature interaction embedding (DFIE) [
93], with prediction results of 73.15% for the AUC and 26.01% for the average precision, good response prediction results of 68.95% for the AUC and 23.02% for the average precision, and proper fusion reciprocal prediction results of 71.26% for the AUC and 23.95% for the average precision. However, the only information utilized is the user’s profile and historical behavior; hence, user embeddings could be enriched with more information, such as social networks and interest features, to improve the recommender system.
Existing methods usually overcome the cold start and sparsity issues in collaborative filtering using side information such as knowledge graphs and social networks. Yang et al. [
94] address these limitations similarly: they introduce a knowledge-enhanced user multi-interest modeling (KEMIM) approach to act as a recommender system. The authors initially use a historical interaction between customers and items, which serves as the main component of the knowledge graph. They then create a customer’s specific interests and use the connection path to broaden their potential interests by leveraging the relationships within the knowledge graph. They analyze changes in customer interest and use a capability to understand the customer’s attention to each past interaction and potential interest. The authors subsequently concatenate the customer’s interests with the attribute features to address the cold start issue in an effective manner. The framework consists of structured data from a knowledge graph, which can describe the user’s characteristics in detail and provide understandable recommendation results to users. The framework is evaluated extensively on three publicly available datasets for two distinctive research problems: top-k recommendation and CTR prediction. The experimental results show that the method outperforms the state-of-the-art models on two datasets (i.e., Book-crossing [
102] and Last.FM), including knowledge-enhanced recommendation with feature interaction and intent-aware attention networks (FIRE) [
103], hierarchical knowledge and interest propagation networks (HKIPNs) [
104], and collaborative guidance for personalized recommendation (CG-KGR) [
105]. However, the performance of this method could be improved, particularly knowledge extraction in the knowledge graph-based recommender system, and an explainable recommendation model could be introduced. Li et al. [
95] introduce a graph factorization machine (GraphFM) to illustrate features in the graph configuration. Specifically, the authors create a capability that chooses meaningful feature interactions and designates these interactions as edges between features. The framework then incorporates the FM’s interaction function into the GNN, represented by the feature aggregation mechanism. The feature aggregation mechanism uses stacking layers to model the arbitrary-order feature interactions on the graph-structured features. The authors validate their method with three public datasets (i.e., Criteo, Avazu, and MovieLens-1M) and compare it with several previous methods, such as higher-order FMs (HOFMs) [
106], adaptive factorization networks (AFNs) [
107] and FM2 [
108]. The proposed method outperforms the other techniques in terms of logloss and AUC.
The surge in multimodel sharing platforms such as TikTok has ignited heightened fascination with online microvideos. These concise videos contain diverse multimedia elements, including visual, textual, and auditory components. Thus, merchants can enhance the user experience by incorporating microvideos into their advertising strategies to their advantage. In many CTR prediction studies, item representations rely on unimodal content. A few studies concentrate on feature representations in a multimodal fashion; one of these approaches is hypergraph CTR (HyperCTR), which was proposed by [
96]. The hypergraph neural network inspired the hypergraph CTR approach. A hypergraph is an extension of an edge in graph theory [
109,
110] that can link more than two edges. It guides feature representation learning in a multimodal manner (i.e., textual, acoustic, and frames) by leveraging temporal user-item interactions to comprehend user preferences.
Figure 4 depicts an example of applying the proposed method, where users
and
have interacted with various microvideos, for example, videos
and
. A microvideo (e.g.,
) might be watched by more than one user (e.g.,
,
,
) because of the exciting soundtracks. A group-aware hypergraph can be created using these signals, consisting of various users interested in the same item. This interaction enables the proposed framework to connect multiple item nodes on a single edge via hyperedges. The hypergraph’s unique ability to utilize degree-free hyperedges allows it to capture pairwise connections and high-order data correlations effectively. This ability facilitates CTR prediction for microvideo items, as it can generate model-specific representations of users and microvideos to capture user preferences efficiently. The authors also develop a mutual network for time-aware user-item pairs to learn the correlation of intrinsic data [
111] (this approach is inspired by the success of self-supervised learning (SSL) [
112]), which addresses multimodal information. This enrichment process aims to enrich each user-item representation with the generated interest-based user and item hypergraphs. The authors validate their proposed technique with three publicly available datasets: Kuaishou [
113], Micro-Video 1.7 [
114], and MovieLens-20M. The proposed method is compared with several state-of-the-art techniques, such as user behavior retrieval for CTR prediction (UBR4CTR) [
115] and automatic feature interaction selection (AutoFIS) [
116]. The results demonstrate its superiority over these methods.
As shown in
Figure 5a, Ariza-Casabona et al. [
97] propose a multidomain graph-based recommender (MAGRec), which uses graph neural networks to learn a multidomain representation of sequential customers’ interplays. Specifically, the customer
c, the chosen user history representation
, the target item domain
, and the target item itself
are fed as inputs into the model. The graph comprises edge features, denoted by target and source domains, and node features, denoted by item embeddings. The authors employ temporal intradomain and interdomain interaction capabilities that act as contextual information and are equipped with their method. In a specific multidomain environment, the relationships are efficiently captured via two graph-based sequential representations that work simultaneously: a general sequence representation for long-term interest and a domain-guided representation for recent user interest. The proposed method effectively addresses the negative knowledge transfer issue and improves the sequential representation. The method is evaluated on the Amazon review dataset [
78], which outperforms baseline approaches such as the full graph neural network (FGNN) [
117] and multigate mixture-of-experts (MMoE) [
118].
Sang et al. [
98] introduce a framework named the adaptive graph interaction network (AdaGIN), which consists of three mechanisms: a multisemantic feature interaction module (MFIM), a graph neural network-based feature interaction module (GFIM), and a negative feedback-based search (NFS). The purpose of the MFIM is to obtain information from various semantic domains, while the purpose of integrating the GFIM is to combine information across features and evaluate their significance explicitly. The framework uses the NFS capability to employ negative feedback to increase model complexity. The proposed method is validated on four publicly available datasets: Avazu, Frappe (which can be found at
http://baltrunas.info/research-menu/frappe, accessed on 31 July 2025), Criteo, and MovieLens-1M (can be found at
https://grouplens.org/datasets/movielens, accessed on 31 July 2025). The extensive evaluation proves that the proposed approach is more effective than previous methods in terms of logloss and AUC.
Shih et al. [
99] introduce a new evaluation metric called the cluster-aware ranking-based bidding strategy (CARBS). This metric evaluates the worth of each bid request by comparing it to a cluster of similar bid requests via a measure called the cluster expected win rate (CEWR). Bid requests with similar predicted CTRs are grouped into clusters using a two-step clustering mechanism to consolidate matching information. The CARBS sets a clear affordability threshold and prioritizes spending to ensure optimal efficiency and cluster ranking to spend the budget wisely and efficiently. The results of the CARBS are evaluated with CEWR, which proves that it can correlate and that its performance is better than that of inaccurate individual CTR predictions. The authors also introduce a bidding strategy based on reinforcement learning to modify the bid request expected win rate (BEWR). It is a hybrid mechanism that combines CEWR and the dynamic market to derive the final bid prices. As shown in the figure, the authors evaluate their method with three real advertising campaigns, confirming its effectiveness. In
Figure 6a, the correlations between the average predicted CTR after utilizing the clustering techniques proposed by [
119,
120] are depicted alongside their empirical CTR counterparts for three advertising campaigns (1458, 3386, and 215). Ideally, if the average predicted CTR equals its empirical counterpart, the data points indicating clusters will lie on the diagonal dashed line. However, it is evident that in most cases, the predicted CTRs differ significantly from their empirical counterparts, suggesting that the CTR predictions are not correlated with the actual environments. In
Figure 6b, the correlations between the average predicted CTR after applying the proposed clustering method are shown alongside their empirical CTR counterparts on the same advertisement campaigns, which proves the effectiveness of the method, as most data points lie on the ideal line. Even in a hard-to-predict campaign with an exceptionally tight budget, the AUC is 0.73, representing an improvement of approximately 33% and indicating the effectiveness of this approach.
Conventional CTR models utilize deep learning to train the model statically, and the network architecture parameters are identical across all the samples. Hence, these models face challenges in characterizing each sample, as they may stem from diverse underlying distributions. This limitation significantly impacts the CTR model’s representation capability, resulting in suboptimal outcomes. Yan et al. [
101] developed a new universal module known as adaptive parameter generation (APG), which aims to address this issue by dynamically generating parameters for CTR models based on different samples. As shown in
Figure 7a, when the authors add certain parameters, the model captures specific patterns for distinctive samples, particularly long-tailed samples. This figure analyzes the effects of different samples when these parameters are used. The participants were divided into ten groups based on frequency. The number of participants in each group is the same, and the frequency increases from the first group to the tenth group. More formally, the basic version of the method dynamically generates parameters
via the distinct condition
; hence,
, where
G denotes the adaptive parameter generation network. The produced parameters are subsequently fed into the deep CTR models, which are represented as
, where
is the neural network and
represents the input features. The authors introduce three types of techniques to design different conditions
(i.e., groupwise, mixedwise, and selfwise). Once the conditions are obtained, the framework utilizes an MLP as
to produce parameters that rely on the three conditions, where
are the adaptive parameters,
is the input-aware condition, and
e is the reshape operation responsible for reshaping the vectors generated by the multilayer perceptron (MLP) into a matrix form. Consequently, the CTR model that utilizes APG can be represented as
, where
denotes the activation function. This basic version is time and memory inefficient and not particularly effective in pattern recognition. Thus, the authors present three versions to solve these issues: low-rank parameterization, shared parameters, and overparameterization. Low-rank parameterization uses the low-rank subspace to optimize the task. The authors suggest that the adaptive parameters contain a low intrinsic rank; thus, they suggest representing the parameters of the weight matrix
as a low-rank matrix. This matrix is generated by three different matrices (i.e.,
), and the low rank can be represented as
. In the shared parameters version, the framework decomposes the weight matrix into three submatrices (i.e.,
,
, and
). The authors introduce the overparameterized version, which enlarges the capacity of the model by increasing the number of shared parameters. Two matrices are introduced in this version to replace the shared parameters proposed in the previous version, i.e.,
, where
i is the
i-th hidden layer,
and
.
Figure 7b depicts the final APG framework without the decomposed feed-forwarding mechanism. The authors then evaluate the performance via AUC and CTR gains for distinctive groups. As shown in
Figure 8a,b, the authors conclude that because group nine represents the participants that have the highest frequency, even though they represent only 10% of all participants, this group generates more than half of the total samples. The parameters contribute more to the performance of low-frequency participants (for example, participants in group zero), as they result in higher CTR and AUC gains. Therefore, these parameters allow low-frequency samples to adequately represent their features, leading to improved performance. This approach allows for adapting model parameters to better fit the characteristics of diverse data samples, potentially improving the model’s performance across various scenarios. The authors conducted multiple experiments to evaluate their proposed technique and incorporated the technique as a capability in several deep learning models to enhance performance. The evaluation demonstrated the effectiveness of the proposed technique in significantly improving the CTR of the deep models. Additionally, the proposed method reduced time costs by 38.7% and memory usage by 96.6% compared with a deep CTR model. Furthermore, the model was deployed in a real environment, resulting in a 3% increase in CTR and a 1% gain in revenue per mile (RPM).
Graph-based approaches (e.g., GCN-int [
89], GraphFM [
95], HyperCTR [
96]) represent the current state-of-the-art in handling large-scale structured data. They are highly effective in capturing high-order feature interactions, modeling user-item relations in non-Euclidean spaces, and delivering strong performance on industrial datasets such as Criteo and Avazu. Their scalability and predictive accuracy make them highly deployable, though preprocessing pipelines and graph construction can be complex. A clear trend is the integration of multimodal signals into graph frameworks, enabling richer representations of users and items.
Table 4 illustrates representative graph-based CTR methods. Graph-based CTR models are effective for large, structured datasets, such as Criteo and Avazu, where feature interactions significantly impact prediction accuracy. They are ideal for industrial applications that require high AUC and the handling of sparse features. However, practitioners must weigh their computational cost and engineering complexity against simpler baselines. Graph-based models have some limitations. They require extensive preprocessing and graph construction, with some models, such as HyperCTR, necessitating a substantial amount of GPU hours, which limits their scalability for real-time use. Additionally, many of these models assume uniform or static weights for feature interactions. This assumption may not be valid in the context of dynamic industrial environments, where conditions can change rapidly.
The rise of multimodal data—such as text, images, and behavioral signals—is transforming CTR prediction. Traditional text-only approaches struggle to fully capture user intent. Recent studies demonstrate that integrating different data types enables models to learn more effective representations by communicating sentiment through text, showcasing product attractiveness with images, and representing interaction dynamics through behavioral logs. For example, ASKAT [
76] leverages graph attention networks to combine textual sentiment features with user interaction data, whereas BAHE [
77] aggregates multimodal behavioral patterns (e.g., search logs, mini-program visits, and item titles) at an industrial scale, significantly reducing redundancy in representation learning. Similarly, HyperCTR [
96] uses hypergraph neural networks to merge textual, acoustic, and visual frame-level features for microvideo CTR prediction, attaining considerable improvements in AUC and log loss on datasets like Kuaishou and MovieLens.
Multimodal frameworks show that combining different modalities offers complementary benefits. They enhance content understanding by merging visual and textual embeddings, while sequential behavior models effectively manage temporal dependencies.This holistic approach strengthens CTR models, making them more robust against data sparsity, addressing cold-start challenges, and enhancing personalization in recommendations. However, the integration of these modalities also introduces computational complexities, which require extensive parallel training and sophisticated fusion techniques, such as attention-based late fusion and cross-modal transformers. Future directions point toward end-to-end multimodal representation learning with eXplainable AI (XAI) components to enhance interpretability, scalability, and industrial deployability.
5.2.2. Cross-Domain CTR Prediction Methods
This subsection introduces the approaches that transfer knowledge across domains to predict the CTR.
Table 5 compares these approaches via different assessment measures. Liu et al. [
121] introduced a groundbreaking approach to continual transfer learning (CTL), a field that has received relatively limited attention from researchers. CTL focuses on transferring knowledge from a source domain that evolves over time to a target domain that also changes dynamically. By addressing this underexplored aspect of transfer learning, this work (i.e., CTNet) has the potential to significantly advance how knowledge can be effectively conveyed and utilized in evolving environments. The main idea of this approach is to process the representations of the source domain as transferred knowledge for target domain CTR prediction. Thus, the target and source domain parameters are continuously reused and retained during knowledge transfer. This approach outperforms other methods, such as the knowledge extraction and plugging (KEEP) [
122] method and progressive layered extraction (PLE) [
123]. It has been evaluated via extensive offline experiments, where it yielded significant enhancements. It is now utilized online at Taobao (a Chinese e-commerce platform).
An et al. [
124] introduce the disentangle-based distillation framework for cross-domain recommendation (DDCDR), a cutting-edge approach operating at the representational level. This approach is based on the teacher-student knowledge distillation theory. The proposed method first creates a teacher model that operates across different domains. This model undergoes adversarial training side by side with a domain discriminator. Then, a student model is constructed for the target domain. The trained domain discriminator detaches the domain-shared representations from the domain-specific representations. The teacher model effectively directs the domain-shared feature learning process, whereas contrastive learning approaches significantly enrich the domain-specific features. The method is evaluated thoroughly on two publicly available datasets (i.e., Douban and Amazon) and a real-world dataset (i.e., Ant Marketing). The evaluation phase demonstrates the method’s effectiveness, which achieved a new state-of-the-art performance compared with previous methods such as the collaborative cross-domain transfer learning (CCTL) framework [
125] and disentangled representations for cross-domain recommendation (DisenCDR) [
126]. The deployment of the technique on an e-commerce platform proves the efficiency of the method, which yields improvements of 0.33% and 0.45% compared with the baseline models in terms of unique visitor CTRs in two different recommendation scenarios.
Table 5.
Comparison of the cross-domain approaches to CTR prediction using multivariate data.
Table 5.
Comparison of the cross-domain approaches to CTR prediction using multivariate data.
Model | Key Idea | Dataset | Performance | Limitations (+)/Advantages (−) |
---|
CTNet [121] | Processed the source domain | Taobao production: three | 0.7474 AUC and | − The homogeneous features |
| representations as transferred | domains (A, B, and C); | 0.6888 GAUC from | might affect this method’s |
| knowledge for the target | the aim is to validate | domain A to B; 0.7451 | performance in practice (e.g., |
| domain; thus, the source and | the framework’s transfer | AUC and 0.7040 GAUC | heterogeneous input features |
| target domain parameters are | effectiveness from A to B | from domain A to C | which indicate two domain have |
| continuously reused and retained | and A to C. A dataset size | | different feature fields, e.g., |
| during knowledge transfer | is 150B, B dataset | | image retrieval technique relies |
| | size is 2B, and C | | on image features while |
| | dataset size is 1B. | | text retrieval method relies on |
| | | | features preprocessed from text) |
DDCDR [124] | Based on the teacher-student | Douban (consists of 1.5 M | 0.6350, 0.6602, and | + The approach filters useful |
| knowledge distillation, it constructs | samples), Amazon (consists | 0.8096 AUC on | information for transfer, |
| a teacher model to execute across | of 1.9 M samples), and | Douban, Amazon, | strengthens domain-specific |
| different domains. The model goes | Ant Marketing (consists | and Ant Marketing, | representation and sampling, |
| through adversarial training with | of 40 M samples) | respectively | leading to superior |
| a domain discriminator and a student | | | performance in practice |
| model is created for the target domain | | | |
DASL [127] | Proposed an approach that | Imhonet (consists of | 0.8375 and 0.8380 AUC | − The method can be applied |
| transfers information between | 223 M book records | on Imhonet (books and | only in domain pairs; |
| two relevant domains iteratively | and 51 M movie records), | movies, respectively); 0.8520 | thus it can be extended to |
| until the learning process | Amazon (consists of 2.3 M | and 0.8511 AUC on Amazon | supply recommendations |
| stabilizes utilizing dual | toys records and 1.3 M | (toys and video games, | through various domains |
| attention and dual | video games records), and | respectively); 0.8825 and | |
| embedding mechanisms | Youku (consists of 11.6 M | 0.8635 AUC on Youku | |
| | TV shows records and | (TV shows and short | |
| | 19.2 M short videos records) | videos, respectively) | |
MAN [128] | Utilized global and local encoding | Micro Video (A and | 0.8285 and 0.8094 AUC; | − The approach needs to be |
| layers to capture the cross-domain | B) and Amazon (video | 0.6167 and 0.5756 MRR on | further evaluated using |
| and specific-domain sequential | games and toys) | Micro Video A and B | an online A/B tests to |
| patterns. Applied a mixed attention | | respectively; 0.6559 and | prove its effectiveness |
| layer to obtain local/global item | | 0.6712 AUC; 0.4755 and | |
| similarity, integrate item sequence, | | 0.6385 MRR on Amazon | |
| and capture customer groups in | | video games and toys, | |
| different domains | | respectively | |
Park et al. [129] | To maintain gradient flows | Amazon [130] | 0.3398 MRR @10 and | + Deployed in a personal |
| across domains with significant | (contains 105,364 users) and | 0.3838 NDCG @10 on | assistant app service and |
| negative transfer, they | Telco (contains | Amazon dataset and 0.7366 | outperformed previous works |
| dynamically assign it as | 99,936 users) | MRR @10 and 0.7802 | by 21.4% CTR prediction |
| a weight factor to the | | NDCG @10 on Telco | increase |
| prediction loss | | dataset | |
MACD [131] | Developed an architecture that | Amazon dataset and | The average exposure is | + The model is tested on a |
| considers users’ varying interests, | A/B test | enhanced by about 10%, the | financial platform for |
| including a capability that | | CVR by about 1.5% and | fourteen days and proved |
| investigates potential customers’ | | the conversion rate by | its effectiveness |
| interests and used a contrastive | | about 6% | |
| information regularizer to | | | |
| filter out background noise | | | |
DCN [132] | Utilize DNN with FTRL | Four iPinyou sub-datasets, | 0.8338, 0.8969, 0.8040, | − The method could be enhanced |
| augmentation to predict CTR. | which consists of 10M | 0.8574 AUC and 0.448, | by adding sequential features at |
| They use SMOTE oversampling | training samples and | 0.4697, 0.5891, 0.4564 | the feature engineering phase |
| to balance the dataset and | 2 M testing samples | Logloss from 1st to 4th | and utilizing advanced network |
| improve performance | | sub-dataset, respectively | such as transformer to extract |
| | | | high-order feature combinations |
Li et al. [
127] introduce a cross-domain sequential recommendation approach that conveys information between two relevant domains iteratively until the learning process stabilizes. This approach uses a dual-learning capability called dual attentive sequential learning (DASL), which comprises two elements: dual attention and dual embedding. These two components work together to create a two-phase learning process. First, they create dual latent embeddings that capture customer preferences from both domains. Then, they utilize these embeddings to provide cross-domain recommendations by matching them with suggested items. To evaluate their method, the authors conducted extensive experiments utilizing three datasets (i.e., Imhonet [
133], Amazon [
78], and Alibaba-Youku datasets). The proposed method is demonstrated to be superior to baseline models such as the mixed interest network (MiNet) [
134] and collaborative cross network (CoNet) [
135] on the three datasets.
Well-known cross-domain sequential recommendation solutions such as DASL [
127] and
-Net [
136] have identical limitations and rely intensively on overlapped customers in distinct domains, which makes them difficult to deploy in practical recommender systems. Therefore, Lin et al. [
128] introduce a mixed attention network (MAN) with global and local attention capabilities that obtains cross-domain and domain-specific information. The authors employ a global/local encoding layer to extract the cross-domain and specific-domain sequential patterns. Additionally, to obtain the local/global item similarity, integrate the item sequence, and capture the customer groups in distinct domains, the authors leverage a mixed attention layer that consists of sequence-fusion attention, item similarity attention, and group-prototype attention. Cross-domain and specific-domain interests are incorporated via a global/local prediction layer. Two datasets are used to validate the proposed method; each dataset contains information from the two domains. The experimental results demonstrate the effectiveness of the proposed method compared with other similar methods.
Park et al. [
129] introduce a cross-domain sequential recommendation approach to address the negative transfer issue. The newly introduced technique involves estimating the level of negative transfer to preserve gradient flows across domains characterized by significant negative transfer. This estimation is achieved by dynamically assigning the negative transfer as a weight factor in the prediction loss. The authors assess the performance of a model trained on cross-domains to investigate the negative transfer of two distinct domain settings to demonstrate the effectiveness of the proposed asymmetric cooperative network. They then compare its performance with that of a different model trained on a specific domain. The authors also present an auxiliary loss capability to maximize the collective information between the representation entities in a per-domain setting; in this way, the transfer of meaningful signals between cross-domain and specific-domain sequential recommendations is facilitated. This process of collective learning, involving specific domains and cross-domains, is similar to the cooperative dynamics between pacers and runners in long-distance races. Thorough experiments with two real-world datasets across multiple service domains indicate that the proposed model outperforms other methods, highlighting its superiority and efficacy. The proposed method has been deployed in a personal assistant app service to demonstrate its effectiveness for recommendation systems and showed a 21.4% CTR increase over other methods such as the context and attribute-aware recommender (CARCA) model [
137] and mixed information flow network (MIFN) [
138].
Xu et al. [
131] introduce model-agnostic contrastive denoising (MACD) to efficiently predict the CTR. This approach implements an auxiliary behavior sequence information capability to investigate conceivable customers’ interests. Researchers have created a specialized architecture that considers users’ varying interests, combined with a contrastive information regularizer, to effectively filter out background noise from secondary behaviors and gain insights into customers’ diverse interests. The authors rigorously conduct experiments on real-world datasets to affirm their method’s effectiveness unequivocally. The proposed method outperforms state-of-the-art methods such as the self-attention-based sequential model (SASRec) [
139] and Bert4rec [
140] on more than one performance metric.
Huang et al. [
132] introduce a DNN-based approach to enhance CTR prediction performance. The authors specifically utilize the deep and cross-network (DCN), supplemented with an augmentation technique known as the regularized leader (FTRL). To balance the dataset and address noise, the authors use an oversampling technique known as SMOTE to increase the number of samples of one of the classes to improve the performance. The authors conduct extensive experiments using five subsets of the iPinYou (can be found at
https://contest.ipinyou.com/ accessed on 31 July 2025) dataset. The results demonstrate the effectiveness of the proposed method compared with other methods.
Figure 9 compares the performance of the proposed method on one of the subsets with that of the deep and cross-network (DCN) [
27] model, DCN with the FTRL mechanism, and the complete framework (i.e., FO-FTRL-DCN), including feature optimization (FO).
Figure 9a compares the effectiveness of the proposed framework with that of the other methods for 40 iterations in terms of logloss. In contrast,
Figure 9b compares the performance of the proposed framework with that of the other methods for 40 iterations in terms of the AUC. This figure demonstrates that the proposed framework converged more quickly than the other methods due to the optimization mechanism and the oversampling technique. One of the most challenging issues in the CTR prediction task is data sparsity (the nonclicked samples are significantly more numerous than the click samples are).
Cross-domain approaches (e.g., CTNet [
121], DDCDR [
124]) emphasize the need for continual learning and knowledge transfer to address sparsity and dynamic environments. Multimodal frameworks such as HyperCTR [
96] and ASKAT [
76] demonstrate the benefits of fusing text, images, and behavior logs, which not only improve accuracy but also enhance robustness against cold-start scenarios. A major direction emerging from these studies is the integration of multimodal and cross-domain approaches as the foundation for next-generation CTR prediction.
Table 6 shows representative cross-domain CTR methods. These methods are most valuable for multi-service platforms (e.g., Taobao, Amazon) where sparsity in one domain can be mitigated by knowledge transfer. They are less suitable when domains are highly heterogeneous or user overlap is minimal. However, cross-domain models often rely heavily on overlapping users across domains (DASL, MAN), face negative transfer (CTNet), and require large labeled datasets (DDCDR).
5.2.3. Customer Behavior-Based Approaches
This subsection presents the methods used to study user behavior to predict CTR via multivariate data. A comparison of these approaches using different evaluation metrics is shown in
Table 7. Guo et al. [
141] present a multi-interest self-supervised learning (MISS) technique to improve feature embedding via designated signals called interest-level self-supervision. The authors employ two extractors based on a convolutional neural network to explore self-supervision signals, considering various interest representations either unionwise or pointwise, long- and short-range interest dependencies, and inter- and intraitem interest correlations. Then, the authors use contrastive learning losses to enhance feature representation learning by augmenting the views of interest representations. The proposed method can also be added to existing methods as a plug-in capability to improve their effectiveness. The authors evaluate the framework via three publicly available datasets, which demonstrates its superiority over state-of-the-art techniques such as the search-based interest model (SIM) [
142] and the deep match-to-rank (DMR) model [
143] (i.e., it improves the AUC by 13.55%).
Lin et al. [
144] introduce the sparse attentive memory (SAM) approach to address the complexity of lengthy sequential customer behavior modeling. The method has been proposed to be highly efficient for training models and conducting real-time inference on user behavior sequences with lengths on the order of thousands. This claim implies that the method is suitable for handling large-scale user behavior data without experiencing significant computational bottlenecks. The proposed method adopts a strategy where the specific item of interest is regarded as the query, and the lengthy sequence is utilized as the knowledge database. This design enables the item of interest to consistently trigger the extraction of valuable and relevant information from the lengthy sequence. The authors conduct comprehensive experiments to evaluate their method. The results demonstrate their method’s effectiveness on both long and short user behavior sequences. The proposed method is applied to an international e-commerce platform that uses sequences of length 1000. The efficiency of the proposed method is high, with an inference time within 30 ms when deployed on GPU clusters, and the prediction rate is decent, with a significant improvement of 7.30%.
Previous works that utilize customers’ interests, such as the DIN [
23], deep session interest network (DSIN) [
150], and deep interest evolution network (DIEN) [
22], have accomplished decent results in practice. Nevertheless, these approaches rely excessively on filtering customers’ historical behavior sequences while omitting context features, leading to decreased recommendation effectiveness. Yu et al. [
145] introduce a deep filter context network (DFCN) approach to address this challenge. This approach employs an attention capability to integrate a filter that refines data related to the customer’s historical sequence that varies significantly from the target ad. The proposed framework is attentive to the context features that alternate across two local activation units. The authors validated their work by utilizing Taobao user and Amazon user datasets. The experimental results demonstrate the approach’s effectiveness compared with the authors’ previously proposed technique (i.e., deep interest context network (DICN) [
151]) in terms of the AUC. Wei et al. [
146] present a deep adaptive interest network (DAIN) to predict CTR in the global view and local view. The authors first create a local attention capability to adaptively compute customer interest representations and obtain customer interest from candidate advertisements and customer behaviors. Then, they design a feature interaction extractor, including FM and multilayer perceptron (MLP) mechanisms, which are responsible for obtaining low- and high-order feature interactions. The authors subsequently utilize a linear-based global attention capability attached to the feature interaction extractor to adaptively learn the effect of low- and high-order feature interactions concerning the target item. The proposed framework is evaluated with three subsets of the Amazon dataset, namely, electronics, beauty, and office_products. The results demonstrate its effectiveness compared with various baseline models, such as the graph intention network (GIN) approach [
152].
Xue et al. [
147] introduce an interactive attention-based capsule (IACaps) architecture to explore complex and varying click information for customer behavior representation. The model’s core is an interactive attention dynamic routing capability utilized to mine the conceivable linkages across various browsing behaviors. Thus, this capability enables the extraction and interpretability of apparently irrelevant information invisible in enormous amounts of click data. To ensure the method’s deployability in real applications, the authors evaluate it with three subdatasets from the Amazon dataset. The proposed method is compared with various techniques, such as the deep user match network (DUMN) [
153], deep multi-interest network (DMIN) [
154], and DRINK [
148]. The results demonstrate its superiority over these methods on four performance metrics: accuracy, F1 score, logloss, and AUC.
Advertising data comprise many features, and the volume of data is expanding at a remarkable pace. This issue can be addressed by implementing customer segmentation according to shared interests. Kim et al. [
43] suggest that it is possible to forecast a customer’s changing interests based on the changing interests of other customers. Customers with shared interests are likely to change their interests in a similar direction. Based on this assumption, the authors present a deep user segment interest network (DUSIN) approach to enhance CTR prediction. The proposed framework consists of three layers: customer and segment interest extractors and segment interest activation. The purpose of these layers is to capture each customer’s hidden interests and create a comprehensive interest profile for the segment by combining the interests of each customer. The authors perform a random undersampling technique because the dataset is imbalanced (i.e., the number of nonclicked instances is greater than the number of click instances). The authors evaluated their framework via the TaoBao dataset (i.e., real industrial data). The experiments demonstrate the effectiveness of the framework, which improved CTR prediction compared with two baseline models (i.e., the deep interest network (DIN) [
23] and DIN with dynamic time warping (DTW) [
155]). As shown in
Figure 10, as the behavior sequence length increases above 30, the proposed framework outperforms the baseline models in terms of the area under the curve (AUC), achieving an AUC of 0.0029 with a 100-sequence behavior length. Compared with other baseline approaches, the framework’s performance demonstrates its effectiveness, making it potentially useful for business deployment. Zhang et al. [
148] introduce a deep multirepresentational item network (DRINK) to predict the CTR. To address the sparse customer behavior issue, the authors represent the target item as a sequence of interacting customers and timestamps. Additionally, the authors present a transformer-based item architecture comprising multiclass and global item representation minimodules. The authors also introduce a mechanism to disassemble the item behavior and the time information to avert overwhelming the information. The mechanism outputs are combined and input into an MLP layer to train the CTR model. The proposed method is evaluated through extensive experiments using the Amazon subdatasets (i.e., grocery, beauty, and sports) and outperforms other methods, such as the deep time-aware item evolution network (TIEN) [
80].
An innovative research method known as automatic feature interaction learning (AutoInt) [
34] creates a mechanism based on multihead attention that merges features. However, this method does not fully capture meaningful high-order features and neglects customer privacy preservation. To address these challenges, Tian et al. [
149] introduce a differential privacy bidirectional long short-term memory (DP-Bi-LSTM) approach to enhance AutoInt. The proposed framework comprises an embedded layer and the Bi-LSTM. The Bi-LSTM captures the nonlinear connection across customer click behaviors and creates high-order features. Additionally, the authors utilize a differential privacy mechanism to preserve customer privacy. The authors also adopt a Gaussian capability to randomly perturb the gradient descent model used in the framework. The authors evaluated their framework via a publicly available dataset called Criteo. The proposed method showed higher effectiveness compared with AutoInt, improving the performance by 0.65%. The proposed method is also highly secure and reliable compared with the AutoInt approach.
Table 8 consists of representative snippets of behavior-based methods. There are common limitations with behavior-based methods, including struggles with very sparse users, evolving interests, and high memory costs when sequences are long. They often omit multimodal context beyond clicks. The key ideas can be summarized as follows: behavior-based methods can be useful on platforms with rich sequential logs (e.g., Amazon, Taobao). They are best applied to personalization tasks where modeling user history is key, but require scalable architectures (e.g., transformer or capsule-based) to manage long sequences.
5.2.4. Feature Interaction-Based Methods
This subsection surveys the approaches that take advantage of feature interactions to improve the effectiveness of CTR prediction methods.
Table 9 compares these approaches on various criteria. Li et al. [
156] introduce an innovative approach to address a challenge encountered in previous research. Their method focuses on overcoming the performance bottleneck of implicit feature interactions without relying on explicit feature interactions. Well-known deep CTR models that present parallel architectures obtain information from various semantic spaces. The subcomponents of parallel architecture-based models encounter difficulties because they lack supervision and communication signals. This limitation makes it challenging to capture meaningful multiview feature interaction information effectively across various semantic spaces. To solve this issue, the authors present a contrast-enhanced model through the network (CETN) that captures valuable multiview feature interaction information across multiple semantic spaces. The approach is rooted in a sociological concept that harnesses the synergy between diversity and homogeneity to enhance the model’s ability to acquire more refined and high-quality feature interaction information. The illustration on the left-hand side of
Figure 11 shows that if the method identifies that feature interaction information varies across different semantic spaces, excessive diversity results. Thus, distinct subspaces have significantly different structures, resulting in excessively large angles between
and
. The CETN leverages feature interactions connected to the product and the concept of augmentation from contrastive learning. This process is performed to segment various semantic spaces, each with its own distinct activation functions. Specifically, this approach enhances the diversity of feature interaction information obtained by the model. Furthermore, each semantic space is equipped with self-supervised signals and connections to guarantee the uniformity of the captured feature interaction information. The authors validated their model via four datasets, demonstrating its superiority over baseline methods such as MaskNet [
157] and the model-agnostic contrastive learning for CTR (CL4CTR) method [
158] in terms of the logloss and AUC.
Lyu et al. [
159] present a new approach named optimizing feature set (OptFS), which unifies feature selection and its corresponding interaction. As shown in
Figure 5b, the authors separate each feature selection interaction into two correlated feature selections to analyze the relationships between different features comprehensively. This separation enables the model to be trained from end to end via several feature interaction procedures. The authors use a feature-level search space to allow a learnable gate to determine whether a given feature
f should be included in the feature set
F. The experimental results demonstrate the ability of the proposed model to create a feature set that consists of features that enhance the prediction performance. The authors evaluated their approach on three public datasets (i.e., Criteo (can be downloaded from
https://ailab.criteo.com/download-criteo-1tb-click-logs-dataset/, accessed on 1 August 2025), Avazu (can be downloaded from
https://www.kaggle.com/c/avazu-ctr-prediction/data, accessed on 1 August 2025), and KDD12 (can be found at
https://www.kdd.org/kdd-cup/view/kdd-cup-2012-track-2, accessed on 1 August 2025)), demonstrating its high performance in terms of prediction, computational cost, and storage.
Conventional CTR techniques attempt to enhance prediction via extensive feature engineering. Although these methods have shown some success, they are time-consuming, and it is difficult to deploy them in industrial environments. It is vital to take full advantage of minimal features and extract efficient feature interactions to overcome the drawback of the learning process of either sparse or high-dimensional features. Wang et al. [
160] present a method called mutual information and feature interaction (MiFiNN) to solve these issues. Each sparse feature weight is computed from the mutual information of that feature and the click result. Then, the authors utilize an interactive mechanism that merges the inner and outer products to extract the feature interaction. The extracted feature interactions and the original input-dense features are subsequently fed into the DNN as inputs. The authors compared their model with well-known models such as FiBiNET [
33] using four datasets. The results show that their method outperforms the other approaches.
Wang and Dong [
161] introduce a framework to quantify the proposed model uncertainty that generates reliable and precise results. The framework merges feature interaction and selection capabilities based on Bayesian deep learning, which is named FiBDL. The authors utilize the DNN parallel mechanism and squeeze network for prediction, and the Monte Carlo dropout capability is employed to extract the approximate posterior parameter distribution of the model. Two types of uncertainties, aleatoric and epistemic, are identified, and information entropy is adopted to compute the aggregate of these types. Mutual information can be utilized to estimate epistemic uncertainty. The proposed framework is evaluated via three publicly available datasets (i.e., Taobao (can be found at
http://www.comp.hkbu.edu.hk/~lichen/download/TaoBao_Serendipity_Dataset.html, accessed on 1 August 2025), Avazu, and ICME). Its superiority compared with previous methods, such as deep field-embedded factorization machine (DeepFEFM) [
168] and extreme cross network (XCrossNet) [
169], is demonstrated in terms of the prediction performance and efficient uncertainty quantification. Lin et al. [
162] introduce a model-agnostic framework (MAP) that consists of two feature detection algorithms, namely, replaced feature detection (RFD) and masked feature prediction (MFP), to recover and break down multifield categorical data. On the one hand, MFP explores feature interactions within each sample by masking and identifying input features and presents noise contrastive estimation (NCE) to address large feature spaces. On the other hand, RFD transforms MFP into a binary classification task utilizing input features to apply replacement and detection transformations, making it more straightforward and more adequate for CTR pretraining. The proposed method is evaluated via two widely used public datasets (Criteo and Avazu). The results demonstrate its efficiency and effectiveness in predicting CTR via backbone approaches such as DeepFM [
170] and DCNv2 [
171].
Sahllal and Souidi [
40] compare the performance of 19 resampling techniques, including four ensemble techniques, four oversampling methods, seven undersampling methods, and four hybrid methods, to identify the most effective method for CTR prediction. The new data constructed by these methods are fed into four well-known machine learning algorithms to investigate the effects of these resampling techniques on the CTR prediction performance. The authors evaluate these techniques extensively using a public dataset; the experiments show that resampling methods can improve the model’s performance by approximately 20%. Their research findings indicate that oversampling is more effective than other resampling techniques are and that undersampling techniques perform precisely with ensemble methods such as random forest.
Yuan et al. [
163] introduce a feature-interaction-enhanced sequence (FESeq) approach that combines the sequential recommendation mechanism and feature interaction capability. The framework consists of an interacting layer, a feature engineering mechanism required by the transformer architecture. The framework uses a linear time interval embedding layer to retain the time intervals and a positional embedding layer to obtain the position information from the customer’s sequence behaviors. The authors also create an attention-based sequence pooling layer capable of reshaping the connection between the target ad representation and the customer’s historical behaviors, leveraging bilinear attention. The authors evaluate the proposed method via public (i.e., Alibaba (a.k.a., Ele.me) (can be found at
https://tianchi.aliyun.com/dataset/131047, accessed on 1 August 2025)) and real-world (i.e., Bundle) datasets. The proposed framework outperforms the baseline models, such as a joint CTR prediction (JointCTR) framework [
172] and time interval aware self-attention-based sequential recommendation (TiSASRec) [
173], on both datasets in terms of the logloss and AUC.
Yang et al. [
164] introduce a learning adaptively sparse structure (AdaSparse) framework, as shown in
Figure 12, to train the model adaptively on each domain sparse structure and hence accomplish decent generalization across domains while maintaining low computational complexity. The framework measures the neurons’ significance through domain-aware neuron-level weighting factors. Thus, the proposed framework can prune redundant neurons in each domain to promote generalization. Moreover, the framework incorporates adaptable sparsity regularization to control the sparsity ratio of acquired structures effectively. The most important part of the framework is the domain-aware pruner, which generates neuron-level weighting factors capable of trimming redundant neurons. The framework uses an
n-layered fully connected neural network as the core of the proposed framework to present AdaSparse. Once the framework transforms the features into embeddings, it combines the domain-aware embeddings
and domain-agnostic embeddings
, which form the input of the models (i.e.,
). The learnable matrix of the
n-th fully connected layer can be represented as
, and the neuron of the
n-th layer that acts as the input is denoted as
. To train the model on the CTR task, the authors use the cross-entropy
evaluation metric via the CTR instances
. For each domain
o, the model uses the pruner to eliminate the neurons of each layer and generates a weighting factor vector
(the authors propose three weighting factors: binarization, scaling, and fusion) capable of trimming the neuron
l. This procedure continues for each layer, and the sparse structure is eventually acquired. The framework is evaluated via two datasets (i.e., a public dataset called IAAC [
101] and a customized dataset named Production). Its superiority to multi-domain CTR state-of-the-art models such as the star topology adaptive recommender (STAR) [
174] and gradient-based meta-learning method (MAML) [
175] is demonstrated.
Several two-stream interaction approaches incorporate MLP with a customized architecture to improve CTR prediction. The MLP mechanism implicitly learns feature interactions, and the customized architecture is used to learn feature interactions explicitly. Mao et al. [
166] propose an approach named FinalMLP that compensates for the customized architecture with another well-tuned MLP mechanism. Thus, the authors merge two streams of the MLP architecture to learn the implicit and explicit feature interactions. The authors also introduce a feature selection mechanism and an interaction aggregation layer to facilitate the feed of differentiated features and integrate stream-level interactions through two streams. The authors evaluate their method via four publicly available datasets: Criteo, Avazu, MovieLens, and Frappe; the evaluation metrics demonstrate the method’s effectiveness in terms of AUC.
Tian et al. [
167] introduce an adaptive feature interaction learning approach called EulerNet. The framework facilitates learning feature interactions via a complex vector space (i.e., consisting of an imaginary part
b and a real part
a) on a given embedding dimension
e and performs space mapping via Euler’s formula. Specifically, the model transforms the exponential growth of feature interactions into simple linear aggregations of complex feature phases
and a modulus
, enabling the model to adaptively and efficiently learn high-order feature interactions. Additionally, as shown in
Figure 13, the proposed framework combines the feature interactions explicitly and implicitly utilizing feature embeddings in the complex feature space
f in a harmonious architecture
, which accomplishes the required reciprocal improvement and significantly increases the model performance on three datasets compared with various baseline models such as the deep interaction machine (DeepIM) [
176].
Feature-interaction-based methods (e.g., CETN [
156], OptFS [
159]) improve CTR prediction performance by automatically identifying meaningful feature interactions, often outperforming manually engineered baselines. Meanwhile, customer behavior models demonstrate that long sequential dependencies and evolving interests are essential for personalization. However, issues remain regarding sequence length and memory efficiency. The key insight is that combining feature-level interactions with temporal behavior modeling yields superior personalization capabilities.
Table 10 demonstrates representative feature-interaction CTR methods. These models perform well with structured datasets that have complex features, like those in Criteo and Avazu, and are useful for capturing higher-order interactions. However, they require careful management of accuracy and computational costs. Key issues include parameter explosion (e.g., CETN), overfitting in high-dimensional spaces, and limited interpretability, with some approaches requiring manual filtering of irrelevant features.
We find that graph-based and multimodal methods are the top approaches for predicting CTR. These methods consistently perform better than older deep learning models on large benchmarks. However, challenges still exist in modeling user behavior over time and learning how different features interact with each other. Models like CETN offer potential solutions to these challenges. Cross-domain and transfer learning improve prediction accuracy and address cold-start issues when new users or items have limited data. In industrial settings, there is a growing emphasis on distributed architectures that use edge computing. These architectures aim to strike a balance between strong predictions and efficient computation. These findings summarize current literature and outline a roadmap for future CTR research, linking existing methods to broader challenges discussed in the next section.