Deep Double Towers Click Through Rate Prediction Model with Multi-Head Bilinear Fusion

Zhang, Yuan; Cheng, Xiaobao; Wei, Wei; Meng, Yangyang

doi:10.3390/sym17020159

Open AccessArticle

Deep Double Towers Click Through Rate Prediction Model with Multi-Head Bilinear Fusion

College of Electronic and Information Engineering, Anhui Jianzhu University, Hefei 230601, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(2), 159; https://doi.org/10.3390/sym17020159

Submission received: 23 December 2024 / Revised: 8 January 2025 / Accepted: 18 January 2025 / Published: 22 January 2025

(This article belongs to the Special Issue Symmetry/Asymmetry in Image Processing and Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

The click-through rate (CTR) forecast is among the mainstream research directions in the domain of recommender systems, especially in online advertising suggestions. Among them, the multilayer perceptron (MLP) has been extensively utilized as the cornerstone of deep CTR prediction models. However, current neural network-based CTR prediction models commonly employ a single MLP network to capture nonlinear interactions between high-order features, while disregarding the interaction among differentiated features, resulting in poor model performance. Although studies such as DeepFM have proposed dual-branch interaction models to learn complex features, they still fall short of achieving more nuanced feature fusion. To address these challenges, we propose a novel model, the Deep Double Towers model (DDT), which improves the accuracy of CTR prediction through multi-head bilinear fusion while incorporating symmetry in its architecture. Specifically, the DDT model leverages symmetric parallel MLP networks to capture the interactions between differentiated features in a more structured and balanced manner. Furthermore, the multi-head bilinear fusion layer enables refined feature fusion through symmetry-aware operations, ensuring that feature interactions are aligned and symmetrically integrated. Experimental results on publicly available datasets, such as Criteo and Avazu, show that DDT surpasses existing models in improving the accuracy of CTR prediction, with symmetry contributing to more effective and balanced feature fusion.

Keywords:

click-through rate prediction; higher-order interaction; bilinear fusion; dual-branch model

1. Introduction

Click through rate (CTR) prediction is a crucial component of recommendation systems, playing a key role in the context of e-commerce and internet advertisements [1,2]. The effectiveness of the CTR prediction model directly influences user experience and satisfaction, making a substantial impact on the revenue of advertising companies [3]. Therefore, improving the precision of CTR prediction models is a major focus within the industry [4].

Thoughtful feature engineering can significantly enhance the efficacy of models employed for CTR prediction [5]. Given that a substantial portion of features in CTR prediction tasks are categorical (such as name, sex, age, occupation, and interests), effectively mastering the intricate relationships among these features has become a pivotal challenge [6]. This is essential for improving the model’s generalization capability throughout the interaction process. Earlier studies often faced the limitation of higher-order interactions to learn richer feature information, leading to poor model performance [7,8]. With the advancement of deep learning, its application to CTR prediction has become an emerging research trend [9,10]. MLP, serving as a crucial element in deep learning, has evolved into an essential module within diverse CTR prediction models as well. Existing research usually combines MLP networks with other shallow models, enhancing the model’s expressive capability by modifying the hidden layer structure to convert it into a deep model that learns more complex nonlinear interactions between features [11,12]. Nevertheless, not every interaction between features is beneficial, and learning useless feature interactions may introduce redundant information that increases the complexity of the model. Consequently, researchers have proposed various models to select appropriate feature interactions to find methods that can eliminate noisy information; representative models include AFM [13] and HoAFM [14], which further enhance the quality of feature interactions by adding an attention mechanism into the existing FM [15] model, thereby significantly boosting the model’s predictive accuracy [16]. Although these methods improve the performance of CTR prediction models to some extent, they mainly focus on important second or higher-order interactions, ignoring the varying contributions of individual features to the final prediction objective [17]. In the CTR prediction task, not all features carry valuable information for achieving the prediction goal, various feature interactions often exhibit distinct predictive capabilities, and interactions involving irrelevant features can be viewed as noise. This noise does not contribute to the prediction and may even undermine the model’s performance. In addressing this issue, a model for learning feature importance (FiBiNET) has been proposed to dynamically adjust feature weights through the incorporation of a Squeeze Excitation Network (SENET) [18,19]. Although the above approach proves that it is feasible to learn feature interactions using MLP, it ignores the variability of features in different spaces and fails to model differentiated feature interactions; therefore, it becomes crucial to learn feature interactions from different perspectives for features possessing different levels of importance [20,21].

Inspired by the concept of bilinear fusion in the field of image classification, two MLP networks are used to design a parallel structure and combine the advantages of SENET to focus on those more meaningful feature interactions by dynamically weighting them in order to enhance the interactions between different features. In this paper, we propose a multi-head bilinear fusion deep twin-tower click prediction model, in which a parallel deep tower structure is used, and when combined with SENET, the model is able to adaptively re-weight the features, enabling the two MLP networks to learn features of different significance separately for differentiated feature interactions and to address the limitations of the traditional model that may occur in feature selection. The generalization of the model is enhanced, and the model’s ability to learn complex features is improved [22]. Additionally, to enhance the fusion of feature outputs from different subspaces within the deep tower, this paper introduces an interaction aggregation layer for multi-head bilinear fusion [23]. This setup makes the model less computationally complex, enabling efficient and fine-grained feature fusion. Several key contributions of this study include:

The SENET mechanism is introduced to dynamically learn important features, and enhance the performance of predictive models by improving the quality of feature interactions.
Drawing inspiration from the concept of bilinear pooling in image classification, a parallel deep double tower structure is proposed to learn differentiated feature interactions.
Features from different subspaces are aggregated by introducing a multi-head bilinear fusion layer for more fine-grained feature fusion.
The experimental results of four publicly available datasets effectively validate that the proposed DDT model surpasses existing CTR prediction models in terms of effectiveness.

2. Related Work

2.1. Conventional CTR Prediction Models

The traditional CTR prediction models are mainly logistic regression (LR) and factorization machines (FM). Earlier, the CTR prediction problem was usually solved by constructing an LR model, but it is not suitable for dealing with high-dimensional sparse advertisement features because it is a simple linear model that cannot construct complex feature interactions. To this end, some researchers have introduced FM models into the CTR prediction task to better effectively manage high-dimensional sparse features by converting the interactions into vector inner products. However, the expressive power of the FM model is insufficient due to its difficulty in capturing the nonlinear relationships between features, for which subsequent research works have proposed improved versions such as Field-aware Factorization Machines (FFM) [24], which deal with more complex feature interactions and high-dimensional data through the introduction of feature field-awareness. This enhancement significantly improves the model’s expressive capability compared to the FM, resulting in better performance in tasks such as CTR prediction and recommendation systems.

2.2. Deep CTR Prediction Models

As deep learning technology rapidly advances, more and more researchers are introducing it into CTR prediction tasks to build deep learning-based CTR prediction models. In this process, the effective modeling of feature interactions becomes a key factor in most neural network-based models. As a result, many novel deep models have emerged in recent years, with MLP as the core, to further improve the performance of CTR prediction models by changing the structure of neural networks and introducing innovative techniques [25]. These models try to overcome the limitations of the traditional CTR prediction model and improve the model to better adapt to changing advertising display solutions [26,27]. The classical Deep Crossing model automatically learns feature interactions by increasing the number of layers and structural complexity of neural networks, which produces new ideas and methods for CTR prediction. On this basis, some scholars, in order to integrate the advantages of various models, combine deep learning networks with different features to complement each other’s strengths and improve the overall capability of the model, representative models such as Wide&Deep, and its variants Deep&Cross [28]. The Wide&Deep model can capture both linear relationships between features and learn nonlinear feature interactions, with strong generalization ability. However, the wide component of the model employs a broad linear model, allowing it to learn only low-order functional interactions. For high-dimensional data with numerous features, the ability of the wide part may seem limited to adequately capture more complex relationships between features [29]. To overcome the limitations of feature interaction modeling, Deep&Cross further the model’s capacity to model feature interactions by introducing the CrossNet structure to learn multi-order crossovers and the ability to automatically learn more complex feature combinations. The subsequent DeepFM model, to solve the complex feature engineering problem, employs FM instead of the original linear part of the Wide&Deep model to reduce the dependence on feature engineering and thus improve the training speed of the model. Meanwhile, another variant of the FM model, NFM, replaces the deep part with a dual linear interaction layer to enhance the characteristics of the characteristic interaction of the two-order part of FM [30]. The above work verified the improvement effect of different combinations on feature engineering through a large number of feature interaction-based ideas; however, such models still suffer from the limitations of inefficiency and complexity [31]. Subsequently, researchers discovered that not all interactions among features carry significance [32]. In fact, irrelevant interactions could introduce unnecessary noise and escalate the training complexity. In pursuit of this objective, they proposed the AFM model. This model facilitates the dynamic learning of feature interaction weights by incorporating an attention mechanism. This addition enables the more flexible modeling of relationships between features and the weighting of crucial features [33]. Although the AFM model can adapt to different types of data distributions and feature combinations better than the traditional FM and other linear models, it ignores the fact that different types of features contribute differently to the target task [34]; for this reason, Huana et al. proposed a feature importance model and bilinear feature interaction (FiBiNET), which effectively improves the performance of CTR prediction models by introducing the SENET, which has been successfully applied to image-domain networks and bilinear function for modeling importance feature interactions, which effectively improves the performance of the CTR prediction model. Based on this, some researchers explored feature modeling following FiBiNet, applied gating systems on top of MLP structures, and proposed the MaskNet model to improve the accuracy of CTR prediction [35]. In addition, recent studies in introducing residual units for improvement, proposing the GDCN [36] model, and verifying its superiority and effectiveness, as well as EulerNet proposed by Tian et al., have used the method of learning feature importance for feature learning [37]. These studies have focused on selecting meaningful features to improve CTR prediction, and these new deep models are expected to provide useful insights for problems in other areas such as recommender systems.

3. The Proposed Model

The objectives of this study are to identify distinct features and achieve more detailed interaction fusion by designing parallel structures. Therefore, we propose a novel model (DDT) with multi-head bilinear fusion for CTR prediction. In this section, we present an overview of the model’s overall structure. The input layer, deep layer, fusion layer, and output layer are the primary components of DDT, as seen in Figure 1. First, the original features in the input layer are mapped from high-dimensional and sparse to corresponding low-dimensional compact vectors; second, the significant features are learned by employing Squeeze-and-Excitation Networks (SENET), and each feature is converted into meaningful embedding vectors; and third, the original feature embedded and important feature embedded are passed to the MLP, which is used to learn different feature interactions. Finally, the deep output is used as the input for the fusion layer, which utilizes enhanced bilinear fusion to combine feature information from distinct representation spaces at a more precise level for complex feature interactions. The result is a predicted value that reflects the likelihood of a user clicking on a product or advertisement. Each component is described in greater detail below.

3.1. Input Layer

3.1.1. Primitive Feature Embedding

Embedding serves as a prevalent method for transforming high-dimensional and sparse raw features into low-dimensional compact vectors. In the CTR prediction task, the input data primarily consist of sparse categorical features, e.g., {gender = male, occupation = businessman, week = sunday, and category = finance}. To represent these features, it is customary to first divide the same class of features into the same field and then transform them into high-dimensional and sparse features using one-hot encoding. The aforementioned classification feature representation is illustrated in Figure 2. Among them, 1 represents the feature marked by the one-hot encoding, and 0 represents other features.Since the combined vector is sparse and difficult to handle, to reduce the feature dimension, sparse features are embedded into low-dimensional, compact real-valued vectors as raw feature input. In this paper,

E = [e m b_{1}, e m b_{2}, \dots, e m b_{k}]

denote the original feature embedding, where k indicates the number of domain segments.

3.1.2. Significant Feature Embedding

Take into account the user’s intent regarding various attribute features, for instance, a user may click on an ad based on the ad category and the time it was published. Specifically, when estimating the probability of a person clicking on a financial advertisement, occupational characteristics are significantly more important than gender characteristics; therefore, occupational characteristics should be assigned greater weight to signify their importance. To attain this objective, we have introduced SENET, which is originally applied in the field of images, which can be leveraged to amplify the weights of crucial features in a particular CTR prediction task. In this model, the original feature embeddings are used as inputs, and the important features are selected by SENET and converted into corresponding embedding vectors, denoted by

S E = [S E e m b_{1}, S E e m b_{2}, \dots, S E e m b_{k}]

, where k indicates the number of domain segments. As illustrated in Figure 3, SENET comprises three stages: the squeeze phase, the excitation phase, and the reweighting phase. These stages are explained in detail as follows:

Squeeze phase: This step is mainly used to learn the compression vector; the input vector

E = [e m b_{1}, e m b_{2}, \dots, e m b_{k}]

is compressed into a certain dimensional statistical vector

S = [s_{1}, s_{2}, \dots, s_{k}]

for feature aggregation by average pooling; and

s_{i}

is a scalar,

i \in (1, \dots, k)

, which reflects the overall information of the associated feature. The formula for the squeeze phase is expressed as follows:

S_{i} = F_{s q} (e m b_{i}) = \frac{1}{N} \sum_{k = 1}^{N} {e m b_{i}}^{(k)}

(1)

where N represents the dimension of

e m b_{i}

and k signifies the k-th dimension of

e m b_{i}

.

Excitation phase: in this step, the compressed statistical vector S is taken as input, and two fully linked layers are used to learn the vector weights. With a weight matrix of

W_{1}

, a reduction ratio of r, and a nonlinear activation function of

σ_{1}

, the first completely connected layer is employed for dimensionality reduction. The second fully connected layer uses

W_{2}

to increase the dimensionality to recover the dimensionality before the reduced dimensionality and then activated by the activation function

σ_{2}

to obtain a weight vector A. The specific formula is as follows:

A = F_{e x} (S) = σ_{2} (W_{2} σ_{1} (W_{1} S)) = [a_{1}, a_{2}, \dots, a_{k}]

(2)

where A ∈

R^{k}

,

W_{1}

∈

R^{k \times \frac{k}{r}}

,

W_{2}

∈

R^{\frac{k}{r} \times k}

,

σ_{1}

, and

σ_{2}

are both

r e l u

activation functions.

Reweight phase: The last phase introduces the reweighting step for recalculating the input feature vector; by multiplying the input feature vector

E = [e m b_{1}, e m b_{2}, \dots, e m b_{k}]

and the weight vector

A = [a_{1}, a_{2}, \dots, a_{k}]

item by item, a new feature vector

S E = [S E e m b_{1}, S E e m b_{2}, \dots, S E e m b_{k}]

can be obtained as the output of the SENET, which is given by the following formula:

\begin{matrix} S E = F_{R e w} (A, E) = [a_{1} \cdot e m b_{1}, a_{2} \cdot e m b_{2}, \dots, a_{k} \cdot e m b_{k}] \\ = [S E e m b_{1}, S E e m b_{2}, S E e m b_{k}] \end{matrix}

(3)

where

a_{i}

∈ R, i ∈

R^{d}

, and d denote the embedding size.

3.2. Deep Layer

MLP serves as a crucial technique for learning implicit relationships between features and has evolved into a fundamental module of diverse CTR prediction models. Most of the existing studies have added a single MLP module to the model to capture complex feature interactions and verify its effectiveness, but no study has been conducted to add two MLP modules simultaneously to a double tower model in a differentiated parallel manner. Therefore, we attempted to complement each other by learning different features for better performance by using one tower to learn common feature interactions and another tower to capture important feature interactions. The depth layer of the model can be represented as:

O_{1} = M L P_{1} (E)

(4)

O_{2} = M L P_{2} (S E)

(5)

where,

O_{1}

,

O_{2}

are the two outputs of the deep layer, respectively, and

M L P_{1}

,

M L P_{2}

are two MLP networks, consisting of several fully connected hidden layers:

\begin{matrix} Z_{1} = r e l u (W_{1}^{T} Z_{0} + b_{1}) \\ \dots \dots \\ Z_{n} = r e l u (W_{n}^{T} Z_{n - 1} + b_{n}) \end{matrix}

(6)

where

Z_{0}

is the input of the MLP network,

W_{n}^{T}

and

b_{n}

are the weight matrix and bias entry of the n-th layer, respectively,

r e l u

is the activation function, and

Z_{n}

is the output of the hidden layer.

3.3. Fusion Layer

Most of the existing studies use splicing or dot products for fusion but are unable to capture differentiated feature interactions well. FinalMLP [21] proposed a bilinear interaction aggregation layer and extended it to a multi-head version for fusing stream outputs with stream-level feature interactions. Inspired by its success, we use it in our model to fuse feature interactions across different representation spaces. As illustrated in Figure 4, First, we define two tensors of the depth layer outputs

O_{1}

and

O_{2}

, respectively, to represent the outputs from two different sub-networks (or ‘streams’). In order to better capture the interaction between features, we divide these two outputs, respectively, into n subspaces. Specifically, assume:

O_{1} = [O_{11}, O_{12}, \dots, O_{1 n}]

(7)

O_{2} = [O_{21}, O_{22}, \dots, O_{2 n}]

(8)

where

O_{1}

and

O_{2}

are the outputs of the depth layer, which are partitioned into n subspaces, each of which has dimensions

l_{1}

and

l_{2}

, where

O_{1 j}

and

O_{2 j}

are the representations of

O_{1}

and

O_{2}

in the j-th subspace, respectively. Feature interactions in each subspace are modeled using bilinear fusion. The goal of bilinear fusion is to capture the interrelationships between

O_{1 j}

and

O_{2 j}

by weighting their combinations. The bilinear fusion is formulated as follows:

B F_{j} = W_{1}^{T} O_{1 j} + W_{2}^{T} O_{2 j} + O_{1 j}^{T} W_{3} O_{2 j} + b_{j}

(9)

where

W_{1}

∈

R^{l_{1} \times 1}

and

W_{2}

∈

R^{l_{2} \times 1}

are the pair of

O_{1 j}

and

O_{2 j}

, which are the weight vectors weighting.

W_{3}

∈

R^{l_{1} \times l_{2}}

is the weight matrix used to model the bilinear interaction.

b_{j}

∈ R is the bias term.

The above equation represents how, within each subspace j,

O_{1 j}

and

O_{2 j}

are fused in a weighted and bilinear way. Notably, this bilinear fusion operation captures the interaction between the two vectors, thus making it more flexible and efficient than the traditional dot product or splicing methods. Similar to the multi-head attention mechanism, we perform bilinear fusion within each subspace and aggregate the results across all heads by sum pooling. Specifically, the fusion results

B F_{j}

from each subspace j are aggregated into the final fusion output

B F

with the following formula:

B F = \sum_{j = 1}^{n} (W_{1}^{T} O_{1 j} + W_{2}^{T} O_{2 j} + O_{1 j}^{T} W_{3} O_{2 j} + b_{j})

(10)

where n is the number of subspaces, indicating that we performed a multi-head fusion operation. By pooling the output of each head by summation, the information from different subspaces can be fully integrated to obtain more expressive fusion features. Ultimately, the fusion result

B F

obtained through summation pooling aggregates information from different subspaces, resulting in a more comprehensive feature representation:

y = \sum_{j = 1}^{n} B F (O_{1 j}, O_{2 j})

(11)

3.4. Output Layer

Following the fusion layer, we obtain the final output, and the predicted values are computed as follows:

{\hat{y}}_{C T R} = σ (y) = σ (\sum_{j = 1}^{n} B F (O_{1 j}, O_{2 j}))

(12)

where

{\hat{y}}_{C T R}

∈

(0, 1)

is the final predicted values and

σ

is the sigmoid function. In addition, to better learn the weights and parameters of the model, we train the model using the common cross-entropy loss function:

L o g l o s s = - \frac{1}{N} \sum_{i = 1}^{N} (y_{i} l o g {\hat{y}}_{C T R_{i}} + (1 - y_{i}) l o g (1 - {\hat{y}}_{C T R_{i}}))

(13)

where

y_{i}

and

{\hat{y}}_{C T R_{i}}

denote the data of real clicks and the probability of predicted clicks of sample i, with N denoting the total number of samples.

4. Experiments

4.1. Experimental Settings

4.1.1. Datasets

In this section, experiments are conducted on four publicly available datasets such as Criteo, Avazu, Frappe, and MovieLens, where we reuse the data pre-processed by Cheng et al. [38]. We follow the same data pre-processing settings and randomly divide each dataset into three segments with a ratio of 8:1:1 for training, validation, and testing, respectively. Table 1 summarizes the field information of the four datasets, as described below:

(1) Criteo: The dataset is an anonymization-based dataset of online advertisement transactions, publicly released by the Kaggle platform. The dataset contains more than 40 million records of user clicks, with each record consisting of 39 anonymised fields about display advertisements. It is widely used in research and experiments in the fields of recommendation algorithms and click-through rate prediction by using users’ historical click-through data to predict whether they will click on a particular advertisement in the future.

(2) Avazu: The dataset contains user mobile behaviour, including whether the displayed mobile ads are clicked by the user, with 23 feature fields, including user/item features and ad attribute features, to predict whether a user will click on a particular ad through their contextual features.

(3) Frappe: The dataset is a Facebook ad recommendation dataset that records logs of app usage by users in different contexts (e.g., daytime, location). The dataset contains 90,203 applications, and each log contains eight contextual features.

(4) Movielens: The dataset is created and maintained by the GroupLens research group. It contains information about users’ ratings of movies and is widely used in recommender system research. It is the classic movie ratings dataset, which has 1,000,209 records involving information about ratings of 3900 movies by 6040 anonymous users.

4.1.2. Evaluation Metrics

In the experiments, the evaluation metrics employed to assess the performance of DDT are AUC (Area Under the ROC curve) and LogLoss (cross entropy). Furthermore, a deviation of 0.1 percentage points either upward or downward in AUC and LogLoss is deemed as a noteworthy enhancement in the CTR prediction task.

AUC:The evaluation of classification problems often utilizes AUC. AUC has a maximum value of 1, with higher values indicating superior performance.

LogLoss: In binary classification problems, Logloss is a commonly employed metric to measure the disparity between predicted scores and actual labels. Logloss has a lower bound of 0, and smaller values indicate better performance.

4.1.3. Baseline Models

This section compares the DDT with different baseline models, which are described below:

LR [16]: Logistic regression models, which use simple linear combinations to construct feature interactions, are used for CTR prediction.

FM [15]: Second-order feature modeling using factorizers solves the data sparsity problem of traditional models.

NFM [30]: Neural networks are added to construct feature interactions via bilinear interaction layers and MLPs, respectively.

AFM [13]: Introducing attentional mechanisms and FM to distinguish the weights of second-order feature interactions can capture importance features.

DeepFM [11]: The feature interactions are extracted through a factorizer, and the deeper parts are extracted through three MLP layers.

FiBiNet [19]: Combining SENet and Bilinear Interaction to extract important features, enable fine-grained feature interaction, and predict CTR.

MaskNet [35]: Applying feature gating to MLP refines the feature importance module.

GDCN [36]: Better capture of explicit higher-order feature interactions by dynamically filtering important feature interactions through gated crossover networks.

EulerNet [37]: Feature interactions are explicitly modeled using Euler’s formula and combined with a linear layer to adaptively learn feature interactions of an arbitrary order.

4.1.4. Implementation Details

Our experiments were carried out in a Windows 11 environment; the CPU was AMD Sharp Dragon 7, the graphics card was NVIDIA RTX2080Ti, the language used was Python version 3.8, the package used for the deep learning part was PyTorch CUDA version 10.2, and the software used was PyCharm 2020. The batch size for both datasets was set to 4096, with a learning rate of 0.001. By default, the MLP consists of three hidden layers, each containing 400 neurons. To avoid the occurrence of overfitting, the dropout rates for the Criteo and Avazu datasets were configured at 0.1 and 0.3, respectively, and early stopping was performed based on the AUC value of the validation set. We utilize Adam to update the parameters in the training phase and dynamically adjust the hyper-parameters to obtain the best performance.

4.2. Comprehensive Performance

In this section, we analyze the proposed models in comparison with various classical models. The overall performance of all models on the four datasets is shown in Table 2. In order to provide a comprehensive comparison, representative models for first-order, second-order, and higher-order interactions are listed. As can be seen from the table, the DDT model outperforms the other baseline models on all four datasets. The AUC values of DDT increased by an average of 0.49% to 1.62% on all four datasets. Logloss decreased on average by 0.27% to 5.22% compared to the other models. These results strongly suggest that deep twin-tower models with multiple bilinear fusions are effective in improving CTR prediction performance. Specifically, the DDT model improves the quality of feature interactions and enhances the expressive power of the model by designing the twin-tower structure and introducing a multi-head bilinear fusion layer. In addition, the traditional logistic regression model with simple interactions had the worst LR performance compared to the other models, while the other models containing complex feature interactions significantly outperformed the traditional models, thus indicating that learning second-order or higher-order interactions effectively enhances the performance of CTR prediction models, and that models with higher-order interactions are usually better than those relying on lower-order interactions, which is in line with our observation, i.e., models with higher-order interactions exhibit more robustness and more accurate predictive capabilities. Notably, models such as FiBiNET exhibit superior performance compared to DeepFM, which is attributed to their ability to select and learn important features. This observation highlights the importance of the quality and accuracy of feature interactions in CTR prediction tasks, which will directly affect the predictive effectiveness of the model.

4.3. Parametric Analysis

In this subsection, we investigate the effect of the parameters; the hyperparameters include the number of network layers, the downscaling ratio, and the number of heads.

4.3.1. Impact of the Number of Network Layers

In this subsection, we conduct experiments to investigate the effect of different numbers of network layers in the MLP on the effectiveness of the DDT model. The performance of the model on different datasets is illustrated in Figure 5, from which it can be seen that as the number of network layers increases, the model performance improves significantly in the initial phase. However, a further increase in the number of network layers leads to a decrease in the area under the curve (AUC), which is due to the fact that an increase in the number of network layers leads to an increase in the complexity of the model, which in turn leads to a tendency towards overfitting. Notably, in the Avazu dataset, we observed a sharp increase in logloss values when the number of network layers increased from 3 to 4. This can be attributed to the increase in the number of contextual features in the dataset, which poses a challenge to the model in terms of effectively capturing contextual information and modeling diversity. Based on our findings, we recommend setting the number of network layers of the MLP in the model to three for the best results.

4.3.2. Impact of Dimensionality Reduction Ratio

This section discusses the effect of the dimensionality reduction ratio on the model performance, which is a key parameter in SENET, and its role mainly involves learning feature importance weights. As shown in Figure 6, in the Avazu, Frappe, and Movielens datasets, the model performs best when the dimensionality reduction ratio is 3, but when the dimensionality reduction ratio is increased to 4, the model performance is below the standard value in all cases. This is due to overly aggressive dimensionality reduction, which hinders the model’s ability to learn and thus leads to a decrease in model performance. It is worth noting that in the Criteo dataset, the model achieves optimal performance when the dimensionality reduction ratio is set to 4. This can be attributed to the effect of dataset size on feature dimensionality. The results on the four datasets show that choosing a moderately appropriate dimensionality reduction ratio can enhance the model’s ability to capture important features to some extent.

4.3.3. Impact of Number of Heads

In this section, we study the practical effects of subspace grouping techniques in bilinear fusion. Table 3 clearly shows the change in model performance after varying the number of subspaces (i.e., head k) in bilinear fusion. We observe that the effect of the number of heads on model performance is not the same on different datasets; on the Avazu, Frappe, and Movielens datasets, the performance of the model shows an opposite trend as the number of heads increases, which we believe may be due to the effect of the size and type of dataset, inferring that the model does not rely on too many contextual features when learning feature interactions. In contrast, in the Criteo dataset, employing a smaller number of heads does not always improve performance. This is due to the fact that a proper subspace helps the model to acquire different feature interactions from different perspectives while reducing redundant interactions, similar to the multi-head attention mechanism. In practice, we can determine the most appropriate hyperparameters for optimal performance through extensive experimentation.

4.4. Ablation Experiment

In this section, in order to evaluate the improvement of the improved parts on the model performance, we selected two datasets and conducted ablation experiments to evaluate the effectiveness of each part of the DDT model. The results are summarized in Table 4. The experimental results show that each module improves the performance of CTR prediction to some extent. Firstly, it is observed that Model 3 performs the worst on both datasets as we have eliminated higher-order interactions on important features in the depth layer, which reduces the quality of the feature interactions and consequently leads to a reduction in model performance. Furthermore, it was observed that eliminating higher-order interactions from the left tower in Model 2 had the least impact on the model performance because the feature embedding extracted from the left tower was sparser compared to the right tower and contained redundant information, which contributed less to the target task, and therefore it had the least impact on the prediction performance. In addition, the configurations of Model 1 and Model 4 also affected the model performance to some extent, further validating the impact of each module on the overall performance of the DDT prediction model.

5. Conclusions

Aiming at the shortcomings of existing CTR prediction models, this paper proposes a DDT model with multi-head bilinear fusion, which aims to learn differentiated feature interactions and achieve more fine-grained feature fusion. In the DDT model, firstly, the SENET mechanism is utilized to assign different weights to features of varying importance, enabling the extraction of important features. Secondly, the deeper level of expressive ability of the model is enhanced by incorporating two MLP networks to learn the feature interactions of different importance in a parallel structure. Finally, to fuse the interacting features from different spaces more efficiently, a multi-head bilinear fusion layer is introduced to achieve more fine-grained feature fusion. Our experimental results fully validate the effectiveness of the proposed model, which outperforms existing state-of-the-art models, and also demonstrate that even a simple MLP network, with appropriate tuning, can achieve surprising results. Overall, this study addresses the problem of limited feature interaction in CTR prediction scenarios, enhancing both the interpretability and accuracy of the CTR prediction model. In addition to CTR prediction, the proposed DDT model shows great potential for application in other fields that involve complex feature interactions and large-scale data. For instance, it can be applied to recommendation systems, where the model can enhance personalized recommendations by better capturing the interactions between users and items. It can also be utilized in advertising optimization, where it helps improve ad targeting and bidding strategies. Furthermore, the model’s ability to handle sparse and high-dimensional features makes it suitable for applications in social network analysis, fraud detection, and even financial risk prediction, where accurate predictions are crucial. The next phase of work will focus on exploring the importance of contextual features in the interaction process to further improve click-through rate prediction and extend the model’s applicability to other domains.

Author Contributions

All of the authors contributed to the study conception and design. Conceptualization, X.C. and Y.Z.; methodology, X.C.; software, W.W.; validation, X.C., W.W. and Y.M.; formal analysis, Y.Z.; investigation, X.C.; resources, Y.Z.; data curation, W.W.; writing—original draft preparation, X.C.; writing—review and editing, X.C.; visualization, X.C.; supervision, Y.M.; project administration, Y.Z.; funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (NSFC) (62001004), the Anhui Natural Science Foundation (2008085MF218), Anhui Provincial University Excellent Talent Support Program Key Projects (gxyqZD2021124), and the Major Natural Science Research Project of Anhui Provincial Universities (2024AH040039).

Data Availability Statement

The data that support the findings of this study can open access in the database at https://www.kaggle.com/datasets/mrkmakr/criteo-dataset, http://www.kaggle.com/c/avazu-ctr-prediction, http://baltrunas.info/research-menu/frappe, https://grouplens.org/datasets/movielens/ (accessed on 15 January 2025).

Conflicts of Interest

The authors have no relevant financial or non-financial interests to disclose. The authors have no competing interests to declare that are relevant to the content of this article.

References

Zhang, J.; Ma, C.; Zhong, C.; Zhao, P.; Mu, X. Multi-scale and multi-channel neural network for click-through rate prediction. Neurocomputing 2022, 480, 0925–2312. [Google Scholar] [CrossRef]
Lyu, Z.; Dong, Y.; Huo, C.; Ren, W. Deep match to rank model for personalized click-through rate prediction. AAAI Tech. Track AI Web 2020, 34, 156–163. [Google Scholar] [CrossRef]
Yang, Y.; Zhai, P. Click-through rate prediction in online advertising: A literature review. Inf. Process. Manag. 2022, 59, 102853. [Google Scholar] [CrossRef]
Guan, F.; Qian, C.; He, F. A knowledge distillation-based deep interaction compressed network for CTR prediction. Knowl. -Based Syst. 2023, 275, 110704. [Google Scholar] [CrossRef]
Yang, X.; Liu, Q.; Su, R. Click-through rate prediction using transfer learning with fifine-tuned parameters. Inform. Sci. 2022, 612, 188–200. [Google Scholar] [CrossRef]
Lauriola, I.; Lavelli, A.; Aiolli, F. An introduction to deep learning in natural language processing: Models, tech-niques, and tools. Neurocomputing 2022, 470, 443–456. [Google Scholar] [CrossRef]
Yang, L.; Zheng, W.; Xiao, Y. Exploring Different Interaction Among Features For CTR Prediction. Soft Comput. 2022, 26, 6233–6243. [Google Scholar] [CrossRef]
Luo, Y.; Peng, W.W.; Fan, Y.; Xu, X.; Wu, X. Explicit sparse self-attentive network for CTR prediction. Procedia Comput. Sci. 2021, 183, 690–695. [Google Scholar] [CrossRef]
Jiang, D.; Xu, R.; Xu, X.; Xie, Y. Multi-view feature transfer for click-through rate prediction. Inform. Sci. 2021, 546, 961–976. [Google Scholar] [CrossRef]
Jun, X.; Zhao, X.; Xu, X.; Han, X.; Ren, J.; Li, X. DRIN: Deep Recurrent Interaction Network for click-through rate prediction. Inf. Sci. 2022, 604, 210–225. [Google Scholar] [CrossRef]
Guo, H.; Tang, R.; Ye, Y.; Li, Z.; He, X. DeepFM: A factorization-machine based neural network for CTR prediction. Int. Jt. Conf. Artif. Intell. 2017, 521, 1725–1731. [Google Scholar]
Wang, R.; Fu, B.; Fu, G.; Wang, M. Deep & Cross Network for Ad Click Predictions. Assoc. Comput. Mach. 2017, 604, 12. [Google Scholar]
Xiao, J.; Ye, H.; He, X.; Zhang, H.; Wu, F.; Chua, T.-S. Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Networks. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017; pp. 3119–3125. [Google Scholar]
Tao, Z.; Wang, X.; He, X.; Huang, X.; Chua, T. HoAFM: A High-order Attentive Factorization Machine for CTR Pre-diction. Inf. Process. Manag. 2020, 57, 102076. [Google Scholar] [CrossRef]
Rendle, S. Factorization Machines. In Proceedings of the IEEE International Conference on Data Mining, Sydney, Australia, 13–17 December 2010; pp. 995–1000. [Google Scholar]
Sidahmed, H.; Prokofyeva, E.; Blaschko, M.B. Discovering predictors of mental health service utilization with k-support regularized logistic regression. Inf. Sci. 2016, 329, 937–949. [Google Scholar] [CrossRef]
Shan, Y.; Hoens, T.R.; Jiao, J.; Wang, H.; Yu, D.; Mao, J.C. Deep Crossing: Web-Scale Modeling without Manually Crafted Combinatorial Features. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 255–262. [Google Scholar]
Zou, D.; Wang, Z.; Zhang, L.; Zou, J.; Li, Q.; Chen, Y.; Sheng, W. Deep Field Relation Neural Network for click-through rate prediction. Inf. Sci. 2021, 577, 128–139. [Google Scholar] [CrossRef]
Huang, T.; Zhang, Z.; Zhang, J. FiBiNET: Combining Feature Importance and Bilinear Feature Interaction for Click-through Rate Prediction. In Proceedings of the Thirteenth ACM Conference on Recommender Systems, Copenhagen, Denmark, 16–20 September 2019; pp. 169–177. [Google Scholar]
Chen, Y.; Wang, Y.; Ren, P.; Wang, M.; Maarten. Bayesian feature interaction selection for factorization machines. Artif. Intell. 2022, 302, 103589. [Google Scholar] [CrossRef]
Zhang, X.; Wang, Z.; Du, B. Deep Dynamic Interest Learning With Session Local and Global Consistency for Click-Through Rate Predictions. IEEE Trans. Ind. Inform. 2022, 18, 3306–3315. [Google Scholar] [CrossRef]
Mao, K.; Zhu, J.; Su, L.; Cai, G.; Li, Y.; Dong, Z. FinalMLP: An Enhanced Two-Stream MLP Model for CTR Prediction. AAAI 2023, 37, 4552–4560. [Google Scholar] [CrossRef]
Song, K.; Huang, Q.; Zhang, F.-E.; Lu, J. Coarse-to-fine: A dual-view attention network for click-through rate predic-tion. Knowl.-Based Syst. 2021, 216, 106767. [Google Scholar] [CrossRef]
Juan, Y.; Zhuang, Y.; Chin, W.S.; Lin, C. Field-aware factorization machines for CTR prediction. In Proceedings of the Tenth ACM Conference on Recommender Systems, Boston, MA, USA, 15–19 September 2016; pp. 43–50. [Google Scholar]
Zheng, J.; Chen, S.; Du, Y.; Song, P. A multiview graph collaborative filtering by incorporating homogeneous and heterogeneous signals. Inf. Process. Manag. 2021, 59, 103072. [Google Scholar] [CrossRef]
Liu, M.; Cai, S.; Lai, Z.; Qiu, L.; Hu, Z.; Ding, Y. A joint learning model for click-through prediction in display advertising. Neurocomputing 2021, 445, 206–219. [Google Scholar] [CrossRef]
Liu, B.; Zhu, C.; Li, G.; Zhang, W.; Lai, J.; Tang, R.; He, X.; Li, Z.; Yu, Y. AutoFIS: Automatic Feature Interaction Selection in Factorization Models for Click-Through Rate Prediction. In Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, 6–10 July 2020; pp. 2636–2645. [Google Scholar]
Cheng, H.T.; Koc, L.; Harmsen, J.; Shaked, T.; Chandra, T.; Aradhye, H.; Shah, H. Wide & Deep Learning for Recommender Systems. In Proceedings of the DLRS 2016: Workshop on Deep Learning for Recommender Systems, Boston, MA, USA, 15 September 2016; pp. 7–10. [Google Scholar]
Liu, B.; Xue, N.; Guo, H.; Tang, R.; Zafeiriou, S.; He, X.; Li, Z. AutoGroup: Automatic Feature Grouping for modeling Explicit High-Order Feature Interactions in CTR Prediction. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, 25–30 July 2020; pp. 199–208. [Google Scholar]
He, X.; Chua, T.-S. Neural Factorization Machines for Sparse Predictive Analytics. In Proceedings of the SIGIR’17, Tokyo, Japan, 7–11 August 2017; pp. 355–364. [Google Scholar]
Jing, C.; Qiu, L.; Sun, C.; Yang, Q. ICE-DEN: A click-through rate prediction method based on interest contribution extraction of dynamic attention intensity. Knowl.-Based Syst. 2022, 250, 109135. [Google Scholar] [CrossRef]
Bian, W.; Wu, K.; Ren, L.; Pi, Q.; Zhang, Y.; Xiao, C.; Sheng, X.R.; Zhu, Y.N.; Chan, Z.; Mou, N.; et al. CAN: Feature Co-Action Network for Click-Through Rate Prediction. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, Virtual Event, 21–25 February 2022; pp. 57–65. [Google Scholar]
Li, X.; Sun, L.; Ling, M.; Peng, Y. A survey of graph neural network based recommendation in social networks. Neurocomputing 2023, 549, 126441. [Google Scholar] [CrossRef]
Huang, J.; Han, Z.; Xu, H.; Liu, H. Adapted transformer network for news recommendation. Neurocomputing 2022, 469, 119–129. [Google Scholar] [CrossRef]
Wang, Z.; She, Q.; Zhang, J. MaskNet: Introducing Feature-Wise Multiplication to CTR Ranking Models by In-stance-Guided Mask. arXiv 2021, arXiv:2102.07619. [Google Scholar]
Wang, F.; Gu, H.; Li, D.; Lu, T.; Zhang, P.; Gu, N. Towards deeper, lighter and interpretable cross network for ctr prediction. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, Birmingham, UK, 21–25 October 2023. [Google Scholar]
Tian, Z.; Bai, T.; Zhao, W.X.; Wen, J.R.; Cao, Z. EulerNet: Adaptive Feature Interaction Learning via Euler’s Formula for CTR Prediction. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, Taiwan, China, 23–27 July 2023. [Google Scholar]
Cheng, W.; Shen, Y.; Huang, L. Adaptive factorization network: Learning adaptive-order feature interactions. Proc. AAAI Conf. Artif. Intell. 2020, 34, 3609–3616. [Google Scholar] [CrossRef]

Figure 1. The framework of DDT model.

Figure 2. Illustration of one-hot encoding.

Figure 3. SENET Layer.

Figure 4. Detail of the fusion layer.

Figure 5. Effect of different number of network layers on model performance on four datasets.

Figure 6. Effect of dimensionality reduction ratio on model performance on four datasets.

Table 1. Statistics of data set information.

Dataset	Instances	Fields	Features
Criteo	45,840,422	39	2,083,736
Avazu	40,428,365	22	1,534,240
Frappe	288,609	10	5328
Movielens	2,006,859	3	90,445

Table 2. Comprehensive performance of all models on different datasets.

Model	Criteo		Avazu		Frappe		Movielens
Model	AUC	LogLoss	AUC	LogLoss	AUC	LogLoss	AUC	LogLoss
LR	0.7832	0.4656	0.7510	0.3735	0.9532	0.2841	0.9315	0.3458
FM	0.8018	0.4493	0.7587	0.3702	0.9644	0.2215	0.9414	0.2818
AFM	0.8005	0.4505	0.7540	0.3720	0.9652	0.2243	0.9475	0.2761
NFM	0.8042	0.4470	0.7596	0.3703	0.9730	0.2142	0.9465	0.2848
DeepFM	0.8114	0.4404	0.7605	0.3696	0.9729	0.2076	0.9485	0.2946
FiBiNET	0.8118	0.4401	0.7599	0.3699	0.9732	0.2101	0.9515	0.2358
MaskNet	0.8119	0.4408	0.7600	0.3696	0.9728	0.2177	0.9664	0.2199
GDCN	0.8124	0.4391	0.7613	0.3675	0.9737	0.2202	0.9682	0.2107
EulerNet	0.8136	0.4372	0.7624	0.3681	0.9739	0.1991	0.9678	0.2138
DDT (ours)	0.8143	0.4376	0.7635	0.3674	0.9765	0.1984	0.9683	0.2104

Table 3. Effect of different number of heads on model performance on four datasets.

Heads (k)	Criteo		Avazu		Frappe		Movielens
Heads (k)	AUC	LogLoss	AUC	LogLoss	AUC	LogLoss	AUC	LogLoss
1	0.8138	0.4382	0.7630	0.3675	0.9765	0.1984	0.9683	0.2104
5	0.8142	0.4378	0.7618	0.3680	0.9752	0.1987	0.9678	0.2137
10	0.8140	0.4380	0.7599	0.3715	0.9749	0.1997	0.9674	0.2135
20	0.8143	0.4376	0.7601	0.3684	0.9731	0.2004	0.9669	0.2144
40	0.8140	0.4379	0.7624	0.3693	0.9735	0.2001	0.9667	0.2146
50	0.8141	0.4378	0.7609	0.3690	0.9737	0.2000	0.9664	0.2157

Table 4. Comparison between DDT and its different versions.

Model	Configuration				Criteo		Avazu
	Input Layer	Deep Layer		Fusion Layer	AUC	LogLoss	AUC	LogLoss
	SENet	MLP1	MLP2	Fusion	AUC	LogLoss	AUC	LogLoss
Model 1	✗	✓	✓	✓	0.8139	0.4380	0.7604	0.3804
Model 2	✓	✗	✓	✓	0.8141	0.4378	0.7606	0.3693
Model 3	✓	✓	✗	✓	0.8132	0.4387	0.7555	0.3713
Model 4	✓	✓	✓	✗	0.8138	0.4382	0.7596	0.3704
DDT (our)	✓	✓	✓	✓	0.8143	0.4376	0.7635	0.3674

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; Cheng, X.; Wei, W.; Meng, Y. Deep Double Towers Click Through Rate Prediction Model with Multi-Head Bilinear Fusion. Symmetry 2025, 17, 159. https://doi.org/10.3390/sym17020159

AMA Style

Zhang Y, Cheng X, Wei W, Meng Y. Deep Double Towers Click Through Rate Prediction Model with Multi-Head Bilinear Fusion. Symmetry. 2025; 17(2):159. https://doi.org/10.3390/sym17020159

Chicago/Turabian Style

Zhang, Yuan, Xiaobao Cheng, Wei Wei, and Yangyang Meng. 2025. "Deep Double Towers Click Through Rate Prediction Model with Multi-Head Bilinear Fusion" Symmetry 17, no. 2: 159. https://doi.org/10.3390/sym17020159

APA Style

Zhang, Y., Cheng, X., Wei, W., & Meng, Y. (2025). Deep Double Towers Click Through Rate Prediction Model with Multi-Head Bilinear Fusion. Symmetry, 17(2), 159. https://doi.org/10.3390/sym17020159

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Double Towers Click Through Rate Prediction Model with Multi-Head Bilinear Fusion

Abstract

1. Introduction

2. Related Work

2.1. Conventional CTR Prediction Models

2.2. Deep CTR Prediction Models

3. The Proposed Model

3.1. Input Layer

3.1.1. Primitive Feature Embedding

3.1.2. Significant Feature Embedding

3.2. Deep Layer

3.3. Fusion Layer

3.4. Output Layer

4. Experiments

4.1. Experimental Settings

4.1.1. Datasets

4.1.2. Evaluation Metrics

4.1.3. Baseline Models

4.1.4. Implementation Details

4.2. Comprehensive Performance

4.3. Parametric Analysis

4.3.1. Impact of the Number of Network Layers

4.3.2. Impact of Dimensionality Reduction Ratio

4.3.3. Impact of Number of Heads

4.4. Ablation Experiment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI