Click-through Rate Prediction and Uncertainty Quantification Based on Bayesian Deep Learning

Click-through rate (CTR) prediction is a research point for measuring recommendation systems and calculating AD traffic. Existing studies have proved that deep learning performs very well in prediction tasks, but most of the existing studies are based on deterministic models, and there is a big gap in capturing uncertainty. Modeling uncertainty is a major challenge when using machine learning solutions to solve real-world problems in various domains. In order to quantify the uncertainty of the model and achieve accurate and reliable prediction results. This paper designs a CTR prediction framework combining feature selection and feature interaction. In this framework, a CTR prediction model based on Bayesian deep learning is proposed to quantify the uncertainty in the prediction model. On the squeeze network and DNN parallel prediction model framework, the approximate posterior parameter distribution of the model is obtained using the Monte Carlo dropout, and obtains the integrated prediction results. Epistemic and aleatoric uncertainty are defined and adopt information entropy to calculate the sum of the two kinds of uncertainties. Epistemic uncertainty could be measured by mutual information. Experimental results show that the model proposed is superior to other models in terms of prediction performance and has the ability to quantify uncertainty.


Introduction
With the rise of multimedia applications, CTR prediction tasks have appeared in many scenarios, such as precise advertising placement and product recommendations on e-commerce platforms. Since every user's click behavior can be regarded as a display of their preferences, CTR prediction is the process of learning user preferences from stored historical user data and then using them to predict future behavior.
The development of CTR prediction models can be divided into two aspects: automatic feature engineering and improving model capabilities. Since the features in CTR prediction are usually in specific fields, how to effectively learn rich information from feature interaction becomes a key challenge of CTR prediction. For the early CTR prediction model, due to the limitation of computing capacity, artificial feature engineering and a simple machine learning model were adopted. The method used in the CTR prediction task was linear models, such as the logistic regression (LR) [1] model, which has the advantage of simplicity and efficiency, but the disadvantage is that the model is very dependent on feature design. Some models that emerged later, such as factorization machine (FM) [2], attentional factorization machine (AFM) [3], etc., can learn the intersection information of the two features. Since the application of feature engineering in CTR prediction largely relies on the experience of experts. Therefore, it is an important research orientation of CTR prediction to design new models to automatically realize feature combinations to explore the hidden information hidden in feature interactions.
Subsequently, significant progress has been made in CTR prediction due to advances in deep neural networks (DNNs) in learning representations. Models, such as the productbased neural network (PNN) [4], deep cross network (DCN) [5], and factorization-machine based neural network (DeepFM) [6] can obtain higher-order feature interactions. A lot of work has been carried out in recent years to combine explicit feature interaction with DNN. However, these methods suppose all features interact and that the interaction of each feature should be modeled to the same extent. While deep interest network (DIN) [7], deep interest evolution network (DIEN) [8], and JointCTR [9] combine historical user behavior to perform a personalized interest model. The powerful function of deep learning is that it can construct complex functions and perform nonlinear transformations on data [10]. All of the above methods only focus on important feature interactions of second or higherorder, while ignoring the importance of each feature to the predicted target [11]. Worthless features yield noise and complicate the feature interaction procedure. Therefore, in feature learning, there are a few methods to learn feature importance, such as feature importance and bilinear feature interaction network (FiBiNET) [12], and dual input-aware factorization machine (DIFM) [13]. However, deep learning usually cannot solve the uncertainty problem in actual application scenarios.
Because of the limitation of available training samples and the existence of incoherent noise, quantification uncertainty should be an integral part of any prediction system. While the development of machine learning tools means large progress, the uncertainty in the models maintains a huge gap worth exploring. In essence, most existing research is based on deterministic models and is short of the capability to acquire uncertainty. As we all know, Bayesian theory is commonly used to describe modeling errors and also provides mathematical tools for the uncertainty of inference models. Deep learning and Bayesian statistical theory are combined as Bayesian deep learning to offer uncertainty estimates for deep structures. In contrast to traditional deep learning models that produce deterministic predictions, Bayesian machine learning provides not only predictions but also uncertainty, which is estimated by the probability density of the outcome.
Gal et al. [14] proposed a simpler method for deep learning uncertainty estimation by training a dropout network and using dropout to obtain predicted Monte Carlo (MC) samples at test time. Optimizing neural networks with dropout can be equated to a type of variational inference in Bayesian machine learning models. A Bayesian machine learning model determines the uncertainty of the input sample model by predicting the variance of the output probability distribution. The output samples are regarded as MC samples that are extracted from the posterior distribution of the model by employing standard dropout to the DNN during the test phase. To achieve MC, a DNN needs to be trained by dropout. Second, to calculate the inference for each input, the DNN method of dropout is applied at T times during the test phase. This theory has been successfully applied in wind energy, power grids, climate prediction, and other fields, proving that the method can capture the uncertainty related to the model output. Inspired by these, we propose a CTR prediction model composed of feature selection and feature interaction. In this framework, uncertainty is divided into epistemic uncertainty and aleatoric uncertainty, which respectively explains the uncertainty caused by model parameters and structure and data noise.
The Innovations of this work can be summarized as: (1) We propose a CTR prediction model based on Bayesian deep learning, which learns high-order feature interaction information and models of two kinds of uncertainties. The model achieves efficient and reliable prediction.
(2) A framework combining feature selection and feature interaction is proposed. In the feature selection module, CNN and MLP are used to generate meaningful feature interactions. The feature interaction part is modeled by squeeze network and DNN parallel for feature interactions, respectively.
(3) In the above framework, the Monte Carlo dropout is used to perform a posteriori approximate inference on the parameters of the model and obtain the integrated prediction results. The uncertainty of the model is quantified by information entropy.
(4) Experiments on three datasets evidence that the prediction performance of our proposed model performs better than the most advanced models. A reliable uncertainty quantification is carried out for the model. The remainder of our work is made up of the following: Section 2 introduces related work. Section 3 describes the proposed model. In Section 4, detailed experimental consequents and analyses are introduced. Finally, we summarize the work of this paper in Section 5.

Related Works
(1) CTR Prediction In recent years, CTR prediction tasks have developed rapidly and have made very big breakthroughs. Its research mainly uses machine learning methods to predict the user's clicks on advertisements in subsequent times based on the historical behavior data of users and attribute information.
Initially, some shallow models were used. The combination of the LR model and artificial feature engineering, such as mixture of logistics regression (MLR) [15], gradient boost decision tree + logistic regression (GBDT + LR) [16], etc., are widely exploited. Its advantage lies in its efficiency and ease of deployment. Then, the FM model realizes a pairto-pair combination of features. Following the idea of the FM model, a series of upgraded models are optimized for different emphases, such as field-aware factorization machine (FFM) [17], higher-order factorization machine (HOFM) [18], field-weighted factorization machine (FwFM) [19], and multi-order interactive features aware factorization machine (MoFM) [20].
Factorization machines deep neural network (FNN) [21] was the earliest proposed model based on deep learning. The PNN model introduced the product layer in the neural network and had two feature cross calculation methods of inner product and outer product. The wide and deep learning (WDL) [22] model adopted a dual-path structure, which joined the generalization of the deep model with the memory of the wide linear model. DeepFM was a combination of neural networks and FM, which could simultaneously model high-and low-order features. The DCN model proposed a CrossNet to explicitly learn finite-order feature crossovers, and implicitly learn crossover features through DNN. Extreme deep factorization machine (xDeepFM) [23] proposed a compressed interaction network (CIN) for vector-wise feature interaction that could obtain explicit and implicit high-order feature interaction simultaneously. Sina proposed the FiBiNET model, adopted the squeeze-and-excitation network (SENET) to learn the importance of features, and learn feature interactions by a bilinear function. Liu et al. [24] introduced a neural attention network to study the significance of second-order interaction and employed two different residual networks to explore feature interactions automatically.
The DIN model is the first method to introduce attention to user behavior modeling, which assigns different weights to users according to their historical behavior and the relevance of current goods through the attention module. On this basis, the DIEN and deep session interest network (DSIN) [25] models are derived. The automatic feature interaction learning (AutoInt) [26] model was drawn on the multi-head self-attention mechanism of the transformer in the natural language processing (NLP) model, and the weights obtained simultaneously give the model a certain degree of interpretability. The DIFM utilized fully-connected layers and multi-head self-attention networks, respectively, to obtain feature interactions from bit-wise and vector-wise levels. The dual-view attention network (DVAN) [27] proposed a selection mechanism to draw feature interactions, respectively, from the item-and user-view. The attentive capsule network (ACN) [28] model employed transformers for feature interaction and adaptively seized multiple interests from user behavior history using capsule networks.
(2) Uncertainty in the Models Deep learning models with uncertainty measurements have been widely used in many applications. For example, Wang et al. [29] proposed an ensemble probabilistic prediction learning system to quantify the uncertainty of crude oil prices by combining five commonly used machine learning methods with an improved optimizer. Gal  dropout could be utilized to apply Bernoulli distributions to the weight of convolutional network filters in the test phase, to assess model uncertainty and prediction. In subsequent studies, Liu et al. [30] combined variational Bayesian inference and spatial-temporal neural networks to predict spatial-temporal wind speed. Sun et al. [31] combined Bayesian probability theory and deep long short-term memory (LSTM) for grid load prediction research. Xiao et al. [32] used neural network structures to quantify model uncertainty and data uncertainty for various NLP tasks. Hernández and López [33] applied Bayesian deep learning to quantify the uncertainty of plant disease detection. Ghoshal et al. [34] proposed a Bayesian CNN based on weights for estimating uncertainty in the COVID-19 chest X-ray dataset to increase the performance of diagnostics. Abdar et al. [35] designed a Bayesian neural network model for biomedical image segmentation to quantify uncertainties in classification. Wang et al. [36] modeled the uncertainty through Bayesian deep learning to improve personalized recommendations. Zheng et al. [37] presented semantic uncertainty to explain incomplete data acquisition and inaccurate data labeling. Jin et al. [38] proposed a variational Bayesian deep prediction network with a self-screening layer to solve the problem that a large amount of noise and data conflict or inconsistency reduces the prediction accuracy. They have successfully demonstrated the practical significance of technology in model prediction and uncertainty estimation.
where Y is a set of class labels for binary classification, representing the click behavior for a given ad. The target of the CTR prediction is to construct a modelŷ = CTR Model(X) that predicts the probability P(y|x; θ) that the user clicks on the recommended advertisement, where θ is the parameter of the model; (2) The epistemic uncertainty is a measure of the uncertainty of the model parameters. It derives from the limited representation ability of the observed value, which denotes the difference between the predicted valueŷ and the true value y. It is quantified by mutual information; (3) The aleatoric uncertainty measures the noise intrinsic in the data. This uncertainty cannot be eliminated. It is expressed as the output of a function, which is represented by the difference between the predicted entropy and mutual information; (4) Objective function. Logloss is regarded as the loss function for a binary classification problem. We train our model by minimizing the following objective function: whereŷ i means the predicted CTR. y i indicates the true label.

Basic Model
The design idea of the Bayesian deep learning model is shown in Figure 1, which mainly includes the feature engineering phase, decision phase, and Monte Carlo reasoning phase, among which the feature engineering phase is completed by feature selection and the feature interaction module. Bayesian probability modeling is applied to fully connected neural networks. In this chapter, dropout regularization approximate Bayesian modeling is adopted.
In this paper, we first exploit the strengths of the CNN to extract the neighbor feature interactions, while complementing it with a recombination layer to extract the global feature interactions. Then the squeezing network and DNN are applied to learn the high-order feature interactions in parallel. The diagram of the FiBDL is displayed in Figure 2. In this paper, we first exploit the strengths of the CNN to extract the neighbor feature interactions, while complementing it with a recombination layer to extract the global feature interactions. Then the squeezing network and DNN are applied to learn the highorder feature interactions in parallel. The diagram of the FiBDL is displayed in Figure 2. (1) Feature selection. The feature selection module aims to identify useful local and global patterns to generate meaningful feature interactions. In the original feature space, meaningful feature interactions are always sparse. CNN has a parameter sharing and pooling mechanism, which can extremely decrease the number of parameters applied to discover important local patterns, which provides an idea for feature selection. Since it  In this paper, we first exploit the strengths of the CNN to extract the neighbor fea interactions, while complementing it with a recombination layer to extract the globa ture interactions. Then the squeezing network and DNN are applied to learn the h order feature interactions in parallel. The diagram of the FiBDL is displayed in Figur (1) Feature selection. The feature selection module aims to identify useful local global patterns to generate meaningful feature interactions. In the original feature sp meaningful feature interactions are always sparse. CNN has a parameter sharing pooling mechanism, which can extremely decrease the number of parameters applie discover important local patterns, which provides an idea for feature selection. Sin (1) Feature selection. The feature selection module aims to identify useful local and global patterns to generate meaningful feature interactions. In the original feature space, meaningful feature interactions are always sparse. CNN has a parameter sharing and pooling mechanism, which can extremely decrease the number of parameters applied to discover important local patterns, which provides an idea for feature selection. Since it only produces neighbor feature interactions, many meaningful global feature interactions have been lost. So we use MLP to reorganize them to learn the global feature interaction. Let X = [x 1,..., x i ] represent the input. Let the output of the first convolutional layer be C 1 . To obtain the neighbor feature interactions is to convolve the weight matrix C 1 of the output feature map of the first convolution layer with non-linear activation functions. The convolutional layer is written as:

Input data
where C 1 :,:,i is the i-th feature map of the first layer of the convolution, h 1 indicates the height of the convolution kernel, p, q denote the row and column indices of the feature map.
A max-pooling layer is applied after the convolutional layer to obtain the most important feature interactions. The output can be denoted as: where h p is the height of the pooling layers. It is worth noting that the input of the (i + 1)-th convolutional layer is the pooling result of the i-th pooling layer: S i = x i+1 . S 1 includes neighbor feature interactions after going through the first convolutional layer and pooling layer. To obtain the global non-neighbor feature interactions, important new features are produced by recombining local neighbor feature interactions by the full connection layer.
where B denotes the bias and W means the weight matrix.
Performing the above process multiple times can generate new features.
(2) Squeeze network. We have designed a squeeze network that is inspired by the CIN module in the xDeepFM model. The structure is shown in Figure 3. Its core is the summation and pooling of single hidden layer feature vectors, and the complexity of the model will not increase exponentially with the increase in the degree of interaction.
only produces neighbor feature interactions, many meaningful global feature interactions have been lost. So we use MLP to reorganize them to learn the global feature interaction. Let = [ ,…, ] represent the input. Let the output of the first convolutional layer be . To obtain the neighbor feature interactions is to convolve the weight matrix ℂ of the output feature map of the first convolution layer with non-linear activation functions The convolutional layer is written as: where :,:, is the -th feature map of the first layer of the convolution, ℎ indicates the height of the convolution kernel, , denote the row and column indices of the feature map.
A max-pooling layer is applied after the convolutional layer to obtain the most important feature interactions. The output can be denoted as: where ℎ is the height of the pooling layers. It is worth noting that the input of the ( + 1)-th convolutional layer is the pooling result of the -th pooling layer: = . includes neighbor feature interactions after going through the first convolutional layer and pooling layer. To obtain the global non-neighbor feature interactions, important new features are produced by recombining local neighbor feature interactions by the full connection layer.
where denotes the bias and means the weight matrix. Performing the above process multiple times can generate new features.
(2) Squeeze network. We have designed a squeeze network that is inspired by the CIN module in the xDeepFM model. The structure is shown in Figure 3. Its core is the summation and pooling of single hidden layer feature vectors, and the complexity of the model will not increase exponentially with the increase in the degree of interaction.  The features of the input and the hidden layer in the squeeze network are, respectively, organized into a matrix, denoted as x 0 and x k+1 . The neurons in each layer of the squeeze network are calculated based on the original input feature vector and the previous layer. The formula is: where x k−1 represents the state of the previous layer, m is the number of feature fields. The calculation of the hidden layer can divide it into two steps: (1) The specific approach is the outer product of the vectors of each row of x 0 and x k , resulting in a new vector, which is taken as the intermediate result z k+1 . The calculation process is shown in Figure 4a.
where represents the state of the previous layer, is the number of feature fields. The calculation of the hidden layer can divide it into two steps: (1) The specific approach is the outer product of the vectors of each row of and , resulting in a new vector, which is taken as the intermediate result .
The calculation process is shown in Figure 4a.  (2) The second step is to perform a convolution calculation on the result of the first step to obtain a new feature map result. Each convolution calculation will perform a convolution calculation on the entire feature map, as shown in Figure 4b.
The characteristics of the squeeze network are that the order of the feature interaction that is finally learned is decided by the number of layers of the network, and each hidden layer is connected to the output through a pooling operation, thus ensuring that the output unit can see feature interaction modes of a different order.
(3) Deep neural network. This module is composed of fully connected layers, which is a network with a multi-layer perceptron. There is a non-linear activation function between layers, a layer that is hidden from both input and output, and also needs to maintain a high degree of connectivity, which is determined by the weight of the network. The formula is as follows: where , are the weight and bias. denotes the -th hidden layer, is the activation function. (2) The second step is to perform a convolution calculation on the result of the first step to obtain a new feature map result. Each convolution calculation will perform a convolution calculation on the entire feature map, as shown in Figure 4b.
The characteristics of the squeeze network are that the order of the feature interaction that is finally learned is decided by the number of layers of the network, and each hidden layer is connected to the output through a pooling operation, thus ensuring that the output unit can see feature interaction modes of a different order.
(3) Deep neural network. This module is composed of fully connected layers, which is a network with a multi-layer perceptron. There is a non-linear activation function between layers, a layer that is hidden from both input and output, and also needs to maintain a high degree of connectivity, which is determined by the weight of the network. The formula is as follows: where w z , b z are the weight and bias. β i denotes the i-th hidden layer, σ is the activation function.

Uncertainty Quantification
The difference from the traditional Bayesian theory is that the Bayesian neural network introduces parameters into the prior distribution. To obtain the uncertainty estimation of the FiBDL model, the prior distribution is placed on the parameters of the deep neural network to determine the posterior distribution of parameters to achieve the best prediction. The epistemic uncertainty is a measure of the uncertainty of model parameters. It derives from the limited representation ability of the observed value, which denotes the difference between the predicted valueŷ and the true value y. The uncertainty of the model is quantified by placing a prior distribution over the model's parameters ω and approximating the posterior by inference algorithm.
Given BDL with L layer has parameter ω = {w l , b l } L l=1 . The prior distribution is the probability distribution of the model parameters independently from any observation. In Bayesian neural networks, it is usually assumed that the prior distribution p(ω) of parameter ω follows the standard Gaussian prior distribution: ω ∼ N (0, 1).
Upon determining an appropriate prior distribution based on prior knowledge, the model likelihood is p(Y|X, ω) . According to Bayesian inference, the posterior of parameter ω can be calculated: Under this posterior, if we have a new input instance x * , the predicted output y * is given by The target of Bayesian inference is to obtain the posterior p(ω|X, Y) of the model parameters for the given dataset. However, because of the complex non-conjugacy and non-linearity in the deep models, the posterior distribution is intractable. For this reason, different inference methods, such as the Markov chain Monte Carlo (MCMC) [39] and variational inference are proposed to approximate it. Gal et al. proposed approximate inference without changing the existing model architecture and proved that the posterior distribution can be approximated by the Monte Carlo dropout.
Specifically, this method defined an easy-to-evaluate distribution q(ω) to approximate the true posterior p(ω|D) . Then, it optimized the parameters of the defined distribution q(ω) to make it as close as possible to p(ω|D) . It is achieved by minimizing the KL divergence between the distributions p(ω|D) and q(ω), which is denoted as: It has been proved in that the minimization of the KL divergence is equivalent to the maximization of the evidence lower bound. Maximizing it will result in the variational distribution q(ω) being closer to the prior distribution and interpreting the data well. This approach has the advantage of replacing the integration problem with an optimization problem that maximizes the parameterized function. The evidence lower bound is equal to the first term represents the log-likelihood expectation. It can be approximated by MC integration with a single sampleω n ∼ q(ω) to obtain an unbiased estimate logp(y n |x n ,ω n ) . The basic concept of MC integration is to replace the integral with summation. The second term can be approximated by regularization, and its role is to avoid the model of over-fitting. Then, Equation (11) can be rewritten as: According to the approximate posterior q(ω), the approximate predicted probability distribution of sample x * is: Next, our predictive log-likelihood is approximated by MC integration of Equation (13). It is equivalent to performing T stochastic forward pass through the neural network and then averaging the results.
For CTR prediction, the general practice is to compress the model output by the So f tmax function. The prediction can be carried out by approximating the predicted average in the following ways: where c ∈ {0, 1},ω t ∼ q θ (ω), q θ (ω) represents the dropout distribution, f ω (·) denotes the prediction of the model.
In classification settings, uncertainty can be measured by predictive entropy H(y * |x * , D), which combines epistemic uncertainty and aleatoric uncertainty: For a new input x * , The predictive entropy is approximate as: In a Bayesian neural network, T MC samples are used to take the average entropy of a single input to capture the aleatoric uncertainty related to the data. The mutual information (MI) between the parameter ω and the output y * is used to quantify the epistemic uncertainty

Experimental Results
Our main work is to construct a prediction model and measure the uncertainty. In this section, we validate the FiBDL model from two aspects. The first one is to illustrate the superiority of the model in CTR prediction by comparing it with several benchmark models. The second is to quantify the uncertainty of the proposed model in its predictions.

Experimental Settings
(1) Datasets (1) Taobao dataset is a dataset of CTR prediction about display ads, which is displayed on the website of Taobao. The dataset consists of two parts: user_profile and ad_features.
(2) Avazu dataset is published in the CTR prediction contest on Kaggle in 2014. It contains four types of attributes: user, website, advertisement, and time. Each click sample has 24 data fields, including 22 category features.
(3) ICME dataset is utilized to predict whether users will finish watching and like videos. Each click sample has two dense features and nine sparse fields.
Following shuffling, all of the datasets are further randomly divided into train and test sets. We randomly placed 80% of the sample data in the training set and 20% in the test set.
(2) Evaluation metrics In this paper, we use Logloss and RMSE, two metrics to measure the prediction performance of all models.
Logloss is a criterion widely applied for classification problems. It balances the difference between the predicted value and the true value. Its definition is as in Equation (1).
RMSE is a criterion of the deviation between the predicted value and the true value. As a metric for uncertainty in this article, the RMSE is denoted as: (

3) Baselines
The FiBDL model is compared with the following models, each model is briefly described as follows: FiBiNET. The FiBiNET adopted SENET to learn the importance of features, and learn feature interactions by a bilinear function; • DIFM. The DIFM model utilized fully-connected layers and multi-head self-attention networks, respectively, to obtain feature interactions from bit-wise and vector-wise levels; • DeepFEFM [40]. The DeepFEFM model employed a field-embedding factorization machine to learn a symmetric matrix embedding for each field pair, as well as a common single-vector embedding for each feature; • DCN V2 [41]. The DCN V2 reduced the dimension of the parameter matrix by lowrank matrix decomposition and effectively learned feature interactions; • MoFM. The MoFM model combined LR, FM, and self-attention networks to obtain both low-and high-order feature interactions; • XCrossNet [42]. The extreme cross network (XCrossNet) model proposed a threestage model, which is used to model the interaction of dense and sparse features, respectively, and interact with each other at the concatenation layer, and finally use MLP to capture the feature interaction.

Performance Comparison
Compare the prediction performance of the FiBDL model with the existing classical efficient model described in the previous section. The performance of all models is shown in Table 1, The observation results are obtained as follows. The performance of LR is worse than other models, which illustrates the ability of the performance of both factorization models and deep neural network-based models. FM and AFM models can model second-order feature interactions to improve prediction accuracy. AFM significantly outperforms FM on all datasets, which is probably attributed to the performance of the attention mechanism in feature learning. Models, such as FNN, xDeepFM, etc., have achieved further performance improvements by combining FM and neural networks into a framework. The reason is that neural networks help FM-based models draw out helpful information for the prediction task. Furthermore, both FiBiNET and DIFM can measure feature importance in the feature learning stage. Models that combine core architectural components that learn explicit or implicit feature interactions with deep neural network components are beneficial for improving prediction performance.
Experimental results show that compared with the classical shallow prediction methods (LR, FM), the FiBDL has an advantageous position in prediction performance. The improvement stems from the fact that the FiBDL model learns high-order feature interaction information. The performance of the FiBDL model significantly outperforms the best and most advanced models. The performance metric Logloss is decreased by about 3.25%, 11.99%, and 1.81% compared to the best baseline method on all datasets, respectively. The reason may be that we completed the feature selection in the initial stage, which pruned redundant feature information. This is useful and positive for feature learning. Learning both implicit and explicit higher-order feature interactions simultaneously plays an important role in prediction tasks and is useful for improving performance.
Of the three datasets, FiBDL is about 1.95%, 9.77%, and 0.55% lower than the best performing model on the RMSE criterion, respectively. According to the results of Logloss and RMSE on the three datasets, the superiority of the proposed FiBDL model in the CTR prediction task is verified.

Feasibility of Using MC Dropout
To verify the effectiveness of using Monte Carlo with dropout to model uncertainty, we compare the FiBDL model with the basic model without dropout. According to Table 2, models using the MC dropout perform well on three datasets in terms of prediction performance. From the Logloss aspect, the model using the MC dropout model achieves a performance improvement of 0.42%, 0.22%, and 0.19% on these three datasets of the basic model, respectively. The Logloss criterion shows the effect of the model prediction. For the RMSE metric, the model FiBDL with the MC dropout achieves performance improvements of 0.26%, 0.1%, and 0.07%, respectively. Among them, there is a more minor improvement in the ICME dataset. This set of experimental results proves that the use of the Monte Carlo with dropout can improve the prediction accuracy. At the same time, it can reduce network overfitting, and achieve the effectiveness of the prediction.

Ablation Study
To verify the validity of each part in the FiBDL model and to better understand their relative importance, in this set of experiments, one part is removed and the rest are kept constant. (1) Prediction performance Observe the two metrics that measure the CTR prediction performance, as can be seen in Figure   (2) Uncertainty Quantification As Figure 6 shows, the ability of the model to quantify uncertainty fluctuates when we remove any components from the FiBDL model. On the three datasets, the quantified aleatoric uncertainty shows a small bias, which stems from the constant size of the experimental dataset. The aleatoric uncertainty is data-dependent. Epistemic uncertainty is tightly correlated with model parameters, and the ability to quantify this part will be af- Removing the feature selection module, the performance of the model on the Avazu dataset degrades significantly. It increased by about 3.67% and 3.12% in Logloss and RMSE, respectively. For the other two datasets, there is 0.38% and 0.75% heightened in Logloss (0.23% and 0.67% by RMSE). It demonstrates that the selection of the input data before feature interaction can improve the prediction performance.
Once DNN was deleted, the Logloss of the model on all datasets was significantly reduced by about 1.28%, 2.23%, and 3.43%, respectively. The performance degradation was significant on three datasets. It proved that the modeling of high-order feature interaction implicitly by DNN has a considerable influence on the model.
Following the removal of the squeeze network, the performance of the model on the Avazu dataset has remarkably fallen. The Logloss on the remaining two datasets is not significant. The RMSE on the three datasets has increased approximately by 0.25%, 2.26%, and 0.64%, respectively. It demonstrates the effectiveness of squeeze networks in learning explicit higher-order feature interactions. It can be seen that FiBDL outperforms all ablation methods. This validates that any component of the FiBDL model proposed is critical for prediction performance.
(2) Uncertainty Quantification As Figure 6 shows, the ability of the model to quantify uncertainty fluctuates when we remove any components from the FiBDL model. On the three datasets, the quantified aleatoric uncertainty shows a small bias, which stems from the constant size of the experimental dataset. The aleatoric uncertainty is data-dependent. Epistemic uncertainty is tightly correlated with model parameters, and the ability to quantify this part will be affected if any component of the model is deleted. (2) Uncertainty Quantification As Figure 6 shows, the ability of the model to quantify uncertainty fluctuates when we remove any components from the FiBDL model. On the three datasets, the quantified aleatoric uncertainty shows a small bias, which stems from the constant size of the experimental dataset. The aleatoric uncertainty is data-dependent. Epistemic uncertainty is tightly correlated with model parameters, and the ability to quantify this part will be affected if any component of the model is deleted.

Influence of the Training Data Size
To examine how the predictive performance and two types of the uncertainty of the FiBDL model vary as the training data increases, we use 20%, 40%, 60%, 80%, and 100% of the training data, respectively, to train the FiBDL model.
As shown in Figure 7, as the training data increase, the prediction performance of the FiBDL model on the Taobao and Avazu datasets first increases and then decreases. When the training set accounts for 80% of the dataset, the prediction performance on the Taobao dataset achieves the best. On the Avazu dataset, the prediction performance is optimal when the training set accounts for 60%. The performance of the FiBDL model on the ICME dataset initially fluctuates less, and when the training data reach 100%, the model has enough training, and the prediction performance outperforms well.

Influence of the Training Data Size
To examine how the predictive performance and two types of the uncertainty of the FiBDL model vary as the training data increases, we use 20%, 40%, 60%, 80%, and 100% of the training data, respectively, to train the FiBDL model.
As shown in Figure 7, as the training data increase, the prediction performance of the FiBDL model on the Taobao and Avazu datasets first increases and then decreases. When the training set accounts for 80% of the dataset, the prediction performance on the Taobao dataset achieves the best. On the Avazu dataset, the prediction performance is optimal when the training set accounts for 60%. The performance of the FiBDL model on the ICME dataset initially fluctuates less, and when the training data reach 100%, the model has enough training, and the prediction performance outperforms well.

Influence of the Training Data Size
To examine how the predictive performance and two types of the uncertainty of the FiBDL model vary as the training data increases, we use 20%, 40%, 60%, 80%, and 100% of the training data, respectively, to train the FiBDL model.
As shown in Figure 7, as the training data increase, the prediction performance of the FiBDL model on the Taobao and Avazu datasets first increases and then decreases. When the training set accounts for 80% of the dataset, the prediction performance on the Taobao dataset achieves the best. On the Avazu dataset, the prediction performance is optimal when the training set accounts for 60%. The performance of the FiBDL model on the ICME dataset initially fluctuates less, and when the training data reach 100%, the model has enough training, and the prediction performance outperforms well.  Figure 8 shows the relationship between the two classes of uncertainty and the percentage of the training data applied during training. We already know that epistemic un-   Figure 8 shows the relationship between the two classes of uncertainty and the percentage of the training data applied during training. We already know that epistemic uncertainty is caused by the parameters of the model and is caused by incomplete training. Aleatoric uncertainty originated from the data noise, and such uncertainty cannot be eliminated. With the training data increasing, the aleatoric uncertainty of the FiBDL model on the Taobao dataset decreased. Epistemic uncertainty showed less fluctuation. On the Avazu dataset, both uncertainties declined. When the proportion of the training dataset increased from 80% to 100%, the increase rate was much faster, which may be on account of the homogeneous nature of the Avazu dataset itself.

ER REVIEW 15 of 23
Aleatoric uncertainty originated from the data noise, and such uncertainty cannot be eliminated. With the training data increasing, the aleatoric uncertainty of the FiBDL model on the Taobao dataset decreased. Epistemic uncertainty showed less fluctuation. On the Avazu dataset, both uncertainties declined. When the proportion of the training dataset increased from 80% to 100%, the increase rate was much faster, which may be on account of the homogeneous nature of the Avazu dataset itself.

Hyper-Parameter Study
(1) Effect of the Dropout Ratio

Hyper-Parameter Study
(1) Effect of the Dropout Ratio Dropout is a regularization technique that prevents overfitting. It refers to temporarily dropping neurons from a deep learning network at a given probability during training. In this paper, we conduct experiments with a dropout ratio ranging from {0.05, 0.1, 0.15, 0.2, 0.25, 0.3}. Figure 9 displays the influence of the dropout ratio on the prediction performance of FiBDL. As the dropout ratio grows, the performance of FiBDL on the three datasets decreases to a large extent. On the Avazu dataset, the values of Logloss and RMSE have been increasing. This means that with the dropout ratio increasing, the prediction performance has been decreasing by 3.33% and 3.01%, respectively. On the other two datasets, the values of Logloss and RMSE fluctuate. On the Taobao dataset, the performance in terms of Logloss and RMSE declines sharply with the dropout ratio increasing and then fluctuates slightly. On the ICME dataset, when the dropout ratio is 0.1, the optimal performance is achieved on both metrics. In this experiment, the dropout ratio is fixed at 0.05.

EVIEW 16 of 23
Logloss and RMSE declines sharply with the dropout ratio increasing and then fluctuates slightly. On the ICME dataset, when the dropout ratio is 0.1, the optimal performance is achieved on both metrics. In this experiment, the dropout ratio is fixed at 0.05.  Figure 10 shows the relationship between the uncertainty of the prediction model and the dropout ratio. It can be found that, on the Avazu dataset, the quantified epistemic uncertainty decreases steadily as the dropout rate increases, while the aleatoric uncertainty increases steadily. The reason is that as the dropout value increases, the model will be more robust, equivalent to dropping some features. Epistemic uncertainty is related to model parameters, and aleatoric uncertainty is data-based. On the Taobao dataset, the quantification of aleatoric uncertainty is sensitive to the dropout ratio. It may be that when training the network, dropout sampling is performed so that the model does not rely too  Figure 10 shows the relationship between the uncertainty of the prediction model and the dropout ratio. It can be found that, on the Avazu dataset, the quantified epistemic uncertainty decreases steadily as the dropout rate increases, while the aleatoric uncertainty increases steadily. The reason is that as the dropout value increases, the model will be more robust, equivalent to dropping some features. Epistemic uncertainty is related to model parameters, and aleatoric uncertainty is data-based. On the Taobao dataset, the quantification of aleatoric uncertainty is sensitive to the dropout ratio. It may be that when training the network, dropout sampling is performed so that the model does not rely too much on certain features, even if they are real. On the ICME dataset, both uncertainties have an upward trend and then drop sharply with the increasing dropout ratio, reaching the optimal state when the dropout value is 0.25. tainty increases steadily. The reason is that as the dropout value increases, the model will be more robust, equivalent to dropping some features. Epistemic uncertainty is related to model parameters, and aleatoric uncertainty is data-based. On the Taobao dataset, the quantification of aleatoric uncertainty is sensitive to the dropout ratio. It may be that when training the network, dropout sampling is performed so that the model does not rely too much on certain features, even if they are real. On the ICME dataset, both uncertainties have an upward trend and then drop sharply with the increasing dropout ratio, reaching the optimal state when the dropout value is 0.25. REVIEW 17 of 23 Figure 10. Uncertainty with different dropout ratios.
(2) Effect of the Network Depth In the deep part, adding hidden layers can better separate the features of the data. The number of network layers is changed, and we observe how the model performance and uncertainties change with the increase of the network layers. Figure 11 shows the influence of the network depth in the FiBDL model. In particular, Figure 10. Uncertainty with different dropout ratios. (2) Effect of the Network Depth In the deep part, adding hidden layers can better separate the features of the data. The number of network layers is changed, and we observe how the model performance and uncertainties change with the increase of the network layers. Figure 11 shows the influence of the network depth in the FiBDL model. In particular, we study the network depth in the range {1, 2, 3, 4, 5, 6}. We observed that the FiBDL model on the Taobao dataset has the best prediction performance when the network depth is 3. When the number of layers is less than 3, the model performance is positively correlated with the network depth, and the predictive performance decreases as the network depth increases. When the number of layers is 1 or 2, the FiBDL model has the best prediction performance on the ICME and Avazu datasets. The results illustrate that excessive hidden layers cannot supply great improvement for the FiBDL model, so it is essential to select reasonable network depths.
(2) Effect of the Network Depth In the deep part, adding hidden layers can better separate the features of the data. The number of network layers is changed, and we observe how the model performance and uncertainties change with the increase of the network layers. Figure 11 shows the influence of the network depth in the FiBDL model. In particular, we study the network depth in the range {1, 2, 3, 4, 5, 6}. We observed that the FiBDL model on the Taobao dataset has the best prediction performance when the network depth is 3. When the number of layers is less than 3, the model performance is positively correlated with the network depth, and the predictive performance decreases as the network depth increases. When the number of layers is 1 or 2, the FiBDL model has the best prediction performance on the ICME and Avazu datasets. The results illustrate that excessive hidden layers cannot supply great improvement for the FiBDL model, so it is essential to select reasonable network depths. EVIEW 18 of 23 Figure 11. Prediction performance w.r.t. different depths in the DNN.
By increasing the depth of the neural network, the number of MC dropout layers of the FiBDL model will also increase. The results are shown in Figure 12. In the ICME dataset, both uncertainties of the model decrease when the network depth is less than 3. When the network depth is 3, the uncertainty reaches the minimum. On the Taobao dataset, the fluctuation of epistemic uncertainty is small. The aleatoric uncertainty achieves the minimum value when the number of layers is 3. The deviation of the aleatoric uncertainty of the FiBDL model on the Avazu dataset is very small. Based on comprehensive considerations, a 3-layer network is appropriate. By increasing the depth of the neural network, the number of MC dropout layers of the FiBDL model will also increase. The results are shown in Figure 12. In the ICME dataset, both uncertainties of the model decrease when the network depth is less than 3. When the network depth is 3, the uncertainty reaches the minimum. On the Taobao dataset, the fluctuation of epistemic uncertainty is small. The aleatoric uncertainty achieves the minimum value when the number of layers is 3. The deviation of the aleatoric uncertainty of the FiBDL model on the Avazu dataset is very small. Based on comprehensive considerations, a 3-layer network is appropriate. taset, the fluctuation of epistemic uncertainty is small. The aleatoric uncertainty achieves the minimum value when the number of layers is 3. The deviation of the aleatoric uncertainty of the FiBDL model on the Avazu dataset is very small. Based on comprehensive considerations, a 3-layer network is appropriate. REVIEW 19 of 23 (

3) Effect of the Training Epoch
The experiments on the training epoch of the model are tested in this section. Experiments were performed on the same randomly generated train-test splits of the data. Experiments are on the same training-test dataset. Figure 13 shows that the prediction performance and model uncertainty of the FiBDL model on the three datasets change significantly with the increase of the training epoch. On the Taobao dataset, the best prediction performance is achieved when the training epoch is 30. On the Avazu dataset, as the training epoch increases, the two prediction performance metrics of the model do not reach the optimum simultaneously. On the  The experiments on the training epoch of the model are tested in this section. Experiments were performed on the same randomly generated train-test splits of the data. Experiments are on the same training-test dataset. Figure 13 shows that the prediction performance and model uncertainty of the FiBDL model on the three datasets change significantly with the increase of the training epoch. On the Taobao dataset, the best prediction performance is achieved when the training epoch is 30. On the Avazu dataset, as the training epoch increases, the two prediction performance metrics of the model do not reach the optimum simultaneously. On the ICME dataset, the training epochs of the original experiment are the settings when the performance is optimal.
iments were performed on the same randomly generated train-test splits of the data. Experiments are on the same training-test dataset. Figure 13 shows that the prediction performance and model uncertainty of the FiBDL model on the three datasets change significantly with the increase of the training epoch. On the Taobao dataset, the best prediction performance is achieved when the training epoch is 30. On the Avazu dataset, as the training epoch increases, the two prediction performance metrics of the model do not reach the optimum simultaneously. On the ICME dataset, the training epochs of the original experiment are the settings when the performance is optimal. The uncertainty results are shown in Figure 14. For the Taobao dataset, with the increase of epoch, the two uncertainties show an overall trend of increasing. When the training epoch is 40, the two uncertainties are the largest. Among them, the aleatoric uncertainty increases greatly. For the Avazu dataset, when the training epoch increases from 10 to 20, the epistemic uncertainty decreases rapidly. Aleatoric uncertainty increases and then remains unchanged. For the Taobao and ICME datasets, the uncertainty is minimal when the training epoch is set as 1, the best model for each type of uncertainty is obtained for both datasets. to 20, the epistemic uncertainty decreases rapidly. Aleatoric uncertainty increases and then remains unchanged. For the Taobao and ICME datasets, the uncertainty is minimal when the training epoch is set as 1, the best model for each type of uncertainty is obtained for both datasets.

Conclusions
Aiming at the uncertainty problem in the CTR prediction model, we propose a CTR prediction model based on Bayesian deep learning. To improve the representation of the features used in the model, we build a prediction structure that combines feature selection and feature interaction. Feature selection module is used to select meaningful global feature interactions, squeeze network, and DNN in parallel and is used to model higherorder feature interactions. we train a dropout network and use dropout to obtain prediction Monte Carlo samples at test time, and then use information entropy to quantify the uncertainty. Bayesian deep learning architecture can reliably estimate the uncertainty and

Conclusions
Aiming at the uncertainty problem in the CTR prediction model, we propose a CTR prediction model based on Bayesian deep learning. To improve the representation of the features used in the model, we build a prediction structure that combines feature selection and feature interaction. Feature selection module is used to select meaningful global feature interactions, squeeze network, and DNN in parallel and is used to model higher-order feature interactions. we train a dropout network and use dropout to obtain prediction Monte Carlo samples at test time, and then use information entropy to quantify the uncertainty. Bayesian deep learning architecture can reliably estimate the uncertainty and improve the robustness of the model by introducing uncertainty into the parameters of the neural network. Experimental results show that this model can provide reliable prediction and effective uncertainty estimation.