A New Click-Through Rates Prediction Model Based on Deep&Cross Network

: With the development of E-commerce, online advertising began to thrive and has gradually developed into a new mode of business, of which Click-Through Rates (CTR) prediction is the essential driving technology. Given a user, commodities and scenarios, the CTR model can predict the user’s click probability of an online advertisement. Recently, great progress has been made with the introduction of Deep Neural Networks (DNN) into CTR. In order to further advance the DNN-based CTR prediction models, this paper introduces a new model of FO-FTRL-DCN, based on the prestigious model of Deep&Cross Network (DCN) augmented with the latest optimization technique of Follow The Regularized Leader (FTRL) for DNN. The extensive comparative experiments on the iPinYou datasets show that the proposed model has outperformed other state-of-the-art baselines, with better generalization across different datasets in the benchmark.


•
The modeling of feature interactions is still limited. When a neural network learns high-order features, it is unable to effectively identify the importance and association of combined features, which further restricts the prediction accuracy of the model.

•
Noise interference can be introduced in the embedding phase, and then it can skew the final prediction results.

•
The data for CTR prediction is unbalanced, which leads to the issue that the classic deep model can not predict accurately, since it needs a lot of data of the same category to learn the discriminative features.

•
The optimization algorithm to train the DNN may not work well because of the underlying distribution of data may be more complex than the assumption.
And this paper is to address these issues. In summary, the main contributions of this paper are as follows: • A new end-to-end model of FO-FTRL-DCN has been introduced, based on Deep&Cross Network (DCN) model [18], augmented with FTRL [11]. • Some preprocessing techniques, such as Synthetic Minority Oversampling Technique (SMOTE) [23], have been applied to optimize FO-FTRL-DCN, which aim to denoise and balance the data, leading to better performance.

•
The proposed FO-FTRL-DCN model has been evaluated on multiple datasets of the prestigious benchmark of iPinYou, showing that the model has better performance compared with state-of-the-art baselines, and better generalization across different datasets.
The paper is structured as follows. Section 2 presents some preliminaries, and Section 3 elaborates the proposed model FO-FTRL-DCN. And Section 4 is for experimental results and discussion, with Section 5 concluding the paper.

Preliminaries
In this section, we review some fundamantal concepts concerning DNN and CTR, as well as some training and optimization techniques.

Embedding
Embedding is an essential approach in DNN to learn a latent representation of things using neural embedding algorithms in order to capture similarities by embedding them in a low dimensional space. Item2vec [24] is an embedding in recommendation systems, to represent items by embedded vectors. In the recommendation system, the sequence of items is equivalent to the sequence of words. Each user's behavior sequence constitutes a set. By subsampling, we can model the difference between popular items and not popular ones. For the input item sequence, we discard the item w with a certain probability. The calculation of the probability is as the Formula (1): where f (w) is the frequency of the item w in the sequence, and ρ is the preset threshold of the algorithm. The truncated normal distribution is a special normal distribution that limits the range of variable values [25]. In the embedding stage, using the truncated normal distribution instead of the ordinary normal distribution can reduce some noise interference.

Oversampling
In binary classification problems, the imbalance of data is commonly seen, where there is a large gap between the number of positive samples and negative ones. Usually there are three ways to solve this problem: (1) threshold adjustment, which forces the classifier more inclined to the minority of classes; (2) undersampling, which is to reduce the number of the majority of samples to make the data more balanced; (3)oversampling, which is to increase the number of the minority of samples. Generally, the effect of oversampling and undersampling is better than that of the threshold adjustment, but sometimes there is a risk of overfitting, so we should use the appropriate regularization approaches together.
Generally speaking, oversampling works better than undersampling. And random oversampling is a simple oversampling method in which the idea is to take some samples randomly from a few kinds of data and add them to the original one to generate more balanced training data [26].
Synthetic Minority Oversampling Technique (SMOTE) [23] is an improved and prestigious sampling technique based on random oversampling. Because the strategy of random oversampling is to balance the data by simply copying minority samples randomly, it is very easy to lead to overfitting. The basic idea of SMOTE is to analyze the data of minority samples, synthesize new minority ones and add them to the original data, and balance the training ratio of positive and negative samples. The main algorithm flow of SMOTE is as follows (Algorithm 1).

Optimization for DNN
Compared with the traditional machine learning approaches, the CTR prediction based on DNN has made significant improvements. These models are composed of a multi-layer neural network, or can be combined with multiple DNNs for prediction. However, the training of DNNs is very tricky, and some essential optimization techniques must be involved to build a better CTR model.

Batch Normalization
Batch Normalization (BN) is proposed to solve the Internal Covariate Shift (ICS) problem [27]. BN is to standardize the input of each hidden layer, and force the input distribution of each layer back to the standard normal distribution. In this way, the gradient can be increased to avoid gradient vanishing and the increase of gradient can accelerate the speed of learning convergence [28]. The main algorithm flow is as follows (Algorithm 2):

Dropout
Neural networks usually are easy to be overfitting [34]. In statistical machine learning, model ensemble can be used to deal with the problem, while, in DNN, dropout can effectively alleviate overfitting and plays the role of regularization.
In the training of feed-forward neural networks, dropout [27] is simply to make the activation value of neurons in the hidden layer be 0 with a certain probability p, so that the learning of neurons is more flexible, the model can learn more effective features, and the generalization of the model can be stronger.
The function of averaging is applied after dropout, since different network structures can produce different overfitting and averaging can make them cancel each other. Actually, dropout can be seen as a powerful ensemble of models, leading to better generalization and avoiding overfitting to a certain extent.

Activation Functions
The DNN model has a strong nonlinear fitting ability, which comes from the activation function. In the DNN-based model, it is very important to select the appropriate activation function to capture the non-linearity inside the data [35,36].
Generally speaking, each activation function has its own advantages and disadvantages. No activation function is suitable for all neural network models. In this paper, when designing the DNN-based CTR prediction model, we have tried each activation function separately, and carried out a lot of comparative experiments. Finally, we find that the sigmoid function and tanh function have better performance for our model and the baselines.

Training Techniques
In the task of CTR prediction based on DNN, the training process of neural network model is essentially to minimize the fitting loss. After defining the loss function, the key of learning is to use an optimization algorithm to minimize the loss function. Therefore, the optimization techniques play an important role, using iterations to get the optimal solution as close as possible. Especially in the complex CTR prediction task, it is very important to choose the optimization algorithm that matches the network model structure and the loss function [42,43].
In general, each optimization algorithm has its own strength and weakness. In this paper, when training the deep CTR prediction model, we have tried each optimization algorithm above and found that FTRL-Proximal and Adam are more suitable for the our proposed model and the baselines.

The Proposed Model
As we can see from the recent advancements, the key factors to boost the performance of CTR are more advanced and effective feature combinations, and more state-of-the-art techniques from DNN to better train and capture non-linearity. Therefore, based on Follow The Regularized Leader (FTRL) [11] and Deep&Cross Network (DCN) [18], this paper proposes an integrated model with feature optimization (FO), namely FO-FTRL-DCN, which will be empowered jointly by the embedding based on DCN, SMOTE on the imbalance of data and the FTRL optimization algorithm [11]. More specifically, the major contributions of the model are as follows.

•
First, in the embedding phase, the truncated normal distribution is introduced, which is a special normal distribution that limits the value range of variables. In the embedding stage, the truncated normal distribution can be used to replace the normal distribution to reduce some noise interference. • Secondly, SMOTE is applied to tackle the imbalance of data, which can analyze the data of minority classes, synthesize new minority samples and add them to the original data, and balance the ratio of positive and negative samples. • Thirdly, the advanced FTRL optimization algorithm is applied, which can greatly improve the optimization effect of CTR model in complex situations. FTRL is an efficient optimization algorithm and online learning algorithm in information retrievel and recommendation, combining the power of forward backward splitting (FOBOS) [12] and regularized dual averaging (RDA) [13]. It uses both L1 and L2 regularization terms in the iterative process, which greatly improves the prediction of the model. The latest version of FTRL-Proximal [8] is adopted in the proposed model and has gained substantial improvements in performance.
The general structure of FO-FTRL-DCN is shown in Figure 1. The backbone of the FO-FTRL-DCN model can be divided into two parts: one is the feed forward deep network; the other is the cross network. The main ideas of FTR are detailed as follows.

The Pipeline
The input data of the proposed model mainly includes three types: user information, advertisement information and context information. After the input data flow is processed by SMOTE to handle the imbalance of data, it goes into the next stage of the embedding layer for further processing, and then flows into two branches of the deep network layer and the cross network layer for training. Then, the outputs from two branches are fused into the combination layer and, finally, mapped into a value in the range of [0, 1] through the sigmoid function of the output layer, and get the estimated Click-Through Rate. The sigmoid function is presented in formula (7) and w logit is the weight and b logit is the bias.
For the input data, after SMOTE is used to deal with the imbalance of data, the augmented data is embedded in a unified way, including embedding optimization, stacking, and splicing, and then is sent to the two branches of the deep network layer and the cross network layer in different forms for training. In the cross layer, the explicit feature intersection is fully and effectively applied, and the mutual interactions between high-order features are fully learned in the deep network layer. The input of each hidden layer will go through batch normalization and activation function, and then the outputs from two branches will enter the combination layer to be incorporated into a vector, and, finally, the CTR prediction will be obtained through the sigmoid function of the output layer.

The Feature Optimization
The feature optimization (FO) is actually implemented by the two branches of the deep network and the cross network. The branch of the cross network plays the role to carry out effective explicit feature crossing, which is to learn certain bounded-degree feature interactions.
The cross network includes several cross layers, and the formula of each layer is as follows (8).
The calculation of an intersection is shown in Figure 2, where x 0 is the original embedding layer input and y is column vectors denoting the output from the current cross layers; w, b are the weight and bias parameters of the current layer. Each cross layer adds back its input x after a feature crossing, and the mapping function just fits the residual of y − x. From the special structure of the cross network, we can see that the degree of feature cross increases with the depth of the network layer. Because the number of parameters in the cross network may limit the capacity of the network, in order to better capture the high-order non-linear interactions of features, the model further introduces a parallel deep network, which is a fully connected feed forward neural network. The formula of each layer is as (9).
The complexity of the deep network can be estimated by the number of parameters. Assuming that the dimension of input x 0 of deep layer is d, the number of layers of deep network is L d , and the number of neurons in each deep layer is m, the complexity of the deep network is calculated as (10).
The model finally uses the cross entropy loss function with regularized terms as Formula (11).
In summary, in the input layer, SMOTE is used to synthesize minority samples to deal with the imbalance of data. In the embedding and stacking layer, sparse features of data are transformed into embedded vectors, in which embedding optimization is applied, and then embedding vectors and dense vectors are spliced together. In the cross layer, explicit features are effectively learned on how to cross. The parallel deep layer fully learns the interactions between higher-order features. The combination layer integrates the output of the cross layer and the deep layer, and the output layer gets the predicted CTR through a sigmoid function. During training, FTRL-Proximal algorithm [8] is adopted, showing greatly improved prediction performance.

The Datasets
The proposed FO-FTRL-DCN mode for CTR prediction was evaluated on the famous benchmark of iPinYou https://contest.ipinyou.com/. We have carried out a large number of comparative studies and analysis. The iPinyou dataset consists of real advertising logs released by iPinYou company in 2013, which includes four types of log records: bidding, exposure opportunity, click, and conversion. Among them, exposure opportunity and click log can be used for CTR prediction. In the empirical studies, each data record is a corresponding exposure record, including the features, such as the users' own information, the relevant information of the advertising contexts and the advertising, and the final click data. In these experiments, the exposure samples of the first 7 days were used as the training set, and the exposure samples of the next 3 days were used as the test set. In the experimental settings, we use the advertising and click logs of four advertisers: 1458, 3358, 3386, and 3427 in the iPinYou data to establish four datasets of advertising and click, respectively. Table 1 shows the sample statistics of these four datasets.

Experimental Settings
All the experiments have been carried out on a Linux server of Ubuntu 18.04 LTS (64 byte), with a CPU of Intel(R) Xeon(R) @2.10GHz, a GPU of NVIDIA TITAN RTX of 24G, and 64G RAM. In these experiments, Area Under the Curve (AUC) and logloss are used as the main evaluation metrics of the model, which are the most commonly used indicators to measure the performance of a CTR model. AUC is the area under ROC curve. The larger AUC value is, the better the prediction performance of the model is. Logloss is the cross entropy loss. The smaller the logloss value is, the smaller the error and the better the performance of the CTR prediction model is.

Results and Analysis
The CTR prediction performance of the FO-FTRL-DCN model has been evaluated by six comparative experiments as follows.

The First Experiment
In the first experiment, the performance of several single models is compared in the CTR prediction (taking the dataset of ID 1458). The results are shown in Figure 3. The evaluation metrics are logloss and AUC.
It can be seen that in these typical CTR prediction models, the logloss value of DCN model is the least and AUC value is the best, which verifies the fundamental strength of DCN, upon which our model FO-FTRL-DCN is based. In Figure 5, we can see that the convergence speed of FO-FTRL-DCN model is faster because, with embedding optimization and SMOTE, the learning speed and performance of the model are boosted. Finally, the logloss is the least after convergence. In Figure 5, it can be seen that the AUC of the FO-FTRL-DCN model is better than the other two models, and it is also the best after final convergence. The results of these ablation study justify the the efficacy of techniques proposed in this paper.

The Third Experiment
The third experiment is to compare different activation functions in DCN model, on the dataset of ID 1458. The results are presented in Figure 6, with metrics of logloss and AUC.
It can be seen that different activation functions can lead to different performance, and the logloss of DCN-sigmoid model is the least and AUC is the best, which verified that the DCN model with sigmoid activation function is the best option.

The Fourth Experiment
The fourth experiment is to illustrate the logloss and AUC of DCN models with different optimization algorithms on the dataset of ID 1458.
It can be seen in Figure 7 that the logloss value of DCN-FTRL model is the least and AUC is the best, concluding that DCN with FTRL optimization algorithm is the most competitive. It can be seen that, among these models, the logloss and AUC of the proposed FO-FTRL-DCN model is the best.

The Sixth Experiment
In the sixth experiment, the comprehensive performance comparisons between the proposed model and other state-of-the-art baselines across four datasets of iPinYou are exhibited in Table 2. The datasets of iPinYou are of ID 1458, 3358, 3386, and 3427, with metrics of logloss and AUC.
From Table 2, it can be seen that, across four iPinYou datasets, the logloss and AUC of FO-FTRL-DCN model are always the best, showing the robustness and generalization of our model in different datasets.

Summarization
It can be summarized from the above experimental studies that: • The base model of DCN can be empowered with the integration of FO and FTRL.

•
The sigmoid activation function works the best for the proposed FO-FTRL-DCN model.

•
The overall performance of the proposed FO-FTRL-DCN model is better than prestigious state-of-the-art counterparts across different datasets of iPinYou, implying good generalization.

Conclusions and Future Research
In this paper, we introduced a FO-FTRL-DCN model for CTR prediction, which is based on Deep& Cross Network (DCN), augmented with the prestigious FTRL algorithm. The extensive comparative experiments on the iPinYou datasets show that the proposed model can outperform state-of-the-art baselines and the vanilla DCN itself, and the comprehensive performance is the best across different datasets, presenting satisfiable robustness and generalization.
This work can be further advanced, such as inserting sequential features during feature engineering and deeper network structures, to capture higher-order feature combinations.