Deep Filter Context Network for Click-Through Rate Prediction

: The growth of e-commerce has led to the widespread use of DeepCTR technology. Among the various types, the deep interest network (DIN), deep interest evolution network (DIEN), and deep session interest network (DSIN) developed by Alibaba have achieved good results in practice. However, the above model’s use of ﬁltering for the user’s own historical behavior sequences and the insufﬁcient use of context features lead to reduced recommendation effectiveness. To address these issues, this paper proposes a novel article model: the deep ﬁlter context network (DFCN). This improves the efﬁciency of the attention mechanism by adding a ﬁlter to ﬁlter out data in the user’s historical behavior sequence that differs greatly from the target advertisement. The DFCN pays attention to the context features through two local activation units. This model greatly improves the expressiveness of the model, offering strong environment-related attributes and the adaptive capability of the model, with a signiﬁcant improvement of up to 0.0652 in the AUC metric when compared with our previously proposed DICN under different datasets.


Introduction
With the increasing popularity of the Internet and the continuous development of computer science and technology, Internet finance has become an important part of the country's economic and financial life. Among the various effects, the rapid development of the e-commerce industry has provided new opportunities for the e-commerce operation of Internet finance [1]. In the information age, although people have more options for shopping or browsing information, the sheer volume of information makes it impossible for people to select products that meet their needs or preferences, and they can only shop by searching precisely. This greatly reduces the efficiency of shopping and the user's shopping experience. As an information filtering system, recommender systems have been introduced to the e-commerce industry, learning from users' personal preferences and historical behavior to predict users' preferences and make effective filtering recommendations. This not only saves advertising costs for e-commerce platforms but also improves the shopping experience for users [2,3]. Initially, recommender systems were divided into three categories, based on recommendation mechanisms: content-based recommender systems [4], collaborative filtering-based recommender systems [5], and hybrid recommender systems that combine both these systems [6]. With the advent and development of deep learning, recommender systems were upgraded to incorporate deep learning into the recommender system. Deep learning can further mine and analyze data to find hidden relationships and patterns between the data, helping recommender systems to make recommendations more accurately and efficiently [7].
Click-through rate (CTR) prediction models are a crucial group of recommendation systems. Click-through rate prediction analyzes the probability of clicking on a recommended advertisement or target by analyzing the users' historical clicking behavior and known context features, enabling more accurate targeting of advertisements and, thus, saving costs [8,9]. Common CTR models include the factorization machine (FM) [10] and logistic regression (LR) [8] as base models. Such a model is mainly a manual or automatic cross-construction of feature vectors and weighted summation employed to obtain clickthrough rate predictions. However, the problem is that the basic CTR prediction model can only predict the relationships on the surface of features, i.e., it can only complete the intersection of lower-order features and cannot make effective judgments regarding the deep hidden relationships and regular characteristics among higher-order features. With a combination of deep learning and recommender systems, the CTR prediction model has been upgraded to DeepCTR, which mainly adds a deep learning component to help address the shortcomings of traditional CTR prediction models that cannot complete higher-order feature intersections and extract hidden vectors. Initially, DeepCTR models were basically improvements on traditional CTR models with a deep learning component, such as DeepFM [11], NFM [12], xDeepFM [13], and PNN [14]. These methods incorporate deep learning components; however, they mostly follow the approach of first compressing and embedding high-dimensional user features into a fixed-length representation vector, and then feeding them into a multilayer perceptron (MLP). Due to the high dimensionality of user feature behavior, such as the wide variety of interests exhibited by each individual, compressing the embedding stage of the model can result in a loss of information and it may not fully utilize the user feature behavior [15]. As neuroscience research continues to advance, the idea of attentional mechanisms has been proposed [16] and applied to recommender systems. Influenced by the combination of the attention mechanism and DeepCTR, the Alibaba Group introduced the deep interest network (DIN) [15]. Subsequently, the deep interest evolution network (DIEN) [17] and deep session interest network (DSIN) [18] were successfully presented, which are improvements on DIN. Although these models have performed well in reality and have advanced the DeepCTR model, they still have problems. Specifically, they ignore the influence of context feature vectors on the historical behavioral characteristics of users and users' click-through rates. Moreover, the models only treat the target items with strong context-related attributes as ordinary vectors for compressed embedding, meaning that they are not fully utilized and, thus, limit the expressive ability of the model. To address this problem, we propose the deep interest context network (DICN) [19], thereby adequately solving the problem of ignoring contextual vectors. However, since DICN is based on the improved DIN model, filtering of the users' historical behavior features is still at the primary stage, and is still based on DIN. That is, the historical behavior features are directly embedded and compressed and are then fed into the attention mechanism without much processing, resulting in a great deal of redundant information. Although the attention mechanism network can be assigned weights, this still limits the expressive ability of the model.
In response to this issue, this paper makes the following contributions: • This paper presents a new and simple filtering machine for users' historical behavior features. This filtering machine makes full use of the characteristics of the targeted advertisements to filter the users' historical behavior, helping the attention mechanism process the input feature vector more efficiently and expressively. This paper designs two sets of comparative experiments so as to verify that the filtering layer can effectively enhance the ability of the attention mechanism in capturing and helping the model to improve its predictive ability. In addition, the importance of the newly added local activation unit for context features is demonstrated. At the same time, this paper highlights the fact that the newly developed filtering layer is more suitable for the pre-processing of users' historical behavior feature data, which means it cannot replace attention mechanism empowerment.
The rest of the paper is organized as follows: Section 2 discusses related works in the literature. Section 3 describes the filters, attention mechanisms, and structure of the components of the overall DFCN model. Section 4 describes the setup and analysis of the results of related experiments. Section 5 demonstrates some details of the models and makes comparisons with some inadequate models that were used during the experiment. Finally, Section 6 concludes the paper and presents ideas for future work.

Attention Mechanism and DICN
The attention mechanism presents a model that simulates the attention paid by the human brain. The core principle is to use the probability distribution of attention to capture the effect of a key input on the output [20]. With the emergence and continuous development of deep learning, people have added the attention mechanism to deep learning and have proposed many deep learning algorithms about the attention-based mechanism [21]. This is a good thing for the recommendation system when used as a commercial data mining algorithm; it can push the recommendation system to continue to progress, improve the model's expressiveness and self-adaptive ability, and improve data mining. In our previous paper, we proposed a new model called the deep interest context network (DICN) which makes full use of attention mechanisms and deep learning. The model takes the DIN proposed by the Alibaba Group as the base model, optimizes the attention mechanism of the DIN, and uses those context features that are not valued by the DIN model to operate the attention mechanism to empower the users' historical behavior features. The DICN operates on the attention mechanism by adding a new local activation unit that takes the timestamped feature vector from the users' historical behavior features and the context features of the target advertisement, resulting in an attention weight matrix called the contextual attention weight matrix. At the same time, the original local activation unit operates the attention mechanism, using the feature of the users' historical behavior and the target advertisement to obtain the users' attention weight matrix. After multiplying these two weight matrices to obtain the total attention weight matrix, it is multiplied with the original users' historical behavior features matrix to achieve weighting of the users' historical behavior features matrix, helping the model to focus on those features with high relevance to the target advertising attributes and context, and improving the expressive and predictive power of the model.

Bandpass Filter
In the beginning, the term bandpass filter was used in radio communication systems. In such systems, the superimposition of other frequencies of noise in the channel with the modulated signal produces distortion, which can convey incorrect information and affect the quality of communication [22]. As can be seen, signal filtering is a very important part of the process, ensuring the reliability and accuracy of the signal [23]. As one of the most common types of filters, bandpass filters are used to select signals within a certain frequency range and suppress signals at other frequencies [24]. With the rise and continuous development of digital image-processing techniques, filters are used in the field of digital image processing. An image can be represented as a discrete function of pixel values versus the plane coordinates and can be viewed as two-dimensional signal data [25]. When filters are used for image processing, they allow the image to be enhanced or restored to avoid the distortion caused by interference from other noisy signals [26]. The effect is shown in Figure 1. avoid the distortion caused by interference from other noisy signals [26]. The effect is shown in Figure 1. In the DeepCTR model, the users' historical behavior data can be seen as multi-dimensional vector signals. In a large database of users' historical behavior, there must be fluctuations and deviations in users' interests at different times or variations in users' interests in the different types of items, which can be treated as noise signals that have a negative effect on the prediction model. In the DIN and DICN models, users' historical behavior data is manipulated directly, without much processing, by the attention mechanism. Although the attention mechanism itself can be understood as a filtering operation, the complexity and size of the input data lead to a decrease in the effectiveness of the attention mechanism filtering process, which makes it necessary to pre-process the users' historical behavior data. In this paper, we introduce the bandpass filter principle used in electronic communication systems into the DeepCTR model and formulate the passband and blocking band of the bandpass filter algorithm according to the target advertising vector, so as to achieve the initial screening process of the users' historical behavior data and help the attention mechanism to further complete the weight allocation.

Model Structure
The structural flow of the deep filtered contextual network model is as follows: the input layer pre-processes the data in the dataset according to its characteristics. The embedding layer then transforms the user features, users' historical behavior features, payment activity, and context features in the dataset into sparse vectors, variable length sparse vectors, and dense vectors, classified according to the characteristics of data length. Specifically, user features and context features are converted to sparse vectors since they are fixed-length sparse features. Sparse features with variable lengths of users' historical behavior are then converted to variable-length sparse vectors. Payment activity is converted to a dense vector. After conversion into a vector, the users' historical behavior features are entered into the filtering layer. In the filtering layer, the target advertisement vector is expanded into a tensor of the same shape as the tensor of the users' historical behavior features, after which the historical behavior tensor is subtracted from the target advertisement tensor to obtain a bandstop filter with the target advertisement as the blocking band. After this, the target advertisement tensor is again subtracted from the bandstop In the DeepCTR model, the users' historical behavior data can be seen as multidimensional vector signals. In a large database of users' historical behavior, there must be fluctuations and deviations in users' interests at different times or variations in users' interests in the different types of items, which can be treated as noise signals that have a negative effect on the prediction model. In the DIN and DICN models, users' historical behavior data is manipulated directly, without much processing, by the attention mechanism. Although the attention mechanism itself can be understood as a filtering operation, the complexity and size of the input data lead to a decrease in the effectiveness of the attention mechanism filtering process, which makes it necessary to pre-process the users' historical behavior data. In this paper, we introduce the bandpass filter principle used in electronic communication systems into the DeepCTR model and formulate the passband and blocking band of the bandpass filter algorithm according to the target advertising vector, so as to achieve the initial screening process of the users' historical behavior data and help the attention mechanism to further complete the weight allocation.

Model Structure
The structural flow of the deep filtered contextual network model is as follows: the input layer pre-processes the data in the dataset according to its characteristics. The embedding layer then transforms the user features, users' historical behavior features, payment activity, and context features in the dataset into sparse vectors, variable length sparse vectors, and dense vectors, classified according to the characteristics of data length. Specifically, user features and context features are converted to sparse vectors since they are fixed-length sparse features. Sparse features with variable lengths of users' historical behavior are then converted to variable-length sparse vectors. Payment activity is converted to a dense vector. After conversion into a vector, the users' historical behavior features are entered into the filtering layer. In the filtering layer, the target advertisement vector is expanded into a tensor of the same shape as the tensor of the users' historical behavior features, after which the historical behavior tensor is subtracted from the target advertisement tensor to obtain a bandstop filter with the target advertisement as the blocking band. After this, the target advertisement tensor is again subtracted from the bandstop filter tensor to obtain a bandpass filter, with the target advertisement tensor as the passband. The bandpass filter is multiplied by the Hadamard product of the original users' historical behavior sequence tensor to obtain the filtered bandpass historical behavior sequence tensor. In the attention layer, the bandpass historical behavior sequence tensor and the context feature tensor will be fed into the two local activation units, resulting in the users' historical weight matrix and the context weight matrix, respectively. They will first be multiplied to obtain the total weight matrix and then multiplied with the bandpass users' historical behavior tensor to perform the additive pooling operation together. Ultimately, the result of the sum pooling operation and the user feature sparse vector, the target advertisement sparse vector, and the context feature sparse vector are passed through the MLP layer to obtain the final result. The overall block diagram is shown in Figure 2.
filter tensor to obtain a bandpass filter, with the target advertisement tensor as the passband. The bandpass filter is multiplied by the Hadamard product of the original users' historical behavior sequence tensor to obtain the filtered bandpass historical behavior sequence tensor. In the attention layer, the bandpass historical behavior sequence tensor and the context feature tensor will be fed into the two local activation units, resulting in the users' historical weight matrix and the context weight matrix, respectively. They will first be multiplied to obtain the total weight matrix and then multiplied with the bandpass users' historical behavior tensor to perform the additive pooling operation together. Ultimately, the result of the sum pooling operation and the user feature sparse vector, the target advertisement sparse vector, and the context feature sparse vector are passed through the MLP layer to obtain the final result. The overall block diagram is shown in Figure 2.

Input Layer
In this layer, the raw data that are fed into the model are pre-processed. As the input data are sparse, high-latitude data and not directly usable by the model, this layer preprocesses the input data to encode them in preparation for the input embedding layer.
For sparse features of low dimensionality and fixed length, such as user features, target advertisements, and contextual features, we preprocessed the data using one-hot encoding [27]:

Input Layer
In this layer, the raw data that are fed into the model are pre-processed. As the input data are sparse, high-latitude data and not directly usable by the model, this layer pre-processes the input data to encode them in preparation for the input embedding layer.
For sparse features of low dimensionality and fixed length, such as user features, target advertisements, and contextual features, we preprocessed the data using one-hot encoding [27]: where e i is the i-th feature group in the dataset D, and K i denotes the dimensionality of this feature group, i. The equation indicates that only one element in a feature group is coded as 1, while the rest of the elements are all coded as 0. However, there is an obvious problem with the unique thermal encoding, which is very detrimental to the model's embedding compression operation if the feature length is not fixed and the unique thermal encoding is of varying lengths. Therefore, for users' historical behavior features with data of variable lengths, we use label encoding to encode the discrete text and numbers. In these equations, N i denotes that there are N i different categories in feature group i. The equation indicates that the element e i [j] is encoded using consecutive integers in the interval, effectively solving the problem of using unique heat to encode too great a dimensionality:

Embedding Layer
After the input layer has pre-processed the data, the embedding layer takes the preprocessed data and compresses them for embedding. Since the high dimensionality of the vectors after pre-processing is not conducive to model learning, the sparse vectors are mapped from the high-latitude to the low-latitude vector space in the embedding layer and are then converted into fixed-length embedding vectors to facilitate the learning of non-linear relationships between features in the fully connected layer. The embedding layer formula is as follows: The embedding matrix G i of the i-th feature group stitches together the embedding vector g i j from dimension 1 to K i , while the embedding vector g i j takes the values in the set of real numbers in dimension K i : If the feature group is encoded using one-hot encoding, the embedding vector of the feature group is represented as a single embedding vector: If the feature group is encoded using label encoding, the embedding vector of feature group i is represented as a tensor, i.e., a list of embedding vectors:

Filtering Layer
After the data have been compressed by the embedding layer, they move to the filtering layer, where the users' historical behavior features are processed again. The core idea of this layer is based on the bandpass filter found in radio communication systems; here, we construct a bandpass filter with the target advertisement as the passband to re-filter the users' historical behavior features data, helping the attention layer to better implement the attention mechanism and assign higher weights to those features of the users' historical behavior that are similar to the target advertisement.
Since the users' historical behavior feature is a tensor and the target advertisement is a vector, the target advertisement vector first needs to be expanded into a tensor, so that the target advertisement tensor is shaped into the form of the users' historical behavior tensor: Here, n represents the length of the second dimension of the users' historical behavior profile tensor.
The users' historical behavior features tensor, T h , is subtracted from the target advertising tensor, T a , to obtain a bandstop filter, H s , with the target advertising tensor as the stop band. The target advertising tensor, T a , is then subtracted from the bandstop filter, H s , to obtain a bandpass filter, H p , with the target advertising tensor as the gain.
Multiplying the bandpass filter H p with the original users' historical behavior feature T h yields the final users' historical behavior feature tensor T p , which passes through the bandpass filter: After the band-pass filter, the values of those elements in the original users' historical behavior profile tensor with low relevance to the target advertisement will be reduced, while the values of elements with high relevance to the target advertisement will be retained. This layer effectively filters the users' historical behavior features, helping the attention layer to give greater weight to users' historical behavior features that are highly relevant to the target advertisement and to reduce the weight of other non-relevant features. The structure of the bandpass filter is shown in Figure 3. filter, s H , to obtain a bandpass filter, p H , with the target advertising tensor as the gain.
After the band-pass filter, the values of those elements in the original users' historical behavior profile tensor with low relevance to the target advertisement will be reduced, while the values of elements with high relevance to the target advertisement will be retained. This layer effectively filters the users' historical behavior features, helping the attention layer to give greater weight to users' historical behavior features that are highly relevant to the target advertisement and to reduce the weight of other non-relevant features. The structure of the bandpass filter is shown in Figure 3.

Attention Layer
This layer mainly provides the attention mechanism empowerment operations. This layer is centered on the attention unit, which contains two local activation units, to learn the attention weight matrix for the band-pass filtered users' historical behavior and context features, respectively. The formula is as follows: G g , G p , G c represent the item, behavior, and item-type embedding matrices in the users' historical behavior features, while G s and G w represent the month and date embedding matrices in the users' historical behavior features, respectively, while t g , t p and t c correspond to the target advertisement item, behavior, and type of embedding vectors, respectively. t s and t w denote the current context feature embedding the vectors.
Taking the bandpass users' historical behavior feature for the attention weight matrix as an example, first, the target advertisement matrix is expanded into a tensor with the same shape as the bandpass users' historical behavior feature tensor. Then, the bandpass historical behavior feature tensor and the target advertisement tensor are used to perform the Hadamard product operation and subtraction operation.
T a = G ad 1 , G ad 2 , . . . , G ad n The result is then spliced with the bandpass historical behavior feature matrix and the target advertisement matrix to feed into the Dice activation function [15] and the linearized part of the DNN model, which is used to obtain the weight matrix ω h of the bandpass historical behavior features.
The structure of the attention unit is shown in Figure 4.
att p T T = ×Ω (21) After the local activation unit yields the users' historical behavior tensor with bandpass weights, the addition pooling operation is performed. This effectively solves the problem that the fixed length of user interests makes the model's learning efficiency decrease.

MLP Layer
This layer is the deep learning part of the model, which learns the non-linear relationships between features by using a fully connected neural network DNN model. The multi-layer perceptron layer first concatenates and then flattens the user feature embedding matrix, the target advertisement embedding matrix, the context feature embedding The attention weight matrix of the context features is identified using the same steps as above, resulting in an attention weight matrix of context features, ω e .
Ultimately, the Hadamard product of ω h and ω e gives the total attention weight matrix Ω; then, the outer product with the bandpass users' historical behavior tensor gives the bandpass weight of the users' historical behavior tensor, T att .
After the local activation unit yields the users' historical behavior tensor with bandpass weights, the addition pooling operation is performed. This effectively solves the problem that the fixed length of user interests makes the model's learning efficiency decrease.

MLP Layer
This layer is the deep learning part of the model, which learns the non-linear relationships between features by using a fully connected neural network DNN model. The multi-layer perceptron layer first concatenates and then flattens the user feature embedding matrix, the target advertisement embedding matrix, the context feature embedding matrix, and the additive pooled users' historical behavior feature matrix with pass weights, using them to one-dimensionalize the multi-dimensional input, avoiding high-dimensional vectors and facilitating fully connected neural network learning. After flattening, first, the results are fed into the DNN model, then ReLU [28] is selected as the activation function of the DNN. Finally, the normalization operation is completed using the SoftMax function [29] to output the click-through rate prediction results.

Experiments and Analysis
In order to validate the performance and learning ability of the new model that is proposed in this paper and prove the superiority of the new model, this experiment uses TensorFlow-2.1.0 as the learning framework and Python 3.7 as the running environment, using them to compare the various classical DeepCTR models.

Datasets
To prevent the occurrence of coincidences, two datasets were chosen for this paper: the Alibaba Taobao user history behavior dataset and the Amazon clothing, shoes, and jewelry dataset. At the same time, the items in both datasets have a high degree of contextual relevance to the environment, i.e., clothing, shoes, etc., have a strong seasonal relevance, with different clothing choices for different seasons. The Alibaba Taobao user history behavior dataset contains information on the users, items, item types, users' historical behavior (click, favorite, add to cart, or purchase), and behavior timestamps [30][31][32]. The Amazon clothing, shoes, and jewelry dataset contains users, items, users' ratings, and behavior timestamps [33][34][35]. Details of the datasets are given in Table 1.

Evaluation Indicators
In order to accurately and effectively evaluate the learning and prediction performance of the DFCN model, the experiment in this paper divides the datasets into a training group and a test group, according to a certain ratio. Meanwhile, the AUC (area under the curve), the log loss function, and RelaImpr-DIN [15,19] are used as the evaluation indicators in this paper, where the loss function formula refers to the loss function formula of the DIN model: RelaImpr-DIN is based on an improved version of the RelaImpr formula. Originally, the RelaImpr formula was designed to be able to reflect the gap between the DIN model and BaseModel [15] more intuitively. In this paper, the AUC parameters of the embedding and MLP paradigms were replaced with the AUC parameters of the DIN model in order to further reflect the performance gap between the DFCN model and the DIN and DICN models. It is expressed as:

Comparison Models
To verify that the DFCN model in this paper performs well in all the above metrics, we use the following, widely used, CTR prediction model for comparison to make the results more intuitive. The new model that is proposed in this paper and that is described in Section 3 introduces a filtering layer to process the users' historical behavior features of the compressed embeddings, reducing the parameters of those elements with little similarity to the target advertisement and helping the local activation unit to perform the assignment operation more accurately and efficiently.

Parameter Settings
This paper compares experiments that use different models with the same parameters as a way to verify the superiority of the new model DFCN. When using the Taobao user history behavior dataset, the number of iterations for each epoch is set to 10, while the model batch size is set to 256. The number of training sets is 8958 data units, of which 7166 data units are for training, 1792 data units are for validation after training, and the number of test sets is 2240. The final ratio of the data training set:validation set:test set was 14:6:5. The number of DNN hidden layers in the MLP was 256, 128, and 64, and the number of DNN hidden layers in the local activation unit was 80, 40. When using the Amazon clothing, shoes, and jewelry dataset, the number of iterations of the epoch was increased to 15, and the remaining parameters were kept constant to verify that overfitting would not occur in the case of larger datasets. However, since the total amounts of data in the two datasets are not the same, the number of training and test sets is also different. In the Amazon clothing, shoes, and jewelry dataset, the total amount of data in the training set comprised 72,965 units, of which 58,372 data units were used to train the model, representing 56% of the total dataset, while 14,197 data units were used to validate the model, representing 24% of the total dataset. The final 18,241 data units were used to test the model in the test set, representing 20% of the total dataset.

Analysis of Results
This section will show the experimental results visually, including tables and images, to verify the superiority of the model.

AUC and the RelaImpr-DIN
This subsection uses tables and bar charts to present the indicator data for the above comparison model and the DFCN model, as shown in Table 2.  Table 2, it can be seen that the DFCN model proposed in this paper outperforms DIN and DICN and far outperforms the rest of the mainstream CTR models, both in the test with the Taobao users' historical behavior dataset and in the test with the Amazon clothes, shoes, and jewelry dataset. When we carefully analyze the indicators and the related data, we can see that DFCN has improved by 0.1706 in the AUC indicator and 106.16% in the RelaImpr-DIN indicator compared to the DIN model for the Taobao user history behavior dataset test. The AUC metric improved by 0.0012 and the RelaImpr-DIN metric improved by 0.89% compared to the DIN model in the Amazon dataset test, which is quite significant. In contrast, when comparing DICN, the results for the AUC indicator derived under the two datasets improved by 0.0652 and 0.0005, respectively, and the results for the RelaImpr-DIN indicator improved by 40.57% and 0.37%, respectively. This is a good example of the superiority of the new DFCN model presented in this paper. A visual comparison of the other DFCN models and the comparison model for the above two evaluation metrics is shown in Figure 5.
R 2023, 18, FOR PEER REVIEW is quite significant. In contrast, when comparing DICN, the results for the AUC indicat derived under the two datasets improved by 0.0652 and 0.0005, respectively, and the r sults for the RelaImpr-DIN indicator improved by 40.57% and 0.37%, respectively. This a good example of the superiority of the new DFCN model presented in this paper. visual comparison of the other DFCN models and the comparison model for the abo two evaluation metrics is shown in Figure 5.

Test and Log Loss
In this subsection, we plot the log loss of the different CTR prediction models und the two datasets, tested as the vertical axis, with the number of experimental iteratio shown as the horizontal axis in a line graph. This is intended to explore the loss rate each model that is tested. See Figure 6 for a folded-line diagram.

Test and Log Loss
In this subsection, we plot the log loss of the different CTR prediction models under the two datasets, tested as the vertical axis, with the number of experimental iterations shown as the horizontal axis in a line graph. This is intended to explore the loss rate of each model that is tested. See Figure 6 for a folded-line diagram. Figure 5. Histogram comparing the AUC metrics of each model, using the Taobao dataset and Amazon dataset: (a) comparison of the AUC of various models using the Taobao dataset; (b) co parison of the AUC with the Amazon dataset.

Test and Log Loss
In this subsection, we plot the log loss of the different CTR prediction models und the two datasets, tested as the vertical axis, with the number of experimental iteratio shown as the horizontal axis in a line graph. This is intended to explore the loss rate each model that is tested. See Figure 6 for a folded-line diagram. From the line graph, we can clearly see that the test log loss of DFCN is mostly at its lowest for different numbers of iterations. The test log loss of the Taobao dataset decreases as the number of epoch iterations increases, while the test log loss of the Amazon dataset fluctuates somewhat, but not significantly, and basically tends to be stable and remains at its lowest value. The test log-loss comparison of these two datasets visually illustrates that the DFCN model can effectively avoid over-fitting or under-fitting, ensuring the accuracy of the model's recommendations.

Comparisons and Contributions
The method proposed in this paper is a refinement and development of both the classical model and the model previously proposed by our team.

Comparison to the Classical Models
Compared with the classical model, firstly, the DFCN model proposed in this paper retains the local activation unit introduced by the Alibaba team in the DIN model, as well as the variable length sparse vector, which means that the user's historical behavior features are no longer embedded as a fixed-length vector, but instead interact with the target advertisement in the local activation unit; finally, the weight matrix is obtained according to relevance. Secondly, an obvious shortcoming in the DIN model is that although Zhou et al. introduced a context feature variable, they did not make good use of it. For this reason, we continue to adopt the strengths of our previous model, namely, the introduction of another new local activation unit, to investigate the relevance of the contextual features in the user's historical behavioral patterns to the contextual features of the target advertisement. This improvement is extremely suitable for those cases where the target advertisement itself is closely related to the context of the environment and greatly improves the accuracy of the recommendation system.

Comparison to the DICN Model
Compared to the DICN model, the DFCN model proposed in this paper incorporates an additional filtering layer. The core role of this layer is to filter the sequence of users' historical behavior features, according to the target advertisement. The filtering layer converges with the attention mechanism in terms of its main purpose, which is to ignore data that are less relevant to the target advertisement and value data that are more relevant to the target advertisement. However, the filtering layer is simpler than the local activation unit structure and does not require a separate deep learning network to learn the non-linear relationships between vectors, which means that those features of the users' historical behavior that are highly relevant to the target advertisement can be filtered out in less time and at a lower cost. However, the filtering layer also has certain drawbacks. Due to the simplicity of its structure, the non-linear relationship between the user's historical behavior features vector and the target advertising vector cannot be mined; therefore, the filtering layer cannot replace the local activation unit to obtain the weight matrix. The filtering layer can pre-process those data with simple linear relationships to help the local activation unit to exclude some variables with very low correlation and help the attention mechanism of the local activation unit to learn non-linear relations, thus improving the reliability of the weight matrix output by the local activation unit. This experiment also designed a DFCN model without the need for any local activation unit, and verified by the use of comparison experiments that the filtering layer cannot replace the local activation unit. However, the expressiveness of the model, with the filtering layer and the local activation unit together, is higher than that of the model with only the local activation unit. The DFCN model without any local activation units was compared with a DIN model with only local activation units and no filtering layer, then the complete DFCN model was input into the Amazon dataset to derive the AUC values for each of the three models. The conclusions were verified by comparing the AUC values of the three models. A visual comparison is presented in Figure 7.  It is clear from Figure 7 that the expressive ability of the model would be greatly reduced if only the filtering layer were added without the local activation unit implementing the attention mechanism. The filtering layer is, therefore, the proverbial icing on the cake for the overall model but cannot directly replace it for expressing the non-linear relationship between the target advertisement and the user's historical behavior.

Importance of the Context Feature Attention Unit
In the course of our experiments, we also designed a DFCN model without the introduction of local activation units for context features, i.e., we directly used the original DIN model to add the new filtering layer proposed in this paper, to investigate whether the addition of more local activation units would have an overfitting side effect on the model and, thus, reduce the predictive power of the model. The results demonstrate that the complete DFCN model has better expressive abilities compared to the DFCN model, which lack context features and local activation units, with better values of AUC for both the Taobao user history behavior dataset and the Amazon dataset, as well as a smaller test log-loss value. A visual comparison is presented in Figure 8. It is clear from Figure 7 that the expressive ability of the model would be greatly reduced if only the filtering layer were added without the local activation unit implementing the attention mechanism. The filtering layer is, therefore, the proverbial icing on the cake for the overall model but cannot directly replace it for expressing the non-linear relationship between the target advertisement and the user's historical behavior.

Importance of the Context Feature Attention Unit
In the course of our experiments, we also designed a DFCN model without the introduction of local activation units for context features, i.e., we directly used the original DIN model to add the new filtering layer proposed in this paper, to investigate whether the addition of more local activation units would have an overfitting side effect on the model and, thus, reduce the predictive power of the model. The results demonstrate that the complete DFCN model has better expressive abilities compared to the DFCN model, which lack context features and local activation units, with better values of AUC for both the Taobao user history behavior dataset and the Amazon dataset, as well as a smaller test log-loss value. A visual comparison is presented in Figure 8.

Contributions
The main contribution of this paper is to propose a method for processing the user's historical behavior sequence feature data, i.e., a filtering layer. By means of a linear operation between the target advertising data and the user history behavior sequence data, the user history behavior sequence is primed and filtered for the next step regarding data entry into the local activation unit and, thus, the relevance weights are obtained. The filtering layer improves the representation of the user's historical behavioural features, first, by eliminating the complex and time-consuming deep learning module by means of linear operations in the middle, and second, by filtering the vector of the user's historical behavioral features with high relevance to the target advertisement by means of simple operations. At the same time, this paper also demonstrates that the filtering layer cannot completely replace the local activation unit because of its simplicity, which prevents the filtering layer from learning the non-linear relationships between vectors. This also illustrates the importance of local activation units from another perspective.

Conclusions
In an era of diversified market economies, the Internet economy accounts for an increasing share of the overall economy [38], and the rise and continuous development of ecommerce promote the progress of recommendation systems. As a data mining model, recommendation systems can analyze data in detail to help e-commerce companies to improve their decision-making, increase operational efficiency, and provide a better service to their customers [39]. In this paper, the DFCN model proposed in the context of e-commerce display advertising filters the huge volume of users' historical behavioral feature data, effectively suppressing the interference of non-relevant user history features with the relevant features and helping the attention mechanism to assign weights to each element more precisely and effectively. At the same time, this paper continues the design advantages of DICN, focusing on the contextual variables in the users' historical behavior features and making full use of the context features. After conducting several comparison tests, it is proved that the DFCN model proposed in this paper has significantly improved recommendation accuracy and the loss rate, and can achieve greater learning ability and more accurate and efficient recommendations.
The limitations of this study mainly stem from the newly proposed filtering layer and the local activation unit. For the filtering layer, this study simply performs a linear From Figure 8, we can clearly see that the AUC results are also significantly better than the original model DIN, as well as the previously proposed DICN model when the filtering layer proposed in this paper was added. However, the lack of a local activation unit made it impossible to generate a weight matrix based on the correlation between contextual features and the user's historical behavior features, which reduces the expressive and predictive power of the model, to some extent.

Contributions
The main contribution of this paper is to propose a method for processing the user's historical behavior sequence feature data, i.e., a filtering layer. By means of a linear operation between the target advertising data and the user history behavior sequence data, the user history behavior sequence is primed and filtered for the next step regarding data entry into the local activation unit and, thus, the relevance weights are obtained. The filtering layer improves the representation of the user's historical behavioural features, first, by eliminating the complex and time-consuming deep learning module by means of linear operations in the middle, and second, by filtering the vector of the user's historical behavioral features with high relevance to the target advertisement by means of simple operations. At the same time, this paper also demonstrates that the filtering layer cannot completely replace the local activation unit because of its simplicity, which prevents the filtering layer from learning the non-linear relationships between vectors. This also illustrates the importance of local activation units from another perspective.

Conclusions
In an era of diversified market economies, the Internet economy accounts for an increasing share of the overall economy [38], and the rise and continuous development of e-commerce promote the progress of recommendation systems. As a data mining model, recommendation systems can analyze data in detail to help e-commerce companies to improve their decision-making, increase operational efficiency, and provide a better service to their customers [39]. In this paper, the DFCN model proposed in the context of e-commerce display advertising filters the huge volume of users' historical behavioral feature data, effectively suppressing the interference of non-relevant user history features with the relevant features and helping the attention mechanism to assign weights to each element more precisely and effectively. At the same time, this paper continues the design advantages of DICN, focusing on the contextual variables in the users' historical behavior features and making full use of the context features. After conducting several comparison tests, it is proved that the DFCN model proposed in this paper has significantly improved recommendation accuracy and the loss rate, and can achieve greater learning ability and more accurate and efficient recommendations.
The limitations of this study mainly stem from the newly proposed filtering layer and the local activation unit. For the filtering layer, this study simply performs a linear addition and subtraction operation between the users' historical behavior features and the target advertising sequence. The resulting filter is used directly to filter the data regarding the users' historical behavior features that have a high correlation with the target advertisement. This saves time and money but, to some extent, it ignores the non-linear relationship between the user's historical behavior features and the targeted advertisements, which causes the local activation unit to experience some limitations. In future work, we intend to continue to follow up on the filter layer, balancing the cost of time against the filter layer's ability to filter the data. For the attention local activation unit, more meaningful local activation units can be added, thereby mining other vectors for correlation in the user's historical behavioral features and further improving the model.