An Attention-Based Latent Information Extraction Network (ALIEN) for High-Order Feature Interactions

: One of the primary tasks for commercial recommender systems is to predict the probabilities of users clicking items, e.g., advertisements, music and products. This is because such predictions have a decisive impact on proﬁtability. The classic recommendation algorithm, collaborative ﬁltering (CF), still plays a vital role in many industrial recommender systems. However, although straight CF is good at capturing similar users’ preferences for items based on their past interactions, it lacks regarding (1) modeling the inﬂuences of users’ sequential patterns from their individual history interaction sequences and (2) the relevance of users’ and items’ attributes. In this work, we developed an attention-based latent information extraction network (ALIEN) for click-through rate prediction, to integrate (1) implicit user similarity in terms of click patterns (analogous to CF), and (2) modeling the low and high-order feature interactions and (3) historical sequence information. The new model is based on the deep learning, which goes beyond the capabilities of econometric approaches, such as matrix factorization (MF) and k-means. In addition, the approach provides explainability to the recommendation by interpreting the contributions of different features and historical interactions. We have conducted experiments on real-world datasets that demonstrate considerable improvements over strong baselines


Introduction
In recent years, recommender systems have improved substantially and are now widely adopted by many online services in domains such as news, e-commerce and social media, among many others.
The key to a personalized recommendation for a target user is in modeling similar users' preferences for items in a domain based on those similar users' past interactions and the similarity of their patterns to those of the target user (e.g., in ratings and clicks). In a broader sense, any use of such user similarity (sometimes called "nearest neighbors") is an umbrella concept known as collaborative filtering [1,2].
With the inclusion of time-stamp and sequence, collaborative filtering is being merged with history sequences to play a vital role in many industrial recommender systems. The most well-known collaborative filtering technique, matrix factorization [3][4][5], projects users and items into a shared latent space and utilizes a vector of latent dimensions to represent a user or an item. Thereafter a user's interaction with an item is modeled as the inner product of their latent vectors. Recently, researchers have been embracing deep-learning neural architectures that can learn very complicated functions from data, to replace the inner product applied in matrix factorization [6,7] and also include information from history sequences [8,9]. However, even with (a) user-user similarity (based on, e.g., clicks), and (b) history sequences, there is still another source of information which emerges from (c) the "attributes" (sometime called "features") of users and items, and the interactions of these attributes, originally at the main effect level, but that is now moving to the second, third and even higher-order interactions. This is the landscape on which our model operates (attention-based latent information extraction network (ALIEN) for high-order feature interactions).
The user's attributes include demographic factors, e.g., age, gender, occupation and educational background [7,[10][11][12][13][14]. In addition, item attributes such as the category of a product, the genre of a movie and the release date of an album, not only render the basic information about the item, but also provide clues as to why the user is interested in it [10,15]. For example, it is reasonable to recommend Toy Story, a famous cartoon movie, to an eight-year-old boy Peter when he enters a video streaming website. Therefore, a third-order feature interaction, <gender = male, age = 8, movie's genre = (animation, children's, comedy)>, can be an informative description of this scenario for prediction. Existing works which focused on modeling low-order feature interactions from user and item attributes have been proposed for click-through rate (CTR) prediction [16][17][18][19], whose primary task is to predict the probabilities of users clicking items, e.g., advertisements, music and products. However, although the problem of feature engineering has been automated beyond manual feature selection, these models still lack the capability of extracting latent information from high-order feature interactions, which usually increases the dimensions and sparsity of the input features exponentially, leading to a more serious problem of model overfitting [20].
In addition, Peter may be curious about the reason why Toy Story was recommended to him. A possible assumption is, on the one hand, that it was recommended because he watched Lion King last week, whose genres are (animation, children's, musical), similar to Toy Story's, and welcomed by children as well. On the other hand, the third-order feature interaction <gender = male, age = 8, movie's genre = (animation, children's, musical)> has a greater impact on the recommendation of Toy Story than any other feature interactions, e.g., <zipcode = 48067 and movie's year of release = 1994>. To meet his expectations, it is appropriate to design a recommender system which not only provides a precise recommendation, but also is capable of finding out the exact feature interactions and history items which have greater influences on this recommendation. Thus, we propose a novel model, the attention-based latent information extraction network (ALIEN), to solve the CTR problem mentioned above. It has been demonstrated that the self-attention mechanism [15,21,22] is able to investigate the internal relationships between words within a sentence in the natural language processing task. Similarly, it enables ALIEN to provide explanations of recommendations by capturing comprehensive relationships between feature interactions from user and item attributes. In order to resolve the issues of modeling high-order feature interactions and providing explainability at the same time in a unified way, we built two attention-based layers from both micro and macro perspectives. The main contributions of this paper include:

•
An attention-based latent information extraction network (ALIEN) is proposed which takes user and item attributes and the user's history interactions as features; two attention-based layers are applied in both macro and micro perspectives: (1) the macro one learns the latent information by modeling the low and high-order of feature interactions, and interprets the contribution of each feature interaction; (2) the micro layer investigates the different impact which each history interaction has on the candidate item. • Dice [8] is introduced as the activation function, to standardize the input data and place the mean at the inflection point of the sigmoid.

•
We conducted empirical evaluations and validated the effectiveness of the model on two real-world datasets.

•
We demonstrate the effectiveness of the approach in providing the explainability of the model.

Related Work
CTR prediction models have been successfully developed in both academia and industry [8,16,[23][24][25][26][27]. Until fairly recently, feature selection from attributes was mainly hand-crafted by experts [28]. However, it was a tedious task, and experts' experience and expertise were highly required [28]. Therefore, there were works proposed [16][17][18][19] to model feature interactions automatically. Among these works, factorization machines (FM) [19] is a representative model, which was built to capture the first and second-order feature interactions in a linear way, and its effectiveness has been demonstrated in many recommendation tasks [29,30]. However, these models only focused on modeling low-order feature interactions, and lacked the capability of extracting latent information from high-order feature interactions, which tends to be combinatorially explosive.
To solve the problem of modeling high-order feature interactions, approaches were made which utilized feed-forward neural networks. He et al. [31] proposed neural factorization machines (NFM) to seamlessly combine the linearity of FM and the non-linearity of neural networks in modeling second-order and higher-order feature interactions, respectively. Qu et al. [32] proposed a product-based neural network (PNN), wherein a product layer is used to explore high-order feature interactions after an embedding layer. Lian et al. [33] proposed a novel compressed interaction network (CIN), which aimed to generate feature interactions at the vector-wise level. In addition to models for academia, Internet companies proposed several representative deep models which aimed to learn non-linear feature interactions from large-scale data. Cheng et al. [23] from Google proposed "Wide & Deep" for app recommendation, wherein the multi-layer perceptron (MLP) was used on the concatenation of feature embedding vectors, to learn feature interactions. Shan et al. [34] from Microsoft proposed DeepCross which utilized a deep residual MLP [35] to learn feature interactions. However, these methods were not capable of interpreting the contribution of each feature interaction.
To deal with the problem of explainability, Xiao et al. [36] applied the attention mechanism [15,22] to learn the importance of each feature interaction. Song et al. [20] moved a step further by combining a multi-head self-attentive neural network with residual connections, to model the importance of feature interactions with different orders. To solve these two issues at the same time in a unified way, we make the following contributions based on the improvements over the existing techniques: (1) For the macro domain, we developed a novel self-attention-based layer named the attributes-driven latent information extraction layer (which is denoted USER × CANDIDATE), to learn the latent information from the user and item attributes by modeling the low and high-order of feature interactions, and interpret the contribution of each feature interaction. Our approach evolved from the existing techniques of the synthesizer [37] and the compressed interaction layer (CIN) [33]. Contrary to the role the synthesizer played in the natural language processing (NLP) task, for the first time to our best knowledge, it was transformed by us into the approach which is able to model feature interactions and interpret their contributions for the recommendation task. (2) In addition, by adding the capability of modeling from a user's history interactions from the micro perspective, we solve the issue of CIN only modeling static feature interactions without having the ability to capture the user's diverse interests. Therefore, another novel attention-based layer named user behavior-driven latent information extraction layer (HISTORY) was developed to learn the latent information from user's historical behavior by modeling the items in the user's history interactions, and investigating the different impact which each history interaction has on the candidate item. Our approach evolved from the existing technique of the deep interest network (DIN) [8]. Unlike the items' embedding method utilized in DIN, we propose a different approach described in Section 3.3, to better suit ALIEN's architecture. Moreover, since DIN lacks the modeling of feature interactions, the attributes-driven latent information extraction layer (USER × CANDIDATE) described above compensates for that disadvantage.

The ALIEN Architecture
In this section, we first present the main idea of the attention-based latent information extraction network (ALIEN) proposed in this paper, and formulate the description of the problem, as explained above. Afterwards, we elaborate with a diagram of the architecture of ALIEN.

Main Idea
Our notation is summarized in Table 1. In the ALIEN model, there are three layers: (1) The attributes-driven latent information extraction layer (USER × CANDIDATE); (2) The user behavior-driven latent information extraction layer (HISTORY); (3) The latent information digestion and prediction layer (PREDICTION).

Notation Description
U , V the sets of users and items u ∈ R m×k , v ∈ R n×k user u's and v's k-dimensional embeddings S vc u the embedding set of u's history items related to u's candidate item v c u f , v f u's and v's high-dimensional and sparse one-hot encodings which contain all the fields of attributes W u (i) ∈ R k×| u (i) | , W v (j) ∈ R k×| v (j) | the weights assigned to u (i) and v (j) for latent space projection ×k the output vector of the Synthesizer module X l h the h-th feature vector of the l-th layer in the CIN module α syn the feature interaction-level attention score matrix for F u,vc the Multi-Layer Perceptron units in the Synthesizer module σ(·) the Dice activation function the randomly initialized matrix utilized in the Synthesizer module I vc u the intensity of u's interest in v c w ci the correlation score between v c and v i P vc u u's preference vector for v c Our model is illustrated in Figure 1. ALIEN performs two tasks in the macro and micro perspectives, and then merges those in the PREDICTION layer: Attributes-driven latent information extraction layer (USER × CANDIDATE): This layer performs a macro task of learning the latent information from modeling the low and high-order feature interactions, thereby establishing the contribution of each feature interaction to the user u's interest. Modeling low and high-order feature interactions from user and item attributes is overlooked by many recommendation models [7,10]. In addition, in their models, user and item embeddings u and v are initialized only with indices of u and v, which have vague meanings and converge slowly. However, user and item attributes are very important and provide key features which describe u's and v's basic characteristics. In our model, user and item attributes are used to initialize u and v via an embedding method, in order to render a clearer meaning with the side information and make the model converge faster. The embedding method is discussed in Section 3.3. Moreover, taking into consideration the exhaustive volume of calculations for generating high-order feature interactions, which grows exponentially when the order of interactions increases, we propose a novel architecture to reduce the computational complexity and parameter costs.
User behavior-driven latent information extraction layer (HISTORY): This layer finishes the micro task and investigates the different impact which each history interaction has on the candidate item. Obviously it is not enough to build user profile with only stationary attributes. User behavior not only implies u's historical interests, but also provides clues as to why u will choose to click the candidate item v c (or not). Given v c , a user's interest related to v c can be locally activated, and therefore its intensity is available for measurement, which is regarded as one important factor to improve the CTR prediction in our work.
Latent information digestion and prediction layer (PREDICTION): This layer concatenates the outputs of the first two layers and feeds them into a fully connected layer with the Dice activation function to generate the CTR prediction of v c for u. Dice standardizes the input data and puts the mean at the inflection point of the sigmoid, which is important when the inputs of each layer follow different distributions.
The detailed work flow of these three layers is discussed in Section 3.4. The main idea of our approach is to: (1) Build a basic user profile by constructing the low and high-order feature interactions from user and item attributes, and assign them different weights, to automatically learn the different influences on the candidate item. (2) Enrich the user's profile comprehensively by locally activating u's interest related to v c with u's corresponding history items S v c u , and measure the intensity of u's locally activated interest, to enhance the performance of CTR prediction. ...

Problem Formulation
We formulate the problem of CTR prediction. Let U denote a set of users and V denote a set of items, where |U | and |V | are the total numbers of users and items, respectively. The user-item interaction matrix Y ∈ R |U |×|V | is defined according to users' implicit feedback, where binary y uv = 1 indicates that there is an implicit interaction between user u and item v, e.g., behaviors of clicking, watching and browsing. Otherwise, there is no interaction between u and v when y uv = 0. Take the example of user u ∈ U and item v ∈ V. There are m and n fields of features in u's and v's attributes, respectively. u's and v's k-dimensional embeddings u = [u (1) (2) ... v (n) ] ∈ R n×k denote the concatenations of u's and v's feature embeddings, where k-dimensional u (i) and v (j) are encoded from the i-th and j-th fields in u's and v's attributes, respectively. The procedure of embedding will be discussed in Section 3.3. The problem of click-through rate prediction is to predict the probability of u clicking the candidate item v c . Our goal is to make precise the CTR prediction by modeling u and v c 's low and high-order feature interactions and u's history interactions. u's history interactions consist of u's history items before he or she interacts with v c , and the embedding set of u's history items is denoted as

Embedding Procedure
Differently from the embedding procedure used in [8], u and v are built with their attributes, respectively. We follow the feature embedding method applied in [32]. Take as an example the procedure of constructing item embeddings only, to conserve space. It consists of three steps. Firstly, all the numerical and categorical fields of item attributes are transformed into high-dimensional and sparse one-hot encodings, i.e., v f = [ v (1) v (2) ... v (n) ]. Since each field v (i) is a different type of attribute, apparently its one-hot encoding has a different dimension, denoted as | v (i) |. To reduce and unify dimensions, secondly, each encoding where the k-dimensional dense vector v (i) ∈ R k is the representation of v (i) . Finally, the item embedding v = [v (1) v (2) ... v (n) ] ∈ R n×k is generated by concatenating all fields of vectors. The generation of u = [u (1) u (2) ... u (m) ] ∈ R m×k is similar to that of v.

Model Description
In this section, we discuss the ALIEN model's entire recommendation process and the main system components with the adopted technologies we introduced.
As illustrated in Figure 1, ALIEN integrates the modeling of attributes-driven and user behavior-driven latent information into one single architecture. The user behavior-driven latent information extraction layer (HISTORY) evolved from the deep interest network (DIN) model used in [8], and the model proposed in it is regarded as a baseline to compare with ALIEN model. The latent information digestion and prediction layer (PREDICTION) receives the outputs of the attributes-driven latent information extraction layer (USER × CANDIDATE) and the user behavior-driven latent information extraction layer (HISTORY), and generates the final prediction.

Attributes-Driven Latent Information Extraction Layer (USER × CANDIDATE)
In terms of building a comprehensive user profile, making good use of user and item attributes is an indispensable task. However, issues such as the enormous sparsity and high dimensions of one-hot feature vectors hinder the recommender systems from precisely modeling user's characteristics. In addition, extracting local dependencies and hierarchical structures among fields has been the challenging work. Qu et al. [32] tried to solve these issues by applying a product layer to capture feature interactions from multi-field categorical data after the embedding layer, but the model has the limitation that it only deals with 1 and 2-order feature interactions, and lacks the ability to model higher-order feature interactions, i.e., the 3-order and higher ones. However, the complexity of n-order inner product grows exponentially with the order of interactions, which results from the exhaustive calculations of inter-fields feature interactions. To solve this issue, we follow [32] and made the following improvements in the attributes-driven latent information extraction layer (USER × CANDIDATE): • We introduce a compressed interaction network (CIN) [33] to model higher order feature interactions, which solves the high complex and time-consuming issue by compressing the high-order interaction vectors to a fixed value.

•
Taking both low and high-order feature interactions into consideration, we utilize synthesizer [37] to interpret the contribution of each feature interaction to the candidate item. To our best knowledge, our model is the first one proposed to apply synthesizer in the field of recommender systems.
The graph inside the dark-green box in Figure 1 illustrates the work flow of the attributes-driven latent information extraction layer (USER × CANDIDATE). In the following paragraphs, we state the functionality and the working procedure of each component in the attributes-driven latent information extraction layer (USER × CANDIDATE).
The generation of feature interactions.
.., f u,v c m+n ] be the concatenation of u and v c . There are two procedures of generating feature interactions, i.e., the generations of (1) low-order feature interactions, and (2) high-order feature interactions. We define low-order as 1-order and 2-order. The representation of 1-order feature interactions The representation of 2-order feature interactions F u,v c (2) is defined as: where • denotes the Hadamard product, and i ∈ [1, m + n − 1], j ∈ [i + 1, m + n]. In addition to low-order feature interactions, we utilize a compressed interaction network (CIN) approach to generate higher-order feature interactions, whose order is defined to be higher than two in this paper. In CIN, there are multiple layers, each of which has different feature vectors. The h-th feature vector of the l-th layer in CIN is: where 1 ≤ h ≤ H l and X 0 j = f u,v c j ; and w l,h ij ∈ R H l−1 ×(m+n) is the parameter matrix for the h-th feature vector. H l−1 denotes the number of feature interactions in the l-th layer: where 1 ≤ l ≤ O, and O denotes the highest order of feature interactions to be modeled in this paper. Equation (3) implies that the order of interactions increases with the growth of the layer depth of CIN. Take as an example the generation of 3-order feature interactions. From Equation (3), we find that the 3-order feature interactions are the output vectors of the second layer in CIN: Similarly, the higher-order feature interactions can be obtained: as the number of i-order feature interactions: The complete representation vector of low and high-order feature interactions is: Thus, the number of low-order and high-order feature interactions F u,vc is: From Equations (7) and (9), it is obvious that F u,v c (i) ∈ R F u,vc (i) ×k and F u,v c ∈ R F u,vc ×k . The synthesizer. Synthesizer is an extension of the self-attention mechanism. The advantage of the synthesizer mechanism is that it replaces the inner product QK in the vanilla self-attention mechanism with the synthesizing function ϕ(·), which results in reduced computational complexity; parameter costs which are approximately 10% lower than those for the vanilla self-attention mechanism; and competitive performance with the vanilla one [37]. Among its several variants, we utilized a mixed version of synthesizer, i.e., synthesizer (random + vanilla), which is the mixture of synthesizer random and vanilla self-attention models. The architecture of synthesizer (random + vanilla) is illustrated in Figure 2. The objective of synthesizer in the attributes-driven latent information extraction layer (USER × CANDIDATE) is to learn the different influences of each feature interaction on the candidate item. Therefore, we achieve it by applying α syn

× F u,vc
(1) as the feature interaction-level attention score matrix for F u,v c (1) . We discuss the influences of feature interactions in Section 5.3.1. F u,v c (1) is used as the input, and the output vector F u,v c of synthesizer is defined as: where and Softmax syn (·) = exp(·) · 1 exp(·) . (12) ψ query (·), ψ key (·), ψ value (·) ∈ R F u,vc (1) (1) are multi-layer perceptron (MLP) units that are analogous to Q(Query), K(Key) and V(Value) in the vanilla self-attention model, respectively: where W, b and σ(·) are the weight, bias and non-linear activation function, respectively. Dice is utilized as the activation function in our model: where E[x] and Var[x] are the mean and variance of input, respectively, and is set to 10 −8 following [8].
(1) is a randomly initialized matrix which is applied to replace QK in the synthesizer (random) method. Therefore, given Equations (10) and (11), F u,v c in random + vanilla mode is: where F u,v c ∈ R

User Behavior-Driven Latent Information Extraction Layer (HISTORY)
For the attributes-driven latent information extraction layer (USER × CANDIDATE), it learns the scope of the user's interest and tries to narrow it down by extracting latent information from feature interactions in the macro perspective. Additionally, a user's interests are diverse [8]. To precisely and comprehensively capture a user's diverse interests, similar to the activation unit applied in [8], we propose the user behavior-driven latent information extraction layer (HISTORY) to learn the latent information from user's historical behavior by modeling the user's history interactions, and investigating the different impact which each history item v i ∈ S v c u has on the candidate item v c . With these impacts, the intensity of u's interest in v c can be measured, which plays a decisive role in determining the probability of clicking v c . Differently from the self-attention-based synthesizer mechanism applied in the attributes-driven latent information extraction layer (USER × CANDIDATE), the attention mechanism utilized in this layer learns to assign a different attentive weight to each history item, to compute the importance of each item on the candidate item.
The graph inside the purple box in Figure 1 illustrates the architecture of the user behavior-driven latent information extraction layer (HISTORY). For each candidate item v c of user u, v c and S v c u are used as the inputs of the layer. The intensity of u's interest in v c , i.e., I v c u , is calculated with a fully-connected neural network layer M, as shown in Equation (16). where and MLP L (·) = MLP(MLP(· · · MLP(·))). (19) w ci is treated as the correlation score between v c and v i . We discuss the influence of v i on v c in Section 5.3.2. [A; B] denotes the concatenation of vector A and vector B. The working procedure of w ci is illustrated in Figure 3. I v c u is provided as the output of user behavior-driven latent information extraction layer (HISTORY).

Latent Information Digestion and Prediction Layer (PREDICTION)
The latent information digestion and prediction layer (PREDICTION) receives the outputs of the first two layers, i.e., F u,v c , F u,v c and I v c u , and generates P v c u , i.e., u's preference vector for v c : Next, P v c u is sent through a fully-connected layer, to calculate the probability of u clicking v c : Finally, v c 's prediction score y for u is: where y ∈ {0, 1} and sigmoid(x) = 1 1+exp(−x) .

Experimental Setup
In this section, we present our experiments in detail, including datasets, baselines, evaluation metrics and hyperparameters. Experiments were conducted on two public datasets with user and item attributes and user behavior to investigate the effectiveness of our model.
MovieLens-1M dataset. The MovieLens-1M dataset contains approximately one million explicit movie ratings (ranging from 1 to 5), users' demographic information and movies' basic information from the MovieLens website. We used both users' and items' attributes as the input features. To make it suitable for CTR prediction task, we followed [38] and transformed ratings into implicit feedback; each entry that was marked with 1 indicated that the user had rated the item positively, and we sampled negative examples from an unwatched set marked as 0 for each user, which had the same numbers as the rated ones. The threshold of positive ratings was set to 4. For the dataset splitting procedure, we split data based on userID; therefore, 3622 (60%), 1207 (20%) and 1207 (20%) users among 6036 users were randomly sampled into the training set (441,134 samples), validation set (149,128 samples) and test set (151,438 samples).
Taobao dataset. Taobao dataset contains over 26 million ad display/click logs from 8 days, users' and items' basic information from Taobao website. We used both users' and items' attributes as the input features. Samples whose "clk" field is 1 were treated as positive samples, otherwise as negative samples. Users with low activity, i.e., the ones who have less than 5 positive samples, were filtered from the dataset. Similar to the unwatched set generated for Movielens-1M, an unclicked set was sampled as negative samples for Taobao dataset, which had the same number with the positive ones. In addition to log data, attribute values of the numerical field, i.e., price, were normalized to the range [0, 1]. Same as the split method for Movielens-1M, we split Taobao dataset based on userID, thus 30,295 (60%), 10,097 (20%) and 10,097 (20%) users among 50,489 users were randomly sampled into training set (443,640 samples), validation set (148,264 samples) and test set (148,534 samples).

Baselines
We compared our model with the following baseline algorithms. IPNN [32]. IPNN is the PNN model with an inner product layer. PNN applies a product layer to capture 1-order and 2-order feature interactions from multi-field categorical data after the embedding layer. We used the same configurations in [32], except that the mini-batch size was set to 50.
OPNN [32]. OPNN is the PNN model with an outer product layer. We used the same configurations in [32], except that the mini-batch size was set to 50.
DeepFM [39]. DeepFM combines the power of factorization machines for recommendation and deep learning for feature learning in a neural network architecture. We used the same configurations in [39], except that the mini-batch size was set to 50.
DIN [8]. DIN uses a local activation unit to adaptively learn the representations of user interests from historical behaviors with respect to a certain item. We used the same configurations in [8], except that the mini-batch size was set to 50.

Evaluation Metrics and Hyperparameters
To evaluate the performance of each method, we employed two commonly used metrics AUC (area under ROC curve) [40] and ACC (accuracy). AUC is a widely used metric for evaluating classification problems. Reference [25] validated AUC as a good measurement in CTR estimation. Formally, the AUC of a classifier C is the probability that C ranks a randomly drawn positive sample x + higher than a randomly drawn negative sample x − : The second metric, ACC calculates the fraction of correctly classified samples: For hyperparameters, we experimented with different ones to find the best configurations of our method for each dataset. The configurations of our model are summarized in Table 5.

Results and Discussion
In this section, the performances of the proposed models and baselines are shown, according to the experimental settings we stated in the previous section. In addition, the superiority of the proposed models through comparisons of performance and the demonstration of their effectiveness in explainability are discussed. Table 6 shows the performances of all methods under the metrics of AUC and ACC. Each experiment was repeated three times and the averaged results are reported. Relative scores are given compared to the strongest baselines, whose results are underlined. Note that a slightly higher AUC of 0.001 is regarded as significant for CTR predictions [23,39,41]. Therefore, for both datasets, ALIEN outperformed all baselines with large margins.

Comparison of Performances
OPNN had the strongest baseline on both the Movielens-1M and Taobao datasets. ALIEN outperformed OPNN with the improvements of 1.14% and 2.41% in AUC score; and 1.49% and 2.50% in ACC score for both the Movielens-1M and Taobao datasets, respectively. For both datasets, although DIN learns the representation of user interests from historical behavior with the candidate item, it has the worst performance without the modeling of feature interactions. For approaches which modeled low-order feature interactions, IPNN and OPNN held the lead compared with DeepFM on Movielens-1M, and the advantage of PNN's two variants against DeepFM became more obvious on the Taobao dataset. In terms of PNN's two variants, OPNN outperformed IPNN on both datasets. Despite being without the modeling of high-order feature interactions and user's historical behavior, they still provided competitive results. The good performance of ALIEN is attributed mainly to the traits that two types of latent information are taken into consideration simultaneously, i.e., attributes-driven and user-behavior-driven latent information, which are extracted from low and high-order feature interactions and the user's history items, respectively, to construct a comprehensive user profile and finally provide a more precise prediction. Please note that for the Movielens-1M dataset, as shown in Table 2, there were abundant interactions for each user but there were not enough fields of features, whereas for the Taobao dataset, interactions for each user were scarce but the number of feature fields was adequate. These results demonstrate that ALIEN can achieve a better performance when it is short of features or the user's history interactions. In other words, ALIEN is capable of coping with severe situations.
Please note that in [8,39], the authors reported the experimental results of PNN, DeepFM and DIN on different datasets, i.e., Movielens-20M, Amazon, Alibaba, Criteo and Company [8,39], which were not utilized in this study. On Movielens-20M, Amazon and Alibaba datasets, DIN was reported to outperform IPNN and DeepFM, and DeepFM performed better than IPNN. On Criteo and Company datasets, DeepFM was reported to outperform the two variants of PNN, i.e., IPNN and OPNN. However, the analysis of the baselines' performances on these datasets is beyond the topic of this paper, since the valuable user and item attributes and the timestamps of each user's history interactions do not exist simultaneously in these five datasets, thereby making them fall short of the requirement for our experiment.

Effectiveness on the Explainability Problem (Case Study)
We conducted a real-world case study of user #7 and candidate item #861 in the Movielens-1M dataset, to present the capability of ALIEN to explain the recommendation results from two perspectives: (1) feature interactions, and (2) a user's historical behavior. The MovieLens-1M dataset was utilized as the dataset of this case study. Figure 4 illustrates the relevance between different fields of attributes from the attention score α syn obtained by the synthesizer module in the attributes-driven latent information extraction layer (USER × CANDIDATE). From the red and black dashed rectangles, we can see that the user's age "35-44" and the movie's genre "action and thriller" contribute more than other fields of attributes and act as good ingredients when forming feature interactions with others. Specifically, the pair <age = "35-44", movie's genre = "action and thriller"> (i.e., the red square) is identified as the most influential feature interaction for the candidate item. It makes sense that middle-aged adults, especially men, are very likely to prefer action and thriller movies.  Figure 5 illustrates the relevance between different history items v i and the candidate item v c from the attention score w ci obtained by the attention module in the user behavior-driven latent information extraction layer (HISTORY). The user discussed in Figure 5 is the same person discussed in Section 5.3.1. Note that the Movielens-1M dataset suffers a severe shortage of attributes, and there exists latent information which is not implied by the attributes. Therefore, generally, history items which have the similar genre and release year as the candidate item are given a high relevance score. It is apparent from the figure that this male user loves action and thriller movies. The reason that v 13 and v 14 are weighted very low is due to the fact that he has watched two romance movies and is currently tired of that same genre. He now wants to get back to the action and thriller movies again and chooses to watch v c .

Conclusions and Future Research
In this work, we developed a model titled the attention-based latent information extraction network (ALIEN) for CTR prediction. ALIEN is designed for common scenarios, such as the recommendation of movies on a video streaming website or products on an e-commerce shopping site, where the user and item attributes and the user's past clicking decisions are typically available. Based on the experimental results, ALIEN resolves the issues of: (1) modeling high-order feature interactions, and (2) the explainability of the prediction. To achieve these two objectives, ALIEN (i) constructs the low and high-order feature interactions from user and item attributes, via the vector inner-product approach combined with a compressed interaction network (CIN) module; (ii) extracts latent information from feature interactions and the user's history interactions with two attention-based layers to enhance the performance of CTR prediction; and simultaneously (iii) provides explainability by interpreting the contributions of different feature and history interactions. We have conducted experiments on two real-world datasets and demonstrated considerable improvements over strong baselines. Moreover, our proposed model can provide reasonable explanations, even when attributes are quite scarce. For future research, we aim to model other sources of attributes, particularly on a knowledge graph, to better characterize users and items and therefore make even more precise predictions.

Conflicts of Interest:
The authors declare no conflict of interest.