A Dual Adaptive Interaction Click-Through Rate Prediction Based on Attention Logarithmic Interaction Network

Click-through rate (CTR) prediction is crucial for computing advertisement and recommender systems. The key challenge of CTR prediction is to accurately capture user interests and deliver suitable advertisements to the right people. However, there are an immense number of features in CTR prediction datasets, which hardly fit when only using an individual feature. To solve this problem, feature interaction that combines several features via an operation is introduced to enhance prediction performance. Many factorizations machine-based models and deep learning methods have been proposed to capture feature interaction for CTR prediction. They follow an enumeration-filter pattern that could not determine the appropriate order of feature interaction and useful feature interaction. The attention logarithmic network (ALN) is presented in this paper, which uses logarithmic neural networks (LNN) to model feature interactions, and the squeeze excitation (SE) mechanism to adaptively model the importance of higher-order feature interactions. At first, the embedding vector of the input was absolutized and a very small positive number was added to the zeros of the embedding vector, which made the LNN input positive. Then, the adaptive-order feature interactions were learned by logarithmic transformation and exponential transformation in the LNN. Finally, SE was applied to model the importance of high-order feature interactions adaptively for enhancing CTR performance. Based on this, the attention logarithmic interaction network (ALIN) was proposed for the effectiveness and accuracy of CTR, which integrated Newton’s identity into ALN. ALIN supplements the loss of information, which is caused by the operation becoming positive and by adding a small positive value to the embedding vector. Experiments are conducted on two datasets, and the results prove that ALIN is efficient and effective.


Introduction
The performance of a recommendation system goes hand in hand with the interests of advertisers, publishers, and users. The cost-per-click (CPC) advertisement charging pattern [1], based on the number of clicks, has become popular for online advertisements. In other words, the more clicks, the higher the publisher's revenue, and the better promotion effect and the greater potential revenue that the advertiser can obtain. Moreover, good recommendation performance can lead to the recommendation of suitable items for users in special contexts [2], which further enhances user satisfaction. Among many recommender systems, such those for online advertisements [3], news displays [4][5][6], and shopping recommendations [7][8][9], the click-through rate (CTR) plays an important role. The goal of CTR is to predict the probability that users click specific items through information about user profiles, item attributions, and contextual scenarios. There are multi-field categorical features in CTR prediction, representing a difference from computer vision and natural language processes, which have many continuous features. There are a large number of features in CTR datasets. Therefore, the performance of CTR prediction is limited in only applying individual features. Modeling complex feature interactions plays a key role in the 1. We design a novel attention logarithmic network (ALN) to model adaptive-order feature interactions and distinguish the importance of different high-order feature interactions through the squeeze and excitation network (SENet); 2. The input of ALN must be positive, which could cause a loss of information. Thus, we integrate Newton's identity modeling feature interactions with ALN to propose a new model called ALIN; 3. Comprehensive experiments on two datasets are conducted to show that our proposed model outperforms the state-of-the-art methods. The rest of this article is organized as follows: in Section 2, we summarize the related work about CTR prediction and our proposed model. Section 3 provides a description of our proposed model in detail. In Section 3, we design elaborative experiments to present the superiority of our model and show the effects of hyper-parameters on two datasets. Several ablation experiments are performed to verify the effectiveness of the proposed component. Finally, the relevant conclusions are drawn in this paper.

Related Materials
As described above, great efforts have been made to improve the performance of CTR prediction by researchers and academics, both in the industry and in academia. CTR prediction has gradually developed from an FM-based shallow model to a DNN-based deep model. In this section, we briefly review past methods of feature interaction in CTR prediction. Knowledge relating to the proposed model, including information on LNN and Newton's identity, is also briefly introduced.

Feature Interaction in CTR Prediction
Modeling feature interaction is an important task in CTR prediction, which has attracted huge attention both in academia and the industry. Logistic regression (LR) [21] is a linear approach that only models first-order feature interaction by way of weighted summation. FM learns the second-order feature interaction in the form of the vector inner product, which further improves the performance of modeling feature interaction. The field-aware factorization machine (FFM) considers field-award information and introduces field-aware embedding, which models multi-feature embedding for one feature. The attention factorization machine (AFM) takes the weight of second feature interaction into consideration, which learns the importance of second-order feature interaction through an attention mechanism. High-order factorization machines (HOFM) model high-order feature interaction, but they apply iterative computation to obtain high-order feature interactions, which consumes computational power and takes a lot of time.
Recently, many deep neural network-based approaches have been applied to CTR prediction. For example, Google proposed Wide & Deep [20], which combines LR and DNNs. Nevertheless, it retains the manual feature engineering in the LR component of this approach. Wide & Deep combines the advantages of memorability and generalization through the wide part and deep part, respectively. Deep & cross replaces the wide component in Wide & Deep with a novel component called CrossNet [31], which increases the degree of interaction between features. Similarly, DeepFM improves on the wide part of Wide & Deep by using the FM module to model explicit second-order feature interactions. PNN conducts product operation in the product layer to capture high-order feature interaction [32]. As in other models, the DNN layer is stacked on the product layer to learn implicit feature interactions. NFM is a neutralized version of FM, which replaces the second-order feature interaction with a DNN layer [33]. Nevertheless, these approaches enumerate all the feature interactions that could produce redundant information. Therefore, AFN was proposed by employing LNN to learn adaptive feature interactions. However, the input of LNN must be positive values, meaning that the embedding values may achieve absolute values and zero values should add a small value. This could disturb the information of raw embedding and increase noisy information for feature interaction. Additionally, the importance of feature interactions should be considered, since feature interactions play different roles in CTR prediction. In this work, we propose an attention logarithmic network (ALN), which considers the importance of high-order feature interactions and learns the adaptive order of feature interactions. Then, to compensate for the information loss caused by LNN, Newton's identity is used as a complementary component modeling feature interaction called the attention logarithmic interaction network (ALIN).

Squeeze and Excitation Network
Hu proposed the squeeze and excitation network (SENet) [34], which adaptively recalibrates channel-wise feature responses by explicitly modeling interdependencies between channels. SENet consists of two phases: the squeeze phase and the excitation phase. The two phases are described in Section 3.3 in detail. SENet has many applications in the field of computer vision. Moreover, we apply SENet as a discriminator to distinguish the importance of feature interactions in this paper to achieve better CTR performance.

Logarithmic Neural Network (LNN)
There are many ways to approximate nonlinear functions [35]. Since muti-layer perceptron (MLP) cannot sufficiently approximate unbounded nonlinear functions, which requires a large number of parameters and has limited accuracy, LNN is proposed to fit unbounded nonlinear functions [27]. LNN consists of multiple logarithmic neurons. The structure of the logarithmic neurons is shown in Figure 1, where the original input is first transformed into logarithmic space, and then the output is obtained after weighted summation and exponential operation. Formally, the logarithmic neurons can be formulated as: high-order feature interactions and learns the adaptive order of feature interactions. Then, to compensate for the information loss caused by LNN, Newton's identity is used as a complementary component modeling feature interaction called the attention logarithmic interaction network (ALIN).

Squeeze and Excitation Network
Hu proposed the squeeze and excitation network (SENet) [34], which adaptively recalibrates channel-wise feature responses by explicitly modeling interdependencies between channels. SENet consists of two phases: the squeeze phase and the excitation phase. The two phases are described in Section 3.3 in detail. SENet has many applications in the field of computer vision. Moreover, we apply SENet as a discriminator to distinguish the importance of feature interactions in this paper to achieve better CTR performance.

Logarithmic Neural Network (LNN)
There are many ways to approximate nonlinear functions [35]. Since muti-layer perceptron (MLP) cannot sufficiently approximate unbounded nonlinear functions, which requires a large number of parameters and has limited accuracy, LNN is proposed to fit unbounded nonlinear functions [27]. LNN consists of multiple logarithmic neurons. The structure of the logarithmic neurons is shown in Figure 1, where the original input is first transformed into logarithmic space, and then the output is obtained after weighted summation and exponential operation. Formally, the logarithmic neurons can be formulated as: (1) Since MLP is not suitable for multiplicative, division, and exponential operations, it does not work well for fitting unbounded nonlinear functions. The operation of logarithmic transformation converts multiplication to addition, division to subtraction, and powers to multiplication by a constant. Therefore, LNN can fit unbounded nonlinear functions much better than MLP.

Newton's Identity
Previous FM-based feature interaction models have modeled feature interactions in the form of symmetric polynomials whose complexity increases with the order of feature interactions. However, this complexity could reduce to linear time complexity by applying Newton's identity [29,30]   Since MLP is not suitable for multiplicative, division, and exponential operations, it does not work well for fitting unbounded nonlinear functions. The operation of logarithmic transformation converts multiplication to addition, division to subtraction, and powers to multiplication by a constant. Therefore, LNN can fit unbounded nonlinear functions much better than MLP.

Newton's Identity
Previous FM-based feature interaction models have modeled feature interactions in the form of symmetric polynomials whose complexity increases with the order of feature interactions. However, this complexity could reduce to linear time complexity by applying Newton's identity [29,30] to feature interactions in the form of power sums.

Methods
Before presenting the model in detail, we briefly summarize the proposed model named attention logarithmic interaction network (ALIN). First, the categorical features and the discretized continuous numerical features in the dataset are coded into one-hot vectors by a one-hot coding technique for easy input into the computer. Next, sparse one-hot features are converted to low-dimensional dense vectors by an embedding technique. Then, the embedding vector is passed into attention logarithmic network (ALN) and Newton's identity component simultaneously. For further learning feature interactions, we stack multiple hidden layers on top of ALN. Finally, the out of hidden layer and Newton's identity components are combined into sigmoid function to predict the click probability. The structure of ALIN is shown in Figure 2; the left half of the figure is the structure of ALN, and the right half is the Newton's identity component. The specific implementation of each module is described in detail in the following sections. Before presenting the model in detail, we briefly summarize the proposed model named attention logarithmic interaction network (ALIN). First, the categorical features and the discretized continuous numerical features in the dataset are coded into one-hot vectors by a one-hot coding technique for easy input into the computer. Next, sparse onehot features are converted to low-dimensional dense vectors by an embedding technique. Then, the embedding vector is passed into attention logarithmic network (ALN) and Newton's identity component simultaneously. For further learning feature interactions, we stack multiple hidden layers on top of ALN. Finally, the out of hidden layer and Newton's identity components are combined into sigmoid function to predict the click probability. The structure of ALIN is shown in Figure 2; the left half of the figure is the structure of ALN, and the right half is the Newton's identity component. The specific implementation of each module is described in detail in the following sections.

Problem Definition
The CTR task is dedicated to predicting the click-through rate, which is the probability that a user will click on a specific item. Specifically, given a dataset, , , … , , containing s samples, where indicates the vector of user, item, and context information, and indexes the samples. ∊ 0,1 is the ground truth of i-th sample.
1 means a positive response from the user, such as clicking special advertisements or purchasing goods. Conversely, 0 indicates that the user makes a negative response. The CTR task builds an efficient feature interaction model to take full advantage of user and item features to predict the probability of click . The definition is given in Equation (2).

Problem Definition
The CTR task is dedicated to predicting the click-through rate, which is the probability that a user will click on a specific item. Specifically, given a dataset, D = {(x 1 , y 1 ), . . . , (x s , y s )} containing s samples, where x i indicates the vector of user, item, and context information, and i indexes the samples. y i ∈ {0, 1} is the ground truth of i-th sample. y i = 1 means a positive response from the user, such as clicking special advertisements or purchasing goods. Conversely, y i = 0 indicates that the user makes a negative response. The CTR task builds an efficient feature interaction model f CTR to take full advantage of user and item features x to predict the probability of clickŷ. The definition is given in Equation (2).

Input Layer
Since the dataset for CTR prediction contains a large number of discrete features, one-hot coding of the raw features is required for input into the neural network. Suppose there are m different fields, each field may contain multiple features but each feature only belongs to one field. For example, one input instance q i could be coded as:

Embedding Layer
The data in CTR prediction always contain multi-field categorical features, which are usually sparse and high dimensional after one-hot encoding. As in previous works, embedding technology is introduced to map these one-hot features into a low dimension and dense embedding vectors. As depicted in Figure 3, there are m feature fields who are independent of each other. First, the features in every field are transformed into highdimensional and sparse feature vectors through one-hot encoding in the input layer. Then, the embedding layer is applied to one-hot encoding. Specifically, the embedding layer learns an embedding matrix for each field. Then, the embedding vector is queried by one-hot encoding. For example, the one-hot encoding of field i is q i and the corresponding embedding matrix is V i . To obtain the embedding e i , the following should be conducted: where i indexes the fields. Similarly, the embedding vectors of all fields can be derived: where e i ∈ R d represents the embedding of field i, and d indicates the dimension of embedding.
hot coding of the raw features is required for input into the neural network. Suppose there are m different fields, each field may contain multiple features but each feature only belongs to one field. For example, one input instance could be coded as: there are m one-hot features and only one bit should be one in every one-hot feature.

Embedding Layer
The data in CTR prediction always contain multi-field categorical features, which are usually sparse and high dimensional after one-hot encoding. As in previous works, embedding technology is introduced to map these one-hot features into a low dimension and dense embedding vectors. As depicted in Figure 3, there are m feature fields who are independent of each other. First, the features in every field are transformed into highdimensional and sparse feature vectors through one-hot encoding in the input layer. Then, the embedding layer is applied to one-hot encoding. Specifically, the embedding layer learns an embedding matrix for each field. Then, the embedding vector is queried by onehot encoding. For example, the one-hot encoding of field i is and the corresponding embedding matrix is . To obtain the embedding , the following should be conducted: where i indexes the fields. Similarly, the embedding vectors of all fields can be derived: where ∈ represents the embedding of field i, and d indicates the dimension of embedding.

Attention Logarithmic Network
To learn the adaptive order of feature interactions and model the importance of highorder feature interactions adaptively, we propose an attention logarithmic network (ALN). As depicted on the left of Figure 2, the input of the logarithmic neuron must be positive. Therefore, the embedding vectors are applied to absolute value function and a small positive value (e.g., 1e-5) is added to zeros in embedding vectors. Consequently, the positive embedding can be represented as , , , … , , which is used in successive layers. Logarithmic neural network (LNN) and squeeze and excitation network (SENet) are important components of ALN, which learn the powers of logarithmic neurons also known as the orders of each feature in feature interaction and learn the importance of feature interactions, respectively. Unlike that in traditional LNN, the input of logarithmic neurons in ALN is vectors. To be more specific, the input of ALN comprises positive vectors . Positive vectors are first transformed into logarithmic space by the logarithmic transformation layer. Then, weighted summation is conducted on the Input layer Embedding layer

Attention Logarithmic Network
To learn the adaptive order of feature interactions and model the importance of high-order feature interactions adaptively, we propose an attention logarithmic network (ALN). As depicted on the left of Figure 2, the input of the logarithmic neuron must be positive. Therefore, the embedding vectors are applied to absolute value function and a small positive value (e.g., 1e-5) is added to zeros in embedding vectors. Consequently, the positive embedding can be represented as E = [ e 1 , e 2 , e 3 , . . . , e m ], which is used in successive layers.
Logarithmic neural network (LNN) and squeeze and excitation network (SENet) are important components of ALN, which learn the powers of logarithmic neurons also known as the orders of each feature in feature interaction and learn the importance of feature interactions, respectively. Unlike that in traditional LNN, the input of logarithmic neurons in ALN is vectors. To be more specific, the input of ALN comprises positive vectors E. Positive vectors E are first transformed into logarithmic space by the logarithmic transformation layer. Then, weighted summation is conducted on the logarithmically transformed vectors in a vector-wise level. Finally, the result of weighted summation is converted into exponential space by the exponential transformation layer. The above operations can be formalized as follows: · · · e m w mj (4) where j indexes the logarithmic neuron, w ij is the order of i-th field in j-th neuron, and denotes element-wise product. According to Equation (4), ALN can learn feature interactions of an arbitrary order. For example, if w 1j and w 2j are equal to 1 and the other weighting coefficients are equal to 0, then the feature interaction y j = e 1 e 2 can be learned in j-th logarithmic neuron. However, the importance of feature interactions is not considered in Equation (4). Then, the SENet, which is stacked on the top of exponential transformation layer, is proposed to model the importance of feature interactions. The output of SENet layer can be formulated as follows: where s i is the importance of the i-th feature interaction, which is calculated from the squeeze-excitation mechanism, and k is the number of feature interactions as well as the number of logarithmic neurons. Next, we describe in detail how to calculate the feature importance factor s i via SENet. SENet contains two main stages including squeeze and excitation stages, which are described in Figure 4. In the squeeze step, all feature interactions are squeezed to summary information vector Z = [z 1 , z 2 , · · · , z k ] by squeeze function F sq (·), such as max pooling or mean pooling. If the mean pooling method is selected for calculating summary information, z i can be calculated as follows: where i ∈ [1, . . . , k] and d are the dimension of embedding vector. In the excitation step, the summary information vector Z is passed into the two-layer perceptron, where the dimensionality is reduced in the first layer and the original dimensionality is restored in the second layer. Then, the importance scores are obtained. Formally, the importance scores can be formulated as: where σ 1 and σ 2 are activation functions and W 1 ∈ R k× k r , W 2 ∈ R k r ×k are weighted matrixes of two layers, and r is reduction ratio. Finally, the importance scores are multiplied by the original feature interaction to obtain the new feature interaction that distinguishes the level of importance. The new feature interactions can be calculated as follows: where s i is a scalar value, y i ∈ R d and i i ∈ R d .  To further enhance the feature interaction effect, multi-layer perceptron (MLP) is employed on top of importance-aware feature interactions. At first, all the importanceaware feature interactions are concatenated as the input of MLP: where ⨁ denotes concatenation operation. Then, is fed into MLP with H hidden layers: ,W , I Figure 4. The squeeze-excitation mechanism used in our model. To further enhance the feature interaction effect, multi-layer perceptron (MLP) is employed on top of importance-aware feature interactions. At first, all the importanceaware feature interactions are concatenated as the input of MLP: where ⊕ denotes concatenation operation. Then, H 0 is fed into MLP with H hidden layers: where H l , W H l and b l are the output, weighted matrix, and bias vector of l-th hidden layer, respectively, and L is the number of hidden layers. The ALN module can stand independently as a CTR prediction model, but it still has some shortcomings that can be improved, which are described in detail below.

Newton's Identity Component
Since the operations of taking absolute values and adding a small positive value to zeros of embedding vectors introduce noise to the embedded information, Newton's identity is used to further model feature interactions in this paper. Analogous to FM, r-order feature interaction is modeled as follows: where j i is the i-th index of r-order feature interaction. In this paper, Newton's identity is applied to model high-order feature interactions in the form of power sums. Formally, the identity can be formulated as follows: where p i is the sum of the i-th power of the feature vector. The feature interactions from first order to fifth order based on Newton's identity are concretely as follows: where R denotes the highest order of feature interaction. Through Equation (18) where ⊕ denotes concatenation operation.

Prediction Layer
In this layer, we first concatenate the output of ALN and Newton's identity component: Then, the sigmoid function is employed to predict the click-through rate as follows: whereŷ ∈ (0, 1) is the predicted label of CTR, w o and b o are the weighted vector and bias vector of the prediction layer, respectively, and σ is the sigmoid function.

Optimization and Training
The CTR prediction problem is essentially a binary classification task. As in the work of previous researchers, the cross-entropy loss function is also adopted in our work. The cross-entropy loss function measures the distance between the ground truth and the predicted value as follows: where N is the number of training samples, y i is the true label of the i-th sample, andŷ i is the prediction value of the i-th sample.

Results
In this section, we described extensive experiments that were conducted to verify the effectiveness of our model. Firstly, a brief overview of the dataset and experiment setup is presented. Then, comparative experiments are shown to demonstrate the effectiveness of the proposed model followed by hyper-parameter experiments to observe the effect of the hyper-parameters. At the end, several ablation experiments were conducted to verify the effectiveness of the individual components. In this paper, the numerical features are converted to categorical features that the numerical values z are transformed to z = ln 2 (z) if z > 2 and z = 1 otherwise. For example, when the numerical feature z = 10.5, the feature is first logarithmically transformed and then squared, and finally the floor operation is made, i.e., z = ln 2 (10.5) = 5. Otherwise, when the numerical feature z = 1.5 (z ≤ 2), z is transformed to 1. Furthermore, all datasets are split into 8:1:1 for the train set, valid set, and test set, respectively. The details of the datasets are shown in Table 1.

. Evaluation Metrics
Following from previous work [10], we use three metrics, AUC (area under the ROC curve), log loss, and relative improvement (RI), to evaluate the proposed model.

AUC:
The AUC metric is widely used in CTR prediction, which measures the probability of a positive sample ranking higher than randomly chosen negative samples [36]. The larger the AUC value, the better the CTR effect. Moreover, the upper limit of AUC is 1. The definition of AUC is as follows: where M and N denote the number of positive instances and negative instances, respectively; r i and r j indicate the prediction value of positive instances and negative instances, respectively. δ denotes the indication function; when the condition is satisfied, δ = 1, and δ = 0 otherwise;

2.
Log loss: Log loss is defined in Equation (22), which measures the distance between real labels and prediction scores. A lower log loss indicates better CTR prediction performance. It should be noted that slightly improvement in AUC or decrease in log loss, e.g., at 0.001 level, is be regarded as huge improvement in CTR prediction; 3.
RI: Relative improvement (RI) measures the improvement of our proposed model over other models. RI can be formulated as: where X denotes the AUC or log loss in this paper, M represents the proposed method and B represents compared models.

1.
LR [21]: LR models first order feature interactions and weight individual features for CTR prediction;

2.
FM [11]: FM learns the hidden presentation for every feature, then models the second order by the inner product; 3.
AFM [13]: Based on FM, AFM employs the attention network to model second-order feature interaction importance; 4.
NFM [33]: NFM is a neural networked version of FM. NFM utilizes a bi-interaction pooling layer for modeling second-order feature interaction; then, MLPs are stacked on the layer to learn high-order feature interactions; 5.
PNN [32]: PNN models product feature interactions in a product neural network, which is capable of modeling complex feature interactions; 6.
Wide & Deep [20]: Wide & Deep combines LR and DNNs for modeling low-order and high-order feature interactions, respectively; 7.
DeepFM [22]: DeepFM replaces the wide component of Wide & Deep with FM to learn more informative feature interactions; 8.
AFN [28]: AFN implements the feature interaction adaptive order via a logarithmic neural network; 9.
AFN+ [28]: AFN+ is an ensemble model that combines AFN and DNNs to learn feature interactions.

Implementation Detail
We implement the proposed model using Pytorch. The optimization method is set to Adam [37], which is widely used for CTR prediction. Learning rate is 0.001 and 0.0001 for Criteo dataset and Criteo_600K dataset, respectively. The embedding dimension is 10 for all the models, and the batch size is 1024 for all the datasets. The dropout rate is 0.5 for all the models, including DNNs. The layer number of hidden layers is three for all the datasets. Further, the number of hidden units is 64 and 400 for the Criteo_600K and Criteo dataset, respectively. For ALIN, the number of logarithmic neuros is 2000 and 1500 for Criteo600K and Criteo, respectively. The SENet reduction ratio in the proposed ALIN is three and five for Criteo600K and Criteo, respectively.

Effectiveness Comparison
In this section, we compare nine baselines with our proposed models. These comparison models can be divided into four categories: first-order, second-order, high-order, and ensemble model. The first-order feature interaction model (LR) is a linear model that only uses first-order information for feature interaction. Second-order feature interaction models capture the interactions between pairs of features. Higher-order feature interaction models model higher-order feature interactions by various means. Ensemble models combined with DNNs or other modules capture more complex feature interactions. A comparison of model performance is shown in Table 2, from which the following conclusions can be obtained: The performance of LR is the worst among all the comparison models, indicating that first-order interaction is inadequate for CTR prediction; 2.
The higher-order model outperforms the second-order model, which shows that finer-grained feature interactions can improve model performance; 3.
ALN achieves the best performance among all the high-order models, indicating that it is necessary to consider the importance of feature interactions and the order of feature interactions; 4.
ALIN performs best among all the ensemble models, indicating that combining Newton's identity can reduce noisy information caused by logarithmic neurons.

Efficiency Comparison
In this section, the proposed models were compared with several models on the Criteo dataset in terms of efficiency. The efficiency comparison result is shown in Table 3. The average running time of 20 epochs and the number of parameters were compared based on several models in Table 3. Although AFN has the best performance of efficiency, AFN achieves relatively poor results of AUC and log loss compared to other models. It is observed from Table 2 that the value of the AUC for AFN is 0.8087, which is the worst among the four comparison models i.e., AFN, ALN, AFN+, and ALIN. Additionally, the log loss performance of AFN is also the worst among the four models. The performance of ALN improves by 0.001 compared to AFN, with some runtime increase. However, there is not much growth of the number of parameters. From Table 3, we can see that AFN+ has the longest running time and the largest number of parameters. ALIN reduces the running time by 30 s compared to AFN+. In Table 3, ALIN is greatly reduced compared to AFN+ in terms of the number of parameters. From Table 2, we can conclude that ALIN still has a slight improvement in AUC and log loss compared to AFN+. In summary, compared with the best baselines models AFN+, ALIN has fewer parameters, faster speed, and better performance.

Hyper-Parameter Experiments
Firstly, we conducted many hyper-parameter experiments on ALN to observe the effects of hyper-parameters in ALN. Subsequently, the optimal parameter setting of ALIN was found based on the ALN parameter settings.
The hyper-parameters of activation in SENet, reduction ratio in SENet, and number of logarithmic neuros were determined by ALN. Additionally, the order of Newton's identity was determined by ALIN.

The Number of Logarithmic Neuros in ALN
For this section, we only performed hyper-parameter experiments on the Criteo_600K dataset. For the Criteo dataset, we employed the recommended parameter settings in AFN, where the best parameter was 1500. As shown in Figure 5, as the number of logarithmic neurons gradually increases, the performance of the ALN gradually improves, and the effect is optimal when the number of neurons reaches 2000. We observe a significant improvement when the number of logarithmic neurons is 2000 compared with a few dozen neurons. This suggests that more logarithmic neurons can, to some extent, better fit the patterns in the data.

The Number of Logarithmic Neuros in ALN
For this section, we only performed hyper-parameter experiments on the Criteo_600K dataset. For the Criteo dataset, we employed the recommended parameter settings in AFN, where the best parameter was 1500. As shown in Figure 5, as the number of logarithmic neurons gradually increases, the performance of the ALN gradually improves, and the effect is optimal when the number of neurons reaches 2000. We observe a significant improvement when the number of logarithmic neurons is 2000 compared with a few dozen neurons. This suggests that more logarithmic neurons can, to some extent, better fit the patterns in the data.

The Type of Activation Functions in SENet
The activation function is the key part of the neural network. As shown in Figure 6, different activation functions have different effects on the two datasets. As shown in Figure 6a, the ReLU activation function performs better with the Criteo_600K dataset. However, as shown in Figure 6b, the sigmoid activation function performs better with the Criteo dataset. This indicates that different datasets have different characteristics and need to be fitted with different nonlinear activation functions.

The Type of Activation Functions in SENet
The activation function is the key part of the neural network. As shown in Figure 6, different activation functions have different effects on the two datasets. As shown in Figure 6a, the ReLU activation function performs better with the Criteo_600K dataset. However, as shown in Figure 6b, the sigmoid activation function performs better with the Criteo dataset. This indicates that different datasets have different characteristics and need to be fitted with different nonlinear activation functions.

The Number of Logarithmic Neuros in ALN
For this section, we only performed hyper-parameter experiments on the Criteo_600K dataset. For the Criteo dataset, we employed the recommended parameter settings in AFN, where the best parameter was 1500. As shown in Figure 5, as the number of logarithmic neurons gradually increases, the performance of the ALN gradually improves, and the effect is optimal when the number of neurons reaches 2000. We observe a significant improvement when the number of logarithmic neurons is 2000 compared with a few dozen neurons. This suggests that more logarithmic neurons can, to some extent, better fit the patterns in the data.

The Type of Activation Functions in SENet
The activation function is the key part of the neural network. As shown in Figure 6, different activation functions have different effects on the two datasets. As shown in Figure 6a, the ReLU activation function performs better with the Criteo_600K dataset. However, as shown in Figure 6b, the sigmoid activation function performs better with the Criteo dataset. This indicates that different datasets have different characteristics and need to be fitted with different nonlinear activation functions.

The Reduction Ratio in SENet
As shown in Figure 7, the two subfigures demonstrate the effects of different reduction ratio on CTR performance. The results of the reduction ratio hyper-parameter experiments on Criteo_600K are shown in Figure 7a, from which we can observe that performance is optimal when the reduction ratio is three, and the effect gradually decreases with the following ratios. This shows that too much information compression can cause a loss of information, to some extent. In contrast to the former, CTR performance for the Criteo dataset gradually increases until the reduction ratio reaches five, and then it deteriorates before the reduction ratio reaches nine. In conclusion, based on the above results, the reduction ratio is better when it is a small value.

The Reduction Ratio in SENet
As shown in Figure 7, the two subfigures demonstrate the effects of different reduction ratio on CTR performance. The results of the reduction ratio hyper-parameter experiments on Criteo_600K are shown in Figure 7a, from which we can observe that performance is optimal when the reduction ratio is three, and the effect gradually decreases with the following ratios. This shows that too much information compression can cause a loss of information, to some extent. In contrast to the former, CTR performance for the Criteo dataset gradually increases until the reduction ratio reaches five, and then it deteriorates before the reduction ratio reaches nine. In conclusion, based on the above results, the reduction ratio is better when it is a small value.

The Reduction Ratio in SENet
As shown in Figure 7, the two subfigures demonstrate the effects of different reduction ratio on CTR performance. The results of the reduction ratio hyper-parameter experiments on Criteo_600K are shown in Figure 7a, from which we can observe that performance is optimal when the reduction ratio is three, and the effect gradually decreases with the following ratios. This shows that too much information compression can cause a loss of information, to some extent. In contrast to the former, CTR performance for the Criteo dataset gradually increases until the reduction ratio reaches five, and then it deteriorates before the reduction ratio reaches nine. In conclusion, based on the above results, the reduction ratio is better when it is a small value.

The Order in Newton's Identity
The results shown in Figure 8 indicate that different feature interactions need to be supplemented for different datasets. For the Criteo_600K dataset, fourth-order feature interactions are needed as a supplement, while the Criteo dataset needs second-order interactions. In Figure 8a, we can see that ALIN achieves the best performance with the Criteo_600K dataset when the order of feature interactions is four. The performance of ALIN with the Criteo dataset tends to improve until the order is two and then decreases, as shown in Figure 8b. The results for the two datasets demonstrate that a small dataset requires high-order and complex feature interactions to improve performance, while large datasets requires only low-order interactions to improve performance. This is because large datasets have more sufficient information.

The Order in Newton's Identity
The results shown in Figure 8 indicate that different feature interactions need to be supplemented for different datasets. For the Criteo_600K dataset, fourth-order feature interactions are needed as a supplement, while the Criteo dataset needs second-order interactions. In Figure 8a, we can see that ALIN achieves the best performance with the Criteo_600K dataset when the order of feature interactions is four. The performance of ALIN with the Criteo dataset tends to improve until the order is two and then decreases, as shown in Figure 8b. The results for the two datasets demonstrate that a small dataset requires high-order and complex feature interactions to improve performance, while large datasets requires only low-order interactions to improve performance. This is because large datasets have more sufficient information.

The Order in Newton's Identity
The results shown in Figure 8 indicate that different feature interactions need to be supplemented for different datasets. For the Criteo_600K dataset, fourth-order feature interactions are needed as a supplement, while the Criteo dataset needs second-order interactions. In Figure 8a, we can see that ALIN achieves the best performance with the Criteo_600K dataset when the order of feature interactions is four. The performance of ALIN with the Criteo dataset tends to improve until the order is two and then decreases, as shown in Figure 8b. The results for the two datasets demonstrate that a small dataset requires high-order and complex feature interactions to improve performance, while large datasets requires only low-order interactions to improve performance. This is because large datasets have more sufficient information.

Ablation Study
To verify the effectiveness of the individual modules, we designed several ablation experiments, including ALN w/o SE and ALIN w/o NI, which removed the attention mechanism (SENet) based on ALN and Newton's identity based on ALIN, respectively. The performance of several variants is shown in Table 4, from which the following can be observed: 1. The variant ALIN w/o NI, also known as ALN, outperforms ALN w/o SE, indicating that SENet is beneficial for improving CTR performance. This also shows that it is necessary to consider the importance of feature interactions using SENet; 2. Comparing ALIN w/o NI to ALIN, we can see that ALIN outperforms ALIN w/o NI, which indicates that Newton's identity can further complement feature interactions to reduce the noise caused by LNN; 3. We can see from the comparison between ALN w/o se and ALIN that the two strategies proposed in this paper significantly enhance CTR performance in both datasets.

Conclusions
In this paper, we first pointed out the shortcomings of the previous CTR model, which cannot model adaptive-order feature interactions and does not consider the importance of higher-order feature interactions. In addition to this, previous works introduced noise to embedded features during feature interaction. To overcome these drawbacks, the ALIN model was proposed in this paper. The ALIN model uses a logarithmic neural network to model adaptive-order feature interactions, and then uses a

Ablation Study
To verify the effectiveness of the individual modules, we designed several ablation experiments, including ALN w/o SE and ALIN w/o NI, which removed the attention mechanism (SENet) based on ALN and Newton's identity based on ALIN, respectively. The performance of several variants is shown in Table 4, from which the following can be observed: 1. The variant ALIN w/o NI, also known as ALN, outperforms ALN w/o SE, indicating that SENet is beneficial for improving CTR performance. This also shows that it is necessary to consider the importance of feature interactions using SENet; 2. Comparing ALIN w/o NI to ALIN, we can see that ALIN outperforms ALIN w/o NI, which indicates that Newton's identity can further complement feature interactions to reduce the noise caused by LNN; 3. We can see from the comparison between ALN w/o se and ALIN that the two strategies proposed in this paper significantly enhance CTR performance in both datasets.

Conclusions
In this paper, we first pointed out the shortcomings of the previous CTR model, which cannot model adaptive-order feature interactions and does not consider the importance of higher-order feature interactions. In addition to this, previous works introduced noise to embedded features during feature interaction. To overcome these drawbacks, the ALIN model was proposed in this paper. The ALIN model uses a logarithmic neural network to model adaptive-order feature interactions, and then uses a squeeze-excitation mechanism to model the importance of higher-order feature interactions. Newton's identity is combined to complement the feature interactions and compensate for the noise caused by LNN in the embedding. Extensive experiments were conducted on two datasets to show the better performance of this proposed model compared with previous models. Further, hyper-parameter experiments were conducted to observe the effects of hyperparameters. Several ablation studies were performed to demonstrate the effectiveness of individual components.