Adversarial learning for product recommendation

Product recommendation can be considered as a problem in data fusion-- estimation of the joint distribution between individuals, their behaviors, and goods or services of interest. This work proposes a conditional, coupled generative adversarial network (RecommenderGAN) that learns to produce samples from a joint distribution between (view, buy) behaviors found in extremely sparse implicit feedback training data. User interaction is represented by two matrices having binary-valued elements. In each matrix, nonzero values indicate whether a user viewed or bought a specific item in a given product category, respectively. By encoding actions in this manner, the model is able to represent entire, large scale product catalogs. Conversion rate statistics computed on trained GAN output samples ranged from 1.323 to 1.763 percent. These statistics are found to be significant in comparison to null hypothesis testing results. The results are shown comparable to published conversion rates aggregated across many industries and product types. Our results are preliminary, however they suggest that the recommendations produced by the model may provide utility for consumers and digital retailers.


Introduction
Product recommendation can be considered as a problem in data fusion-that is, estimation of the joint distribution between individuals, their behaviors, and goods or services of relevance or interest [1]. This distribution is used to create a list of recommended items to present to a consumer.
Business impact of recommendation. Online retailing revenue continues to expand each year. The largest online provider of goods and services (Amazon) reported 2019 gross revenue of $280.5B, an increase of 20.4% over the previous year 1 . Most sizable e-commerce companies use some type of recommendation algorithm to suggest additional items to their customers. The Long Tail proposition asserts that by making consumers aware of rarely noticed products via recommendation, demand for these obscure items would increase, shifting the distribution of demand away from popular items, and potentially creating a larger market overall [2]. The goal of personalized recommendation is to produce marginal profit from each customer. These incremental sales are certainly non-trivial, accounting for approximately 35% additional revenue for Amazon, and 75% for Netflix by some estimates [3]. Operating efficiencies within a digital enterprise can also be significantly improved. Netflix saves $1B year in cost due to churn by employing personalization and recommendation [4].
Recommender systems. Recommendation algorithms act as filters to distill very large amounts of data down to a select group of products personalized to match a user's preferences. Filtering and ranking the recommendations is extremely important; marketing studies have suggested that too many choices can decrease consumer satisfaction and suppress sales [5]. These algorithms can be categorized into a few basic strategies: (1) item-or content-based (return lists of popular items with similar attributes); (2) collaborative (recommend items based on preferences or behaviors of similar users), or (3) some hybrid combination of the first two.
Deep learning-based recommendation systems are abundantly represented in the research literature (see reviews in [6] and [7]). Deep models have the capacity to incorporate greater volumes of data of mixed types, extract features and express user-item-score statistical relationships as compared to classical techniques based on linear matrix decomposition [8]. Examples of deep algorithms application to recommendation tasks include multilayer perceptrons [9]; autoencoders [10]; recurrent neural networks [11]; graph neural networks [12]; and generative adversarial networks [13]. These models aim to predict a user's preference for new or unseen items from mappings relating user-item ( [9], [10]), item-feature ( [12]) or item-item sequences ( [11], [13]).

Contribution of present research.
In this work, we apply a conditional, coupled generative adversarial network (GAN) to a new domain of application-product recommendation in an online retail setting. In the context of previous GAN research [14], and specifically in terms of recommender systems, there are several novel aspects of the model and approach advanced in this research. These include: • Mapping: Direct modeling of the joint distribution between product views and buys for a user segment; • Data structure & semantics: Inputs to the trained generative model are (1) user segment and (2) noise vectors; the outputs are matrices of coupled (view, buy) predictions; • Coverage: Complete, large-scale product catalogs are represented in each generated distribution; • Data compression: Application of a linear encoding algorithm to very high-dimensional data vectors, enabling computation and ultimate decoding to product space; • Commercial focus on transaction (versus rating) for recommended products by design.
The RecommenderGAN addresses many ongoing challenges found in recommender systems. Novelty and sparsity of recommendations is not an issue; samples representing the entire distribution of products in a high-cardinality catalog can be generated. Cold start issues are mitigated as the system learns to generate a joint (view, buy) distribution for user segments with minimal identifying information, and is designed to include additional behavioral or demographic data if available.

Background-Generative adversarial networks
Generative adversarial networks (GANs) are deep models that learn to generate samples representing a data distribution p data (x) [14]. A GAN consists of two functions: a generator G that converts samples z from a prior distribution p z (z) into candidate examples G(z); and a discriminator D that looks at real samples from p data (x) and those synthesized by G, and estimates the probability that a particular example is authentic, or fake. G is trained to fool D with artificial samples that appear to be from p data (x). The functions G and D therefore have adversarial objectives which are described by the minimax function used to adapt their respective parameters: In the game defined in Eqn. 1, the discriminator tries to make D(G(z)) approach 0; the generator tries to make this quantity approach unity [14].
Conditional GANs. Additional information about the input data can be used to condition the GAN model. To learn different segments of the target distribution, an auxiliary input signal y is presented to both the generator and discriminator functions. The objective function for the conditional GAN [15], [16] becomes where the joint density to be learned by the model is p data (x,y).
Coupled GANs. Further generalization of the GAN idea was developed to learn a joint data distribution p data (x 1 , x 2 ) over multiple domains within an input space [17]. The coupled GAN includes two paired GAN models [f 1 :(G1, D 1 ), f 2 :(G2, D 2 )], each of which is trained only on marginal distributions p data (x 1 ), p data (x 2 ) from the constituent domains of the joint distribution. The coupling mechanism shares weights between the lower layers of G 1 and G 2 , and the upper layers of D 1 and D 2 . respectively. The architectural location of the shared weights in each case corresponds to the proximity of the greatest degree of abstraction in the data (for G, nearest to the input latent space z; for D, near the encoded semantics of class membership). By constraining weights in this manner, the joint distribution p(v, b) between view and buy behaviors can be learned from training data.

Model architecture
The present model design combines elements of both conditional and coupled GANs as described above. The constituent networks of the coupled GAN recommender were realized in software using the Tensorflow and Keras [18], [19] deep learning libraries. The GAN models were developed by adaptation of baseline models provided in an open source repository [20].
A schematic view of the RecommenderGAN model used the current work is presented in Figure 1.
The generators (left) receive input latent vectors z and user segment y, and output matrices G 1 , G 2 . Discriminators (right) are trained alternatively with these artificial arrays and real samples X 1 , X 2 , and try to discern the difference. The error in this decision is backpropagated to update weights in the generators. Architectural details of the network were settled upon after iterative experimentation and observation of results from different model design configurations and hyperparameter sets. Specifics of the layered configuration and dimensions of the final model appear in Table 1.
For the present dataset, more extensive (layer-wise) coupling of the weights within the generator networks proved necessary to obtain useful statistical results upon analysis. This is in contrast to the existing literature on coupled GANs, in which the main application domain is computer vision [17].
It was determined that the use of dropout layers [21] in G 1 , G 2 improved convergence during training, but such layers had negligible positive effect when included in the discriminator sub-models.
The final model included a total of 326, 614, 406 parameters, of which 257, 407, 020 were trainable. Three different schemes for joint distribution learning can be constructed from the behaviors view, addtocart, and buy in the dataset: (view, add), (add, buy), and (view, buy). The (view, buy) scheme was studied here because it directly connects viewing a raw recommendation and a corresponding purchase.

Data preparation
User segmentation. In [22], models of user engagement with digital services were developed based on metrics covering three aspects of observable variables: popularity, activity and loyalty. Each of these areas and metrics suggest means for grouping users in an implicit feedback situation.
In the current study, users were segmented based on counts of the number of interactions made within a session (this is referred to as "click depth" in [22]). Each visitor/session in the dataset were assigned to one of five behavioral segments according to click depth, which increases in proportion to bin number. Visitor counts by segment for the (view, buy) scheme are summarized in Table 2.
Note that the first segment is empty in this table. User segmentation was based on the entire sample; for the (view, buy) group, no paired data fell into the lowest click depth count bin 3 . The total sample size considered for the four segments was 10,591 visitors. To address this cardinality issue, a compressed representation of the product data was created using an arithmetic coding algorithm [23]. For each visit, training data matrices were constructed having fixed dimensions of 1, 169 rows (one product category per row) and 300 columns (encoded bit strings representing items in corresponding category). This was the maximum encoded column dimension; for row encodings of lesser length, the bit string is prepended with zeros. The decoded result is identical regardless of length of this prefix; this is important in order to subsequently identify specific items recommended by the system.
The encoded, sparse data matrices profiling visitor behavior for Views (V) and Buys (B) can be expressed symbolically as: where elements v m,n , b m,n ∈ {0, 1} are indicator variables denoting whether a visitor interacts with category 'm' and item 'n', and (r=1, 669 × c=300) is the encoded matrix dimension.
GAN training is carried out in the compressed data space, and final recommendations are read out after decoding to the full column dimension of all items comprising the product catalog.

Evaluation metrics
Assessments of recommendation systems in academic research often include utility statistics of the returned results (such as novelty, precision, sensitivity), or overall system computational efficiency [24], [25], [26]. To estimate the business value derived from deployed systems, effectiveness measures may be direct (conversion rates, click-through rates), or inferred (increased subscriptions, overall revenue lift, for example.) [27].
In the implicit feedback situation considered here, recommendations are created from sampling a joint distribution of (view, buy) behaviors. Consider the potential paths for system evaluation as suggested in Figure 2. In the left-hand column are the paired training data (V x , B x ); on the right, the generated recommendations (V z , B z ).
Without knowledge of the relevance of recommendations by human ratings or via purchasing behavior, evaluation in this preliminary work is based on objective metrics of similarity between the generated (view, buy) lists (path #4).
Contents of the overlapping recommendation sets are taken to signify the highest likelihood for completion of a transaction within the context of a given visitor session.
Two metrics of evaluation are proposed: 1. Specific items contained within the overlapping category sets that are both viewed and "bought"-a putative conversion rate; 2. Coherence between categories in the paired (view, buy) recommendations.
Estimation of conversion rate is the most important statistic considered here; it is crucial for evaluation and optimization of recommender systems in terms of user utility, as well as for targeted advertising [28].
Category overlap is prerequisite to demonstration of the feasibility of the current approach to product recommendation. Product conversion rate. Define the conversion rate as the number of items recommended and bought, to the count of all items recommended, conditioned on the set of overlapping product categories returned by the system: where (i v , i b ) are items, (c v , c b ) are product categories, N is the number of GAN realizations, y denotes the user segment and "#()" denotes the cardinality of its argument. Note that in the current analysis, it is assumed that all recommended items i v are viewed by a visitor.
Category similarity. The average Jaccard similarity between recommended categories (c v , c b ) is given by Training distribution statistics. Summary statistics comparing the distributions V x , V z (Figure 2, path #1) are observed to provide qualitative information about the effectiveness of target distribution learning.
Null hypothesis tests. A legitimate question to ask upon analyzing the current results is this: ''Are the generator realizations samples of the target joint distribution, or do they simply represent random noise?".
To address this question, the analysis includes statistics estimated from simulation trials (n=500) in which randomly selected elements from the (V, B) matrices are set equal to unit value, while maintaining the average sparsity observed in the decoded GAN predictions.
The random trials are meant to test the null hypothesis that there is no correlation between paired (view, buy) elements in the generator output.
The alternative hypothesis is that the recommendations contain relevant information that may provide utility to the system user.

Recommendation experiments
Training. The system was trained on the encoded data for 1,100 epochs, in randomly-selected batches of 16 examples each.
Statistics of training data for all networks comprising the model were observed during the training iterations.
The G 1 , G 2 statistics monotonically approached the true distribution those of the true until around epoch nunber 1,110, at which point the GAN output began to diverge. One explanation for this may be that the representational capacity of the networks on this abstract learning task may have become exhausted [29]. Examples of this statistical evolution during training are shown in Figure 3. Note that training data matrix values were scaled onto the range [-1,+1], where the value "-1" corresponds to a zero valued element in the sparse raw data arrays.
Label smoothing on the positive ground truth examples presented to the discriminators was used to regularize the learning procedure [30].
At training stoppage, the observed discriminator accuracies where consistently in the ≈ 45−55% range, indicating that these models were unable to differentiate between the real and fake distributions produced by the generators [14]. Testing. Testing a machine learning model refers to evaluation of its output predictions obtained using out-of-sample (unseen) input data. This provides an estimate of the quality and generalization error of the model. Techniques such as cross-validation are often used to assess generalization potential. In contrast to many other learning algorithms, GANs do not have an objective function, rendering performance comparison of different models difficult [31].
For a concrete illustration, imagine that a GAN has been trained on a distribution of real images of human faces, and generates synthetic face samples after sufficient training iterations [32].
The degree to which the sampled distribution has learned to approximate the target distribution can be estimated by qualitative scoring; the assessment is subjectively accomplished by human observers, who easily measure how "facelike" these artificial faces appear to the eye.
Alternatively, objective metrics based on training and generated image data can be applied in some cases. Example objective metrics are proposed in [31].
In the present application, the generated "images" are abstract representations of consumer activity, not concrete objects. An out-of-sample test in the conventional sense is not possible. The metrics of generative model performance and null hypothesis tests as described in Section 2.4 constitute the testing of the model developed in this work.

GAN predictions.
After training, the model was stimulated with a noise vector and a user segment conditioning signal, producing a series of coupled (view, buy) predictions (G 1 (z, y), G 2 (z, y)), as depicted in Figure 1. The discriminators D 1 , D 2 serve only to guide training, and are disabled for the inference procedure.
A total of 2, 500 generation realizations was produced for each user segment. The recommendation matrices V z .V b were decoded onto the full-dimensional category × item space ( R 1669×417053 ) by the inverse arithmetic coding algorithm used in data preparation.
An 8% sub-sample 4 of these realizations was taken to compute key recommender evaluation metrics (Equations 4, 5). These statistics were also calculated in null hypothesis tests, the results of which were averaged to obtain an estimate of the distribution expected under this hypothesis. The main experimental results of this paper are summarized in Table 3. CV R is the conversion rate (Eqn. 4) and J C is the category similarity (Eqn. 5). Here, y represents the user segment; #I, #C are the predicted item and category counts, respectively. Row data in the table are averages over (1) 200 realizations from the GAN, and (2) 500 random trials (columns marked "rn".) The conversion rates (CV R) calculated from the GAN output range from 1.323 to 1.763%. The corresponding randomized values (CV R rn ) are at least 3 orders of magnitude smaller. Based on this result, the null hypothesis of no relationship between paired (view, buy) samples in the generator output is rejected.
On this key statistical metric, the joint distribution of (view, buy) behaviors sampled from the trained GAN model is characterized by non-trivial signal to noise. The conversion rates observed from the sampled recommendations suggest utility of the system for consumers and digital retailers.
The category similarity for the randomized cases is around 50% for all user segments, while the GAN similarity is between 6.13 and 8.19%. The random procedure shows much greater similarity between recommended (view) and potential transactions-however given the negligible precision (CV R rn ), this similarity metric has negligible practical utility.

Benchmark comparison results.
To place the current results into commercial context, experimental conversion rates were compared against benchmarks observed by digital retailers across industries and product types. In an online resource [33], conversion rates are estimated for 11 different industries and 9 product types. For a concise comparison, average values of these conversion rates are shown along with the mean GAN conversion rate obtained here. The data are summarized in Table 4.
It is seen from Table 4 that the GAN conversion rates are slightly less than, but on the order of, aggregated industrial and product type values. It is noted in [33] that precise definitions of "conversion rate" may vary and the one used in this research may be slightly different than industrial convention. Nevertheless, the RecommenderGAN produces intriguing rates of conversion given the paucity of information about user segments that was used to develop this model.

Discussion
The results obtained from evaluation of the RecommenderGAN generated samples suggest that this approach may be useful for online recommendation systems. This section discusses difficulties with direct comparison to other deep models, identifies areas for improvement of the model presented in this preliminary work, and other assorted notes.
Comparison with deep recommenders. A direct comparison of the present results with deep recommender systems found in the literature is problematic.
Commonly used benchmarking datasets are based upon entities including user, item, rating and some attributes. In order to compare against other deep models directly, it would be necessary to either (i) refactor benchmark datasets from a rating scale e.g., [1][2][3][4][5] for the MovieLens dataset [34]) onto the current binary-valued training data elements and re-interpret as a buy / no buy outcome, or vice versa (arbitrarily convert binary to a numeric scale). Such modification would shift the meaning of the data and the model, and it is not clear how to do this in a rigorously correct manner. This might be explored in future work, but is considered out of scope in the current investigation.
A comparison of the RecommenderGAN against selected deep recommendation models on the basis of their respective core algorithms, inputs and outputs presented in Table 5. Numerical efficiency. A limitation of the approach to recommendation as presented here is the numerical efficiency of the decoding process. The arithmetic coding algorithm used to decode the binary data matrices after training the model involves iteration and is not easily parallelizable. The dimensionality of the full catalog of products is extremely high; decoding compute times are consequently large. This mandates offline processing before deployment.
Future development should focus on numerical optimization of the decoding algorithm, to perhaps include compilation to another language, such as C++.
Ranking of results. There is no ranking of recommendation results in the current scheme, as the GAN produces binary valued information upon decoding. Inherent filtering is accomplished by limiting the presented results to those contained within the category intersection set {c v ∩ c b } as seen in the operational definition of conversion rate (Equation 4). This set is interpreted as representing the greatest likelihood for completing a transaction. On average over user segments, 13% of all categories are returned; of these, 0.46% of all catalog items are represented.
Conditioning signal. The current conditioning signal y is simply based on user dwell time. The information contained in this signal is relatively weak, as indicated by the variation of statistics across segments in Table 3. It is reasonable to anticipate more stringent filtering, and consequent precision and relevance of results, upon the introduction of more robust demographic or behavioral data in the conditioning signal input to the model. This would facilitate a more personalized recommendation experience. The model architecture considered here directly supports such segmentation, and is an important topic to be explored in extensions to this research.
Selection bias and scalability. The estimation of conversion rates is problematic because of two related, key issues: training sample selection bias and data sparsity [28]. Sample selection bias refers to discrepancies in data distribution between model training and inference in conventional recommenders-i.e., training data often comprise only "clicked" samples, while inference is made on all impression samples. Selection bias is said to limit the accuracy of inference assuming the user proceeds through the sequence (impression → click → buy) [28].
As clicked examples are a small fraction of views, a highly imbalanced training set results, biased towards sparse positive examples [35].
This issue is partially avoided in the present research, where the training data are constructed from all viewed items, extracted from the user sequence (view → addtocart → buy). Model recommendations are produced on items having semantic correspondence to the first and third actions in this sequence 5 .
Intertwined with the data sparsity situation in implicit recommendation is the issue of scalability. In collaborative filtering algorithms, the consideration of all paired user-item data points as input to a model is infeasible; the numbers of those pairs can be exceedingly large, and as noted, each user provides feedback on a very small fraction of the available items [35].
The present system provides scalability to the full product catalog (417053 items) by virtue of the arithmetic coding compression scheme, albeit at the cost of numerical performance upon decoding.
Open question. Has the true joint distribution been learned? Making inferences about the joint distribution of viewing and buying behavior to inform marketing decisions is the motivation behind this analysis. Investigators have previously shown that GANs may not adequately approximate the target distribution, as the support of the generated distribution was low due to so-called mode collapse [37], where the generator learns to mimic certain modes in the training data in order to trick the discriminator during training.
It is debatable whether or not mode collapse is an issue in the present problem formulation, given the abstract formulation of the problem using binary indicator matrices to represent consumer behavior.
This is an open question that may be addressed in further research.

Conclusion
We have shown that a coupled, conditional generative adversarial network can learn to generate samples from a joint distribution of online user behavior of (view, buy) item pairs. These samples can be used to make product recommendations for specific user segments.
Conversion rate statistics computed from the trained GAN output samples ranged from 1.323 to 1.763%. These statistics were shown to be significant in comparison to null hypothesis testing results. This suggests that the GAN recommendations may provide utility for consumers and digital retailers.
A comparison of GAN-predicted conversion rates against benchmarks from digital retailers representing many industries and product types showed the GAN conversion rates to approximate these aggregated commercial rates.
The capacity of the system scales to the full item catalog dimension (417,053 items) by the use of an arithmetic coding compression algorithm.
Inversion to product space by the decoding algorithm used here is slow; numerical optimization should be addressed in extensions of this work.