Explainable Reciprocal Recommender System for Affiliate–Seller Matching: A Two-Stage Deep Learning Approach

Almutairi, Hanadi; Ykhlef, Mourad

doi:10.3390/info17010101

Open AccessArticle

Explainable Reciprocal Recommender System for Affiliate–Seller Matching: A Two-Stage Deep Learning Approach

by

Hanadi Almutairi

^* and

Mourad Ykhlef

Department of Information Systems, College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Information 2026, 17(1), 101; https://doi.org/10.3390/info17010101

Submission received: 21 November 2025 / Revised: 7 January 2026 / Accepted: 9 January 2026 / Published: 19 January 2026

(This article belongs to the Special Issue 2nd Edition of Modern Recommender Systems: Approaches, Challenges and Applications)

Download

Browse Figures

Versions Notes

Abstract

This paper presents a two-stage explainable recommendation system for reciprocal affiliate–seller matching that uses machine learning and data science to handle voluminous data and generate personalized ranking lists for each user. In the first stage, a representation learning model was trained to create dense embeddings for affiliates and sellers, ensuring efficient identification of relevant pairs. In the second stage, a learning-to-rank approach was applied to refine the recommendation list based on user suitability and relevance. Diversity-enhancing reranking (maximal marginal relevance/explicit query aspect diversification) and popularity penalties were also implemented, and their effects on accuracy and provider-side diversity were quantified. Model interpretability techniques were used to identify which features affect a recommendation. The system was evaluated on a fully synthetic dataset that mimics the high-level statistics generated by affiliate platforms, and the results were compared against classical baselines (ALS, Bayesian personalized ranking) and ablated variants of the proposed model. While the reported ranking metrics (e.g., normalized discounted cumulative gain at 10 (NDCG@10)) are close to 1.0 under controlled conditions, potential overfitting, synthetic data limitations, and the need for further validation on real-world datasets are addressed. Attributions based on Shapley additive explanations were computed offline for the ranking model and excluded from the online latency budget, which was dominated by approximate nearest neighbors-based retrieval and listwise ranking. Our work demonstrates that high top-K accuracy, diversity-aware reranking, and post hoc explainability can be integrated within a single recommendation pipeline. While initially validated under synthetic evaluation, the pipeline was further assessed on a public real-world behavioral dataset, highlighting deployment challenges in affiliate–seller platforms and revealing practical constraints related to incomplete metadata.

Keywords:

reciprocal recommender systems; two-stage recommendation pipeline; affiliate–seller matching; learning-to-rank; contrastive representation learning; cold-start problem; diversity-aware reranking; explainable recommender systems; behavioral interaction data; approximate nearest neighbor retrieval; two-sided matching

1. Introduction

The industry of affiliate marketing is growing rapidly. An affiliate promotes a seller’s products or services and receives commissions for clicks or each sale made, and achieving success in affiliate marketing requires mutual support from both parties. In large systems, however, finding the right affiliate–seller match is challenging given that these systems have numerous stakeholders. They can be manually paired, but this approach is biased, slow, and lacks scalability, highlighting the need for intelligent pairing recommendations.

On a small scale, some straightforward guidelines can help, but these cannot guarantee substantial expansion. Consistency in popularity lists tends to promote the same dominant sellers, leaving other longer, lesser-known items—the long tail—hidden. Standard collaborative filtering techniques, such as matrix factorization, work well in situations involving several interactions but fail in circumstances characterized by sparse data, new users, or empty user–data interactions [1,2]. More recent neural recommenders address certain issues around data density, but challenges such as the cold start, fairness, response time, and other practical problems remain unresolved [3,4].

Accordingly, we formulated affiliate–seller matching as a reciprocal recommendation problem in which one side recommends a ranked list of sellers for each affiliate (affiliate→seller), while the other recommends a ranked list of affiliates for each seller (seller→affiliate). We used a fully synthetic dataset comprising 20,000 affiliates, 2000 sellers, 26,200 transactions, 60,000 implicit interactions, temporal features, and text profiles embedded using Sentence-BERT (SBERT) [5]. The dataset was designed to approximate high-level categorical, geographical, and interactional distributions of typical affiliate platforms. We do not, however, claim that it reproduces a specific real-world marketplace, and this should be treated as a key limitation when interpreting the results.

We observed five considerable challenges:

Cold start: About 40% of the affiliates in our test split have no past transactions. Therefore, pure collaborative signals are missing, and classic MF models cannot learn useful embeddings for them [2,3].
Multimodal fusion: Effectively combining structured metrics (click-through rate (CTR), earnings per click (EPC), commissions), categorical values (language, category, platform), temporal contexts (season, hour), graph priors (affiliate–affiliate and seller–seller similarity), as well as bios and program description text embeddings is complicated [6,7,8].
Bias and diversity: Without sufficient countermeasures against popularity bias, head sellers are prioritized, resulting in uneven exposure and downstream discovery, thereby causing systems to miss many valuable niche matches [8].
Latency and scalability: Scalable and real-time prediction systems require estimated execution times of 100 ms. Although heavier models can yield improved accuracy, they sacrifice speed.
Trust and explainability: Business users typically ask why certain options are recommended to them. Black-box scores are insufficient, and clear reasons must be provided [9].

To resolve these challenges, we implemented a typical two-stage structure in the design of our system, similar to modern industrial recommenders [8]. Stage 1 focuses on retrieval to maximize speed, while Stage 2 involves reranking to optimize order and business utility. In Stage 1, the system learns meaningful representations of affiliates and sellers, enabling efficient matching even when users are new and have no historical data [8,10]. This design helps address the cold-start problem by relying on available profile information to create useful representations. Like real-world industry recommendation systems, the retrieval layer of our system is optimized for fast searches and scales to massive matching situations [8,11].

In Stage 2, top retrieved candidates are optimized through a learning-to-rank approach that enhances ordering quality [12,13]. Our model considers behavioral signals and profile-level characteristics to generate rankings that better reflect user intent and relevance [6,7]. Time-aware training is applied to avoid data leakage and ensure realistic evaluation [14].

Our final system achieved strong performance, showing major improvements in ranking quality compared with systems grounded in retrieval alone. The results demonstrate large gains in accuracy and coverage, verifying the effectiveness of a two-stage recommendation pipeline [8]. The system also maintains low latency suitable for real-time production environments, confirming its practicality [8].

The system includes reranking strategies that balance accuracy and diversity to ensure fairness and variety. This enables the platform to identify a wider range of relevant sellers rather than repeatedly promoting only the most popular ones. Experiments show a measurable trade-off between diversity and ranking precision, which is expected and acceptable for improved fairness and discovery [4,15,16]. Cold-start users are also evaluated, and the system can deliver meaningful initial suggestions even without prior interaction histories. As users engage more frequently with the system, the quality of its recommendations continues to improve [10]. Finally, explainability features are included to improve transparency and trust for business users. The platform provides simple, nontechnical explanations of why a recommendation appears, helping stakeholders understand and develop confidence in the system [9,17].

Beyond controlled synthetic settings, we also consider evaluation under real-world behavioral data, acknowledging the trade-offs between realism and metadata completeness that commonly arise in public datasets.

The contributions of this study are summarized as follows:

Designing a two-stage reciprocal recommender for affiliate–seller matching with multimodal inputs, evaluated in both affiliate→seller and seller→affiliate directions.
Training two-tower encoders with contrastive (InfoNCE) learning for fast approximate nearest neighbors (ANN) retrieval and documenting the architecture, loss, and negative sampling strategy in detail for reproducibility [8,11,18].
Using listwise ranking (ListNet) and cross-feature modeling (deep and cross network V2 (DCN-V2)) to rerank retrieved candidates and analyzing why the resulting normalized discounted cumulative gain at 10 (NDCG@10) scores are close to 1.0 under our synthetic evaluation setting, including checks for data leakage and pair overlap [7,12].
Implementing diversity-aware reranking (maximal marginal relevance (MMR)/ explicit query aspect diversification (xQuAD)) and popularity penalties as well as quantifying their effects on provider-side coverage and long-tail seller exposure relative to NDCG/mean average precision (MAP) [4,15,16].
Using SBERT text embeddings and metadata filters to partially mitigate cold-start cases, with stratified performance for cold versus non-cold entities reported and the remaining gap addressed [5,10].
Adding post hoc attributions based on Shapley additive explanations (SHAPs) for the ranking model (offline) and clarifying the scope and computational cost of explainability [9,19].
Conducting ablation studies, statistical significance tests, and latency measurements on a commodity server to support methodological choices while explicitly acknowledging the limitations of using synthetic data [8,14].

2. Background and Related Work

2.1. Background

2.1.1. Recommendation Paradigms

As a critical component of digital platforms, recommender systems help users narrow their choices down from a plethora of options. There are three main paradigms for affiliate recommenders. First, collaborative filtering (CF) analyzes the interaction history between users and items. When two users demonstrate similar behaviors, the system assumes that they like the same items [2]. CF is a powerful paradigm under sufficient data, but it suffers from sparsity and cold-start challenges [1]. Second, content-based (CB) filtering relies on item and user attributes to recommend similar items, as discussed in general recommender system surveys [20,21]. CB is especially advantageous for new items with rich metadata, although it may overspecialize and limit diversity [20,21]. Third, hybrid models integrate CF and CB to mitigate the weaknesses of each paradigm [22]. Such models are preferred and widely deployed in industrial applications due to their robustness [8,22].

2.1.2. Two-Stage Pipelines

Many contemporary systems use two-stage recommendations [8]. The first step is retrieval, in which a small candidate set is quickly shortlisted from a considerably larger set. The second step is ranking, in which a candidate list is reordered according to a more sophisticated model. The speed of retrieval is critical since a system deals with massive sets of items. Retrieval methods include the popular embedding-based approaches that map users and items onto a common vector space [8]. Other examples include approximate nearest neighbor (ANN) methods such as hierarchical navigable small world (HNSW) graphs [11].

The ranking phase occurs after retrieval and describes the process of rescheduling the items in a top-K list. Common algorithms in this space include ranking by LambdaRank, LambdaMART, and deep & cross networks (DCNs) [7,13]. Although ranking models entail the slowest processing of all stages, this step is carried out on a small number of items, making more features readily available [8].

2.1.3. Cold-Start Solutions

The cold-start problem, which occurs in most systems, pertains to the first set of users and items lacking a system history. Solutions include embeddings from text or metadata, such as the use of pretrained models like Sentence-BERT (SBERT) that offer rich semantic representations for users and items [5]; graph priors, which involve the construction of user and item similarity graphs that allow new nodes to inherit signals from neighboring nodes [11,23]; and meta-learning or transfer learning approaches that train models to adapt quickly to new entities [10].

2.1.4. Debiasing and Fairness

Recommender systems tend to be biased, with popular items gaining disproportionate exposure while niche items remain hidden. This lack of fairness and reduced visibility diminish diversity [20]. This problem is resolved through diversity-aware approaches, among which an MMR [15] framework adjusts relevance and novelty, while xQuAD [16] ensures that relevance is distributed across different facets of items. Popularity penalties are also effective in mitigating dominant head items [4].

2.1.5. Explainability in Recommenders

Explainability is crucial for trust, as stakeholders require explanations of why items are recommended. It provides a means for model scrutiny by researchers to ensure that a model uses the right signals. Explanations can be provided using SHAP (SHapley values), which measures the contributions of features [19]; integrated gradients (IGs), which explain a prediction by integrating the gradient of an output with respect to each input feature along a path from a baseline (e.g., zero input) to the actual input [24]; and permutation importance, which involves estimating a decrease in accuracy if each feature is shuffled [25].

2.2. Literature Review

2.2.1. Collaborative Filtering and Extensions

Early collaborative filtering (CF) approaches were based on user–item matrices. Koren et al. [1] introduced matrix factorization, which became a standard approach in recommender systems. Subsequently, Hu et al. [2] extended CF to implicit feedback datasets. Bayesian personalized ranking (BPR) [26] was proposed as a pairwise optimization framework for implicit feedback. Other extensions include temporal CF and context-aware CF, which incorporate temporal dynamics and contextual information such as location, device, or session data.

2.2.2. Content-Based and Hybrid Approaches

CB methods use attributes such as genres, tags, or descriptions. Lops et al. [27] documented studies on CB techniques, but the problem is that these approaches lack diversity. As previously stated, hybrid approaches combine CF and CB, and evidence suggests that hybrid models are more resilient to the cold-start challenge [22,28]. Burke [22] proposed a taxonomy of hybrid recommender systems, outlining different hybridization strategies. Hybrid recommendation systems are now widely adopted in industrial platforms, including companies such as Netflix [29].

2.2.3. Two-Stage Architectures

Two-stage systems are used by large platforms such as YouTube and TikTok. Covington et al. [8] described YouTube’s candidate generation and ranking networks in which the retrieval stage applies embeddings and ANN searches, and the ranking stage uses wide & deep models with rich features.

Annotated retrieval methods aimed at large datasets include HNSW method by Malkov and Yashunin [11], which describes efficient ANN searches. Burges et al. [13] proposed listwise optimization using LambdaRank and LambdaMART for ranking.

Wang et al. [6] discussed the development of deep and cross networks (DCNs) for ad click prediction, and further improvements were proposed by Wang et al. [7] to create DCN-V2, an enhanced deep and cross network that improves feature interaction learning and demonstrates strong effectiveness and scalability for large-scale learning-to-rank tasks in web systems.

2.2.4. Cold-Start Strategies

The introduction of SBERT by Reimers and Gurevych [5] has enabled the embedding of text into vectors and has become foundational for many applications. This is particularly useful in a cold start with a text profile as the only data. Vartak et al. [10] used meta-learning for recommendation, in which minor interactions make fast adaptation possible, corroborating the exploration carried out in that space.

2.2.5. Bias, Fairness, and Diversity

The literature on recommender bias is extensive. For example, Abdollahpouri et al. [4] studied the impact of popularity bias on user satisfaction, Carbonell and Goldstein [15] introduced the MMR framework for diversification, which has since been adopted in recommender systems. Santos et al. [16] suggested using xQuAD to address web search diversification, and Abdollahpouri et al. [4] considered long-tail exposure and popularity penalties as remedies. The recent literature on fairness and diversity states that exposure to fairness with respect to different providers is a key factor in trust in a platform [28].

2.2.6. Explainability

Studies on explainability continue to grow given that in recommender systems, explainability serves the end-user, platform owner, and regulator [17,28]. Lundberg and Lee [19] proposed SHAP as a means of ensuring model interpretation, Sundararajan et al. [24] developed IGs for deep models, and Fisher et al. [25] discussed permutation importance.

Tintarev and Masthoff [17] presented explanation types and how they are related to trust and satisfaction, while SHAP has recently been used in ranking models to explain reason codes for recommendations.

All these components were integrated into our system to demonstrate how affiliate–seller recommenders can be developed with the precision of an industry solution, consistent with relevant scholarly work.

3. Materials and Methods

3.1. Exploratory Data Analysis (EDA)

We conducted a focused examination of our chosen dataset to identify structural patterns relevant to the model design, including distributional imbalance, feature relationships, and indicators that influence retrieval and ranking performance. The dataset comprises several tables, such as those on affiliates, sellers, transactions, and implicit interactions. These tables jointly describe the ecosystem and inform which features can meaningfully support reciprocal matching, especially under imbalance and cold-start conditions.

3.1.1. Distribution of Affiliate Categories

Figure 1 presents the distribution of affiliate categories in the dataset.

The category distribution in the dataset reveals a strong imbalance, with technology, fashion, and sport comprising the majority of affiliates. Long-tail groups (e.g., finance, education, travel) are underrepresented, which has direct implications for model fairness and exposure. Such an imbalance must be accounted for in both retrieval (embedding space density) and ranking (mitigating majority-class dominance).

3.1.2. Distribution of Affiliate Languages and Locations

Figure 2 summarizes the distribution of affiliate languages and geographical locations.

Languages and locations exhibit a similar imbalance, with English and Western regions dominating. This skew is relevant because embedding density and interaction patterns may favor these groups, potentially disadvantaging underrepresented languages and geographies. Accordingly, language and location are incorporated into feature engineering to mitigate geographic and linguistic bias in downstream recommendation stages.

3.1.3. Seller Category and Location Distribution

Figure 3 shows the distribution of seller categories and locations across the platform.

Seller distribution mirrors affiliate trends, with technology and fashion being dominant and long-tail seller groups remaining sparse. Since reciprocal matching requires balanced exposure between affiliate and seller sides, such concentration motivates the use of fairness-oriented reranking (MMR/xQuAD) to avoid reinforcing popularity-biased outcomes. Additionally, geographic skew toward Western markets signals potential cultural and demand-side biases that must be controlled during ranking.

3.1.4. Distribution of Company Tiers

Figure 4 illustrates the distribution of sellers across different company tiers.

The sellers in the dataset are mostly small businesses and startups, constituting close to 70% of the sample. Mid-market companies account for around 25%, and enterprise sellers account for only 100 cases.

3.1.5. Distribution of Commission Rates

Figure 5 depicts the distribution of commission rates among sellers.

Commission values show moderate variability but no extreme skew. Because commission strongly influences conversion likelihood, it is retained as a primary engineered feature in both the retrieval and ranking stages.

3.1.6. Distribution of Cookie Periods

Figure 6 displays the distribution of cookie periods observed in the dataset.

This feature was removed from downstream modeling because it showed minimal predictive signaling in interaction-based learning.

The Performance Score Graph

Figure 7 presents the performance scores computed for affiliates and sellers.

This manually constructed score exhibited redundancy with other behavioral features (CTR, engagement) and was therefore excluded to avoid model overfitting and feature multicollinearity.

3.1.7. Correlation Heatmap

The correlation heatmap in Figure 8 shows the relationship between numeric features, namely, amount, click count, CTR, and engagement rate. Strong positive relationships are marked in red, and weak or zero relationships are denoted in lighter colors.

3.1.8. Kruskal–Wallis Test

The Kruskal–Wallis test is meant to compare multiple groups. It is a nonparametric test, meaning it does not assume that the data follow a normal distribution. It is comparable to ANOVA, which requires normal distribution and equal variance. The Kruskal–Wallis test assigns ranks to all values and then checks whether the ranks in different groups vary. Under substantial differences, at least one group is regarded as different from the others, but under minimal variances, all groups are considered the same.

The first relationship tested was that between the marketing platform of interest and CTR. The test yielded a statistic of 3.0015 (p = 0.6998). The groups included in the analysis were the Blog, YouTube, Instagram, TikTok, Twitter, and Facebook platforms. The substantial p-value (0.69 > 0.05) precludes the evaluation of the platforms as having different CTR values, suggesting that the CTR is similar across these platforms, with no strong platform-specific effect occurring.

The second relationship tested was that between seller category and amount. The test yielded a statistic of 39.8544 (p = 0.00000341). The groups considered were technology, fashion, sport, health and wellness, home goods, jewelry, education, travel, and finance. The p-value (<0.05) indicates a statistically significant difference and suggests that the number of transactions is nonequivalent across the seller categories. Some categories, such as technology and fashion, might generate higher amounts than those produced by categories such as travel or finance.

3.1.9. Revenue, Engagement Rate, and Composite Utility Graphs

Figure 9 presents revenue, engagement rate, and composite utility prior to model training.

Revenue and engagement histograms provide comprehensive descriptive insights but do not offer modeling-critical information beyond confirming the skewness already captured through numeric features. These visual descriptions were therefore omitted for conciseness. The insights below illustrate how the various features of the system interact and function before model training.

Success Probability Based on Signal Strength

With regard to engagement rate, success probability approaches 0.35 and remains relatively stable. With seller commission, success probability response increases with stronger signals, reaching a maximum of 0.40. This finding suggests that commissions increase affiliate attraction and aid in conversions. Success probability with respect to CTR is unstable, as clicks are highly variable. Commission is the most influential variable that increases success rates, as confirmed by the analyses.

Figure 10 depicts the increase in the likelihood of conversions as signal strength is heightened. The engagement rate, represented by the blue line in the figure, mostly stays within the range of 0.34 to 0.37. While there is some minor fluctuation as it moves up and down, there is no consistent upward movement. Even when the engagement rate rises, this does not mean that the chances of success improve; it simply stays consistent.

The orange line, which represents seller commission, begins at a low level of around 0.30 when commissions are low before gradually rising to approximately 0.40 when the commission rate is high. After 20 bins, commission slightly declines, indicating that large commissions tend to engender greater conversions, although excessively high levels may not lead to additional gains. The green line represents the CTR, which fluctuates randomly and consistently from 0.34 to 0.37, suggesting that this variable inconsistently and unreliably predicts success. As can be seen, seller commission, engagement rate, and CTR provide unstable conversion predictions.

3.1.10. Dominance and Strong Pairs

While interesting from a network analysis perspective, dominance patterns do not directly inform the proposed two-stage modeling pipeline and were therefore removed for brevity.

3.1.11. Text Embedding Visualization with t-SNE

Figure 11 presents a t-SNE visualization of the BERT-based text embeddings.

BERT embeddings were projected using t-SNE primarily to verify whether semantic similarity was preserved in low-dimensional spaces. Distinct clusters emerge for major thematic groups (e.g., technology, fashion), indicating that embeddings capture domain-level structures even in the absence of interaction histories. Moderate overlap between related domains (e.g., sport and health) is expected and reflects natural semantic proximity. This visualization qualitatively confirms that embeddings offer a viable mechanism for mitigating affiliate cold starts by enabling similarity-based retrieval independent of historical transactions.

3.1.12. Cold-Start Analysis for Affiliates and Sellers

Cold-start sparsity is a central modeling challenge: Approximately 95% of the affiliates and 10% of the sellers in our dataset lack historical interactions. Because this synthetic distribution intentionally reflects sparsity in real-world marketplaces, the system relies heavily on semantic features (SBERT embeddings, metadata, pairwise indicators) during retrieval. Detailed subgroup breakdowns (e.g., by category or location) were omitted for conciseness, as the model addresses the cold-start problem globally through embedding-based generalization rather than per-segment heuristics.

Utility by Hour of Day

Figure 12 presents the change in utility score across hours of the day, highlighting periods of more effective affiliate–seller interaction.

Certain hours, such as 3 to 4 and 10 to 11, exhibit utility peaks of almost 0.357. Hours such as 7 and 12 to 13 exhibit utility dips, dropping to between 0.327 and 0.337. This shows that over the course of the day, more useful actions happen on the platform compared with other times.

These insights foster recommendations related to timing. If a particular hour attracts more activity, the system knows how frequently to throttle the pushing of new items or campaigns.

Modeling EDA Summary

Here, we present a summary of what we learned during the exploration of features, the cold-start problem, and seller distribution before we launch into full model training.

Top Features for Utility: Engagement rate tops the list (0.8473) in terms of utility, reflecting that if an affiliate engages well with an audience, the chances of success are excellent. The second most useful variable is the commissions provided (0.4353), which indicates that sellers who give better commissions have more utility. These other features also matter but to a lower extent: follower count, EPC, dwell time, CTR, average order value, network density score, average CTR, and temporal trend score. Both user actions (CTR, dwell time, and engagement) and seller actions (commission and order value) drive success.

Significance Test (p-values): The results of the Kruskal–Wallis test indicate that the marketing platform minimally affects the CTR (p = 0.6997, which is a high value), meaning that CTR distribution varies only slightly across the Blog, YouTube, Instagram, TikTok, Twitter, and Facebook platforms. Thus, the choice of platform is insignificant.

Cold-Start Problem: The rate of cold starts for affiliates is extremely high (95.35%), indicating that the majority of the affiliates on record are new, with limited prior data. This is in stark contrast to the case of the sellers, whose cold-start rate is low (9.9%), denoting an abundance of data for most of them. These findings imply that our model should address the cold-start problem for affiliates through strategies such as using embeddings and other zero-shot methods.

Seller Category Distribution (Top 10): The largest selling categories are technology (20.9%) and fashion (19.7%), followed by sport (14.8%), health and wellness (10.4%), home goods (9.5%), and jewelry (9.3%). Categories with smaller market shares are education (5.4%), travel (5.0%), and finance (4.6%). This distribution points to the tendency of the marketplace to lean more toward tech and fashion sellers, which requires affiliate recommendations that balance this bias.

To complement controlled synthetic experiments, we additionally report results derived from a public behavioral dataset, following a preprocessing and mapping strategy aligned with the proposed affiliate–seller abstraction.

3.2. Real-World Behavioral Dataset Construction and Preprocessing

We validated our reciprocal matching pipeline using a public real-world behavioral dataset, namely the RetailRocket event logs, rather than relying solely on synthetic interaction data. The objective was to assess the pipeline under conditions that reflect actual user behavior. From the raw event logs, we derived two primary signal types. The first consisted of transaction-level positive interactions used to form ground-truth affiliate–seller pairs. The second comprised temporal context features extracted directly from event timestamps, including month, hour, weekday, and seasonal indicators. This design improves external validity while maintaining a feature set that is simple, interpretable, and suitable for reviewer scrutiny. We additionally report split-level coverage limitations, as public datasets often lack complete affiliate or seller profile information for all identifiers.

Using the RetailRocket logs, we constructed a real interaction dataset by mapping visitor–item behavior to the affiliate–seller abstraction. An implicit interaction table was created using views, add-to-cart events, and confirmed transactions, with fixed and interpretable weights assigned to each interaction type (view = 0.1, add-to-cart = 0.6, transaction = 1.0). In parallel, a separate transaction table containing only confirmed purchase events was generated. Because the dataset does not consistently provide price information across all events, transaction amounts were set to a constant value of 1.0, allowing the analysis to focus on behavioral signals rather than monetary magnitude.

From transaction timestamps, several temporal attributes were computed, including month, day of week, hour, quarter, season, and a weekend indicator. This process resulted in a temporal feature table containing 22,427 records and a transaction-level table with 22,749 records. These outputs were subsequently aggregated into pair-level collaborative priors, such as transaction counts per pair, affiliate and seller totals, a simple density proxy, and a recency-weighted trend indicator. Overall, this setup enabled evaluation of the full recommendation pipeline on real behavioral data, despite the known limitations of public datasets.

3.3. Model Architecture

Our recommendation framework operates in three main stages. The first entails feature assembly using structured affiliate, seller, and pair-interaction signals derived from the engineered dataset.

In the second stage, a candidate generation module is deployed. This stage narrows the focus to a smaller pool of potential matches for each affiliate using embedding-based retrieval optimized for scalability and recall rather than fine-grained accuracy. It is also more strongly concerned with filtering than with perfect ranking. The model then progresses to the ranking stage, which involves sophisticated features and fine-tuned algorithms, such as ListNet or DCN, to optimize list-wise ordering under NDCG- and MAP-based objectives.

The last stage involves postprocessing the results, which includes explicitly defining diversity reranking, coverage balancing, and popularity controls to ensure that the output lists are equitable and varied. The seamless flow of the system is summarized in Figure 13.

This modular design supports real-time retrieval, interpretable ranking, and controlled fairness adjustments in the final list.

3.4. Stage A: Setup and Splits

The first step in the methodology was to determine an optimal design for the dataset. To ensure unbiased evaluation, the data split strategy must prevent temporal leakage and pair reuse. In this case, we used strict time-aware splitting, which separates the training, validation, and testing data according to timestamps. Older data were used for training, while newer data were set aside for validation and testing to reflect realistic future prediction situations. To prevent leakage, we ensured that no pairs of affiliates and sellers overlapped between the training and test sets. This prevents the memorization of specific affiliate–seller links and avoids artificially inflated performance scores.

Another critical part of the process was cold-start tagging as illustrated in Figure 14. When a test set contains affiliates or sellers missing in training, they are considered cold-start users or items. Cold-start tagging allowed us to separately evaluate the performance of unseen affiliates or sellers, which is necessary given the high affiliate cold-start rate noted earlier. Cold-start situations are prevalent in real-world applications, so explicit segmentation ensures the transparent reporting of warm-start versus cold-start performance. This design supports fair, leakage-free evaluation and forms the basis for reproducible benchmarking.

3.5. Stage B: Labels and Sampling

This section outlines the labels and the sampling approach taken to train the model. Labels formalize the supervision signal for both retrieval and ranking models and ensure reproducible training across experiments.

3.5.1. Hard Labels: Conversions

Hard labels represent the most powerful feedback, and in our work, we considered a conversion to be a hard label. In affiliate marketing, a conversion occurs when an affiliate takes an action that results in a sale or transactional outcome with a seller. There is either conversion or no conversion; hence, the label is binary. Conversion events are extremely sparse, which requires complementary supervision to avoid overfitting to rare events. In many cases, an impression or a click occurs without a transaction.

3.5.2. Soft Labels: Weighted Implicit Signals

A weak label, in contrast, is triggered by a click, share, save, or view and, in a sequence, indicates interest. Signals are assigned weights by intent strength. We formalized implicit labels as a monotonic intent hierarchy (view < click < share < save) with calibrated weights to avoid overstating noisy signals. These implicit labels serve as auxiliary training signals used primarily by the ranker to improve robustness when conversion labels are insufficient.

3.5.3. Hard Negatives: Semantic and Popularity Control

Training requires negatives, but in this case, training focused on hard negatives, which are necessary for contrastive learning to prevent the trivial separation of positive and negative examples. In the first selection, semantic criteria are used; that is, pairs share high-level attributes (category, language, region) but have no interaction history, making them more informative than random negatives.

The goal of popularity regulation is not restricted to assigning unpopular sellers as negatives. To prevent reinforcing popularity bias, we mixed negative sampling across popularity strata. Together, semantic and popularity control produce harder examples that teach a model meaningful distinctions and are beneficial for contrastive learning and retrieval.

Collectively, hard labels, structured implicit-feedback weights, and controlled negative sampling advance balanced supervision for retrieval and ranking while mitigating data sparsity, popularity bias, and leakage risks.

3.6. Stage C: Feature Engineering

Feature engineering transforms an unrefined dataset into organized signals for machine learning models. We defined three structured feature sets—affiliate, seller, and pair level—chosen on the basis of prior work in reciprocal and two-sided matching systems. We designed features in accordance with the aforementioned levels and added embeddings from pretrained language models.

3.6.1. Affiliate Features

These were classified into numeric and categorical data.

Numeric Features

Engagement rate refers to the rate of interaction an audience has with affiliate content, with high levels signifying more high-quality followers. Follower count represents the size of the audience that an affiliate has. Average CTR denotes the ratio of clicks to impressions. Content quality score reflects the rating given to affiliate content, with better content increasing trust. Total affiliate transactions indicate the total transactions that an affiliate has. Network density score captures the strength of an affiliate’s connection in the network graph, with increased density resulting in more relationships with sellers. Category popularity score measures the acclaim that an affiliate’s category receives. Temporal trend score is used to monitor changing activity trends over various periods.

Categorical Features

Affiliate category refers to the domain (e.g., technology, fashion, health). Affiliate language denotes the primary tongue. Marketing platform includes examples such as YouTube, Instagram, Blogs, TikTok, and Twitter. Affiliate location represents the country of operation. Socially and commercially, these features represent affiliates’ attributes [30].

3.6.2. Seller Features

Seller features include entrepreneurial and operational attributes.

Numeric Features

Commission provided refers to the affiliate commission rate, with potentially more affiliates attracted with higher commissions. Average order value (AOV) represents the average value per transaction. EPC denotes profit per click. Total seller transactions indicate the cumulative previous transactions.

Categorical Features

Seller category refers to the product domain (e.g., technology, fashion, jewelry, sport). Seller location represents the geographical region. International shipping indicates worldwide shipping availability (yes/no). These features can be used to examine seller products and may hold value for affiliates.

3.6.3. Pair Features

Pair features are interaction-level signals between affiliate and seller. These features are critical because recommendations entail matching two sides. Category match indicates identical product categories (1/0). Location match denotes identical geographical areas. Language match represents identical languages. CTR prior refers to the prior average CTR. Conversion prior captures the mean conversion prior. Transaction count indicates the total past transactions. Affiliate conversion mean represents the average affiliate conversion for all sellers. Seller conversion mean denotes the average seller conversion for all affiliates. Pair features allow for learning the strength of interaction [2,26].

3.6.4. Embedding Integration

Alongside handcrafted features, text models provide embeddings. An embedding is a numeric, low-dimensional, semantic representation of text. For our purposes, we used SBERT (all MiniLM L6 v2), which produces 384-dimensional embeddings. We created affiliate text embeddings with dimensions (5441, 384) for training, (1358, 384) for validation, and (1327, 384) for testing. Seller text embeddings were created with dimensions (2000, 384) for training, (1922, 384) for validation, and (1906, 384) for testing. These embeddings were combined with the numeric and categorical features to form richer profiles.

3.6.5. Final Feature Pack

This feature representation supports candidate generation by incorporating individual and relational information [8], and supports ranking through learned feature interactions [6,7].

3.7. Stage D: Candidate Generation

As the first stage in our pipeline, candidate generation involves assembling a small set of potential sellers for each affiliate, and vice versa. In the following stage, the ranker performs a detailed analysis. Candidate generation needs to be carried out quickly and be scalable to prevent a bottleneck in the system [8].

3.7.1. Two-Tower Architecture

We implemented a two-tower model that encodes affiliate and seller features into a shared 128-dimensional vector space where closeness to each other indicates a good match. Precomputed seller embeddings decrease latency by ensuring that affiliate queries are processed much faster. Seller embeddings are computed in advance and stored for retrieval [31]. The outputs of our training are as follows. Affiliate embeddings were produced with dimensions 5441 × 128 for training, 1358 × 128 for validation, and 1327 × 128 for testing. Seller embeddings were produced with dimensions 2000 × 128 for training, 1922 × 128 for validation, and 1906 × 128 for testing.

3.7.2. InfoNCE Contrastive Learning

The two-tower model was trained with InfoNCE loss, which is a contrastive loss function that pulls together the embeddings of real affiliate–seller pairs and pushes apart the embeddings of negative pairs [18]. Each positive pair is contrasted with a set of negative pairs in the same batch, forcing the model to learn a more meaningful and discriminative representation [18]. The training results (validation InfoNCE) are listed below. The best validation InfoNCE was 6.9234 at epoch 8. Early termination was applied at epoch 18, with the best weights restored. The resulting embeddings were stable and avoided overfitting.

3.7.3. ANN Retrieval

Following the training of embeddings, we proceeded in the candidate search using ANN. ANN is quicker than brute force and allowed us to retrieve the top 200 candidates in under 1 ms. For a single index, the build time was less than 1 ms, while the query time was 67 ms in forward A to S and 89 ms in backward S to A. For K = 50, the overlap of exact to ANN was 100%, reflecting the near-exactness of the ANN results [11].

3.7.4. Training Curves

The training curve shows fast improvement in the first few epochs, as illustrated in Figure 15.

The training curve shows fast improvement in the first few epochs, as illustrated in Figure 15. The model converges steadily until it reaches a plateau, with the best validation InfoNCE achieved at epoch 8. Early termination was applied at epoch 18 after no further improvement was observed, indicating stable training behavior without overfitting.

3.7.5. Evaluation Metrics

Table 1 summarizes the retrieval performance on the validation set for both affiliate-to-seller and seller-to-affiliate directions.

It is anticipated that the retrieval performance of these metrics will be low, as the primary objective at this stage is to expand coverage.

3.8. Stage E: Ranking

At this stage, we focused on reranking the sellers for each affiliate after candidate generation. This stage is crucial as it was intended to reduce the initial 50 candidate pool and rank the top 10 with considerable precision. We compared three models:

3.8.1. Bayesian Personalized Ranking (BPR)

BPR is a pairwise learning method; that is, it learns pairwise preferences (e.g., affiliate prefers seller A over B) using implicit data (clicks, views) without explicit ratings [26]. This model served as a baseline, although it was expected to underperform compared with listwise methods because it currently compares only pairs and not an entire list.

3.8.2. ListNet and LambdaRank (Listwise Models)

LambdaRank is a listwise learning model that focuses on optimizing an entire ranked list rather than individual item pairs [13]. It is an enhancement of RankNet and incorporates a lambda gradient that places greater emphasis on the top of the ranked list, thereby pushing more relevant items upward in the ranking. ListNet was also evaluated as a listwise ranking model in our experiments. ListNet produced an NDCG@10 of approximately 0.9956 and a MAP@10 of around 0.9956, indicating a good fit with our feature-rich dataset.

3.8.3. DCN-V2

DCN-V2 is a cross-feature neural model that not only examines individual features but also integrates and crosses features (e.g., engagement × commission, or location × language). This is useful in identifying complex interactions [7]. The cross-layer helps the generalization of a model when features are sparse or in high dimensions.

3.8.4. Results from Our Data

Training Results

ListNet-Lite achieved convergence in 5 epochs, with validation NDCG@10 held constant at 0.9956. The MAP@10 was 0.9956, which also indicates solid ordering and relevant marked items. Top-10 seller coverage was 72.74%, indicating that the models did not fixate on a handful of sellers but rather spread exposure more broadly. The HHI concentration was approximately 0.0056, which reflects fairness.

The comparison yielded the following results. BPR performed on the downside, whereas ListNet and DCN-V2 showed strong outcomes, with ListNet being somewhat more stable. Based on these observations, we selected the hybrid model ListNet + DCN-V2 for the final ranker pipeline.

3.8.5. Why Ranking Is Important

While candidate generation brings in many sellers, ranking determines who the top 10 sellers are. In this stage, we deployed sophisticated algorithms to ensure that the list is not only correct but also diverse and equitable.

3.8.6. Training Objectives: NDCG and MAP

The ranking model is intended to predict which items are good and arrange them in the best order for achieving ideal relevance to ground truth. For this purpose, we used two of the most popular metrics in our training and evaluation. NDCG is a listwise metric that determines whether a relevant item is available and rewards a model heavily if the item is of a high rank or position [13]. MAP is a ranking metric that, for all relevant items, calculates average precision and then takes the average for all users to obtain the final precision score [12]. It is of great relevance for achieving stable MAP when dealing with binary relevance labels. ListNet and DCN-V2 obtained NDCG@10 = 0.9956 and MAP@10 = 0.9956 in the experiment, which means that they can find relevant sellers.

3.8.7. Ranking Pipeline: Retrieve 50 → Rerank 10

The ranking stage follows a two-layer pipeline:

First, Candidate Generation (Stage D): From the entire catalog, the system first retrieves around 200 candidates using a two-tower and ANN search. Out of these, the top 50 candidates are sent to the ranking step.
Second, Ranking (Stage E): The ranking models (BPR, ListNet, DCN-V2) then reorder the 50 candidates. The final output is the top 10 sellers displayed to the affiliate.

This pipeline works well: The first step speeds up the process by reducing the pool, and the second step ensures accuracy by using a deep learning model. Hence, we ensured a good tradeoff between speed and accuracy, both of which are crucial in a recommendation system [8].

3.9. Stage F: Postprocessing

Postprocessing improves the fairness and diversity of a ranked list, guaranteeing that popular sellers do not monopolize the list and that underserved sellers receive their fair share of exposure.

3.9.1. MMR Reranking (Diversity)

MMR involves the selection of items on the basis of relevance while considering how different they are from items that have already been chosen [15]. In our case, the system avoids showing 10 technology sellers together even if all of these have high scores. It also includes sellers from other categories, such as fashion or sports, so the list appears more diverse. Thus, MMR improves categorical diversity in the recommendation list.

However, there was a small drop in the NDCG score. For λ = 0.3 to 1.0, NDCG@10 decreased from {0.9956}→0.919. This indicates a potential trade-off between accuracy and diversity.

3.9.2. xQuAD Coverage

xQuAD is another reranking method [16] that covers multiple aspects (categories) in the final list. For each item, it balances the rank score and how much it improves the coverage of an unseen category. xQuAD increased the average diversity of a category for long-tail sellers from 4.15 to 9, with an ~7.3% drop in NDCG@10 as there was with MMR.

3.9.3. Popularity Penalty λ-Sweep

Popularity bias is a significant issue in recommenders, causing popular sellers to appear everywhere while small-scale sellers acquire almost no exposure [4]. To solve this problem, we applied a penalty controlled using λ. When λ = 0, no penalty was applied and popular sellers dominated, but when λ = 1.0, the exposure of popular sellers was reduced, increasing the visibility of unpopular sellers. At λ = 0, NDCG@10 was 0.9956, but diversity was very low. At λ = 0.5, diversity increased, while NDCG@10 dropped to approximately 0.919. At λ = 1.0, strong diversity was achieved, but with an almost 7.8% decline in utility. This means that λ requires careful selection to ensure both accuracy and fairness.

3.9.4. Summary of Postprocessing

MMR maintains a relevance–diversity balance. xQuAD guarantees coverage across multiple categories. The popularity penalty limits the dominance of popular sellers. The models we have used in our code systems ensure fairness and inclusivity by sacrificing some accuracy.

3.10. Stage G: Explainability

Explainability in recommender systems is essential. A model may perform at the expected accuracy, but a user or client requires some rationalization when a model is shown to them. We endeavored to be transparent with both local and global explainability strategies [28].

3.10.1. SHAP Global Importance

SHAP is a game theory method for explaining predictive value and how much each feature contributes to score prediction. Global importance ranks features by aggregated SHAP values, as illustrated by the following results [19]. Engagement rate showed the strongest contribution at approximately 0.847, followed by commission provided at around 0.435. Features such as follower count, EPC, CTR, and AOV contributed at a moderate level, with values in the range of approximately 0.13–0.15. Temporal trend score and network density score provided additional signals for matching. These findings suggest that affiliate activity and seller incentives drive recommendations that are consistent with business intuition.

3.10.2. Visualizations (Summary and Force Plots)

SHAP visualization enhances explainability in the following ways. Summary plots display the most important features and their distributions of effects across multiple predictions [19], providing a clear picture of the most influential features. Force plots illustrate, for a single recommendation, how individual features influence a prediction score and how they can raise or lower it, serving as a clarification for one user–item pair. As to our integrations, summary plots validated that global dominance includes engagement rates and commissions, and force plots clarified the cold-start affiliate suggestions, which accounted for the scarcity of the seller selections.

3.11. Experimental Setup

For evaluation, we constructed positive pairs based on observed affiliate–seller transactions and sampled negative pairs by selecting sellers with whom the same affiliate had no prior interaction [2]. A negative sampling ratio of 3:1 was adopted, whereby three negative samples were generated for each positive pair. This strategy provided a more balanced learning signal without introducing artificial labels.

Data splitting was performed at the affiliate level using a 70/15/15 split for the training, validation, and test sets, respectively. This design was chosen to reduce data leakage and to better approximate a realistic cold-start scenario, in which an affiliate’s interactions do not appear across multiple splits simultaneously.

Under this setting, the real-world dataset yielded 60,952 training pairs, including 15,238 positive interactions. The validation set comprised 12,844 pairs with 3211 positives, while the test set contained 11,284 pairs, of which 2821 were positive. The corresponding transaction tables included 16,292 rows for training, 3431 rows for validation, and 3026 rows for testing.

3.12. Model

We trained a lightweight two-tower retrieval model using InfoNCE contrastive learning [18], with the objective of embedding affiliates and sellers into a shared 64-dimensional latent space. Each tower consisted of a small multilayer perceptron (MLP) followed by L2 normalization, such that dot-product similarity directly corresponded to cosine similarity during retrieval.

Because the number of affiliate profiles and seller profiles is not necessarily equal, embeddings were exported in a tower-wise manner, with each side handled independently. This approach ensured cardinality-safe inference and prevented mismatch issues during retrieval.

In our experiments, the exported embedding matrices for affiliates and sellers were represented in a 64-dimensional latent space across training, validation, and test splits. For affiliates, the embedding sizes were 133, 25, and 17 entities for the training, validation, and test sets, respectively. For sellers, the corresponding sizes were 50, 30, and 29 entities. During training, the model exhibited stable convergence behavior, with the InfoNCE loss decreasing steadily and reaching a value of 0.1272 by epoch 25.

4. Results

This section discusses the performance of the trained system in terms of item retrieval and ranking, diversity enhancement, decision explainability, and overall effectiveness under varying configurations. The near-perfect ranking metrics reported in this section should be interpreted in light of the controlled experimental setup, synthetic data generation, and strict pair-level splitting used to prevent data leakage.

4.1. Candidate Generation

This phase was intended to identify suitable candidate sellers for each affiliate before the final ranking. We used the following three key metrics in model evaluation: Recall@K was used to assess model coverage by determining the fraction of pertinent sellers retrieved in the top-K results; Precision@K was deployed to evaluate the relevance of the top-K recommendations by determining pertinent proportions; and HitRate@K, a dichotomous metric, was employed to assess the relevance of sellers in the top-K results and account for retrieval success if at least one pertinent seller is present [20].

In the experiments, the two-tower model showed advantageous performance compared to the baseline approaches. In the A→S (affiliate to seller) direction, the model attained a Recall@10 of 0.0037 and a HitRate@10 of 0.0037. Conversely, in the S→A (seller to affiliate) direction, the results were a Recall@10 of 0.0116 and a Hit@10 of 0.0168. These outcomes suggest that even in a cold-start situation, the retrieval network identifies relevant matches.

By relying only on user–item co-occurrences, the comparative ALS model encountered challenges surrounding new users and sellers, and, as a result, performed poorly [1,2].

4.2. Ranking

For this last stage, candidate sellers had their orders switched again using three different models. BPR was ordered based on difference in preference using a pairwise approach, LambdaRank/ListNet was ordered using a listwise approach while optimizing the NDCG, and DCN-V2 was ordered using higher-order feature interactions [7,13,26].

The final ranker, which is a hybrid of ListNet and DCN-V2, yielded an NDCG@10 of 0.9956, a MAP@10 of 0.9956, and a Recall@10 of 0.9956, which are almost perfect results. These metrics show that the hybrid significantly outperformed baselines such as BPR and ALS, which speaks to the model’s capability to learn seller prioritization relative to affiliate interest, conversion rate, and content quality [1,2,26].

Segment-Wise Performance

The model also showed accurate ranker performance across the affiliate segments of fashion, technology, and health and wellness, as well as the languages English, Arabic, and Spanish. The model maintained a recall value close to 0.995, even in cold-start affiliate situations. The consistency in ranking performance across all groups was driven by the cross-feature fusion of textual, numerical, and categorical features.

4.3. Postprocessing

For additional diversity and fairness postprocessing adjustments after ranking, we used the MMR and xQuAD algorithms. MMR combines relevance and diversity and penalizes chosen items for being too similar to already selected ones [15]. Its formulation can be illustrated as follows: score = λ × relevance − (1 − λ) × similarity, where λ is the weight for diversity and bounded relevance. xQuAD ensures that retrieval does not focus on and retrieve dominant or repeated subcategories [32].

Popularity Penalty λ-Sweep

Diversity in the top 10 items was enhanced by varying λ between 0 and 1. This resulted in increased Diversity@10 alongside a reduced NDCG score from 0.995 to 0.918. These results indicate that the potential gains in utility diminish as diversity increases, but a fairer and more balanced recommendation list is produced.

4.4. Explainability

Explainability was considered here to foster trust and ensure transparency in our model’s recommendations. This was achieved using SHAP for both global and local analyses.

4.4.1. Global SHAP

The primary contributing features were engagement rate (0.84), commissions provided (0.43), follower count (0.15), and EPC (0.15). These results show that models depend mainly on engagement and seller incentives in deciding on recommendations.

4.4.2. Local Explanation

Reason codes were also generated for individual predictions. For example, if seller 11001 was recommended to affiliate 50010, the top positive contributions were derived from transaction count, average CTR, and seller conversion means. This enabled us to justify the display of each seller, thereby increasing the transparency of the system [19].

4.5. Statistical Validation

This section explains the statistical tests that we performed to validate that the enhancements in performance were not due to chance. To compare the results of the baseline and final models, we conducted a paired t-test and the Wilcoxon signed-rank test [14]. NDCG@10 improved from 0.0877 to 0.9947, while Recall@10 increased from 0.2021 to 0.9945. The results of the paired t-test indicated a p < 0.001, further corroborated by the Wilcoxon test, which produced nearly 1.35 × 10⁻²⁴⁸. The 95% confidence interval for NDCG shows that we can be confident that the improvement is real (0.8958, 0.9169).

4.6. Ablation Studies

Here, we explain the reasoning behind the design of the model. Each component is removed one at a time to elucidate which portions of the model are most important.

The ablation results are summarized in Table 2.

Hence, the text + numeric + pair combination generates the strongest result [33].

4.7. Scalability and Latency

A practical system must run quickly for practical use. The latency and scalability results are summarized in Table 3.

Index size for retrieval was compact. The seller index (1922 × 128 dimensions) occupied 0.94 MB, while the affiliate index (1358 × 128 dimensions) occupied 0.66 MB. These results demonstrate that the system is lightweight, scalable, and ready for deployment even in a CPU-only environment.

4.8. Evaluation

We evaluated retrieval performance in the affiliate-to-seller (A→S) setting using standard retrieval metrics, including Recall@K, HitRate@K, and Mean Reciprocal Rank (MRR) [12,20]. The value of K was dynamically constrained based on the number of available candidates for each query (denoted as topk_used*), in order to avoid misleading evaluations when the candidate pool was small.

Due to incomplete profile coverage in the public dataset, some affiliates or sellers appearing in the split interaction pairs did not have corresponding profile records. As a result, after enforcing split-local identifier constraints, the number of valid evaluation queries with at least one ground-truth positive was substantially reduced.

Under the current data split, the validation set contained only a single affiliate with at least one valid ground-truth positive (N = 1), while the test split contained none. Consequently, retrieval metrics for the test split are reported as NaN by design. These NaN values do not indicate model failure; rather, they reflect structural limitations arising from incomplete metadata coverage in the public dataset.

5. Discussion

5.1. Practical Insights

Based on the results, the rating given to the two-tower + DCN-V2 model suggests a reasonable trade-off between retrieval speed and ranking accuracy. The retrieval speed using ANN was under 2 ms, and the ranking using DCN-V2 fell between 75 and 130 ms, which is almost real-time level. More extensive models such as DCN became complex but produced faster results due to early termination and small batch sizes. This applies to affiliate recommendation, where a timely response is imperative. The combination of high NDCG@10 ≈ 0.9956 and latency is reasonable.

5.2. Cold Start and Novelty

Cold-start problems appear with new users or new sellers that are entirely without historical data. Incorporating SBERT embeddings into the model allowed utilization of additional structured metadata (i.e., category, language, and location) to help alleviate the cold-start issue [5]. Even with 40% cold-start affiliates, the HitRate@10 = 0.016 indicates that the system can predict the relational value of previously unseen profiles.

The real-world evaluation further illustrates a common tension in recommender system research between behavioral realism and metadata completeness, particularly under cold-start and sparse-profile conditions.

From a methodological standpoint, our model brings together components that are often studied separately—utility-based ranking, diversity-aware reranking, and post hoc explainability—and applies them to a stylized affiliate–seller setting on synthetic data. The literature in our field focuses on these elements in a siloed fashion, and synthesizes them into a robust, cohesive system.

5.3. Explainability and Fairness

The reranking based on MMR and xQuAD increased recommendation diversity and improved result coverage [15]. Such diversity-aware mechanisms can help mitigate popularity bias and improve the visibility of less mainstream sellers [4]. Together with explainability components, these properties contribute to building recommender systems that better support accuracy, fairness, and transparency. Explainability was applied specifically at the ranking stage, where feature semantics are well-defined and directly interpretable.

5.4. Limitations and Future Work

With approximately 26,000 samples, our dataset was small, and the embeddings may have remained unoptimized to a degree, which may limit the generalization of our results. Future studies should explore online retraining with reinforcement learning as an additional means to improve flexibility. As suggested in the literature, future research can also incorporate images and audio embeddings to enhance the human-like capabilities of the system and make the system multimodal [34].

6. Conclusions

This research developed a unified two-stage recommendation system in which candidate generation, ranking, and postprocessing are cohesively integrated. Low-latency results are achieved through the two-tower model with InfoNCE learning, and personalization and fairness are enhanced by the ListNet and DCN-V2 rankers, which narrow down candidates. The integration of diversity improvements through MMR and xQuAD results in meaningful gains in recall, NDCG score, and system-wide diversity.

By examining the proposed pipeline under both synthetic and real-world behavioral settings, this work highlights not only performance characteristics but also practical deployment considerations arising from data availability constraints.

Embedding textual, behavioral, and categorical data helps address the cold-start problem with new users. This research brings together the components of explainable, utility-driven, and diverse recommendations in one system, paving the way for the development of advanced intelligent recommender systems.

Author Contributions

Conceptualization, H.A.; methodology, H.A.; software, H.A.; validation, H.A. and M.Y.; formal analysis, H.A.; investigation, H.A.; resources, H.A.; data curation, H.A.; writing—original draft preparation, H.A.; writing—review and editing, H.A. and M.Y.; visualization, H.A.; supervision, M.Y.; project administration, H.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Koren, Y.; Bell, R.; Volinsky, C. Matrix Factorization Techniques for Recommender Systems. Computer 2009, 42, 30–37. [Google Scholar] [CrossRef]
Hu, Y.; Koren, Y.; Volinsky, C. Collaborative Filtering for Implicit Feedback Datasets. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; IEEE: New York, NY, USA, 2008; pp. 263–272. [Google Scholar] [CrossRef]
He, X.; Liao, L.; Zhang, H.; Nie, L.; Hu, X.; Chua, T.-S. Neural Collaborative Filtering. In Proceedings of the 26th International Conference on World Wide Web, Perth, Australia, 3–7 April 2017; International World Wide Web Conferences Steering Committee: Geneva, Switzerland, 2017; pp. 173–182. [Google Scholar] [CrossRef]
Abdollahpouri, H.; Burke, R.; Mobasher, B. Controlling Popularity Bias in Learning-to-Rank Recommendation. In Proceedings of the Eleventh ACM Conference on Recommender Systems, Como, Italy, 27–31 August 2017; ACM: New York, NY, USA, 2017; pp. 42–46. [Google Scholar] [CrossRef]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 3980–3990. [Google Scholar] [CrossRef]
Wang, R.; Fu, B.; Fu, G.; Wang, M. Deep & Cross Network for Ad Click Predictions. In Proceedings of the ADKDD’17, Halifax, NS, Canada, 14 August 2017; ACM: New York, NY, USA, 2017; pp. 1–7. [Google Scholar] [CrossRef]
Wang, R.; Shivanna, R.; Cheng, D.; Jain, S.; Lin, D.; Hong, L.; Chi, E. DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems. In Proceedings of the Web Conference 2021, Ljubljana, Slovenia, 19–23 April 2021; ACM: New York, NY, USA, 2021; pp. 1785–1797. [Google Scholar] [CrossRef]
Covington, P.; Adams, J.; Sargin, E. Deep neural networks for youtube recommendations. In Proceedings of the RecSys 2016—Proceedings of the 10th ACM Conference on Recommender Systems, Boston, MA, USA, 15–19 September 2016; pp. 191–198. [Google Scholar] [CrossRef]
Ribeiro, M.T.; Singh, S.; Guestrin, C. Why Should I Trust You? In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; ACM: New York, NY, USA, 2016; pp. 1135–1144. [Google Scholar] [CrossRef]
Vartak, M.; Thiagarajan, A.; Miranda, C.; Bratman, J.; Larochelle, H. A Meta-Learning Perspective on Cold-Start Recommendations for Items. In Advances in Neural Information Processing Systems (NeurIPS 2017); NeurIPS: Long Beach, CA, USA, 2017; pp. 785–793. [Google Scholar] [CrossRef]
Malkov, Y.A.; Yashunin, D.A. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 824–836. [Google Scholar] [CrossRef] [PubMed]
Cao, Z.; Qin, T.; Liu, T.-Y.; Tsai, M.-F.; Li, H. Learning to rank. In Proceedings of the 24th international conference on Machine learning, Corvallis, OR, USA, 20–24 June 2007; ACM: New York, NY, USA, 2007; pp. 129–136. [Google Scholar] [CrossRef]
Burges, C.J.C. From RankNet to LambdaRank to LambdaMART: An Overview. IEEE Internet Comput. 2010, 11, 81. [Google Scholar]
Demšar, J. Statistical Comparisons of Classifiers over Multiple Data Sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]
Carbonell, J.; Goldstein, J. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, 24–28 August 1998; ACM: New York, NY, USA, 1998; pp. 335–336. [Google Scholar] [CrossRef]
Santos, R.L.T.; Macdonald, C.; Ounis, I. Exploiting query reformulations for web search result diversification. In Proceedings of the 19th International Conference on World Wide Web, Raleigh North, CA, USA, 26–30 April 2010; ACM: New York, NY, USA, 2010; pp. 881–890. [Google Scholar] [CrossRef]
Tintarev, N.; Masthoff, J. A Survey of Explanations in Recommender Systems. In Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering Workshop, Istanbul, Turkey, 17–20 April 2007; IEEE: New York, NY, USA, 2007; pp. 801–810. [Google Scholar] [CrossRef]
Henaff, O.J.; Srinivas, A.; Fauw, J.D.; Razavi, A.; Doersch, C.; Eslami, S.M.A.; Oord, A.v.d. Data-Efficient Image Recognition with Contrastive Predictive Coding. In Proceedings of the 37th International Conference on Machine Learning (ICML 2020), Online, 13–18 July 2020; pp. 4182–4192. [Google Scholar]
Lundberg, S.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. arXiv 2017, arXiv:1705.07874. [Google Scholar] [CrossRef]
Adomavicius, G.; Tuzhilin, A. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Trans. Knowl. Data Eng. 2005, 17, 734–749. [Google Scholar] [CrossRef]
Ricci, F.; Rokach, L.; Shapira, B. Recommender Systems: Techniques, Applications, and Challenges. In Recommender Systems Handbook; Springer: New York, NY, USA, 2022; pp. 1–35. [Google Scholar] [CrossRef]
Burke, R. Hybrid Recommender Systems: Survey and Experiments. User Model. User-Adapt. Interact. 2002, 12, 331–370. [Google Scholar] [CrossRef]
van den Berg, R.; Kipf, T.N.; Welling, M. Graph Convolutional Matrix Completion. arXiv 2017, arXiv:1706.02263. [Google Scholar] [CrossRef]
Sundararajan, M.; Taly, A.; Yan, Q. Axiomatic Attribution for Deep Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 3319–3328. [Google Scholar]
Fisher, A.; Rudin, C.; Dominici, F. All Models are Wrong, but Many are Useful: Learning a Variable’s Importance by Studying an Entire Class of Prediction Models Simultaneously. J. Mach. Learn. Res. 2019, 20, 1–81. [Google Scholar]
Rendle, S.; Freudenthaler, C.; Gantner, Z.; Schmidt-Thieme, L. BPR: Bayesian personalized ranking from implicit feedback. arXiv 2012, arXiv:arXiv:1205.2618. [Google Scholar]
Lops, P.; de Gemmis, M.; Semeraro, G. Content-based Recommender Systems: State of the Art and Trends. In Recommender Systems Handbook; Springer: Boston, MA, USA, 2011. [Google Scholar]
Zhang, Y.; Chen, X. Explainable Recommendation: A Survey and New Perspectives. Found. Trends^® Inf. Retr. 2020, 14, 1–101. [Google Scholar] [CrossRef]
Amatriain, X.; Basilico, J. Recommender Systems in Industry: A Netflix Case Study. In Recommender Systems Handbook; Springer: Boston, MA, USA, 2015; pp. 385–419. [Google Scholar]
Mathur, A.; Narayanan, A.; Chetty, M. Endorsements on social media: An empirical study of affiliate marketing disclosures on YouTube and pinterest. Proc. ACM Hum. Comput. Interact. 2018, 2, 119. [Google Scholar] [CrossRef]
Yi, X.; Yang, J.; Hong, L.; Cheng, D.Z.; Heldt, L.; Kumthekar, A.; Zhao, Z.; Wei, L.; Chi, E. Sampling-bias-corrected neural modeling for large corpus item recommendations. In Proceedings of the 13th ACM Conference on Recommender Systems, Copenhagen, Denmark, 16–20 September 2019; ACM: New York, NY, USA, 2019; pp. 269–277. [Google Scholar] [CrossRef]
Santos, R.L.T.; Peng, J.; Macdonald, C.; Ounis, I. Explicit Search Result Diversification through Sub-queries. In Proceedings of the 32nd European Conference on IR Research, ECIR 2010, Milton Keynes, UK, 28–31 March 2010; pp. 87–99. [Google Scholar] [CrossRef]
Zhang, S.; Yao, L.; Sun, A.; Tay, Y. Deep learning based recommender system: A survey and new perspectives. ACM Comput. Surv. 2019, 52, 5. [Google Scholar] [CrossRef]
Tsai, Y.-H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.-P.; Salakhutdinov, R. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 6558–6569. [Google Scholar] [CrossRef]

Figure 1. Distribution of affiliate categories.

Figure 2. Affiliates’ languages and locations.

Figure 3. Distribution of sellers’ categories and locations.

Figure 4. Sellers’ company tiers.

Figure 5. Distribution of commission rates.

Figure 6. Distribution of cookie periods.

Figure 7. Performance scores: Affiliates and sellers.

Figure 8. Correlation heatmap.

Figure 9. Three-part histogram showing the distribution of revenue, engagement rate, and composite utility.

Figure 10. Success probability based on signal strength.

Figure 11. t-SNE plot of text embeddings. Each blue point represents one text embedding.

Figure 12. Line graph showing the change in utility score across 24 h.

Figure 13. Process flow.

Figure 14. Data flow.

Figure 15. Model training (InfoNCE loss). Dashed green: LR schedule; dashed red vertical: LR decay start.

Table 1. Retrieval performance on the validation set.

Direction	K	Recall@K	Precision@K	HitRate@K
A→S	10	0.0037	0.0004	0.0037
A→S	50	0.0361	0.0008	0.0376
S→A	10	0.0116	0.0017	0.0168
S→A	50	0.0644	0.0017	0.0842

Table 2. Ablation study results showing the impact of removing individual model components.

Ablation Setup	Δ NDCG	Δ Recall	Observation
Text Embeddings	−0.412	−0.380	Model losing its understanding of semantic meaning
Pair Features	−0.263	−0.201	Less personalized matching
Structured (numeric)	−0.318	−0.289	Missing CTR and EPC weakening precision
Only-Text	−0.524	−0.488	Semantic only insufficient
Only-Pair	−0.469	−0.422	Context lost
Full Model	0.9956	0.9956	Best overall

Table 3. Latency and scalability results across retrieval and ranking stages.

Stage	p50 (ms)	p95 (ms)
Retrieval A→S	1.28	1.65
Retrieval S→A	0.86	1.12
Ranker (TopK)	75.09	128.55
End-to-End	76.36	130.20

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Almutairi, H.; Ykhlef, M. Explainable Reciprocal Recommender System for Affiliate–Seller Matching: A Two-Stage Deep Learning Approach. Information 2026, 17, 101. https://doi.org/10.3390/info17010101

AMA Style

Almutairi H, Ykhlef M. Explainable Reciprocal Recommender System for Affiliate–Seller Matching: A Two-Stage Deep Learning Approach. Information. 2026; 17(1):101. https://doi.org/10.3390/info17010101

Chicago/Turabian Style

Almutairi, Hanadi, and Mourad Ykhlef. 2026. "Explainable Reciprocal Recommender System for Affiliate–Seller Matching: A Two-Stage Deep Learning Approach" Information 17, no. 1: 101. https://doi.org/10.3390/info17010101

APA Style

Almutairi, H., & Ykhlef, M. (2026). Explainable Reciprocal Recommender System for Affiliate–Seller Matching: A Two-Stage Deep Learning Approach. Information, 17(1), 101. https://doi.org/10.3390/info17010101

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Explainable Reciprocal Recommender System for Affiliate–Seller Matching: A Two-Stage Deep Learning Approach

Abstract

1. Introduction

2. Background and Related Work

2.1. Background

2.1.1. Recommendation Paradigms

2.1.2. Two-Stage Pipelines

2.1.3. Cold-Start Solutions

2.1.4. Debiasing and Fairness

2.1.5. Explainability in Recommenders

2.2. Literature Review

2.2.1. Collaborative Filtering and Extensions

2.2.2. Content-Based and Hybrid Approaches

2.2.3. Two-Stage Architectures

2.2.4. Cold-Start Strategies

2.2.5. Bias, Fairness, and Diversity

2.2.6. Explainability

3. Materials and Methods

3.1. Exploratory Data Analysis (EDA)

3.1.1. Distribution of Affiliate Categories

3.1.2. Distribution of Affiliate Languages and Locations

3.1.3. Seller Category and Location Distribution

3.1.4. Distribution of Company Tiers

3.1.5. Distribution of Commission Rates

3.1.6. Distribution of Cookie Periods

The Performance Score Graph

3.1.7. Correlation Heatmap

3.1.8. Kruskal–Wallis Test

3.1.9. Revenue, Engagement Rate, and Composite Utility Graphs

Success Probability Based on Signal Strength

3.1.10. Dominance and Strong Pairs

3.1.11. Text Embedding Visualization with t-SNE

3.1.12. Cold-Start Analysis for Affiliates and Sellers

Utility by Hour of Day

Modeling EDA Summary

3.2. Real-World Behavioral Dataset Construction and Preprocessing

3.3. Model Architecture

3.4. Stage A: Setup and Splits

3.5. Stage B: Labels and Sampling

3.5.1. Hard Labels: Conversions

3.5.2. Soft Labels: Weighted Implicit Signals

3.5.3. Hard Negatives: Semantic and Popularity Control

3.6. Stage C: Feature Engineering

3.6.1. Affiliate Features

Numeric Features

Categorical Features

3.6.2. Seller Features

Numeric Features

Categorical Features

3.6.3. Pair Features

3.6.4. Embedding Integration

3.6.5. Final Feature Pack

3.7. Stage D: Candidate Generation

3.7.1. Two-Tower Architecture

3.7.2. InfoNCE Contrastive Learning

3.7.3. ANN Retrieval

3.7.4. Training Curves

3.7.5. Evaluation Metrics

3.8. Stage E: Ranking

3.8.1. Bayesian Personalized Ranking (BPR)

3.8.2. ListNet and LambdaRank (Listwise Models)

3.8.3. DCN-V2

3.8.4. Results from Our Data

Training Results

3.8.5. Why Ranking Is Important

3.8.6. Training Objectives: NDCG and MAP

3.8.7. Ranking Pipeline: Retrieve 50 → Rerank 10

3.9. Stage F: Postprocessing

3.9.1. MMR Reranking (Diversity)

3.9.2. xQuAD Coverage

3.9.3. Popularity Penalty λ-Sweep

3.9.4. Summary of Postprocessing

3.10. Stage G: Explainability

3.10.1. SHAP Global Importance

3.10.2. Visualizations (Summary and Force Plots)

3.11. Experimental Setup

3.12. Model

4. Results

4.1. Candidate Generation