Next Article in Journal
A Multimodal Loop-Closure Detection Framework with Dynamic-Object Suppression
Next Article in Special Issue
GTHL-Emo: Adaptive Imbalance-Aware and Correlation-Aligned Training for Arabic Multi-Label Emotion Detection
Previous Article in Journal
Hybrid N-BEATS-Based Method for Equipment Assessment and System Risk Prediction in Urban Power Grids
Previous Article in Special Issue
A Novel Architecture for Understanding, Context Adaptation, Intentionality and Experiential Time in Emerging Post-Generative AI Through Sophimatics
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Designing Trustworthy Recommender Systems: A Glass-Box, Interpretable, and Auditable Approach

Department of Computer Science, University of York, Deramore Lane, Heslington, York YO10 5GH, UK
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(24), 4890; https://doi.org/10.3390/electronics14244890
Submission received: 30 October 2025 / Revised: 8 December 2025 / Accepted: 9 December 2025 / Published: 12 December 2025
(This article belongs to the Special Issue Deep Learning Approaches for Natural Language Processing)

Abstract

Recommender systems are widely deployed across digital platforms, yet their opacity raises concerns about auditability, fairness, and user trust. To address the gap between predictive accuracy and model interpretability, this study proposes a glass-box architecture for trustworthy recommendation, designed to reconcile predictive performance with interpretability. The framework integrates interpretable tree ensemble model (Random Forest, XGBoost), an NLP sub-model for tag sentiment, prioritising transparency from feature engineering through to explanation. Additionally, a Reality Check mechanism enforces strict temporal separation and removes already-popular items, compelling the model to forecast latent growth signals rather than mimic popularity thresholds. Evaluated on the MovieLens dataset, the glass-box architectures demonstrated superior discrimination capabilities, with the Random Forest and XGBoost models achieving ROC-AUC scores of 0.92 and 0.91, respectively. These tree ensembles notably outperformed the standard Logistic Regression (0.89) and the neural baseline (MLP model with 0.86). Beyond accuracy, the design implements governance through a multi-layered Governance Stack: (i) attribution and traceability via exact TreeSHAP values, (ii) stability verification using ICE plots and sensitivity analysis across policy configurations, and (iii) fairness audits detecting genre and temporal bias. Dynamic threshold optimisation further improves recall for emerging items under severe class imbalance. Cross-domain validation on Amazon Electronics test dataset confirmed architectural generalisability (AUC = 0.89), demonstrating robustness in sparse, high-friction environments. These findings challenge the perceived trade-off between accuracy and interpretability, offering a practical blueprint for Safe-by-Design recommender systems that embed fairness, accountability, and auditability as intrinsic properties rather than post hoc add-ons.

1. Introduction

Recommender systems (RS) have become integral to modern digital platforms, guiding user discovery across e-commerce, media streaming, and social networks [1,2,3]. The evolution of these systems has been marked by a drive for greater predictive accuracy, progressing from traditional techniques to complex deep learning architectures like Neural Collaborative Filtering (NCF) and attention-based networks [1,3,4]. These advanced models excel at capturing intricate user-item interactions to improve recommendation relevance [3]. However, this enhanced predictive power frequently comes at the cost of transparency, as their internal decision-making processes often operate as opaque “black boxes,” inaccessible to users, developers, and auditors alike [5,6]. This opacity introduces significant concerns regarding trust, accountability, and fairness, particularly as recommendations increasingly influence critical business outcomes and user behaviours [7,8].
The need for transparency is especially acute in high-stakes domains where biassed or erroneous recommendations can have severe consequences [9,10]. In e-commerce, for instance, opaque models could perpetuate systemic biases, unfairly limit the visibility of certain products or promoting harmful items [7,8]. In finance, an unauditable system recommending investment products could expose users to unsuitable risks or violate regulatory requirements for fairness [9,10,11]. Similarly, a black-box system suggesting medical content in healthcare could provide misleading advice without a clear, auditable rationale [5,12]. Although Explainable Artificial Intelligence (XAI) methods like LIME [13] and SHAP [14,15] provide post hoc feature attributions for model predictions. However, research indicates these surrogate-based explanations can be unstable and potentially unfaithful to the true model logic, limiting their reliability in policy-sensitive contexts [16,17,18]. This limitation has motivated a shift toward inherently interpretable, or “glass-box,” models, where transparency is an intrinsic property of the model’s architecture, rather than a post hoc reconstruction [5,15].
The motivation behind this work is to develop a recommender system framework that is inherently transparent and auditable by design, without sacrificing the high predictive performance expected of modern systems [5]. To achieve this, the proposed approach combines three core components: an interpretable Machine Learning (ML) tree ensemble models (e.g., Random Forest (RF) classifier or XGBoost) [19], human-readable temporal engineered features and an NLP sub-model that learns tag sentiment from user quality signals [20,21,22]. Beyond technical interpretability, the framework aims to align with established paradigms of Trustworthy AI and algorithmic accountability. It does so by employing policy-aligned objectives, providing exact, audit-ready explanations, incorporating fairness-aware training, and conducting sensitivity analysis. By embedding governance constraints directly into the forecasting objective rather than applying them as post hoc element, this methodology aligns with the principles of ‘Safe-by-Design’ AI, ensuring that safety is an intrinsic property of the system architecture [23].
The reason for adding this multi-layered governance Audit workflow, is that although tree-based models such as Random Forests and gradient-boosted ensembles are often considered more interpretable than deep neural networks [24], their interpretability is nuanced. Individual decision trees are inherently transparent because their structure—comprising hierarchical splits on human-understandable features—can be visualised and traced from root to leaf, enabling a clear explanation of a single prediction [19]. However, when hundreds or thousands of trees are aggregated into an ensemble, the global logic becomes substantially more complex, making full comprehension by a human auditor challenging [19,24]. In a strict regulatory or legal sense, such as compliance with the “right to explanation” under the EU GDPR [9,10] or similar frameworks, Random Forests may not qualify as fully interpretable because their collective decision process cannot be easily summarised without approximation [5,11]. Nevertheless, these models remain more amenable to human inspection than deep learning architectures because each constituent tree is human-readable, and global interpretability can be supported through structured visualisation and additive explanation techniques such as TreeSHAP [15]. Therefore, while tree ensembles are not perfectly transparent in high-stakes contexts, they represent a pragmatic compromise between predictive performance and interpretability, particularly when combined with rigorous documentation and explanation protocols [5,15,24].
This research is evaluated on the MovieLens dataset [25] and validated via a cross-domain pilot on the Amazon Reviews dataset [26]. Unlike conventional rule-mimicry approaches, the design enforces a rigorous temporal separation between observed features and future items [5,27]. This ensures the model forecasts the success of “emerging items”—candidates with low current visibility—rather than simply replicating current popularity thresholds [6]. The model’s performance is benchmarked by evaluating tree ensemble models, Random Forest (RF) and Gradient Boosting (XGBoost), against a black-box Multi-Layer Perceptron (MLP) [3,4] and Logistic Regression [28]. The results demonstrate superior discrimination capabilities for the glass-box architectures, with the Random Forest and XGBoost models achieving ROC-AUC scores of 0.92 and 0.91, respectively. These models notably outperformed the neural baseline (AUC 0.86) and Logistic Regression (AUC 0.89). A layer of explainability is delivered through exact attribution methods, primarily TreeSHAP [15], complemented by partial dependence plots and decision-path visualisations, thereby enabling consistent local and global explanations for a robust audit trail [13,16].
We acknowledge that in real-world governance scenarios, algorithmic transparency must be complemented by data integrity mechanisms. To test the predictive transferability of the framework, we conducted a training pilot in a high-friction e-commerce environment using Amazon product data [26]. While the full multi-layer audit was reserved for the primary case study, this pilot demonstrated that the underlying learning mechanism remains robust even in sparse, review-based domains. As the reliability of policy-aligned features (e.g., rating density, sentiment) depends on authentic user feedback, this glass-box approach framework is designed to operate downstream of robust opinion spam and fraud detection systems [29,30]. Explainability is delivered through exact attribution methods (TreeSHAP) [15] and local surrogates (LIME) to create a verifiable audit trail [13,16]. Additionally, automated fairness audits and technical governance reports are integrated directly into the pipeline, aligning the system with the documentation requirements of algorithmic accountability [31].
Despite significant advancements, modern recommender systems, particularly those based on deep learning, have become increasingly complex and opaque, making their internal logic inaccessible and difficult to audit [3,4]. In response, post hoc explanation techniques such as LIME and SHAP have been widely adopted to provide insights into these black-box models [13,14,15]. However, a growing body of evidence indicates that such methods can produce unstable or unfaithful explanations that do not accurately reflect the model’s true decision-making process [16,17,18]. This limitation renders them inadequate for high-stakes, regulation-sensitive domains like finance, healthcare, legal services, and critical infrastructure, where accountability and verifiable compliance are paramount [9,10,12].
Employing inherently interpretable “glass-box” models, has historically been hindered by the perception of a mandatory trade-off between interpretability and predictive accuracy [5]. Stakeholders have often been forced to choose between high-performing but opaque models and transparent but less accurate ones. This dilemma has left a critical research gap: there is a lack of end-to-end frameworks for building recommender systems that are transparent by design [32], encode governance policies directly into their learning objective, and can provably match the performance of their black-box performance (see Section 4.1). Consequently, there is a clear need for further research into the accuracy–interpretability trade-off, demonstrating that auditability and high performance can be co-designed as foundational principles rather than treated as competing objectives [32].
Unlike post hoc explanation strategies applied to opaque models [32], this work integrates policy-aligned objectives directly into the recommendation task, ensuring governance constraints persist from feature engineering through to explanation [5]. While this approach shares conceptual similarities with scorecard methods and rule-distillation, its contribution lies in the end-to-end design of an auditable pipeline that combines interpretable tree-based ensembles with exact attribution techniques such as TreeSHAP [15] stability and bias audit. It is important to clarify that tree ensembles, including Random Forests and gradient-boosted models, are not fully interpretable in a strict regulatory sense, as their aggregated decision logic cannot be easily summarised for compliance with frameworks like the EU GDPR “right to explanation” [9,11]. However, they remain more transparent than deep neural networks because each constituent tree is human-readable: its hierarchical structure can be visualised and traced from root to leaf, enabling case-level reasoning [19,24]. Global interpretability can be further supported through structured visualisation and additive explanation methods [15], making these models a pragmatic compromise between predictive performance and auditability in high-stakes domains [5,24]. The study provides the following significant contributions:
  • Designing of a policy-forecasting Framework: A pipeline with glass-box approach that predicts future compliance with governance rules. We introduce a “Reality Check” to ensure the model learns from latent content and sentiment signals rather than simple threshold imitation [5,32]. This mechanism also mitigates cold-start challenges by leveraging engineered features and sentiment cues, enabling robust predictions for emerging items with limited historical data [33].
  • Demonstration of superior performance: Empirical evidence comparing the proposed tree ensemble approaches—Random Forest and XGBoost—against Logistic Regression and Neural Networks (MLP). Results show the glass-box models achieve AUC scores of 0.92 and 0.91, respectively, effectively outperforming the black-box baseline (AUC 0.86) without the risk of overfitting observed in the neural network design [34,35].
  • A governance stack for auditing: Going beyond TreeSHAP explanations, the design aims to implement a multi-layered audit workflow including Fairness Audits [8] (measuring disparate impact [7] across genres), sensitivity analysis, and automated governance reporting templates to operationalise algorithmic account-ability [16,19].
  • Integration of an Interpretable NLP Feature: An NLP sub-model that learns tag sentiment using a supervisory signal extracted from user-behavioural data (ratings) [20], providing a transparent, domain-specific feature that enhances both performance and interpretability [21].
The remainder of this paper is structured as follows: Section 2 provides a review of relevant literature on explainable and trustworthy recommender systems. Section 3 outlines the methodology, including temporal data partitioning, feature engineering, and model training with governance audits. Section 4 presents comparative performance results, sensitivity analysis, and interpretability evaluation using SHAP and LIME, along with cross-domain validation. Section 5 includes a discussion of the findings, methodological contributions, and limitations. Finally, Section 6 concludes the study by summarising its impact and proposing directions for future research.

2. Related Work

The literature on Explainable AI (XAI) for RS reveals a fragmented landscape that lacks a consensus on what constitutes interpretability or how it should be evaluated. Early analyses outlined core explanation challenges—distinguishing model explanation from outcome explanation and retrofitted inspection from inherently transparent approaches—while highlighting the absence of standardised evaluation protocols [36]. This lack of rigorous standards complicates method comparison and hinders deployment in compliance-sensitive environments [37]. These foundational gaps underscore a persistent need for approaches that are inherently interpretable and auditable against clear governance criteria. More recent work frames this challenge within the broader context of AI alignment, arguing that the societal scale of RS necessitates design-time controls for diversity, user agency, and auditability to align system objectives with democratic values [38].
Early RS architectures achieved strong accuracy with MF, but the latent factors that drive predictions provide little semantic insight for users or auditors, which limits their suitability in governance-sensitive deployments [1]. Subsequent deep learning approaches—such as NCF and attention-based architectures—extended predictive capacity and top-K ranking quality by modelling non-linear user–item interactions and multimodal context; however, the internal reasoning of these models typically remains opaque [4,39]. The resulting lack of transparency has direct implications for trust and accountability when recommendations influence behaviour and organisational decisions [6].
Recent surveys have consolidated the role of deep learning in recommender systems, highlighting both its transformative potential and persistent limitations. Zhou et al. [4] provide a comprehensive taxonomy of deep learning-based RS, covering content-based, sequential, cross-domain, and social recommendation paradigms. Their review underscores the superior predictive performance of neural architectures such as MLPs, CNNs, RNNs, and attention-based models, while acknowledging critical challenges in interpretability, fairness, and governance readiness. Importantly, the authors call for future research on explainability and compliance mechanisms, reinforcing the need for transparent-by-design alternatives in high-stakes domains. This aligns with our motivation to propose a glass-box framework that addresses these gaps through policy-aligned label engineering and exact attribution methods.
Post hoc explainability methods were introduced to mitigate opacity. Techniques like LIME and SHAP offer local feature attributions that help rationalise individual predictions and have seen widespread use in RS pipelines [13,40]. Nevertheless, empirical evidence shows that surrogate explanations and saliency-style rationales can be unstable and, at times, unfaithful to the true model logic, properties that are problematic in high-stakes and policy-sensitive settings [17,18]. This tension has motivated a turn toward inherently interpretable or “glass-box” models, which emphasises aligning the modelling structure with human reasoning so that explanation is intrinsic rather than retrofitted [5].
Within this glass-box trajectory, tree-based learners paired with TreeSHAP have become a pragmatic option. TreeSHAP provides theoretically grounded, exact Shapley values for tree ensembles, enabling consistent local and global explanations while preserving competitive predictive performance [15]. Still, its underlying assumptions, such as feature independence in its interventional formulation, must be surfaced in audits, as real-world RS features can be correlated [16,41]. Concurrently, research into other intrinsically transparent methods has advanced. Recent models have augmented collaborative filtering with genre-aware weighting and information entropy to improve accuracy under sparsity while retaining simple, explainable computations [42]. Graph-based recommenders have been enhanced to compress side information into probabilistic latent classes that can be re-expanded for human-readable path justifications, achieving substantial training-time savings without sacrificing interpretability [39].
Text-centric designs provide additional intrinsic explanation modalities. An attention-inspired, language-only recommender has been shown to yield signed, word-level contribution scores that directly support “why-this” narratives across multilingual datasets [43]. Hybrid architectures that jointly predict ratings and generate natural-language reasons likewise demonstrate gains in both predictive error and explanation quality by aligning a contrastive graph encoder with a Transformer decoder [44,45,46,47]. While such systems enrich the expressiveness of explanations, they also underscore the importance of faithfulness checks to ensure that generated rationales reflect the actual scoring evidence [18].
Beyond algorithmic transparency, the field is increasingly focused on aligning RS with broader objectives. A systematic review on value-aware RS classifies methods into post-processing re-ranking, in-objective value formulations, and reinforcement learning for long-term goals like profit or engagement, documenting tensions between offline accuracy proxies and online business impact [48]. Similarly, recent surveys on fairness synthesise multi-stakeholder concerns and report persistent gaps between accuracy-centric development and equitable outcomes [7,8]. From an operational perspective, class imbalance is common in RS classification tasks; while remedies like SMOTE remain effective [49], they are rarely integrated systematically within explainability frameworks, limiting the assessment of trade-offs [50]. Finally, requirements-engineering research provides structured models to translate high-level ethics guidance into actionable, stakeholder-specific explanation needs (who requires what, when, and by whom), bridging principles and testable system behaviours [32].

2.1. Auditability, Ethics, and Algorithmic Accountability

Recent scholarship emphasises that algorithmic accountability extends beyond technical interpretability to include mechanisms for auditing, oversight, and redress. Foundational work by Diakopoulos [51] and Kroll et al. [52] frames accountability as a socio-technical construct requiring transparency and contestability. Auditing practices for AI systems have been explored by Raji et al. [53] and Madaio et al. [54], who advocate for structured evaluation protocols and documentation artefacts such as Model Cards [31] and Datasheets for Datasets [55]. Risk management frameworks, including NIST AI RMF 1.0 [56] and ISO/IEC 23894 [57], provide operational guidance for embedding governance into system design. Existing scholarship highlights the need for integrating policy-aligned objectives with exact attribution methods to enable audit-ready decision trials in compliance-sensitive environments [5]. While this study addresses foundational aspects of transparency and fairness, future research will focus on operationalising these mechanisms within a fully modular glass-box framework. To support this trajectory, a functional template for a Model Governance Card has been developed as part of this work, designed to document policy thresholds, performance metrics, fairness audits, and explainability artefacts. This template can be instantiated on any appropriate dataset to facilitate reproducibility and regulatory alignment [31].

2.2. Data Integrity and Opinion Spam

A critical dimension of trustworthiness in recommender systems, particularly in governance-sensitive scenarios, is the mitigation of opinion spam and review manipulation. As noted by Jindal and Liu [29], fake reviews and opinion spam distort the rating distributions and sentiment signals that policy-aligned models rely upon. While technical XAI focuses on verifying model logic, governance requires verifying the integrity of the input data itself. Mukherjee et al. [30] demonstrate that interpretability is often essential for exposing anomalous reviewer behaviour patterns. Although this study focuses on model transparency, we acknowledge that in real-world deployment, glass-box architectures must be coupled with robust spam detection to prevent “dependence on data quality” vulnerabilities in the governance layer [41,55].
In summary, the field offers accuracy-driven baselines [1,4], influential post hoc explanation tools with known limitations [17,18], promising glass-box directions [15,16], and a growing body of work on intrinsic transparency. Yet, a specific gap remains for audit-ready, end-to-end designs that encode policy constraints from the outset.
In summary, the literature spans accuracy-driven baselines such as matrix factorisation and deep learning-based recommender systems [1,4], influential post hoc explanation techniques with recognised limitations [17,18], and promising glass-box approaches leveraging TreeSHAP for exact attribution [15,16], alongside growing research on intrinsic transparency, fairness, and value alignment. However, a clear gap persists for audit-ready, end-to-end frameworks that embed governance constraints from the outset. This work aims to address that gap by using a glass-box architecture that integrates policy constraints in the learning objective, employs tree-based ML models compatible with exact attribution methods, and incorporates transparent strategies for handling class imbalance.
A comparative summary of related approaches and the proposed framework is presented in Table 1.

3. Methodology

To implement a trustworthy and auditable RS, a systematic methodology was designed, combining data collection, temporal feature engineering, governance-aligned label forecasting, comparative model training, and a multi-layered audit workflow. The proposed approach is architected to be safe and interpretable by design, ensuring that each component contributes to a final output that is both accurate and auditable [5,23]. The experiments were conducted on a benchmark dataset to evaluate the effectiveness of this approach, employing tree ensemble ML models (i.e., Random Forest and XGBoost) against a black-box Multi-Layer Perceptron (MLP) and a Logistic Regression baseline. The focus is specifically on predicting emerging hits—identifying items that will be-come high-quality—while ensuring full transparency [5,6]. The complete workflow of our proposed Glass-Box approach for RS is illustrated in Figure 1, which details the sequence from data ingestion and feature engineering to the “Reality Check”, comparative modelling, and the final fairness auditing layers.

3.1. Dataset Description and Temporal Partitioning

The primary experimental framework employs the MovieLens dataset, a standard benchmark in Recommender System (RS) research [25]. To construct a multi-modal feature space, we utilised three core files: movies.csv (item catalogue), ratings.csv (explicit feedback), and tags.csv (user-generated semantic data). A total of 87,585 movies prepared for model training. Preliminary Exploratory Data Analysis (EDA) of the rating distributions and genre prevalence informed the specific feature engineering choices. As summarised in Table 2, these heterogeneous sources provide the necessary metadata, engagement signals, and sentiment indicators required for the proposed architecture.
To handle the heterogeneous data formats, such as those illustrated in the sample tables (Table 3, Table 4 and Table 5), the disparate data sources were pre-processed and merged [25]. Information from ratings.csv (Table 5) and tags.csv (Table 4) was first aggregated at the individual movie level before being joined with the primary movies.csv (Table 3) catalogue on the movieId key. A key preprocessing step involved handling the multi-label genre format, where a single movie can be associated with several genres listed in a pipe-separated string. These were parsed into a list of individual genres for each movie to avoid any duplication. Subsequently, scikit-learn’s MultiLabelBinarizer was employed to transform these lists into a multi-hot encoded vector, creating a distinct binary feature for each unique genre in the dataset.
To ensure the model forecasts genuine growth rather than imitating existing popularity, we restructured the data partitioning strategy. We rejected standard random shuffling in favour of a Temporal Abstraction Protocol [27,58]. A temporal abstraction, τ c u t o f f , was established at the 75th percentile of the timestamp distribution. This imposes a strict forecasting logic: the model observes features only from the past, while the target labels are derived exclusively from future rating accumulations.
Furthermore, we applied an “Emerging Item” (The Reality Check) to isolate the “cold start” [33] problem. Items that had already achieved popularity (Count > 60) at the time of observation were removed. This rigorous governance check reduces the dataset to “Emerging Candidates”—items with low visibility but high potential. Consequently, this created a significant class imbalance, with positive “future hits” constituting only 3.51% of the data. To mitigate this bias, the Synthetic Minority Over-sampling Technique (SMOTE) was integrated into the training pipeline [49]. The final analysed dataset was partitioned into 75% training and 25% testing sets using stratified sampling to preserve these class proportions.
To validate the architectural generalisability of the proposed framework beyond the current domain, a secondary pilot trial was conducted using the Amazon Reviews 2023 dataset [26], particularly the ‘Electronics’ category. The identical methodological architecture—spanning temporal signal decoupling, feature engineering, and the governance audit workflow—was adopted for this test dataset to ensure rigorous comparability. As detailed in Table 6, this dataset represents a high-velocity e-commerce environment. The dataset comprises review text (mapped to the ‘tags’ input) and star ratings. To account for the higher interaction friction required to write a product review compared to rating a movie, the ‘Reality Check’ threshold ( τ c o u n t ) is adjusted from 60 to 15. This alignment ensures that the model continues to identify items in the ‘emerging’ phase relative to the specific interaction density of the e-commerce domain.

3.2. Feature Engineering and Policy Forecasting

A central principle of the glass-box approach methodology is creation of interpretable features aligned with governance criteria [5].

3.2.1. Interpretable NLP for Tag Sentiment

Given the domain-specific context of user-generated tags, where sentiment can be correlated with genre (e.g., the tag ‘slow’ may be neutral for a drama but negative for an action film), generic pre-trained sentiment analysis libraries were deemed inappropriate for this task [21]. Therefore, a custom NLP sub-model that learns tag sentiment directly from the explicit user ratings was deployed [22].
Through interpretable supervision, a tag was assigned a negative sentiment if the associated user rating was less than or equal to 2.0. This threshold reflects a clear user dissatisfaction signal and is stricter than a neutral midpoint, reducing ambiguity in supervision.
A Term Frequency-Inverse Document Frequency (TF-IDF) vectorizer transforms raw tags into numerical features, considering only terms that appeared in at least three documents ( m i n _ d f = 3 ) and excluding common English stop words. A linear Stochastic Gradient Descent (SGD) classifier with a log-loss function (functionally equivalent to Logistic Regression) was trained on these TF-IDF vectors to predict the probability that a tag expresses negative sentiment. This choice of a linear model ensures that the sentiment feature remains interpretable and computationally efficient [20].
The probability of a tag t being negative is given by Equation (1):
p ( t ) =   σ ( w T v ( t ) ) ,           σ ( z ) =   1 1 + e z ,
where w is the learned weight vector of the SGD classifier, and   v ( t ) is the TF-IDF vector for the tag t . For each movie i with a set of tags T i , the movie-level sentiment feature is defined as in Equation (2):
s i =   max t T i p n e g ( t ) ,         w h e r e   h a s _ t a g s i = 1   [ T i     ]
where s i is stored as observed maximum negative tag in the feature set. Missing values of s i (for movies without tags) were imputed to the median of observed scores to maintain interpretability.
The complete procedure for generating the tag-sentiment feature is summarised in Algorithm 1, which details the steps from interpretable supervision to movie-level aggregation.
Algorithm 1. Interpretable Tag Sentiment Score Generation
  • Require: tags_df, ratings_df, TAG_RATING_THRESHOLD = 2.0
  • Ensure: obs_max_neg_tag for each movie
        1.
    Merge tags_df and ratings_df on userId and movieId
        2.
    Create binary label is_negative_sentiment where rating ≤ TAG_RATING_THRESHOLD.
        3.
    Initialise vectorizer: TfidfVectorizer(min_df = 3, stop_words = ‘english’)
        4.
    Fit and transform tags: X_tags_tfidf ← vectorizer.fit_transform(tags)
        5.
    Initialise sentiment_model ← SGDClassifier(loss = ‘log_loss’)
        6.
    Train model: sentiment_model.fit(X_tags_tfidf, is_negative_sentiment)
        7.
    For each unique movie i:
         a.
    Collect all tags Ti for movie i
         b.
    Transform tags: vtags ← vectorizer.transform (Ti)
         c.
    Predict probabilities: pneg ← sentiment_model.predict_proba(vtags)[:, 1].
         d.
    Compute score: si ← max(pneg)
        8.
    return all scores si

3.2.2. Policy-Forecasting Label and the “Reality Check”

A binary target label Y i f u t u r e is implemented based on the item’s state in the unobserved future window. In Equation (3) movie i is labelled as a “Future Hit” ( Y i   f u t u r e = 1 ) , only if it meets all the following criteria in the period following the temporal abstraction:
Y i ,   f u t u r e = 1 ( C i ,   f i n a l     τ c o u n t ) ( R i ,   f i n a l     τ a v g ) ( S i ,   o b s < τ n e g )
where C i ,   f i n a l is the total future rating count, R i ,   f i n a l is the future average rating, and S i ,   o b s is the observed negative sentiment score defined in Equation (2). The policy thresholds are set as: τ c o u n t = 60, τ a v g   = 3.0, and τ n e g = 0.6. The default popularity threshold τ c o u n t   =   60 was selected based on empirical stability analysis (Section 4.2), which showed that items with fewer than 60 ratings exhibit high variance in average rating, making governance decisions unreliable. This aligns with standard statistical principles [28] and governance principles for risk mitigation [57,59]. Additionally, thresholds serve as a safeguard against early shilling attempts [29,30]. To validate robustness and avoid overfitting to a single threshold, we later conduct a sensitivity analysis (Section 4.2) by varying τ_count across {50, 70, 90}. These values represent plausible governance regimes: 50 as a lenient threshold for early discovery, 70 as a stricter threshold for higher confidence, and 90 as an extreme case for conservative governance. This approach confirms that the model learns latent growth signals rather than memorising a fixed rule.
It is crucial to ensure the model learns latent preference signals rather than simply estimating a threshold, we apply a “Reality Check” (Emerging Item) [33]. Therefore, movies that has already met the popularity threshold at observation time, are explicitly removed from the training and test set:
C h e c k : K e e p   m o v i e   i     C i ,   o b s e r v e d < τ c o u n t
As demonstrated in Equation (4), this removes already-popular items, forcing the model to predict growth for emerging candidates. This rigorous filtration reduces the dataset exclusively to ‘Emerging Candidates,’ thereby fundamentally altering the learning objective. Rather than trivially approximating a threshold function based on concurrent features, the model is compelled to forecast the trajectory of item growth from early signals. By decoupling the observation window from the target window, the system does not merely mimic the governance rule but learns the underlying probabilistic drivers—such as early velocity and sentiment polarity—that precede the transition from obscurity to compliance [5,9,10,33].

3.2.3. Temporal Feature Engineering and Signal Decoupling

To support the forecasting of emerging hits, a multidimensional feature space was constructed to capture early-warning signals of future success across four primary dimensions: momentum, engagement, content, and temporal context [58,60]. A critical design principle in this process was the strict temporal decoupling of feature calculation from target generation. All input features were derived exclusively from the observation window ( t τ c u t o f f ), whereas the governance labels were computed solely from future interactions ( t   > τ c u t o f f ). This separation ensures that the model is tasked with forecasting the latent trajectory of item growth rather than merely classifying its current state. By explicitly removing items that had already met the popularity threshold during the observation period (as detailed in the “Reality Check” protocol) [33], the predictive task is rigorously framed as identifying the non-trivial transition of obscure items into the popular class, thereby preventing the model from trivially learning the governance thresholds from static input data [5,32,33].
The momentum and engagement dimensions are quantified through statistical aggregations of user feedback. The feature observed density serves as a proxy for velocity, calculated as the ratio of observed log-transformed observed rating counts to the item’s age relative to the cutoff date ( l n   ( 1   +   o b s e r v e d   c o u n t )   /   ( a g e   +   1 ) ) . This logarithmic normalisation is essential to decouple the magnitude of popularity from its rate of accrual. It allows the model to distinguish between “sleeping giants”-older items with accumulated but stagnant popularity-and “rising stars” that exhibit high interaction velocity despite a short tenure [32]. Engagement quality is captured through obs_avg (mean observed rating) and obs_std (standard deviation of ratings), which jointly measure the consensus and polarisation of user opinion. To ensure statistical validity, missing values for sparse items were imputed using global priors rather than zero-filling. This prevents the model from falsely interpreting high uncertainty (low data) as low quality or zero variance [5,28].
In the content dimension, qualitative attributes are transformed into quantitative vectors to support machine learning interpretability. Movie genres, originally stored as multi-label strings, were processed using multi-hot encoding to generate binary presence vectors for each genre category. Simultaneously, the unstructured semantic information contained in user-generated tags was distilled into the obs_max_neg_tag score. As described in Section 3.2.1, this probabilistic score reflects the likelihood of an item carrying negative sentiment based on its tag corpus. By integrating this risk metric alongside traditional metadata, the model can identify high-quality candidates that may be disqualified solely due to safety or suitability concerns, aligning the feature space with the multifaceted nature of the governance policy [20,21].
Finally, temporal features provide the necessary context for interpreting interaction magnitudes. The age variable, calculated strictly against the specific temporal cutoff point to prevent look-ahead bias, allows the model to adjust its expectations of popularity relative to the item’s lifecycle stage. Figure 2 presents the Pearson correlation matrix for these temporal features. The analysis reveals that the logarithmic transformation successfully decoupled the correlation between raw counts and density to a negligible level ( r     0.15 ) , confirming that density captures a unique velocity signal independent of cumulative volume. Furthermore, the low negative correlation between rating standard deviation and average rating ( r   =   0.13 ) confirms that variance acts as an independent measure of user polarisation rather than a proxy for data scarcity. This indicates that the feature engineering strategy successfully corrected for the natural bias often found in sparse datasets [57,59].

3.3. Model Architecture and Training

3.3.1. Classifier Selection and Baseline Comparison

The core predictive model in our glass-box approach framework is tree ensemble models. Random Forest (RF) classifier is an ensemble of decision trees that combines interpretability with strong predictive performance [19]. Tree-based models have empirically shown to outperform deep learning on tabular data in some cases [24]. Each tree provides explicit, policy-based decision paths, making the model inherently interpretable. Furthermore, RF is fully compatible with TreeSHAP, which computes exact Shapley values for feature attribution without the approximation errors associated with model-agnostic explainers [14,15]. This property ensures that both global and local explanations remain faithful to the model’s internal logic, a critical requirement for governance and auditability. Similarly, XGBoost, as a gradient-boosted decision tree model representing the current state-of-the-art for tabular data. This comparison tests whether the interpretability of RF comes at a significant cost to predictive accuracy.
To rigorously validate the performance and necessity of this architecture we benchmarked the Glass-Box approach against two distinct model families:
  • Logistic Regression (LR): A linear baseline established to test whether the relationships between governance features and success are simple and linear.
  • Multi-Layer Perceptron (MLP): A neural network acting as a representative black-box baseline. MLPs are widely used in modern recommender systems for their ability to capture complex, non-linear user–item interactions [3,4]. This allows to evaluate the trade-off between transparency and accuracy, a central question in trustworthy AI design.

3.3.2. Data Preparation and Imbalance Handling

The dataset—specifically the abstracted set of “Emerging Candidates” (see Section 3.2.2) was split into a 75% training set and a 25% testing set. Stratified sampling was used to preserve the class distribution, which is critical given the severe imbalance in the processed dataset where “Future Hits” constitute a small minority [20].
To address this, Synthetic Minority Over-sampling Technique (SMOTE) was integrated directly into an imbalance pipeline [49]. Crucially, SMOTE is applied only to the training folds during cross-validation, ensuring that synthetic samples never leak into the validation sets. The pipeline also included steps for data preprocessing: numerical features were imputed using the median strategy and then standardised using StandardScaler to ensure convergence for the MLP and Logistic Regression baselines [20].

3.3.3. Hyperparameter Optimisation

Hyperparameters for all models were optimised using GridSearchCV with 3-fold stratified cross-validation, using Area Under the Precision-Recall Curve (AUPRC) as the primary optimisation metric. Given the extreme class imbalance (approx. 3.51%), AUPRC provides a stricter and more informative signal for hyperparameter tuning than ROC-AUC, ensuring the model prioritises the precise identification of the minority positive class [20]. The search spaces were defined to balance complexity and generalisation:
  • For the RF model, the grid included n _ e s t i m a t o r s     { 100 ,   200 } and m a x _ d e p t h     { 10 ,   15 ,   25 } . The depth was strategically constrained to prevent overfitting.
  • The XGBoost model was optimised for n _ e s t i m a t o r s     { 100 ,   200 } and max_depth∈ {3, 6}, following standard best practices for tabular datasets [24].
  • The MLP tuned with   h i d d e n _ l a y e r _ s i z e s   { ( 50 ) ,   ( 100 ,   50 ) } and alpha r e g u l a r i z a t i o n     { 0.0001 ,   0.001 } to capture non-linearities without excessive complexity.
  • Finally, Logistic Regression optimised for regularisation strength C   { 0.1 ,   1.0 ,   10.0 } .
This rigorous tuning ensures that the final comparison accurately reflects the optimal capability of each architecture, rather than relying on default settings.

3.4. Performance Metrics

To assess model performance, particularly in the context of the severely imbalanced dataset, a comprehensive suite of evaluation metrics is employed. While ROC-AUC is used for hyperparameter tuning due to its robustness against class imbalance, the final model comparison also relies on metrics derived from the confusion matrix: TP (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).
Accuracy, the proportion of total correct predictions, is defined in Equation (5) as:
A c c u r a c y   = T P + T N T P + T N + F P + F N
While Accuracy is reported for completeness, it is not used as a primary success metric due to the class imbalance Paradox [35,50], where a trivial zero-rule classifier would achieve high accuracy but zero utility.
Precision, defined in Equation (5), is crucial for ensuring the reliability of recommendations, measures the proportion of correctly predicted positive instances among all instances predicted as positive:
P r e c i s i o n   = T P T P   +   F P
Recall (or Sensitivity) in Equation (7) evaluates the proportion of actual positive instances that are correctly identified by the model, indicating its ability to find all relevant emerging hits:
R e c a l l   = T P T P   +   F N
The F1-score, presented in Equation (8), is the harmonic mean of Precision and Recall, providing a single score that balances both metrics.
F 1 s c o r e = 2 ×   P r e c i s i o n × R e c a l l   P r e c i s i o n + R e c a l l
In addition, the Area Under the Receiver Operating Characteristic curve (ROC-AUC) and the AUPRC is used to measure the model’s discrimination capabilities across all classification thresholds. The ROC curve plots the True Positive Rate against the False Positive Rate at various threshold settings, with an AUC score near 1.0 indicating excellent separability between classes. ROC-AUC was selected as the primary optimisation metric during hyperparameter tuning because it is considered a robust measure for imbalanced classification tasks [34].
However, ROC curves can present an overly optimistic view of performance on datasets with a severe class imbalance, such as in this study where the “Future Hit” represents only approximately 3.51% of the emerging candidate pool [33]. Therefore, the AUPRC is also reported as a more stringent and informative metric in this context [35]. A high AUPRC value indicates that the model can maintain both high precision and high recall, even when the positive class is rare.

3.4.1. Dynamic Decision Thresholding

Conventional probabilistic classifiers typically default to a decision boundary of τ   = 0.5 . However, within a dataset exhibiting merely 3.51% positive class prevalence, this static threshold frequently leads to vanishing recall or compromised precision. To address this limitation, a Dynamic Threshold Optimisation strategy was employed. This method calibrates the decision boundary by traversing the Precision-Recall curve derived from the test set. Formally, the optimal threshold τ * is selected to maximise the F1-score, conditional upon a minimum recall constraint R m i n   =   0.10 to prevent degenerate solutions. This optimisation problem is defined in Equation (9):
τ * = a r g m a x τ [ 0,1 ]       ( 2 . P ( τ ) . R ( τ ) P ( τ ) +   R ( τ ) )   s u b j e c t   t o   R ( τ ) 0.10  
In this equation, P ( τ ) and R ( τ ) denote the Precision and Recall values calculated at a specific probability threshold τ . This constraint ensures the system adapts its strictness to the actual confidence distribution of the model rather than relying on an arbitrary default value.
To ensure a rigorous evaluation, we implemented an ‘Extended Model Comparison’ protocol. Rather than evaluating candidate architectures solely on their default probability output ( P = 0.50 ), each model—including the Glass-Box RF, XGBoost, and the black-box baselines—underwent an individual threshold optimisation phase using the validation set. This step ensures that the performance metrics reported in Section 4 reflect the maximum capability of each architecture to balance the trade-off between precision and recall under strict governance constraints, rather than penalising models that are well-calibrated but conservative.

3.4.2. Governance and Fairness Metrics

To meet the auditability requirements, the evaluation framework extends beyond aggregate performance metrics to include granular fairness diagnostics. We operationalise Allocative Fairness through the metric of Disaggregated Recall, which aligns theoretically with the principle of Equality of Opportunity [7,8,32]. This audit detects systematic biases across two critical dimensions: content semantics and temporal stability [27,58]. For content subgroups, such as genres, the objective is to quantify popularity bias, ensuring that the model identifies high-quality candidates in niche categories (e.g., Documentary) with comparable sensitivity to mass-market genres (e.g., Action). Simultaneously, temporal cohort analysis assesses potential recency bias by validating that the model maintains equitable performance for catalogue items relative to contemporary releases, rather than overfitting to recent trends.
The subgroup-specific recall is defined formally in Equation (10):
R e c a l l   g = T P g T P g +   F N g
where T P g and F N g represent the True Positives and False Negatives, respectively, for a specific subgroup g . By rigorously auditing the Equality of Opportunity across these stratifications, the framework ensures that the forecasting logic functions consistently across the entire item space [7]. This verification step mitigates structural algorithmic harms and reinforces the system’s readiness for deployment in compliance-sensitive regulatory environments.

3.5. The Governance Stack: Beyond Simple Explainability

The deployment of algorithmic systems in high-stakes environments necessitates adherence to rigorous governance standards, often requiring intrinsic guarantees such as strict monotonicity, sparse rule lists, or certified architectural transparency to ensure predictable behaviour [5,31,51,52,61]. While standard ensemble methods like Random Forests deliver superior predictive performance on complex tabular data, they lack the inherent stability guarantees and legal interpretability found in certified governance-grade models [19,24]. To bridge the gap between this high predictive capability and the strict requirements of regulatory compliance, we introduce a multi-layered Governance Stack [62,63]. The proposed architecture extends beyond simple post hoc feature attribution to incorporate empirical stability verification and normative fairness auditing. By systematically stress-testing the model against monotonicity and fairness constraints, this framework approximates the rigour of certified interpretable systems through a three-tiered validation process. The complete workflow, illustrating the sequence from model training to the generation of the Fairness and Bias Audit, is presented in Figure 3.

3.5.1. Layer 1: Attribution and Traceability

The first layer establishes transparency through feature attribution at both the global and local levels. Global interpretability is achieved using TreeSHAP algorithm, implemented through the shap library in Python, which computes theoretically exact Shapley values for tree ensembles [15,16]. Unlike model-agnostic approximations, TreeSHAP guarantees consistency, ensuring that the generated explanations faithfully reflect the internal decision paths of the Random Forest [16]. As illustrated in the SHAP beeswarm plot (see Section 4.4), this method visualises both the magnitude and direction of feature influence, confirming that the model prioritises governance-aligned signal particularly, high popularity density and low negative sentiment—when forecasting success.
Complementing this global view, local instance-level auditing is employed to verify the rationale behind specific high-confidence recommendations. By generating local explanations for top-ranked emerging candidates using both SHAP and LIME, the framework enables auditors to cross-verify the stability of the decision boundary. Discrepancies between the exact Shapley values and LIME’s linear approximations serve as diagnostic indicators of potential decision fragility, ensuring that individual forecasts are not driven by brittle or adversarial feature interactions [63].

3.5.2. Layer 2: Stability Verification

Attribution alone does not guarantee that a model is robust or logical. Recognising that Random Forests do not strictly enforce monotonic constraints by default, this study implements an empirical stability verification layer using Individual Conditional Expectation (ICE) plots and stratified interaction figures [61,64]. These visualisations provide necessary empirical validation that the model adheres to domain-specific logical constraints. For example, they allow auditors to verify that an increase in negative sentiment consistently reduces the recommendation probability across all genre subgroups [5,32].
By retraining the model on redacted feature sets, we verify that predictive performance degrades proportionately, confirming that the model relies on the intended causal drivers rather than spurious proxies. Beyond feature-level monotonicity, the framework assesses policy-level stability by systematically varying the governance thresholds ( τ c o u n t , τ a v g , and τ n e g ). By re-deriving target labels under these perturbed configurations, we confirm that the model’s performance remains robust and that recall metrics adjust smoothly to changes in governance strictness [61].

3.5.3. Layer 3: Fairness and Bias Audit

The final layer explicitly audits the model for allocative harm [7,8,32], ensuring that the optimisation of aggregate precision does not inadvertently systematise bias against protected or niche subgroups. Rather than relying solely on global performance metrics which can mask localised disparities, the framework automates a Disaggregated Recall analysis across two critical dimensions. First, it evaluates content diversity by partitioning items into semantic subgroups (genres); this audit verifies that the model does not exhibit popularity bias by systematically under-forecasting success for niche categories, such as documentaries, relative to mass-market genres like action films [7,8]. Second, it conducts a temporal cohort analysis to safeguard against recency bias [27,58]. By measuring performance consistency across decades, this step ensures that older items in the catalogue which meet quality standards are treated equitably alongside contemporary releases, thereby satisfying the principle of equality of opportunity within the recommendation logic [7,32].

3.6. Policy Operationalisation and Ranking

The final stage of the pipeline operationalises the governance logic by generating a prioritised forecast from the unseen test data. Simulating a cold-start scenario, the trained Glass-Box model assigns a calibrated probability score to each emerging candidate [19,33]. Unlike traditional collaborative filtering which relies on user interaction history, this system ranks items solely based on their predicted adherence to the future success criteria derived from the governance policy [32,43]. The candidates are sorted in descending order of their calibrated success probability ( P ( y = 1 | x ) ), yielding a “Top-N” list of high-confidence emerging hits. This ranked output serves as the primary artefact for content curators, allowing them to allocate promotional resources to items that—despite low current visibility—exhibit the strongest latent signals of future quality and compliance.

4. Results and Analysis

This section presents the empirical evaluation of the proposed glass-box framework. The analysis focuses on three core objectives: benchmarking the predictive performance of interpretable tree ensembles against black-box baselines, decomposing classification errors through dynamic threshold calibration, and validating the system’s transparency through a multi-layered governance audit. The primary experiments were conducted on the MovieLens dataset [25] using the temporal partitioning protocol [27] described in the methodology.

4.1. Predictive Performance

To assess the validity of the “Safe-by-Design” hypothesis [23,31], we compared the performance of the proposed glass-box approach [5] (utilising Random Forest and XGBoost) against two baselines: a standard Logistic Regression model and a neural network (Multi-Layer Perceptron or MLP), which represents the opaque “black-box” [3,4] alternative often favoured in modern recommender systems. All models were evaluated on the “Emerging Candidates” test set, which contains items with low initial visibility, creating a challenging cold-start forecasting scenario [33].
The results in Table 7 demonstrate that the glass-box architectures achieved superior discrimination capabilities compared to their black-box counterparts. The Random Forest (RF) classifier yielded the highest AUC of 0.92, closely followed by XGBoost at 0.91. In contrast, the MLP baseline achieved an AUC of 0.86, while Logistic Regression performed slightly better at 0.89. This performance gap challenges the pervasive assumption in recommender system literature that interpretability necessitates a trade-off in predictive accuracy.
Figure 4 reports ROC curves for all evaluated architectures. The superior performance of the tree ensembles suggests that for tabular governance data—characterised by distinct, non-linear decision boundaries regarding sentiment and velocity—structured ensemble learning may offer better generalisation than fully connected neural networks.
The stability of the tree ensemble models was evident in their ability to handle the heterogeneous feature space without extensive architectural tuning [19,24]. While the MLP required careful regularisation to prevent overfitting on this sparse dataset [3,4], the Random Forest effectively leveraged the engineered momentum and sentiment features to separate emerging hits from the majority class. This finding validates the methodological choice to prioritise interpretable architectures, as they provide high-performance forecasting while remaining compatible with exact attribution methods such as TreeSHAP (see Section 4.4) [5,15,16].

4.2. Robustness Check: Sensitivity to Governance Policy

To ensure the system’s reliability is not an artefact of specific governance parameters, we conducted a Policy Threshold Sensitivity analysis. This evaluation systematically perturbed the definitions of a “future hit” by varying the required popularity count T c o u n t { 50 ,   70 ,   90 } , the average rating threshold T a v g   { 2.5 ,   3.0 ,   3.5 } , and the negative sentiment tolerance T n e g   { 0.5 ,   0.6 ,   0.7 } . The results confirmed that the model’s predictive performance remained robust across these configurations, with recall metrics adjusting smoothly to changes in governance strictness rather than exhibiting chaotic fluctuations. This stability verifies that the model has learned the latent drivers of quality—such as early momentum and sentiment polarity—rather than merely overfitting to a static rule set [62,64].
The outcomes, summarised in Table 8, indicate that the model maintains high ranking stability across all governance regimes. The AUC remained consistently robust, ranging from 0.92 under permissive conditions to 1.00 under strict constraints. However, the system’s coverage exhibited significant sensitivity to policy tightening. As illustrated in the table, Recall varied considerably, peaking at 0.86 in relaxed settings but dropping to approximately 0.38 when higher average ratings and stricter sentiment caps were enforced. Correspondingly, the volume of identified emerging hits (Support) contracted sharply, decreasing from 3891 candidates in the most lenient configuration to 424 in the strictest. These results suggest that while variations in policy thresholds inevitably impact the breadth of discovery (coverage), they do not degrade the model’s fundamental discriminatory capability, confirming that the ranking logic remains stable even under stringent governance rules.

4.3. Error Decomposition via Confusion Matrices

Figure 5 presents the confusion matrices for the four evaluated architectures following the application of Dynamic Policy Calibration. This granular error decomposition is critical given the extreme class imbalance, where “Future Hits” constitute only 3.51% of the emerging candidate pool. In this governance-sensitive domain, the distribution of FP (Type I errors) versus False Negatives (Type II errors) determines the system’s practical utility and trustworthiness [34,49].
The analysis reveals that the RF model, operating at an optimised threshold of τ = 0.77 , successfully transitions from a naive “Discovery Engine” to a robust “Governance Check” [5]. Under default conditions ( τ = 0.50 ), the model exhibited high recall (0.82) but poor precision (0.16). However, as shown in Figure 5, the optimised RF configuration achieves a balance that is strategically favourable for high-stakes recommendations [5]. It correctly identified 339 emerging hits (TPs) while limiting FP to 249. This results in a Precision of 0.58, implying that nearly 60% of the items recommended by the glass-box system were indeed successful a marked improvement over the noise inherent in the uncalibrated baseline [19].
XGBoost ( τ = 0.80 ) distinguished itself as the most precision-oriented architecture (Precision = 0.65). While it minimised FPs to just 170—a strong performance relative to the baseline, it achieved this by sacrificing coverage, identifying only 319 hits compared to the RF’s 339. This characterises XGBoost as a “Precision Specialist,” suitable for environments where the cost of a bad recommendation is prohibitive [60], whereas the RF model offers a broader discovery scope without unacceptably compromising accuracy [5].
The RF model outperformed the black-box Multi-Layer Perceptron (MLP) in overall discovery capability. While the MLP ( τ = 0.92 ) maintained a low false positive rate (154 errors), it failed to identify a significant portion of the target class, capturing only 288 hits compared to the RF’s 339. This finding challenges the assumption that neural networks invariably offer superior predictive utility [3,4]; in this tabular governance task, the RF provided significantly higher recall, ensuring fewer viable candidates were missed. Logistic Regression proved the least effective, generating the highest volume of FPs (297), confirming that simple linear boundaries are insufficient for separating high-velocity emerging items from the noisy majority class [35].

4.4. Layer 1 Audit: Global Logic and Feature Attribution via TreeSHAP

Transparency was first established at the global level using TreeSHAP, which computes theoretically exact Shapley values for tree ensembles [15]. To operationalise transparency, the research employed XAI TreeSHAP [15] to decompose the model’s global decision logic, as illustrated in Figure 6. The SHAP summary plot aggregates thousands of local explanations to create a global picture of feature importance and impact [15,65]. The empirical analysis reveals that the model prioritises temporal context and volume, with age and obs_count emerging as the highest-ranked predictors. This indicates that the model relies heavily on the tenure and established volume of an item to baseline its predictions before assessing velocity. Following these foundational metrics, interaction quality and density play a critical role; observed density ranks third, followed closely by observed avg in the fourth position. For these engagement features, high values—indicated by red points—strongly push the model toward a positive prediction, illustrating that higher user density and average ratings act as positive signals.
Visualisation confirms the functional integrity of the engineered governance policy. While engagement features dominate the global hierarchy, the NLP-derived feature, identified as observed maximum negative tag (obs_max_neg_tag), through observed negative tags, exhibits the intended directional behaviour lower down the ranking. As evidenced by the distribution of the red points for this feature, high values of negative sentiment consistently yield negative SHAP values, shifting the model’s output toward non-recommendation [15]. This confirms that the NLP sub-model successfully operates as a safety guardrail, effectively penalising items with adverse user feedback even when their engagement metrics, such as rating count or density, suggest high popularity [22].

4.5. Faithfulness Audit: Divergence Between Exact (SHAP) and Surrogate (LIME) Explanations

To complement the global and structural views, we perform a local-level audit using SHAP force plots, which decompose individual predictions into their constituent feature contributions [64]. Figure 7 provides a quantitative view of the model’s reasoning for specific instances, allowing auditors to verify that high-confidence predictions rely on valid governance signals rather than spurious correlations. By examining canonical TP and TN cases, we can trace the exact logic defining the decision boundary.
For the Confident TP case, the model correctly predicts a high success probability of 0.96. The force plot in Figure 7 reveals that the decision is primarily driven by strong engagement signals, specifically a high observed count (6.18) and density (1.66), which act as the primary positive drivers pushing the prediction significantly above the base value. Additionally, features such as the negative tag score (obs_max_neg_tag = −2.71), standard deviation (obs_std = 0.57), and the presence of tags (has_tags = 2.66) act as corroborating signals, reinforcing the prediction. While the item’s age (1.11) exerts a minor negative pull, it is overwhelmed by the positive momentum metrics. This confirms that the model correctly prioritises interaction velocity and metadata completeness as leading indicators of emerging success, effectively distinguishing a “rising star” from the background noise [65].
Figure 8 illustrates a borderline positive prediction where the model calculates a success probability of 0.51, successfully exceeding the strict decision threshold of 0.50. This specific instance demonstrates the model’s ability to weigh conflicting evidence. The primary driver pushing the prediction towards the success class is the robust observed count (5.85). This value indicates that the item has achieved significant early momentum and audience traction. In fact, this strong engagement signal acts as a counterbalance to substantial negative factors. A low average rating (obs_avg = −0.61) particularly acts as the strongest negative force, supported by genre count (num_genres = −0.77) and tag availability (has_tags = −0.38). Despite these detractors reducing the model’s confidence, the sheer volume of interactions allows the item to barely cross the decision boundary. This outcome indicates that the model prioritises verified audience interaction velocity over qualitative feedback or metadata richness when identifying emerging trends. Consequently, even items that suffer from lower average ratings can be correctly identified as emerging hits if the underlying user engagement remains sufficiently high.
As a confirmatory and comparative analysis, we deployed Local Interpretable Model-agnostic Explanations (LIME) [13]. LIME approximates the behaviour of any model in the local vicinity of a prediction by training an interpretable surrogate model, such as a linear model. We applied LIME to three canonical cases—Confident Success, Confident Failure, and a Borderline Case—to stress-test case-level faithfulness and compare its explanations to those from the exact TreeSHAP method (Figure 9).
For Confident Success case, presented in Figure 9a, LIME identifies high average ratings ( o b s _ a v g   >   1.12 ) and favourable interaction density ( o b s _ d e n s i t y   >   0.17 ) as the dominant positive contributors, with genre signals like ‘Animation’ providing secondary support. Conversely, for the Confident Failure in Figure 9b, the surrogate model attributes the rejection primarily to temporal factors, identifying older age (age > 0.59) as the overwhelming negative driver, followed by a lack of density (observed density between −0.36 and −0.17). This heavy reliance on age in the surrogate explanation contrasts with the tree-based logic, which typically prioritises engagement signals over tenure.
The Borderline Case provides the most significant diagnostic insight regarding the fidelity of surrogate explanations. As illustrated in the local explanation Figure 9c, the LIME surrogate attributes a positive influence on the item’s recency and interaction density (observed density > −0.17), suggesting these factors are actively driving the prediction toward a successful classification. However, a comparative cross-examination against the exact TreeSHAP values exposes a fundamental lack of faithfulness in the surrogate model. While LIME interprets the observed density as a contributory factor toward success, the exact attribution analysis demonstrates that the model penalised this feature due to non-linear thresholding effects.
This discordance underscores the inherent risks of relying on linear surrogates within high-stakes governance frameworks. LIME effectively “smooths over” the discrete, step-function penalties enforced by the Random Forest, thereby presenting a feature as beneficial when, in the actual decision logic, it failed to meet the requisite activation threshold. Consequently, while LIME may suffice for identifying general feature importance, its inability to accurately represent precise, non-linear policy boundaries renders it inadequate for verifying strict regulatory compliance in “Safe-by-Design” architectures.

4.6. Layer 2 Audit: Verifying Monotonicity and Interaction Stability

Complementing the attribution analysis and addressing governance requirements for model stability, we verified the logical consistency of the model through ICE plots and conducted a stratified sensitivity analysis [61], demonstrated in Figure 10.
Figure 10a (ICE Plot by Decade) reveals a stark Recency Bias. Emerging candidates from the 2020s command a baseline success probability of approximately 48%, while items from the pre-1990s are compressed below 10%. This confirms the model has learned that “freshness” is a prerequisite for emergence, protecting the system from recommending stale catalogue items.
Furthermore, Figure 10b (Interaction Surface) highlights a non-linear dependency between Momentum and Sentiment. Observation of the interaction surface reveals that for items demonstrating elevated momentum (specifically where Observed Rating Density ranges between 0.35 and 0.48), the model maintains high success probabilities regardless of the sentiment intensity. This “Engagement Paradox” suggests the model forecasts that for viral items, any engagement is a predictor of growth. Consequently, even items with severe negative tag scores approaching 1.0 are not penalised by the model if their interaction density remains high. This finding validates the necessity of our explicit Policy Guardrail ( T n e g = 0.6 ) to detect negative content that the model might otherwise promote based solely on velocity metrics.
Figure 10c further dissects this interaction by contrasting items with low momentum against those with moderate growth. The plot challenges the assumption that sentiment acts as a strict gatekeeper. For items with moderate momentum (the orange line), the model exhibits significant resilience to negative feedback; the predicted probability hovers near 0.9 and only declines marginally to approximately 0.87 as the negative tag score increases. Similarly, for low momentum items (the blue dashed line), the probability actually increases initially before stabilising, further suggesting that the model interprets metadata silence (a score of 0) as less favourable than active, albeit mixed, discourse.
Finally, Figure 10d validates the global stability of the model through ICE plots. Contrary to the expectation of strict monotonicity, the red Partial Dependence trend line reveals a non-linear threshold effect where the average success probability rises from 0.56 to 0.70 as the negative tag score increases from 0.0 to 0.2. This indicates that the model associates a complete absence of negative tags with a lack of overall engagement intensity. However, as the negative score extends beyond 0.2, the curve flattens and stabilises, ensuring that while the model rewards initial engagement, it does not linearly reward increasing toxicity. This nuance confirms that the model prioritises engagement volume over sentiment purity, necessitating the external guardrails established in the governance policy [61].

4.7. Layer 3 Audit: Fairness and Bias Audit

The fairness audit demonstrated in Figure 11 evaluates Equality of Opportunity (Recall) across two dimensions: genre subgroups and temporal cohorts, under the optimised decision threshold (0.78). The global recall baseline is 0.42 (red dashed line).
Across genres, disparities are evident:
  • Comedy and Romance exhibit the highest recall among genres (approximately, 0.52 and 0.50), indicating stronger sensitivity for narrative-driven categories.
  • Drama and Documentary fall near the global baseline (approximately, 0.44 and 0.46), while Action, Thriller, and Sci-Fi show notably lower recall (approximately, 0.25–0.30), suggesting potential popularity bias against high-intensity genres.
Temporal cohorts reveal a different pattern:
  • Items from the 1970s achieve the highest recall (approximately, 0.78), followed by the 1980s and 1990s (approximately, 0.70–0.68), indicating strong performance for established catalogue periods.
  • Recent decades show reduced sensitivity, with the 2010s at approximately 0.38 and the 2020s at approximately 0.34, reflecting the inherent volatility of forecasting modern trends and the cold-start challenge for very recent content [33].
These findings highlight areas for governance intervention, particularly for underrepresented genres (Action, Sci-Fi) and recent releases, where adaptive threshold tuning or targeted oversampling may be necessary to ensure allocative fairness without destabilising ranking integrity.

4.8. Forecasting Emerging Hits

The final output of the pipeline is a list of “Emerging Hits” items that currently reside below the popularity threshold but exhibit the latent signal characteristics of future success. Table 9 presents the top 10 forecasts from the unseen test set, ranked by their predicted growth probability.
Unlike standard “Top-N” lists which often recycle already-popular blockbusters [1], this list highlights recent releases (e.g., Denial (2016), Queen of Katwe (2016)) and critically acclaimed international or indie films (e.g., Zero Motivation (2014), The Assassin (2015)) [25]. This validates the efficacy of the “Emerging Item “: the model successfully identifies high-quality candidates in the “cold-start” phase [33,43], prioritising recent content with strong engagement velocity (obs_density) and low negative sentiment, perfectly aligning with the predefined governance policy.

4.9. Cross-Domain Validation: E-Commerce Pilot

To rigorously assess the architectural transferability of the ‘Glass-Box’ framework beyond its primary domain, the experimental pipeline was applied to the Amazon ‘Electronics’ test dataset [26]. This environment presents distinct challenges compared to the movie domain, specifically characterised by higher interaction friction and significantly greater data sparsity (Section 3.1). As illustrated in Figure 12, the tree ensemble architectures, RF and XGBoost, demonstrated consistent and robust ranking capabilities, with both models achieving an AUC of 0.89. Although this represents a minor reduction in discrimination performance relative to the dense MovieLens benchmark, it confirms that these interpretable ensembles retain significant predictive power. The framework effectively separates emerging hits from the majority class even within the sparse, high-noise conditions inherent to review-based ecosystems.
Complementing the global ranking metrics, the class-wise performance detailed in Table 10 underscores the collective utility of the tree-based architectures for governance objectives. Despite the severe class imbalance, where emerging hits constitute a small fraction of the data, both the Random Forest and XGBoost models maintained an identical Recall of 0.60 for the minority class (Class 1) alongside an overall accuracy of 0.92. This sensitivity provides empirical evidence that the glass-box ensembles successfully identify a majority of future hits without being overwhelmed by the majority class, performing on par with the black-box neural baseline. This behaviour aligns directly with the foundational ‘Governance Stack’ objective by ensuring that high-quality emerging content is not systematically obscured. Consequently, the tree ensemble approach proves viable for operating within high-friction environments, offering a balanced trade-off between discovery and auditability.

5. Discussion

The results demonstrate that a glass-box approach, when rigorously architected with prediction objectives, can achieve predictive performance comparable to state-of-the-art black-box models while offering superior auditability in comparison to the deep neural network models [24]. This section critically examines the implications of these findings, highlights methodological contributions, identifies limitations regarding domain transferability and data integrity.

5.1. Interpretation of Findings: Predicting vs. Classification

A central finding of this study is that the accuracy–interpretability trade-off is not inevitable, even when the task involves forecasting a complex, non-linear growth trajectories [5], (Section 4.1). The glass-box approach’s ability to discriminate between classes was on par with the black-box MLP (see Section 4.1). The RF and XGBoost models achieved ROC-AUC scores of 0.92 and 0.91, respectively, effectively outperforming both the linear baseline (Logistic Regression, AUC 0.89) and the neural baseline (MLP, AUC 0.86) (see Section 4.1). This result challenges the prevailing assumption that deep learning is strictly necessary for high-performance recommendation in structured domains [3,4].
It is crucial to interpret this outstanding performance within the “Emerging Item” forecasting task. Unlike static classification, where models might simply rely on imitation of existing popularity thresholds, our Temporal Split Protocol and Emerging Item forced the model to identify latent leading indicators of future success [27,58]. The fact that the RF model identified approximately 42% of future hits (Recall) compared to 40% for XGBoost (Table 7) suggests that while bagging ensembles offer broad coverage, boosting methods provide a distinct advantage in precision-oriented (Precision 0.65) scenarios characterised by noisy, high-variance signals [19,24].
Our findings align with those of Nohara et al. [66], who demonstrated SHAP’s utility for identifying clinically relevant predictors. While their work focused on post hoc risk interpretation, our approach extends these principles to behavioural forecasting by embedding interpretability into the model architecture. Specifically, we enforce temporal separation, integrate explicit policy thresholds, and engineer human-readable features, ensuring transparency is intrinsic rather than retrofitting [5,27,32].
Beyond feature attribution, we embed governance through fairness audits and sensitivity checks. Using ICE plots and interaction heatmaps, we empirically validate that key relationships—such as the negative impact of sentiment—remain monotonic across subgroups [62,64]. This multi-layered design operationalises Responsible AI principles by integrating fairness and stability checks throughout the model lifecycle, rather than as retrospective compliance steps.
This audit comparing SHAP and LIME revealed notable discrepancies in local ex-planations for a high-confidence TP case. SHAP’s force plot attributed (Section 4.5) the prediction primarily to rating count, rating density, and average rating, while strongly penalising negative sentiment and slightly discounting older age. In contrast, LIME’s surrogate explanation placed disproportionate emphasis on age as the dominant positive factor, with rating density and average rating appearing secondary and sentiment exerting only a minor negative effect. Genre indicators were present but contributed minimally. These findings underscore the reliability advantage of mathematically exact methods like TreeSHAP in compliance-sensitive settings, where surrogate instability could mislead audits [15,16,17].

5.2. The Governance Advantage: Stability and Fairness

Beyond predictive accuracy, the “Governance Stack” revealed critical insights into model behaviour that black-box metrics typically obscure [5,31]. The Temporal Sensitivity Analysis exposed a distinct “Recency Dominance,” where the model assigns significantly higher baseline probabilities to post-2020 content compared to pre-1990 releases [3,38]. While this confirms the model has learned valid market dynamics, it also highlights a risk of Temporal Bias against the catalogue tail [1,2,35].
Furthermore, the Fairness Audit uncovered a disparity in opportunity between genres, with Documentaries achieving a lower Recall (0.71) than Dramas (0.91). In a deployed system, this finding would trigger a mandatory governance intervention—such as oversampling niche genres or adjusting decision thresholds—demonstrating how the glass-box architecture facilitates active compliance with allocative fairness standards [31,51,52]. Importantly, threshold sensitivity analysis demonstrates that governance strictness can be adjusted without destabilising model performance. While stricter thresholds reduce discovery coverage, they do not compromise ranking integrity ( A U C   0.91 ), enabling adaptive compliance strategies aligned with evolving regulatory norms.
Threshold sensitivity analysis demonstrates that governance strictness can be tuned without compromising ranking performance: AUC remains high ( 0.91 ) across tested ranges, while recall varies predictably with τ a v g and τ n e g . This enables adaptive compliance strategies that maintain auditability and performance while meeting changing risk tolerances [57,59]. In practice, curators can adjust thresholds to balance discovery of emerging quality (higher recall) against stricter guardrails (lower PosRate) without destabilising the score distribution.

5.3. Methodological Contributions

Critics of policy-driven learning often suggest that training a model to predict a deterministic threshold is a trivial exercise in heuristic replication [5,37]. However, the used approach distinguishes itself from static scorecards or rule-distillation by addressing a temporal forecasting problem rather than a concurrent classification task [58]. The model does not simply reproduce a static rule from current observations; rather, it predicts whether an item will satisfy governance criteria in an unseen future window, using antecedent signals such as rating density, sentiment momentum, and content metadata. This reframes the task from a simplistic imitation into a complex pattern-recognition problem, a claim substantiated by our empirical results: if the task were trivial, the universal approximation capabilities of the MLP would have easily matched the Random Forest. Instead, the superior generalisation of the glass-box architecture (AUC 0.92 vs. MLP 0.86) demonstrates that interpretable ensembles are uniquely suited to capture these non-linear governance dynamics [32,67].
This alignment with policy-driven objectives operationalises the principles of Trustworthy AI, embedding fairness and accountability into the predictive process rather than treating them as retrospective add-ons [7,8,31,51]. Consequently, the glass-box approach provides a verifiable pathway for regulatory compliance, addressing critiques that black-box systems lack interpretability and governance readiness in high-stakes domains [5,9,10].

5.4. Governance Relevance of Dynamic Decision Thresholding

Dynamic Decision Thresholding, introduced in Section 3.4.1, plays a critical role in aligning predictive performance with governance objectives. Conventional classifiers default to a fixed probability cutoff (e.g., 0.50), which can lead to severe precision–recall imbalances in highly skewed datasets such as emerging-item forecasting (positive class approximately 3.51%). By optimising the threshold using the Precision–Recall curve and enforcing a minimum recall constraint, the system ensures that recommendations meet both accuracy and fairness requirements under strict governance conditions [7,57]. This adaptive calibration prevents degenerate solutions where minority classes are ignored and supports compliance with principles of Equality of Opportunity by maintaining sensitivity to rare but high-value items [49,59]. Furthermore, threshold optimisation enables curators to dynamically tune the trade-off between discovery coverage [34] and risk tolerance, offering a practical mechanism for operationalising governance policies in real-world deployments [8,41,59].

5.5. Limitations

While the proposed framework demonstrates significant strengths in auditing and transparency, certain limitations provide clear directions for future research and deployment.
First, regarding the explainability layer, it is important to acknowledge that feature attribution methods such as TreeSHAP [16] rely on assumptions of feature independence in their interventional formulation [15]. In practice, our engineered features exhibit strong dependencies (Section 3.2.3). To mitigate the interpretability risks associated with these dependencies, we augmented the standard attribution analysis with Stratified Interaction Heatmap and ICE plots. These visualisation techniques explicitly map the joint effects of correlated features, allowing for the empirical verification of decision boundaries even in the presence of feature dependence. Future governance audits should continue to document these dependencies and, where necessary, complement SHAP with causal inference methods or controlled ablation experiments to further disentangle specific feature contributions [16]. While the glass-box architecture significantly improves auditability and fairness, it introduces additional computational overhead compared to black-box models, primarily due to the integration of TreeSHAP for exact attribution and multi-layered governance audits. Although training costs remain lower than deep learning pipelines, real-time deployment may require latency profiling and optimisation strategies to maintain user experience [3,5,24,41,59].
A further limitation relates to operational constraints. Real-time deployment would require latency profiling, incremental feature computation, and possibly approximate SHAP methods [14,64]. Nonetheless, the computational footprint remains significantly lower than deep learning pipelines, making production adaptation feasible [3].
From the perspective of dataset representativeness, the inclusion of the Amazon Reviews dataset [26] pilot (Section 4.8) mitigates the risks associated with single-domain evaluations. The results confirm that the framework’s core logic—aligning labels with governance rules and auditing for allocative fairness—is transferable to high-velocity e-commerce environments. However, while the model achieved a robust AUC of 0.89, this represents a slight attenuation compared to the dense MovieLens domain (AUC 0.92), highlighting the subtle challenge of data sparsity in review-based platforms. Furthermore, although global accuracy remained high at 0.92, this metric is heavily influenced by the class imbalance inherent in the e-commerce dataset, making the Recall of 0.60 a more indicative measure of the model’s sensitivity to emerging hits. Future work should focus on optimising the feature engineering pipeline specifically for such high-friction, sparse environments to narrow this performance gap.
Finally, the validity of the governance policy is intrinsically linked to the accuracy and completeness of the input data. Our “Reality Check” and forecasting target rely on the assumption that user ratings and tags represent genuine sentiment. In adversarial environments, phenomena such as “shilling” attacks or opinion spam could potentially distort these signals [29,30]. While the current framework detects compliance with the policy, it does not inherently detect manipulation of the underlying data. Therefore, in a production environment, this system functions best as a downstream governance layer, operating after robust fraud detection mechanisms have sanitised the input stream.

5.6. Comparison with State-of-the-Art

The proposed framework with a glass-box approach, offers a distinct and timely alternative to the dominant paradigms in recommender systems research. This section contextualises our contributions by comparing the framework against state-of-the-art deep learning models, classical machine learning baselines, and other explainable AI (XAI) architectures.
Recent surveys (e.g., Zhou et al., 2023 [4]) rightly emphasise the ascendancy of deep learning-based RS for achieving higher predictive accuracy, in complex, high-dimensional tasks [1,3]. However, for the specific task of predicting emerging media hits based on metadata and sentiment signals, our results suggest that the perceived mandatory trade-off between performance and transparency is not inevitable [5]. The RF model achieved an ROC-AUC of 0.92, outperforming the MLP benchmark (0.86) and operating at parity with Gradient Boosting (XGBoost, 0.91). This finding indicates that for tabular governance data, well-tuned ensemble methods can achieve parity with black-box architectures while offering the semantic interpretability lacking in latent factor models like Matrix Factorisation [1,2].
Compared to hybrid explainable architectures that jointly predict ratings and generate natural-language rationales or leverage attention mechanisms for interpretability [43,44], our approach prioritises explanation faithfulness over expressiveness [15]. While hybrid models enhance user experience, their rationales are often generated by a secondary mechanism, introducing the risk that the explanation may not accurately reflect the model’s actual decision logic. In contrast, our framework employs an inherently interpretable ensembled model combined with TreeSHAP, which provides mathematically exact feature attributions [15,65]. The divergence observed between LIME’s approximate explanation and TreeSHAP’s exact attribution in our error analysis (Section 4.5) highlights the reliability limitations of surrogate explainers for high-stakes auditing [13,17,68].
From a practical perspective, the framework delivers measurable benefits in computational efficiency and fairness. Training the RF model incurs a substantially lower computational cost than deep learning pipelines [24], aligning with broader evidence that tree-based models often outperform neural networks on tabular data with less tuning and resource overhead. Furthermore, the proactive integration of SMOTE to address class imbalance [49] contributes to improved recall for the minority class, offering a more principled solution than common post-processing or re-ranking fairness adjustments.
Finally, the model’s inherent transparency creates a structural pathway toward regulatory alignment. While we acknowledge that the media or e-commerce domains (e.g., MovieLens and Amazon Reviews dataset [26]) do not carry the same risk profile as credit or healthcare, the methodological architecture—specifically the automated generation of Model Governance Cards and Fairness Audits [31]—addresses the stringent requirements for auditability found in emerging regulations such as the EU AI Act [9,10]. Unlike post hoc explanations for opaque models may fail to satisfy legal standards for “contestability” [5], our study offers traceability through explicit policy thresholds and verified monotonic constrains [63].
To synthesise these comparisons, Table 11 contrasts our proposed framework with dominant recommender system archetypes across key dimensions.

6. Conclusions

This study introduced a glass-box recommender architecture that prioritises both predictive performance and auditability, addressing a critical gap where accuracy often overshadows interpretability [5,36]. By leveraging interpretable temporal features [27,58], a tree ensemble backbone comprising Random Forest and XGBoost, the framework ensures audit-grade transparency, making every decision verifiable. Unlike black-box model that rely on unstable, post hoc proxies, the proposed approach emphasises faithfulness and operational utility without sacrificing accuracy [17]. This commitment was validated through local-level audits, which revealed discrepancies between exact TreeSHAP attributions and approximate LIME explanations—underscoring the risk of unfaithful rationales from post hoc surrogate methods and reinforcing the necessity of glass-box designs for governance-sensitive deployments [68,69].
While glass-box architecture enhances auditability and fairness, they introduce computational overhead and scalability challenges compared to black-box models. The integration of TreeSHAP for exact attribution and multi-layered governance audits increases resource requirements, and real-time deployment may require latency profiling and optimisation strategies to maintain user experience [3,5,24,41,59]. These trade-offs highlight the ethical and practical tension between transparency and efficiency, which future work will address through adaptive optimisation and approximate attribution methods.
A key finding of this work is the empirical demonstration that strong predictive performance can be achieved without sacrificing transparency. The glass-box architecture demonstrated superior discrimination, with the Random Forest achieving a ROC-AUC of 0.92 and XGBoost reaching 0.91. Both models notably outperformed the neural baseline, which attained an AUC of 0.86, as well as the linear benchmark. Furthermore, the error analysis revealed distinct operational roles: the Random Forest prioritised broader discovery with higher recall, while XGBoost minimised false positives, offering a high-precision alternative. (Section 4.1). This result challenges the long-held assumption of a mandatory accuracy–interpretability trade-off in structured prediction tasks [5,36]. Cross-domain validation on the Amazon dataset [26] further confirmed the framework’s robustness, identifying emerging hits with an AUC of 0.89. While this represents a slight performance attenuation attributed to the higher data sparsity of the review-based environment, it validates that the governance logic remains transferable across distinct domains.
Future research will focus on strengthening the framework against adversarial threats and expanding governance to align with evolving regulations. The results from the high-friction e-commerce pilot emphasise the need to integrate upstream opinion-spam detection to sanitise user-generated signals before they reach the governance layer, thereby mitigating “shilling” attacks and preserving signal authenticity [29,30]. To address allocative disparities observed between genres, future iterations will move beyond post hoc auditing to incorporate dynamic feedback loops and online learning for real-time adaptability [41], while embedding multi-stakeholder fairness constraints directly into the optimisation objective [8]. These enhancements will operationalise Trustworthy AI principles by ensuring that fairness, accountability, and adaptability are intrinsic properties of the system architecture rather than retrospective add-ons.

Author Contributions

All authors have equal contributions to prepare and finalise the manuscript. Conceptualization, P.V. and M.L.; methodology, P.V., M.L. and M.A.; validation, P.V., M.L. and M.A.; formal analysis, P.V., M.L. and M.A.; investigation, P.V., M.L. and M.A.; data curation, P.V. and M.L.; writing—original draft preparation, P.V. and M.L.; writing—review and editing, P.V., M.L. and M.A.; visualisation, P.V., M.L. and M.A.; supervision, M.L., and M.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data will be made available upon request.

Acknowledgments

The authors used various AI tools to enhance the language and readability of the paper during the writing process. After utilising these tools, the authors evaluated and performed any necessary editing of the text. The authors are solely responsible for the fundamental study, results, and research findings.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Kumar, C.; Chowdary, C.R.; Meena, A.K. Recent trends in recommender systems: A survey. Int. J. Multimed. Inf. Retr. 2024, 13, 41. [Google Scholar] [CrossRef]
  2. Ricci, F.; Rokach, L.; Shapira, B. Recommender Systems: Techniques, Applications, and Challenges. In Recommender Systems Handbook, 2nd ed.; Springer: New York, NY, USA, 2015; pp. 1–35. [Google Scholar]
  3. Zhang, S.; Yao, L.; Sun, A.; Tay, Y. Deep Learning based Recommender System: A Survey and New Perspectives. ACM Comput. Surv. 2019, 52, 5. [Google Scholar] [CrossRef]
  4. Zhou, H.; Xiong, F.; Chen, H. A Comprehensive Survey of Recommender Systems Based on Deep Learning. Appl. Sci. 2023, 13, 1378. [Google Scholar] [CrossRef]
  5. Rudin, C. Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead. Nat. Mach. Intell. 2019, 1, 206–215. [Google Scholar] [CrossRef]
  6. Zhang, Y.; Chen, X. Explainable Recommendation: A Survey and New Perspectives. Found. Trends Inf. Retr. 2020, 14, 1–101. [Google Scholar] [CrossRef]
  7. Wang, Y.; Ma, W.; Zhang, M.; Liu, Y.; Ma, S. A Survey on the Fairness of Recommender Systems. ACM Trans. Inf. Syst. 2023, 41, 52. [Google Scholar] [CrossRef]
  8. Deldjoo, Y.; Jannach, D.; Bellogín, A.; Difonzo, A.; Zanzonelli, D. Fairness in Recommender Systems: Research Landscape and Future Directions. User Model. User Adapt. Interact. 2024, 34, 59–108. [Google Scholar] [CrossRef]
  9. High-Level Expert Group on Artificial Intelligence Ethics Guidelines for Trustworthy AI; European Commission: Brussels, Belgium, 2019.
  10. Rossetti, S. The Court of Justice of the European Union Confirms the Existence of the Right to Explanation of Automated Decision-Making. European Law Blog, 7 April 2025. Available online: https://www.europeanlawblog.eu/pub/lwchuopd/release/1 (accessed on 8 December 2025).
  11. Smith, J.J.; Beattie, L.; Cramer, H. Scoping Fairness Objectives and Identifying Fairness Metrics for Recommender Systems: The Practitioners’ Perspective. In Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April 2023–4 May 2023. [Google Scholar]
  12. Moult, A.I.L.; Mah, H.W.H.; Williams, R.J.A.D. A Scoping Review of Explainable Artificial Intelligence (XAI) in c. Appl. Sci. 2023, 13, 4006. [Google Scholar]
  13. Ribeiro, M.T.; Singh, S.; Guestrin, C. ‘Why Should I Trust You?’: Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
  14. Lundberg, S.M.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 4–9 December 2017. [Google Scholar]
  15. Lundberg, S.M.; Erion, G.; Chen, H.; DeGrave, A.; Prutkin, J.M.; Nair, B.; Katz, R.; Himmelfarb, J.; Bansal, N.; Lee, S.-I. From Local Explanations to Global Understanding with Explainable AI for Trees. Nat. Mach. Intell. 2020, 2, 56–67. [Google Scholar] [CrossRef] [PubMed]
  16. Laberge, G.; Pequignot, Y. Understanding Interventional TreeSHAP: How and Why it Works. arXiv 2022, arXiv:2209.15123. [Google Scholar]
  17. Adebayo, J.; Gilmer, J.; Muelly, M.; Goodfellow, I.; Hardt, M.; Kim, B. Sanity checks for saliency maps. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 3–8 December 2018; pp. 9525–9536. [Google Scholar]
  18. Jain, S.; Wallace, B.C. Attention is not Explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 3543–3556. [Google Scholar]
  19. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  20. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Müller, A.; Nothman, J.; Louppe, G.; et al. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  21. Wankhade, M.; Rao, A.C.S.; Kulkarni, C. A survey on sentiment analysis methods, applications, and challenges. Artif. Intell. Rev. 2022, 55, 5731–5780. [Google Scholar] [CrossRef]
  22. Selbst, A.; Boyd, D.; Friedler, S.; Venkatasubramanian, S.; Vertesi, J. Fairness and Abstraction in Sociotechnical Systems. In Proceedings of the Conference on Fairness, Accountability, and Transparency, Atlanta, GA, USA, 29–31 January 2019; pp. 59–68. [Google Scholar]
  23. IEEE Global Initiative on Ethics of Autonomous and Intelligent Systems. Ethically Aligned Design: A Vision for Prioritizing Human Well-being with Autonomous and Intelligent Systems. In Proceedings of the 2017 IEEE Canada International Humanitarian Technology Conference (IHTC), Toronto, ON, Canada, 21–22 July 2017. [Google Scholar]
  24. Grinsztajn, L.; Oyallon, E.; Varoquaux, G. Why do tree-based models still outperform deep learning on typical tabular data? In Proceedings of the 36th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 28 November 2022–9 December 2022; pp. 507–520. [Google Scholar]
  25. Harper, F.M.; Konstan, J.A. The MovieLens Datasets: History and Context. ACM Trans. Interact. Intell. Syst. 2015, 5, 19. [Google Scholar] [CrossRef]
  26. Hou, Y.; Li, J.; He, Z.; Yan, A.; Chen, X.; McAuley, J. Bridging language and items for retrieval and recommendation. arXiv 2024, arXiv:2403.03952. [Google Scholar] [CrossRef]
  27. Lu, S.; Zhang, Y.; Li, X. Filtering with Time-Frequency Analysis: An Adaptive and Lightweight Model for Sequential Recommender Systems. arXiv 2025, arXiv:2503.23436. [Google Scholar] [CrossRef]
  28. James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning: With Applications in R, 2nd ed.; Springer: New York, NY, USA, 2021. [Google Scholar]
  29. Jindal, N.; Liu, B. Opinion spam and analysis. In Proceedings of the 2008 International Conference on Web Search and Data Mining, Palo Alto, CA, USA, 11–12 February 2008; pp. 219–230. [Google Scholar]
  30. Mukherjee, A.; Kumar, A.; Liu, B.; Wang, J.; Hsu, M.; Castellanos, M.; Ghosh, R. Spotting opinion spammers using behavioral footprints. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL, USA, 11–14 August 2013; pp. 632–640. [Google Scholar]
  31. Mitchell, M.; Wu, S.; Zaldivar, A.; Barnes, P.; Vasserman, L.; Hutchinson, B.; Spitzer, E.; Raji, I.D.; Gebru, T. Model Cards for Model Reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, New York, NY, USA, 29–31 January 2019; pp. 220–229. [Google Scholar]
  32. Balasubramaniam, N.; Kauppinen, M.; Rannisto, A.; Hiekkanen, K.; Kujala, S. Transparency and explainability of AI systems: From ethical guidelines to requirements. Inf. Softw. Technol. 2023, 159, 107197. [Google Scholar] [CrossRef]
  33. Zhou, X.; Wang, W.; Buntine, W.; Bergmeir, C. Context-driven cold-start Web traffic forecasting. World Wide Web 2025, 28, 60. [Google Scholar] [CrossRef]
  34. Yang, T.; Ying, Y. AUC Maximization in the Era of Big Data and AI: A Survey. ACM Comput. Surv. 2022, 55, 172. [Google Scholar] [CrossRef]
  35. Hancock, J.T.; Khoshgoftaar, T.M.; Johnson, J.M. Evaluating classifier performance with highly imbalanced Big Data. J. Big Data 2023, 10, 1–28. [Google Scholar] [CrossRef]
  36. Guidotti, R.; Monreale, A.; Ruggieri, S.; Turini, F.; Giannotti, F.; Pedreschi, D. A Survey of Methods for Explaining Black Box Models. ACM Comput. Surv. 2018, 51, 93. [Google Scholar] [CrossRef]
  37. Doshi-Velez, F.; Kim, B. Towards A Rigorous Science of Interpretable Machine Learning. arXiv 2017, arXiv:1702.08608. [Google Scholar] [CrossRef]
  38. Bojić, L. AI alignment: Assessing the global impact of recommender systems. Futures 2024, 160, 103383. [Google Scholar] [CrossRef]
  39. Shimizu, R.; Matsutani, M.; Goto, M. An explainable recommendation framework based on an improved knowledge graph attention network with massive volumes of side information. Knowl.-Based Syst. 2022, 239, 107970. [Google Scholar] [CrossRef]
  40. Gnasso, A.; Aria, M. “Can You Explain That?” E2Tree, SHAP, and LIME for Interpretable Random Forests. In Supervised and Unsupervised Statistical Data Analysis; CLADAG-VOC 2025, ser. Studies in Classification, Data Analysis, and Knowledge Organization; Springer: Berlin/Heidelberg, Germany, 2025. [Google Scholar]
  41. Shahbazi, Z.; Jalali, R.; Shahbazi, Z. Enhancing Recommendation Systems with Real-Time Adaptive Learning and Multi-Domain Knowledge Graphs. Big Data Cogn. Comput. 2025, 9, 124. [Google Scholar] [CrossRef]
  42. Yuan, Y.; Chen, L.; Yang, J. A multidimensional model for recommendation systems based on classification and entropy. Electronics 2023, 12, 402. [Google Scholar] [CrossRef]
  43. Pérez, P.; Buitelaar, P.; Díez, J.; Luaces, O.; Bahamonde, A. Attention-inspired text-based recommender system with explanatory capabilities. Appl. Soft Comput. 2025, 184, 113650. [Google Scholar] [CrossRef]
  44. Zhu, X.; Xia, X.; Wu, Y.; Zhao, W. Enhancing explainable recommendations: Integrating reason generation and rating prediction through multi-task learning. Appl. Sci. 2024, 14, 8303. [Google Scholar] [CrossRef]
  45. Vultureanu-Albiși, A.; Murarețu, I.; Bădică, C. Explainable Recommender Systems Through Reinforcement Learning and Knowledge Distillation on Knowledge Graphs. Information 2025, 16, 282. [Google Scholar] [CrossRef]
  46. Ortega, F.; González, Á. Recommender systems and collaborative filtering. Appl. Sci. 2020, 10, 7050. [Google Scholar] [CrossRef]
  47. Mosquera, C.; Ferrer, L.; Milone, D.H.; Luna, D.; Ferrante, E. Class imbalance on medical image classification: Towards better evaluation practices for discrimination and calibration performance. Eur. Radiol. 2024, 34, 7895–7903. [Google Scholar] [CrossRef]
  48. De Biasio, A.; Montagna, A.; Aiolli, F.; Navarin, N. A systematic review of value-aware recommender systems. Expert Syst. Appl. 2023, 219, 119659. [Google Scholar] [CrossRef]
  49. Joloudari, J.H.; Marefat, A.; Nematollahi, M.A.; Oyelere, S.S.; Hussain, S. Effective Class-Imbalance Learning Based on SMOTE and Convolutional Neural Networks. Information 2023, 11, 4006. [Google Scholar] [CrossRef]
  50. He, H.; Garcia, E.A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar] [CrossRef]
  51. Diakopoulos, N. Algorithmic Accountability. Digit. J. 2015, 3, 398–415. [Google Scholar] [CrossRef]
  52. Kroll, J.A.; Huey, J.; Barocas, S.; Felten, E.W.; Reidenberg, J.R.; Robinson, D.G.; Yu, H. Accountable Algorithms. Univ. Pa. Law Rev. 2017, 165, 633–705. [Google Scholar]
  53. Raji, I.D.; Smart, A.; White, R.N.; Mitchell, M.; Gebru, T.; Hutchinson, B.; Smith-Loud, J.; Theron, D.; Barnes, P. Closing the AI Accountability Gap: Defining an End-to-End Framework for Internal Algorithmic Auditing. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, Barcelona, Spain, 27–30 January 2020; pp. 33–44. [Google Scholar]
  54. Madaio, M.; Stark, S.; Vaughan, J.W.; Wallach, H. Co-Designing Audits for Algorithmic Accountability. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 11–16 May 2024; pp. 1–14. [Google Scholar]
  55. Gebru, T.; Morgenstern, J.; Vecchione, B.; Vaughan, J.W.; Wallach, H.; Daumé, H., III; Crawford, K. Datasheets for Datasets. Commun. ACM 2021, 64, 86–92. [Google Scholar] [CrossRef]
  56. Tabassi, E. Artificial Intelligence Risk Management Framework (AI RMF 1.0); NIST AI 100-1; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2023. [Google Scholar]
  57. ISO/IEC 23894:2023; Information Technology—Artificial Intelligence—Guidance on Risk Management. International Organization for Standardization: Geneva, Switzerland, 2023.
  58. Azri, A.; Boukhalfa, A.; Boughaci, M. IUAutoTimeSVD++: A Hybrid Temporal Recommender System Integrating Item and User Features Using a Contractive Autoencoder. Information 2024, 15, 204. [Google Scholar] [CrossRef]
  59. Mosca, E.; Szigeti, F.; Tragianni, S.; Gallagher, D.; Groh, G. SHAP-Based Explanation Methods: A Review for NLP Interpretability. In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 12–17 October 2022; pp. 4593–4603. [Google Scholar]
  60. Imani, M.; Beikmohammadi, A.; Arabnia, H.R. Comprehensive Analysis of Random Forest and XGBoost Performance with SMOTE, ADASYN, and GNUS Under Varying Imbalance Levels. Technologies 2025, 13, 88. [Google Scholar] [CrossRef]
  61. Goldstein, A.; Kapelner, A.; Bleich, J.; Pitkin, E. Peeking Inside the Black Box: Visualizing Statistical Learning with Partial Dependence and ICE. J. Comput. Graph. Stat. 2015, 24, 676–703. [Google Scholar] [CrossRef]
  62. Vimbi, V.; Shaffi, N.; Mahmud, M. Interpreting artificial intelligence models: A systematic review on the application of LIME and SHAP in Alzheimer’s disease detection. Brain Inform. 2024, 11, 10. [Google Scholar] [CrossRef]
  63. de Campos, L.M.; Fernández, J.M.; Huete, J.F. An explainable content-based approach for recommender systems: A case study in journal recommendation for paper submission. User Model. User-Adapt. Interact. 2024, 34, 1431–1465. [Google Scholar] [CrossRef]
  64. Parisineni, S.R.A.; Pal, M. Enhancing trust and interpretability of complex machine learning models using local interpretable model agnostic shap explanations. Int. J. Data Sci. Anal. 2023, 18, 457–466. [Google Scholar] [CrossRef]
  65. Van den Broeck, G.; Lykov, A.; Schleich, M.; Suciu, D. On the tractability of SHAP explanations. J. Artif. Intell. Res. 2022, 74, 851–886. [Google Scholar] [CrossRef]
  66. Nohara, Y.; Matsumoto, K.; Soejima, H.; Nakashima, N. Explanation of machine learning models using shapley additive explanation and application for real data in hospital. Comput. Methods Programs Biomed. 2022, 214, 106584. [Google Scholar] [CrossRef]
  67. Montano, I.H.; Aranda, J.J.G.; Diaz, J.R.; Cardín, S.M.; de la Torre Díez, I.; Rodrigues, J.J.P.C. Survey of techniques on data leakage protection and methods to address the insider threat. Clust. Comput. 2022, 25, 4289–4302. [Google Scholar] [CrossRef]
  68. Mane, D.; Magar, A.; Khode, O.; Koli, S.; Bhat, K.; Korade, P. Unlocking machine learning model decisions: A comparative analysis of LIME and SHAP for enhanced interpretability. J. Electr. Syst. 2024, 20, 1252–1267. [Google Scholar]
  69. Shehu, A.; Mohammed, A.S.; Abdussalam, A. Interpretability of the Black-Box Machine Learning Models: A Review. Appl. Sci. 2023, 13, 11378. [Google Scholar]
Figure 1. End-to-end architecture of the proposed glass-box approach for RS. The diagram illustrates the complete workflow, including data preprocessing, temporal feature engineering, the ‘Reality Check’ emerging item, comparative model training, and explainability layers.
Figure 1. End-to-end architecture of the proposed glass-box approach for RS. The diagram illustrates the complete workflow, including data preprocessing, temporal feature engineering, the ‘Reality Check’ emerging item, comparative model training, and explainability layers.
Electronics 14 04890 g001
Figure 2. Temporal Features Correlation Heatmap.
Figure 2. Temporal Features Correlation Heatmap.
Electronics 14 04890 g002
Figure 3. Multi-Layered Governance Audit Workflow.
Figure 3. Multi-Layered Governance Audit Workflow.
Electronics 14 04890 g003
Figure 4. ROC curves comparing the predictive performance of the four models under the default policy ( τ c o u n t = 60 ,   τ a v g = 3.0 , τ n e g = 0.6 ). The plots illustrate the trade-off between TP Rate and FP Rate at various thresholds. Both the Glass-Box Random Forest (RF) and XGBoost achieve the highest AUC of 0.92 and 0.91, respectively, demonstrating superior capability in distinguishing emerging hits from standard items compared to the Logistic Regression (AUC = 0.89) and MLP (AUC = 0.86) baselines.
Figure 4. ROC curves comparing the predictive performance of the four models under the default policy ( τ c o u n t = 60 ,   τ a v g = 3.0 , τ n e g = 0.6 ). The plots illustrate the trade-off between TP Rate and FP Rate at various thresholds. Both the Glass-Box Random Forest (RF) and XGBoost achieve the highest AUC of 0.92 and 0.91, respectively, demonstrating superior capability in distinguishing emerging hits from standard items compared to the Logistic Regression (AUC = 0.89) and MLP (AUC = 0.86) baselines.
Electronics 14 04890 g004
Figure 5. Comparative Confusion Matrices on the Hold-out Test Set under optimised decision thresholds. The grid displays the error distribution for the four evaluated architectures, highlighting the trade-off between discovery support and precision (dark blue indicates high counts of true non-hit predictions, while light blue highlights lower counts of true emerging hits and associated errors).
Figure 5. Comparative Confusion Matrices on the Hold-out Test Set under optimised decision thresholds. The grid displays the error distribution for the four evaluated architectures, highlighting the trade-off between discovery support and precision (dark blue indicates high counts of true non-hit predictions, while light blue highlights lower counts of true emerging hits and associated errors).
Electronics 14 04890 g005
Figure 6. Global Feature Importance Analysis Using SHAP.
Figure 6. Global Feature Importance Analysis Using SHAP.
Electronics 14 04890 g006
Figure 7. SHAP Force (Confident TP).
Figure 7. SHAP Force (Confident TP).
Electronics 14 04890 g007
Figure 8. SHAP Force Plot Borderline Positive Prediction.
Figure 8. SHAP Force Plot Borderline Positive Prediction.
Electronics 14 04890 g008
Figure 9. LIME explanations for three canonical prediction cases. (a) LIME Local Explanation for the Confident_Success case; (b) LIME Local Explanation for the Confident_Failure case; (c) LIME Local Explanation for the Borderline_Case.
Figure 9. LIME explanations for three canonical prediction cases. (a) LIME Local Explanation for the Confident_Success case; (b) LIME Local Explanation for the Confident_Failure case; (c) LIME Local Explanation for the Borderline_Case.
Electronics 14 04890 g009
Figure 10. Stability and Sensitivity Analysis. (a) Temporal sensitivity of negative sentiment across different decades; (b) Interaction heatmap displaying sentiment versus early popularity density; (c) Comparison of sentiment impact on low versus moderate momentum items; (d) ICE plot illustrating the sensitivity of 50 emerging items (grey lines) compared to the average effect (red line).
Figure 10. Stability and Sensitivity Analysis. (a) Temporal sensitivity of negative sentiment across different decades; (b) Interaction heatmap displaying sentiment versus early popularity density; (c) Comparison of sentiment impact on low versus moderate momentum items; (d) ICE plot illustrating the sensitivity of 50 emerging items (grey lines) compared to the average effect (red line).
Electronics 14 04890 g010
Figure 11. Fairness Audit. The bar chart displays Equality of Opportunity (Recall) across subgroups. The red dashed line represents the global recall.
Figure 11. Fairness Audit. The bar chart displays Equality of Opportunity (Recall) across subgroups. The red dashed line represents the global recall.
Electronics 14 04890 g011
Figure 12. ROC Curve for Cross-domain Validation on Amazon Reviews 23 (Category: Electronics.test) Dataset [26].
Figure 12. ROC Curve for Cross-domain Validation on Amazon Reviews 23 (Category: Electronics.test) Dataset [26].
Electronics 14 04890 g012
Table 1. Comparative summary of related work and the proposed glass-box framework.
Table 1. Comparative summary of related work and the proposed glass-box framework.
ApproachModel TypeInterpretabilityExplanation MethodFairnessGovernance Readiness
KG-based Intrinsic XAI (Shimizu et al., 2022) [39] Knowledge Graph Attention Network High (model-intrinsic) Attention-weighted paths LimitedHigh (traceable paths)
MF Deep Learning-based RS (Kumar et al., 2024) [1]Deep Neural Network (e.g., MLP, CNN)Very low (black box)Post hoc (e.g., SHAP, LIME, Attention Visualisation)Limited (Fairness and privacy challenges noted)Low (Interpretability and compliance remain open research issues)
Deep Learning-based RS (Zhou et al., 2023) [4]Deep Neural Network (e.g., MLP, RNN, Transformer)Very low (black box)Post hoc (e.g., SHAP, LIME, Attention Visualisation)Limited (Fairness and privacy challenges noted)Low
Value-Aware RS (De Biasio et al., 2023) [48] Varies (Re-ranking, RL, etc.) Varies (depends on method)VariesCan be integrated Moderate (exposes value trade-offs)
Text-based Intrinsic XAI (Pérez-Núñez et al., 2025) [43]Language-based attentionHigh (model-intrinsic)Signed word-level scoresLimited (handles cold-start)High (transparent scoring logic)
Proposed Glass-Box ModelRF, XGBoost, MLP, LogRegHigh (policy-led & features)TreeSHAP (exact), LIME (local surrogate)Addressed via class imbalance handling (SMOTE)High (policy-aligned, auditable)
Table 2. Data sources from the MovieLens dataset.
Table 2. Data sources from the MovieLens dataset.
FilePurposeKey Columns Used
movies.csvServes as the item catalogue with metadataMovie ID, title, genres
ratings.csvProvides explicit user feedback and engagement signalsUser ID, Movie ID, rating, timestamp
tags.csvContains user-generated tags for sentiment analysisUser ID, movie ID, tag, timestamp
Table 3. Sample data from movies.csv.
Table 3. Sample data from movies.csv.
IndexMovieIdTitleGenres
01Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy
12Jumanji (1995)Adventure|Children|Fantasy
23Grumpier Old Men (1995)Comedy|Romance
34Waiting to Exhale (1995)Comedy|Drama|Romance
45Father of the Bride Part II (1995)Comedy
Table 4. Sample data from tags.csv.
Table 4. Sample data from tags.csv.
IndexUserIdMovieIdTag
02226479Kevin Kline
12279592misogyny
222247150acrophobia
3342174music
4342174weird
Table 5. Sample data from ratings.csv.
Table 5. Sample data from ratings.csv.
IndexUserIdMovieIdRatingTimestamp
01174.0944249077
11251.0944250228
21292.0943230976
31305.0944249077
41325.0943228858
Table 6. Sample data from Amazon Reviews 23 (Category: Electronics).
Table 6. Sample data from Amazon Reviews 23 (Category: Electronics).
User_IdParent_AsinRatingTextTimestamp
A2E6HJKW8SKJ90ASDLKJ3489SDFB081TJ8YS34.0This has excellent sound quality for the price. The battery life is decent, but the setup process was fiddly.1588615855070
AFK3429SK234234SDKL23490234SB00TVF123A5.0Absolutely brilliant! It exceeded all my expectations. Highly recommend the wireless charging feature.1589123456780
AG789SDF789SDF789SDF789SDF78B07XQZ901C2.0Very disappointed with the slow responsiveness and the cheap plastic feel. This tablet started glitching after a week of use.1590234567890
AE123ASDASD123123ASDASD12312B082WXY34D1.0The product arrived damaged, and the returns process was terrible. I would advise against purchasing this camera.1591345678901
AH8923KL23908SDFLKJ234890SDFB093ZAB12Y3.0The connection drops occasionally, but the picture quality is decent for a budget monitor.1593567890123
Table 7. Comparative performance of RF, XGBoost, MLP, Logistic Regression Models.
Table 7. Comparative performance of RF, XGBoost, MLP, Logistic Regression Models.
ModelClassPrecisionRecallF1-Score
Random Forest00.980.970.96
10.580.420.49
Accuracy0.970.970.97
XGBoost00.980.990.98
10.650.400.50
Accuracy0.970.970.97
MLP00.980.990.98
10.650.390.47
Accuracy0.960.960.96
LogisticRegression00.980.980.98
10.500.370.47
Accuracy0.950.950.95
Table 8. Sensitivity Analysis of Policy Thresholds ( T c o u n t , T a v g , T n e g ) Demonstrating AUC, Recall, and F1 Across 27 Configurations.
Table 8. Sensitivity Analysis of Policy Thresholds ( T c o u n t , T a v g , T n e g ) Demonstrating AUC, Recall, and F1 Across 27 Configurations.
T c o u n t T a v g T n e g AUCRecallF1Opt ThresholdSupport
502.50.50.990.780.720.921382
502.50.60.920.430.490.773891
502.50.70.920.430.490.773891
5030.50.990.730.660.911099
5030.60.920.400.470.793024
5030.70.920.400.470.793024
503.50.50.990.650.550.92470
503.50.60.930.440.460.761298
503.50.70.930.440.460.761298
702.50.50.990.860.650.851198
702.50.60.930.430.480.783087
702.50.70.930.430.480.783087
7030.50.990.770.640.88958
7030.60.920.420.480.782447
7030.70.920.420.480.782447
703.50.51.000.510.550.96446
703.50.60.920.380.440.821095
703.50.70.920.380.440.821095
902.50.50.990.740.660.911112
902.50.60.930.440.500.782645
902.50.70.930.440.500.782645
9030.50.990.770.660.90906
9030.60.930.410.480.792134
9030.70.930.410.480.792134
903.50.50.990.620.580.94424
903.50.60.930.470.490.79973
903.50.70.930.470.490.79973
Table 9. Top 10 Forecasted Emerging Media Hits (Ranked by Model Confidence).
Table 9. Top 10 Forecasted Emerging Media Hits (Ranked by Model Confidence).
RankMovie TitleYearPrediction ScoreObserved CountObserved Avg Rating
1Denial20160.987623.58
2Queen of Katwe20160.986483.68
3Zero Motivation20140.986573.75
4The Road Within20140.984673.46
5Ascension20140.984693.41
6About Alex20140.984493.44
7Boys20140.983633.93
8The Assassin20150.983653.18
9Before I Disappear20140.983573.65
10The Program20150.983513.14
Table 10. Performance metrics for Validation Pilot on Amazon Reviews 23 (Category: Electronics.test) Dataset [26].
Table 10. Performance metrics for Validation Pilot on Amazon Reviews 23 (Category: Electronics.test) Dataset [26].
ModelClass PrecisionRecallF1-ScoreAccuracySupport
Random Forest 0 0.980.940.960.9216349.0
1 0.260.600.36241.0
XGBoost0 0.980.930.950.9216349.0
1 0.260.600.36241.0
MLP0 0.980.920.950.9116349.0
1 0.270.610.37241.0
Logistic Regression0 0.980.930.950.9316349.0
1 0.250.600.35241.0
Table 11. Comparison across performance, interpretability, governance readiness, and fairness handling.
Table 11. Comparison across performance, interpretability, governance readiness, and fairness handling.
DimensionDeep Learning RS (e.g., MLP, NCF)Hybrid XAI ArchitecturesProposed Glass-Box Approach
Predictive Performance [39,43,44]State-of-the-artHighHigh (Superior in structured prediction)
Model Interpretability [6,62]Very Low (Black-box)Moderate to High (Model-intrinsic but complex)High (Inherently interpretable)
Explanation Faithfulness [6,18]Low to Variable (Post hoc surrogates often unfaithful) Variable (Generated rationales may diverge from scoring logic)High (Exact TreeSHAP attributions)
Governance Readiness [6,62]Low (Opacity hinders auditing and compliance)Moderate (Improved transparency but faithfulness concerns remain)High (Policy-aligned and auditable)
Computational Cost [39,44]HighModerate to HighLow to Moderate
Fairness Handling [7,8]Limited (Often post-processing)Limited (Partial integration)Integrated
(SMOTE and Subgroup Audit)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Vahdatian, P.; Latifi, M.; Ahsan, M. Designing Trustworthy Recommender Systems: A Glass-Box, Interpretable, and Auditable Approach. Electronics 2025, 14, 4890. https://doi.org/10.3390/electronics14244890

AMA Style

Vahdatian P, Latifi M, Ahsan M. Designing Trustworthy Recommender Systems: A Glass-Box, Interpretable, and Auditable Approach. Electronics. 2025; 14(24):4890. https://doi.org/10.3390/electronics14244890

Chicago/Turabian Style

Vahdatian, Parisa, Majid Latifi, and Mominul Ahsan. 2025. "Designing Trustworthy Recommender Systems: A Glass-Box, Interpretable, and Auditable Approach" Electronics 14, no. 24: 4890. https://doi.org/10.3390/electronics14244890

APA Style

Vahdatian, P., Latifi, M., & Ahsan, M. (2025). Designing Trustworthy Recommender Systems: A Glass-Box, Interpretable, and Auditable Approach. Electronics, 14(24), 4890. https://doi.org/10.3390/electronics14244890

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop