1. Introduction
The global food system is under mounting pressure to simultaneously ensure human health and environmental sustainability [
1]. Within this context, food allergies have emerged as a pervasive public health concern worldwide, affecting individuals across all demographics. For affected populations, strict avoidance of allergen exposure is an absolute medical necessity, reinforced by stringent regulatory frameworks, including mandatory labeling policies enforced by the Japan Consumer Affairs Agency (CAA) [
2]. These compounded challenges underscore the critical need for intelligent systems that support safe, health-conscious, and nutritionally adequate dietary decision-making.
Despite ongoing public health efforts, managing allergy-restricted diets remains highly complex. Beyond the mere avoidance of explicit allergens, individuals must also navigate the risks of cross-reactivity, wherein structurally homologous proteins can trigger unintended and potentially severe immune responses [
3]. Concurrently, dietary safety must not come at the expense of nutritional balance, a priority emphasized by national public health initiatives such as “Health Japan 21” and the Dietary Reference Intakes (DRIs) [
4,
5]. This creates an inherent and persistent tension among safety, nutritional adequacy, and user satisfaction, which represents a trilemma that remains insufficiently addressed in existing computational dietary approaches.
Despite extensive study, food recommender systems often fail to meet the stringent requirements of users with food allergies. Conventional systems typically optimize for either general user preferences or broad health objectives, lacking rigorous allergen risk modeling that accounts for cross-reactivity and cooking-induced biochemical transformations [
6,
7,
8]. Conversely, systems focused on nutritional optimization frequently disregard the visual and hedonic factors that profoundly influence human food choices [
9,
10]. Existing mobile applications for allergy management similarly also suffer from limited clinical reliability and a lack of adaptive personalization [
11,
12]. A critical point of divergence in current research is how to treat safety: as a preference-based a fully compensatory objective or a safety-prioritized ranking mechanism. Although some argue that flexible, preference-aware filtering improves user satisfaction, we hypothesize that, in health-critical contexts, structural safety implicitly takes precedence over aesthetic or hedonic factors, regardless of explicitly stated preferences. Furthermore, current algorithmic approaches rely heavily on explicitly stated user preferences, neglecting the well-documented discrepancies between stated preferences and actual interaction behaviors, particularly in risk-sensitive contexts. Consequently, there remains a distinct lack of a unified, principled framework that jointly models allergen safety, nutritional balance, and real-world user behavior.
To bridge these gaps, this study proposes an allergen-aware recipe recommendation framework that integrates safety constraints, nutritional optimization, and user-centered aesthetic preferences. We formulate the recommendation problem as a structured multi-objective ranking task. Crucially, allergen safety is operationalized as a primary safety-aware ranking objective rather than a secondary preference, with nutritional balance and visual appeal incorporated as complementary optimization objectives within the safe feasible region. This paper is a significant extension of our preliminary work presented at [
13]. While our previous study demonstrated the feasibility of a multi-objective scoring framework, it treated allergen safety as a simple additive goal rather than thoroughly investigating its role as a safety-prioritized ranking component. In this paper, we provide a deeper theoretical re-characterization of safety and a substantially expanded empirical evaluation. Specifically, we introduce a systematic weight robustness analysis to identify stable performance regions, a standardized usability evaluation using the System Usability Scale (SUS), and a novel behavioral analysis that identifies the “decision–action discrepancy” in risk-sensitive scenarios, providing critical insights that remained unexplored in our initial study. Through these advancements, we provide the following contributions:
Multi-Objective Integration with Safety-Aware Ranking: We conceptualize and model allergen safety not merely as a binary filter or an additive scoring component, but as a foundational safety-aware ranking component using a penalty-based mechanism. This enables nuanced, graded risk-aware recommendations while stabilizing the overall preference space.
Robustness Analysis of the Unified Scoring Framework: We develop a comprehensive scoring architecture that seamlessly integrates safety risk evaluation, nutritional profiling, and visual appeal assessment. Furthermore, we present an extensive sensitivity analysis that maps the performance landscape, identifying a robust high-performance plateau for these integrated objectives.
Behavioral Insight through Empirical Evaluation: We conduct a controlled user study to evaluate the performance of multi-objective ranking. The analysis uncovers the “decision–action discrepancy,” which refers to systematic divergences between explicit user preferences and actual implicit decision-making behaviors in risk-sensitive scenarios. This finding provides new design implications for health-critical recommendations.
Ultimately, these findings provide compelling empirical evidence for the necessity of incorporating safety-aware prioritization mechanisms and behavior-aware modeling into the design of trustworthy, health-critical recommender systems.
The remainder of this paper is organized as follows.
Section 2 reviews related work in health and allergen-aware recommender systems, as well as visual aesthetics and multi-objective evaluation.
Section 3 details the proposed Allergen-Aware Cooking Recipe Recommender System and its methodology, which includes the mathematical modeling of the three scoring modules.
Section 4 describes the experimental design and evaluation metrics.
Section 5 presents the quantitative results, while
Section 6 provides an in-depth discussion on the decision–action discrepancy observed in user evaluations. Finally,
Section 7 concludes the paper and outlines future research directions.
3. Allergen-Aware Cooking Recipe Recommender System
This section details the proposed allergen-aware recipe recommendation framework and the data-driven methodology employed to evaluate its performance. We first provide a high-level overview of the system architecture, followed by a rigorous mathematical formulation of the three core scoring modules: Allergen Safety, Nutritional Balance, and Visual Appeal. Finally, we define the multi-objective ranking strategies utilized to aggregate these dimensions into optimized recommendation lists.
3.1. System Architecture and Workflow
Figure 1 illustrates the proposed system architecture. It consists of four primary components designed to facilitate reproducible dietary decision-making:
Recipe Repository: A centralized database containing recipe metadata, including precise ingredient quantities, standardized nutritional profiles, and high-resolution culinary images.
Scoring Modules: Three independent computational modules that quantify Allergen Safety (S), Nutritional Balance (N), and Visual Appeal (V).
Multi-Objective Re-ranker: An aggregation engine that applies dynamic weighting strategies () to prioritize health-critical safety factors while maintaining hedonic appeal.
Web-Based Interface: An interactive platform that enables users to browse recommendations and adjust preferences while simultaneously facilitating the collection of explicit preference data and implicit behavioral logs.
The workflow follows a two-stage process: (1) Hard Filtering, which excludes recipes containing a user’s primary allergens, and (2) Multi-Objective Ranking, where the remaining candidates are ranked using the scoring modules described below.
3.2. Safety-Aware Risk Modeling
Although the initial filtering removes immediate health risks, a binary “safe/unsafe” classification is insufficient for high-risk users. Therefore, we introduce a safety-aware risk modeling mechanism that combines hard allergen filtering with graded penalty-based ranking. Rather than acting as a strict optimization constraint, the safety module serves as a safety-aware ranking component that strongly suppresses recipes carrying residual allergen risks.
The
Safety Score (S) is defined as a penalty-based function that captures varying degrees of risk rather than a binary safe/unsafe distinction:
The components of the Safety Score (S) are defined as follows:
3.2.1. Allergen and Cross-Reactivity Penalties ()
reflects the intrinsic risk of ingredients based on the “8 Specified Ingredients” (mandatory labeling) and “20 Recommended Ingredients” defined by the Japan Consumer Affairs Agency (CAA) [
2]. In the first stage, the system meticulously filters out the user’s explicitly declared primary allergens. However, this term introduces a soft penalty for the presence of any remaining high-risk ingredients from this 28-item list. This mechanism ensures the system favors generally hypoallergenic recipes, preventing it from casually substituting one common allergen for another potentially risky ingredient.
accounts for cross-reactivity, where proteins in non-allergen ingredients structurally resemble allergens [
3]. For instance, a user allergic to shrimp (crustacean) might react to crab or lobster. We utilize cross-reactivity groups (e.g., Group G2 for crustaceans, G3 for latex-fruit syndrome) to assign penalties. If an ingredient belongs to the same cross-reactivity group as the user’s allergen, a fixed penalty (e.g., 5 points) is applied [
19]. This specific penalty value is mathematically calibrated as a high-magnitude threshold relative to the preference scoring range. By setting
, the system ensures that cross-reactive ingredients are strongly penalized and effectively demoted below the baseline preferences, preventing accidental exposure based on clinical risk categories documented in medical literature [
3].
3.2.2. Heating Requirement Penalty ()
Certain ingredients, such as eggs or shellfish, may pose risks if undercooked, even if not strictly allergenic. We assign penalties based on the Tokyo Metropolitan Government’s food hygiene standards [
8]. Ingredients requiring higher internal temperatures for sterilization are assigned higher penalties to reflect the increased cooking burden and risk. To avoid heuristic scaling, the penalty scores were systematically derived from biologically motivated thermal-processing thresholds. A conservative baseline of 45 °C was adopted as the computational lower bound, below which substantial structural protein denaturation and the inactivation of non-thermophilic microorganisms are generally negligible. As summarized in
Table 1, target temperatures within the practically relevant processing spectrum of 60 °C to 85 °C were then mapped onto a discrete penalty scale from 1.0 to 4.0 via a linear scaling function equipped with a deterministic floor operator.
By integrating this into the Safety Score (
S) by Equation (
1), the system ensures that the recommendation ranking is sensitive not only to biochemical allergen risks but also to the practical, safety-critical requirements of the cooking process itself.
3.2.3. Substitution Bonus ()
To promote safe adaptation, the system provides a bonus when high-risk ingredients are substituted. The bonus is calculated as follows:
where
is the positive continuous penalty value of the replaced ingredient,
is a discrete safety coefficient (2.0 if the substitute is completely safe, 1.0 otherwise), and
is a non-negative integer representing the count of remaining risky ingredients. This formulation rewards risk reduction [
18]. The coefficient
acts as a binary scaling factor designed to double the adaptation reward only when a verified hypoallergenic alternative replaces a hazardous ingredient. To ensure the robustness of the recommendation, the denominator
mathematically models the overall risk density of the recipe, expanding upon the concept of safe dietary substitution mining proposed in recent health informatics literature [
18]. This density-dependent dampening mechanism ensures that the substitution bonus is adaptively suppressed if the surrounding recipe environment still contains multiple unaddressed allergen risks, thereby preventing the system from over-rewarding a single substitution in an otherwise hazardous dish.
3.3. Nutritional Optimization and DRI Alignment
The
Nutrition Score (N) quantifies the degree to which a recipe aligns with the Dietary Reference Intakes (DRIs) for the Japanese population (2025 Edition) [
5]. In alignment with the strategic framework of “Health Japan 21” (Phase III) by the MHLW [
4], our framework prioritizes ten essential nutrients critical for preventing lifestyle-related diseases.
To adapt daily DRI standards to a single-meal context, we assume a balanced three-meal dietary structure. Thus, the target intake (
) and the tolerable upper limit (
) for the
i-th nutrient correspond to the daily recommended value divided by three. Let
represent the actual content of the
i-th nutrient in a given recipe. The individual nutrient score
formulates a penalty for both insufficient and excessive intake:
The aggregate Nutrition Score (
N) is computed as a weighted average to reflect the varying public health significance of each nutrient:
where
denotes the weight assigned to nutrient
i. As detailed in
Table 2, these weights are categorized into three strategic tiers:
High Importance (): Energy and fat, given their critical roles in managing obesity and chronic metabolic disorders.
Medium Importance (): Essential micronutrients, including sodium (salt), iron, and specific vitamins, selected for their broad public health impact.
Standard Importance (): Nutrients such as dietary fiber, which are vital for long-term health maintenance but present less acute risk profiles.
3.4. Visual Appeal Assessment via Neural Image Assessment (NIMA)
To address the “visual hunger” phenomenon, the framework implements a deep perceptual scoring module to quantify the aesthetic quality of recipe images, providing an objective measure of their visual appeal.
The evaluation leverages the Neural Image Assessment (NIMA) architecture, instantiated via the open-source PyTorch (version 2.12.0) Image Quality Assessment (IQA-PyTorch) toolbox. Specifically, the visual scoring path utilizes a robust Inception-ResNet-V2 backbone pre-trained on ImageNet for structural feature extraction, which was subsequently optimized on the Aesthetics Visual Analysis (AVA) dataset. The AVA dataset contains approximately 250,000 professional photographic images annotated with human aesthetic rating distributions, providing a strong foundation for generalized aesthetic evaluation.
To rigorously evaluate the domain generalization performance of the scoring module without introducing target-domain annotation bias, the NIMA network is intentionally evaluated in a zero-shot setting without fine-tuning on food-specific imagery. Instead, it relies on cross-domain feature transfer from the off-the-shelf weights to capture fundamental, universal photographic attributes, including optimal illumination, chromatic contrast, and compositional alignment.
During inference, the model processes the input recipe image
X and returns a mean aesthetic preference score natively scaled between 1 and 10. To align this metric with the uniform scale of our multi-objective framework, this raw score is deterministically mapped into the final
Visual Score (
) via a standard decimal multiplier:
In our safety-critical framework, this visual appeal score does not act as a primary hard filter. Instead, it serves strictly as a hedonic refinement signal that distinguishes between recipe candidates that have already satisfied the rigorous safety and nutritional thresholds.
3.5. Multi-Objective Ranking Strategy
The final recommendation ranking results from a
Composite Objective Score (O). To ensure commensurability across diverse metrics, we normalized the raw scores for safety, nutrition, and visual appeal to a uniform scale of [0, 100], denoted as
,
, and
, respectively. Specifically, for the visual dimension, the raw score
is min-max normalized to
to ensure that complete aesthetic collapse maps to an absolute zero baseline. Then, we define the Composite Objective Score (
O) as follows:
In our primary configuration, the weighting coefficients are set to
, reflecting a deliberate prioritization of allergen safety. Although the overall ranking score is formulated as a linear weighted sum, the framework is designed to exhibit safety-aware non-compensatory behavior through the penalty mechanism described in
Section 3.2. While primary explicit allergens are eliminated via a strict hard structural constraint at the first stage (the absolute exclusion mechanism in Algorithm 1 described in
Section 3.6), remaining implicit risks such as cross-reactivity (
) and heating requirements (
) are processed within the scoring engine. Recipes associated with these identified secondary risks receive substantial safety penalties, causing their normalized safety score to approach the lower end of the scoring range.
| Algorithm 1 Multi-Objective Recipe Ranking Engine |
Require: Set of Candidate Recipes , User Profile (Primary Allergens, Cross-Reactivity Groups, Nutritional Targets , Upper Limits , Nutrient Weights ), Global Coefficients Ensure: Ranked Recommendation List
- 1:
- 2:
for each recipe do - 3:
if r contains the user’s primary explicit allergen then - 4:
continue {Absolute exclusion of explicit allergens} - 5:
end if - 6:
Calculate penalties and using CAA guidelines and cross-reactivity matrices {Implicit risk modeling} - 7:
Calculate using linear normalization of sterilization temperatures from Table 1- 8:
Compute Substitution Bonus using Equation (2) - 9:
Compute Raw Safety Score using Equation (1) - 10:
Normalize Raw Safety Score to - 11:
for each nutrient do - 12:
Compute individual nutrient score using actual content via Equation (3) - 13:
end for - 14:
Compute Aggregate Nutrition Score using Equation (4) - 15:
Normalize Raw Nutrition Score to - 16:
Fetch Pre-calculated Visual Score using Equation (5) - 17:
Normalize Raw Visual Score to - 18:
Compute Composite Ranking Score using Equation (6) - 19:
Append to - 20:
end for - 21:
{Sort via hierarchical fallback rules} - 22:
return
|
To preserve safety dominance in practical deployment scenarios, we recommend operating within a safety-prioritized region of the weight space:
Under this configuration, substantial safety degradation cannot be fully compensated for by improvements in nutritional balance and visual appeal. For the default weight setting
, the theoretical lower bound of a completely safe recipe with respect to secondary risks occurs when
yielding
Conversely, the theoretical upper bound of a heavily penalized risky recipe occurs when
yielding
Therefore, except for this extreme boundary condition where
, safe recipes remain preferentially ranked. In the rare event of an exact numerical tie, the deterministic fallback procedure via the CustomSort function detailed in
Section 3.6 (Algorithm 1) is invoked, evaluating sub-scores sequentially
to ensure safety-oriented ranking consistency.
It should be noted that the above condition serves as a deployment guideline rather than a hard optimization constraint for the entire framework. To evaluate the robustness of the framework and explore diverse user-preference patterns, the sensitivity analysis and weight optimization experiments reported in
Section 5.5 intentionally examined the full feasible weight space under the normalization constraint
. Consequently, empirically optimal weight combinations may fall outside the safety-prioritized region while still providing valuable insight into user preference structures. The recommended deployment configuration, however, remains anchored in the safety-dominant region to preserve the intended behavior of the recommender system.
3.6. Algorithmic Implementation and Reproducibility
To guarantee full technical reproducibility and eliminate any mathematical ambiguity regarding how these distinct objectives are programmatically integrated, Algorithm 1 outlines the sequential operational pipeline of our multi-objective scoring and ranking engine.
To systematically support the formalized procedure in Algorithm 1, we define its core operational parameters below:
Computational Complexity: Let denote the total number of candidate recipes. The scoring phase iterates through the filtered repository in a single pass, yielding complexity. Within this loop, implicit safety penalties and nutrient sub-scores are calculated in constant time due to the fixed dimensions of cross-reactivity matrices and nutrient categories (10 items). To optimize online inference efficiency, visual aesthetic scores (V) are pre-computed offline via the CNN-based NIMA model and indexed in the database, reducing the runtime visual score acquisition to a lightweight table lookup per recipe. The final re-ranking phase utilizes a customized sorting algorithm equipped with a multi-key hierarchical comparison operator (derived from dual-pivot Quicksort), which operates at time complexity. Thus, the overall computational complexity of the engine is bounded at , making the framework suitable for real-time recommendation scenarios involving moderate-to-large recipe collections.
Stopping Criteria: The ranking engine operates under a deterministic batch-processing stopping criterion. The execution automatically terminates without any early-stopping thresholds or heuristic convergence checks once all N elements in the candidate repository have completed the scoring loop and the fully sorted recommendation queue is compiled.
Tie-Breaking Mechanism: In the event that multiple recipes achieve identical composite scores (
), a non-compensatory hierarchical fallback mechanism is programmatically triggered to protect user safety. As modeled at the boundary conditions in
Section 3.5 (where a safe but nutritionally empty recipe intersects with an unsafe but otherwise perfect recipe at
), ties are resolved deterministically by evaluating sub-scores in sequential order of health-critical importance: (1) higher Safety Score (
), (2) higher Nutrition Score (
), and (3) higher Visual Score (
). If a tie persists across all dimensions, the recipe with the lower unique database entry index (Recipe ID) is prioritized to ensure absolute algorithmic determinism.
4. Experimental Design and Evaluation
To evaluate the efficacy of the proposed system, we conducted a controlled user study comparing diverse ranking strategies. The primary objective of this evaluation is not to measure the system’s absolute retrieval recall from a global database, but to assess its ability to align with nuanced user preferences within a safety-prioritized recommendation space. By providing participants with a pre-filtered set of high-quality, safe candidates, we isolate and observe how different multi-objective weighting strategies reflect the subtle trade-offs between safety, nutrition, and visual appeal. This section details the dataset construction, the user study protocol, and the evaluation metrics used.
4.1. Dataset Construction and Curation
We constructed a comprehensive, proprietary recipe dataset by systematically scraping and curating web data from the “Table for All” platform [
23], a specialized Japanese portal dedicated to allergy-friendly diets. Crucially, the culinary contents, dietary guidelines, and ingredient substitution rules on this platform were developed under the strict clinical supervision and expert advice of Dr. Yukihiro Ohya, Director of the Allergy Center at the National Center for Child Health and Development, alongside certified food allergy registered dietitians [
24]. This expert-backed foundation provides a high level of clinical credibility and reliability for the source data. Since the platform does not offer an official academic API or a downloadable dataset, we developed a dedicated web-scraping pipeline to extract content from 606 individual recipe pages. By cleaning and structuring this raw web data, we built a tailored database for our study. This self-compiled dataset is particularly suitable for allergen-aware recommendation research due to the source platform’s highly reliable allergen annotations and well-structured nutritional metadata.
Total Recipes: 606 distinct, validated recipes.
Cuisine Categories: The dataset covers Desserts (217), Japanese (201), Western (159), and Chinese/Korean (29).
Metadata Attributes: Each entry includes a list of ingredients, 28-type allergen labels, detailed nutritional profiles (energy, protein, fat, salt, etc.), step-by-step instructions, and high-resolution culinary imagery.
The dataset was pre-processed to standardize unit measurements, normalize ingredient names, and remove incomplete or duplicate records to ensure experimental integrity.
4.2. User Study Protocol
We recruited 20 participants (13 regular home cooks and 7 occasional cooks) to evaluate the system’s recommendations. To maintain privacy and adhere to ethical standards, all interactions with participants were anonymous, and no personally identifiable information (PII) was collected. The study was conducted in a within-subject design so that each participant experienced several ranking configurations to reduce individual variance.
4.2.1. Experimental Procedure
The experiment utilized a custom web-based interface, following a four-stage protocol:
Profile Initialization: Participants registered an anonymous ID and declared their dietary restrictions.
Session Interaction: The system presented recommendation lists generated under 4 core interactive evaluation configurations: the integrated Multi-Objective (Overall) strategy, and three single-objective baselines (SafetyOnly, NutritionOnly, and VisualOnly). For each configuration, the top-10 recommended items were displayed to the participant (
Figure 2). Across these live interactive sessions, each participant reviewed and manually re-ranked a total of 40 recipe recommendations. To prevent cognitive fatigue and ensure evaluation quality during the user study, the remaining dual-objective and external baselines were excluded from the live interface and reserved for retroactive algorithmic evaluation.
Interactive Re-ranking: Participants reviewed the recommended list and manually reordered the items to reflect their true personal preference. This user-modified order serves as the ground truth for our evaluation metrics (
Figure 3).
Data Logging: The system automatically logged the initial system-generated rank, the user’s final rank, and any click-through behavior for detailed recipe views.
4.2.2. Ranking Strategies
To rigorously analyze the influence of each scoring dimension, we evaluated 7 distinct ranking configurations:
Multi-Objective (Overall): The proposed integrated strategy, with a baseline weight configuration of , which prioritizes safety while balancing other factors.
Single-Objective Benchmarks: SafetyOnly (S), NutritionOnly (N), and VisualOnly (V).
Dual-Objective Benchmarks: Safety+Nutrition (), Safety+Visual (), Nutrition+Visual ().
Although seven ranking configurations were evaluated in total, only four core configurations were exposed during the live user sessions. The remaining configurations were assessed retrospectively using the collected preference data.
4.3. Evaluation Metrics
We evaluated the quality of the recommendations using standard Information Retrieval (IR) metrics, comparing the system’s initial ranking with the user’s re-ranked list, which serves as a proxy for the ground truth preference.
Mean Reciprocal Rank (MRR): Measures the effectiveness of the system in placing the user’s top-choice item at the top of the list.
Mean Average Precision (MAP): Evaluates the overall precision across the ranked list, where user interaction signals (e.g., clicks and selections) serve as implicit relevance indicators.
Normalized Discounted Cumulative Gain (nDCG@5): Measures the ranking quality within the top-5 positions, giving more weight to the relevance of higher-ranked items.
Spearman’s Rank Correlation Coefficient (SRC): Assesses the monotonic relationship between the system’s ranking and the user’s preference ranking, indicating how well the model captures relative user priorities [
14].
4.4. Evaluation Paradigm and Data Utilization
This study utilizes a user-centric Information Retrieval (IR) evaluation paradigm rather than an offline machine learning training loop. The 606 validated recipes serve as the frozen retrieval repository [
24]. Crucially, because our multi-objective ranking framework relies on an axiomatic scoring formulation rather than parameterized learning, the primary weighting coefficients
were defined
a priori based on clinical safety priorities, avoiding any training-level dependency on user data.
The explicit human preferences harvested from the interactive sessions form an independent evaluation dataset. To maintain full comparative completeness, the three dual-objective combinations (detailed in
Section 4.2.2) and the external state-of-the-art baselines (detailed in
Section 5.1) were evaluated retroactively offline by replays against these empirical user profiles. Furthermore, the continuous weight sensitivity analysis explored in
Section 5.5 was simulated post hoc against this frozen baseline, serving strictly as a system robustness check rather than an iterative hyperparameter tuning procedure.
5. Results
This section bridges the proposed formulation with empirical validation by presenting a comprehensive evaluation of the allergen-aware recipe recommender system based on data collected from the user study (). We analyze the effectiveness of different ranking strategies from four perspectives: ranking effectiveness, alignment with user preferences, parameter sensitivity, and subjective usability.
5.1. Comparative Performance Analysis
To validate the effectiveness of our multi-objective approach, we compared the proposed Overall strategy against single-objective and dual-objective baselines.
Table 3 summarizes the performance across four metrics: MRR, MAP, nDCG@5, and SRC.
The results clearly demonstrate the superiority of the multi-objective formulation. The Overall strategy achieved the highest scores in MRR () and MAP (), indicating its high precision in surfacing relevant, safe, and nutritious recipes at the top of the recommendation list. While the Safety+Nutrition strategy showed a marginally higher nDCG@5 (), the Overall strategy maintained the highest SRC (), suggesting superior global ranking consistency with users’ holistic preferences.
To rigorously validate the effectiveness of our multi-objective framework within the constraints of the completed user study, we extended our comparative analysis by retroactively evaluating two representative external baselines offline, utilizing the explicit user preference profiles collected during the interactive sessions: (1) Hard-Filtering Safety Recommendation (HF-SR), which executes the binary exclusion of primary allergens as evaluated in mobile health applications [
11,
12] and subsequently retains the natural retrieval order of the original dataset, and (2) Linear Multi-Objective Optimization (LMO), representing the conventional compensatory scalarization paradigm widely categorized in healthy dietary recommendation surveys [
9], which standardizes safety, nutrition, and visuals as equivalent linear objectives without applying the proposed safety-aware filtering and ranking mechanism or penalty-based risk modulations.
Table 3 summarizes the retroactive empirical performance.
The offline simulated results suggest that the proposed Overall strategy maintains competitive advantages over both external baselines. Notably, it demonstrates statistically significant improvements in ranking-sensitive metrics (nDCG@5 and SRC), as statistically verified by the post hoc tests detailed in
Section 5.2. Within these simulated bounds, the HF-SR baseline exhibits low global ranking alignment (SRC = 0.2765), confirming that a rudimentary binary exclusion of primary allergens, while ensuring safety, completely disrupts the continuous optimization of user preferences by failing to account for secondary health or hedonic utilities. On the other hand, while the LMO baseline yields more competitive top-
k results (MRR = 0.6563, nDCG@5 = 0.8278), its global rank correlation remains notably inferior to our model (SRC = 0.5077 vs. 0.6952). This discrepancy empirically supports our theoretical premise: traditional compensatory multi-objective systems allow critical safety risks to be balanced out by nutritional or visual high-scores, leading to chaotic ranking hierarchies in risk-sensitive scenarios. By treating safety as an explicit allergen filtering and high-magnitude penalty-based ranking mechanism rather than a fully compensatory objective, our framework ensures stable, precise, and highly personalized recommendation lists.
5.2. Statistical Significance Analysis
To strictly evaluate the statistical robustness of the ranking strategies, a non-parametric Friedman test was conducted, followed by a Wilcoxon signed-rank test with Holm–Bonferroni correction for post hoc pairwise comparisons. As summarized in
Table 4, the analysis revealed a highly significant main effect across strategies for the ranking-sensitive metrics, specifically nDCG@5 and SRC (
). Conversely, the macro-level retrieval metrics (MRR and MAP) did not demonstrate statistical significance (
and
, respectively), indicating that the baseline ability to surface relevant items remains consistent across configurations.
Post hoc pairwise comparisons indicate that the proposed Overall strategy achieves statistically significant improvements over the newly incorporated external baselines. Specifically, the Overall strategy significantly outperforms the HF-SR baseline ( for both nDCG@5 and SRC) and the conventional LMO paradigm ( for nDCG@5; for SRC).
The post hoc analysis uncovers a critical structural dynamic. While the proposed Overall strategy significantly outperforms the visually driven baselines (e.g., Overall vs. Nutrition+Visual, for nDCG@5), it does not yield a statistically significant improvement over the dual-objective Safety+Nutrition baseline ( for nDCG@5; for SRC). This statistical parity suggests that in a health-constrained search space, the visual dimension mathematically fails to override safety and nutritional objectives in macro-level ranking.
5.3. Ablation Study: Component-Wise Contributions
We conducted an ablation study to understand the isolated effect of safety, nutrition, and visual cues on the quality of recommendations.
Synergy of Safety and Nutrition: The high performance of the Safety+Nutrition strategy (MRR: , nDCG@5: ) confirms that these two factors form the structural backbone of the system.
Visual Cues as a Refinement Signal: Incorporating visual information does not consistently improve top-k accuracy; however, the Overall strategy achieves the highest SRC (), indicating improved global ranking coherence. This suggests that visual cues primarily function as a critical “tie-breaker,” allowing users to differentiate between multiple safe and nutritious candidates, thereby smoothing the overall preference curve.
5.4. Impact of Safety Modeling on Ranking Coherence
The SRC measures how well the system ranking aligns with users’ overall preference ordering. A critical finding emerged from the Nutrition+Visual baseline. Despite a competitive MRR (), it yielded an exceptionally low SRC (). This result suggests that, in the absence of safety-oriented risk modeling, the model’s ranking is noisy and fails to accurately reflect the user’s relative ranking preferences across the entire list. This phenomenon demonstrates the importance of safety for allergy users, who regard it as a requirement rather than a preference when developing a coherent decision-making logic.
5.5. Parameter Sensitivity and Weight Optimization
To validate the rationality of the proposed weighting scheme and ensure its robustness across varying user priorities, we conducted a systematic sensitivity analysis alongside data-driven weight optimization.
5.5.1. Safety Weight Sensitivity
We first examined the impact of the safety weight (
) on ranking performance by varying its value from 0 to 1 with a step size of 0.05, while maintaining a fixed ratio between the nutrition (
) and visual (
) weights (
). As illustrated in
Figure 4, the resulting performance trends are evident.
The results reveal a distinct performance plateau within the range of , where all evaluation metrics remain consistently high. When is too small (e.g., ), we observe a significant decline in SRC, suggesting that insufficient prioritization of safety compromises ranking coherence. Conversely, when becomes excessively large (e.g., ), we observe a marginal but consistent decline in performance, as the over-prioritization of safety begins to overshadow the nuanced contributions of nutritional and visual factors.
5.5.2. Optimal Weight Discovery
To further validate the empirical observations from the sensitivity analysis, we performed a grid search over the full weight space under the constraint
.
Table 5 shows the optimal weight combination
with SRC as optimization objective. Notably, this learned value of
falls squarely within the high-performance plateau identified in the sensitivity analysis, confirming the stability of our initial weighting strategy.
5.5.3. Weight Space Visualization
Figure 5 shows the SRC distribution heatmap, which provides further insight into the structure of the weight space. This visualization confirms that high-performance configurations do not correspond to a single isolated point, but rather form a contiguous high-value plateau within the parameter space (specifically,
and
, where
). This result underscores the structural stability of the proposed ranking mechanism, indicating that the model maintains robust alignment with user preferences despite minor variations in weighting strategies.
Interestingly, the heatmap reveals an asymmetric sensitivity, in which the performance drops sharply when the safety weight is below , while the variation in the nutrition () and visual () weights within the flat area has relatively little impact on the overall ranking quality. This observation empirically validates our architectural design, confirming that while nutritional and aesthetic factors are essential for refinement, a foundational prioritization of allergen safety is the primary determinant of ranking coherence in health-critical contexts.
5.6. Usability and User Experience Assessment
To evaluate the subjective user experience and interface intuitiveness, we administered the System Usability Scale (SUS), a widely validated and reliable industry standard for assessing perceived usability [
25]. A total of 20 participants evaluated the proposed allergen-aware recipe recommender system using the standardized 10-item questionnaire.
Each item was rated on a five-point Likert scale, ranging from 1 (strongly disagree) to 5 (strongly agree). To ensure reliability and mitigate response bias, the questionnaire utilized alternating positive (Q1, Q3, Q5, Q7, Q9) and negative (Q2, Q4, Q6, Q8, Q10) statements. The administered SUS questionnaire items were as follows:
Q1. I think that I would like to use this system frequently.
Q2. I found the system unnecessarily complex.
Q3. I thought the system was easy to use.
Q4. I think that I would need the support of a technical person to be able to use this system.
Q5. I found the various functions in this system were well integrated.
Q6. I thought there was too much inconsistency in this system.
Q7. I would imagine that most people would learn to use this system very quickly.
Q8. I found the system very cumbersome to use.
Q9. I felt very confident using the system.
Q10. I needed to learn a lot of things before I could get going with this system.
Following the standard SUS scoring procedure, we first converted the raw ratings into normalized scores ranging from 0 to 4. For positively worded items (Q1, Q3, Q5, Q7, Q9), we calculated the score by subtracting 1 from the user’s rating. For negatively worded items (Q2, Q4, Q6, Q8, Q10), we calculated the score as 5 minus the rating. The final composite SUS score was obtained by summing these normalized scores and multiplying the total by 2.5, resulting in a standardized value between 0 and 100.
Table 6 summarizes the raw average ratings and standard deviations for each questionnaire item.
The system achieved an overarching SUS score of
. According to established industry benchmarks [
26], this score places the system in the “Good” to “Excellent” (Grade B) level of usability, significantly exceeding the average baseline of 68. Particularly high scores in ease of use (Q3:
) and learnability (Q7:
) indicate that the system successfully translates its complex multi-objective ranking logic into an intuitive interface. These results confirm that the proposed framework is not only technically robust but also highly accessible for users managing allergen-aware diets in daily domestic contexts.
6. Discussion
In this section, we discuss the implications of the empirical findings, interpreting how multi-objective optimization balances potentially competing factors and how this trade-off influences system performance, user behavior, and perceived trust.
6.1. The Asymmetric Role of Visual Aesthetics: A Hedonic Tie-Breaker
The statistical parity observed between the Overall and Safety+Nutrition strategies presents a counterintuitive yet vital behavioral insight for health-aware recommendation. Mainstream food recommender systems typically conceptualize visual appeal as a primary, independent driver of user preference. However, our rigorous statistical analysis (
Table 4) demonstrates that in an allergen-restricted context, incorporating visual aesthetics does not systematically alter the macro-level ranking distribution (SRC,
).
Rather than indicating a flaw in the multi-objective formulation, this statistical phenomenon exposes the highly asymmetric nature of risk-sensitive decision-making. Specifically, as highlighted by the reviewer’s observation on the marginal variance between the Overall strategy and the Safety+Nutrition baseline (e.g., Overall MRR of 0.7242 vs. Safety+Nutrition MRR of 0.7176), the absence of a statistically significant difference in macro-level retrieval metrics ( for MRR) is a mathematically expected result of our safety-prioritized ranking mechanism. Because the safety component receives the highest weighting and incorporates substantial risk penalties, both strategies consistently assign lower composite scores to high-risk recipes, reducing variance among unsafe candidates and homogenizing the macro-retrieval space.
We argue that visual appeal in health-critical systems should not be treated as a parallel ranking engine, but strictly as a hedonic tie-breaker. From a dual-process cognitive perspective, safety and nutritional considerations operate as dominant System 2 evaluation criteria that strongly influence user judgment regarding risk and healthfulness. Once most candidate recipes achieve acceptable safety and nutritional levels, macro-ranking metrics struggle to capture further variance. It is precisely within this local, pre-filtered neighborhood that visual aesthetics engage System 1 (fast, emotional) processing to trigger the final click decision. Consequently, the Visual Score functions as a localized optimization mechanism, preventing the presentation of unappetizing safe options without destabilizing the fundamental, survival-oriented safety hierarchy.
6.2. Stability and Dominance in Multi-Objective Landscapes
The experimental results confirm that the optimal configuration for allergen-aware recommendation operates within a contiguous and stable performance plateau (). The existence of this plateau implies that safety and nutritional value jointly form the structural backbone of the decision-making process.
Without explicit safety prioritization, as observed in the
Nutrition+Visual strategy anomaly, the the absence of such prioritization led to a collapse in ranking coherence (low SRC). This demonstrates that safety functions as the primary organizing factor of user decision-making rather than merely another preference dimension. Without it, users have to make high-risk decisions based on inconsistent heuristics, leading to decision ambiguity. This aligns with prior work suggesting that constrained search spaces can paradoxically improve user satisfaction by reducing cognitive friction [
22]. However, we also observe that overprioritizing safety (
) leads to “objective dominance,” in which the system loses its capacity to differentiate items based on secondary yet important factors such as visual appeal or nutrition.
6.3. The Gap Between Rational Preference and Risk-Averse Behavior
Our analysis reveals that different evaluation metrics favor distinct weight configurations. Ranking-based metrics (e.g., nDCG@5 and SRC) achieved optimal performance around , while interaction-based metrics (e.g., MAP) continued to improve and reach their peak around .
This shift indicates “decision–action discrepancy,” which represents a systematic gap between the articulation of explicit preferences (i.e., manual re-ranking) and implicit interaction behavior (i.e., clicks or detailed views). From a behavioral perspective, this shift can be interpreted as a form of risk-sensitive decision-making. While participants rationally balanced safety, nutrition, and visuals when asked to re-order the list, their actual interaction behavior was significantly more conservative. In a health-critical context, the “cost” of an unsafe choice is perceived as far higher during active selection than during a reflective ranking task. This phenomenon reinforces the necessity of incorporating risk-sensitive algorithmic biases in health-aware systems to bridge the gap between what users say they want and how they actually behave under perceived risk.
This behavioral divergence aligns with established psychological and behavioral economics frameworks. Specifically, it instantiates Kahneman and Tversky’s concept of “loss aversion” [
27], where the psychological pain of a negative outcome, such as consuming an allergen, far outweighs the pleasure of a positive utility, such as visual appeal or nutritional balance, during real-time choices. In a reflective ranking task, users operate under a more abstract preference-matching mindset. However, during an active selection task, such as clicking a recipe, the perceived immediacy of risk triggers a conservative cognitive bias, amplifying the fear of making an unsafe choice.
Furthermore, this phenomenon provides strong empirical evidence for the “intention–behavior gap” within the health-critical domain [
28,
29]. While users explicitly articulate a balanced intention, their final actions shift toward survival-oriented, risk-averse constraints. This finding highlights the importance of incorporating risk-sensitive algorithmic biases into health-conscious systems. This will help bridge the gap between what users say they want and how they actually behave when faced with perceived risk.
6.4. Human-Centered Evaluation: Trust, Usability, and Cognitive Load
The consistently high performance of ranking strategies incorporating the Safety module underscores the foundational role of trust in health-critical recommender systems [
9]. Our findings suggest that participants utilized the safety score as a reliable cognitive heuristic, enabling them to prune the decision space efficiently. Unlike traditional binary filtering approaches, which merely categorize items as safe or unsafe, the proposed penalty-based formulation enables more granular differentiation between “safe” and “safer” options (e.g., distinguishing between highly processed ingredients and raw components with higher cross-contamination risks). This transparency likely functioned as a primary heuristic, empowering users to make confident, risk-aware decisions without the cognitive burden of exhaustive ingredient verification.
This psychological sense of security is reflected in the SUS outcomes. The overall score () indicates a “Good” to “Excellent” (Grade B) level of usability. Specifically, high scores in ease of use (Q3: ) and learnability (Q7: ) confirm that the interface successfully translates complex multi-objective ranking logic into an intuitive user experience. Furthermore, the positive rating on confidence in use (Q9: ) reinforces the link between transparent safety scoring and user trust.
However, the moderate scores for Q4 (need for technical support) and Q10 (prior learning) reveal a subtle “multi-objective friction.” Weighing safety, nutrition, and aesthetics simultaneously is inherently more cognitively demanding than traditional single-criterion search. These scores suggest that while the system is highly effective, the initial interaction requires users to mentally visualize a three-dimensional trade-off space. Our findings suggest that presenting safety as the default priority within the ranking process can mitigate this load, ensuring that advanced functionality does not cognitively overwhelm the user during long-term domestic use.
6.5. Design Implications for Health-Aware Recommender Systems
Based on the synthesized empirical findings and behavioral analyses, we distill several key design implications for building robust, risk-sensitive recommendation frameworks:
Safety as a Stabilizing Decision Anchor: In health-critical domains, safety should function as a stabilizing anchor rather than a mere additive score. Our results demonstrate that safety prioritization establishes a stable decision framework that stabilizes the preference space. This safety-aware approach is essential for maintaining ranking consistency and fostering user trust in high-stakes scenarios.
Prioritize Robust Regions over Singular Optimal Points: System designers should aim for a “robust performance plateau” rather than searching for a fragile, singular optimal point. Operating within a stable weight region (e.g., the identified plateau of ) enhances system generalizability and provides a “forgiving” foundation for future personalization and adaptive learning algorithms.
Addressing the Discrepancy Between Rational Preference and Interaction: Recommender systems must explicitly account for the gap between explicit preference articulation and implicit interaction behavior. In risk-sensitive contexts, we recommend introducing a “safety-biased” algorithmic nudge to align with users’ implicit risk aversion, even when their explicit profiles suggest a more balanced approach.
Operationalizing Visual Aesthetics as “Top-Rank Nudges”: Perceptual features should be considered as refinement signals rather than primary drivers. Following the “nudging” principle [
17], visual appeal is most effective when applied at the final stage of a ranking pipeline to assist users in selecting among candidates that have already satisfied safety and nutritional thresholds.
Mitigating Cognitive Overload through Scaffolding: To prevent “multi-objective friction,” systems must provide cognitive support mechanisms. This includes transparent safety scoring to reduce verification effort and well-designed default configurations that balance complex trade-offs, ensuring that the system remains accessible for daily, long-term domestic use.
7. Conclusions
This study addresses the critical challenge of dietary planning for individuals with food allergies by proposing a multi-objective recipe recommender system. Moving beyond traditional single-objective optimization, the proposed framework integrates Allergen Safety, Nutritional Balance, and Visual Appeal into a unified, interpretable scoring model.
While the empirical scale of our user study is exploratory, this research significantly advances our preliminary findings [
13] by resolving key methodological and evaluative limitations. Specifically, we have reconceptualized allergen safety, transforming it from a simple additive objective into a safety-aware ranking component with explicit risk prioritization that enhances ranking stability. Through rigorous weight sensitivity analysis, a formal System Usability Scale (SUS) evaluation, and a behavioral analysis of the “decision–action discrepancy,” we have validated the effectiveness and robustness of the framework in risk-sensitive scenarios.
The principal contributions and findings of this research are summarized as follows:
Safety as a Stabilizing Decision Anchor: We demonstrate that in health-critical domains, safety must be prioritized within the ranking process rather than being treated as a fully compensatory objective. Experimental results confirm that excluding safety results in a collapse of ranking consistency (SRC). However, including safety establishes a stable preference space that supports further refinement via nutritional and visual factors.
Empirical Identification of a Robust Performance Plateau: Using systematic grid searches and heatmap visualizations, we demonstrated that optimal performance does not exist at a single, fragile point, but rather within a stable region (). This robustness is a critical property for real-world deployment, as it reduces the system’s sensitivity to precise parameter tuning.
The Gap Between Rational Preference and Interaction Behavior: We identified a systematic discrepancy between explicit ranking (optimized at ) and implicit interaction (optimized at ). This finding reveals that users are more risk-averse in active decision-making than in reflective preference articulation, highlighting the necessity of dual-modality feedback modeling in health-aware systems.
7.1. Limitations
Despite the promising results, this study has several limitations that should be acknowledged:
Sample Size and Diversity: Although the within-subject design with 20 participants yielded statistically significant insights, a larger and more demographically diverse cohort is needed to improve the generalizability of the findings. Given the exploratory nature of this usability-oriented evaluation, the observed behavioral discrepancies should be interpreted as localized user trends rather than deterministic behavioral laws.
Platform-Specific Bias: The dataset originates from a single website (“Table for All”) specializing in allergy-friendly recipes [
24]. To formally quantify this source bias and evaluate its statistical implications, we analyzed the descriptive statistics of the raw Visual Score across all 606 recipes. The raw visual appeal distribution exhibits a mean of
and a notably narrow standard deviation of
, spanning from a minimum of
to a maximum of
. This empirical variance directly confirms that the specialized platform’s standardized food presentation artificially compresses the raw aesthetic distribution. However, to ensure commensurability across diverse metrics, our framework applies a standard Min-Max normalization to project these scores onto a uniform
continuous scale. The resulting normalized Visual Score (
) exhibits an expanded mean (
) of
and a standard deviation (
) of
. This expanded 100-point absolute variance demonstrates that the normalization process successfully restores the mathematical resolution, ensuring that the visual score functions effectively as a distinct secondary refinement signal within the safe feasible region without losing discriminative power.
Static vs. Dynamic Personalization: The current model employs a globally optimized configuration. While robust, it does not yet dynamically adapt to individual variations in risk sensitivity or real-time dietary goals.
Lack of Clinical Validation: Although the mathematical formulation of our safety model is grounded in expert-supervised datasets, this study evaluates system performance through an online user study focused on cognitive and behavioral metrics. It lacks direct clinical trials on actual food-allergy patients within a controlled medical environment, which is critical since physiological sensitivity can vary dynamically across individuals.
Evaluation Paradigm Limitations: Our comparative evaluation of external baselines was conducted retroactively via offline replay based on interaction logs from the user study. Because our re-ranking engine relies on an unsupervised, axiomatic scoring function rather than a parameterized statistical model requiring machine learning training, conventional offline train–validation–test data splits are conceptually inapplicable. While this hybrid paradigm prevents participant fatigue by limiting live evaluation to 40 core recipes per user, offline simulations inherently face limitations in perfectly capturing dynamic, context-dependent user choices when presented with entirely new recommendation lists. Future studies should evaluate these external models through live, randomized A/B testing to capture real-time behavioral nuances.
Proxy Nature of Visual Assessment: The visual appeal module utilizes an off-the-shelf NIMA model pre-trained on the general-purpose AVA dataset without food-specific fine-tuning. While this effectively captures universal photographic quality (e.g., lighting and composition) to act as a hedonic tie-breaker, it is fundamentally a proxy metric. It does not natively capture domain-specific culinary semantics, such as perceived freshness or “deliciousness”. Future iterations should incorporate visual models fine-tuned on dedicated food aesthetic datasets.
7.2. Future Work
Building upon the empirical findings of this study, future research will focus on four primary directions:
Personalized Learning-to-Rank (LTR): Our goal is to implement LTR algorithms that dynamically calibrate module weights based on implicit signals, such as dwell time and click sequences. This allows for highly personalized trade-offs between safety and visual appeal.
Generative Recipe Adaptation: Beyond recommendation, we plan to leverage Large Language Models (LLMs) and Generative AI (GenAI) to proactively modify recipes, suggesting safe, nutritionally superior ingredient substitutions to turn “unsafe” recipes into viable options.
Contextual Awareness via Computer Vision: We plan to extend the system’s awareness to include dynamic, real-time contexts. The system uses computer vision to automatically monitor a user’s inventory of available ingredients, eliminating the need for manual data entry. Integrating this with temporal context, such as specific meal times or seasonal constraints [
16], will enable truly situational and risk-aware dietary guidance that adapts to the immediate realities of the user’s domestic environment.
Longitudinal Clinical Evaluation: To overcome the limitation of medical validation, we aim to collaborate with healthcare providers and clinical nutritionists. We plan to deploy the system in real-world dietary interventions to validate its long-term clinical efficacy, physiological safety, and overall impact on the quality of life of food-allergy patients.
In conclusion, this study provides valuable exploratory evidence that in risk-sensitive personalized recommendation, safety acts as a primary prerequisite for ranking coherence. By demonstrating the intricate interplay between safety-aware prioritization and human decision behavior within a controlled, sample-limited setting, this work offers a preliminary yet foundational blueprint for the design of next-generation health-aware dietary assistance systems. Further large-scale validation remains necessary to confirm these behavioral trends across broader demographics.