Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Destination (Un)Known: Auditing Bias and Fairness in LLM-Based Travel Recommendations

AI 2025, 6(9), 236; https://doi.org/10.3390/ai6090236

by Hristo Andreev^1,*

, Petros Kosmas^1,*

, Antonios D. Livieratos²

, Antonis Theocharous¹ and Anastasios Zopiatis¹

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

AI 2025, 6(9), 236; https://doi.org/10.3390/ai6090236

Submission received: 31 July 2025 / Revised: 12 September 2025 / Accepted: 15 September 2025 / Published: 19 September 2025

(This article belongs to the Special Issue AI Bias in the Media and Beyond)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

First, I'd like to emphasize that, in my opinion, the manuscript is carefully prepared, clearly written, and substantively sound. I'm not inclined to find flaws simply to suggest "forced" changes, nor to "improve" what is already well-developed. Therefore, my comments are intended as reflections, not absolute improvements ("which absolutely must be implemented"), and should be treated by the authors as optional suggestions.

Reading the manuscript, my first impression is that I disagree with the following statement by the authors: "Notably, issues such as selection bias, data bias, and algorithmic design bias can lead to unfair outcomes, limiting access to travel options for marginalized groups and reinforcing existing stereotypes...", but at the same time I agree with what follows later: "For instance, studies have shown that certain AI systems can misrepresent the travel aspirations of users based on flawed data inputs, which may hinder inclusivity and limit the diversity of experiences offered to travelers (53-54)".

My initial conclusion was that the problem wasn't the LLM AI systems themselves, but rather the user-generated input in the form of prompts, meaning the content of the prompts. However, upon further reflection and reading, I realized that the authors had recognized this and anticipated it in their research model.

1) The prompts in this study were controlled and standardized. Flawed or biased user input can generate erroneous recommendations, but in this experiment, the prompts were neutral, yet significant differences and patterns emerged suggesting that the models themselves had built-in biases resulting from their training and design. This, in simple terms, means that the training data were not neutral(?). If the authors used neutral, standardized prompts in their study, and yet the models exhibited reproducible patterns of bias, then the source of these biases lies within the models themselves. Therefore, I quickly concluded that the bias occurred regardless of the user and that the models were not "empty" neutral filters, i.e., they had "built-in" preferences resulting from what and how they were fed during the training phase (the content of the training corpus).

My conclusion is therefore as follows:

a) In my humble opinion, the authors correctly identify the presence of various forms of bias in the recommendations generated by the LLMs studied. However, I believe that the discussion should more clearly distinguish between the two sources of this phenomenon: bias resulting from user-provided input (prompt) and bias built into the model itself during the training and construction stages. Optional suggestion for consideration. Please comment.

b) In the "Future Research" section, I recommend that the authors consider separate testing of the interaction between prompt content and bias, which would allow for a better understanding of the relative influence of the user and the model itself on the shape of recommendations. Optional suggestion for consideration. Please comment.

2) Although the research model is refined, there are some in-depth questions that are worth asking, namely:

a) In my opinion, the conclusions from this study are statistically significant, but for an individual user, the effect (the effect of LLM) is subtle and difficult to identify, or even completely invisible or unnoticeable. The effect of LLM is only visible when comparing multiple interactions or different profiles. Cultural or demographic bias requires awareness of context and a reference point, while stereotypical bias can be perceived more as natural marketing language than a problem. Therefore, can we assume that, to a large extent, many of the statistically identified biases will be invisible to the average user? Please comment.

3) The differences in biases observed between DeepSeek and ChatGPT in the results may largely be due to the characteristics of the training corpora and the priorities used in data selection and filtering. If DeepSeek was trained more on Chinese or Asian sources (language, media, forums, guidebooks), its recommendations may naturally reflect the perspectives and popularity hierarchies present in that cultural area. If ChatGPT has an overrepresentation of English-language and Western (especially American) sources in its corpus, it will more often recommend destinations and tourist narratives consistent with that context.

This leads to a situation in which some of the observed biases are not a result of the model's current performance per se, but a legacy of the training data (!) I therefore conclude that the study results may reflect regional, cultural, and linguistic biases built into the corpora, which is significant for interpreting the results. How does this impact the conclusions of this study? Please comment.

4) If we want to reduce bias at the source, in an ideal world, we should start with neutral, balanced, and representative training data. However, in practice, we have no direct influence over the training data. Access to it, its selection, and filtering are the responsibility of the model manufacturer and are usually protected by trade secrets.

I think the problem is that "neutrality" is a relative concept, i.e. every source of knowledge is embedded in a certain cultural and linguistic context, so absolute neutrality is impossible. Please comment.

Technical notes:

1) Important note: For some reason, the authors did not include references to tables and figures in the text (I don't see them). This should be corrected.

2) Figures, e.g., Figure 1 and Figure 2, are captioned twice, meaning they have two titles – one above the figure and one below the figure. In my opinion, this is unnecessary. The title above the figure should be removed, even if it differs from the title below the figure. The title should be single and unified.

3) For example, Table 8, in addition to numerical values, also uses color – the table footnote doesn't explain the meaning of the individual colors. It's worth adding a legend to the table explaining the meaning of the individual colors. Similarly, Table 7, etc.

4) The authors use abbreviations (acronyms, etc.) in tables and figures, e.g., PDI, IDV, MAS, US, FR, CN. Despite the explanation in the text, in my opinion, the explanation of the abbreviations should also be included in the table footer, i.e. in the table/figure legend.

Author Response

Dear Reviewer, thank you for taking the time to review our manuscript and for providing constructive feedback.

My conclusion is therefore as follows:

In my humble opinion, the authors correctly identify the presence of various forms of bias in the recommendations generated by the LLMs studied. However, I believe that the discussion should more clearly distinguish between the two sources of this phenomenon: bias resulting from user-provided input (prompt) and bias built into the model itself during the training and construction stages. Optional suggestion for consideration. Please comment.

Response:

Thank you for highlighting the need to distinguish sources of bias (prompt VS model). We have added a clarification in the Introduction distinguishing prompt-induced from model-induced bias and explaining why our neutral, standardised prompts focus the analysis on model-induced effects (Edit 1A.1). We also added a reinforcing sentence in the Prompting Protocol (Methods) confirming neutral prompts and session controls (Edit 1A.2), and a closing sentence in the Discussion that interprets the observed patterns accordingly (Edit 1A.3).

Edit 1A.1: Added to introduction (76-88): When conducting audits, it is important to distinguish bias arising from prompts from bias arising from the model. Prompt-based bias occurs when a user’s question in-cludes stereotypes, loaded wording, or constraints that steer the model toward a par-ticular answer. Model-based bias originates in the training data, labelling practices, or design choices, and it can persist even when prompts are neutral [13]. Absolute neu-trality in training data is unattainable because web-scale corpora reflect uneven cultural, linguistic, and economic representation. In this article, neutrality is defined operation-ally as the use of neutral prompts and symmetric execution procedures, not as a claim that either the model or the corpus is neutral [14]. This study kept prompts consistent and neutral across all personas and used session controls to limit personalisation and carry-over effects. Procedures were designed to be repeatable and to support inde-pendent replication. The patterned differences observed are therefore best interpreted as model-based bias that reflects each system’s training rather than idiosyncrasies of user input.

Edit 1A.2: Added to Discussion (749-751): Because input prompts and execution controls were standardised, these patterns are most parsimoniously explained as model-induced rather than prompt-induced bias.

In the "Future Research" section, I recommend that the authors consider separate testing of the interaction between prompt content and bias, which would allow for a better understanding of the relative influence of the user and the model itself on the shape of recommendations. Optional suggestion for consideration. Please comment.

Response:

We agree this is a valuable extension. We now propose a concrete Prompt × Bias experimental design in Future Research to quantify the interaction between prompt framing and model-induced tendencies across our six bias families (Edit 1B.1). We also note this limitation explicitly in the Limitations section (Edit 1B.2).

Edit 1B.1: (932-938):Another direction for future research is to understand how prompt wording in-teracts with model-based bias. A factorial design could test variations in prompt framing (neutral, stereotype-laden, or de-biased), constraint specificity (general, thematic, or budget/safety related), and language (English vs. other major languages). Bias outcomes could be measured using the same metrics applied here. Mixed-effects models could then estimate both main effects and interactions, clarifying how much bias stems from user input versus the model, and whether prompts amplify, reduce, or reverse under-lying biases.

Edit 1B.2 (781-783): Sixth, the study did not manipulate prompt framing; determining how biased or stere-otype-laden wording interacts with baseline model tendencies remains an open question for future research.

2) Although the research model is refined, there are some in-depth questions that are worth asking, namely:

a. In my opinion, the conclusions from this study are statistically significant, but for an individual user, the effect (the effect of LLM) is subtle and difficult to identify, or even completely invisible or unnoticeable. The effect of LLM is only visible when comparing multiple interactions or different profiles. Cultural or demographic bias requires awareness of context and a reference point, while stereotypical bias can be perceived more as natural marketing language than a problem. Therefore, can we assume that, to a large extent, many of the statistically identified biases will be invisible to the average user? Please comment.

Response:

We agree with your comment and believe that users can benefit from more transparency. Truth is, the models kind of explain why they recommend each destination because it was asked in the prompts: 1^st prompt: “…give reasons for each”, 2^nd prompt “…explain why”. However, due to the quantitative design of this methodological design of this research we didn’t analyse the data qualitatively this time. It’s one of the things we want to do in a future research study by analysing qualitatively the outputs with either thematic or sentiment analysis. Here are the changes we made: In the Suggestions for Future Research section we added how qualitative analysis can expand the depth of this research (Edit 2A.3) While our analysis reveal statistically significant patterns at the aggregate level, many differences will be subtle or imperceptible to individual users in a single session. We have added a clarifying paragraph in the Discussion (Edit 2A.1) explaining the distinction between individual perceptibility and aggregate, system-level impact, extended the Implications to suggest light-weight transparency guidelines that address this invisibility in practice (Edit 2A.2),

Edit 2A.1 (722-730):Although the reported effects are statistically significant at the corpus level, they are often too subtle for an individual user to detect in a single interaction. Bias becomes most consequential when small differences accumulate across many users and sessions, in-fluencing which content or regions are highlighted at the system level. Cultural and demographic patterns usually require a broader context to recognize, and stereo-type-laden language can appear similar to conventional travel marketing rather than an explicit fairness concern. These results should therefore be interpreted as evidence of structural, aggregate-level tendencies rather than as differences that individual users will consistently perceive in each interaction.

Edit 2A.2 (889-895):Because many biases are not immediately visible at the point of use, interfaces should incorporate lightweight transparency features and guardrails that function effectively at scale. Examples include “why this was recommended” explanations, simple diversity or novelty indicators, and optional user controls that permit minor adjustments within safe limits. Such measures do not require users to conduct audits but make systemic safeguards more transparent. They also help align the objectives of the re-ranking layer with user understanding and trust.

Edit 2A.3 (922-931)The prompt design in this study elicited short rationales for each recommended destination (“…give reasons for each”; “…explain why”). In the present analysis, these texts were used only to extract destinations and were not analysed qualitatively. Future work should apply qualitative content analysis (e.g., reflexive thematic analysis or structured coding) and sentiment/valence analysis to these rationales to examine tone, persuasiveness, hedging, and cliché usage alongside the stereotype lexicon. Such anal-yses would indicate whether the explanations themselves embed promotional or de-mographic/cultural skews, how this framing interacts with destination exposure, and whether patterns differ across languages.

Response:

Indeed cross-model differences likely reflect legacies of training data composition, filtering priorities, and alignment procedures. To address this, the Discussion now explicitly interprets divergences as model-induced legacies rather than intrinsic performance differences (Edit 3.1). The Limitations section acknowledges the opacity of proprietary corpora and alignment and clarifies how this constrains causal attribution and generalisability while not undermining user-facing significance (Edit 3.2).

Edit 3.1 (752-759):Divergences between the two systems are likely shaped by differences in training corpora, data filtering, and alignment procedures. For example, if one model relies more on English-language Western sources and the other on Asian sources, their recom-mendations will mirror the narrative frames common in those contexts. These effects should be understood as model-induced outcomes of training data and design choices rather than as inherent differences in quality. The audit therefore evaluates models as deployed, capturing the aggregate tendencies that users encounter without attributing causality to specific corpus components.

Edit 3.2 (784-794): The training data, pre-processing pipelines, and alignment steps are proprietary and not documented in detail, therefore corpus and policy influences can only be in-ferred from outputs. Safety and preference tuning may nudge language toward positive or generic promotion and away from sensitive topics. Outputs also depend on model architecture and default decoding choices such as sampling temperature, nucleus probability, and length constraints; we used default chat-only generation without tools or retrieval, so behavior may differ under other settings or architectures. The scope of evaluation is bounded: two models and one build each, a single collection window, English prompts only, eight origins, three ages, three gender identities, three interest themes, and a fixed top-five recommendation task. These constraints limit causal claims and the generalizability of results across languages, time periods, deployment modes, and model families.

Response:

We agree that absolute neutrality is unattainable and a relative concept because all corpora are culturally and linguistically situated in some sense, and that access to proprietary training data limits source-level interventions. Currently, the approach of commercial (not open source AI) training data of LLMs lack transparency. To address your point, we added a clarifying paragraph in the Introduction framing neutrality as a relative ideal and restricting our use of “neutral” to operational controls (Edit 4.1). That’s not something our study has control over. Perhaps it’s possible to talk about neutrality if training data were in a public repository but then, it was going to be really hard to process all of it due to the sheer volume of data. At least, it won’t be feasible with classical content analysis and without any assistance from AI.

Edit 4.1 (80-88) Absolute neutrality in training data is unattainable because web-scale corpora reflect uneven cultural, linguistic, and economic representation. In this article, neutrality is defined operationally as the use of neutral prompts and symmetric execution procedures, not as a claim that either the model or the corpus is neutral [14]. This study kept prompts consistent and neutral across all personas and used session controls to limit personalisation and carry-over effects. Procedures were designed to be repeatable and to support independent replication. The patterned differences observed are therefore best interpreted as model-based bias that reflects each system’s training rather than idio-syncrasies of user input.

Technical notes:

1) Important note: For some reason, the authors did not include references to tables and figures in the text (I don't see them). This should be corrected.

Action taken:

References were added to the tables and figures as suggested. All tables and figures are referenced and explained in the text.

Action taken:

Thank you for your constructive observation. The top captions were removed. Decided to keep the captions underneath as it allows for clearer presentation and larger figures.

Action taken:

This was a table turned into a heatmap for easier comprehension of differences. Colour scheme on all the heatmap tables is explained in the reference for each table for easier understanding and readability.

Action taken:

The actual names and acronyms for each dimension were added to the table for easier readability.

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript addresses an important and timely topic related to bias and fairness in travel recommendations generated by large language models (LLMs). The research problem is well motivated and has clear practical implications. The methodology, based on analyzing distributions of recommendations generated by LLMs, is clearly explained and systematically applied.

The manuscript is interesting and has potential, but it requires further development before it can be considered for publication.
1. The literature review is primarily descriptive and lacks a more critical discussion that would clearly highlight the research gap.
2. The proposed methodology allows only for a static assessment of bias; it does not capture the sequential or iterative effects that may arise from users’ previous interactions with the model.
3. The discussion section remains relatively limited, with insufficient reflection on the assumptions, limitations, and generalizability of the findings.
5. No sensitivity analysis or exploration of alternative experimental settings is provided, which would strengthen the robustness of the conclusions.

Overall, this is a promising contribution. With the above improvements, the manuscript would provide a stronger and more comprehensive basis for advancing research on fairness and bias in LLM-based recommendation systems.

Author Response

Dear Reviewer, thank you for taking the time to review our manuscript and for providing constructive feedback.

Response:

Thank you for your helpful observation. We revised the introduction by adding critical synthesis that highlights the gap. The new text explains that most studies look at only one bias, one user group, or one model, which leaves four blind spots: cross-bias comparability, system-level effects, controlled comparisons of current models to separate prompt from model effects, and a link from audits to governance and re-ranking. It also sums up what the field already agrees on (popularity and geographic clustering, WEIRD-leaning cultural framings, and demographic disparities) and points out the lack of shared public benchmarks and standard metrics that would allow fair comparison. These additions add critique and highlight our holistic approach: first audit across six bias families and the proposed public-interest re-ranking layer.

The proposed methodology allows only for a static assessment of bias; it does not capture the sequential or iterative effects that may arise from users’ previous interactions with the model.

Response:

We agree and now state this limitation clearly in Section 4.1. The added text explains that our audit is cross-sectional with fresh sessions, so it does not capture path dependence, carryover, or preference shaping from multi-turn or repeat interactions. It clarifies that our reinforcement tests check within-session novelty and should not be read as evidence on longer-run reinforcement or filter-bubble effects, and notes that bias accumulating over time may be underestimated. We also outline concrete next steps: longitudinal, multi-turn protocols that reuse prior answers, vary conversation depth, and track changes across sessions and model updates, along with a Prompt × Bias factorial design to test interactions of wording, constraint specificity, and language in propositions for future research. The point of this methodology is precisely to eliminate all elements that can add extra bias. Everything was eliminated including: Cookies, conversation history, ‘memory’. Also, usage of VPN and IP changes before each persona were all important steps to ensure the experiment was conducted in a controlled environment. We also suggested qualitative assessments (such as human evaluations) in the propositions for future research which can consider longer conversational chains which are more representative of real-world cases. The nature of this study is holistic and quantitative rather than qualitative and deep. Ideally, in future research both approaches should be combined.

The discussion section remains relatively limited, with insufficient reflection on the assumptions, limitations, and generalizability of the findings.

Response:

Thank you for this helpful comment. We have expanded the Discussion to make the assumptions, limits, and scope clear. At the start of Section 4 we now state that the effects are aggregate and may be subtle in a single interaction. In Section 4.1 we added a paragraph explaining the static, cross-sectional design, why this can miss path dependence and longer-run reinforcement, and that our use of Hofstede scores, Western-oriented tourism lists, English prompts, and two specific model builds limits coverage. In Section 4.2 we note that the findings are a time-stamped snapshot under chat-only settings and should not be assumed to hold for other languages, tool-enabled modes, or later versions. In Section 4.4 we call for periodic re-audits before generalizing. Together these additions provide the reflection on assumptions, limitations, and generalizability that you requested.

No sensitivity analysis or exploration of alternative experimental settings is provided, which would strengthen the robustness of the conclusions.

Response:

Thank you for this helpful point. We agree that formal sensitivity analysis would strengthen robustness. In this revision we made our design choice explicit and tightened the limitations. In Methods we now state that we fixed all settings to vendor defaults and did not vary temperature, top-p, system prompts, safety filters, tools, retrieval, or language, which improves internal consistency but means we did not test alternative experimental settings. In Limitations we added that we did not vary popularity list cutoffs, distance metrics, cultural frameworks, decoding parameters, or language, so our estimates should be read as baseline values for this configuration and a full robustness study will require reruns that change these choices one at a time. Where possible we reported both mean and rank-order associations to reduce dependence on a single statistic, but the paper’s goal is a broad, cross-bias audit rather than parameter sweeping. We will pursue the suggested sensitivity analyses in follow-up work.

Reviewer 3 Report

Comments and Suggestions for Authors

Dear Authors,

Thank you for this interesting contribution.

The manuscript develops a critical and relevant issue by critically examining the bias in the AI-driven traveling recommendation systems through large language models.

Basically, it examines biases in large language models (LLMs) applied to AI-based travel recommendation systems that involve the use of such models as ChatGPT and DeepSeek. They define six major types of bias, namely popularity, geographic, cultural, stereotype, demographic, and reinforcement bias and demonstrate how travel recommendations are often loaded with them in favor of mainstream, culturally close and stereotyped destinations.

DeepSeek shows more recommended places that are domestic and cultural, which have more intense stereotype and demographic bias, whereas ChatGPT has quite the opposite, and the suggestion is more regionally pollinated and diverse.

The study highlights the scale-up of biases in society by LLMs and considers transparent, inclusive and sustainable AI design (such as stakeholder involvement, bias mitigation throughout, and layers of public-interest re-ranking) to enhance fairness and equality in the travel suggestions.

To make the paper stronger, the authors are suggested to take into consideration the following:

1. Explain Bias Definitions and Measures: Generally, six forms of bias are discussed, although each needs more explicit operational definitions and standard metrics to enhance reproducibility and interpretability by the reader. It might be helpful to give instances of the ways positive or negative biases express themselves through recommendations.

2. Provide More Methodological Information: The experimental design using a controlled persona-based study is sound but could stand to give more information on the selection of personas, guidance on prompt use, and statistics so that everything can be replicated.

3. Improve Reporting Results: Certain numbers and tables might be clarified to make them easier to see and understand. It may be worth using visual summaries or heat maps to depict the pattern of geographic and cultural biasness.

4. Further Overview of Limitations: The limitations section in the paper recognizes the limitations of data access but needs to elaborate more on possible biases brought out by the training data, model architecture, and the scope of the evaluation. Limiting the effect of these limitations on the generalizability would make the study stronger.

5. Extend to Future Research Directions: Recommendations on future work may be extended to multilingual bias recognition, a study of bias in the real world, and inclusion of intersectional demographic aspects so as to represent conflicting relationships between biases.

6. Stress Practical implications: The idea of a layer of public-interest re-ranking is a good one, and explaining its design, plausibility, and pitfalls, would be useful to most practitioners (I would wager).

7. Better Language and Flow: There are parts where the language can be more streamlined; more precise to ensure easier reading and better interest level in the introduction and discussion.

Taken together, such enhancements would make the manuscript more rigorous, clear, and relevant, making it more capable of informing the building of fairer and more inclusive travel recommendation systems with artificial intelligence.

Kindest regards

Comments on the Quality of English Language

The general quality of the English language is professional and clear and is fit to an academic audience. Nevertheless, there are certain complex sentences and they can be simplified to have a better flow and readability:

There is appropriate use of jargon and some technical terms used occasionally and these can be complimented with some brief explanation to reach a wider spectrum.
Some minor grammatical inaccuracy and clumsy wording can be found every here and there, which could be fixed through thorough proof-reading.
The breaks between parts are occasionally harsh; there could be a greater use of smoother transition sentences.
Monotonicity of the use of some specific terminology (e.g., bias, recommendations) can be changed to increase interest.
The abstract and introduction can use some trimming and focus in order to make readers interested within a shorter period of time.
Captions of figures and tables could be described and explained more clearly.

In general, the manuscript would profit by being given a comprehensive language edit to smooth style, teach and flow, without making technical changes.

Author Response

Dear Reviewer, thank you for taking the time to review our manuscript and for providing constructive feedback.

To make the paper stronger, the authors are suggested to take into consideration the following:

Explain Bias Definitions and Measures: Generally, six forms of bias are discussed, although each needs more explicit operational definitions and standard metrics to enhance reproducibility and interpretability by the reader. It might be helpful to give instances of the ways positive or negative biases express themselves through recommendations.

Response:

Thank you for your comment. We rewrote section 2.5 Data Analysis Procedure so each bias has a clear, operational definition, literature references of different metrics for each bias and a detailed and precise description of how we measure it. These changes make the constructs concise, the metrics explicit, and the findings easier to interpret at a glance. Also, it will make the replicability of the study easier.

Provide More Methodological Information: The experimental design using a controlled persona-based study is sound but could stand to give more information on the selection of personas, guidance on prompt use, and statistics so that everything can be replicated.

Response:

Thank you for the suggestion. Within the available space we strengthened three parts of Methods to support replication. In 2.1 (Research Design) we added a brief justification for the controlled persona-based experiment, noting that LLM outputs are sensitive to user attributes and prompt framing, and that using personas with a standard three-prompt chain and fresh sessions improves internal validity and replicability while reflecting typical user behavior. In 2.2 (Persona Construction) we clarified how personas were selected: origins based on outbound volume and cultural coverage, age bands (25/45/65), inclusion of female/male/non-binary identities for fairness probes, and three interest themes; we also state that prompts were issued in English to avoid translation effects. In 2.3 (Prompting Protocol) we specify the exact three-step sequence, fresh session per persona–model pair, a single regeneration if fewer than five destinations are returned, and simple refusal handling. Our statistical approach in 2.5 already identifies the estimands and models for each bias family; given length constraints we kept this concise but sufficient for replication.

Improve Reporting Results: Certain numbers and tables might be clarified to make them easier to see and understand. It may be worth using visual summaries or heat maps to depict the pattern of geographic and cultural biasness.

Response:

Tables and figures were clarified making them easier to understand. No visual summaries were made but further table reading documentation was added to the result section to help readers read the data easier. Some tables and figures were improved to accommodate for easier understanding by the users. Colour coding explanations on heat map tables were also added for easier understanding and comprehension using thresholds for each colour code.

Further Overview of Limitations: The limitations section in the paper recognizes the limitations of data access but needs to elaborate more on possible biases brought out by the training data, model architecture, and the scope of the evaluation. Limiting the effect of these limitations on the generalizability would make the study stronger.

Response:

We agree and have expanded the Limitations to address potential biases from training data and alignment, sensitivity to model architecture and decoding, and the scope boundaries of our evaluation. We also state the steps taken to limit over-generalization and recommend targeted replications across languages, time, and deployment settings. These additions clarify where bias may originate and how our design constrains claims about external validity.

Extend to Future Research Directions: Recommendations on future work may be extended to multilingual bias recognition, a study of bias in the real world, and inclusion of intersectional demographic aspects so as to represent conflicting relationships between biases.

Response:

Thank you for this helpful suggestion. We added three extensions exactly as requested: (1) Multilingual bias: we now call for repeating the audit in multiple languages (including right-to-left scripts) and comparing native-language, English, and simple machine-translated prompts to see whether the same patterns hold beyond English. (2) Real-world study : we now propose small live pilots or A/B tests with tourism boards or travel sites to observe what people actually click and choose, alongside simple outcomes like visitor flow, CO₂, and crowding. (3) Intersectional aspects: we now suggest looking at combinations of gender, age, origin, and interest theme to find where disparities are strongest and to make any trade-offs between bias types explicit.

Stress Practical implications: The idea of a layer of public-interest re-ranking is a good one, and explaining its design, plausibility, and pitfalls, would be useful to most practitioners (I would wager).

Response:

Thank you for the suggestion. We expanded our Implications section and added brief, practitioner-oriented details on design: a simple two-step flow with a public scoring sheet and plain-language explanations, using parameters, existing data, and aligned with how platforms already sort results. These additions explain how a public-interest re-ranking layer can be built and governed in a more practical way.

Better Language and Flow: There are parts where the language can be more streamlined; more precise to ensure easier reading and better interest level in the introduction and discussion.

Response:

Language was improved in many parts of the article to make the flow better and more streamlined.

Kindest regards

Comments on the Quality of English Language

Response: English was improved in the revised version of the manuscript

There is appropriate use of jargon and some technical terms used occasionally and these can be complimented with some brief explanation to reach a wider spectrum.

Response: Jargon use was noted and re-checked.

Some minor grammatical inaccuracy and clumsy wording can be found every here and there, which could be fixed through thorough proof-reading.

Response: Proof reading was done to address some of the inaccuracies and clumsiness.

The breaks between parts are occasionally harsh; there could be a greater use of smoother transition sentences.

Response: Smoother transition sentences were used in some cases.

Monotonicity of the use of some specific terminology (e.g., bias, recommendations) can be changed to increase interest.

Response: Some synonyms were included to break monotonicity.

The abstract and introduction can use some trimming and focus in order to make readers interested within a shorter period of time.

Response: Introduction was trimmed to accommodate for that.

Captions of figures and tables could be described and explained more clearly.

Response: Captions of figures and tables were added with explanations and reference to the tables in text.

In general, the manuscript would profit by being given a comprehensive language edit to smooth style, teach and flow, without making technical changes.

Response: Language was edited in certain cases to improve the style and flow.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

After reviewing the revised manuscript, I confirm that the changes declared by the authors have indeed been implemented, for which I am grateful. In my opinion, the revised manuscript presents a solid and methodologically sound study of bias in travel recommendations generated by LLM. The results appear consistent, well-documented, and grounded in the literature. However, in my opinion, there is still some narrative verbosity and limited discussion of practical implications. Despite this, the article makes a valuable contribution to the audit and evaluation of the ethics of AI recommendations in tourism.

Technical notes

1) I still don't see certain references to tables in the text, for example, there's no reference in the text to Table 1, Table 2, etc. I'm referring to the MDPI guidelines: All figures, schemes, and tables should be inserted into the main text close to their first citation and must be numbered in order of appearance (e.g., Figure 1, Scheme 1, Figure 2, Scheme 2, Table 1, etc.). Example in the text: "As summarized in this analysis (Table 1)..."

Is this an oversight on my part, or are these references actually missing? Please comment.

2) For example, lines 406-417 contain formulas (mathematical formulas)—the meaning of the individual components (elements) of these formulas is not explained.

Is this an oversight on my part, or are these explanations actually missing? Please comment.

Author Response

Is this an oversight on my part, or are these references actually missing? Please comment.

Reply:

Thank you for your attention to detail. We forgot to add the In text references in the theoretical & methodological parts of the paper. References and explanations in accordance with MDPI guidelines were added to the missing tables and figures. Specifically:

In-text reference to table 1 was added, “Table 1 categorises and summarises the most common biases in AI systems.”

In-text reference to table 2 was added: “see Table 2, for a summary of persona characteristics”

In-text reference to table 3 was added: " Table 3 presents prompt templates that provide a synopsis of the prompt chains.”

In-text reference to table 4 was added: “Table 4 presents the bias assessment framework, detailing variability, metrics, and secondary data for each bias.”

These tables and figures which are in the results section of the paper already have in-text references from before, and as a result they weren’t changed:

Table 5, Table 6, Table 7, Figure 1, Table 8, Figure 2, Figure 3, Figure 4, Figure 5, Table 9, Table 10, Figure 6, Figure 7,

----------

2) For example, lines 406-417 contain formulas (mathematical formulas)—the meaning of the individual components (elements) of these formulas is not explained. Is this an oversight on my part, or are these explanations actually missing? Please comment.

Thank you for noting that we were presenting formulas without defining all components. We have revised this passage to explicitly define every symbol and describe each calculation step in both the cultural and reinforcement bias sections. In line with reviewer 2’s recommendation to make the paper accessible to math-averse readers and appealing to a broader audience, we also relocated the full technical exposition, including symbol glossary, derivations, normalization details, to Appendix 2: Metric Definitions and Formulas, while keeping a concise, intuitive explanation in the main text with clear cross-references. This way, whoever is interested in the math behind the analysis can find all formulas fully documented in the Appendix, while at the same time the methodology section doesn’t overwhelm people that are less technical or are averse to mathematical formulas.

Reviewer 2 Report

Comments and Suggestions for Authors

This paper addresses a very important and timely topic: bias and fairness in travel recommendations generated by large language models. The study is ambitious and impressive, with more than six thousand recommendations tested across 216 traveler profiles. The experimental design is careful, the results are convincing, and the idea of a public-interest re-ranking layer gives the work strong practical relevance.
1. At the same time, the paper is too long (over 30 pages), which makes it harder to read. The presentation could be improved by moving detailed formulas, technical definitions, and long metric descriptions into an appendix. In the main text, it would be better to keep only short explanations supported by clear figures and tables. The introduction and literature review contain useful information, but they are still too descriptive. They should finish with a short summary that explains what is already known, what is missing, and how this paper fills the gap.

2. The methodology is one of the strengths of the paper, but the level of detail makes it difficult for readers who are less technical. A simplified description would make the paper more accessible. The results section provides many tables and statistical tests, but the narrative should highlight the key findings more clearly.

3. The discussion and conclusions are interesting, especially the link to sustainability goals. However, the limitations are not discussed in enough detail. The authors should note that the study includes only two models (ChatGPT-4o and DeepSeek-V3), that it was done only in English, and that the findings might not be the same for other models, prompts, or languages.

Author Response

At the same time, the paper is too long (over 30 pages), which makes it harder to read. The presentation could be improved by moving detailed formulas, technical definitions, and long metric descriptions into an appendix. In the main text, it would be better to keep only short explanations supported by clear figures and tables. The introduction and literature review contain useful information, but they are still too descriptive. They should finish with a short summary that explains what is already known, what is missing, and how this paper fills the gap.

Reply:

Thank you for these helpful suggestions. We agree the paper reads better when the main text is lighter and the technical details are fewer.

What we changed:

Introduction and literature review: Descriptive parts were trimmed. Each section now ends with a short summary that states what is known, what is missing, and how this study fills the gap.

Methods (Sections 2.1–2.4): The study design and persona description were kept concise. The prompting and control text was shortened. Full session controls, parsing rules, and execution settings are now in Appendix A1 Session controls and execution settings. Table 3 shows the prompt templates, and Table 4 gives a simple overview of the metrics. All cited sources were retained.

Section 2.5 (Data Analysis Procedure): This section was simplified to a plain-language summary for each bias family. The formal equations, symbol glossary, smoothing and missing-data rules, deduplication details, and robustness checks were moved to Appendix 2: “Metric Definitions and Formulas” Each subsection in 2.5 includes a brief intuition and a pointer to Appendix 2 for readers who want the full technical detail. This keeps the section accessible for less technical readers while preserving replicability and deeper understanding as requested by Reviewer 1. All citations remain in place.

The manuscript is shorter where it was most descriptive. Methods are clearer in the main text, and full technical detail is available in the appendices. Minor typographical issues and table headings were also corrected for consistency. Moreover, the methodology section and overall the paper is now more accessible for mathematically averse readers.

-------------

The methodology is one of the strengths of the paper, but the level of detail makes it difficult for readers who are less technical. A simplified description would make the paper more accessible. The results section provides many tables and statistical tests, but the narrative should highlight the key findings more clearly.

Reply:

Thank you for this suggestion. The manuscript was revised to make the methods and results easier to read while keeping full transparency for replication.

Methodology (simplified main text, full detail in the appendix).

As mentioned previously, Section 2.5 was rewritten in plain language so that each bias family has a short, non-technical description. All equations, symbol definitions, smoothing choices, missing-data rules, deduplication steps, and robustness checks were moved to Appendix 2: “Data Analysis Details: Equations, Notation, and Robustness.” Sections 2.1–2.4 were tightened, and operational controls and parsing rules were relocated to the appendices with clear pointers in the text.

Results (clearer storyline and lighter prose).

The results narrative now leads with the main takeaways and keeps statistical detail in the tables and figures. Redundant test output in the prose was reduced, captions were made more interpretive, and long matrices remain in the paper only where they aid interpretation. Theme-specific patterns tied to Table 6 are summarized compactly. Also, a new Section 3.7 “Summary of results” was added. It provides a single paragraph that synthesizes the findings across all six bias families and draws concise conclusions.

----------------------

The discussion and conclusions are interesting, especially the link to sustainability goals. However, the limitations are not discussed in enough detail. The authors should note that the study includes only two models (ChatGPT-4o and DeepSeek-V3), that it was done only in English, and that the findings might not be the same for other models, prompts, or languages.

Reply:

Thank you for highlighting the need to improve limitations. We agree and have strengthened Section 4.1 Limitations to a) explicitly enumerate the study’s scope (two models: ChatGPT-4o and DeepSeek-V3, one build each; English-only prompts; fixed prompt template and default decoding; chat-only mode), and 2) make clear that results may differ for other model families, prompts, languages, builds, and deployment modes. We also added a clarifying sentence in 4.4 Final Remarks reinforcing the time-stamped and language-specific nature of our findings, and cross-referenced 4.3 Suggestions for Future Research for multilingual and prompt-design replications.

Round 3

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript has improved considerably after the first round of reviews. The authors have addressed the main methodological and conceptual issues: the literature review now clearly highlights the research gap, the methodology is better structured, the limitations are explicitly discussed, and the results section has been made more accessible. These changes have strengthened the paper significantly.

At this stage, the remaining issues are primarily editorial and related to presentation. The manuscript is still very long (over 30 pages), which makes it harder to read. The introduction and literature review could be further condensed, with descriptive passages trimmed and only key points retained. The results section should highlight the main findings more clearly, preferably by stating a few key takeaways at the beginning of the section and synthesizing them in the conclusions. The discussion, although improved, could still engage more critically with previous audits and better explain the practical significance of observed differences between models. Finally, while the English is generally good, the text would benefit from another round of editing for clarity and conciseness.

Overall, the paper is close to being ready for publication. I recommend minor revision, with the expectation that the authors will implement these editorial and presentation improvements in the next version. I would like to see the revised manuscript again to verify that these changes have been made before final acceptance.

Author Response

Dear reviewer,

Thank you for your feedback and your help in making the paper better.

Comment 1 )At this stage, the remaining issues are primarily editorial and related to presentation. The manuscript is still very long (over 30 pages), which makes it harder to read. The introduction and literature review could be further condensed, with descriptive passages trimmed and only key points retained.

Response:

Thank you for this helpful suggestion. We agreed that the opening section was too long and have tightened it substantially for readability. Introduction & lit review sections were cut significantly from 2733 words to 2240. We removed descriptive passages and repeated definitions, consolidated overlapping background, and kept only the key claims. Extended context that was useful but not essential to the narrative has been removed so that the main text reads faster without losing sources or transparency. We also added a short signpost paragraph at the end of the introduction that states the problem, gap, purpose, and contributions in compact form. All original references are retained and citation numbering is unchanged. We hope the revised section now delivers the core ideas clearly and sets up the results more directly. In section 1.1.1 points were reduced significantly excluding lots of descriptive details that were not essential for presenting the background of each bias. Rationale for each bias family is presented in Table 1 anyways.

Comment 2) The results section should highlight the main findings more clearly, preferably by stating a few key takeaways at the beginning of the section and synthesizing them in the conclusions.

Response:

Thank you for this constructive suggestion. We have added a short claim-first paragraph at the beginning of the Results that states the key takeaways, and we rewrote each subsection opener to foreground the main finding before the supporting details and tables. We also added a concise synthesis to the Conclusions that ties the results to practical implications. These changes aim to make the core findings immediately visible while preserving the full statistical detail in the tables and figures.

Comment 3) The discussion, although improved, could still engage more critically with previous audits and better explain the practical significance of observed differences between models.

Response:

Thank you for this helpful suggestion. We revised the Discussion to engage prior audits more directly and to clarify the practical significance of model differences. We now situate our results alongside reported popularity and geographic bias in recommender settings, including recent LLM evidence [15,16,36], WEIRD-centred cultural alignment and tourism stereotypes [17,14,18], and gender-conditioned advice [19]. Where our popularity findings show more exploration, we explain likely reasons such as broader persona coverage, controlled prompt chains, and task differences compared with itinerary-style agents [20,43]. To make impact tangible, we translate percentage gaps into user-level counts per five-item list, for example +7.9 points off-list at city level is about 0.4 additional off-list items per list, and the 11.8 point domestic gap is about 0.6 additional domestic items per list. We also clarify our contribution relative to prior work: unlike studies that probe a single bias in depth, our audit tests six biases under one controlled protocol to provide a holistic, comparable view of how biases co-occur and trade off [21,45,48]. Building on this breadth, we link findings to actionable governance grounded in the literature, including diversity-aware re-ranking and popularity calibration [15,35], geographic spread and novelty safeguards [21,53], lexical quality checks to reduce promotional cliché density [21,53], and acceptance-aware routing using reliable indicators such as the Global Acceptance Index, with monitoring to avoid exclusionary portfolios [54]. Limitations are stated candidly and aligned with calls for multilingual, multi-model, and longitudinal benchmarks [45,48,53]. All edits preserve your citation numbering.

Comment 4) Finally, while the English is generally good, the text would benefit from another round of editing for clarity and conciseness.

Response:

Did another round of edits in some parts that needed improvement.

Final comments by the authors: The paper is still about 30 pages long, but we think that this is normal for this sort of research. The paper examines 6 types of biases that in all other papers to our knowledge are presented as a whole paper individually. Although we understand the depth factor, this approach is more holistic and in our humble opinion if we try to reduce further the methodology or results sections paper will start to lose important details that can impact the comprehension of the readers.

Article Menu

Destination (Un)Known: Auditing Bias and Fairness in LLM-Based Travel Recommendations

Further Information

Guidelines

MDPI Initiatives

Follow MDPI