1. Introduction
Travelers increasingly rely on the Internet to plan their trips, yet they face an overwhelming volume of fragmented and dynamic information. This phenomenon, characterized by data dispersion across destination guides, blogs, and social media, forces tourists to constantly switch contexts during information seeking (
Xiang et al., 2015). From the perspective of Cognitive Load Theory, this process imposes a high mental cost on the user’s limited working memory (
Sweller, 2011), often resulting in decision paralysis and a suboptimal planning experience (
Grundner & Neuhofer, 2021). To mitigate these challenges, the Smart Tourism Destinations paradigm emphasizes the need to integrate real-time data and contextual services to enhance the on-site experience (
Buhalis & Amaranggana, 2013). This complexity motivates the need for intelligent systems capable of orchestrating heterogeneous sources to provide context-aware, trustworthy, and easily navigable tourism information.
Tourist-oriented AI systems have been explored extensively in both academic and commercial contexts. Foundational research investigated the use of recommender systems and intelligent agents to support trip planning and destination exploration (
Gretzel, 2011;
Gretzel et al., 2012). In parallel, commercial platforms such as TripAdvisor and Google Travel offer context-aware recommendations. However, these systems generally rely on static retrieval and ranking methods, lacking advanced multi-step reasoning and the ability to dynamically combine contextual data sources. These limitations highlight the need for intelligent assistants that can integrate reasoning with external information tools to deliver more personalized and cognitively transparent support.
From a theoretical perspective, travel planning is a complex, multi-criteria decision-making process involving high cognitive load due to the fragmentation of information sources. The concept of Smart Tourism Destinations emphasizes the need for real-time, context-aware data integration to enhance the on-site experience. While standard LLMs offer fluency, they lack the dynamic adaptability required for this ‘smart’ integration. Therefore, the architectural choice of ReAct (Reason + Act) is not merely technical but responds to the theoretical need for a system that can mimic the iterative, non-linear information search behavior of human travelers.
Traditional search and recommendation systems struggle to reconcile heterogeneous and temporally dynamic data—e.g., static attraction descriptions, live event schedules, and geospatial queries—particularly in urban tourism contexts where timely, location-aware suggestions are crucial. Recent efforts in geospatial question-answering systems have sought to address similar challenges through retrieval and structured query generation. For instance, the MapQA framework (
Z. Li et al., 2025) proposes a retrieval-based approach combined with SQL generation to answer map-related queries, offering a relevant foundation for spatial reasoning in tourism contexts. Retrieval-Augmented Generation (RAG) is a pragmatic approach for such applications because it grounds language model outputs in curated, domain-specific knowledge sources, thereby improving factual consistency and reducing hallucinations (
Oche et al., 2025). Recent research in domains such as manufacturing and pharmaceuticals has shown that agentic RAG systems—which combine retrieval with reasoning architectures like ReAct to chain multiple retrieval and action steps (
J. Wang et al., 2025)—significantly improve accuracy and multi-step query resolution. Despite these advances, the systematic application of ReAct-augmented RAG in tourism remains largely unexplored.
Motivated by this gap, we developed an agentic RAG assistant for Valencia (Spain) that orchestrates three specialized agents: (1) a Retrieval Agent for static and semi-structured documents using hybrid search; (2) an Events Agent for near-real-time event ingestion; and (3) a Geospatial Agent that answers location-based queries using OpenStreetMap. The orchestration follows the LangChain (
LangChain, n.d.) ReAct paradigm to interleave reasoning and tool use, enabling the system to handle temporal, spatial, and semantic aspects of travel queries within a single conversational interface—akin to a knowledgeable local guide offering personalized advice.
This study is framed as a real-world case study within the CitCom.ai (
CITCOMTEF, n.d.) Testing and Experimentation Facility (TEF), an European project that emphasizes the use of trustworthy AI in urban environments. Accordingly, our contribution focuses on practical engineering lessons and empirical observations from the development and prototyping of a RAG-based tourism assistant, rather than exhaustive benchmarking. Concretely, the core contributions are as follows:
Impact of Coreference Resolution on Grounding: We demonstrate that applying coreference resolution to heterogeneous tourism data is a critical step to mitigate semantic ambiguity and hallucinations in RAG systems.
Evaluation of the System: We provide empirical evidence that an agentic ReAct architecture, while slightly trading off strict faithfulness, significantly enhances ‘Answer Relevancy’ compared to static RAG, aligning better with the dynamic nature of tourist inquiries.
Operational Framework: We propose a scalable, open source framework (using Mistral Small) that successfully integrates static (cultural), temporal (events), and spatial (OSM) data, offering a blueprint for sovereign AI deployment in Smart Cities.
The remainder of the paper is organized as follows:
Section 1.1,
Section 1.2 and
Section 1.3 provide a concise background on language models, RAG systems, and agent orchestration.
Section 2 describes the methodology, including data collection, preprocessing, model and tool choices, and implementation details.
Section 3 presents the RAGAS-based evaluation and qualitative examples from the prototyping.
Section 4 discusses engineering lessons, limitations, ethical considerations, and operational trade-offs.
Section 5 concludes and outlines concrete next steps for expanding evaluation and reproducibility.
1.1. Brief History and Development of Language Models
Figure 1 illustrates the chronological evolution of language models. Early approaches were based on statistical methods that estimated word and phrase probabilities from large corpora (
Jelinek, 1998;
Rosenfeld, 2000;
Stolcke, 2004), but were limited by context length. Neural Language Models later employed neural networks to capture more complex representations and semantic relationships in language (
Bengio et al., 2003;
Kombrink et al., 2011;
Mikolov et al., 2010). The Transformer architecture (
Vaswani et al., 2023) enabled Pretrained Language Models (PLMs), which first learn general linguistic structures from large untagged text corpora and are subsequently fine-tuned for specific NLP tasks (
Devlin et al., 2019;
Radford et al., 2019). Modern Large Language Models (LLMs), such as GPT-3 (
Brown et al., 2020) and GPT-4 (
OpenAI et al., 2024a), leverage billions of parameters and extensive training data, combining large-scale pretraining with alignment to human instructions to achieve strong adaptability across diverse tasks (
Radford et al., 2019). Recent advances explore sparse Mixture of Experts (MoE) architectures, such as Mixtral 8 × 7B (
A. Q. Jiang et al., 2024), which activate only a subset of feedforward experts per token, improving efficiency while maintaining high performance. Collectively, these models represent major milestones in language technology, supporting advanced reasoning and generation capabilities, as utilized in the proposed RAG-based agent.
1.2. RAG Systems: Principles and Practical Choices
The Retrieval-Augmented Generation (RAG) methodology enhances LLMs by integrating external knowledge, overcoming their limitations in accessing current information. RAG retrieves relevant document fragments from external sources, which are combined with the original query to formulate enriched questions, enabling the model to generate informed responses (
Figure 2). This approach synergistically merges information retrieval with in-context learning, providing crucial context without requiring model fine-tuning, and has become foundational for many conversational systems. The workflow involves three stages: corpus segmentation and indexing via an encoder, retrieval of fragments based on similarity to the query, and synthesis of responses considering the recovered context.
Technological advancements have refined RAG along three axes: what, when, and how to retrieve and utilize information. Retrieval has evolved from individual tokens (
Khandelwal et al., 2020) and entities (
B. Li et al., 2022) to more structured representations, including slices (
Ram et al., 2023) and knowledge graphs (
Kang et al., 2023), balancing granularity with precision and efficiency. Strategies for when to retrieve range from single retrievals (
Shi et al., 2023;
Y. Wang et al., 2023) to adaptive (
Huang et al., 2025;
Z. Jiang et al., 2023) and multiple retrieval methods (
Izacard et al., 2022), trading information richness for computational efficiency. Techniques for how to integrate retrieved data span input-level (
Khattab et al., 2023), intermediate (
Hoffmann et al., 2022), and output-level (
L. Wang et al., 2025) integration, with trade-offs in effectiveness, training complexity, and efficiency.
The evolution of RAG can be divided into four phases. The initial phase, starting in 2017 with the Transformer architecture (
Vaswani et al., 2023), focused on incorporating additional knowledge into pre-trained models (PTMs) to enhance language modeling, emphasizing improvements in pretraining methods.
1.3. Agents and Orchestration Frameworks
In artificial intelligence, an agent is an autonomous system that perceives its environment and acts toward defined goals. Within LLM-based systems, agents operate through a reasoning-acting cycle, invoking external tools such as APIs, databases, or search engines as needed. This paradigm, in which the language model learns to select and utilize external tools to enhance its capabilities, is foundational to tool-augmented LLMs (
Schick et al., 2023). Complex applications often employ multi-agent frameworks, where specialized components collaborate to solve multi-step queries (
Park et al., 2023). LangChain (
LangChain, n.d.) is a widely used framework that provides modular components—including chains, agents, and tools—for orchestrating such pipelines. Building on these orchestration frameworks, ReAct (Reasoning and Acting) (
S. Yao et al., 2023) introduces an advanced methodology that combines logical reasoning with tool-based actions, enabling iterative problem-solving for more accurate and complete responses. The ReAct process involves (1) receiving a user query, (2) reasoning to determine necessary information, (3) acting using external tools to search, retrieve, and process data, (4) iterating reasoning and actions if the initial attempt is insufficient, and (5) generating a final, precise response.
Figure 3 illustrates this workflow, which integrates reasoning traces and actions to refine the internal context while responding adaptively to external observations. ReAct RAG extends traditional RAG systems by embedding a ReAct agent within the retrieval-augmented generation loop, enhancing response accuracy through stepwise reasoning and interaction with multiple documents and tools, thereby producing more precise and detailed outputs than single-step retrieval methods.
2. Materials and Methods
As mentioned in the Introduction
Section 1, all the data handled in this work are inherent to the tourist information of the Municipality of Valencia (Spain). Consequently, the optimization of the algorithm is based on the relevant data collected about this city.
The entire project has been developed taking advantage of the open source libraries and frameworks available today. To create a custom application driven by a language model, we leverage LangChain (
Topsakal & Akinci, 2023). It is an open source software library that allows developers to easily use other data sources and interact with other applications. Each of the libraries used is described in the section corresponding to its use.
2.1. Data Collection
In this section, we describe the data collection process and the sources used to collect information for the development of our tool.
If we want to offer our tourists useful, reliable and up-to-date advice, the first step is to carry out a detailed search for information material on the city’s official channels. The most important of them is Visit Valencia, a non-profit foundation in which the Valencia City Council, the Chamber of Commerce, Feria Valencia, the Valencian Business Confederation, Turismo Comunitat Valenciana and the Tourist Board of the Provincial Council participate, as well as most local tourism companies. Its objective is the strategic management and promotion of the city of Valencia in the tourism sector, with a professional approach that combines public and private interests.
The official Visit Valencia portal (
Visit Valencia, 2025) provides access to downloadable resources in Portable Document Format (PDF), including five different official tourist guides, each targeting a different type of traveler. In addition to the general city guide, more specific guides are available, such as “Guide to Valencia in three days” for short itineraries, “Valencia with your family” for trips with children, or guides focused on romantic getaways, sports activities, and nightlife. Leveraging these authoritative documents allows us to incorporate updated and verified content into our tool, enhancing the credibility of the information provided to users.
In parallel, we employed web scraping techniques to extract data from official and reputable sources, including Wikipedia articles on historical references to the city, selected blogs, and event agendas. All data collection was performed after obtaining the necessary permissions and in strict compliance with ethical and legal standards. In particular, we adhered to the corresponding robots.txt protocols and the Terms of Service of all indexed websites, ensuring that data extraction was both legal and non-intrusive. By combining heterogeneous data from these authorized sources, we built a comprehensive dataset comprising 84 documents covering attractions, monuments, cultural heritage, gastronomy, shopping, nature, sports activities, tourist recommendations, practical Valencia travel tips, and upcoming events. This approach aligns with our main objective of unifying diverse, reliable, and officially sanctioned sources to provide tourists with accurate and trustworthy guidance when planning or exploring Valencia’s vibrant cultural landscape.
Finally, the knowledge base is composed entirely of texts in Spanish, ensuring maximal relevance from local sources and standardization for the integration phase. Each document is stored separately according to its origin, facilitating traceable and reliable retrieval. The dataset comprises a total of 54,400 tokens (approximately 100 pages), representing a pragmatic compromise for the CitCom.ai prototyping environment. It encompasses 443 dining establishments and restaurant chains, 715 distinct tourist attractions, and more than 10 event categories curated by the Events Agent. Coverage is geographically balanced, with a strong concentration in key historical districts such as Ciutat Vella, Eixample and Russafa, while also including peripheral areas such as El Saler and El Palmar within the Albufera Natural Park. This distribution reflects a deliberate and focused scope for the initial evaluation phase.
As noted above, RAG combines the power of an LLM with external data. If the dataset contains conflicting or redundant information, retrieval may struggle to provide the correct context, potentially leading to suboptimal generation by the LLM. To mitigate this, rigorous data integration and cleaning procedures were applied across all sources to harmonize content and eliminate redundancies or inconsistencies, ensuring a reliable and high-quality knowledge base for downstream processing.
2.2. Preprocessing
The pipeline continues through the cleaning process to remove noise, encoding errors, and duplicates. Text is normalized via case folding, punctuation standardization, and number handling, then tokenized using subword methods (e.g., BPE, WordPiece). Optional linguistic steps, such as lemmatization or stop-word removal, may further refine context.
The first step was to convert all documents from heterogeneous sources (e.g., databases, PDFs) to plain text (TXT). Paragraphs are identified by carriage returns to better organize file contents. All text is human-readable and represented as a sequence of characters.
Preprocessing applied to guides differed slightly from that applied to scraped text. Titles, subtitles, indexes, page numbers, legends, hyperlinks, and paragraph summaries next to illustrations were removed, retaining only the main body of the text. Each paragraph is separated by a carriage return, and blank lines were removed. For extracted text, additional normalization was performed: for instance, typical numbers and words in parentheses were removed from Wikipedia articles, and tables and references were excluded, leaving only pure text.
To better integrate information from all sources, Coreference Resolution (CR) (
Ng, 2010) was applied. CR identifies all linguistic expressions (mentions) referring to the same real-world entity, replacing pronouns with noun phrases to avoid ambiguities that could lead to hallucinations in the RAG system.
Given the lack of an open source coreference resolution tool capable of processing Spanish efficiently, this task was delegated to GPT-4o. The following prompt was used to perform CR on tourism-related textual data prior to their use in the RAG system:
You are a highly capable NLP assistant specializing in coreference resolution. Given documents containing information about Valencia’s tourist attractions, restaurants, events, and cultural heritage, identify all expressions that refer to the same entity (including pronouns, definite descriptions, and repeated mentions) and rewrite the text so that each entity is consistently referenced with a single canonical form. Preserve the original meaning, context, and factual details, including locations, events, and services. Do not introduce new information or modify existing facts.
This approach significantly streamlined the preprocessing workflow: GPT-4o was able to generate a reformulated version of the original texts (
Figure 4) within seconds. Although several open source CR tools exist, most are either designed for English or show limited performance on heterogeneous Spanish corpora, such as tourism documents with mixed registers. GPT-4o was chosen for its high-quality, context-aware resolution across diverse sources, minimizing errors that could propagate to the RAG system. While using a closed model introduces additional computational cost, the substantial gains in speed, reliability, and preprocessing consistency justified this trade-off for the prototyping phase.
Repeating this procedure across all documents enhanced the consistency and quality of the text fragments, thereby improving retrieval performance. As a final step, a rigorous human review was conducted to resolve discrepancies and validate content accuracy, ensuring reliable data for the LLM and increasing confidence in generated responses.
2.3. Embeddings and Storage
Documents were first split into semantically coherent chunks, with overlap to preserve contextual continuity, enabling retrieval of specific tourist information such as attractions, restaurants, or events. Each chunk was converted into dense vector representations using pretrained encoders and indexed in ChromaDB along with metadata (e.g., source, timestamps) for efficient semantic search. Query preprocessing mirrored the corpus normalization, ensuring that user queries could retrieve relevant fragments accurately. Validation included both chunk coherence checks and retrieval accuracy tests, guaranteeing robustness for downstream RAG-based tourism recommendations.
We employed INSTRUCTOR (
Su et al., 2023) to generate 1024-dimensional embeddings for each text fragment, guided by task and domain instructions. Unlike traditional encoders, INSTRUCTOR produces flexible embeddings suitable for diverse tourism content without additional training.
To store the documents as dense vector embeddings, we utilized the open source ChromaDB (
Try Chroma, n.d.) embedding database, which supports nuanced semantic retrieval and facilitates fast, context-aware integration with the RAG pipeline for tourism applications.
2.4. Agent Architecture
To create a system resistant to hallucinations and capable of handling complex user queries, we implemented an agent-based architecture (
Figure 5) using the LangChain library. LangChain is a widely used framework that enables the construction of modular, scalable NLP pipelines through the orchestration of LLMs and external tools. It also provides seamless integration with a variety of APIs for retrieving external information. In our implementation, LangChain was used to connect the agent with the Events database and OpenStreetMap, allowing the system to access temporal and geospatial data.
For example, a ReAct agent may receive a query such as “What are the best beaches in Valencia?” The agent first reasons about which tools or strategies to use (Re), then invokes the appropriate retrieval tools to gather relevant information (Act), and finally generates a detailed, contextually accurate response based on the retrieved data.
The decision to implement a multi-agent architecture is grounded in the principle of ‘Separation of Concerns’. In the tourism domain, data possesses distinct modalities: cultural heritage information is static and semantic, events are temporal and volatile, and locations are geospatial and structured. A monolithic model struggles to reason across these modalities simultaneously. By isolating these capabilities into specialized tools orchestrated by a ReAct agent, we mimic a human concierge who consults a map, a calendar, and a guidebook independently before synthesizing an answer.
For the language model, we used Mistral Small 3.1 (
Mistral AI, 2025), a state-of-the-art open source LLM that powers the reasoning and generation capabilities of the agent. The model comprises approximately 24 billion parameters and is distributed under the Apache 2.0 license, which facilitates its integration into research and experimental environments. It is publicly available through the Hugging Face model hub under the tag mistralai/Mistral-Small-3.1-24B-Base-2503, ensuring transparent access and reproducibility for the research community.
In its latest release, Mistral Small 3.1 introduces an extended context window of up to 128,000 tokens, enabling the model to process long text sequences—such as full documents or concatenated knowledge sources—while maintaining reasoning coherence. Furthermore, this version includes multimodal capabilities (text + image), broadening its applicability to tasks that require the joint interpretation of textual and visual information. From a methodological standpoint, and in alignment with the retrieval-augmented generation (RAG) architecture adopted in this study, the combination of a large context window and strong reasoning abilities is particularly advantageous. It allows the agent to integrate retrieved knowledge chunks, build coherent reasoning chains, and generate contextualized outputs that effectively combine external information with the model’s own inference capabilities.
Additionally, the European origin of Mistral strengthens its relevance to this investigation. Since the present study was conducted within the framework of a European research project, the use of a model developed in Europe aligns with the project’s objectives of promoting technological sovereignty and supporting the regional AI ecosystem.
2.5. Tools
In the context of a ReAct agent, a tool refers to an external capability or function that the agent invokes during its reasoning and acting cycle. While the language model provides reasoning abilities, tools extend its functionality by enabling access to structured data sources, APIs, or specialized operations such as retrieval, geospatial queries, or event management. In this way, tools act as bridges between the agent’s abstract reasoning and concrete actions in the real world, ensuring that responses are grounded, accurate, and contextually enriched.
In our system, we implemented three specialized tools to manage different types of queries and enhance agent responses:
Retrieval Tool obtains relevant context from the document knowledge base to support accurate and informative answers;
Event Tool retrieves current events relevant to the user’s query, ensuring temporal awareness and up-to-date recommendations;
Geospatial Tool filters and ranks results based on geographic location, enabling personalized, location-aware suggestions for tourists.
2.5.1. Retrieval Tool
Our main agent tool implements a RAG system to obtain information from the documents described above (
Figure 6). The system consists of several key components:
Vector Database: Documents are stored as dense vector embeddings in the open source ChromaDB (
Chroma Core, 2025), which supports semantic retrieval and efficient access to relevant content.
Model embedding: We employed INSTRUCTOR XL (
Su et al., 2023) to generate embeddings for both documents and query instructions. Each chunk is paired with task and domain-specific guidelines to clarify its intended application. Unlike earlier specialized encoders, INSTRUCTOR XL produces versatile embeddings across multiple tasks and domains without additional fine-tuning, while remaining parameter-efficient.
Ensemble Retriever: The Ensemble Retriever (
Hambarde & Proença, 2023;
Kuzi et al., 2020) is a hybrid information retrieval approach that combines the BM25 sparse retriever, which excels at exact keyword matching, with a dense retriever that captures semantic nuances and contextual relationships. This hybrid strategy leverages the strengths of both techniques, improving overall retrieval performance.
The dense retriever operates in the embedding space, grouping semantically similar documents to capture contextually relevant information. To efficiently manage tourist guides, we employed a Contextual Compression Retriever. Often, the most relevant information is embedded within lengthy documents containing a lot of irrelevant text. Passing the entire document to the LLM increases computational cost and can reduce response quality. Contextual compression mitigates this by filtering documents based on the query context, returning only the most pertinent chunk. Specifically, we implemented a pipeline applied to all ChromaDB embeddings, which compresses each document and retains only the most relevant fragments using a similarity threshold filter.
By combining these two complementary retrieval mechanisms, the hybrid search achieves a more holistic understanding of relevance, improving both the accuracy and robustness of the retrieval process. Although the Ensemble Retriever introduces additional parameters and hyperparameters that require careful tuning, its advantages in handling both exact keyword matches and semantic nuances make it a highly effective approach for optimizing RAG performance.
2.5.2. Geospatial Tool
The Geospatial Tool is based on OpenStreetMap (OSM) (
OpenStreetMap Contributors, 2025) and was developed to enrich the RAG model with up-to-date and detailed geographic information. OSM is a collaborative and continuously updated mapping platform that provides comprehensive data on locations, routes, points of interest, and geographic features worldwide. Integrating OSM enables the model to access accurate and current spatial data, such as location, directions, and topological features, which is essential for tourism-oriented applications.
Furthermore, within the ReAct architecture, the agent can not only retrieve specific data from OSM but also reason about it both before and after each query. This allows spatial information to be incorporated contextually and adaptively, which is critical for complex tasks such as generating personalized routes, describing environments in detail, or answering questions about accessibility, distances, and travel logistics.
2.5.3. Event Tool
To address tourists’ demand for discovering ongoing and upcoming activities within the city, an event tool was developed. The system integrates information from the official Visit Valencia event calendar, which is regularly updated and publicly available. The collected data are stored in a structured database that allows the model to execute parameterized queries based on attributes such as geographic location, event timing, and category, among others. This approach ensures that the model maintains access to reliable, up-to-date, and contextually relevant information, accommodating the dynamic and time-sensitive nature of urban events.
2.6. Evaluation with RAGAS
The evaluation of Retrieval-Augmented Generation (RAG) systems has recently attracted increased attention, leading to the development of several benchmarking frameworks that assess different aspects of retrieval and generation performance. For instance, the BEIR benchmark (
Thakur et al., 2021) provides a standardized suite for evaluating retrieval methods across a wide range of domains and datasets, while the RGB benchmark (
Chen et al., 2023) extends this paradigm by jointly assessing retrieval, grounding, and generation quality in RAG-based large language models. These frameworks offer valuable foundations for comparative evaluation and reproducibility in the field. In this study, we employ the RAGAS (Retrieval-Augmented Generation Assessment) framework (
Es et al., 2025) due to its flexibility, reference-free design, and widespread adoption as the de facto standard for evaluating RAG systems in contemporary research (
Gao et al., 2024). RAGAS allows for efficient, data-driven assessment of both the retriever and generator components, evaluating them independently as well as within an integrated pipeline. This component-wise analysis provides for a granular understanding of system performance. Furthermore, RAGAS introduces several key metrics for holistic evaluation, including faithfulness, which measures the factual accuracy of the generated answer against the retrieved context; answer relevancy, which quantifies how well the output answer addresses the user’s query; and context precision and context recall, which assess the quality and completeness of the retrieved documents.
3. Results
We developed a Retrieval-Augmented Generation (RAG) agent based on the Mistral Small 3.1 language model, equipped with the three tools described in
Section 2.5. The central retrieval module employs a vector store containing approximately 54,400 tokens, segmented into 220 chunks, and embedded in a 1024-dimensional space to enable efficient semantic search. Complementary geospatial and event tools provide continuously updated geographic and structured cultural data, while the ReAct-based reasoning loop integrates retrieval and generation to ensure answer relevancy.
3.1. Assessment
Evaluating RAG architectures is inherently challenging, as it requires assessing multiple aspects simultaneously: the retrieval system’s ability to locate relevant and focused contextual passages, the LLM’s capacity to effectively leverage this information, and the overall quality of the generated responses. Following the approach of previous studies (
Niu et al., 2024), we constructed a ground truth (GT) dataset to systematically evaluate our RAG system. Using the Google Gemini API, we automatically generated 994 question–answer pairs based on 84 documents containing official tourism information about Valencia using the gemini-2.5-flash model. The dataset covers diverse topics such as trip planning, historical and cultural references, traditions, monuments, accommodations, gastronomy, and nightlife. It also includes multi-hop questions that require integrating information from multiple sources. All generated answers were subsequently reviewed and validated by two human experts to ensure factual accuracy, coherence, and consistency. Both datasets are publicly available in the Zenodo repository at
https://zenodo.org/records/17384690 (accessed on 18 October 2025). Although the RAGAS framework enables evaluation without requiring manually annotated ground truth data, the inclusion of this expert-validated GT dataset enhances the robustness of our assessment and provides an independent benchmark for cross-validating and interpreting the automated RAGAS metrics.
Figure 7 presents several representative examples from the question dataset.
The following section reports the quantitative metrics obtained when evaluating the retriever with the RAGAS framework, using GPT-4o as the judge model for response assessment.
3.2. Retrieval Evaluation
The retrieval component was evaluated in terms of its ability to return relevant contextual passages that directly support the GT answers. As shown in
Table 1, the retriever achieved a Context Precision Mean of 0.714 and a Context Recall Mean of 0.563. These values indicate that, on average, more than 70% of the top-
K retrieved passages are relevant to the query, while the system is able to cover slightly more than half of the reference claims with the retrieved context. For this evaluation, the parameter
K was set to 6, meaning that the first six retrieved passages were considered. These results suggest that the retriever effectively prioritizes relevant documents but still leaves room for improvement in terms of recall, i.e., ensuring that all relevant information is consistently included in the retrieved set.
The metrics reported in
Table 1 were computed for each query in the test dataset and then averaged across all samples.
Context Precision@K: Quantifies the weighted precision of the top-
K retrieved items, accounting for their relevance:
where
indicates whether the item at rank
k is relevant.
Precision@k: Measures the proportion of true positives among the top-
k retrieved items.
Context Recall: Evaluates the coverage of relevant claims in the reference that are supported by the retrieved context.
3.3. Response Evaluation
The goal of this step is to evaluate the response generated by the LLM to ensure that the information returned is accurate, contextually faithful, and appropriate to the user’s query. Using the ground truth question dataset described above, each query was passed to the RAG-based chatbot to generate answers, which were then compared with the GT answers. The comparison process was automated using the RAGAS evaluation framework, enabling a systematic assessment of response quality and facilitating fine-tuning of system parameters and hyperparameters. The specific configuration of hyperparameters used in this study is summarized in
Table 2.
This evaluation process allowed us to detect and analyze hallucinations and factual inaccuracies at the level of individual model responses. For our main experiments, we selected Mistral Small 3.1 (
Mistral AI, 2025) as the primary model and conducted a detailed analysis of its outputs. To contextualize its performance, we also evaluated several alternative setups, including architectural variations and a commercial model (GPT-4o Mini (
OpenAI et al., 2024b)). The architectural variants included: (1) a baseline without retrieval (No RAG), and (2) a standard RAG implementation without the ReAct mechanism.
Table 3 summarizes the results in terms of faithfulness and answer relevancy, computed using the RAGAS evaluation framework. For each query in the test set, these metrics were calculated and then averaged. The results illustrate the comparative performance of the models and configurations, confirming the effectiveness of Mistral Small 3.1—particularly in its RAG+ReAct architecture- as the core model in our system.
To determine the statistical significance of these differences, we performed an ANOVA analysis on the metric scores. The results indicate significant variation across models for both faithfulness (F = 27.97, p < 0.001) and answer relevancy (F = 137.58, p < 0.001). A Tukey HSD post-hoc analysis further reveals that for faithfulness, the RAG + ReAct configuration with Mistral Small 3.1 underperforms compared to both RAG and GPT-4o Mini, with the latter two being statistically indistinguishable. In terms of answer relevancy, GPT-4o Mini significantly outperforms both No RAG and RAG, but shows no significant difference compared to RAG+ReAct with Mistral Small 3.1. These findings suggest that while the ReAct mechanism may slightly reduce factual accuracy in Mistral Small 3.1, it enhances answer relevancy, and GPT-4o Mini delivers consistently strong performance across both evaluation dimensions.
Faithfulness Score: Evaluates the degree to which the claims made in a generated response are substantiated by the retrieved context.
Answer Relevancy: Represents the average semantic similarity between the embeddings of the generated answers and those of the reference (ground truth) answer, computed via cosine similarity.
3.4. Qualitative Case Study: Complex Decision Support
To demonstrate the user value beyond quantitative metrics, we analyze a complex query scenario reflecting a real-world tourist need: “Esta mañana voy a las Torres de Serranos, quiero comer una paella auténtica cerca y me gustaría saber también si hay algún concierto de jazz esta noche.” (“This morning I’m going to the Torres de Serranos, I want to eat an authentic paella nearby, and I’d also like to know if there’s a jazz concert tonight.”)
A standard RAG system might retrieve generic paella restaurants and a list of jazz clubs, likely failing to link location and time. In contrast, our ReAct agent executes the following trace:
Geospatial Reasoning: Identifies “Torres de Serranos” coordinates via the Geospatial Tool and filters restaurants with the tag “paella” within a 500 m radius.
Temporal Reasoning: Queries the Event Tool filtering by Category = “Music/Jazz” and Date = “Today”.
Synthesis: The agent combines these outputs. It might respond: “Near Torres de Serranos, you can enjoy authentic paella at [Restaurant A] or [Restaurant B] (located 300 m away). Regarding jazz, there is a concert tonight at [Venue X] at 20:00, which is a 15-min walk from the restaurant.”
This scenario validates the system’s ability to reduce decision paralysis by performing cross-domain inference (Space + Time + Gastronomy), a key requirement for intelligent tourism assistance.
4. Discussion
The present study demonstrates the practical implementation of an agentic RAG system for urban tourism information. The analysis focuses on the impact of dataset design, preprocessing strategies, and model selection on retrieval and generation performance, providing insights into best practices for developing scalable, context-aware AI assistants in real-world scenarios.
The first aspect to consider is the dataset. As discussed in
Section 2, its size reflects a deliberate compromise: although relatively compact, the dataset was carefully curated to maximize coverage of essential tourism information while remaining manageable for thorough manual verification. This choice was intentional to maintain feasibility and data quality during prototyping, ensuring that the dataset prioritizes high-quality, reliable, and contextually relevant information for the RAG system. The primary goal at this stage was to validate the technical viability and operational coherence of the agentic RAG system in a controlled setting. Once all parameters and hyperparameters are optimized for the retrieval and generation, the system can integrate additional sources, such as detailed accommodation offerings, further enriching the knowledge base.
Building upon the curated dataset, preprocessing procedures, including PDF document handling and integration of data from official tourism sources, must be adapted to the structure and format of each information source. Incorporating a diverse set of content helped expand coverage beyond central areas. Notably, the removal of titles and subtitles from each section enhanced retrieval performance by reducing redundancy, enabling the hybrid search to focus solely on relevant chunks. This improvement in context quality simultaneously increases the accuracy, relevance, and factual integrity of responses generated by the language model. Coreference resolution (CR) was another critical step for mitigating hallucinations. In tourism-related texts, frequent use of pronouns and ambiguous references often results in text chunks lacking clear context, which can lead the LLM to incorrectly associate facts with the wrong landmark. By systematically replacing these references with the specific entities they denote, each chunk became a self-contained, factually precise unit, further improving retrieval accuracy and ensuring reliable outputs.
We evaluated different configurations of the Mistral Small 3.1 language model to assess their impact on RAG system performance.
Table 3 summarizes the results, highlighting notable differences in answer relevancy and faithfulness across configurations:
- -
Without RAG, the model exhibits low answer relevancy (0.643), reflecting its limited ability to retrieve relevant information without external context.
- -
With simple RAG, the system achieves high faithfulness (0.918) and good relevancy (0.831), demonstrating that retrieved chunks effectively anchor generation and reduce hallucinations.
- -
With ReAct + RAG, answer relevancy improves slightly (0.897) while faithfulness decreases somewhat (0.858). This configuration produces more direct and contextually aligned answers; however, it introduces a small amount of content not fully supported by the retrieved sources.
These observations are consistent with prior work on reasoning architectures such as chain-of-thought (CoT) and Reason and Act (ReAct) (
Z. Yao et al., 2025), which can enhance fluency and multi-hop reasoning but may complicate the anchoring of outputs to retrieved evidence. From our perspective, the ReAct + RAG configuration represents the most suitable choice for a tourism assistant: its higher
Answer Relevancy ensures responses are useful and contextually aligned, while the slight decrease in faithfulness remains acceptable for providing coherent, actionable guidance.
Beyond these technical results, deploying agentic RAG systems in urban tourism raises important operational and ethical considerations. Maintaining up-to-date information requires continuous dataset updates, particularly for dynamic events and seasonal activities, to ensure reliability. The system’s recommendations may influence tourist flows and local businesses, highlighting the importance of balancing efficiency with sustainable urban tourism practices. Ethical aspects, including potential biases in recommendations, transparency of AI-generated guidance, and equitable treatment of underrepresented groups, must be addressed to ensure trustworthy user experiences. It is important to highlight that the system relies on official tourism sources, which pay particular attention to inclusivity, helping to minimize biases against underrepresented populations.
5. Conclusions
In this work, we present an AI-based conversational system that allows users to query travel information using natural language, while the intelligent agent generates personalized travel plans respecting user-specific constraints. By integrating multiple sources of information into a single interactive interface, the system reduces the need to consult multiple platforms, streamlines trip planning, and enhances the overall traveler experience.
The primary contribution of this study is the development of an agentic architecture integrated within a RAG framework, which orchestrates specialized tools for text, geospatial, and event retrieval. This design enables cohesive semantic, spatial, and temporal reasoning and demonstrates the potential of combining retrieval-augmented generation techniques with task-specific agents to address complex information needs in the tourism domain.
Our evaluation highlights several key insights. The use of ReAct + RAG with Mistral Small 3.1 provides a strong balance between Answer Relevancy and Faithfulness: while the reasoning–action loop slightly reduces strict adherence to retrieved context, it significantly improves the relevance and contextual alignment of responses, making the system more useful for real-world queries. In contrast, configurations without RAG or without ReAct exhibited lower relevance or required stronger grounding to maintain faithfulness, underscoring the importance of combining retrieval and reasoning mechanisms in agentic systems.
The modular and scalable design of the system facilitates the integration of additional types of information, such as local events, seasonal activities, and emerging points of interest, supporting flexible deployment in other cities or domains. Furthermore, the approach provides a foundation for iterative improvements, including dataset expansion, enhanced preprocessing, multilingual support, and the incorporation of smaller, efficient language models with targeted hallucination control to optimize computational efficiency without compromising response quality.
Implications for the Academic Community: Theoretically, this study bridges the gap between Cognitive Load Theory and Agentic AI architectures. While ReAct has been validated in static domains, our results confirm its efficacy in the dynamic, fragmented environment of tourism. We provide empirical evidence that separating concerns via specialized agents (Spatial, Temporal, and Semantic) is a valid architectural pattern to reduce information overload, offering a blueprint for researchers aiming to move beyond monolithic RAG systems in complex decision-support scenarios.
Implications for Managers and Society: From a managerial and social perspective, this system demonstrates how Smart Tourism Destinations can deploy “sovereign AI”. Unlike generic commercial LLMs, the proposed architecture allows public administrations (like the Valencia City Council) to maintain strict control over the knowledge base, ensuring that recommendations align with sustainable tourism goals—such as dispersing crowds from saturated areas to peripheral zones like L’Albufera. Furthermore, by democratizing access to complex city data through natural language, the tool serves as a digital public good, enhancing the autonomy and experience of visitors and citizens alike.
5.1. Limitations
Despite the promising results, this study presents limitations. First, the inference latency of the ReAct loop (sequential tool use) is higher than standard RAG, which may affect the user experience in real-time mobile scenarios. Second, the system’s accuracy is heavily dependent on the metadata quality of the underlying sources (e.g., if OpenStreetMap tags are missing, the geospatial agent fails). Third, the reliance on a commercial model (GPT-4o) for the preprocessing step (coreference resolution) introduces a cost barrier for fully open source deployments.
5.2. Future Work
Future research will focus on three concrete directions:
Real-world User Validation: We will conduct an A/B test with 50 tourists in Valencia to measure perceived usefulness and trust compared to a traditional tourist office app.
Multimodality: We plan to integrate computer vision capabilities into the agent, allowing users to upload a photo of a monument and ask, “What is this and what are the visiting hours?”
Latency Optimization: We aim to fine-tune a smaller model (e.g., Mistral 7B) specifically for tool-calling to replace the generic ReAct prompting, thereby reducing inference time and computational costs.
Overall, this study demonstrates that intelligent agent-driven travel assistants, powered by retrieval-augmented generation, ReAct reasoning, and specialized tools, can provide context-aware, practical guidance in urban tourism, bridging the gap between AI research and real-world user experiences.