1. Introduction
The emergence of Generative Artificial Intelligence (GenAI), capable of comprehending and reasoning with human language, is driving significant shifts across all sectors of society. Whereas earlier deep learning-based AI technologies primarily focused on predicting or classifying outcomes from data, GenAI, based on Large Language Models (LLMs), refers to systems that actively generate context-aware outputs in response to user requests [
1]. Such models have rapidly advanced to a level where they can autonomously handle not only everyday conversation but also tasks that require complex reasoning in specialized domains—such as programming, financial analysis, and solving complex mathematical or physics problems—as well as the generation of images and videos.
In the tourism industry, GenAI is driving a paradigm shift through advantages such as personalization and efficiency. Tourists often face difficulties in planning trips due to the need to navigate vast amounts of fragmented information, and traditional chatbot services, which rely on predefined responses, have been limited in their ability to address this issue. Song et al. [
2] report that GenAI can be used to support personalized travel planning, including destination recommendation, itinerary design, and route guidance to tourist attractions, and that its multilingual capabilities and 24/7 availability can contribute to improving customer satisfaction. Consequently, research on the utilization of GenAI in the tourism domain has become increasingly active.
From a sustainability perspective, the quality of mobility information is consequential. Smart tourism foundations emphasize that the integration of technology into physical infrastructure should aim for resource efficiency and sustainability [
3]. In urban destinations, accurate public transport guidance is essential to facilitating this goal by supporting low-carbon travel choices and reducing congestion [
4]. Conversely, unreliable guidance—especially for last-mile navigation between attractions and transit stations—can trigger missed trips and induce travelers to shift toward taxis or private vehicles, undermining sustainable mobility goals pursued through smart technologies [
5]. Therefore, assessing the reliability of GenAI-generated public transport information is a practical sustainability question, not merely a technical one.
However, the application of GenAI in tourism still presents notable limitations. Low answer accuracy can reduce user satisfaction, and the models’ ability to handle complex problems remains constrained [
2]. In practice, reported cases indicate that GenAI often provides inaccurate information when addressing queries related to geographic information [
6,
7,
8]. This phenomenon, often referred to as “hallucination,” directly impacts the reliability and effectiveness of tourism information. In particular, a lack of spatial reasoning capability—such as misidentifying the location of tourist attractions and access routes—or regional biases in training data can cause considerable inconvenience to tourists. Therefore, there is a clear need for studies that rigorously assess the accuracy of tourism-related geographic information provided by GenAI.
Public transportation accessibility is a key determinant of tourist behavior, as the use of public transport is closely related to the trip type, travel distance, and accommodation choices [
9]. Accordingly, this study examines whether GenAI can accurately provide information on metro stations that offer access to tourist attractions. More specifically, we focus on the task of identifying the nearest metro station to each attraction, where “nearest” should be defined by walking network shortest-path distance rather than simple straight-line proximity.
In addition, as GenAI becomes embedded in public-facing information services, recent evaluation scholarship emphasizes that AI outputs should be assessed with transparent criteria, explicit ground truth data where feasible, and attention to trust, oversight, and context-specific risks. Digitalization in evaluation practice highlights the need to reconsider evaluation standards and ethical safeguards when AI tools are used in decision-relevant domains [
10]. Similarly, conceptual frameworks for evaluating AI-enhanced infrastructure systems argue that evaluation must be systematic and aligned with stakeholder values, including reliability and trust [
11]. In line with these perspectives, the present study treats GenAI-based mobility guidance as an evaluable information service and benchmarks its spatial accuracy against network-based ground truth.
This research aims to (i) benchmark how accurately GenAI identifies the nearest metro station to each tourist attraction and (ii) analyze which factors influence that accuracy. The study queries GenAI models for the name of the nearest metro station from each tourist attraction, compares the responses with actual shortest-path results calculated via network analysis, and employs a binary logistic regression model to explore the determinants affecting the correctness of the AI-generated answers.
To clarify the studied problem and research contributions, we address the following research questions:
RQ1: How accurately do major GenAI models identify the nearest metro station to each tourist attraction when “nearest” is defined by walking network shortest-path distance?
RQ2: What types of errors occur (e.g., Hallucination, Non-response, Near miss, Far miss), and how do error patterns differ across model families?
RQ3: Which factors explain the correctness of GenAI responses (e.g., name similarity, geographic attributes, and information visibility), and what do they imply about the mechanisms of spatial reasoning?
The remainder of this paper is organized as follows. The next section reviews prior research on GenAI and situates the present study within this body of work. The subsequent section describes the study area and the data used in the analysis, followed by a detailed explanation of the research methodology. Finally,
Section 4 reports the answer accuracy of each GenAI model and analyzes the determinants of that accuracy.
2. Literature Review
Tourism is intrinsically a geospatial activity that unfolds within geographic space, where the quality of mobility information is consequential for sustainability. When making decisions regarding destination choice and travel routes, tourists weigh spatial factors such as distance and accessibility. In urban destinations, reliable public transport guidance is particularly critical as it supports low-carbon travel choices and reduces congestion [
4]. Smart technologies, including smartphones and AI, play a pivotal role in mediating these travel experiences and facilitating dynamic decision-making [
12]. However, if this digital mediation fails due to inaccurate location data or access routes, it undermines the goal of sustainable mobility. Accordingly, the application of Generative AI (GenAI) in tourism necessitates a rigorous verification of whether such models can provide accurate geographic information—specifically the location of attractions and access routes—and whether they possess adequate spatial reasoning capabilities to support last-mile connectivity.
In the field of geography, attempts to incorporate artificial intelligence into spatial analysis have evolved under the concept of Geospatial AI (GeoAI). The emergence of Large Language Models (LLMs) has dramatically expanded these possibilities, giving rise to the paradigm of Autonomous GIS [
13,
14]. Autonomous GIS refers to next-generation systems in which AI agents autonomously collect data, design analytical workflows, and generate and execute code based on user prompts, thereby lowering the barrier to complex spatial tasks. Within this context, recent studies have begun to empirically assess the spatial analytical capabilities of GenAI. Renshaw et al. [
15] benchmarked the topological reasoning accuracy of GPT-3.5, GPT-4, and Gemini 1 Pro using county-level adjacency relationships. Ji et al. [
16] evaluated the ability of foundation models to comprehend topological spatial relations by providing geometric information in WKT (Well-Known Text) format. Zhang and Gao [
17] demonstrated that non-expert GIS users could successfully generate ArcPy scripts for ArcGIS Pro using natural-language prompts, achieving an overall task success rate of 80.5%. Sherman et al. [
18] further highlighted that fine-tuning GenAI using domain-specific spatial data significantly improved model accuracy, suggesting the necessity of domain adaptation.
Meanwhile, several studies have investigated geographic bias and imbalances in the spatial knowledge of GenAI. Liu et al. [
6] conducted a “geo-guessing” experiment with masked location names, revealing that GPT-4 exhibited uneven geographic knowledge, performing significantly better on Western regions than on other cultural zones. Kim et al. [
8] found that ChatGPT provided detailed responses regarding environmental issues in major metropolitan areas but produced generalized or incomplete information for rural regions, indicating a data exposure bias. Jang et al. [
7] analyzed place identity in 64 global cities, reporting that Western cities were described with distinct characteristics, whereas non-Western cities were often portrayed with generalized or less distinctive cultural traits.
In the tourism domain, the potential applications of GenAI have drawn considerable attention, yet reliability regarding spatial context remains a concern. Karlović et al. [
19] suggest that retrieving accurate contextual information can be challenging for LLM-based systems and that the risk of generating plausible but incorrect outputs (i.e., hallucinations) may persist in tourism recommendations. Chen et al. [
20] examined how the quality of AI-generated tourism information affects perceived usefulness and expectation confirmation, which are critical for overall satisfaction. Lee et al. [
21] proposed a framework combining ChatGPT’s text-based information with social big data to mitigate limitations related to visualization and experiential detail. Their case study demonstrated that errors in AI-generated travel information could be corrected using actual visitor behavior data, emphasizing the need for reliable ground truth.
Furthermore, as GenAI becomes embedded in public-facing information services, recent evaluation scholarship emphasizes that AI outputs should be assessed with transparent criteria. Potluka et al. [
10] argue that the rapid digitalization of services necessitates new evaluation standards to ensure technologies contribute to societal goals. Pudney et al. [
11] further emphasize that when AI is integrated with critical infrastructure systems, evaluations must be systematic and aligned with stakeholder values, including reliability and trust. A synthesis of the prior literature indicates that while GenAI offers automated information generation that can enhance tourist satisfaction, research specifically evaluating the spatial accuracy of nearest metro station information produced by GenAI remains limited. Therefore, this study aims to assess the reliability of GenAI in providing last-mile connectivity information by comparing AI-generated results with network-based shortest-path analysis outputs. Building on insights from prior work, we also consider potential factors associated with response correctness and discuss their implications for sustainable tourism applications.
3. Data and Methodology
3.1. Study Area
This study selected Busan Metropolitan City as the focal study area. As the second-largest metropolis in South Korea, Busan features a diverse array of tourism resources, ranging from coastal destinations like Haeundae to cultural heritage sites such as Gamcheon Culture Village, Haedong Yonggungsa, and Beomeosa. Driven by these regional attributes, the city attracts a significant volume of both domestic and international tourists, with visitors demonstrating a steady annual increase. According to Kim and Kwon [
22], the number of international tourists visiting Busan between January and July 2025 reached 2,003,466, representing a significant 23% year-on-year increase. This figure is expected to surpass the 3 million international visitors recorded in 2024.
Importantly, this tourism growth is supported by a highly developed public transportation infrastructure, anchored by the Busan Metro system. As the second-largest urban rail network in South Korea, the Busan Metro comprises six interconnected lines categorized into three heavy rail lines (Lines 1, 2, and 3), two light rail transit systems (Line 4 and BGLRT), and one commuter rail line (Donghae Line). As summarized in
Table 1, this extensive network facilitates the daily movement of approximately 1.9 million passengers, ensuring seamless connectivity between major transportation hubs and key tourist destinations. Consequently, the convergence of rapidly growing international tourism demand and a mature, accessible public transit system positions Busan as an optimal testbed for evaluating GenAI’s spatial reasoning capabilities regarding last-mile connectivity to tourist attractions.
3.2. Data
The data used in this study can be categorized into three groups, as summarized with their primary sources in
Table 2.
The first dataset is the Busan Metropolitan City Tourist Attractions dataset, which provides information on the locations of major points of interest across Busan (
Figure 1). This dataset includes attributes such as attraction name, address, and geographic coordinates. In this study, the list of attraction names was used to collect search engine-based indicators, to identify the nearest public transportation facilities in terms of pedestrian network distance, and to query generative AI models for their estimates of the nearest metro station.
The second dataset pertains to transportation facilities, consisting of metro station information and the road network of Busan. The metro station dataset was used to construct the ground truth reference for evaluating model accuracy. It includes station information for all metro and metropolitan railway lines operating in Busan and the surrounding municipalities—Gimhae, Yangsan, and Ulsan—namely Busan Metro Lines 1–4, the Busan–Gimhae Light Rail Transit, and the Donghae Line (
Figure 2a,b). Meanwhile, the Busan road network dataset was used to identify the metro station reachable from each attraction via the shortest pedestrian path through network analysis.
The final dataset comprises search engine result counts. Prior studies have suggested that the performance of generative AI models is influenced by the volume of data they have been trained on [
23], and that high-quality, well-curated datasets can sometimes be more effective than large quantities of unfiltered data [
24]. To examine factors affecting AI-generated outputs, this study collected web document and blog search counts as proxies for the quantity of training data, while news articles, encyclopedia entries, and image search results were used as proxies for high-quality information. These variables were utilized to analyze determinants influencing the AI models’ accuracy.
3.3. Analysis Methods
3.3.1. Network Analysis for Ground Truth Construction
This study aims to compare the nearest metro station for each attraction identified by generative AI models with the ground truth station derived from network analysis based on walking distances, and to examine the factors influencing the correctness of generative AI responses. The list of tourist attractions used in the analysis was obtained from the “Busan Attractions Information” dataset provided by the Busan Metropolitan City government. Several attractions lie near Busan’s administrative boundary (
Figure 1). Therefore, we collected the OSM road network for Busan and a 5 km buffer to mitigate boundary effects (
Figure 3). This buffer zone was implemented to mitigate boundary effects, ensuring that pedestrian network paths were not artificially severed at the city limits and allowing for the identification of nearest stations located in immediately adjacent municipalities (e.g., Gimhae, Yangsan, Ulsan). A 5 km buffer was selected as a conservative margin to capture nearby stations across municipal borders while limiting unnecessary network expansion.
To approximate pedestrian-accessible routes, the OSM network was retrieved using a custom filter that retains roads tagged with the highway key while excluding high-speed road classes where pedestrian access is typically restricted, specifically motorway and trunk. This choice reflects the local context in which the remaining road classes often provide pedestrian space (e.g., sidewalks), allowing for walking distance computation while avoiding unrealistic shortest paths along high-speed facilities. While this filter does not explicitly encode sidewalk presence or all access restrictions, it provides a consistent and reproducible approximation for pedestrian shortest-path evaluation across the study area.
To determine the accuracy of the generative AI responses, ground truth data was established by identifying the nearest metro station to each attraction through network analysis using the OSM road network. The network analysis was implemented in Python (3.13.7) using the OSMnx package (2.0.6). Shortest paths were computed using physical edge lengths (in meters). Furthermore, to prevent routing failures caused by disconnected paths or one-way vehicular restrictions that typically do not apply to pedestrians, the extracted network was converted into an undirected graph to ensure bidirectional connectivity. To handle potential distance ties (i.e., multiple stations having the exact same shortest network distance from an attraction), the system was configured to select the station that appears first in the spatial index. In this process, the geographic coordinates provided in the source datasets were utilized to define the routing endpoints. Specifically, the point coordinates of tourist attractions provided in the dataset served as origins, while station node coordinates (representing the station center) served as destinations. Although it is acknowledged that spatially extensive attractions (e.g., beaches, parks) may have multiple access points affecting the perceived nearest station, this study consistently employed a point-to-point network distance to ensure a standardized and reproducible evaluation metric across the entire dataset.
AI responses were classified as correct only if they strictly matched the ground truth in terms of both the station name and the specific subway line; otherwise, they were recorded as incorrect. Prior to matching, station names were standardized—for example, by removing common suffixes such as “Station” and normalizing minor formatting variations—to ensure consistent comparison between AI outputs and ground truth labels. Metro station data were extracted from the Urban Railway Station Information dataset, which includes station location coordinates for urban and metropolitan rail lines operating in Busan and neighboring cities (Gimhae, Yangsan, and Ulsan)—specifically Busan Metro Lines 1–4, the Busan-Gimhae Light Rail Transit, and the Donghae Line (see
Figure 2a,b).
3.3.2. Prompt Engineering
Figure 4 illustrates the workflow of prompt input and AI response generation. The input prompt used in this study was written in Korean, and the full prompt template is provided in
Appendix A. Following prior studies, the prompt was organized into four components—Task, Logical Steps, Rules, and Execution—to improve consistency across iterative queries and to enable automated post-processing.
First, the Task component explicitly defined the objective: identifying the single nearest metro station to each tourist attraction based on walking network distance. By specifying the target output and the interpretation of “nearest” in advance, the prompt reduces variability stemming from ambiguous instructions and aligns model responses with the research objective, consistent with established prompt design principles [
17].
Second, the Logical Steps component provided an explicit reasoning sequence for completing the task. This design draws on the Chain-of-Thought (CoT) principle [
16,
25], which decomposes complex problems into smaller subtasks to support structured reasoning. In contrast to approaches that elicit verbose reasoning traces, the present study was designed for large-scale quantitative evaluation: the prompt encourages stepwise reasoning while requiring the model to return only the final answer in a structured format suitable for automated analysis.
Third, the Rules component constrained the output format to ensure stable parsing and direct comparability with network analysis results. Each response was required to be a single structured object with three fields: (i) name (tourist attraction name; string), (ii) subway_station (predicted nearest station name; string), and (iii) subway_line (serving line(s); array of strings). To enforce strict adherence to this structure, model family-specific API features were used. For GPT series, a strict JSON Schema was applied to require all fields and disallow additional properties. For Gemini, the same structure was enforced via a typed response model with identical field definitions. These constraints reduce ambiguity in free-form text responses and enable reliable extraction and evaluation within computational workflows [
17,
18]. These output constraints were enforced via the providers’ API features to ensure consistent, machine-parseable responses across all queries.
Finally, the Execution component separated instructional content from the input data for each iteration. This structure helps ensure that the model, after interpreting the task and rules, focuses solely on generating the structured output for the specific attraction name provided in the current query [
17,
18]. Overall, this prompt design enables consistent, machine-parseable outputs across iterative queries, supporting large-scale quantitative evaluation.
3.3.3. Model Selection and Configuration
This study utilized two generative AI model families that provide API access and represent the leading foundation models in the current AI ecosystem: OpenAI’s GPT series and Google’s Gemini series. To reflect both inference performance-oriented and lightweight models, the flagship model and its lightweight variant from each family were selected: Gemini 3 Pro and Gemini 3 Flash (Google’s Gemini series), and GPT-5.2 and GPT-5 mini (OpenAI’s GPT series). All input/output processes were automated using the respective API services, and outputs were constrained to the predefined structured format described in
Section 3.3.2. No model parameters were manually tuned beyond enforcing the JSON output constraint; all other parameters were left at the providers’ API default settings at the time of querying.
3.3.4. Logistic Regression Modeling
Following the collection of GenAI responses via the constructed prompts, each output was binarized by comparing it with the ground truth derived from network analysis based on the OpenStreetMap road network; responses matching the ground truth were coded as correct (1), while mismatches were coded as incorrect (0). To empirically explore the determinants of GenAI response correctness, a binary logistic regression model was employed. This analytical framework aligns with methodologies adopted in recent studies, such as [
15], which examined spatial question-answering accuracy, and [
8], which assessed geographic biases in AI-generated information.
Regarding the independent variables, two textual cue measures—Jaccard similarity and Levenshtein distance—were included under the assumption that GenAI may infer metro station names based on lexical similarity between entity names. Jaccard similarity quantifies the overlap between two strings by dividing the size of the intersection by the size of the union, whereas Levenshtein distance measures the minimum number of single-character edits required to transform one string into the other.
Variables representing the geographic context included the distance between facilities, the number of metro stations within a 1000 m radius, the number of map search results, and urbanization status. The distance between facilities was defined as the shortest walking distance (km) from each attraction to its nearest metro station derived from the network analysis. The count of metro stations within a 1000 m radius was included to test the hypothesis that a higher density of nearby stations may increase the likelihood of misidentifying the optimal station. The number of map search results was used to account for potential ambiguity caused by place name duplication; this variable represents the count of results returned by a web map search for the attraction name and was natural log-transformed to mitigate skewness. Finally, the urbanization variable indicates whether the attraction is located within an urbanized area based on Statistics Korea data.
Information visibility measures were captured using search volume variables—counts of web documents, blogs, news articles, images, and encyclopedia entries—based on the premise that GenAI performance is influenced by the volume and quality of training data available for a specific entity [
23,
24]. Due to substantial variance in raw counts, all search volume variables were natural log transformed. These data were acquired via the Naver Search API by querying the official name of each tourist attraction. Descriptive statistics for all variables included in the regression model are presented in
Table 3. To obtain a parsimonious and interpretable specification, we performed stepwise model selection using the Akaike Information Criterion (AIC), starting from the full set of candidate predictors.
4. Results
4.1. Accuracy of Generative AI Models
The performance of generative AI models evaluated in this study is based on responses retrieved on 30 December 2025, using structured prompts to identify the nearest metro station to each tourist attraction based on walking network distance. Accuracy was calculated as the ratio of correct responses to the total number of evaluated cases, and the comparative results are presented in
Table 4. Among all tested models, Google’s Gemini 3 Pro achieved the highest accuracy.
A provider-level comparison shows that Google’s Gemini models achieved higher accuracy than OpenAI’s GPT models in this evaluation. The flagship Gemini 3 Pro recorded the highest accuracy (65.0%), followed by the lightweight Gemini 3 Flash (60.0%). In contrast, GPT-5.2 and GPT-5 mini recorded 35.7% and 11.4%, respectively.
We hypothesized that higher overall intelligence—encompassing logical reasoning, instruction following, and problem-solving capabilities—would be associated with a model’s accuracy in identifying the nearest stations and adhering to geographic constraints. To test this hypothesis,
Figure 5 illustrates the relationship between the Intelligence Index and the accuracy of each generative AI model. The intelligence scores were obtained from the Artificial Analysis Intelligence Index v3.0, a composite benchmark [
26]. This index integrates results from ten evaluation suites: MMLU-Pro, GPQA Diamond, Humanity’s Last Exam, LiveCodeBench, SciCode, AIME 2025, IFBench, AA-LCR, Terminal-Bench Hard, and τ
2-Bench Telecom. While scores may vary depending on inference settings, this study used the maximum score reported for each model for visualization.
4.2. Error Typology and Distribution
In the previous accuracy assessment, responses were classified as correct only when both the station name and the corresponding subway line perfectly matched the ground truth derived from the network analysis. However, strictly adhering to a single “nearest” station can be overly rigid for open-space attractions with multiple access points, such as beaches or parks, where the optimal station may vary depending on the specific entrance used. Consequently, an AI response classified as incorrect might not necessarily represent a complete failure but rather a suboptimal answer. To systematically distinguish between these types of inaccuracies, incorrect responses were further categorized into four classes: Hallucination, Non-response, Near Miss, and Far Miss (
Table 5). Hallucination refers to cases where the AI generated non-existent stations, stations located in different regions, or correct station names associated with incorrect subway lines, whereas non-response indicates cases where the AI failed to return any result. Regarding spatial proximity errors, Near Miss denotes cases where the AI’s answer was not the single nearest station but fell within the top-3 nearest stations based on network analysis, while Far Miss refers to cases where the answer was outside this top-3 threshold.
The distribution of error types by model is visualized in
Figure 6. Regarding Hallucinations, Gemini 3 Pro recorded zero cases and Gemini 3 Flash recorded one case. As shown in
Table 6, the single case in Gemini 3 Flash involved an orthographic error for Samnak Ecological Park, where it returned “Kwaebeop Renecite” instead of the correct “Gwaebeop Renecite” (a single consonant difference in Korean). In contrast, GPT-5.2 and GPT-5 mini recorded 7 and 10 Hallucination cases, respectively. In terms of Non-responses, only GPT-5 mini exhibited this behavior, accounting for 52 cases.
With respect to spatial proximity errors, Near Miss counts were 34 and 38 for Gemini 3 Pro and Gemini 3 Flash, respectively, and 27 for both GPT-5.2 and GPT-5 mini. Far Miss counts were 15 and 17 for Gemini 3 Pro and Gemini 3 Flash, respectively, compared to 56 for GPT-5.2 and 35 for GPT-5 mini.
4.3. Determinants of Correctness (Binary Logistic Regression)
Based on the accuracy classification (correct vs. incorrect) derived in the previous section, a binary logistic regression model was employed to identify the determinants influencing the correctness of GenAI responses. To investigate potential disparities across model providers, the regression analysis was applied specifically to the highest-performing models from each family: Gemini 3 Pro and GPT-5.2.
Given the vast parameter space of GenAI models—ranging from hundreds of millions to trillions—isolating specific causal factors is inherently challenging. Nevertheless, to empirically examine variables associated with response correctness, this study first estimated an initial model incorporating factors commonly considered relevant to AI performance (
Table 7). The independent variables included textual similarity between attraction and station names, geographic factors (distance between facilities, number of metro stations within a 1000 m radius, urbanization status, and potential place name ambiguity), and information visibility measures (search volume metrics such as web documents, news, blogs, images, and encyclopedia entries).
Subsequently, a stepwise selection process based on the Akaike Information Criterion (AIC), starting from the full set of candidate predictors, was performed to derive a parsimonious final model specification (
Table 8).
In the regression tables, values enclosed in parentheses denote the standard errors of the coefficients. The likelihood ratio test p-values for the initial and final models of Gemini 3 Pro were 0.00007 and <0.001, respectively, while those for GPT-5.2 were 0.00595 and 0.00012, indicating that all estimated models were statistically significant compared to the null model.
For Gemini 3 Pro, four variables were statistically significant in the final specification: Levenshtein distance, distance between facilities, number of metro stations within 1000 m, and news search volume. In contrast, for GPT-5.2, only news search volume was statistically significant in the final specification, while the remaining geographic and textual variables were not significant.
Figure 7 visualizes the distributions of the retained predictors by correctness for Gemini 3 Pro: (a) textual similarity (Levenshtein distance), (b) distance between facilities, (c) number of metro stations within a 1000 m radius, and (d) the natural log of news search results.
Figure 8 presents the distribution of the natural log of news search results by correctness for GPT-5.2. These results provide an empirical basis for the subsequent discussion of cross-model differences in accuracy, error patterns, and associated predictors.
5. Discussion
The findings of this study provide comprehensive answers to the three research questions regarding the accuracy, error typology, and determinants of GenAI performance in identifying the nearest metro stations. First, regarding overall accuracy (RQ1), the results indicate substantial variation across model families and sizes. Gemini 3 Pro achieved the highest accuracy, followed by Gemini 3 Flash, GPT-5.2, and GPT-5 mini. This pattern suggests that general-purpose benchmark performance does not necessarily translate into reliable performance for specific, place-based spatial queries about tourist attractions. Second, concerning error typology (RQ2), the analysis shows that “incorrect” outputs are heterogeneous, ranging from Near Misses (plausible alternatives within the network-derived top-3 candidates) to Far Misses, Hallucinations (e.g., fabricated station labels or station–line mismatches), and, in some cases, Non-responses. The distribution of these error types varies by model, with lightweight variants exhibiting a higher frequency of severe failure modes in this benchmark. Third, regarding determinants of correctness (RQ3), the regression results indicate that information visibility—proxied by news search volume and reflecting how prominently a given tourist attraction appears in widely disseminated sources—is positively associated with correctness across models, while geographic context variables (e.g., distance to the nearest station and station density) are additionally retained as significant predictors for Gemini 3 Pro. Collectively, these findings establish a reproducible benchmark for evaluating GenAI-based mobility guidance and underscore the need for case-specific verification when such models are used for last-mile navigation.
The observed error patterns offer insights into limitations of current LLMs in place-based mobility queries. The prevalence of Near Misses suggests that models may often retrieve geographically plausible candidates while still failing to identify the single nearest station under a strict walking network definition. More critically, Hallucinations—such as fabricated station labels or incorrect station–line pairings—represent qualitatively different failures because they produce outputs that are not directly grounded in the transit system. Non-responses likewise constitute a distinct failure mode in which the model cannot provide a usable output for downstream evaluation or service delivery. The representative cases illustrate that the practical consequences of “wrong” answers differ substantially across error types: orthographic variants of an existing station name are materially different from fabricated station labels that do not correspond to any station, and both differ from near-optimal selections that remain within the network-derived candidate set.
The determinant analysis provides complementary evidence on the conditions under which models tend to succeed or fail. In the final regression specification for Gemini 3 Pro, multiple predictors remained significant, including textual similarity (Levenshtein distance), network-based distance to the nearest station, station density within 1000 m, and news visibility. In contrast, for GPT-5.2, news visibility was the only predictor retained as statistically significant in the final specification. These patterns should be interpreted as associations rather than causal effects, and they may be subject to potential confounding factors. For instance, an attraction’s news visibility is often inherently correlated with its urban centrality, local density, or visitor volume, making it difficult to fully isolate the effect of media exposure from the underlying spatial structure. Nevertheless, these results help characterize a key regularity: correctness is linked to how prominently a tourist attraction is featured in authoritative, widely circulated sources, and—at least for Gemini 3 Pro—also varies with the geographic context captured by network-derived distance and local station density. Accordingly, attractions that receive less media attention or are less prominently featured online may be more prone to incorrect outputs, pointing to uneven coverage and exposure effects that can affect the reliability of AI-based tourism services.
From the perspective of sustainable tourism mobility, these technical limitations have important implications. Public transport accessibility is a key mediator of low-carbon travel behaviors among tourists, particularly in dense urban destinations. If GenAI-based services provide inaccurate last-mile guidance—especially Far Misses that direct users to spatially irrelevant stations or hallucinations that introduce station labels or line assignments not aligned with the network—travelers may experience increased uncertainty, missed connections, or reduced confidence in public transport guidance. Such reliability issues can plausibly discourage public transit use and increase reliance on taxis or private vehicles, thereby contributing to congestion and emissions. In this sense, spatial reliability is not only a technical performance metric but also a service quality attribute relevant to sustainability-oriented smart tourism.
To mitigate these risks, several practical recommendations follow for the deployment and evaluation of GenAI in tourism services. First, raw GenAI outputs should be externally validated whenever feasible—for example, by verifying that a returned station exists and that the station–line pairing is consistent with an official transit database, and by checking walking-network proximity through routing services. Retrieval-Augmented Generation (RAG) architectures that ground outputs in curated transit and geospatial databases, or map API-integrated approaches that compute candidate stations from authoritative sources, can help reduce hallucinations and improve consistency. Second, given the frequency of Near Misses, interfaces may benefit from presenting multiple candidates (e.g., the top-3 nearest stations) along with route information, rather than asserting a single answer without qualification; this can better accommodate multiple access points and user preferences, especially for open-space attractions. Third, disambiguation strategies—such as requesting administrative context, alternative names, or nearby landmarks—may reduce errors caused by place name ambiguity. Finally, continuous benchmarking remains important because model behavior can change with updates; evaluation should therefore be treated as an ongoing process rather than a one-time validation.
Despite these contributions, this study has limitations that affect generalizability. The evaluation was confined to a single metropolitan area (Busan) and focused on the metro system, excluding bus networks and multimodal transfers. Ground truth was established using a centroid-to-centroid network analysis based on OpenStreetMap; while this ensures consistency and reproducibility, it may not fully capture pedestrian experience for spatially extensive attractions where specific entrances determine practical access. The benchmark reflects a snapshot of model behavior and visibility signals at a specific period, and prompts were issued in Korean, leaving cross-lingual differences as an open question. Additionally, the regression analysis is designed to characterize associations and cannot by itself identify underlying causal mechanisms.
In the next section, we consolidate the main findings in relation to RQ1–RQ3 and summarize the study’s implications, limitations, and directions for future research.
6. Conclusions
This study benchmarked the reliability of Generative AI (GenAI) in identifying the nearest Busan Metro station to tourist attractions under a walking network definition of “nearest.” By comparing AI-generated responses with network analysis ground truth, the study provides empirical evidence on how current LLM-based systems perform in a last-mile tourism mobility task.
First, regarding overall accuracy (RQ1), performance varied substantially across the evaluated models. In this benchmark, Gemini 3 Pro achieved the highest accuracy, followed by Gemini 3 Flash, GPT-5.2, and GPT-5 mini, indicating notable differences across model families and between flagship and lightweight variants. Second, for error typology (RQ2), incorrect outputs were heterogeneous, including Near Misses (plausible alternatives within the network-derived top-3 candidates), Far Misses (outside the top 3), Hallucinations (e.g., fabricated station labels or station–line mismatches), and, in some cases, Non-responses. Third, regarding determinants of correctness (RQ3), the regression results indicate that information visibility—proxied by news search volume and reflecting how prominently a tourist attraction appears in widely disseminated sources—was positively associated with correctness in both flagship models. For Gemini 3 Pro, additional predictors related to textual cues and geographic context were retained in the final specification, whereas for GPT-5.2 news visibility was the only retained significant predictor.
From a sustainability perspective, the findings underscore that the spatial reliability of GenAI is a practical prerequisite for smart tourism applications that aim to support low-carbon mobility. Accurate last-mile guidance can facilitate tourists’ use of public transport, while severe errors (e.g., Far Misses and Hallucinations) may increase uncertainty and reduce confidence in public transport guidance, potentially undermining efforts to promote transit-based travel. Accordingly, for tourism authorities and service providers deploying GenAI within tourism GIS applications, the results support the need for verification mechanisms that prevent ungrounded outputs from being delivered as definitive guidance. In practice, this can be addressed by grounding responses using Retrieval-Augmented Generation (RAG) with authoritative transit databases, enforcing consistency checks (e.g., station existence and station–line validity), and adopting interface designs that present multiple candidates (e.g., top-3 nearest stations) for open-space attractions where access points vary.
This study is subject to several limitations. The evaluation was conducted in a single metropolitan context and focused on metro accessibility, excluding multimodal transport and transfer-based routing. The attraction set was derived from the official “Busan Attractions Information” dataset; thus, coverage and representativeness may depend on the provider’s curation and update practices and may not fully reflect popularity- or demand-driven destination choices. Ground truth was established using point representations of attractions and stations and a walking network approximation derived from OpenStreetMap, which may not fully capture entrance-specific access for spatially extensive sites. In addition, the current evaluation reflects a snapshot of model behavior and visibility signals at a specific time.
To address these limitations, future research should extend the analytical scope to multimodal transport networks across diverse cities and languages. Furthermore, compiling alternative attraction inventories from varied tourism platforms (e.g., VISITKOREA and TripAdvisor) will help test the robustness of model outputs against differently curated data sources. To advance beyond simple last-mile accessibility, future studies should adopt constraint-based network evaluation metrics, drawing upon prior research [
27] that integrates realistic constraints—such as maximum allowable travel times and service weights—into network analyses. In addition, ground truth definitions can be refined by incorporating precise access points and station entrances, alongside the application of differentiated walking time impedances that account for specific road typologies and pedestrian environments. These methodological enhancements will effectively complement local proximity-based assessments with broader, network-wide connectivity perspectives. Finally, systematic comparisons between baseline models and retrieval- or map-grounded variants will be instrumental in assessing the extent to which external knowledge integration mitigates hallucinations and spatial errors over time.