Mobile-AI-Based Docent System: Navigation and Localization for Visually Impaired Gallery Visitors

An, Hyeyoung; Park, Woojin; Liu, Philip; Park, Soochang

doi:10.3390/app15095161

Open AccessArticle

Mobile-AI-Based Docent System: Navigation and Localization for Visually Impaired Gallery Visitors

¹

Electronics and Telecommunications Research Institute (ETRI), Daejeon 34129, Republic of Korea

²

Department of Computer Engineering, Chungbuk National University, Cheongju 28644, Republic of Korea

³

Graduate School of Culture Technology, Korea Advanced Institute of Science and Technology, Daejeon 34141, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(9), 5161; https://doi.org/10.3390/app15095161

Submission received: 29 March 2025 / Revised: 30 April 2025 / Accepted: 3 May 2025 / Published: 6 May 2025

(This article belongs to the Special Issue IoT in Smart Cities and Homes, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Smart guidance systems in museums and galleries are now essential for delivering quality user experiences. Visually impaired visitors face significant barriers when navigating galleries due to existing smart guidance systems’ dependence on visual cues like QR codes, manual numbering, or static beacon positioning. These traditional methods often fail to provide adaptive navigation and meaningful content delivery tailored to their needs. In this paper, we propose a novel Mobile-AI-based Smart Docent System that seamlessly integrates real-time navigation and depth of guide services to enrich gallery experiences for visually impaired users. Our system leverages camera-based on-device processing and adaptive BLE-based localization to ensure accurate path guidance and real-time obstacle avoidance. An on-device object detection model reduces delays from large visual data processing, while BLE beacons, fixed across the gallery, dynamically update location IDs for better accuracy. The system further refines positioning by analyzing movement history and direction to minimize navigation errors. By intelligently modulating audio content based on user movement—whether passing by, approaching for more details, or leaving mid-description—the system offers personalized, context-sensitive interpretations while eliminating unnecessary audio clutter. Experimental validation conducted in an authentic gallery environment yielded empirical evidence of user satisfaction, affirming the efficacy of our methodological approach in facilitating enhanced navigational experiences for visually impaired individuals. These findings substantiate the system’s capacity to enable more autonomous, secure, and enriched cultural engagement for visually impaired individuals within complex indoor environments.

Keywords:

AI-aided system; indoor navigation; gallery experience; smart docent system

1. Introduction

Visual senses are crucial for perceiving environmental changes and interacting with surroundings, individuals with visual disabilities face significant challenges when engaging in outdoor physical activities [1]. According to the 2023 statistics on disability in Korea, the disabled population stands at 2,652,860 as of 2022, with a registered disability rate of 5.12%. Among them, visually impaired individuals number 251,000, representing 9.5% of the total disabled population. The registered disabled population has seen a 10.9 times increase compared to 1992. However, data from the Ministry of Health and Welfare’s survey on disabled individuals’ participation in cultural activities reveals a cultural engagement rate of only 2% (https://www.peoplepower21.org/welfarenow/1940360 (accessed on 2 May 2025)). This restriction is especially noticeable when it comes to enjoying art and culture because it is difficult for people with disabilities to obtain visual information. Moreover, accessing art exhibitions can pose greater challenges as they often lack assistive equipment to facilitate smooth navigation for viewers [2,3].

Prior research addressing the needs of visually impaired individuals has predominantly concentrated on route guidance through assistive technologies, the development of supplementary models facilitating physical engagement with museum artifacts, or the provision of audio-descriptive services via specialized equipment. Notably, modules for localization, algorithmic path planning, continuous path tracking, and real-time obstacle detection constitute essential functional components. These elements represent critical infrastructure in the broader assistive technology ecosystem. However, considering obstacle avoidance during route tracing remains a challenging task due to the complexity, variability, and dynamism of indoor environments [4]. To address these challenges, methods such as the smart stick, electronic travel aids (ETAs), and mobile indoor assistance have been utilized [4]. However, using smart sticks and ETAs requires additional devices and is inconvenient for visually impaired people to carry around in addition to the equipment they already own. The adoption of vision cameras to address safety concerns has become increasingly popular, owing to their capacity to offer detailed information in well-lit environments and their cost-effectiveness in comparison to alternative sensors such as sonar, radar, and lidar [5]. Moreover, these solutions hold promise in detection systems as they can be deployed on devices with limited resources, such as smartphones.

Museum guide systems possible to employ various methods for delivering artwork information, with manual entry of exhibit numbers and QR code scanning representing the predominant approaches. These techniques allow visitors to access relevant content by simply inputting the corresponding artwork number or capturing the associated QR code with their device, thereby efficiently retrieving comprehensive exhibit information. However, these conventional methods present significant accessibility barriers for individuals with visual impairments, as those intrinsically rely on the visual cues such as numbers and QR codes. Consequently, alternative approaches that do not depend on visual cues are demanded to deliver equivalent level of informational access to visually impaired visitors. Beacon-based guide systems do not require visual information. These systems use broadcasting packet [6] or proximity [6] or localization method [7,8] to provide the information. These systems can determine the time when the information is provided to the user. However, these do not consider whether providing the information continues or stops. Situations, where voice guidance that is unnecessary to the user continues, are inconvenient for visually impaired people who need a guide sound. Therefore, the guide system for visually impaired people needs to consider how to adaptively provide the information.

Therefore, we propose a smart docent system designed to address existing limitations in museum accessibility while significantly enhancing art interpretation capabilities among visually impaired patrons. The proposed AI-based navigation system demonstrates several distinct advantages over existing technologies, detailed as follows:

Integrated assistance service: Our service unifies multiple assistive functionalities, overcoming the compartmentalization prevalent in conventional solutions that separately provide directional guidance, hazard detection, or audio descriptions. Instead, this approach integrates navigational assistance and in-depth audio interpretation within a single cohesive platform, ensuring consistent and convenient user engagement throughout the museum experience.
Adaptive information provision: Utilizing real-time behavioral analysis algorithms, the system monitors visitor engagement patterns—including proximity to exhibits, dwell time, departure cues, and ambulatory velocity—to dynamically adjust content delivery. This approach mitigates the passive information delivery and unnecessary repeated information characteristic of existing solutions, thereby providing visually impaired visitors with a more immersive and enriched experience beyond superficial descriptions.
Real-time obstacle detection and alerts: Utilizing an on-device object detection model, our system actively identifies obstacles in real-time and promptly alerts users, enhancing visitor safety and facilitating a secure and comfortable museum environment.
Accurate and reliable positioning: The solution employs a dual-method positioning approach that synthesizes BLE beacon technology with continuous analysis of movement trajectories. This integration effectively mitigates the localization inaccuracies inherent in conventional indoor positioning systems, particularly those resulting from signal interference or unpredictable movement patterns.

To effectively implement these functionalities, the proposed system operates in two distinct modes. A Navigation Mode that incorporates real-time obstacle detection to safely guide visitors through exhibition spaces. A Docent Guide Mode that offers more in-depth and context-sensitive audio descriptions, significantly enhancing the overall visitor experience and artwork comprehension.

The remainder of the paper is organized as follows: Section 2 discusses the related work. Section 3 examines the structural configuration and methodological principles underlying the Navigation, Localization, and Detection Modules. These modules form the technological foundation required to deliver spatial orientation assistance—navigation services—and contextually enriched interpretive content—guide services—throughout the museum environment. Additionally, the experimental environment is described. Section 4 presents and discusses the experimental results and delves into the quality of user experience, and finally, Section 5 concludes the paper and outlines directions for future research.

2. Background of Analysis

2.1. Navigation Service

With the advancement of technology, various services are being developed to assist visually impaired individuals in achieving independent mobility. These services provide support through a variety of means, including smart canes and mobile applications.

The technologies utilized in navigation for visually impaired individuals can be classified into three main categories, as shown in Table 1. This taxonomical framework encompasses wireless communication technologies (including Wi-Fi, Bluetooth Low Energy beacons, and Ultra-Wideband), sensor-based technologies (such as Global Positioning System, RGB-D cameras, and Inertial Measurement Units), and marker-based approaches (primarily two-dimensional codes like QR codes strategically positioned within environments) [9]. Table 1 presents a structured comparison of these specific technologies, detailing their respective benefits and features alongside their limitations and implementation challenges in a tabular format. The remainder of this section provides a more detailed examination of these technologies.

GPS functionality is severely compromised in indoor settings, whereas Wi-Fi-based solutions have emerged as a viable alternative due to the ability to exploit existing access points (APs) and perform localization and proximity detection without requiring line-of-sight (LOS). Techniques such as location and RSS fingerprinting are commonly employed to estimate user positions based on the signal strength of surrounding APs [10,11]. Wi-Fi-based methods can be implemented with minimal configuration on the client side. However, their accuracy is often limited, particularly in estimating the distance between users and exhibits. Furthermore, RSS fingerprinting techniques require a labor-intensive offline phase to build a signal map, which may be impractical for large-scale indoor environments such as museums.

To enhance the accuracy of indoor positioning systems, recent research has applied machine learning (ML) and neural networks to beacon technology [10,12,13,14]. This algorithm employs neural networks to analyze RSS features collected from BLE beacons. However, BLE signals are susceptible to significant noise and attenuation due to building materials, presenting challenges to achieving high positioning accuracy. Filtering methods, including Kalman filters, Bayesian filtering, and fingerprinting techniques, have been proposed to mitigate these issues. These filtering approaches improve accuracy for positioning, proximity detection, and zone classification compared to conventional RSS-based methods. Specifically, they help manage and reduce RSS variability caused by noise, attenuation, and BLE advertising channel fluctuations. Nevertheless, inherent BLE signal characteristics, such as sensitivity to environmental attenuation, remain limiting factors. Moreover, the effectiveness of deep learning models is highly dependent on the availability and quality of training datasets. Without sufficient data, model accuracy is constrained. Training these models also demands substantial computational resources.

Table 1. Navigation technologies.

Technology Type	Specific Technology	Benefits & Features	Limitations & Challenges	Refs.
Wireless Communication	Wi-Fi + Lateration	- Uses existing APs - No line-of-sight required - Low setup cost	- Low distance accuracy - Sensitive to signal reflections	[10]
	Wi-Fi + RSS fingerprinting	- Higher accuracy in mapped areas - No special setup on user devices	- Time-consuming offline mapping - Performance varies with environment - Hard to scale in large spaces	[10]
	BLE beacon + Neural networks	- Enhanced Localization Accuracy - Adaptability to Environmental Variations - Performance Enhancement	- Intrinsic Limitations of BLE Signals - Data and Training Requirements - Device and Environment Dependency	[12,13,14]
	BLE beacon + Filtering	- Accuracy improvement and error reduction - Noise and variability handling	- Implementation and computational complexity - Intrinsic limitations of BLE signals	[6,10]
	UWD	- High accuracy - Robust signal penetration - Extended range	- Infrastructure costs - Energy consumption	[10,15]
Sensor	GPS + Transformer or ML	- Outdoor precision - Enhanced portability and usability - Improved accessibility	- Limited positional accuracy - Inaccurate distance estimation	[16,17]
	RGB-D camera + ML	- Data fusion - Provision of enriched environmental information - Real-time processing	- Data noise and distortion - External environmental constraints	[15,18]
	IMU	- Posture estimation - Motion tracking	- Cumulative drift error	[11,16,18]
Marker-based	Two-dimensional code	- High positioning accuracy - Low cost - Fast information retrieval	- Low distance estimation accuracy - Requires environmental installation	[19,20,21]

UWB technology is widely used for indoor localization, particularly known for its high accuracy, capable of estimating locations within a few centimeters, with certain cases demonstrating positional errors under 20 cm. Compared to other RF technologies, UWB offers a wider transmission range and may require fewer devices than RFID-based systems. However, effective operation of UWB systems necessitates obstacle avoidance capabilities. Although UWB is currently considered the most accurate indoor positioning system available, extensive infrastructure deployment, involving numerous UWB anchors, is required. Furthermore, compatible end-user devices have yet to achieve widespread adoption [10,15].

Several navigation methods integrating deep reinforcement learning and neural networks with GPS-based technologies have been explored. Smart assistive systems integrate ML algorithms like YOLO for real-time object recognition combined with ultrasonic sensors and GPS navigation, to enhance user support [17]. ASSISTER uses imitation learning to generate natural-language navigation guidance based on visual input and GPS/IMU localization [16]. These methods offer advantages including improved positional accuracy through multi-sensor fusion, detailed environmental information from deep learning models, real-time object detection, customized guidance, portability, integrated functionality, and enhanced accessibility. However, challenges remain, such as limited positional accuracy in complex urban settings, limitations in object detection accuracy, challenges in distance estimation, dataset acquisition and labeling difficulties, system complexity, real-time performance issues, privacy concerns, environmental constraints, user interaction complexities, and power efficiency considerations.

Visual Simultaneous Localization and Mapping (VSLAM) systems utilize RGB-D cameras to estimate device position and orientation without prior mapping, integrating deep learning techniques to enhance path planning capabilities. RGB-D sensors combine visual and depth information, supporting real-time egomotion estimation, mapping, and obstacle-aware path planning [15]. This approach offers several advantages, including improved positional accuracy through multi-sensor fusion, enriched environmental data provided by RGB-D sensors, real-time object detection capabilities, and enhanced portability, usability, and accessibility [15,18]. Nevertheless, several challenges persist, such as limited positional accuracy, inaccuracies in distance estimation, restricted operational range, sensor noise and distortion inherent in RGB-D data, environmental constraints (e.g., interference from sunlight), and complexities associated with sensor data fusion and calibration.

IMU sensors are frequently employed as critical inputs for ML models, either directly or through fusion with other sensors [16]. Accurate navigation often relies on integrating IMU data, especially indoors. To enhance the positional accuracy and robustness of navigation systems for visually impaired users, advanced sensor fusion algorithms combining data from GNSS and inertial sources, including IMU, are commonly utilized [11,18].

Two-dimensional codes are also a popular method for indoor navigation [19,20,21]. However, they require attaching printed codes to surfaces such as ceilings or floors. In practice, administrators of art galleries are often reluctant to install such markers, as they can detract from the gallery’s aesthetic appearance. Despite this limitation, two-dimensional codes offer high positioning accuracy, low cost, and fast information retrieval [19,20,21].

Object detection constitutes a critical component in assistive navigation systems for individuals with visual impairments [22]. Various technologies demonstrate distinctive efficacy profiles when implemented in detection frameworks: Ultrasonic sensors offer pragmatic obstacle detection capabilities with favorable cost-efficiency ratios, though they exhibit operational range limitations that constrain their utility in complex navigational scenarios [23,24]. Near Field Communication (NFC) technology presents advantages in market penetration and implementation economics but suffers from proximity constraints and performance degradation in densely populated environments [12]. LiDAR technology demonstrates superior performance metrics in spatial resolution, continuous environmental scanning, and three-dimensional data acquisition. However, significant limitations manifest in temporal processing constraints, operational deficiencies in complex environmental topographies, and form factor integration challenges [11]. RGB-D camera implementations excel in generating semantic three-dimensional environmental mappings and facilitating real-time obstacle detection. These advantages are counterbalanced by substantial graphical processing requirements and computational dependencies on convolutional neural network architectures [18]. Traditional camera-based detection systems provide precision in object recognition, multi-object classification capabilities, and model specialization opportunities, while simultaneously requiring extensive training datasets, significant computational resources, and exhibiting vulnerability to detection anomalies [18,22,25]. Mobile device-based detection applications leverage integrated cameras for environmental sensing, typically transmitting captured data to server infrastructure for ML processing and subsequent generation of auditory feedback [26]. This methodology introduces potential communication latency issues, highlighting the critical importance of on-device processing capabilities to achieve real-time object detection performance necessary for effective assistive navigation.

2.2. Guide Service

The manual entry of numbers represents a common and straightforward method of accessing the information of the artwork, whereby visitors input the numerical codes displayed on the adjacent exhibit labels to retrieve complete details of the corresponding pieces [27]. For information delivery in cultural spaces, multiple approaches have been developed. QR code-based methods have gained popularity in various fields, with applications scanning these codes to display collection information [28]. These marker-based QR code technologies offer significant benefits, including low implementation cost, information-rich content delivery, and widespread smartphone compatibility [12,28]. However, they present substantial limitations for visually impaired users, including visual obtrusiveness in gallery settings, institutional policy restrictions regarding their placement, and the interaction burden they impose on users who must first locate and then accurately scan these codes.

Wireless communication technologies, particularly BLE beacons, provide alternative approaches for visually impaired individuals who cannot easily access visual information, such as artwork numbers or QR code locations. These systems deliver proximity-based information passively and can offer personalized content [12,29]. For example, users can obtain artwork information via beacon advertising messages [29] or receive increasingly detailed information based on proximity to exhibits [6]. Despite these advantages, beacon systems face technical challenges including RSSI (Received Signal Strength Indicator) variability affecting accuracy, requirements for extensive infrastructure deployment, and the necessity for device-specific calibration [12]. Additionally, a significant limitation exists: information about artworks remains available even after users have left their vicinity, with no processing method to address this issue—creating potentially uncomfortable situations for visually impaired users receiving unnecessary audio information. Current assistive technologies for visually impaired individuals in cultural spaces predominantly offer fragmented functionalities—isolated route guidance, basic obstacle detection, or rudimentary audio descriptions—lacking comprehensive integration for meaningful engagement with artistic content.

To address these limitations, we propose a new docent system offering comprehensive and in-depth guide services. This system utilizes real-time information about the user’s location and direction to provide personalized audio guidance, simultaneously detecting obstacles along the route and dynamically modulating audio content delivery according to the user’s circumstances. By integrating location awareness, obstacle detection, and contextual content delivery within a unified framework, our proposed system aims to transcend the fragmented functionality of existing solutions, thereby facilitating more meaningful and autonomous engagement with cultural spaces for visually impaired individuals.

3. Material and Methods

3.1. System Design

The overview of the proposed guide system is shown in Figure 1 and Figure 2. The system provides a comprehensive navigation and guidance service for visually impaired individuals, combining real-time location tracking, adaptive audio guidance, and obstacle detection to ensure safe and informative exploration of the art gallery environment. The navigation service enables people to move along a predefined route while avoiding obstacles, including other people. As shown in Figure 2, the localization module determines the user’s location by utilizing ML models based on the RSSI from BLE beacons. BLE beacons are installed on the ceiling of the art gallery and set to advertising mode. A mobile device continuously scans the RSSI values from these beacons and transmits the data to the server. The server then processes the data using pre-trained machine-learning models to estimate people’s location. Since RSSI values can be affected by environmental noise and movement, the system includes a localization error calibration process to detect and correct localization errors. The calibration process refines the estimated location by comparing it with the previous position. The system provides various navigation instructions based on people’s location. For example, when they reach a corner, a voice guide instructs them to change direction. Similarly, when they approach an artwork, an audio announcement informs them about the surrounding exhibits. Additionally, the system continuously refines localization accuracy using movement history.

Algorithm 1 represents the error calibration and the guide comment that varies depending on the measured location. For example, if the previous location number is 3, the measured location number is 5, and

T h

is 1, the measured location is in error and it is ignored. However, if

T h

is 2, the measured location is not an error, and the user’s current location changes to location number 4.

T h

is set as the appropriate value during the experiment. When the user arrives at the location of the artwork, the system notifies them that the artwork is nearby. If the user looks at the artwork, the system provides a information as voice guide. When the user moves away from the artwork, the system automatically stops the voice guidance to prevent unwanted audio.

Algorithm 1 Audio guide algorithm of navigation service with localization error calibration

1:: Input: PL, ML
2:: PL: Previous Location Number
3:: ML: Measured Location Number
4:: CL: Current Location Number
5:: $T h$ : Threshold
6:: if $| M L - C L | \leq T h$ then
7:: if ML != PL then
8:: CL = CL + 1
9:: if CL == $L o c_{a r t}$ then
10:: $N o t i f y (A r t w o r k)$
11:: if User looks at the artwork then
12:: $E x p l a i n A r t w o r k (L o c_{a r t})$
13:: end if
14:: end if
15:: if PL == $L o c_{a r t}$ and CL != $L o c_{a r t}$ then
16:: $S t o p E x p l a i n ()$
17:: end if
18:: if CL == $L o c_{c o r n e r}$ then
19:: $N o t i f y (C o r n e r)$
20:: end if
21:: end if
22:: PL = CL
23:: else
24:: Localization Error
25:: end if

The depth of the guide service is a novel feature that adaptively provides an audio guide based on people’s behavior, as shown on the left side of Figure 1. First, when a person approaches an artwork, the system announces that there are artworks nearby. Second, if they look at the artwork, the system provides detailed information about it. Third, when they move away, the audio guide stops automatically. This process is managed by the localization module and the orientation module, which utilize the mobile device’s magnetic field and gyroscope sensors. As a result, the system dynamically adjusts the audio guide, ensuring that it provides relevant information while avoiding unnecessary sounds. Notably, the initiation and termination of the descriptions occur automatically.

The Detection Module implements the YOLOv5 deep learning architecture for efficient image analysis and obstacle identification. For individuals with visual disabilities, the capacity to promptly identify obstacles is vital for ensuring safety and independence while navigating unfamiliar environments [1]. This capability is particularly relevant in art gallery settings, which present unique navigational challenges.

Introduced in 2020, YOLOv5 has established its reputation for exceptional computational efficiency and performance speed, making it particularly appropriate for deployment in edge computing environments with resource constraints—a critical consideration for mobile applications. While subsequent iterations (YOLOv8, YOLOv10) have introduced incremental improvements, YOLOv5 continues to demonstrate robust performance across diverse operational scenarios [30]. Empirical evaluation by [31] indicates that YOLOv5 exhibits comparable or superior detection capabilities relative to newer variants in specific object detection tasks and general classification contexts.

Among the available YOLOv5 variants (YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x), YOLOv5m was ultimately selected for this project due to its optimal balance of performance characteristics [22]. Specifically, YOLOv5m demonstrates a favorable balance between precision and computational efficiency based on COCO benchmark evaluations. It provides higher accuracy than lightweight models such as YOLOv5s while requiring fewer computational resources than heavier models like YOLOv5l or YOLOv5x. Its CPU inference latency is also acceptable for real-time applications that demand efficient performance [30]. Additionally, YOLOv5 supports multiple export formats, such as ONNX, CoreML, and TFLite, facilitating deployment across various platforms [30]. For optimal real-time performance on mobile devices, the final implementation was converted to TFLite format, enabling rapid detection through direct camera streaming on standard mobile phones.

Although art galleries typically feature fewer diverse obstacles compared to other public spaces, significant human traffic is common as visitors navigate through exhibition areas. Therefore, the system was trained to prioritize the detection of people as primary obstacles. In crowded situations, the system dynamically identifies and alerts based on the closest detected individual, ensuring relevant and timely obstacle notifications to users.

Figure 3 illustrates an example of the docent service mentioned above, where the system detects the user’s position and surrounding obstacles in real time and delivers relevant audio guidance accordingly. When a person enters the art gallery, the navigation guidance for viewing begins. For example, the system provides directional instructions such as “You are at the entrance of Exhibition Room 1” or “Please proceed straight”. The system also continuously monitors the distance between individuals based on a predefined Detection Threshold Distance, which is established as a fixed value during system configuration. If a user approaches a stationary person and the distance falls within the Detection Threshold Distance, the system issues a warning to both parties. For the moving user, the system may announce, “There is a person on your right side in the direction of movement”, while the stationary person may receive a warning such as, “A person is approaching from your left side”. These alerts are delivered through both voice guidance and phone vibration to ensure effective communication.

Additionally, while the user is navigating the gallery, the system provides contextual information about nearby artworks. For instance, it may announce, “Artwork No. 1 is located to your left”. When the user turns toward the mentioned direction, the system automatically delivers information about the artwork. The level of detail in this information varies according to the user’s movement speed: when the user is moving quickly, the system provides only basic information such as “The title of this piece is XYZ, created by artist ABC”, but when the user slows down or stops in front of an artwork, more comprehensive explanations including historical context, artistic technique, and significance are automatically provided. This ensures that the guidance service dynamically responds to the user’s orientation, movement, and engagement level, facilitating an enhanced and interactive docent experience within the art gallery.

3.2. Experiment Environment

We conducted experimental evaluations of our system at the Wumin Art Center. The system architecture integrates BLE beacons, a mobile device, and a server for indoor positioning. Figure 4 depicts the spatial configuration of the installed BLE beacons. Raspberry Pi 4B devices, functioning as BLE beacons, were strategically mounted on the gallery ceiling. Table 2 provides detailed information regarding the system specifications, including both the hardware and software components utilized in our experimental setup. To preserve the aesthetic appearance as much as possible, the Raspberry Pi devices were installed on the ceiling and enclosed in white cases matching the ceiling color. According to the museum curator, visitors did not noticeably pay attention to the Raspberry Pi installations. The exhibition space was partitioned into three discrete zones, with RSSI data collected exclusively from beacons within the user’s current zone to enhance positional accuracy. The circles in the figure represent data collection points used for training the ML model, with red circles specifically denoting corner positions. Real-time location estimation was performed through the application of the trained ML model to current RSSI measurements. For validation purposes, accuracy assessment data were collected at identical locations as the training dataset.

The implementation of Raspberry Pi boards as BLE beacon receivers can present a cost-effective solution, including requisite peripherals such as power supplies, BLE dongles, enclosures, and networking components [32]. However, a comprehensive cost-benefit analysis evaluating total hardware investment and long-term maintenance feasibility is recommended prior to large-scale deployment in library or gallery environments. While Raspberry Pi units necessitate continuous power supply and network connectivity, public exhibition spaces such as galleries and museums can leverage pre-existing electrical infrastructure originally established for lighting installations, thereby minimizing or eliminating the requirement for additional wiring installations to support IoT hardware.

As demonstrated in [33], increasing the density of Raspberry Pi devices typically exhibits correlation with enhanced location estimation accuracy, though the marginal improvement diminishes significantly beyond a threshold of approximately five to six units. Empirical evidence from this study indicates optimal performance when Raspberry Pi devices are positioned at the corners and center of a defined space, effectively establishing a bounding box that ensures triangulation coverage by a minimum of three beacons for each target location. Research presented in [34] demonstrates that the implementation of just three BLE beacons reduced positional error from 9.09 m to 2.94 m, while additional beacons contributed only marginally to performance enhancement. This finding substantiates the conclusion that increasing device quantity does not necessarily yield proportional improvements in system performance, underscoring the importance of strategic beacon placement to optimize system efficiency while minimizing hardware requirements.

To detect humans within the art gallery, we utilized Python version 3.9.17, NVIDIA GeForce RTX 3070 GPU, and Torch version 1.12.1_cu116. For training, we employed the YOLOv5m weights with 128 epochs, a batch size of 4, and images resized to 416 × 416 resolution. Furthermore, for human detection, we trained using image data from the CrowdHuman dataset [35]. Subsequently, the model is converted to a TFLite model and integrated into a smartphone application. The Galaxy S21 smartphone plays an intermediary role between humans and devices by scanning RSSI values, measuring human orientation, communicating with the server, and capturing streaming images through AI models and cameras to detect other humans. In the experiment, the mobile device was hung around the user’s neck. The application is made using Android Studio.

4. Research and Analysis

4.1. Localization

Based on localization, the system provides both a navigation service and a depth of guide service. For the navigation service, the system guides people along a predefined route using their location. For the depth of guide service, it provides information about artworks based on their location. Therefore, localization accuracy is a crucial performance indicator. Figure 5a presents the accuracy of the machine-learning models used in different areas, showing accuracy rates of 91.28%, 87%, and 95%. In area 2, the accuracy is lower compared to other areas due to specific misclassifications. Figure 6 shows the confusion matrix for the results in area 2, illustrating these errors. In particular, location 5 was frequently misclassified as location 6. This suggests that the data collected at locations 5 and 6 are similar, leading to decreased accuracy. If location 6 is excluded, the accuracy increases to 95%. These results were obtained while the person remained stationary. However, when the person moves, RSSI values can become noisy, potentially affecting localization accuracy. To evaluate this, we conducted an experiment where a person moved along the predefined route. The experiment included two scenarios, with the second scenario involving actual artworks as obstacles within the art gallery.

The established routes for each scenario are shown in Figure 7a and Figure 8a. The results of the first scenario are presented in Figure 7b, where red text indicates localization errors. These errors are corrected using the calibration method described in Algorithm 1. The blue text represents the adjusted locations after applying the filtering algorithm. In Area 1, the threshold (

T h

) value is set to 2, while in other areas, it is set to 1. Similarly, Figure 8b shows the results of the second scenario. The findings demonstrate that navigation along the predefined route can be completed using error calibration, regardless of the presence of obstacles.

Corner recognition in localization is a crucial aspect of the navigation service. Visually impaired people rely on the system to detect corners and change direction accordingly. If a corner is recognized too late, they may risk colliding with a wall. To evaluate this, we measured the distance between the person and the wall at the moment a corner was detected. Figure 5b presents these measurements. The results confirm that corners were detected at a minimum distance of 1 m from the wall, ensuring safe navigation.

4.2. Human Detection

The relationship between facial measurements, height, and head height is significant, with the vertical face value changing proportionally to the individual’s height [36]. Building on this insight, we proceeded to conduct tests to assess how the size of human heads varies with distance and height. This was done based on the principle that when an object is closer to the camera, it occupies a greater pixel count compared to when it is distant, enabling the measurement of the object’s distance [37]. To conduct relevant tests, we collected data from a total of 5 individuals, each with heights of 153, 158, 170, 178, and 183 cm, respectively. Tests were conducted at three predetermined locations. At the first location, individuals stood at distances of 180 cm, 270 cm, and 360 cm away. They stood in an upright position while we adjusted the height of the smartphone hanging from the neck, measuring it at distances of approximately 100 cm and 150 cm above the ground level, respectively. While the positions of the individuals were fixed, their movements were not constrained. This decision was made to account for the fact that the smartphone, worn by individuals with visual impairments, would not be fixed in place, and thus, we aimed to conduct tests considering this movement.

Initially, photos were taken at each location, and subsequently, using the system implemented on Android devices, the size of human heads was measured at each location where they were trained. The results of a test to determine the effect of human head size on height, conducted at a photographing distance of 180 cm and a photographing position of 150 cm above the ground, are depicted in Figure 9.

As observed in Figure 9, while height does not have an absolute influence on facial size, it does appear to exert a certain degree of impact. Additionally, Figure 10 illustrates the effects on facial pixel size when the photographing position is varied at a distance of 180 cm. This test was conducted to understand how changes in the photographing position affect facial pixel size. The data labeled with “_Low” and “_High” in the legend were collected at ground heights of 100 cm and 150 cm, respectively. Although the movements were not constrained, resulting in variations in detection size, a similar trend to Figure 9 can be observed.

Finally, we examined changes in face size by measurement location. In Figure 11, while there are overlapping pixel sections to some extent depending on the distance, it’s noticeable that the sensitivity to distance is more pronounced compared to height and facial size. This result can be seen more clearly by looking at the average value in Table 3. Therefore, our system sets the face recognition pixel size to 2000 or more in order to recognize obstacles that may occur during route guidance, that is, people, and provide voice guidance to the visually impaired for obstacles recognized within a certain range.

In high-density crowd scenarios where multiple individuals are detected, the system employs a proximity-based prioritization algorithm to focus on the nearest person. Due to the inverse relationship between perceived head size and distance in the visual field, the effective detection threshold is constrained to approximately 10 m. The system implements a comparative analysis of detected individuals and initiates an alert protocol based on the position of the nearest subject. The sequence diagram illustrated in Figure 12 delineates the system’s operational framework. The image capture process commences simultaneously with the application initialization. Subsequently, the YOLO module conducts image analysis to determine the presence of multiple human subjects within the captured frame. In instances of multiple detections, the system measures the cranial dimensions of all detected individuals and calculates the distance based on the subject exhibiting the largest head size. When the calculated proximity falls below a predetermined threshold, the system verifies the status of any ongoing alerts before activating the device’s haptic feedback mechanism to notify the user. The system incorporates a mutex-like protocol to prevent alert redundancy by suppressing additional notifications when the haptic feedback module is already engaged. This methodological approach ensures that obstacle notifications maintain contextual relevance and efficacy, thereby enhancing navigational safety for visually impaired users in crowded environments.

4.3. Comparison of Object Detection Performance

This study compares the performance of two approaches for YOLO-based object detection on Android devices. The first approach is the On-Device method, which directly executes the YOLO model on the Android device. The second approach is the Edge Server-based method, which transmits images from the device to an edge server, performs object detection through the YOLO model on the server, and then returns the results to the device for processing.

The performance evaluation for the Edge Server-based approach involved measuring the image transmission time and model inference time across various image sizes. To directly compare the On-Device and Edge Server-based approaches, the total processing time was measured using images of identical dimensions.

To accurately evaluate the performance differences between the Android device and the edge server, the experiments were conducted under consistent hardware and network conditions. The Android device used was a Galaxy S21 running Android 14, and object detection was performed using the YOLOv5 model. The network environment was configured with a 5 GHz Wi-Fi connection, and each measurement was repeated 10 times to ensure consistency and reliability in the experimental results.

For performance evaluation, three key metrics were defined. The first metric was image transmission time, which measures the time taken to transmit images from the device to the edge server. The second metric was inference time, which measures the time required for the YOLO model to perform object detection. Lastly, the total processing time was defined as the sum of image transmission time, inference time, and result transmission time when using images of the same size, providing a comprehensive measure of overall system performance.

The specific characteristics of the various image sizes tested during the evaluation are detailed in Table 4. As shown in Figure 13, the time required to transmit images from the device to the edge server in the Edge Server-based approach significantly varied depending on the image size. The shortest transmission time was recorded for Img7, with an average of 153 ms, while the longest transmission time was observed for Img2, averaging 821 ms. This variation suggests that transmission time is influenced by both the transmitted image size and network conditions.

Next, the inference time of the YOLO model on the edge server is shown in Figure 14. The shortest inference time was 118 ms for Img7, while the longest inference time was 242 ms for Img2. This implies that inference time can also vary depending on the complexity and resolution of the input image.

Finally, the comparison of total processing time between the Edge Server-based and On-Device approaches is shown in Figure 15. This comparison was conducted using Img1, which had identical dimensions for both methods. The Edge Server-based approach had an average total processing time of 323 ms, while the On-Device approach measured an average of 101 ms. The Edge Server-based approach exhibited performance variability due to transmission time and network conditions, whereas the On-Device approach demonstrated more consistent processing speeds since no transmission was involved.

4.4. Quality of User Experience

The system was tested with the 7 visually impaired people in the art gallery where several artworks were exhibited. The locations of artwork are shown in Figure 16a. Figure 16b shows the photo when the participants used the system. The red boxes are the artworks that provide the information in the system. A route in Figure 8a is used for an established route. The location numbers of Area 1 (1, 4, 8, 10), Area 2 (2, 7, 6, 8), and Area 3 (1, 3, 6) are used as locations where the artworks are exhibited. Considering the probability of system error, the authors were standing near the participants when the participants used the system. The participants only moved with the navigation guide of the system. The participants were asked to fill in the questionnaire presented in Table 5 using the Likert scale. The scale of each question is from “1: strongly disagree” to “5: strongly agree”.

Figure 17 shows the average scores for each assessment. Red boxes indicate the negative questions. In the case of the navigation service, the participants felt that the measurement of the location of the artwork and using the established route were not bad, but the overall accuracy of localization was unsatisfactory. Overall, participants were generally satisfied with receiving vibration alerts for obstacle detection while moving. Additionally, survey results indicated that implementing warnings every minute for the same obstacle to avoid repetitive alerts did not significantly impact psychological comfort regarding safety. In the case of the depth of guide service, the participants were satisfied to provide the information when they only looked at the artwork. The overall satisfaction with the system was positive, and the intention to reuse the system was especially high when returning.

5. Conclusions

In this paper, we presented an innovative smart docent system specifically engineered for visually impaired individuals. The system offers dual functionality: a navigation service that integrates both localization and object detection modules, and a comprehensive guide service that utilizes the localization module in conjunction with the orientation module.

With these integrated functions, visually impaired visitors showed an accuracy ranging from 91.28% to 95% in stationary tests. The location errors that occurred during movement were effectively processed through an error correction algorithm, allowing participants to complete navigation along a preset path. Our experimental methodology established optimal parameter values for object detection based on comprehensive testing results. The system was empirically and quantitatively evaluated with seven visually impaired participants at the Wumin Art Center, followed by a qualitative assessment through structured feedback. Although the results indicated minor concerns regarding localization precision, the participants expressed general satisfaction with the system’s performance and utility.

Despite technological advancements, visually impaired individuals continue to face significant barriers to meaningful cultural participation in museum and gallery settings. We intend to collaborate with the Wumin Art Center and other cultural institutions to provide sustained access to enriching experiences for visually impaired visitors. Through these efforts toward iterative improvement and the expansion of accessibility, we aim to transition cultural engagement for visually impaired individuals from isolated experimental implementations to established and widely accessible services.

In future research directions, we aim to overcome the technical limitations identified in this study by implementing more accurate obstacle detection algorithms and improving localization accuracy. Furthermore, we seek to establish a framework for long-term service provision and continuous enhancement of the user experience. In addition, we plan to enhance the system by integrating user-centered evaluations to further improve real-time performance and the overall user experience.

Author Contributions

Conceptualization, W.P., H.A., P.L. and S.P.; methodology, W.P., H.A., P.L. and S.P.; software, W.P. and H.A.; validation, W.P., H.A., P.L. and S.P.; formal analysis, W.P., H.A., P.L. and S.P.; investigation, W.P., H.A., P.L. and S.P.; resources, W.P. and H.A.; data curation, W.P. and H.A.; writing—original draft preparation, W.P. and H.A.; writing—review and editing, P.L. and S.P.; visualization, W.P., H.A., P.L. and S.P.; supervision, P.L. and S.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

The authors also gratefully acknowledge the Wumin Art Center for providing us with the platform necessary to conduct the experiments and tests outlined in this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kaif, M.; Chauhan, U.; Chauhan, R.; Rana, R.; Chauhan, P. Designing Sensor Based Smart Stick for Handicapped Person. In Proceedings of the 2022 4th ICAC3N, Greater Noida, India, 16–17 December 2022; pp. 1399–1402. [Google Scholar]
Quero, L.C.; Bartolomé, J.I.; Cho, J. Accessible visual artworks for blind and visually impaired people: Comparing a multimodal approach with tactile graphics. Electronics 2021, 10, 297. [Google Scholar] [CrossRef]
Vaz, R.; Freitas, D.; Coelho, A. Blind and visually impaired visitors’ experiences in museums: Increasing accessibility through assistive technologies. Int. J. Incl. Mus. 2020, 13, 57–80. [Google Scholar] [CrossRef]
Bai, J.; Lian, S.; Liu, Z.; Wang, K.; Liu, D. Virtual-Blind-Road Following-Based Wearable Navigation Device for Blind People. IEEE Trans. Consum. Electron. 2018, 64, 136–143. [Google Scholar] [CrossRef]
Li, B.; Munoz, J.P.; Rong, X.; Chen, Q.; Xiao, J.; Tian, Y.; Arditi, A.; Yousuf, M. Vision-Based Mobile Indoor Assistive Navigation Aid for Blind People. IEEE Trans. Mob. Comput. 2019, 18, 702–714. [Google Scholar] [CrossRef] [PubMed]
Spachos, P.; Plataniotis, K.N. BLE Beacons for Indoor Positioning at an Interactive IoT-Based Smart Museum. IEEE Syst. J. 2020, 14, 3483–3493. [Google Scholar] [CrossRef]
Ahriz, I.; Douin, J.-M.; Lemoine, F. Location-based Service Sharing for Smart Museum. In Proceedings of the 2019 International Conference on Software, Telecommunications and Computer Networks (SoftCOM), Split, Croatia, 19–21 September 2019; pp. 1–6. [Google Scholar]
Alletto, S.; Cucchiara, R.; Del Fiore, G.; Mainetti, L.; Mighali, V.; Patrono, L.; Serra, G. An Indoor Location-Aware System for an IoT-Based Smart Museum. IEEE Internet Things J. 2016, 3, 244–253. [Google Scholar] [CrossRef]
Tyagi, N.; Sharma, D.; Singh, J.; Sharma, B.; Narang, S. Assistive Navigation System for Visually Impaired and Blind People: A Review. In Proceedings of the 2021 AIMV, Gandhinagar, India, 24–26 September 2021; pp. 1–5. [Google Scholar]
Ayaz, Z. Digital Advertising and Customer Movement Analysis Using BLE Beacon Technology and Smart Shopping Carts in Retail. J. Theor. Appl. Electron. Commer. Res. 2025, 20, 55. [Google Scholar] [CrossRef]
Feghali, J.M.; Feng, C.; Majumdar, A.; Ochieng, W.Y. Comprehensive Review: High-Performance Positioning Systems for Navigation and Wayfinding for Visually Impaired People. Sensors 2024, 24, 7020. [Google Scholar] [CrossRef]
Barsocchi, P.; Girolami, M.; La Rosa, D. Detecting proximity with bluetooth low energy beacons for cultural heritage. Sensors 2021, 21, 7089. [Google Scholar] [CrossRef]
Verde, D.; Romero, L.; Faria, P.M.; Paiva, S. Indoor content delivery solution for a museum based on BLE beacons. Sensors 2023, 23, 7403. [Google Scholar] [CrossRef]
Girolami, M.; La Rosa, D.; Barsocchi, P. Bluetooth dataset for proximity detection in indoor environments collected with smartphones. Data Brief 2024, 53, 110215. [Google Scholar] [CrossRef]
Abidi, M.H.; Siddiquee, A.N.; Alkhalefah, H.; Srivastava, V. A comprehensive review of navigation systems for visually impaired individuals. Heliyon 2024, 10, e31825. [Google Scholar] [CrossRef] [PubMed]
Huang, Z.; Shangguan, Z.; Zhang, J.; Bar, G.; Boyd, M.; Ohn-Bar, E. Assister: Assistive navigation via conditional instruction generation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022; pp. 271–289. [Google Scholar]
Okolo, G.I.; Althobaiti, T.; Ramzan, N. Smart Assistive Navigation System for Visually Impaired People. J. Disabil. Res. 2025, 4, 20240086. [Google Scholar] [CrossRef]
Chen, Z.; Liu, X.; Kojima, M.; Huang, Q.; Arai, T. A wearable navigation device for visually impaired people based on the real-time semantic visual SLAM system. Sensors 2021, 21, 1536. [Google Scholar] [CrossRef] [PubMed]
Zhou, C.; Liu, X. The Study of Applying the AGV Navigation System Based on Two Dimensional Bar Code. In Proceedings of the 2016 International Conference on Industrial Informatics—Computing Technology, Intelligent Technology, Industrial Information Integration (ICIICII), Wuhan, China, 3–4 December 2016; pp. 206–209. [Google Scholar]
Liu, D.; Zhang, W.; Qi, Y.; Liu, F.; Liu, J.; Ren, J.; Wang, Y.; Wang, Z. Auxiliary Smart Glasses for Visually Impaired People Based on Two-dimensional Code Positioning. In Proceedings of the 2021 7th IEEE International Conference on Network Intelligence and Digital Content (IC-NIDC), Beijing, China, 17–19 November 2021; pp. 279–283. [Google Scholar]
Lv, H.; Feng, L.; Yang, A.; Lin, B.; Huang, H.; Chen, S. Two-Dimensional Code-Based Indoor Positioning System With Feature Graphics. IEEE Photonics J. 2018, 11, 1–15. [Google Scholar] [CrossRef]
Arifando, R.; Eto, S.; Wada, C. Improved YOLOv5-based lightweight object detection algorithm for people with visual impairment to detect buses. Appl. Sci. 2023, 13, 5802. [Google Scholar] [CrossRef]
Khan, S.; Nazir, S.; Khan, H.U. Analysis of navigation assistants for blind and visually impaired people: A systematic review. IEEE Access 2021, 9, 26712–26734. [Google Scholar] [CrossRef]
Dilli, H.; Katragadda, K.; Sang, V.M. Smart Blind Stick Using Node MCU with Voice Alert. In Proceedings of the 2022 OTCON, Raigarh, Chhattisgarh, India, 8–10 February 2023; pp. 1–6. [Google Scholar]
Wang, H.; Song, C.; Li, H. Application of social media communication for museum based on the deep mediatization and artificial intelligence. Sci. Rep. 2024, 14, 28661. [Google Scholar] [CrossRef]
Costales, J.A.; Vida, A.K.S.; Albino, M.G. Mobile Based Navigation System for Visually Impaired Person Using Image Detection with Voice Translation. In Proceedings of the 2023 3rd ICICSE, Chongqing, China, 7–9 April 2023; pp. 32–36. [Google Scholar]
Birmingham Museum of Art. Smart Guide. Available online: https://www.artsbma.org/art/smartguide/ (accessed on 10 March 2024).
Siwalette, R.; Suyoto. IoT-Based Smart Gallery to Promote Museum in Ambon. In Proceedings of the 2020 Fourth World Conference on Smart Trends in Systems, Security and Sustainability (WorldS4), London, UK, 27–28 July 2020; pp. 528–532. [Google Scholar]
Ng, P.C.; She, J.; Park, S. Notify-and-interact: A beacon-smartphone interaction for user engagement in galleries. In Proceedings of the 2017 IEEE International Conference on Multimedia and Expo (ICME), Hong Kong, China, 10–14 July 2017; pp. 1069–1074. [Google Scholar]
Hussain, M. Yolov5, yolov8 and yolov10: The go-to detectors for real-time vision. arXiv 2024, arXiv:2407.02988. [Google Scholar]
Geetha, A.S.; Hussain, M. A comparative analysis of yolov5, yolov8, and yolov10 in kitchen safety. arXiv 2024, arXiv:2407.20872. [Google Scholar]
Gebru, K.; Rapelli, M.; Rusca, R.; Casetti, C.; Chiasserini, C.F.; Giaccone, P. Edge-based passive crowd monitoring through WiFi Beacons. Comput. Commun. 2022, 192, 163–170. [Google Scholar] [CrossRef]
De Brouwer, R.; Torres-Sospedra, J.; Trilles-Oliver, S.; Berkvens, R. Optimal Receivers Location for Passive Indoor Positioning Based on BLE. In Proceedings of the IPIN-WiP, Lloret de Mar, Spain, 29 November–2 December 2021. [Google Scholar]
Saghafi, S.; Kiarashi, Y.; Rodriguez, A.D.; Levey, A.I.; Kwon, H.; Clifford, G.D. Indoor Localization Using Multi-Bluetooth Beacon Deployment in a Sparse Edge Computing Environment. Digit. Twins Appl. 2025, 2, e70001. [Google Scholar] [CrossRef]
Shao, S.; Zhao, Z.; Li, B.; Xiao, T.; Yu, G.; Zhang, X.; Sun, J. CrowdHuman: A Benchmark for Detecting Human in a Crowd. arXiv 2018, arXiv:1805.00123. [Google Scholar]
Baume, R.M.; Buschang, P.H.; Weinstein, S. Stature, head height, and growth of the vertical face. Am. J. Orthod. 1983, 83, 477–484. [Google Scholar] [CrossRef]
Zuckerman, M.; Kolberg, E. Distance Estimation to Image Objects Using Adapted Scale. Int. J. Eng. Sci. 2017, 6, 39–50. [Google Scholar] [CrossRef]

Figure 1. System architecture.

Figure 2. Framework of the system consisting of the navigation service and the adaptive depth of the guidance service.

Figure 3. Navigation for docent service.

Figure 4. Beacon deployment location at the art gallery.

Figure 5. Localization model performance: (a): accuracy of models per each area. (b): Distance between a wall and a user when the system detects a corner.

Figure 6. Confusion matrix for the results in area 2.

Figure 7. Scenario 1: (a): An established route. (b): Localization results when the user moves on the established route.

Figure 8. Scenario 2: (a): An established route. (b): Localization results when the user moves on the established route.

Figure 9. The comparison of face sizes relative to height.

Figure 10. The relationship between height and facial size at different photographing heights.

Figure 11. The relationship between measured distance and facial size.

Figure 12. Sequence diagram of a visual proximity detection and alert system.

Figure 13. Image transmission time by image size in the edge server-based approach.

Figure 14. Inference time by image size in the edge server-based approach.

Figure 15. Comparison of total processing time between on-device and edge server-based approaches.

Figure 16. (a): Overview of the art gallery when the artworks were exhibited. (b): The situation where the participants use the proposed system.

Figure 17. Likert score from the proposed system experiment.

Table 2. System implementation specifications: hardware and software components.

Category	Component	Details
Hardware	BLE beacons	Raspberry Pi 4B unit
	Mobile device	Galaxy S21 smartphone
	GPU (training)	NVIDIA GeForce RTX 3070 GPU
Software	YOLOv5	YOLOv5m architecture - Epochs: 128 - Batch size: 4 - Image resolution: 416 × 416 pixels - Dataset: CrowdHuman dataset
	Python	Version: 3.9.17
	Library	TFLite, TensorFlow

Table 3. Average facial size of each individual based on photographing position.

Height (cm)	Location (cm)
Height (cm)	180	270	360
153	2968.451	1275.108	600.8036
158	2179.694	959	559.5476
170	3261.687	1612.492	814.3333
178	2478.534	1338.682	762.1111
183	3269.375	1717.545	900.9821

Table 4. Specifications of test images used for performance evaluation of YOLO-based object detection methods.

Image ID	Image Properties
Image ID	Dimensions	Resolution (MP)	File Size (MB)
Img1	3024 × 4032	12.00	4.08
Img2	6936 × 9248	64.00	17.23
Img3	2268 × 4032	9.00	2.56
Img4	5204 × 9248	48.00	12.29
Img5	3024 × 3024	9.00	2.63
Img6	6928 × 6928	48.00	12.27
Img7	1816 × 4032	7.00	2.35
Img8	4164 × 9248	39.00	11.20

Table 5. Likert assessments for quality of user experience.

ID	Assessment	Type
1	Was a smartphone suitable as a device to provide a system?	P
2	Was the system complicated to use?	N
3	Were the navigation inaccurate?	N
4	Was the location of the artwork recognized by the system suitable for listening to information about it?	P
5	Were you dissatisfied with traveling along the recommended route?	N
6	Were the audio navigation helpful when touring the museum?	P
7	Was it convenient to have information provided when looking in the direction of the artwork?	P
8	Was the audio guide for the artwork adequate for obtaining information about the artwork?	P
9	Was it inconvenient because an obstacle detection alarm was provided while providing information about artwork?	N
10	Had you found the frequency of obstacle detection alarms once per second to be too slow and inconvenient?	N
11	Did you feel that obstacle detection was working well?	P
12	Did the obstacle detection service give you psychological stability while moving around the art gallery?	P
13	Were you dissatisfied with using the system at the art gallery?	N
14	Would you consider using the system again when you return to the art gallery?	P

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

An, H.; Park, W.; Liu, P.; Park, S. Mobile-AI-Based Docent System: Navigation and Localization for Visually Impaired Gallery Visitors. Appl. Sci. 2025, 15, 5161. https://doi.org/10.3390/app15095161

AMA Style

An H, Park W, Liu P, Park S. Mobile-AI-Based Docent System: Navigation and Localization for Visually Impaired Gallery Visitors. Applied Sciences. 2025; 15(9):5161. https://doi.org/10.3390/app15095161

Chicago/Turabian Style

An, Hyeyoung, Woojin Park, Philip Liu, and Soochang Park. 2025. "Mobile-AI-Based Docent System: Navigation and Localization for Visually Impaired Gallery Visitors" Applied Sciences 15, no. 9: 5161. https://doi.org/10.3390/app15095161

APA Style

An, H., Park, W., Liu, P., & Park, S. (2025). Mobile-AI-Based Docent System: Navigation and Localization for Visually Impaired Gallery Visitors. Applied Sciences, 15(9), 5161. https://doi.org/10.3390/app15095161

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Mobile-AI-Based Docent System: Navigation and Localization for Visually Impaired Gallery Visitors

Abstract

1. Introduction

2. Background of Analysis

2.1. Navigation Service

2.2. Guide Service

3. Material and Methods

3.1. System Design

3.2. Experiment Environment

4. Research and Analysis

4.1. Localization

4.2. Human Detection

4.3. Comparison of Object Detection Performance

4.4. Quality of User Experience

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI