Next Article in Journal
BPMN-Based Design of Multi-Agent Systems: Personalized Language Learning Workflow Automation with RAG-Enhanced Knowledge Access
Previous Article in Journal
A Multi-Angle Semantic Feature Fusion Method for Web User Behavior Anomaly Detection
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Smartphone-Based Seamless Scene and Object Recognition for Visually Impaired Persons

Graduate School of Science and Engineering, Saga University, Saga 840-8502, Japan
*
Author to whom correspondence should be addressed.
Information 2025, 16(9), 808; https://doi.org/10.3390/info16090808
Submission received: 26 July 2025 / Revised: 28 August 2025 / Accepted: 3 September 2025 / Published: 17 September 2025

Abstract

This study introduces a mobile application designed to assist visually impaired persons (VIPs) in navigating complex environments, such as supermarkets. Recent assistive tools often identify objects in isolation without providing contextual awareness. In contrast, our proposed system uses seamless scene and object recognition to help users efficiently locate target items and understand their surroundings. Employing a “human-in-the-loop approach”, users control their smartphone camera direction to explore the space. Experiments conducted in a simulated shopping environment show that the system enhances object-finding efficiency and improves user orientation. This approach not only increases independence, but also promotes inclusivity by enabling VIPs to perform everyday tasks with greater confidence and autonomy.

1. Introduction

1.1. Background

Visually impaired persons (VIPs) often face persistent challenges when performing daily activities that rely heavily on visual information, such as shopping. Locating specific items in a supermarket, especially when navigating through multiple shelves, sections, or tightly packed products, can be particularly difficult without assistance. This often forces people to rely on sighted companions or store staff, limiting their autonomy and potentially decreasing confidence in navigating public spaces. These problems are amplified in complex environments, where tasks like switching between aisles or scanning vertically stacked items (typically arranged on five or six tiers) require continuous visual awareness.
While prior work has advanced indoor navigation and object detection, it has not fully addressed the unique challenges of supermarket settings, where visually impaired users must identify small items within visually cluttered and homogeneous shelf layouts. Commercial applications like Envision AI and Google Lookout [1,2] offer real-time recognition, operate as general-purpose tools with limited adaptation to complex environments and task-specific scenarios. These systems often impose a cognitive load on users, as they require the user to select the object they want to find and can only search for a single type of object at one time. Such limitations are especially problematic for tasks that require speed, sequence, or spatial reasoning, such as locating items on multi-tiered supermarket shelves.
By narrowing the gap between high-level environmental awareness and item-level detail, our system aims to better support the nuanced, visually complex tasks that blind users encounter in everyday life. To bridge this gap, we propose a system that combines lightweight, mobile-friendly object detection with scene-level reasoning to provide contextual guidance, enabling more efficient and independent item search.
In this study, our system integrates object and scene recognition within a human-in-the-loop framework, enabling more efficient and context-aware object findings without requiring external hardware. We introduce a mobile system that supports scene-object awareness to help blind users find specific items in structured environments, such as supermarkets. Rather than relying solely on object labels, our system integrates scene context (e.g., identifying a fruit shelf or beverage section) with object recognition, providing higher-level cues that guide the user’s search.
The system is designed around a human-in-the-loop interaction model, where users actively control the camera’s orientation to explore their surroundings while receiving real-time auditory feedback. Our proposed method advances previous object recognition systems [3,4], which are limited to small object detection and do not utilize spatial context. By introducing seamless scene-object integration, our approach enhancing usability for visually impaired individuals in complex supermarkets with many shelves.
The contributions of this study are fourfold: (1) we design and implement an Android-based assistive system that supports blind users in locating specific objects using contextual cues from both scenes and objects; (2) we train a model on a combined dataset of shelf scenes and target objects to enable seamless context-aware detection; (3) we evaluate the usability of the system through exploratory user testing; and (4) we demonstrate that contextual guidance improves efficiency and reduces cognitive effort during search tasks.

1.2. Related Work

Endsley’s situational awareness framework [5] defines three levels of awareness: perceiving environmental elements, understanding their meaning, and projecting future states. For visually impaired users, contextual guidance via scene and object recognition can support these processes by supplementing missing visual information, helping users better understand their surroundings and plan navigation actions. Rather than directing visually impaired users to specific locations, our system aims to enhance their understanding of the environment and support confident, independent navigation decisions.
Existing assistive technologies offer only partial support for the complex needs of visually impaired users. RFID and beacon-based systems [6,7] provide indoor localization but lack visual perception, making them unsuitable for environments where item-level recognition is critical. Camera-based solutions [8] assist in interpreting surroundings but often fail to deliver contextual or scene understanding, which is necessary in visually cluttered spaces like supermarkets. Marker- and depth-sensor-based solutions [9] improve scene recognition but require additional hardware, limiting their scalability in daily shopping scenarios. Wearable and 3D-based systems [10,11] offer richer descriptions, but introduce usability and portability challenges in crowded public settings.
Lightweight, mobile-optimized models such as small-object detection using YOLO NAS Small [12] and efficient Vision Transformers for scene text recognition and assisting visual impairment support on mobile devices [13,14] demonstrate strong performance in detecting small objects. However, these methods still operate at the level of isolated detection or recognition and do not incorporate the contextual reasoning needed to guide object search in structured environments.
Pratap et al. (2025) provide a comprehensive evaluation of real-time detection algorithms for indoor navigation [15]; however, these algorithms are typically optimized for general indoor spaces and do not explicitly address the challenges of detecting small, densely arranged items on supermarket shelves. Mixed-reality navigation systems that fuse scene, object, depth, and NLP, like MR.NAVI [16], demonstrate the benefits of multi-modal context integration, but their high computational and hardware requirements limit their practical deployment on portable mobile devices in everyday shopping contexts.
Several mobile applications have been designed to support visually impaired individuals in daily tasks. For instance, Norkhalid et al. [17] developed a voice command navigation interface but without visual scene recognition, and Sargsyan et al. [18] introduced a mobile map offering audio and haptic feedback for spatial orientation but without object identification. A map designed for visually impaired users, enabling them to understand their surroundings and navigate effectively using a portable smartphone- or tablet-based system [19] but remain disconnected from real-time object recognition.

2. System Overview

This study presents a seamless recognition system that enables VIPs to navigate complex environment using a single mobile device. As shown in Figure 1, the system operates in two distinct spatial interaction modes: (1) distal positioning, where the user scans the environment from a distance, and (2) proximal positioning, where the user moves closer to examine specific items. These two modes leverage a shared recognition model to interpret both scene and object information based on the relative scale of visual input. The system continuously analyzes the camera feed and delivers real-time auditory feedback, announcing the names of detected scenes and objects as users move through the environment. This dynamic feedback loop helps users build spatial awareness, identify relevant items, and make informed navigational decisions without requiring additional hardware or infrastructure.

2.1. Human-in-the-Loop Design

Our system adopts a human-in-the-loop approach, in which users actively control the orientation of their smartphone camera to explore the environment and search for target objects (Figure 2). As users move through the space, the system continuously analyzes the camera feed and delivers real-time auditory feedback corresponding to detected scenes and objects. This interactive loop enables users to iteratively refine their search strategy and maintain situational awareness, granting them greater agency and precision during navigation. By embedding users directly into the recognition process, the system supports adaptive exploration and facilitates goal-directed behavior in visually complex environments.

2.2. Seamless Scene-Object Recognition

At the core of our system is a seamless real-time recognition pipeline, enabled by training a YOLO v11 model on the Roboflow platform [20]. After training, the model was converted to the TensorFlow Lite format to support efficient inference on mobile devices [21]. The dataset includes a diverse set of images reflecting real-world supermarket environments, featuring common items such as apples, bottles, and cans under varying lighting conditions and backgrounds. A total of 1135 labeled images were used, spanning 26 scene and object classes. Scene labels included categories such as fruit shelf, beverage shelf, vegetable shelf, food shelf, and item shelf, chosen to reflect the typical physical structure of a grocery store and to provide contextual grounding for object recognition tasks.
During dataset construction, annotations were selectively applied to objects and scenes occupying a moderate proportion of the image frame. Specifically, visual targets that were either too small (e.g., distant objects occupying less than 20% of the image area) or excessively large (e.g., shelves captured at close range exceeding the frame boundary) were intentionally excluded. This annotation strategy was adopted to reduce ambiguity in object boundaries and to improve detection accuracy for targets of practical salience in typical use scenarios.
This constraint directly shaped the recognition capabilities of the trained YOLOv11 model, which demonstrated optimal performance when visual targets appeared within a defined spatial scale relative to the image frame. Consequently, the model exhibited sensitivity to camera-to-object distance, with recognition accuracy dependent on the relative size of the object within the camera view.
This behavior is evident in real-world usage. When the user is positioned at a distance, such as at the midpoint of a supermarket aisle, the system reliably identifies macro-level contextual elements (e.g., entire shelves), which fall within the optimal detection range. In contrast, fine-grained object recognition (e.g., cans or apples) is not activated under these conditions due to insufficient visual resolution. Conversely, when the user approaches a shelf, the recognition focus shifts to individual items, while large scene-level elements are excluded due to exceeding the effective detection window. Rather than being a limitation, this distance-sensitive behavior enables a context-aware recognition pipeline, facilitating smooth transitions between scene classification and object detection as the user dynamically explores the environment.
To support deployment, the application was developed using Android Studio with the Flutter framework [22,23], enabling cross-platform compatibility for Android devices. System testing was conducted on an Android smartphone with API level 33. A diagram of the system pipeline is shown in Figure 3. This configuration ensures that the system remains lightweight and responsive in real-time, while also highlighting the dynamic relationship between camera distance, recognition capability, and audio feedback.

2.3. Multiple Scenes and Object Recognition

In order to minimize cognitive load and improve usability, the system implements an object prioritization mechanism that selects the most relevant object in each video frame, as illustrated in Figure 2. The system continuously recognizes multiple scenes and objects to dynamically understand the context. To ensure that users receive only the most pertinent information, the object selection mechanism evaluates both the proximity of detected objects to the center of the screen and their detection confidence scores.
For each detected object i, the system calculates the confidence score C i , which corresponds to YOLO’s detection confidence. The Euclidean distance d i from the center of the frame, defined as
d i = ( x i x c ) 2 + ( y i y c ) 2
where ( x i , y i ) is the center of the object’s bounding box and ( x c , y c ) is the center coordinate of the video frame. These two factors are then combined into a single selection score using the following formula:
S i = α · C i β · d i
In Equation (2), the weighting parameters are fixed to α = 0.5 and β = 1. This configuration balances the influence of the detection confidence with the object’s distance from the screen, prioritizing objects that are closer while still accounting for the detector’s confidence scores. The weights were chosen empirically based on validation experiments to achieve this balance, reflecting the practical need to assist users in identifying nearby objects efficiently. The object with the highest combined score S i is selected as the final detection output for that frame, ensuring that only one object is communicated to the user at a time to prevent cognitive overload.
After a scene is recognized and the user approaches the relevant area, the system detects specific nearby objects (e.g., apples, bottles, or cans) within the frame. Real-time auditory feedback is then provided, allowing users to hear the name of the detected object.
Unlike conventional systems that treat scene and object recognition as independent tasks and often require complex user interfaces with multiple controls, the proposed system integrates both functions seamlessly. By training the model on a unified dataset that includes both scene and object labels, the system is capable of detecting multiple scenes and relevant objects simultaneously within a single frame. This integrated approach contributes to a more intuitive and efficient user experience.

3. Tested Smartphone Application for Supporting Visually Impaired Persons

As a baseline study, we investigated mobile applications available on the Android platform that are designed to support visually impaired persons (VIPs). A comparative analysis of their performance is presented in Table 1. The first application, Envision AI, was developed by a socially driven AI company. Envision AI is a free, award-winning application that utilizes artificial intelligence and optical character recognition (OCR) to convert visual information, such as text and objects, into spoken feedback, thereby promoting greater independence for VIPs. The application provides auditory feedback using a frequency tone upon detecting a predefined target object. However, it does not offer any feedback when the object is not detected. Furthermore, the configuration of the target object is limited to a single class, meaning that other objects present in the environment are ignored.
The second application evaluated was Lookout, developed by Google. Lookout is a free Android-based application that leverages the device’s camera and advanced AI technologies to deliver real-time auditory descriptions of objects and text, thereby assisting VIPs in navigating their surroundings. In object search mode, the application emits a frequency tone and announces the object’s name when a predefined target is detected. Similarly to Envision AI, Lookout only identifies the specified target, ignoring other objects in the environment.
Both Envision AI and Lookout are limited in conveying the broader context in which objects are situated. For instance, when using Envision AI in a cluttered or complex environment, users may become disoriented due to the absence of feedback when the target object is not detected. This lack of continuous feedback offers no directional cues, leaving users without guidance. Likewise, although Lookout can estimate the position and distance to a target object, its performance degrades in complex environment, where identifying small or specific items becomes inefficient and time-consuming. These limitations often result in unnecessary movement and prolonged search durations, especially in maze-like environments such as supermarkets.
The third application was a custom-built baseline system implementing object recognition alone. We refer to this as the object-based application. It provided auditory feedback by announcing the class names of all detected objects in real time and employed a prioritization mechanism to manage multiple detections within a frame. This application was developed specifically for this study and does not correspond to prior work [3,4]. Due to differences in the object recognition models and feedback methods used, it is difficult to directly compare existing applications with our proposed scene-based approach. To more effectively evaluate the effectiveness of our method, we developed a custom application that adopts the object-based approach of these existing applications while using the same object recognition model and feedback method as our scene-based application.
The fourth application represents the proposed system, which constitutes the core contribution of this study. We refer to this as the scene-based application. It integrates both scene and object recognition to deliver a comprehensive understanding of the environment. The system features seamless scene–object integration and employs a prioritization mechanism to convey the most contextually relevant information. Unlike previous applications, it does not require users to predefine target objects and ensures that surrounding objects are also considered. This results in a more flexible, informative, and user-friendly experience.
To evaluate the performance of the scene-based application, we conducted a paired samples t-test comparing it against object-based application. Outlier data were excluded from these analyses to ensure the validity of the statistical results.
By enhancing contextual awareness, reducing cognitive load, and delivering intuitive auditory feedback, the proposed system aims to improve navigation efficiency. This makes it particularly well-suited for VIPs operating in complex environments such as supermarkets.

4. Experiments

4.1. Training Process

The system input consists of still images captured using a smartphone camera at 640 × 640 pixels, frame rate is not applicable. The object detection model was trained on annotated images using an 80/20 training–validation split. For each training example, three variations were generated. Random noise was added to up to 1.49% of the pixels, and the bounding box size was randomly adjusted to vary between −15% and +15% of the original dimensions. Using YOLOv11, hyperparameters such as image size, learning rate, and batch size were tuned to balance accuracy and real-time performance (Table 2). Model performance was monitored via mean average precision (mAP) to prevent overfitting (Table 3). The network outputs include the predicted object class, bounding box coordinates, and confidence score. The final trained model was deployed in the tested mobile applications. Although the exact frame rate of the video camera is not applicable, the system is capable of capturing real-time video with sufficient smoothness to reliably observe the motion.

4.2. Participants and Instructions

The experiment was conducted with ten blindfolded participants, each equipped with a smartphone running the developed application and a white cane. To simulate the experience of a visually impaired individual visiting a supermarket for the first time, each participant tested a maximum of two applications per day, with the order of applications varied across sessions. To minimize familiarity bias, the arrangement of shelves and object placement were randomized for each session.
Participants received three main instructions prior to the experiment. First, they were trained on the basic technique of using a white cane, simulating typical navigation behavior of visually impaired users. Second, a list of target objects was introduced along with their English names to ensure label recognition, as all participants were non-native English speakers. Third, participants were instructed to locate and retrieve five different target objects within the test environment.
All participants were thoroughly briefed on the purpose and procedures of the experiment, and written informed consent was obtained from each participant before participation. This study was conducted with the approval of the Saga University Ethics Committee (Approval No. R7-6).

4.3. Experimental Conditions

The experiment was conducted to evaluate the effectiveness of the seamless recognition system in a supermarket-like environment. As illustrated in Figure 4, the test was performed in a 4 m × 4 m indoor space designed to simulate a typical shopping scenario, featuring multiple shelves and a variety of common supermarket items. The shelves used in the environment measured 90 cm × 300 cm and were divided into five levels; only levels 1 to 3 were utilized for object placement.
Five types of shelves were employed, each labeled with a distinct color to represent different product sections. A mix of artificial food items and real objects was used to create a realistic and diverse testing scenario.
Four mobile applications were evaluated during the testing phase, as summarized in Table 1.

4.4. Evaluation Metrics

The performance of the tested applications was evaluated using two primary metrics: (1) the time to complete the object retrieval tasks, measured in minute and (2) the number of objects successfully retrieved by each participant. For the time metric, the distribution of task completion times of the participants was visualized using box-and-whisker plots. Points over 1.5 times the interquartile range are plotted as outliers. This visualization provides insight into both central tendency and variability of the applications, enabling comparison of consistency and spread in task completion times.
A paired samples t-test was conducted to determine whether there is a significant difference between object-based and scene-based applications on task completion time. The dependent variable was the completion time in seconds for each participant, and the independent variable was the type of application. Each application was tested by ten participants, with each participant tasked to find five objects per application. The analyses examined whether the mean completion times differed significantly between applications.
Additionally, object retrieval accuracy is presented as a percentage of the total number of target objects. These measures provide a clear and quantitative basis for comparing the effectiveness and efficiency of each application.

5. Result and Evaluation

The evaluation focused on two key metrics: the time taken by each participant to retrieve five target objects, and the number of the objects retrieved by the users based on the system’s feedback.

5.1. Time Consumption

The completion times for each application are illustrated in Figure 5, showing the performance of the four tested applications. Overall, the results indicate that Envision AI and Lookout yielded the longest task completion times. This outcome was primarily attributed to participants’ uncertainty about their own location and the position of shelves in the spacious environment. Both applications lacked contextual feedback, such as the identification of nearby objects or shelf sections, forcing participants to rely heavily on the white cane to interpret their surroundings. Extended completion times were particularly evident when participants searched within three-level shelves or mistakenly navigated to the wrong shelf.
The object-based and scene-based applications showed moderately improved performance. Both systems provided feedback on the current shelf section and nearby objects. For instance, when a participant searching for a donut was informed that the nearest object was an apple, they correctly inferred that they were at the wrong shelf and adjusted their navigation accordingly. However, since the object-based system was designed to detect only specific items, participants still had to explore the entire environment to locate the correct shelf. This contributed to moderate task durations, though the added contextual awareness enhanced users’ decision-making during the search process.
The scene-based application achieved the best performance, with an average completion time of 7 min 48 s ( S D = 6 min 19 s). This standard deviation (SD) indicates some variation across participants, but overall, the approach enabled faster task completion. The improved performance can be attributed to the system’s prioritization of larger, scene-relevant objects in wide spaces, which facilitated more effective target shelf identification, allied to human strategies of attending to prominent areas when entering a supermarket.
A data anomaly occurred during the test with Participant 10. On the first day, the participant used the scene-based and Lookout applications. Due to hurried movements during the task, the system’s recognition accuracy was compromised, resulting in prolonged completion time. The Lookout application displayed an orientation “move slowly” notification, but because the notification was text-based and the participant was simulating blindness, it went unnoticed. This error was classified as a human factor, and the resulting data was identified as an outlier.
Experimental conditions and participant movement patterns are illustrated in Figure 6; the points were plotted every five seconds, showing the process of locating a single target object. Completion times for all participants are summarized in Table 4.
Participants using Envision AI and Lookout frequently showed confusion regarding spatial layout and self-location. In contrast, participants using the object-based application tended to search around the room’s perimeter, while those using the scene-based system more often navigated directly toward the target area, exhibiting minimal disorientation.
The t-test result confirmed that the mean completion times differed significantly between object-based and scene-based applications, t(9) = 4.07, p < 0.05.

5.2. Number of Recognized Targets

The correctly recognized targets was based on the detection results and the participants’ ability to correctly retrieve the intended items. The evaluation results among all participants are presented in Figure 7.
In the evaluation results, Envision AI and Lookout exhibited the highest rates of incorrect detection. This was primarily attributed to participants’ difficulty in distinguishing visually similar items, such as ‘mineral water’ and ‘cans.’ In some cases, participants abandoned the search after extended attempts and instead relied on their sense of touch. These instances were categorized as accuracy failures. In contrast, both the object-based and scene-based applications demonstrated good accuracy in identifying the target objects.

6. Limitation and Future Work

Our study and system design have several limitations. Several challenges remain in the development of this system. Variations in room layout and occlusions between objects may affect the system’s ability to consistently recognize scenes or objects. Additionally, although the system provides real-time auditory feedback, unstructured or excessive information delivery can potentially increase the user’s cognitive load. Future research may focus on optimizing feedback modalities that are more intuitive and capable of conveying information selectively and efficiently.
Looking ahead, further development should aim to create systems capable of adapting to dynamic environments, where object positions and spatial layouts frequently change, as commonly found in public spaces such as supermarkets. Enhancing the system’s robustness against environmental disturbances will be a critical step toward realizing a more reliable, flexible, and inclusive navigation aid for visually impaired individuals in daily life.
Future work will explore optimizing multimodal feedback modalities that convey information more selectively and intuitively, such as vibration or tactile cues, which can reduce cognitive load and provide users with complementary channels of interaction beyond auditory feedback.
In this study, participant recruitment was limited, as all participants were sighted science and engineering students who were blindfolded during the study. Consequently, the results may not fully generalize to the visually impaired population.

7. Conclusions

In complex environments, object finding and independent navigation become significantly more challenging for visually impaired users due to the spatial layout complexity and the abundance of visual stimuli that are not directly accessible.
Traditional object detection methods often fail to provide sufficient contextual information, leaving users uncertain about their position within the environment and forcing them to explore the area extensively through trial and error. The proposed method in this study, which integrates visual context understanding through scene recognition, enables participants to better recognize their location in the test environment. This was demonstrated by the significantly lower task completion time when using the scene-based application. This is achieved by identifying key contextual cues, such as shelf types and nearby objects, which serve as spatial references. This improved environmental awareness allows participants to navigate directly to the target area, such as a specific shelf type, rather than indiscriminately scanning the entire space.

Author Contributions

Conceptualization, F.A.R.; methodology, F.A.R.; software, F.A.R.; supervision, W.L.Y. and O.F.; validation, F.A.R., W.L.Y. and O.F.; writing—original draft, F.A.R.; writing—review and editing, F.A.P. and O.F. All authors have read and agreed to the published version of the manuscript.

Funding

A part of this work was funded by the JSPS KAKENHI grant number 23H03440.

Institutional Review Board Statement

This study was conducted with the approval of the Saga University Ethics Committee (Approval No. R7-6, approved on 1 March 2025).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author. The raw data supporting the conclusions of this article will be made available by the author on request.

Acknowledgments

During the preparation of this work, the authors composed all words and phrases of the text. ChatGPT (https://chatgpt.com/) and Quillbot (https://quillbot.com/) were then used to improve grammatical accuracy. All content was subsequently reviewed and edited by the authors, who take full responsibility for the final publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
VIPVisually impaired person
YOLOYou Only Look Once
OCROptical Character Recognition
SDStandard deviation

References

  1. Envision Technologies. Envision AI. Available online: https://www.letsenvision.com (accessed on 3 February 2025).
  2. Google. Lookout—Assisted Vision. Available online: https://play.google.com/store/apps/details?id=com.google.android.apps.accessibility.reveal (accessed on 3 February 2025).
  3. Rahman, F.A.; Handayani, A.N.; Takayanagi, M.; He, Y.; Fukuda, O.; Yamaguchi, N.; Okumura, H. Assistive Device for Visual Impaired Person Based on Real Time Object Detection. In Proceedings of the 2020 4th International Conference on Vocational Education and Training (ICOVET), Malang, Indonesia, 19–19 September 2020; pp. 190–194. [Google Scholar] [CrossRef]
  4. Rahman, F.A.; Yamaguchi, N.; Handayani, A.N.; Okumura, H.; Yeoh, W.L.; Fukuda, O. The development of a smartphone application based on object detection and indoor navigation to assist visually impaired. In Proceedings of the 2023 8th International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS), Okinawa, Japan, 23–25 November 2023; Volume 8, pp. 111–116. [Google Scholar] [CrossRef]
  5. Endsley, M.R. Toward a Theory of Situation Awareness in Dynamic Systems. Hum. Factors 1995, 37, 32–64. [Google Scholar] [CrossRef]
  6. Plikynas, D.; Žvironas, A.; Budrionis, A.; Gudauskis, M. Indoor Navigation Systems for Visually Impaired Persons: Mapping the Features of Existing Technologies to User Needs. Sensors 2020, 20, 636. [Google Scholar] [CrossRef] [PubMed]
  7. Kandalan, R.N.; Namuduri, K. Techniques for Constructing Indoor Navigation Systems for the Visually Impaired: A Review. IEEE Trans. Hum.-Mach. Syst. 2020, 50, 492–506. [Google Scholar] [CrossRef]
  8. Khan, S.; Nazir, S.; Khan, H.U. Analysis of Navigation Assistants for Blind and Visually Impaired People: A Systematic Review. IEEE Access 2021, 9, 26712–26734. [Google Scholar] [CrossRef]
  9. Yu, X.; Saniie, J. Visual Impairment Spatial Awareness System for Indoor Navigation and Daily Activities. J. Imaging 2025, 11, 9. [Google Scholar] [CrossRef] [PubMed]
  10. Brilli, D.D.; Georgaras, E.; Tsilivaki, S.; Melanitis, N.; Nikita, K. AIris: An AI-Powered Wearable Assistive Device for the Visually Impaired. In Proceedings of the 2024 10th IEEE RAS/EMBS International Conference for Biomedical Robotics and Biomechatronics (BioRob), Heidelberg, Germany, 1–4 September 2024; pp. 1236–1241. [Google Scholar] [CrossRef]
  11. Liu, H.; Liu, R.; Yang, K.; Zhang, J.; Peng, K.; Stiefelhagen, R. HIDA: Towards Holistic Indoor Understanding for the Visually Impaired via Semantic Instance Segmentation with a Wearable Solid-State LiDAR Sensor. arXiv 2021, arXiv:2107.03180. [Google Scholar]
  12. BN, R.; Guru, R.; A, A.M. Small Object Detection for Indoor Assistance to the Blind using YOLO NAS Small and Super Gradients. arXiv 2024, arXiv:2409.07469. [Google Scholar]
  13. Luan, X.; Zhang, J.; Xu, M.; Silamu, W.; Li, Y. Lightweight Scene Text Recognition Based on Transformer. Sensors 2023, 23, 4490. [Google Scholar] [CrossRef] [PubMed]
  14. Ng, S.C.; Kwok, C.P.; Chung, S.H.; Leung, Y.Y.; Pang, H.S.; Lam, C.Y.; Lau, K.C.; Tang, C.M. An Intelligent Mobile Application for Assisting Visually Impaired in Daily Consumption Based on Machine Learning with Assistive Technology. Int. J. Artif. Intell. Tools 2021, 30, 2140002. [Google Scholar] [CrossRef]
  15. Pratap, A.; Kumar, S.; Chakravarty, S. Adaptive Object Detection for Indoor Navigation Assistance: A Performance Evaluation of Real-Time Algorithms. arXiv 2025, arXiv:2501.18444. [Google Scholar] [CrossRef]
  16. Pfitzer, N.; Zhou, Y.; Poggensee, M.; Kurtulus, D.; Dominguez-Dager, B.; Dusmanu, M.; Pollefeys, M.; Bauer, Z. MR.NAVI: Mixed-Reality Navigation Assistant for the Visually Impaired. arXiv 2025, arXiv:2506.05369. [Google Scholar]
  17. Said, Y.; Atri, M.; Albahar, M.A.; Ben Atitallah, A.; Alsariera, Y.A. Scene Recognition for Visually-Impaired People’s Navigation Assistance Based on Vision Transformer with Dual Multiscale Attention. Mathematics 2023, 11, 1127. [Google Scholar] [CrossRef]
  18. Norkhalid, A.M.; Faudzi, M.A.; Ghapar, A.A.; Rahim, F.A. Mobile Application: Mobile Assistance for Visually Impaired People—Speech Interface System (SIS). In Proceedings of the 2020 8th International Conference on Information Technology and Multimedia (ICIMU), Selangor, Malaysia, 24–26 August 2020; pp. 329–333. [Google Scholar] [CrossRef]
  19. Sargsyan, E.; Oriola, B.; Serrano, M.; Jouffrais, C. Audio-Vibratory You-Are-Here Mobile Maps for People with Visual Impairments. Proc. ACM Hum.-Comput. Interact. 2024, 8, 624–648. [Google Scholar] [CrossRef]
  20. Dwyer, B.; Nelson, J.; Hansen, T. Roboflow (Version 1.0) [Software]. 2024. Available online: https://roboflow.com (accessed on 26 July 2025).
  21. Google. Android Studio. Available online: https://developer.android.com/studio (accessed on 3 January 2025).
  22. Flutter Team. Flutter (Version 3.16.4). Cross-Platform Mobile Application Framework. Available online: https://flutter.dev (accessed on 3 January 2025).
  23. Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, D.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. 2015. Available online: https://www.tensorflow.org (accessed on 26 July 2025).
Figure 1. System overview illustrating two recognition modes: red triangles indicate distant positions, while blue triangles indicate near positions. Symbols include Japanese text reflecting the supermarket setting.
Figure 1. System overview illustrating two recognition modes: red triangles indicate distant positions, while blue triangles indicate near positions. Symbols include Japanese text reflecting the supermarket setting.
Information 16 00808 g001
Figure 2. Recognition based on the human-in-the-loop system. Different box edge colors indicate various detected classes, while red colored boxes highlight objects located near the center point, which are prioritized.
Figure 2. Recognition based on the human-in-the-loop system. Different box edge colors indicate various detected classes, while red colored boxes highlight objects located near the center point, which are prioritized.
Information 16 00808 g002
Figure 3. System configuration showing dataset division (scene and object), training pipeline, display screen, and participant. Red boxes indicate the detected scenes and objects.
Figure 3. System configuration showing dataset division (scene and object), training pipeline, display screen, and participant. Red boxes indicate the detected scenes and objects.
Information 16 00808 g003
Figure 4. Experimental setup.
Figure 4. Experimental setup.
Information 16 00808 g004
Figure 5. Task completion times for each smartphone application. Data points from ten participants are shown as box plots. t-test shows significant difference between object-based and scene-based applications. The colors represent: grey indicates the baseline study, yellow indicates the custom-built study, and green indicates the proposed study.
Figure 5. Task completion times for each smartphone application. Data points from ten participants are shown as box plots. t-test shows significant difference between object-based and scene-based applications. The colors represent: grey indicates the baseline study, yellow indicates the custom-built study, and green indicates the proposed study.
Information 16 00808 g005
Figure 6. Trial example: from the top-left, the demonstration results of four assistive applications in a simulated environment: (a) Envision AI, (b) Lookout, (c) object-based method, and (d) scene-based method. The dots indicate the timing of participant movements. Blue dots represent moments when the system successfully recognized the target scene, yellow dots indicate successful object recognition, and white dots show when the participant successfully grabbed the intended object.
Figure 6. Trial example: from the top-left, the demonstration results of four assistive applications in a simulated environment: (a) Envision AI, (b) Lookout, (c) object-based method, and (d) scene-based method. The dots indicate the timing of participant movements. Blue dots represent moments when the system successfully recognized the target scene, yellow dots indicate successful object recognition, and white dots show when the participant successfully grabbed the intended object.
Information 16 00808 g006
Figure 7. Objects retrieved (%) for each tested application. Each application was evaluated by ten participants, with each participant tasked to locate five target objects. Accuracy is calculated as the number of correctly retrieved objects divided by the total number of target objects, multiplied by 100%. The colors represent: grey indicates the baseline study, yellow indicates the custom-built study, and green indicates the proposed study.
Figure 7. Objects retrieved (%) for each tested application. Each application was evaluated by ten participants, with each participant tasked to locate five target objects. Accuracy is calculated as the number of correctly retrieved objects divided by the total number of target objects, multiplied by 100%. The colors represent: grey indicates the baseline study, yellow indicates the custom-built study, and green indicates the proposed study.
Information 16 00808 g007
Table 1. Comparison of tested applications: Envision AI (existing app), Lookout (existing app), object-based (custom baseline), and scene-based (proposed method).
Table 1. Comparison of tested applications: Envision AI (existing app), Lookout (existing app), object-based (custom baseline), and scene-based (proposed method).
Smartphone AppEnvision AILookoutObject-BasedScene-Based
Display sampleInformation 16 00808 i001Information 16 00808 i002Information 16 00808 i003Information 16 00808 i004
Search targetOne object specified in advanceOne object specified in advanceAny multiple objectsAny multiple objects and scenes
Feedback typeDetection sound (no feedback for undetected items)Detection sound and object name (no feedback for undetected items)Object namesObject and scene names
UI/UXMany buttons for setting targetsMany buttons for setting targetsReady to recognize multiple targetsReady to recognize multiple targets
Table 2. Training parameters.
Table 2. Training parameters.
ParameterValue
Image size640 × 640
Learning rate0.01
Batch size16
Epoch100
Table 3. Performance evaluation of the trained object and scene recognition model. The grey color indicates scene classes.
Table 3. Performance evaluation of the trained object and scene recognition model. The grey color indicates scene classes.
ClassesmAP@50mAP@50:95PrecisionRecall
Beverage Shelf0.9910.8960.9390.965
Coffee0.9810.8540.9220.947
Juice0.9780.8620.8730.955
Tea0.9910.8670.9170.947
Mineral Water0.9890.8790.9590.950
Food Shelf0.9860.8790.9290.962
Pizza0.9900.8920.9510.965
Donut0.9900.8870.9330.967
Sweet0.9950.8980.9460.973
Hot Dog0.9900.8630.9510.960
Fruit Shelf0.9880.8900.9660.978
Banana0.9950.8970.9750.986
Green Apple0.9940.9130.9380.964
Red Apple0.9880.8800.9330.985
Orange0.9930.9300.9570.992
Vegetable Shelf0.9920.9210.9680.970
Carrot0.9930.9110.9770.980
Eggplant0.9830.8970.9580.977
Garlic0.9900.9200.9670.969
Items Shelf0.9930.9230.9790.984
Flashlight0.9930.9210.9750.969
Glove0.9910.9230.9700.978
Hammer0.9930.9250.9690.981
Measuring Tape0.9900.9260.9740.991
Scissors0.9910.9210.9800.982
Tape0.9910.9200.9530.981
Table 4. Time consumption of all participants (minutes). The table shows the task completion times for ten participants across four tested applications: Envision AI, Lookout, object-based, and scene-based.
Table 4. Time consumption of all participants (minutes). The table shows the task completion times for ten participants across four tested applications: Envision AI, Lookout, object-based, and scene-based.
SubjectEnvision AILookoutObject-BasedScene-Based
Subject 116:1014:4510:2109:15
Subject 215:4015:3608:4304:07
Subject 308:0216:2007:0603:39
Subject 425:4920:0907:1206:06
Subject 540:1530:5309:1406:14
Subject 617:3012:0608:5404:33
Subject 714:4513:5508:1807:05
Subject 814:4642:4412:0704:12
Subject 909:4236:1209:0106:43
Subject 1015:5520:3508:1026:07
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Rahman, F.A.; Pusparani, F.A.; Yeoh, W.L.; Fukuda, O. Smartphone-Based Seamless Scene and Object Recognition for Visually Impaired Persons. Information 2025, 16, 808. https://doi.org/10.3390/info16090808

AMA Style

Rahman FA, Pusparani FA, Yeoh WL, Fukuda O. Smartphone-Based Seamless Scene and Object Recognition for Visually Impaired Persons. Information. 2025; 16(9):808. https://doi.org/10.3390/info16090808

Chicago/Turabian Style

Rahman, Fisilmi Azizah, Ferina Ayu Pusparani, Wen Liang Yeoh, and Osamu Fukuda. 2025. "Smartphone-Based Seamless Scene and Object Recognition for Visually Impaired Persons" Information 16, no. 9: 808. https://doi.org/10.3390/info16090808

APA Style

Rahman, F. A., Pusparani, F. A., Yeoh, W. L., & Fukuda, O. (2025). Smartphone-Based Seamless Scene and Object Recognition for Visually Impaired Persons. Information, 16(9), 808. https://doi.org/10.3390/info16090808

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop