1. Introduction
Visual impairment represents a significant global health challenge, affecting approximately 285 million people worldwide, with 39 million experiencing complete blindness [
1]. These individuals face substantial barriers in navigating their environments, accessing textual information, and maintaining social connections [
2]. Visually impaired individuals often encounter significant barriers in their efforts to interact with and interpret their environments. Recent advances in artificial intelligence and edge computing have opened new possibilities for developing sophisticated assistive technologies that can operate independently of cloud infrastructure [
3,
4].
Traditional assistive technologies, including screen readers, guide canes, and magnification devices, provide basic functionality but lack the intelligent interpretation capabilities offered by modern AI systems [
5]. Commercial solutions such as Microsoft Seeing AI and Google Lookout leverage cloud-based artificial intelligence to provide advanced object recognition and scene description [
6]. Although assistive technologies have evolved to include screen readers, smart navigation tools, and AI-enabled devices, many of these solutions are dependent on cloud-based infrastructures. While cloud platforms offer advanced computational resources and high recognition accuracy, they come with critical drawbacks such as data privacy concerns, latency in communication, constant internet dependency, and higher operational costs.
The privacy implications of cloud-based assistive systems are particularly concerning, as they often require transmission of sensitive personal data, including images of the user’s environment and biometric information [
7]. Furthermore, the reliance on constant internet connectivity limits their applicability in rural areas, developing regions, or situations where network access is unreliable [
8].
To overcome these limitations, this research proposes a fully offline assistive system built on a Raspberry Pi 5. The system integrates core AI functionalities—including object detection, optical character recognition (OCR), face recognition, and voice-command processing—executed entirely on-device. This architecture supports real-time interaction without requiring cloud access, thereby enhancing user privacy, reducing latency, and extending accessibility to remote or low-resource settings. The approach bridges the gap between affordability and autonomy in modern assistive technologies, addressing the critical need for privacy-preserving, accessible solutions [
9].
The primary aim of this study is to design and implement a Python-based assistive platform that provides real-time visual interpretation and voice interaction capabilities for visually impaired users, all while operating independently of internet connectivity. The specific objectives of the system are as follows:
Develop a lightweight object detection module to identify and localize items in the user’s environment.
Integrate an OCR engine to extract printed text and convert it to speech, enabling access to textual content.
Implement a face recognition system that can identify pre-registered individuals and communicate their identity to the user.
Design a voice-command interface that enables hands-free control over the system’s functionalities.
Ensure optimized performance on a Raspberry Pi 5, maintaining responsiveness without relying on external computing resources.
Demonstrate a privacy-first approach with comprehensive local data processing and zero external data transmission.
This work contributes significantly to the field of assistive technology by introducing an affordable, portable, and privacy-conscious solution tailored for visually impaired users. The key contributions include (1) a novel comprehensive offline multimodal assistive system integrating object detection, OCR, face recognition, and voice control on a single edge device; (2) a systematic evaluation of open-source AI models for embedded assistive applications; (3) novel optimization strategies for achieving real-time performance on resource-constrained hardware; and (4) a privacy-preserving architecture that eliminates dependence on cloud services [
10]. Unlike commercial systems that require internet-based AI services, the proposed system functions autonomously, making it particularly useful in rural or underdeveloped regions with limited connectivity.
The scope of this project includes the development and evaluation of a real-time, offline assistive system that provides voice-guided support based on visual inputs. The following core functionalities are implemented:
Object Detection: Identifies objects in the camera’s field of view and delivers positional feedback to the user via audio.
Optical Character Recognition (OCR): Reads printed or displayed text from scenes and documents, converting it into spoken words.
Face Recognition: Detects and identifies individuals from a stored database of known faces, announcing their names when recognized.
Voice Command Interface: Empowers the user to control the system’s operation, toggle features, and switch modes through spoken commands.
Privacy-First Architecture: Ensures all processing occurs locally with zero data transmission to external servers.
While the system demonstrates effective performance in controlled indoor settings, current limitations include processing constraints of the Raspberry Pi 5, particularly during simultaneous execution of multiple AI models. These constraints may impact real-time performance under high computational loads. Additionally, performance degrades in challenging environmental conditions such as poor lighting or high background noise [
11].
The rest of the paper is organized as follows:
Section 2 reviews recent advancements and limitations in assistive technologies, providing comprehensive comparison tables and research gap analysis.
Section 3 describes the system architecture, hardware components, software stack, and operational workflow, including detailed model selection justification, threading architecture, and privacy-first implementation.
Section 4 presents comprehensive testing methodology, evaluation metrics, dynamic scene analysis, and comparative performance analysis with existing systems.
Section 5 concludes with key contributions, current limitations, and future research directions.
3. System Design and Methodology
3.1. System Overview
The conceptual framework of our offline assistive system is illustrated in
Figure 1.
The proposed system is a fully offline, Python-based assistive platform designed to empower visually impaired individuals by facilitating interactive engagement with their surroundings. It integrates real-time object detection, optical character recognition (OCR), face recognition, and voice-based control—all implemented locally on a Raspberry Pi 5 without any dependence on cloud infrastructure.
The architecture adopts a modular structure, enabling each core function to operate independently or in combination depending on the selected mode. Voice commands serve as the primary user interface, while auditory feedback ensures seamless and intuitive interaction. The overall workflow is optimized for minimal latency and high usability, allowing the user to switch modes dynamically without manual intervention—ideal for hands-free operation in real-world settings.
3.2. Hardware Components
The physical implementation of the system is based on low-cost, energy-efficient hardware components that are readily available and easy to integrate.
Table 4 summarizes the main hardware modules used in building the prototype.
3.3. Software Stack and Main Libraries Used
The entire system is developed using Python 3.11.2, which offers high flexibility and support for numerous open-source libraries. The software stack incorporates a range of tools tailored for computer vision, speech processing, multithreading, and hardware interfacing.
Table 5 provides a detailed overview of the main libraries and tools used across various system functionalities.
To achieve real-time performance on a resource-constrained platform like the Raspberry Pi 5, the system integrates lightweight, open-source AI models selected through an iterative design and testing process. The model selection process involved comprehensive benchmarking of various architectures under different computational constraints, as detailed in
Section 3.4.
While Tesseract OCR was initially considered for broader visual interpretation, its capabilities are limited to text recognition and do not extend to general object detection. Moreover, its integration with hardware accelerators such as the Hailo AI module in the Raspberry Pi AI Kit is not natively supported and requires intermediate translation layers or model conversion steps, which introduce additional complexity. In contrast, YOLOv8 provides a robust and flexible object detection framework with direct compatibility for hardware acceleration pipelines.
3.4. Model Selection and Justification
The selection of AI models for the offline assistive system required careful consideration of the trade-offs between accuracy, computational efficiency, and real-time performance constraints. This section provides detailed justification for each model choice based on systematic evaluation and benchmarking.
3.4.1. Object Detection Model Selection
The comparison of object detection models on Raspberry Pi 5, as shown in
Table 6, demonstrates the following:
From an embedded vision theory perspective, the choice between YOLOv8’s cross-window attention mechanism and MobileNet’s depthwise separable convolution requires careful analysis. While MobileNet architectures traditionally excel in mobile deployment due to their lightweight design, YOLOv8’s architectural innovations provide several advantages for the Raspberry Pi 5 platform:
Memory Access Patterns: YOLOv8’s unified architecture reduces memory fragmentation compared to MobileNet’s sequential depthwise and pointwise convolutions.
Cache Efficiency: The Raspberry Pi 5’s ARM Cortex-A76 architecture benefits from YOLOv8’s optimized tensor operations and reduced memory bandwidth requirements.
Computational Complexity: While MobileNet reduces FLOPs through separable convolutions, YOLOv8’s anchor-free design eliminates post-processing overhead, resulting in better overall performance on ARM architectures.
Empirical testing confirmed that YOLOv8n achieves superior end-to-end performance (800 ms vs. 950 ms for MobileNet-SSD) despite slightly higher theoretical computational requirements.
3.4.2. Face Recognition Architecture
The face recognition encoding optimization analysis, presented in
Table 7, shows the following:
The optimal threshold for face encoding storage was determined through systematic analysis balancing recognition accuracy, storage cost, and matching efficiency. Mathematical analysis shows that recognition accuracy follows a logarithmic improvement curve:
where
n is the number of encodings,
is the theoretical maximum accuracy, and
is the learning rate parameter. The optimal threshold of five encodings represents the point where marginal accuracy gains (<
) no longer justify the linear increase in storage and computational overhead.
3.5. System Threading Architecture
The system threading architecture and priority management is illustrated in
Figure 2.
The system employs a sophisticated multi-threaded architecture with priority-based task scheduling to ensure responsive interaction while managing computational constraints. The threading system is designed based on the real-time interaction needs of visually impaired users, where immediate response to voice commands is critical for system usability.
Priority Allocation Rationale:
Priority 1—Voice Commands: Highest priority ensures immediate system responsiveness to user instructions, critical for hands-free operation.
Priority 2—Face Recognition: High priority for social interaction support, enabling timely identification of approaching individuals.
Priority 3—Object Detection: Moderate priority for environmental awareness, providing continuous but non-critical spatial information.
Priority 4—OCR Processing: Lower priority for text reading tasks, which can tolerate slight delays without affecting user experience.
Queue Management: The system implements a priority queue with 50-command capacity. When the queue reaches capacity, the oldest low-priority commands are discarded to prevent system overload while preserving critical user interactions.
3.6. Data Processing and Optimization
Given the limited computational resources of the Raspberry Pi 5, a series of optimization techniques were carefully implemented to ensure that the assistive system delivers responsive, accurate, and real-time performance. These strategies encompass intelligent workload distribution, algorithmic simplification, and efficient resource utilization across various subsystems, including speech recognition, computer vision, and audio synthesis.
3.6.1. Managing Processing Load
To prevent system slowdowns and maintain smooth operation under multitasking conditions, several mechanisms were employed to manage computational load:
Multithreading: The system leverages concurrent threads to enable the parallel execution of critical functions such as voice command processing, image acquisition, and AI inference. This allows for responsive interaction and seamless switching between modes without significant delays.
Queue-Based Command Handling: A first-in-first-out (FIFO) queueing mechanism ensures that user commands are processed in the order they are received. This structured handling avoids command overlaps and potential system bottlenecks, particularly under high-demand scenarios.
Optimized Model Execution: AI models used for object detection (YOLOv8) and face recognition are configured to run at lower input resolutions. This significantly reduces the computational burden while preserving acceptable levels of detection accuracy and robustness in practical use cases.
3.6.2. Improving Speech Recognition Accuracy
Voice interaction is a central feature of the system, requiring precise recognition even in less-than-ideal acoustic environments. To this end, audio input is pre-processed using the following enhancements:
Noise Reduction: The integration of the SpeexDSP library allows for real-time suppression of background noise, which is critical for achieving clarity in user speech input, especially in dynamic or noisy settings.
Audio Pre-Processing: At system initialization, a sample of ambient noise is recorded to serve as a reference. This allows the system to better differentiate between user commands and background sounds, improving speech-to-text conversion accuracy during runtime.
3.7. Privacy-First Architecture and Implementation
The system implements a comprehensive privacy-first approach that ensures complete data sovereignty and eliminates external dependencies. This architecture addresses growing concerns about biometric data privacy and personal information security in assistive technologies.
3.7.1. Privacy Implementation Details
The privacy-first implementation metrics are detailed in
Table 8.
3.7.2. Data Flow Security Analysis
The privacy-first data flow architecture is shown in
Figure 3.
The privacy-first architecture ensures that all sensitive data remains within the user’s control:
Zero External Communication: The system is designed with no network interfaces active during operation, preventing any accidental data transmission.
Ephemeral Processing: Camera frames and audio samples are processed in memory and immediately discarded, leaving no persistent traces.
Encrypted Local Storage: Face recognition encodings are stored using AES-256 encryption with user-controlled keys.
Audit Trail: Complete system operation logging enables users to verify privacy compliance.
3.8. System Workflow and Operational Methodology
The offline assistive system is designed to facilitate seamless interaction between visually impaired users and their surroundings by leveraging voice commands and AI-based perception. The system executes a well-structured sequence of operations to deliver real-time feedback and support. A graphical representation of the overall workflow is provided in
Figure 4, and the detailed steps are explained below.
Explanation of System Workflow
The assistive system operates in a structured sequence of steps to enable visually impaired users to interact with their environment through voice commands.
Step 1: System Initialization Upon startup, the system initializes essential hardware components and software models:
- (i)
Activating the Raspberry Pi 5’s camera module and microphone for continuous multimedia input.
- (ii)
Loading pre-trained AI models, including YOLOv8 for object detection and Tesseract OCR for text extraction.
- (iii)
Running SpeexDSP to capture baseline ambient noise for dynamic noise filtering.
Step 2: Voice Command Monitoring and Mode Activation The system continuously monitors audio input via the microphone, using VOSK speech-to-text engine to process incoming speech and interpret predefined voice commands such as “Activate”, “Register”, or “Exit".
Step 3: Combined Detection Mode In combined mode, the system executes multiple detection tasks in parallel including object detection using YOLOv8, optical character recognition for text extraction, and face recognition against stored encodings.
Step 4: Face Registration Workflow The face registration feature enables users to enroll new individuals by prompting for names via voice interaction and capturing face images for encoding storage.
Step 5: Real-Time Audio Feedback All results are communicated through Piper Text-to-Speech synthesis, with Pyttsx3 as a lightweight fallback option when system resources are constrained.
4. Testing and Evaluation
This section presents a comprehensive evaluation of the proposed offline assistive system in terms of its accuracy, responsiveness, and overall usability. The system was tested under varying conditions, and its performance was benchmarked against key metrics. Additionally, a comparative analysis was conducted with cloud-based AI solutions to highlight the strengths and limitations of an offline deployment.
4.1. Testing Conditions and Methodology
To ensure a realistic and rigorous evaluation, the system was subjected to various operational scenarios replicating practical usage by visually impaired individuals:
The system was tested indoors across different lighting environments, including well-lit and dim settings, to assess the robustness of vision-based tasks.
Voice command performance was measured in both quiet and noisy conditions to simulate real-world acoustic variability.
System performance was analyzed under varying computational loads—ranging from the execution of a single AI process to concurrent execution of multiple tasks (e.g., object detection, face recognition, and OCR simultaneously).
4.2. Dynamic Scene Evaluation
To address the limitation of static scene testing, comprehensive dynamic scene evaluation was conducted to assess system performance under realistic movement conditions.
Dynamic Testing Methodology
Dynamic testing involved recording real-time videos of users walking at different speeds while the system performed object detection, OCR, and face recognition tasks. The testing protocol included the following:
- (i)
Speed Variations: Testing at 0.5 m/s (slow walking) and 1.0 m/s (normal walking) to simulate typical user movement patterns.
- (ii)
Motion Blur Analysis: Evaluating the impact of camera shake and object motion on detection accuracy.
- (iii)
Tracking Performance: Assessing the system’s ability to maintain object identification across consecutive frames.
The static vs dynamic performance comparison is presented in
Table 9.
The dynamic testing revealed that while performance degrades with movement speed, the system maintains acceptable functionality for typical user scenarios. Motion blur primarily affects OCR accuracy, while object detection shows greater robustness to movement.
4.3. Performance Metrics and Statistical Analysis
Performance assessment focused on key metrics including detection accuracy, recognition rates, and response time across each major functionality. Statistical significance testing was conducted using paired
t-tests with
p < 0.05 threshold.
Table 10 and
Table 11 summarize the system’s quantitative evaluation results.
4.4. Training and Validation Analysis
The training and validation curves for YOLOv8 fine-tuning on our assistive dataset are shown in
Figure 5.
The training process involved fine-tuning YOLOv8n on a custom dataset of 2500 images relevant to assistive scenarios, including indoor objects, text documents, and human faces. The convergence analysis shows stable training with minimal overfitting, validating the model’s suitability for the target application.
4.5. Confusion Matrix Analysis
The confusion matrix for object detection performance is presented in
Figure 6.
The confusion matrix analysis reveals strong diagonal dominance, indicating good class separation with minimal cross-class confusion. The primary confusion occurs between structurally similar objects (chair/table), which is expected given the resolution constraints of the embedded system.
4.6. Comparison with Previous Work
Statistical significance testing (paired t-test, p < 0.05) confirms that our system achieves significantly better performance compared to previous embedded assistive systems, particularly in terms of integrated functionality and response time.
The performance comparison with previous work is presented in
Table 12.
4.7. High-Load Performance Analysis
To evaluate system performance under demanding conditions, high-load scenarios were simulated where object detection, OCR, and face recognition were triggered simultaneously. The analysis compared response delays with and without priority queue scheduling, as shown in
Table 13.
The priority queue scheduling demonstrates significant improvements, particularly for time-critical voice command processing, validating the threading architecture design.
4.8. Comparison with Cloud-Based Systems
To further contextualize system performance, a qualitative comparison was conducted between the proposed offline solution and standard cloud-based AI systems. This analysis considered aspects such as computational efficiency, latency, user privacy, and deployment flexibility.
The performance comparison between cloud-based and offline systems is shown in
Table 14.
The comparative analysis reveals distinct advantages for each approach. The proposed offline system demonstrates complete privacy preservation (10/10) by processing all data locally without external transmission, ensuring full offline capability (10/10) that maintains functionality regardless of internet connectivity. Additionally, the system offers low cost implementation (9/10) through efficient use of readily available hardware components and easy deployment (9/10) with minimal technical expertise required for setup and maintenance.
The comparative evaluation reveals several critical considerations that inform the selection between cloud-based and offline assistive technologies:
- (i)
Cloud-Based Systems: Benefit from substantial computational resources, enabling the use of larger and more complex AI models, which enhances accuracy in tasks such as face and object recognition. However, they are inherently dependent on stable internet connectivity, introducing latency and posing privacy concerns when transmitting user data to remote servers.
- (ii)
Offline Raspberry Pi System: Prioritizes low-latency, real-time interaction and enhanced user privacy by processing all data locally. While it is limited by hardware constraints, it remains operational without internet access—an essential feature for deployment in rural, low-resource, or privacy-sensitive environments.
- (iii)
Voice Command Limitations: Offline voice recognition is comparatively less accurate than cloud-based solutions, particularly in acoustically challenging environments. This is due to the limited size and scope of the onboard language models available for offline use.
- (iv)
Deployment Flexibility: The offline solution excels in scenarios where infrastructure is lacking or internet reliability is low, offering a viable, cost-effective alternative to cloud-based assistive technologies.
4.9. Implementation Challenges and Solutions
The implementation of a fully offline assistive system on the Raspberry Pi 5 introduced several hardware and software-related challenges. These challenges stem from the need to balance computational demands of deep learning models with real-time performance requirements, all while maintaining usability and robustness in practical environments.
Hardware Limitations and Solutions
Processing Constraints: The limited computational capacity of the Raspberry Pi 5 makes it difficult to simultaneously execute resource-intensive AI models.
Solutions Implemented:
Reduced input resolution for computationally intensive models;
Employed multithreading to manage independent tasks concurrently;
Introduced queue-based command handling with priority management;
Optimized model architectures for ARM processors.
Camera and Audio Limitations: Standard Raspberry Pi peripherals showed reduced performance under challenging conditions.
Solutions Implemented:
Applied image preprocessing techniques including brightness enhancement;
Integrated SpeexDSP noise suppression library;
Captured baseline noise profiles for adaptive filtering;
Implemented automatic gain control for audio input.
The implementation challenges and solutions summary is presented in
Table 15.
5. Conclusions
5.1. Key Contributions and Findings
This study presents the development of an offline Python-based assistive system designed to enhance the autonomy and accessibility of visually impaired individuals. By integrating object detection, optical character recognition, face recognition, and voice-command capabilities into a compact and affordable Raspberry Pi 5 platform, the system offers a comprehensive, privacy-focused alternative to cloud-dependent assistive technologies.
The key contributions of this research include the following:
Integrated Multimodal System: First comprehensive offline system combining object detection, OCR, face recognition, and voice control on a single edge device with sub-second response times.
Privacy-First Architecture: Complete elimination of cloud dependencies with 100% local data processing, addressing critical privacy concerns in assistive technology.
Systematic Optimization: Novel approach to concurrent AI model execution on resource-constrained hardware through priority-based threading and queue management.
Real-World Validation: Comprehensive evaluation including dynamic scene testing and statistical significance analysis, demonstrating practical viability.
Open-Source Implementation: Fully reproducible system using exclusively open-source tools, promoting accessibility and further research.
Through the use of open-source libraries and careful optimization strategies—including multithreading, queue-based task management, and resolution adjustments—the system achieves functional real-time performance within the constraints of limited hardware resources. Evaluation results demonstrate promising accuracy and usability across all core functionalities, particularly in controlled indoor environments. Notably, the system maintains high levels of data privacy and responsiveness without the need for internet connectivity, making it especially suitable for deployment in low-resource or remote settings.
Dynamic scene testing revealed that while performance degrades with user movement (15–18% accuracy reduction at normal walking speed), the system maintains acceptable functionality for typical use scenarios. The priority-based threading architecture demonstrated significant improvements in system responsiveness, with 71% faster voice command processing under high-load conditions.
5.2. Limitations and Future Work
Despite its strengths, the current implementation faces several limitations that represent opportunities for future enhancement:
Current Limitations:
Hardware Constraints: Processing limitations of the Raspberry Pi 5 affect performance during simultaneous execution of multiple AI models.
Environmental Sensitivity: Reduced accuracy in challenging lighting conditions and noisy environments.
Language Support: Currently limited to English voice commands and text recognition.
Dynamic Performance: Accuracy degradation in moving scenarios due to motion blur and tracking limitations.
User Study Limitations: Evaluation primarily conducted in controlled settings with limited real-world user testing.
Future Research Directions:
Hardware Acceleration: Integration of AI accelerators (Coral TPU, Raspberry Pi AI Kit) to improve inference speed and enable more complex models.
Advanced AI Techniques: Implementation of attention mechanisms and transformer-based models optimized for edge deployment.
Multimodal Enhancement: Integration of additional sensors (LiDAR, ultrasonic) for improved spatial awareness and navigation assistance.
Adaptive Learning: Development of personalized models that adapt to individual user preferences and environmental conditions.
Comprehensive User Studies: Large-scale evaluation with visually impaired participants in real-world scenarios.
Multilingual Support: Extension to multiple languages and cultural contexts for broader accessibility.
Addressing these limitations through hardware acceleration, advanced noise reduction algorithms, and multilingual support represents a vital direction for future work. The integration of the Raspberry Pi AI Kit and Coral TPU accelerators could potentially achieve 3–5× performance improvements based on preliminary testing, enabling more sophisticated AI models and better real-time performance.
In conclusion, this project contributes meaningfully to the field of assistive technology by demonstrating that reliable and user-friendly support for the visually impaired can be achieved using cost-effective, offline, and open-source solutions. The system represents a significant step toward democratizing assistive technology through privacy-preserving, affordable solutions that can operate independently of cloud infrastructure. Continued development and user-centered refinement hold the potential to further expand its impact and adoption in real-world settings.
Future work will focus on conducting comprehensive user studies with visually impaired participants to validate the system’s real-world effectiveness and gather feedback for user-centered improvements. Additionally, exploration of federated learning approaches could enable model improvements while maintaining privacy principles, and integration with existing assistive devices could provide a more comprehensive support ecosystem.