Vocal-Eyes: AI-Powered Smart Glasses for the Blind Using Transformer-Based Architecture and Scene Graph Generation

Shabbir, Amna; Afsheen, Uzma; Shirazi, Muhammad Faizan; Rauf, Abdul; Abbas, Syed Muhammad Meesam; Saeed, Shahid; Khan, Abdul Samad; Rizvi, Safdar; Saaludin, Nurashikin

doi:10.3390/technologies14070384

Open AccessArticle

Vocal-Eyes: AI-Powered Smart Glasses for the Blind Using Transformer-Based Architecture and Scene Graph Generation

by

Amna Shabbir

^1,*

,

Uzma Afsheen

²,

Muhammad Faizan Shirazi

¹

,

Abdul Rauf

¹

,

Syed Muhammad Meesam Abbas

¹

,

Shahid Saeed

¹,

Abdul Samad Khan

¹

,

Safdar Rizvi

^3,4,*

and

Nurashikin Saaludin

⁴

¹

Department of Electronic Engineering, NED University of Engineering and Technology, Karachi 75270, Pakistan

²

Department of Telecommunication Engineering, NED University of Engineering and Technology, Karachi 75270, Pakistan

³

Department of Computer Science, Bahria University, Karachi Campus, Karachi 75260, Pakistan

⁴

Malaysian Institute of Information Technology (UniKL MIIT), Universiti Kuala Lumpur, 1016, Jalan Sultan Ismail, Kuala Lumpur 50250, Malaysia

^*

Authors to whom correspondence should be addressed.

Technologies 2026, 14(7), 384; https://doi.org/10.3390/technologies14070384 (registering DOI)

Submission received: 11 May 2026 / Revised: 17 June 2026 / Accepted: 19 June 2026 / Published: 24 June 2026

(This article belongs to the Special Issue Wearable Vital Signs and Activities Detection: Systems and Applications)

Download

Browse Figures

Versions Notes

Abstract

Visually impaired individuals face significant challenges in autonomous mobility and situational awareness. Most existing assistive technologies address isolated tasks, such as object recognition or text reading, while failing to capture broader environmental context. This work addresses this limitation by proposing a scene-sensitive, low-cost assistive system that delivers holistic situational information. We present Vocal-Eyes, an intelligent smart glasses platform that provides periodic audio descriptions of the surrounding environment. The system employs a cloud-based neural processing pipeline in which visual features are extracted using a Transformer-based architecture. Relational context is modeled through scene graph generation, and scene graphs are translated into natural language via a graph-to-text module. A lightweight hardware prototype captures visual data locally, while computationally intensive processing is offloaded to the cloud to reduce power consumption. The experimental results show that relational, scene-based narration produces more coherent and informative descriptions than object-centric approaches while maintaining acceptable periodic latency. Cost analysis further indicates that Vocal-Eyes is significantly more affordable than comparable commercial smart glasses solutions. These results demonstrate that Transformer-based scene understanding with cloud-assisted processing is an effective and practical approach for developing accessible, context-aware assistive technologies for visually impaired users.

Keywords:

assistive technology; visual impairment; scene graph generation; transformer-based architecture; smart glasses; audio narration

1. Introduction

Visual impairment significantly limits the ability of a person to move around, to see their surroundings, and to independently relate to the physical world. Several factors have led to an increase in the number of assistive technologies being demanded in the world as the number of individuals with partial or complete visual impairment continues to rise. Traditional mobility aids, e.g., white canes and guide dogs, are not to be neglected, but they still only provide partial information about a complex and dynamic environment, thus leaving the user with incomplete situational awareness. The modern digital assistive technologies, including smartphone apps and smart glasses, have tried to alleviate these limitations through processes like object recognition [1], textual extraction, and crude scene description. However, all systems that survived to us have an incredibly fragmented form of representation where the objects are recognized individually, and the spatial or semantic connections between them are not encoded. To visually impaired people, contextual understanding, exemplified by the ability to tell that a given piece of equipment lies on equipment, or that one object is next to a roadway, is often much more beneficial than a blind list of perceived objects. This lack of holistic perception of the scene is a basic deficiency of current solutions [2]. Moreover, more complex scene-understanding systems are rarely coupled with periodic wearable systems due to the concerns related to computational requirements, latency [3], hardware cost, and ongoing design trade-offs between on-board and cloud-based processing. The primary contributions of the work are the following:

Design of a low-cost, scene-aware, smart glasses assistive system that emphasizes contextual and relational interpretation rather than individual object recognition. This contribution introduces a cost-effective wearable platform, which foregrounds scene-level understanding using an integrated multimodal inference engine [4], thus refocusing the single-object recognition perspective to be holistic in both spatial and semantic regard.
A combination of the structured scene interpretation and natural language audio description provides a meaningful and coherent description of the environmental descriptions to the visually impaired users. We describe a pipeline here that combines explicit scene graphs with generative language models [5] to generate contextually rich audio stories based on the visual impairment requirements of the perceiver.
An architecture of cloud-assisted systems that can achieve periodic performance whilst reducing wearable hardware complexity and costs. This design removes computationally heavy inference to a remote server, thereby conserving battery life and reducing the physical size of the hardware in the smart glasses without affecting latency [6].
The empirical and cost-based comparison shows better descriptive quality and economic viability than the current commercial assistive smart glass solutions. We present quantitative measurements of user research, a cost study that demonstrates better descriptive performance and reduced overall cost of ownership as compared to existing market products.

2. System Model

This section presents the equipment, software modules, models, datasets, and experimental procedures used in the implementation and evaluation of Vocal-Eyes. The clarity and the details are adequate to repeat the work, or to replicate it further by other scientists. Models, datasets, and software frameworks used in our work are publicly available when not otherwise specified.

Our system is based on a cloud-assisted pipeline architecture that processes raw visual inputs into hierarchical semantic forms and eventually translates them into natural language descriptions with low end-to-end latency [3]. The system includes three main steps.

These steps are image acquisition and transmission using a wearable embedded camera, scene understanding and relation extraction using a scene graph generation (SGG) model [2], and natural language generation (NLG) using a Small Language Model (SLM) [5]. The data integration from these steps can be seen in Figure 1. It is a sequential process; ESP32-CAM captures the visual frame and sends it wirelessly to the Hugging Face inference backend, which in turn processes the image into a scene description. This is then passed back to Raspberry Pi to be converted to speech locally (Text-to-Speech [TTS]).

All the essential components in this workflow, like ESP32-CAM integration, cloud deployment, and the edge audio unit, have been fully developed and tested in the field. During initial algorithmic validation, local CPU inference experiments were conducted with the intention of reducing edge latency and power consumption in the deployed system, but they have been excluded for now, and only the cloud backend has been used. Future work could also be the shift to a fully offline, locally instantiated inference model on Raspberry Pi. The details of exactly how this distributed deployment works through the REST API are later described in Section 3.4.

2.1. Imaging and Communication Module

The wearable hardware platform is based on ESP32-CAM, which combines the ESP32 system-on-a-chip with the OV2640 CMOS image sensor. This module deals with local image capture and wireless transfer to the cloud processing pipeline. The base hardware is a 2.4 GHz Wi-Fi and Bluetooth Classic radio, 520 KB of internal SRAM, 4 MB of external PSRAM, and a dual-core Xtensa LX6 CPU, as shown in Figure 1.

The main goal of the input module is to balance the image quality with the hard constraints of memory and bandwidth of ESP32-CAM. Although the camera can support a maximum of UXGA resolution, there is a high risk of latency and memory overload when trying to transmit the uncompressed high-resolution frames over the standard wireless protocols. Images are dynamically compressed and stored at lower resolutions and lower frame rates (around 13 fps) to keep the transmission pipeline constant and to not exhaust the PSRAM.

The trade-off between resolution, memory footprint, and transmission reliability is shown in Table 1. By limiting the capture parameters, the system is effectively able to reduce buffer overflows when the Wi-Fi is continuously transmitting, which is important in ensuring that the cloud-based Transformer-based architecture receives a constant stream of data.

2.2. Audio Output Configuration

It takes adequate memory and processing capacity at the edge to deliver periodic and coherent text-to-speech (TTS) narration to the user. A ESP32 Development Board with an external MAX98357A I2S DAC was tested as a local audio synthesizer. This setup had text payloads sent by the cloud to be synthesized into audio on the microcontroller. Nevertheless, it was discovered that this method was not sufficient. The strict SRAM constraints of ESP32 meant that when audio was generated, buffer underruns would occur, and the system would stall and produce fragmented audio (which, in most cases, was only able to articulate single words).

To get around these hardware bottlenecks, the audio processing architecture was ported to Raspberry Pi 4 (8 GB RAM). Raspberry Pi has been shown to support large amounts of memory and computational overhead, allowing continuous, uninterrupted TTS generation without the buffer overflowing. In addition, the use of Raspberry Pi makes the hardware footprint smaller because the built-in analog audio output of Raspberry Pi has removed the need to have an external I2S DAC module, as mentioned in Table 2. The final design layout will have the text descriptions of data in the cloud processing pipeline redirected directly to Raspberry Pi, as shown in Figure 2. The audio is synthesized using a lightweight TTS engine and played out through the onboard audio interface to a wired hands-free earpiece. This new architecture is designed to provide the visually impaired user with an immediate and highly coherent environmental narration.

2.3. Software Environment

At the current stage of development, all initial experiments were conducted in a local desktop environment without deployment to a cloud-based or GPU-enabled server. The system is implemented and evaluated as a proof-of-concept prototype, focusing on validating the functional pipeline, after which it was deployed on Hugging Face Spaces for periodic inference on a server with a CPU. The software environment used in this study is summarized as follows: operating system, Microsoft Windows; programming language, Python 3.11; deep learning framework, PyTorch 1.6.0; Transformer libraries, Hugging Face Transformers [7]; demonstration framework, Hugging Face Spaces; and development environment, local execution without model deployment or containerization. All the deep learning model experiments were run locally through inference using the CPU. Offline training and fine-tuning were performed, and the model outputs were compared to controlled experimental conditions. At this point, no optimization frameworks (such as ONNX Runtime or TensorRT) were used.

The interaction between the image acquisition and the processing pipeline was implemented in operational latency for a proof-of-concept framework. Rather, experimental verification of images was done by manually transferring captured images or through simulated input pipelines.

2.4. Multi-Stage Pipeline

2.4.1. Scene Graph Generation Model

(a): RelTR Architecture

Scene graph generation is performed using the Relation Transformer (RelTR) model [2], a single-stage, end-to-end framework based on the DETR architecture [1]. RelTR formulates scene graph generation as a direct set prediction problem, eliminating the need for hand-crafted post-processing steps. The model consists of the following components:

Backbone Feature Extractor: Initially ResNet-101; later optimized using Transformer-based architecture backbones [4].
Transformer Encoder: Encodes global visual context using self-attention.
Entity Decoder: Predicts object entities within the scene.
Triplet Decoder: Infers subject–predicate–object relationships using coupled self-attention.
Matching and Loss: Hungarian matching with a set-prediction loss function.

The final output is an unordered set of relational triplets describing object interactions in the scene.

(b): Backbone Configuration

The implementation introduces the original RelTR architecture with a convolutional backbone utilized by the scene graph generation module in the current implementation, which is provided by the original open-source repository. The backbone is used without any architectural and performance optimization in order to ensure the correct functionality of the end-to-end pipeline. There is no backbone, quantization, pruning, knowledge distillation, or inference acceleration method that has been implemented at this point; all the inference functions use the baseline model setup to determine a reference performance scale. As a result, the system has shown an average inference time of about eight seconds per image under the prevailing experimental conditions. This limitation of performance is also recognized and clearly documented in the Section 7. Based on this, the current study will focus on the feasibility of the pipeline and the quality of output, as opposed to the actual realization of periodic performance. The possible optimization solutions, such as lightweight Transformer backbones, model compression, and faster inference runtimes, are also identified as future work and have not been integrated into the existing experiment evaluation.

(c): Triplet Filtering and Serialization

The raw output of the RelTR model may contain low confidence or redundant relations. A confidence-based filtering stage was applied to retain only semantically meaningful triplets. The retained triplets were then serialized into a textual format compatible with transformer-based language models. An example serialized input is WebNLG: man|wear|hat && man|stand on|sidewalk. This serialized representation serves as input to the text generation module.

(d): Identification of Critical Latency Bottleneck

During the initial experiments, although the functional correctness of the unoptimized RelTR model has been demonstrated, a test of the model on a local CPU-based system has revealed a major limitation. The processing time of a single image was about 8.16 s, which cannot be used in real-time assistive applications that generally require response times under 200 ms (about 5 frames per second) [3]. This latency bottleneck is largely blamed on the following:

It involves the use of a ResNet-101 backbone that is computationally expensive for feature extraction [4].
The quadratically scaling self-attention operations in the Transformer encoder [8].
The use of full 32-bit floating point (FP32) precision, which adds computing time and memory footprint to resource-constrained hardware [9].

These observations demonstrate that it is important to optimize the model and perfect its architecture before it can be deployed in real-time. The specific optimization plans and their numerical contribution are not discussed in detail in this section but will be considered in other works.

2.4.2. Natural Language Generation Model

A T5-based Small Language Model (SLM) was used for structured-to-text generation [5]. The T5 model reformulates all tasks into a text-to-text format, making it particularly suitable for converting serialized relational triplets into natural language sentences. Smaller variants (T5-small/T5-base) were selected to ensure low latency while maintaining adequate generation quality, as shown in Figure 3.

The model was fine-tuned using the WebNLG dataset [10], which pairs RDF-style triplets with human-written natural language descriptions. The dataset addresses challenges such as aggregation, lexical choice, and sentence coherence, aligning closely with scene graph narration requirements. The WebNLG dataset is publicly available, and no proprietary data were used. These results prove that the fundamental design goal, i.e., the transformation of visual input into regular relational representations that can be used in assistive reasoning, is reached. The actual quantified analysis of the detection accuracy and relation recall is not part of the current step and will be allocated a later work after optimization of the models.

2.5. System Scope: Proof-of-Concept vs. Real-Time Processing

In order to clearly define the scope of this work, it is important to differentiate between the Vocal-Eyes framework and a real-time assistive system, which is set forth as follows. The key is that a real-time assistive navigation system will need continuous sub-second inference and often needs to be entirely processed on the local edge AI platform (dedicated NPUs) rather than sent to the cloud to avoid network lags.

The present version of Vocal-Eyes, on the other hand, is only a proof-of-concept prototype. It has the primary goal of not only avoiding obstacles but also demonstrating the possibility of building a Transformer-based scene graph generation to inform a visually impaired user of the environment in a comprehensively broad manner. Extracting heavy semantic structures is cloud-assisted processing, and to avoid sensory overload, the prototype has been implemented with an intentional 5 s latency cycle. This system is thus not designed as a navigational lifeline but as a tool for awareness of the environment asynchronously.

3. Qualitative Analysis

3.1. Transformer Attention Mechanisms

To explore the inner mechanism of the RelTR model, attention visualizations that were created based on the triplet decoder were studied [2]. Heatmaps of subject and object attention were obtained in order to analyze the model’s attention to specific parts of the input image during relational inference.

The attention maps, as shown in Figure 4, indicate that the coupled subject and object queries have been effective in the selection of semantically significant image locations. Such behavior provides qualitative data that the Transformer architecture uses global contextual data to make rational arguments about spatial and semantic relationships among objects [8]. Such visualizations help to strengthen the interpretability of the model and to ensure that relational predictions can be made based on meaningful visual cues but not spurious correlations.

3.2. System Deployment and Reproducibility

To ensure the transparency and reproducibility of our findings, the inference pipeline was deployed as an interactive web application on Hugging Face Spaces. The interface was developed using the Gradio library, providing a high-level UI that bridges the gap between the complex backend Transformer architecture and end-user interaction. The deployment serves two primary purposes.

3.2.1. Accessibility

It allows researchers and practitioners to test the model on diverse datasets without the need for local environment configuration. The interactive demonstration is publicly available at the following URL: https://huggingface.co/spaces/ABDRauf/Vocal-Eyes (accessed on 28 February 2026).

3.2.2. Verification

We provide three curated example images within the UI that represent different challenging scenarios (e.g., varying illumination or complex backgrounds) to demonstrate the model’s robustness under controlled yet difficult conditions.

Users can also perform custom inference by uploading their own images, as shown in Figure 5.

3.3. Representative Inference Example

To illustrate the practical utility of the deployed pipeline, a representative test case is provided within the interactive interface. This example was specifically selected to demonstrate the model’s performance in a challenging, real world. Upon processing this sample through the Gradio-based pipeline, the system successfully identified the primary features, as shown in Figure 6.

3.4. Hardware-Compatible REST API Deployment

To extend the accessibility of the Vocal_Eyes pipeline beyond browser-based interaction, the inference backend was additionally exposed as a lightweight RESTful API using the FastAPI framework, as shown in Figure 1. This deployment strategy is motivated by the practical requirements of edge hardware integration, specifically embedded systems, such as ESP32-CAM or similar microcontrollers, that are incapable of running full inference locally but can transmit image data over a network connection.

The API follows a straightforward request–response paradigm: a client device sends a POST request to the “/predict” endpoint with an image attached as “multipart/form-data” and receives a plain-text scene description as the JSON response. This design minimizes the computational burden on the hardware side, offloading all model inference to the server. The auto-generated OpenAPI documentation, accessible via /openapi.json and rendered through Swagger UI, confirms the contract of the endpoint, as illustrated in Figure 7.

As verified through the root health check endpoint, the service responds with a status confirmation indicating that the API is live and ready to accept image submissions. This confirms the feasibility of a real-world deployment scenario where a camera-equipped embedded device captures a scene, transmits the image to the hosted API, and returns an audio-ready scene description, completing the assistive loop intended by the Vocal-Eyes system.

3.5. System Latency and Cognitive Overload

On local hardware, initial testing gave an inference latency of ~8.16 s per frame. On full deployment of the end-to-end pipeline, though, the system settled down to a latency of roughly 5 s/cycle. It is possible to go beyond the present time of 5 s to sub-second continuous tracking; however, this user experience parameter was deliberately kept as a critical one to control the frequency of audio output.

Continuous, ultra-low-latency feedback, in the case of assistive navigation, can actually be harmful [11]. Ambient acoustic cues, including sounds of vehicles approaching and passing, footsteps in the immediate vicinity, and echolocation received from white canes, are of critical importance for spatial orientation for visually impaired people. If a user is constantly receiving a lot of frequent updates in the scene graph (such as object detection summaries every 1–2 s), they are often overwhelmed by the information [11]. The brain requires an adequate processing window to interpret output, map it to the physical environment, and make a safe navigational decision. The user of this window is deprived of this information in a fast-paced manner, leaving them more confused about what they are hearing than able to navigate more quickly, dangerously obscuring important environmental sounds.

The system cycles the audio feedback every 5 s, creating a conscious sensory filter. This pacing allows the user to receive very accurate and relevant relational information without overloading the cognitive capacity of the user or denying him/her physical surroundings that are important to them. In the end, the system design is oriented toward safety and actionable understanding, rather than just on the output frequency.

4. Experimental Setup and Metrics

To evaluate the proposed pipeline of Vocal_Eyes, the randomized set of 250 images from the MS COCO 2017 validation dataset was used to conduct the experiment. The evaluation strategy belongs to the old image captioning methods, which prioritize communicating like a human and being as close to human as possible. Instead, to evaluate the ability of pipelines, metrics were selected that act as assistive technology for blind or visually impaired users, which prioritize strict factual grounding and semantic accuracy over stylistic prose.

We randomly selected a subset of 250 images from the MS COCO 2017 validation set. These images were chosen to capture a variety of real-world situations, including indoor living areas, street scenes, and common pedestrian obstacles, that are most likely to present difficulties for a visually impaired individual navigating.

4.1. Internal Factual Reliability Metrics

4.1.1. Triplet Coverage

It measures the unique and meaningful subject–predicate–object from the cropped scene graph that was incorporated by the language model into the final text.

4.1.2. Hallucination Rate

It calculates the percentage of expressive tokens in the produced explanation that are not supported by the intermediate scene graph. For assistive navigation, a near-zero hallucination rate is a critical safety constraint.

4.2. External Semantic and Stylistic Metrics:

4.2.1. BERTScore (Semantic Fidelity)

It uses pre-trained contextual embeddings for the evaluation of the deep semantic similarity between the created output and human ground-truth annotations, which do not depend on exact word matching.

4.2.2. BLEU-4 and ROUGE-L (Stylistic Overlap)

This is an n-gram metric for matching and is standard and is used to calculate the structural as well as stylistic divergence between the model’s unbending and factual outputs and the casual text of human annotators.

4.2.3. CIDEr (Consensus)

It calculates the alignment of the text generated with the agreement of various human reference captions.

4.3. End-to-End Evaluation Strategy:

This study does not assess the intermediate scene graph generation (SGG) module by itself with the standard metric benchmarking (e.g., Recall@K), but it assesses the pipeline end to end. The efficacy of the system can best be judged by the quality and factual accuracy of the output produced for the user, as synthesized speech is the final output delivered to the user. As a consequence, we focus on natural language generation metrics. The stylistic overlap and consensus are measured with BLEU-4, ROUGE-L, and CIDEr, and the deep semantic fidelity is measured by BERTScore [12]. Importantly, we use triplet coverage and hallucination rate to validate the system’s “factual bottleneck”, as these are direct measures of the successful retention of the structural relations that were mapped by the SGG module in the final output without the addition of unverified spatial artifacts.

5. Small Language Model Results for Triplet-to-Text Generation

The second stage of the pipeline focuses on converting structured scene graph triplets into fluent natural language descriptions. A T5-based Small Language Model (SLM), as shown in Figure 8, was fine-tuned on the WebNLG 2020 [10] dataset to perform structured data-to-text generation [5].

This approach enables low-latency narration while avoiding the computational overhead associated with large language models, making it suitable for assistive systems operating under strict resource constraints [13,14,15].

6. Dataset and Training Behavior

The WebNLG 2020 dataset was used for fine-tuning, consisting of 35,212 instances mapping RDF-style triplets to human-written reference sentences [10]. Each input instance contains a variable number of triplets serialized into a textual format. Training loss curves, shown in Figure 9, indicate rapid early convergence followed by stabilization, suggesting effective transfer learning and training stability without evidence of overfitting. The fine-tuned SLM was evaluated using standard automatic metrics on held-out WebNLG samples. The model achieved the following scores: BLEU, 0.358 [16]; ROUGE-1, 0.727; and ROUGE-2, 0.469.

Inference time per instance ranged between 0.25 and 0.45 s, which satisfies the latency requirements for periodic narration in assistive applications [3]. These results indicate strong semantic fidelity between generated outputs and reference sentences, while maintaining low inference latency.

Table 3 presents representative examples comparing generated sentences with ground-truth references. The results show that the model consistently preserves key entities, attributes, and relationships from the input triplets. Minor variations in phrasing and word order were observed; however, these did not affect semantic correctness. In several cases, the model demonstrated paraphrasing capabilities, producing natural and human-like descriptions rather than rigid template-based outputs.

Despite its effectiveness, several limitations were identified in the SLM component. Autoregressive decoding introduces latency that increases with sentence length and the number of triplets. The T5-base model size (220 M parameters) still poses memory and computation challenges for highly constrained edge devices [7]. Generated text is strictly conditioned on input triplets and lacks higher-level commonsense inference. Dependence on the WebNLG domain may introduce domain shift issues when applied to real-world scene graph data [10]. Nonetheless, compared to large language models, the SLM provides a favorable balance between quality, latency, and deployability.

7. Results and Analysis

The aim of this study was to examine the possibility of creating a low-cost, cloud-based assistive vision pipeline based on scene graph generation (SGG) and structured-to-text natural language generation (NLG) to support periodic narration. The results provided above are an important source of data on the feasibility of the suggested architecture and the challenges individuals are likely to face when implementing state-of-the-art Transformer-based models in resource-limited settings.

7.1. Multi-Stage Pipeline Results Interpretation

The effective implementation of the Relational Transformer (RelTR) proves that end-to-end scene graph generation, set prediction, and relational fashion can retrieve rich relational semantics of single images [2]. The qualitative findings prove that the model is reliable in producing meaningful triplets of subject, predicate, and object, and it supports the previous findings reported by Cong et al. [2], who found that Transformer-based architectures are more effective than traditional two-stage pipelines due to the fact that they model objects and relationships jointly. The focus on visualizations also supports this conclusion, which shows that RelTR successfully uses global contextual reasoning. The similarity between attention heatmaps and semantically relevant regions on the input images shows that the coupled self-attention mechanism of the triplet decoder works as intended [8]. This behavior is consistent with the previous research that has stressed the relevance of global receptive fields in the model of the interactions between objects over the long range, which is a feature that is prevalent in real-world complex scenes.

However, in spite of the qualitative strength of relational accuracy, inference latency as observed on a deployed CPU platform has inherent limitations. The inference time of just 5 s per image was observed and was kept intentionally at this level to avoid the cognitive overload bottleneck.

7.2. Internal Factual Reliability

The main objective is to establish a strict factual block with the help of a two-stage architecture. As verified in Figure 10, a triplet coverage of 51.73% is achieved by the pipeline, as well as a hallucination rate of 0.40%, which is exceptionally low.

The rate of coverage, i.e., 51.73%, proves the ability of the heuristic graph pruning module. By precisely filtering useless structural noise (e.g., “person has nose”) before text generation, the language model is required to allocate its token limit entirely to important environmental data as well as object interaction data. Consequently, the 0.40% hallucination rate shows that 99.6% of the generated output is strictly fixed in the tested visual data. In the field of assistive technology for the visually impaired or blind people, where hallucinating is sometimes an obstacle and sometimes a clear path, posing a straight physical danger, the proposed safety architecture is validated by the near-zero error.

7.3. Semantic Fidelity vs. Stylistic Overlap

Figure 11 shows a trade-off in the generation of the pipeline’s translation. A high BERTScore F1 of 89.54% is remarkably recorded by the system, distinguished sharply from a BLEU-4 score of 1.60% as well as a CIDEr score of 18.51%.

While BLEU is maximized by the traditional end-to-end vision–language models by copying conversational human phrasing, such prose is often useless for visually impaired users who require the environmental data rapidly and unambiguously. The huge difference between the BERTScore and exact match metrics ensures that the Vocal_Eyes pipeline effectively discards poetic, human-like literary fluff (resulting in the low BLEU-4) while protecting the core, important meaning of the scene with almost perfect semantic fidelity (proven by the 89.54% BERTScore) as shown in Figure 12.

7.4. Factual Conciseness vs. Human Stylistic Variation

To demonstrate the practical implications of the measurable metrics, Table 4 provides a qualitative assessment between the raw intermediate facts, the output generated by Vocal_Eyes, and standard MS COCO human annotations.

As shown in Table 4, human annotators fundamentally put subjective adjectives (“beautiful”, “casually dressed”, “playful”) and literary variations into their descriptions. The Vocal_Eyes pipeline deliberately avoids this behavior. By directing the generation strictly through the extracted facts, the conciseness of the output is maintained. For a visually impaired user who depends on screen readers or audio responses to understand their surroundings, this conciseness limits processing load and speeds up the delivery of only important spatial information, which ensures the utility of the semantic over the stylistic approach.

7.5. Latency: A System-Level Bottleneck

The results given by the latency establishment stress that the computational feasibility, not the algorithmic correctness, is currently the main limitation of the system. It is caused by the combination of factors: the depth and complexity of the Transformer backbone, the quadratic cost of self-attention mechanisms, and the use of FP32 precision in the inference [9].

The finding confirms an important lesson of recent assistive vision studies: optimal perception models should be optimized aggressively prior to implementation [3]. Alternative architectures like EGTR also target to minimize the cost of relational decoding, but the current work instead makes a conscious effort at refining a well-known baseline (RelTR) to prove that the concept of refinement of architectures is not sufficient; instead, systematic optimization strategies, including quantization, replacement of backends, and distillation [17], are required to achieve levels of refinement that can ensure the gap between research prototypes and operational systems. It is worth noting that the discussion of the latency in this composition does not imply a failure of the approach but rather the results of the diagnosis, which directly inform the further stages of development. The earlier the identification of this bottleneck, the more optimization can be assigned to areas where it will have the most effect.

7.6. Effectiveness of SLM

The findings indicate that structured-to-text generation is effective with SLMs. The second pipeline phase of translating structured triplets into natural language was significantly more manageable. The T5-based fine-tuned Small Language Model had a high semantic fidelity and a low inference latency, with a fluent description generation but a response time of less than a second [5].

The results of quantitative BLEU and ROUGE scores are comparable and, in certain cases, higher than the results described in previous studies based on WebNLG that used similarly sized models [16,18]. Qualitative examples can also be used to explain how the model can be aggregated, paraphrased, and structurally organized into grammar without extensive use of templates. This observation is in line with the available literature that shows that encoder–decoder Transformers are especially well-suited for data-to-text tasks when preservation of semantics is paramount [19].

These findings justify the choice to divide perception and language generation into two different modules, as seen through the system design perspective. Although the current vision stage has prevailing latency, the NLG stage is estimated to have a minimal computational burden and, therefore, can be used in latency narration at a later stage when the upstream perception model is streamlined.

7.7. Limits and Design Trade-Offs

Although the outcomes are encouraging, there are a number of limitations that should be acknowledged. First, the use of the WebNLG dataset to train the NLG module creates a possible domain gap in the model application with outputs of the scene graphs of real-world pictures [10]. Even though WebNLG offers a strong reference point on structured-to-text generation, the presence of scene graph predicates and object distributions does not necessarily match those of Visual Genome-based products of SGG. Second, its autoregressive character precludes scalability when required to do continuous, high-frequency narration. Although the existing inference time is reasonable, sets of triplets that are larger and more intricate might introduce an extra delay, especially in resource-limited devices. Lastly, it is not yet tested in end-to-end conditions of deployment, like inference with the use of a GPU accelerator or integration of entire hardware-in-the-loop testing. Therefore, the obtained results can only be attributed to prototype-level validation, but not to the ultimate performance of the system.

7.8. Deployed End-to-End Latency Breakdown

Empirical measurements of timing were made throughout the operation to show the efficiency of the deployed infrastructure from edge to cloud. On the localized, non-optimized hardware setup, the initial baseline evaluation of the RelTR model resulted in an unoptimized inference latency of ~8.16 s, whereas the raw computational cost of moving the core part of the computation to cloud infrastructure with acceleration on cloud-hosted CPU resources was considerably less.

The final architecture includes an intentional buffer step, which allows the user to hear the actual feedback frequency, as opposed to playing a continuous, fast-moving loop that could mask the user’s hearing and leave him or her vulnerable. Table 5 provides the exact micro-latency breakdown across these individual execution stages.

The empirical results show that the total overhead of the core technological processing pipeline (from local hardware capture to the cloud inference and speech synthesis) is about 3.50 s. A 1.50 s intentional cognitive pacing delay puts the end-to-end system latency right on target at 5.00 s. This intentional latency parameter ensures a very low and stable update rate, enabling users to understand the updates without feeling fatigued or overwhelmed by them, while maintaining situational cues.

7.9. Discussions

Vocal-Eyes shows great promise of producing coherent, scene-based narrations, but cascading errors are a challenge because our cloud-based neural pipeline is sequential. The errors from the Vision Transformer for initial feature extraction and/or the errors from the scene graph generation phase will inevitably be carried over to the final graph-to-text module. This natural language generation (NLG) component serves as the final destination of any anomalies coming from the upstream, hence exposing it to the risk of exposure bias. In the training phase, graph-to-text models are usually trained with a teacher-forcing approach using the ground-truth graph, but in the inference phase, the models have to deal with noisy, imperfectly formed graphs from our pipeline. These upstream prediction errors can cause the NLG module to hallucinate inappropriate audio descriptions if it has not learnt to correct for such errors, a very important safety concern for the users who are visually impaired.

In future development of Vocal-Eyes, further improvements can be made through the use of more sophisticated training paradigms in the graph-to-text module to withstand these compounding errors. For instance, recent research [19] demonstrates that imitation learning is an effective method for addressing exposure bias in LLM distillation. This could make the system more resilient to errors made by the feature extractors and relation predictors when they are used upstream by putting the noise from the text generator into the training of the model.

8. Cost Analysis and Economic Viability

One of the goals of the Vocal-Eyes project is to make an assistive technology that is too costly available to everyone. For quantification purposes, we provide a comprehensive cost analysis that reflects both the initial Bill of Materials (BOM) for the lightweight hardware prototype as well as the estimated (recurring) cloud processing costs.

8.1. Hardware Bill of Materials (BOM)

To keep manufacturing costs down, the hardware prototype was designed using commercial off-the-shelf (COTS) components. The processing unit is Raspberry Pi 4 (8 GB RAM) due to its edge processing capabilities and power efficiency. Table 6 shows the estimated cost of Hardware prototype BOM.

8.2. Recurring Operational and Cloud Costs

Cloud connectivity is necessary due to the offloading of heavy Transformer and scene graph generations by Vocal-Eyes. For a typical user using the system for an average of 3 h of active inference per day, the estimated operating costs in our deployment architecture (FastAPI backend, cloud GPU inference) comprise:

Cloud GPU/Inference Hosting: ~$12.00–$18.00/month.
Connectivity: Standard mobile tethering (smartphone hotspot)—no dedicated cellular modem in glasses so as to keep weight and cost low.
Maintenance: Open-source software is updated OTA for free; hardware is replaced using standard modular parts.

8.3. Comparison with Commercial Alternatives

To place Vocal-Eyes in the realm of economic feasibility, the costs of our prototype were compared with the commercially available top assistive glasses and smartphone applications as compared in Table 7. Smartphone apps (such as Seeing AI) are available and are free but require a person to hold and point the device at a specific object and are not as constantly available in their surroundings as a head-mounted wearable would be.

9. Conclusions

The results of this paper show that combining scene graph generation and small language models is a potentially useful and conceptually appropriate system for assistive vision systems. The article highlights the importance of the fact that semantic richness is attainable without the use of large language models and thus does not compromise low latency and computational efficiency. Future studies will be directed at:

A quantization, backbone replacement, and knowledge distillation optimization of the SGG model.
Move inference to GPU runtimes or edge runtimes.
Measurement of end-to-end performance of continuous streaming applications.
Making semantics stronger by matching scene graph predicates with downstream NLG training data.

On the whole, this discussion shows that the architectural backgrounds and intermediate outcomes are a strong indication that the proposed pipeline is a next-generation assistive technology.

Data and Code Availability

This study employs a multi-stage pipeline involving both vision-based scene understanding and natural language generation, each utilizing established public datasets. For scene graph generation (SGG), the system is based on the RelTR architecture [2], which is trained and evaluated using the Visual Genome dataset. Visual Genome is a large-scale, publicly available dataset containing real-world images annotated with object categories and pairwise relationships. For the natural language generation (NLG) stage, the WebNLG dataset was utilized [10]. WebNLG is a publicly available benchmark dataset designed for graph-to-text generation tasks, where structured relational data are converted into coherent natural language descriptions. In this work, WebNLG is used to guide and evaluate the text generation process from structured scene graph representations. No custom dataset was created for training in the current study. The implementations are based on publicly available open-source frameworks, with additional scripts developed for triplet formatting, pipeline integration, and output visualization. These supplementary components can be made available upon reasonable request.

There are preliminary experimental results of the suggested AI pipeline in assistive scene understanding and narration. The findings aim at qualitative validation of both scene graph generation (SGG) and natural language generation (NLG) modules and the determination of the key computational bottlenecks that are presently constraining real-time performance, making them proof-of-concept assistive prototypes. Large-scale benchmarking and quantitative optimization of latency are left to future efforts. The main goal of the initial experiment stage was to confirm the functionality of the Relational Transformer (RelTR)-based scene graph generation pipeline [2]. The model that has been implemented is able to process one RGB input image and generate structured relational outputs as triplets of subject, predicate, and object. An example of an output created by the system is shown in Figure 1, in which the model is able to create the correct relational description, like that of a person sitting on a chair. The qualitative finding validates the hypothesis that the end-to-end set-prediction formulation of RelTR can be used to extract meaningful relational semantics beyond single-object detection [1].

Author Contributions

Conceptualization, A.R., A.S.K., S.M.M.A., and S.S.; methodology, A.R., A.S.K., S.M.M.A., and S.S.; software, A.R., A.S.K., and S.M.M.A.; validation, A.R.; formal analysis, A.R., A.S.K., and S.M.M.A.; investigation, A.R., A.S.K., and S.M.M.A.; resources, M.F.S., A.S.; data curation, A.R., A.S.K., and S.M.M.A.; writing—original draft preparation, A.R., A.S.K., and S.M.M.A.; writing—review and editing, A.S., A.R., A.S.K., and S.M.M.A., U.A. and S.R.; visualization A.R., A.S.K., and S.M.M.A.; supervision, A.S. and M.F.S., N.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original data presented in the study are openly available in the COCO dataset repository at https://cocodataset.org (accessed on 1 May 2026) and the WebNLG dataset repository via GitLab at https://gitlab.com/shimorina/webnlg-dataset, accessed on 1 May 2026.

Acknowledgments

The authors would like to thank NED University of Engineering and Technology, Karachi, Pakistan, for providing a supportive academic environment and access to research facilities that helped in completing this work. The authors also acknowledge Universiti Kuala Lumpur (UniKL), Kuala Lumpur, Malaysia, as a collaborating institution for its support in facilitating the APC for this publication.

Conflicts of Interest

The authors declare no conflict of interest.

References

Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Springer International Publishing: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Cong, Y.; Yang, M.Y.; Rosenhahn, B. RelTR: Relation Transformer for Scene Graph Generation. arXiv 2022, arXiv:2201.11460. [Google Scholar]
Kasoju, A.; Vishwakarma, T.C. Optimizing Transformer Models for Low-Latency Inference: Techniques, Architectures, and Code Implementations. Int. J. Sci. Res. 2025, 14, 857–866. [Google Scholar] [CrossRef]
Essam, M.; Khaflab, D.; Shedeed, H.; Tolba, M. Transformer-Based Backbones for Scene Graph Generation: A Comparative Analysis. Int. J. Intell. Comput. Inf. Sci. 2024, 24, 1–10. [Google Scholar] [CrossRef]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 5485–5551. [Google Scholar]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT: A Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. In Proceedings of the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing, Vancouver, BC, Canada, 4 August 2019; pp. 36–40. [Google Scholar]
Hugging Face. Text Generation with T5 Models. Available online: https://huggingface.co/docs (accessed on 13 January 2026).
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Colin, E.; Gardent, C.; Perez-Beltrachini, L. The WebNLG Challenge: Generating Text from RDF Data. In Proceedings of the 9th International Natural Language Generation Conference, Edinburgh, UK, 5–8 September 2016; pp. 1–5. [Google Scholar]
Klein, B.; Rahman, K.R.; Ghose, S. AMAVA: Adaptive Motion-Aware Video-to-Audio Framework for Visually-Impaired Assistance. arXiv 2026, arXiv:2604.23909. [Google Scholar]
Hirota, Y.; Li, B.; Hachiuma, R.; Wu, Y.-H.; Ivanovic, B.; Nakashima, Y.; Pavone, M.; Choi, Y.; Wang, Y.-C.F.; Yang, C.-H.H. LOTUS: A Leaderboard for Detailed Image Captioning from Quality to Societal Bias and User Preferences. arXiv 2025, arXiv:2507.19362. [Google Scholar]
Sudhakaran, G.; Dhami, D.S.; Kersting, K.; Roth, S. Vision relation transformer for unbiased scene graph generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; IEEE: New York, NY, USA, 2023. [Google Scholar]
Shehzad, A.; Xia, F.; Abid, S.; Peng, C.; Yu, S.; Dongyu, Z. Graph transformers: A survey. IEEE Trans. Neural Netw. Learn. Syst. 2026, 1–20. [Google Scholar] [CrossRef]
He, T.; Hu, X.; Wu, T.; Zhang, D.; Li, M.; Li, Y.-F.; Yu, F.R. Lifelong Scene Graph Generation. Pattern Recognit. 2026, 176, 113132. [Google Scholar] [CrossRef]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A Method for Automatic Evaluation of Machine Translation. In Pro-ceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, PA, USA, 7–12 July 2002; Association for Computational Linguistics: Stroudsburg, PA, USA, 2002; pp. 311–318. [Google Scholar]
Chang, J.; Wang, S.; Xu, H.; Chen, Z.; Yang, C.; Zhao, F. DETRDistill: A Universal Knowledge Distillation Framework for DETR-Families. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; IEEE: New York, NY, USA, 2023; pp. 14755–14764. [Google Scholar]
Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Workshop on Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; Association for Computational Linguistics: Stroudsburg, PA, USA, 2004; pp. 74–81. [Google Scholar]
Gardent, C.; Shimorina, A.; Narayan, S.; Perez-Beltrachini, L. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), Vancouver, BC, Canada, 30 July–4 August 2017; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 179–188. [Google Scholar]
Pozzi, A.; Incremona, A.; Tessera, D.; Toti, D. Mitigating exposure bias in large language model distillation: An imitation learning approach. Neural Comput. Appl. 2025, 37, 12013–12029. [Google Scholar] [CrossRef]

Figure 1. Complete Vocal Eyes Pipeline.

Figure 2. Block diagram of the ESP32-CAM input module and wireless transmission pipeline.

Figure 3. Block diagram of the edge-based Text-to-Speech (TTS) output pipeline utilizing Raspberry Pi 4.

Figure 4. RelTR scene graph generation pipeline showing predicted subject predicate–object triplets.

Figure 5. Subject and object attention heatmaps produced by the triplet decoder, illustrating relational focus regions.

Figure 6. The interactive Gradio interface deployed on Hugging Face Spaces.

Figure 7. A screenshot of the live inference result using the provided representative example.

Figure 8. FastAPI-powered REST endpoint for the Vocal-Eyes inference pipeline, showing the POST/predict interface with multipart/form data support.

Figure 9. Encoder–decoder architecture of the T5 model.

Figure 10. Training loss convergence of the T5-based SLM fine-tuned on the WebNLG dataset.

Figure 11. A 250-image subset used for evaluation of pipeline grounding. The high factual density after cropping is indicated by 51.73% of triplet coverage, while the safety for visually impaired users is ensured by a near-zero hallucination rate (0.40%) and is eliminated.

Figure 12. Comparison of external performance metrics. The high BERTScore F1 (89.54) confirms the protection of core semantics, while the low BLEU-4 and CIDEr scores show the intentional removal of literary human terminology to ensure only concise, factual descriptions.

Table 1. Performance trade-offs for ESP32-CAM image capture.

Resolution	Frame Rate	Approx. Memory/Frame	Transmission Stability
UXGA (1600 × 1200)	<2 fps	~384 KB (JPEG)	High latency, frequent buffer drops
SVGA (800 × 600)	~5–8 fps	~60 KB (JPEG)	Moderate latency, occasional drops
QVGA (320 × 240)	13 fps	~15 KB (JPEG)	Stable, minimal packet loss

Table 2. Performance comparison of audio output configurations.

Hardware Configuration	Memory Resources	Hardware Complexity	Audio Synthesis Performance
ESP32 Dev Board + MAX98357A DAC	520 KB SRAM	High (requires external I2S wiring)	Failed (severe buffer drops, single-word output)
Raspberry Pi 4 (8 GB)	8 GB LPDDR4	Low (utilizes onboard audio jack)	Excellent (smooth, continuous narration)

Table 3. Comparison between different models for text generation.

Model	Parameters	Architecture Type	Pre-Training Objective	Latency
GPT-2	124 M–1.5 B	Decoder Only	Next-token prediction	High
BART-base	~139 M	Encoder–Decoder	Denoising autoencoder	Medium
T5-small	~60 M	Encoder–Decoder	Text-to-text	Low–Medium
T5-base	220 M	Encoder–Decoder	Text-to-text	Low–Medium
T5-large	~770 M	Encoder–Decoder	Text-to-text	High

Table 4. Qualitative comparison of generated outputs.

Input Image	Extracted Pruned Facts	Vocal_Eyes Output	MS COCO Ground Truth
(Image 1)	[‘car’, ‘parked on’, ‘street’], [‘car’, ‘is’, ‘red’]	A red car is parked on the street.	A beautiful, shiny red car rests quietly on the sunny street.
(Image 2)	[‘man’, ‘holding’, ‘bottle’], [‘man’, ‘sitting on’, ‘chair’]	A man is sitting on a chair holding a bottle.	A casually dressed man relaxes in a seat with a beverage.
(Image 3)	[‘dog’, ‘catching’, ‘frisbee’], [‘dog’, ‘in’, ‘air’]	A dog is in the air catching a frisbee.	A playful puppy leaps high into the air to grab a flying disc.

Table 5. End-to-end system latency breakdown (per cycle).

Pipeline Phase	Description/Architectural Components	Average Latency (Seconds)
Image Capture and Routing	Local image acquisition via ESP32-CAM and local routing via FastAPI gateway	~0.50 s
Network Uplink and Inference	Payload transmission to Hugging Face Spaces and Transformer-based triplet extraction	~2.50 s
Network Downlink and Processing	Text payload return and local Text-to-Speech (TTS) synthesis initialization	~0.50 s
Intentional Cognitive Pacing	Managed system idle delay to govern audio output frequency	~1.50 s
Total End-to-End Latency	Complete closed-loop execution cycle	~5.00 s

Table 6. Vocal-Eyes hardware prototype BOM.

Component	Specification	Estimated Cost (USD)
Microcomputer	Raspberry Pi 4 Model B (8 GB)	$75.00
Vision Sensor	8 MP Pi Camera Module V2	$30.00
Audio Output	Open-ear Bone Conduction Headphones	$45.00
Power Supply	10,000 mAh Portable Power Bank	$25.00
Frame/Mounting	Custom 3D-Printed Chassis	$15.00
Connectivity	Wi-Fi/Bluetooth Module	(Integrated)
Total Upfront Hardware Cost		~$190.00

Table 7. Economic comparison of assistive vision devices.

Platform	Upfront Cost (USD)	Recurring Cost	Form Factor
Vocal-Eyes (Ours)	~$190	~$15/month (cloud)	Wearable (Smart Glasses)
OrCam MyEye 3 Pro	~$4490	None	Wearable (Magnetic Mount)
Envision Glasses (Pro)	~$3499	None	Wearable (Smart Glasses)
Smartphone Apps	$0	None	Handheld

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shabbir, A.; Afsheen, U.; Shirazi, M.F.; Rauf, A.; Abbas, S.M.M.; Saeed, S.; Khan, A.S.; Rizvi, S.; Saaludin, N. Vocal-Eyes: AI-Powered Smart Glasses for the Blind Using Transformer-Based Architecture and Scene Graph Generation. Technologies 2026, 14, 384. https://doi.org/10.3390/technologies14070384

AMA Style

Shabbir A, Afsheen U, Shirazi MF, Rauf A, Abbas SMM, Saeed S, Khan AS, Rizvi S, Saaludin N. Vocal-Eyes: AI-Powered Smart Glasses for the Blind Using Transformer-Based Architecture and Scene Graph Generation. Technologies. 2026; 14(7):384. https://doi.org/10.3390/technologies14070384

Chicago/Turabian Style

Shabbir, Amna, Uzma Afsheen, Muhammad Faizan Shirazi, Abdul Rauf, Syed Muhammad Meesam Abbas, Shahid Saeed, Abdul Samad Khan, Safdar Rizvi, and Nurashikin Saaludin. 2026. "Vocal-Eyes: AI-Powered Smart Glasses for the Blind Using Transformer-Based Architecture and Scene Graph Generation" Technologies 14, no. 7: 384. https://doi.org/10.3390/technologies14070384

APA Style

Shabbir, A., Afsheen, U., Shirazi, M. F., Rauf, A., Abbas, S. M. M., Saeed, S., Khan, A. S., Rizvi, S., & Saaludin, N. (2026). Vocal-Eyes: AI-Powered Smart Glasses for the Blind Using Transformer-Based Architecture and Scene Graph Generation. Technologies, 14(7), 384. https://doi.org/10.3390/technologies14070384

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Vocal-Eyes: AI-Powered Smart Glasses for the Blind Using Transformer-Based Architecture and Scene Graph Generation

Abstract

1. Introduction

2. System Model

2.1. Imaging and Communication Module

2.2. Audio Output Configuration

2.3. Software Environment

2.4. Multi-Stage Pipeline

2.4.1. Scene Graph Generation Model

2.4.2. Natural Language Generation Model

2.5. System Scope: Proof-of-Concept vs. Real-Time Processing

3. Qualitative Analysis

3.1. Transformer Attention Mechanisms

3.2. System Deployment and Reproducibility

3.2.1. Accessibility

3.2.2. Verification

3.3. Representative Inference Example

3.4. Hardware-Compatible REST API Deployment

3.5. System Latency and Cognitive Overload

4. Experimental Setup and Metrics

4.1. Internal Factual Reliability Metrics

4.1.1. Triplet Coverage

4.1.2. Hallucination Rate

4.2. External Semantic and Stylistic Metrics:

4.2.1. BERTScore (Semantic Fidelity)

4.2.2. BLEU-4 and ROUGE-L (Stylistic Overlap)

4.2.3. CIDEr (Consensus)

4.3. End-to-End Evaluation Strategy:

5. Small Language Model Results for Triplet-to-Text Generation

6. Dataset and Training Behavior

7. Results and Analysis

7.1. Multi-Stage Pipeline Results Interpretation

7.2. Internal Factual Reliability

7.3. Semantic Fidelity vs. Stylistic Overlap

7.4. Factual Conciseness vs. Human Stylistic Variation

7.5. Latency: A System-Level Bottleneck

7.6. Effectiveness of SLM

7.7. Limits and Design Trade-Offs

7.8. Deployed End-to-End Latency Breakdown

7.9. Discussions

8. Cost Analysis and Economic Viability

8.1. Hardware Bill of Materials (BOM)

8.2. Recurring Operational and Cloud Costs

8.3. Comparison with Commercial Alternatives

9. Conclusions

Data and Code Availability

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI