Vocal-Eyes: AI-Powered Smart Glasses for the Blind Using Transformer-Based Architecture and Scene Graph Generation
Abstract
1. Introduction
- Design of a low-cost, scene-aware, smart glasses assistive system that emphasizes contextual and relational interpretation rather than individual object recognition. This contribution introduces a cost-effective wearable platform, which foregrounds scene-level understanding using an integrated multimodal inference engine [4], thus refocusing the single-object recognition perspective to be holistic in both spatial and semantic regard.
- A combination of the structured scene interpretation and natural language audio description provides a meaningful and coherent description of the environmental descriptions to the visually impaired users. We describe a pipeline here that combines explicit scene graphs with generative language models [5] to generate contextually rich audio stories based on the visual impairment requirements of the perceiver.
- An architecture of cloud-assisted systems that can achieve periodic performance whilst reducing wearable hardware complexity and costs. This design removes computationally heavy inference to a remote server, thereby conserving battery life and reducing the physical size of the hardware in the smart glasses without affecting latency [6].
- The empirical and cost-based comparison shows better descriptive quality and economic viability than the current commercial assistive smart glass solutions. We present quantitative measurements of user research, a cost study that demonstrates better descriptive performance and reduced overall cost of ownership as compared to existing market products.
2. System Model
2.1. Imaging and Communication Module
2.2. Audio Output Configuration
2.3. Software Environment
2.4. Multi-Stage Pipeline
2.4.1. Scene Graph Generation Model
- (a)
- RelTR Architecture
- Backbone Feature Extractor: Initially ResNet-101; later optimized using Transformer-based architecture backbones [4].
- Transformer Encoder: Encodes global visual context using self-attention.
- Entity Decoder: Predicts object entities within the scene.
- Triplet Decoder: Infers subject–predicate–object relationships using coupled self-attention.
- Matching and Loss: Hungarian matching with a set-prediction loss function.
- (b)
- Backbone Configuration
- (c)
- Triplet Filtering and Serialization
- (d)
- Identification of Critical Latency Bottleneck
- It involves the use of a ResNet-101 backbone that is computationally expensive for feature extraction [4].
- The quadratically scaling self-attention operations in the Transformer encoder [8].
- The use of full 32-bit floating point (FP32) precision, which adds computing time and memory footprint to resource-constrained hardware [9].
2.4.2. Natural Language Generation Model
2.5. System Scope: Proof-of-Concept vs. Real-Time Processing
3. Qualitative Analysis
3.1. Transformer Attention Mechanisms
3.2. System Deployment and Reproducibility
3.2.1. Accessibility
3.2.2. Verification
3.3. Representative Inference Example
3.4. Hardware-Compatible REST API Deployment
3.5. System Latency and Cognitive Overload
4. Experimental Setup and Metrics
4.1. Internal Factual Reliability Metrics
4.1.1. Triplet Coverage
4.1.2. Hallucination Rate
4.2. External Semantic and Stylistic Metrics:
4.2.1. BERTScore (Semantic Fidelity)
4.2.2. BLEU-4 and ROUGE-L (Stylistic Overlap)
4.2.3. CIDEr (Consensus)
4.3. End-to-End Evaluation Strategy:
5. Small Language Model Results for Triplet-to-Text Generation
6. Dataset and Training Behavior
7. Results and Analysis
7.1. Multi-Stage Pipeline Results Interpretation
7.2. Internal Factual Reliability
7.3. Semantic Fidelity vs. Stylistic Overlap
7.4. Factual Conciseness vs. Human Stylistic Variation
7.5. Latency: A System-Level Bottleneck
7.6. Effectiveness of SLM
7.7. Limits and Design Trade-Offs
7.8. Deployed End-to-End Latency Breakdown
7.9. Discussions
8. Cost Analysis and Economic Viability
8.1. Hardware Bill of Materials (BOM)
8.2. Recurring Operational and Cloud Costs
- Cloud GPU/Inference Hosting: ~$12.00–$18.00/month.
- Connectivity: Standard mobile tethering (smartphone hotspot)—no dedicated cellular modem in glasses so as to keep weight and cost low.
- Maintenance: Open-source software is updated OTA for free; hardware is replaced using standard modular parts.
8.3. Comparison with Commercial Alternatives
9. Conclusions
- A quantization, backbone replacement, and knowledge distillation optimization of the SGG model.
- Move inference to GPU runtimes or edge runtimes.
- Measurement of end-to-end performance of continuous streaming applications.
- Making semantics stronger by matching scene graph predicates with downstream NLG training data.
Data and Code Availability
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Springer International Publishing: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
- Cong, Y.; Yang, M.Y.; Rosenhahn, B. RelTR: Relation Transformer for Scene Graph Generation. arXiv 2022, arXiv:2201.11460. [Google Scholar]
- Kasoju, A.; Vishwakarma, T.C. Optimizing Transformer Models for Low-Latency Inference: Techniques, Architectures, and Code Implementations. Int. J. Sci. Res. 2025, 14, 857–866. [Google Scholar] [CrossRef]
- Essam, M.; Khaflab, D.; Shedeed, H.; Tolba, M. Transformer-Based Backbones for Scene Graph Generation: A Comparative Analysis. Int. J. Intell. Comput. Inf. Sci. 2024, 24, 1–10. [Google Scholar] [CrossRef]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 5485–5551. [Google Scholar]
- Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT: A Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. In Proceedings of the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing, Vancouver, BC, Canada, 4 August 2019; pp. 36–40. [Google Scholar]
- Hugging Face. Text Generation with T5 Models. Available online: https://huggingface.co/docs (accessed on 13 January 2026).
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
- Colin, E.; Gardent, C.; Perez-Beltrachini, L. The WebNLG Challenge: Generating Text from RDF Data. In Proceedings of the 9th International Natural Language Generation Conference, Edinburgh, UK, 5–8 September 2016; pp. 1–5. [Google Scholar]
- Klein, B.; Rahman, K.R.; Ghose, S. AMAVA: Adaptive Motion-Aware Video-to-Audio Framework for Visually-Impaired Assistance. arXiv 2026, arXiv:2604.23909. [Google Scholar]
- Hirota, Y.; Li, B.; Hachiuma, R.; Wu, Y.-H.; Ivanovic, B.; Nakashima, Y.; Pavone, M.; Choi, Y.; Wang, Y.-C.F.; Yang, C.-H.H. LOTUS: A Leaderboard for Detailed Image Captioning from Quality to Societal Bias and User Preferences. arXiv 2025, arXiv:2507.19362. [Google Scholar]
- Sudhakaran, G.; Dhami, D.S.; Kersting, K.; Roth, S. Vision relation transformer for unbiased scene graph generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; IEEE: New York, NY, USA, 2023. [Google Scholar]
- Shehzad, A.; Xia, F.; Abid, S.; Peng, C.; Yu, S.; Dongyu, Z. Graph transformers: A survey. IEEE Trans. Neural Netw. Learn. Syst. 2026, 1–20. [Google Scholar] [CrossRef]
- He, T.; Hu, X.; Wu, T.; Zhang, D.; Li, M.; Li, Y.-F.; Yu, F.R. Lifelong Scene Graph Generation. Pattern Recognit. 2026, 176, 113132. [Google Scholar] [CrossRef]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A Method for Automatic Evaluation of Machine Translation. In Pro-ceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, PA, USA, 7–12 July 2002; Association for Computational Linguistics: Stroudsburg, PA, USA, 2002; pp. 311–318. [Google Scholar]
- Chang, J.; Wang, S.; Xu, H.; Chen, Z.; Yang, C.; Zhao, F. DETRDistill: A Universal Knowledge Distillation Framework for DETR-Families. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; IEEE: New York, NY, USA, 2023; pp. 14755–14764. [Google Scholar]
- Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Workshop on Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; Association for Computational Linguistics: Stroudsburg, PA, USA, 2004; pp. 74–81. [Google Scholar]
- Gardent, C.; Shimorina, A.; Narayan, S.; Perez-Beltrachini, L. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), Vancouver, BC, Canada, 30 July–4 August 2017; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 179–188. [Google Scholar]
- Pozzi, A.; Incremona, A.; Tessera, D.; Toti, D. Mitigating exposure bias in large language model distillation: An imitation learning approach. Neural Comput. Appl. 2025, 37, 12013–12029. [Google Scholar] [CrossRef]












| Resolution | Frame Rate | Approx. Memory/Frame | Transmission Stability |
|---|---|---|---|
| UXGA (1600 × 1200) | <2 fps | ~384 KB (JPEG) | High latency, frequent buffer drops |
| SVGA (800 × 600) | ~5–8 fps | ~60 KB (JPEG) | Moderate latency, occasional drops |
| QVGA (320 × 240) | 13 fps | ~15 KB (JPEG) | Stable, minimal packet loss |
| Hardware Configuration | Memory Resources | Hardware Complexity | Audio Synthesis Performance |
|---|---|---|---|
| ESP32 Dev Board + MAX98357A DAC | 520 KB SRAM | High (requires external I2S wiring) | Failed (severe buffer drops, single-word output) |
| Raspberry Pi 4 (8 GB) | 8 GB LPDDR4 | Low (utilizes onboard audio jack) | Excellent (smooth, continuous narration) |
| Model | Parameters | Architecture Type | Pre-Training Objective | Latency |
|---|---|---|---|---|
| GPT-2 | 124 M–1.5 B | Decoder Only | Next-token prediction | High |
| BART-base | ~139 M | Encoder–Decoder | Denoising autoencoder | Medium |
| T5-small | ~60 M | Encoder–Decoder | Text-to-text | Low–Medium |
| T5-base | 220 M | Encoder–Decoder | Text-to-text | Low–Medium |
| T5-large | ~770 M | Encoder–Decoder | Text-to-text | High |
| Input Image | Extracted Pruned Facts | Vocal_Eyes Output | MS COCO Ground Truth |
|---|---|---|---|
| (Image 1) | [‘car’, ‘parked on’, ‘street’], [‘car’, ‘is’, ‘red’] | A red car is parked on the street. | A beautiful, shiny red car rests quietly on the sunny street. |
| (Image 2) | [‘man’, ‘holding’, ‘bottle’], [‘man’, ‘sitting on’, ‘chair’] | A man is sitting on a chair holding a bottle. | A casually dressed man relaxes in a seat with a beverage. |
| (Image 3) | [‘dog’, ‘catching’, ‘frisbee’], [‘dog’, ‘in’, ‘air’] | A dog is in the air catching a frisbee. | A playful puppy leaps high into the air to grab a flying disc. |
| Pipeline Phase | Description/Architectural Components | Average Latency (Seconds) |
|---|---|---|
| Image Capture and Routing | Local image acquisition via ESP32-CAM and local routing via FastAPI gateway | ~0.50 s |
| Network Uplink and Inference | Payload transmission to Hugging Face Spaces and Transformer-based triplet extraction | ~2.50 s |
| Network Downlink and Processing | Text payload return and local Text-to-Speech (TTS) synthesis initialization | ~0.50 s |
| Intentional Cognitive Pacing | Managed system idle delay to govern audio output frequency | ~1.50 s |
| Total End-to-End Latency | Complete closed-loop execution cycle | ~5.00 s |
| Component | Specification | Estimated Cost (USD) |
|---|---|---|
| Microcomputer | Raspberry Pi 4 Model B (8 GB) | $75.00 |
| Vision Sensor | 8 MP Pi Camera Module V2 | $30.00 |
| Audio Output | Open-ear Bone Conduction Headphones | $45.00 |
| Power Supply | 10,000 mAh Portable Power Bank | $25.00 |
| Frame/Mounting | Custom 3D-Printed Chassis | $15.00 |
| Connectivity | Wi-Fi/Bluetooth Module | (Integrated) |
| Total Upfront Hardware Cost | ~$190.00 |
| Platform | Upfront Cost (USD) | Recurring Cost | Form Factor |
|---|---|---|---|
| Vocal-Eyes (Ours) | ~$190 | ~$15/month (cloud) | Wearable (Smart Glasses) |
| OrCam MyEye 3 Pro | ~$4490 | None | Wearable (Magnetic Mount) |
| Envision Glasses (Pro) | ~$3499 | None | Wearable (Smart Glasses) |
| Smartphone Apps | $0 | None | Handheld |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Shabbir, A.; Afsheen, U.; Shirazi, M.F.; Rauf, A.; Abbas, S.M.M.; Saeed, S.; Khan, A.S.; Rizvi, S.; Saaludin, N. Vocal-Eyes: AI-Powered Smart Glasses for the Blind Using Transformer-Based Architecture and Scene Graph Generation. Technologies 2026, 14, 384. https://doi.org/10.3390/technologies14070384
Shabbir A, Afsheen U, Shirazi MF, Rauf A, Abbas SMM, Saeed S, Khan AS, Rizvi S, Saaludin N. Vocal-Eyes: AI-Powered Smart Glasses for the Blind Using Transformer-Based Architecture and Scene Graph Generation. Technologies. 2026; 14(7):384. https://doi.org/10.3390/technologies14070384
Chicago/Turabian StyleShabbir, Amna, Uzma Afsheen, Muhammad Faizan Shirazi, Abdul Rauf, Syed Muhammad Meesam Abbas, Shahid Saeed, Abdul Samad Khan, Safdar Rizvi, and Nurashikin Saaludin. 2026. "Vocal-Eyes: AI-Powered Smart Glasses for the Blind Using Transformer-Based Architecture and Scene Graph Generation" Technologies 14, no. 7: 384. https://doi.org/10.3390/technologies14070384
APA StyleShabbir, A., Afsheen, U., Shirazi, M. F., Rauf, A., Abbas, S. M. M., Saeed, S., Khan, A. S., Rizvi, S., & Saaludin, N. (2026). Vocal-Eyes: AI-Powered Smart Glasses for the Blind Using Transformer-Based Architecture and Scene Graph Generation. Technologies, 14(7), 384. https://doi.org/10.3390/technologies14070384

