Human–AI Collaboration for Remote Sighted Assistance: Perspectives from the LLM Era †
Abstract
:1. Introduction
2. Background and Related Works
2.1. Navigational Aids for People with VI
2.2. RSA Services for People with VI
2.3. Use of CV in Navigation for People with VI
2.4. Collaboration between Human and AI
3. Identifying Challenges in RSA: Literature Review
3.1. Challenges in Localization and Orientation
3.1.1. Unfamiliarity with the Environment
3.1.2. Scarcity and Limitation of Maps
3.1.3. Inaccurate GPS
3.2. Challenges in Obstacle and Surrounding Information Acquirement and Detection
3.2.1. Narrow View of the Camera
3.2.2. Limitation of Using Video Feed
3.3. Challenges in Delivering Information and Interacting with Users
3.4. Network and External Issues
4. Identifying Challenges in RSA: User Interview Study
4.1. Common Navigation Scenarios
4.2. Challenging Outdoor Navigation Experiences
4.3. Challenging Indoor Navigation Experiences
4.4. Users’ Understanding of Problems: Insufficient Maps, RSA’s Unfamiliarity of Area, and Limited Camera View
4.5. Users Helping and Collaborating with RSA
5. Identified Computer Vision Problems in RSA
6. Emerging Human–AI Collaboration Problems in RSA
6.1. Emerging Problem 1: Making Object Detection and Obstacle Avoidance Algorithms Blind Aware
6.2. Emerging Problem 2: Localizing Users under Poor Networks
6.3. Emerging Problem 3: Recognizing Digital Content on Digital Displays
6.4. Emerging Problem 4: Recognizing Texts on Irregular Surfaces
6.5. Emerging Problem 5: Predicting the Trajectories of Out-of-Frame Pedestrians or Objects
6.6. Emerging Problem 6: Expanding the Field-of-View of Live Camera Feed
6.7. Emerging Problem 7: Stabilizing Live Camera Feeds for Task-Specific Needs
6.8. Emerging Problem 8: Reconstructing High-Resolution Live Video Feeds
6.9. Emerging Problem 9: Relighting and Removing Unwanted Artifacts on Live Video
6.10. Emerging Problem 10: Describing Hierarchical Information of Live Camera Feeds
7. Integrating RSA with Large Language Models (LLMs)
7.1. AI-Powered Visual Assistance and LLMs
7.2. Opportunities for Human–AI Collaboration in RSA with LLMs
7.2.1. Human Agent Supporting LLM-Based AI
7.2.2. LLM-Based AI Supporting Human Agent
7.3. Human-Centered AI: Future of Visual Prosthetics
8. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
RSA | remote sighted assistance |
VI | visual impairments |
PVI | people with visual impairments |
LLM | large language model |
CV | computer vision |
AI | artificial intelligence |
AR | augmented reality |
SLAM | simultaneous localization and mapping |
DOF | degrees of freedom |
LiDAR | light detection and ranging |
OCR | optical character recognition |
IMU | inertial measurement unit |
FOV | field of view |
O&M | orientation and mobility |
LCD | liquid crystal display |
HDR | high dynamic range |
VQA | visual question answering |
References
- Lee, S.; Reddie, M.; Tsai, C.; Beck, J.; Rosson, M.B.; Carroll, J.M. The Emerging Professional Practice of Remote Sighted Assistance for People with Visual Impairments. In Proceedings of the CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 25–30 April 2020; pp. 1–12. [Google Scholar] [CrossRef]
- Bigham, J.P.; Jayant, C.; Miller, A.; White, B.; Yeh, T. VizWiz::LocateIt—Enabling blind people to locate objects in their environment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR Workshops, San Francisco, CA, USA, 13–18 June 2010; pp. 65–72. [Google Scholar] [CrossRef]
- Holton, B. BeSpecular: A new remote assistant service. Access World Mag. 2016, 17. Available online: https://www.afb.org/aw/17/7/15313 (accessed on 2 June 2024).
- Holton, B. Crowdviz: Remote video assistance on your iphone. AFB Access World Mag. 2015. Available online: https://www.afb.org/aw/16/11/15507 (accessed on 2 June 2024).
- TapTapSee—Assistive Technology for the Blind and Visually Impaired. 2024. Available online: https://taptapseeapp.com (accessed on 15 May 2024).
- Be My Eyes—See the World Together. 2024. Available online: https://www.bemyeyes.com (accessed on 15 May 2024).
- Aira, a Visual Interpreting Service. 2024. Available online: https://aira.io (accessed on 15 May 2024).
- Petrie, H.; Johnson, V.; Strothotte, T.; Raab, A.; Michel, R.; Reichert, L.; Schalt, A. MoBIC: An aid to increase the independent mobility of blind travellers. Br. J. Vis. Impair. 1997, 15, 63–66. [Google Scholar] [CrossRef]
- Bujacz, M.; Baranski, P.; Moranski, M.; Strumillo, P.; Materka, A. Remote guidance for the blind—A proposed teleassistance system and navigation trials. In Proceedings of the Conference on Human System Interactions, Krakow, Poland, 25–27 May 2008; pp. 888–892. [Google Scholar] [CrossRef]
- Baranski, P.; Strumillo, P. Field trials of a teleassistance system for the visually impaired. In Proceedings of the 8th International Conference on Human System Interaction, Warsaw, Poland, 25–27 June 2015; pp. 173–179. [Google Scholar] [CrossRef]
- Scheggi, S.; Talarico, A.; Prattichizzo, D. A remote guidance system for blind and visually impaired people via vibrotactile haptic feedback. In Proceedings of the 22nd Mediterranean Conference on Control and Automation, Palermo, Italy, 16–19 June 2014; pp. 20–23. [Google Scholar] [CrossRef]
- Kutiyanawala, A.; Kulyukin, V.; Nicholson, J. Teleassistance in accessible shopping for the blind. In Proceedings of the International Conference on Internet Computing, Hong Kong, China, 17–18 September 2011; p. 1. [Google Scholar]
- Kamikubo, R.; Kato, N.; Higuchi, K.; Yonetani, R.; Sato, Y. Support Strategies for Remote Guides in Assisting People with Visual Impairments for Effective Indoor Navigation. In Proceedings of the CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 25–30 April 2020; pp. 1–12. [Google Scholar] [CrossRef]
- Lee, S.; Yu, R.; Xie, J.; Billah, S.M.; Carroll, J.M. Opportunities for Human-AI Collaboration in Remote Sighted Assistance. In Proceedings of the 27th International Conference on Intelligent User Interfaces, Helsinki, Finland, 21–25 March 2022; pp. 63–78. [Google Scholar] [CrossRef]
- Yin, S.; Fu, C.; Zhao, S.; Li, K.; Sun, X.; Xu, T.; Chen, E. A Survey on Multimodal Large Language Models. arXiv 2023, arXiv:2306.13549. [Google Scholar]
- OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
- Announcing ‘Be My AI’, Soon Available for Hundreds of Thousands of Be My Eyes Users. 2024. Available online: https://www.bemyeyes.com/blog/announcing-be-my-ai (accessed on 15 May 2024).
- Tversky, B. Cognitive maps, cognitive collages, and spatial mental models. In Proceedings of the European Conference on Spatial Information Theory; Springer: Berlin/Heidelberg, Germany, 1993; pp. 14–24. [Google Scholar] [CrossRef]
- Rafian, P.; Legge, G.E. Remote Sighted Assistants for Indoor Location Sensing of Visually Impaired Pedestrians. ACM Trans. Appl. Percept. 2017, 14, 1–14. [Google Scholar] [CrossRef]
- Real, S.; Araujo, Á. Navigation Systems for the Blind and Visually Impaired: Past Work, Challenges, and Open Problems. Sensors 2019, 19, 3404. [Google Scholar] [CrossRef] [PubMed]
- OpenStreetMap. 2024. Available online: https://www.openstreetmap.org (accessed on 15 May 2024).
- BlindSquare. 2024. Available online: https://www.blindsquare.com (accessed on 15 May 2024).
- Sendero Group: The Seeing Eye GPS App. 2024. Available online: https://www.senderogroup.com/products/shopseeingeyegps.html (accessed on 15 May 2024).
- Microsoft Soundscape—A Map Delivered in 3D Sound. 2024. Available online: https://www.microsoft.com/en-us/research/product/soundscape (accessed on 15 May 2024).
- Autour. 2024. Available online: http://autour.mcgill.ca (accessed on 15 May 2024).
- Saha, M.; Fiannaca, A.J.; Kneisel, M.; Cutrell, E.; Morris, M.R. Closing the Gap: Designing for the Last-Few-Meters Wayfinding Problem for People with Visual Impairments. In Proceedings of the 21st International ACM SIGACCESS Conference on Computers and Accessibility, Pittsburgh, PA, USA, 28–30 October 2019; pp. 222–235. [Google Scholar] [CrossRef]
- GPS Accuracy. 2024. Available online: https://www.gps.gov/systems/gps/performance/accuracy (accessed on 15 May 2024).
- Sato, D.; Oh, U.; Naito, K.; Takagi, H.; Kitani, K.M.; Asakawa, C. NavCog3: An Evaluation of a Smartphone-Based Blind Indoor Navigation Assistant with Semantic Features in a Large-Scale Environment. In Proceedings of the 19th International ACM SIGACCESS Conference on Computers and Accessibility, New York, NY, USA, 29 October–1 November 2017; pp. 270–279. [Google Scholar] [CrossRef]
- Legge, G.E.; Beckmann, P.J.; Tjan, B.S.; Havey, G.; Kramer, K.; Rolkosky, D.; Gage, R.; Chen, M.; Puchakayala, S.; Rangarajan, A. Indoor navigation by people with visual impairment using a digital sign system. PLoS ONE 2013, 8, e76783. [Google Scholar] [CrossRef] [PubMed]
- Ganz, A.; Schafer, J.M.; Tao, Y.; Wilson, C.; Robertson, M. PERCEPT-II: Smartphone based indoor navigation system for the blind. In Proceedings of the 36th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Chicago, IL, USA, 26–30 August 2014; pp. 3662–3665. [Google Scholar] [CrossRef]
- Ganz, A.; Gandhi, S.R.; Schafer, J.M.; Singh, T.; Puleo, E.; Mullett, G.; Wilson, C. PERCEPT: Indoor navigation for the blind and visually impaired. In Proceedings of the 33rd Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Boston, MA, USA, 30 August–3 September 2011; pp. 856–859. [Google Scholar] [CrossRef]
- Dokmanić, I.; Parhizkar, R.; Walther, A.; Lu, Y.M.; Vetterli, M. Acoustic echoes reveal room shape. Proc. Natl. Acad. Sci. USA 2013, 110, 12186–12191. [Google Scholar] [CrossRef] [PubMed]
- Guerreiro, J.; Ahmetovic, D.; Sato, D.; Kitani, K.; Asakawa, C. Airport Accessibility and Navigation Assistance for People with Visual Impairments. In Proceedings of the CHI Conference on Human Factors in Computing Systems, Glasgow, UK, 4–9 May 2019; p. 16. [Google Scholar] [CrossRef]
- Rodrigo, R.; Zouqi, M.; Chen, Z.; Samarabandu, J. Robust and Efficient Feature Tracking for Indoor Navigation. IEEE Trans. Syst. Man Cybern. Part B 2009, 39, 658–671. [Google Scholar] [CrossRef] [PubMed]
- Li, K.J.; Lee, J. Indoor spatial awareness initiative and standard for indoor spatial data. In Proceedings of the IROS Workshop on Standardization for Service Robot, Taipei, Taiwan, 18–22 October 2010; Volume 18. [Google Scholar]
- Elmannai, W.; Elleithy, K.M. Sensor-Based Assistive Devices for Visually-Impaired People: Current Status, Challenges, and Future Directions. Sensors 2017, 17, 565. [Google Scholar] [CrossRef]
- Gleason, C.; Ahmetovic, D.; Savage, S.; Toxtli, C.; Posthuma, C.; Asakawa, C.; Kitani, K.M.; Bigham, J.P. Crowdsourcing the Installation and Maintenance of Indoor Localization Infrastructure to Support Blind Navigation. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2018, 2, 1–25. [Google Scholar] [CrossRef]
- Fallah, N.; Apostolopoulos, I.; Bekris, K.E.; Folmer, E. The user as a sensor: Navigating users with visual impairments in indoor spaces using tactile landmarks. In Proceedings of the CHI Conference on Human Factors in Computing Systems, Austin, TX, USA, 5–10 May 2012; pp. 425–432. [Google Scholar] [CrossRef]
- Bai, Y.; Jia, W.; Zhang, H.; Mao, Z.H.; Sun, M. Landmark-based indoor positioning for visually impaired individuals. In Proceedings of the 12th International Conference on Signal Processing, Hangzhou, China, 19–23 October 2014; pp. 668–671. [Google Scholar] [CrossRef]
- Pérez, J.E.; Arrue, M.; Kobayashi, M.; Takagi, H.; Asakawa, C. Assessment of Semantic Taxonomies for Blind Indoor Navigation Based on a Shopping Center Use Case. In Proceedings of the 14th Web for All Conference, Perth, WA, Australia, 2–4 April 2017; pp. 1–4. [Google Scholar] [CrossRef]
- Carroll, J.M.; Lee, S.; Reddie, M.; Beck, J.; Rosson, M.B. Human-Computer Synergies in Prosthetic Interactions. IxD&A 2020, 44, 29–52. [Google Scholar] [CrossRef]
- Garaj, V.; Jirawimut, R.; Ptasinski, P.; Cecelja, F.; Balachandran, W. A system for remote sighted guidance of visually impaired pedestrians. Br. J. Vis. Impair. 2003, 21, 55–63. [Google Scholar] [CrossRef]
- Holmes, N.; Prentice, K. iPhone video link facetime as an orientation tool: Remote O&M for people with vision impairment. Int. J. Orientat. Mobil. 2015, 7, 60–68. [Google Scholar] [CrossRef]
- Lasecki, W.S.; Wesley, R.; Nichols, J.; Kulkarni, A.; Allen, J.F.; Bigham, J.P. Chorus: A crowd-powered conversational assistant. In Proceedings of the 26th Annual ACM Symposium on User Interface Software and Technology, St. Andrews, Scotland, UK, 8–11 October 2013; pp. 151–162. [Google Scholar] [CrossRef]
- Chaudary, B.; Paajala, I.J.; Keino, E.; Pulli, P. Tele-guidance Based Navigation System for the Visually Impaired and Blind Persons. In Proceedings of the eHealth 360°— International Summit on eHealth; Springer: Berlin/Heidelberg, Germany, 2016; Volume 181, pp. 9–16. [Google Scholar] [CrossRef]
- Lasecki, W.S.; Murray, K.I.; White, S.; Miller, R.C.; Bigham, J.P. Real-time crowd control of existing interfaces. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, Santa Barbara, CA, USA, 16–19 October 2011; pp. 23–32. [Google Scholar] [CrossRef]
- Zhong, Y.; Lasecki, W.S.; Brady, E.L.; Bigham, J.P. RegionSpeak: Quick Comprehensive Spatial Descriptions of Complex Images for Blind Users. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, Seoul, Republic of Korea, 18–23 April 2015; pp. 2353–2362. [Google Scholar] [CrossRef]
- Avila, M.; Wolf, K.; Brock, A.M.; Henze, N. Remote Assistance for Blind Users in Daily Life: A Survey about Be My Eyes. In Proceedings of the 9th ACM International Conference on PErvasive Technologies Related to Assistive Environments, Corfu, Island, Greece, 29 June–1 July 2016; p. 85. [Google Scholar] [CrossRef]
- Brady, E.L.; Bigham, J.P. Crowdsourcing Accessibility: Human-Powered Access Technologies. Found. Trends Hum. Comput. Interact. 2015, 8, 273–372. [Google Scholar] [CrossRef]
- Burton, M.A.; Brady, E.L.; Brewer, R.; Neylan, C.; Bigham, J.P.; Hurst, A. Crowdsourcing subjective fashion advice using VizWiz: Challenges and opportunities. In Proceedings of the 14th International ACM SIGACCESS Conference on Computers and Accessibility, Boulder, CO, USA, 22–24 October 2012; pp. 135–142. [Google Scholar] [CrossRef]
- Nguyen, B.J.; Kim, Y.; Park, K.; Chen, A.J.; Chen, S.; Van Fossan, D.; Chao, D.L. Improvement in patient-reported quality of life outcomes in severely visually impaired individuals using the Aira assistive technology system. Transl. Vis. Sci. Technol. 2018, 7, 30. [Google Scholar] [CrossRef] [PubMed]
- Budrionis, A.; Plikynas, D.; Daniušis, P.; Indrulionis, A. Smartphone-based computer vision travelling aids for blind and visually impaired individuals: A systematic review. Assist. Technol. 2020, 34, 178–194. [Google Scholar] [CrossRef] [PubMed]
- Tekin, E.; Coughlan, J.M. A Mobile Phone Application Enabling Visually Impaired Users to Find and Read Product Barcodes. In Proceedings of the International Conference on Computers for Handicapped Persons; Springer: Berlin/Heidelberg, Germany, 2010; Volume 6180, pp. 290–295. [Google Scholar] [CrossRef]
- Ko, E.; Kim, E.Y. A Vision-Based Wayfinding System for Visually Impaired People Using Situation Awareness and Activity-Based Instructions. Sensors 2017, 17, 1882. [Google Scholar] [CrossRef] [PubMed]
- Elgendy, M.; Herperger, M.; Guzsvinecz, T.; Sik-Lányi, C. Indoor Navigation for People with Visual Impairment using Augmented Reality Markers. In Proceedings of the 10th IEEE International Conference on Cognitive Infocommunications, Naples, Italy, 23–25 October 2019; pp. 425–430. [Google Scholar] [CrossRef]
- Manduchi, R.; Kurniawan, S.; Bagherinia, H. Blind guidance using mobile computer vision: A usability study. In Proceedings of the 12th International ACM SIGACCESS Conference on Computers and Accessibility, Orlando, FL, USA, 25–27 October 2010; pp. 241–242. [Google Scholar] [CrossRef]
- McDaniel, T.; Kahol, K.; Villanueva, D.; Panchanathan, S. Integration of RFID and computer vision for remote object perception for individuals who are blind. In Proceedings of the 1st International ICST Conference on Ambient Media and Systems, ICST, Quebec, QC, Canada, 11–14 February 2008; p. 7. [Google Scholar] [CrossRef]
- Kayukawa, S.; Higuchi, K.; Guerreiro, J.; Morishima, S.; Sato, Y.; Kitani, K.; Asakawa, C. BBeep: A Sonic Collision Avoidance System for Blind Travellers and Nearby Pedestrians. In Proceedings of the CHI Conference on Human Factors in Computing Systems, Glasgow, Scotland, UK, 4–9 May 2019; p. 52. [Google Scholar] [CrossRef]
- Presti, G.; Ahmetovic, D.; Ducci, M.; Bernareggi, C.; Ludovico, L.A.; Baratè, A.; Avanzini, F.; Mascetti, S. WatchOut: Obstacle Sonification for People with Visual Impairment or Blindness. In Proceedings of the 21st International ACM SIGACCESS Conference on Computers and Accessibility, Pittsburgh, PA, USA, 28–30 October 2019; pp. 402–413. [Google Scholar] [CrossRef]
- Liu, Y.; Stiles, N.R.; Meister, M. Augmented reality powers a cognitive assistant for the blind. eLife 2018, 7, e37841. [Google Scholar] [CrossRef]
- Guerreiro, J.; Sato, D.; Asakawa, S.; Dong, H.; Kitani, K.M.; Asakawa, C. CaBot: Designing and Evaluating an Autonomous Navigation Robot for Blind People. In Proceedings of the 21st International ACM SIGACCESS Conference on Computers and Accessibility, Pittsburgh, PA, USA, 28–30 October 2019; pp. 68–82. [Google Scholar] [CrossRef]
- Banovic, N.; Franz, R.L.; Truong, K.N.; Mankoff, J.; Dey, A.K. Uncovering information needs for independent spatial learning for users who are visually impaired. In Proceedings of the 15th International ACM SIGACCESS Conference on Computers and Accessibility, Bellevue, WA, USA, 21–23 October 2013; pp. 1–8. [Google Scholar] [CrossRef]
- ARKit 6. 2024. Available online: https://developer.apple.com/augmented-reality/arkit (accessed on 15 May 2024).
- ARCore. 2024. Available online: https://developers.google.com/ar (accessed on 15 May 2024).
- Yoon, C.; Louie, R.; Ryan, J.; Vu, M.; Bang, H.; Derksen, W.; Ruvolo, P. Leveraging Augmented Reality to Create Apps for People with Visual Disabilities: A Case Study in Indoor Navigation. In Proceedings of the 21st International ACM SIGACCESS Conference on Computers and Accessibility, Pittsburgh, PA, USA, 28–30 October 2019; pp. 210–221. [Google Scholar] [CrossRef]
- Aldas, N.D.T.; Lee, S.; Lee, C.; Rosson, M.B.; Carroll, J.M.; Narayanan, V. AIGuide: An Augmented Reality Hand Guidance Application for People with Visual Impairments. In Proceedings of the 22nd International ACM SIGACCESS Conference on Computers and Accessibility, Virtual Event, Greece, 26–28 October 2020; pp. 1–13. [Google Scholar] [CrossRef]
- Rocha, S.; Lopes, A. Navigation Based Application with Augmented Reality and Accessibility. In Proceedings of the Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 25–30 April 2020; pp. 1–9. [Google Scholar] [CrossRef]
- Verma, P.; Agrawal, K.; Sarasvathi, V. Indoor Navigation Using Augmented Reality. In Proceedings of the 4th International Conference on Virtual and Augmented Reality Simulations, Sydney, NSW, Australia, 14–16 February 2020; pp. 58–63. [Google Scholar] [CrossRef]
- Fusco, G.; Coughlan, J.M. Indoor localization for visually impaired travelers using computer vision on a smartphone. In Proceedings of the 17th Web for All Conference, Taipei, Taiwan, 20–21 April 2020; pp. 1–11. [Google Scholar] [CrossRef]
- Xie, J.; Reddie, M.; Lee, S.; Billah, S.M.; Zhou, Z.; Tsai, C.; Carroll, J.M. Iterative Design and Prototyping of Computer Vision Mediated Remote Sighted Assistance. ACM Trans. Comput. Hum. Interact. 2022, 29, 1–40. [Google Scholar] [CrossRef]
- Naseer, M.; Khan, S.H.; Porikli, F. Indoor Scene Understanding in 2.5/3D for Autonomous Agents: A Survey. IEEE Access 2019, 7, 1859–1887. [Google Scholar] [CrossRef]
- Jafri, R.; Ali, S.A.; Arabnia, H.R.; Fatima, S. Computer vision-based object recognition for the visually impaired in an indoors environment: A survey. Vis. Comput. 2014, 30, 1197–1222. [Google Scholar] [CrossRef]
- Brady, E.L.; Morris, M.R.; Zhong, Y.; White, S.; Bigham, J.P. Visual challenges in the everyday lives of blind people. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems, Paris, France, 27 April–2 May 2013; pp. 2117–2126. [Google Scholar] [CrossRef]
- Branson, S.; Wah, C.; Schroff, F.; Babenko, B.; Welinder, P.; Perona, P.; Belongie, S.J. Visual Recognition with Humans in the Loop. In Proceedings of the 11th European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2010; Volume 6314, pp. 438–451. [Google Scholar] [CrossRef]
- Sinha, S.N.; Steedly, D.; Szeliski, R.; Agrawala, M.; Pollefeys, M. Interactive 3D architectural modeling from unordered photo collections. ACM Trans. Graph. 2008, 27, 159. [Google Scholar] [CrossRef]
- Kowdle, A.; Chang, Y.; Gallagher, A.C.; Chen, T. Active learning for piecewise planar 3D reconstruction. In Proceedings of the 24th IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, USA, 20–25 June 2011; pp. 929–936. [Google Scholar] [CrossRef]
- Alzantot, M.; Youssef, M. CrowdInside: Automatic construction of indoor floorplans. In Proceedings of the 2012 International Conference on Advances in Geographic Information Systems, Redondo Beach, CA, USA, 6–9 November 2012; pp. 99–108. [Google Scholar] [CrossRef]
- Pradhan, S.; Baig, G.; Mao, W.; Qiu, L.; Chen, G.; Yang, B. Smartphone-based Acoustic Indoor Space Mapping. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2018, 2, 1–26. [Google Scholar] [CrossRef]
- Chen, S.; Li, M.; Ren, K.; Qiao, C. Crowd Map: Accurate Reconstruction of Indoor Floor Plans from Crowdsourced Sensor-Rich Videos. In Proceedings of the 35th IEEE International Conference on Distributed Computing Systems, Columbus, OH, USA, 29 June–2 July 2015; pp. 1–10. [Google Scholar] [CrossRef]
- Hara, K.; Azenkot, S.; Campbell, M.; Bennett, C.L.; Le, V.; Pannella, S.; Moore, R.; Minckler, K.; Ng, R.H.; Froehlich, J.E. Improving Public Transit Accessibility for Blind Riders by Crowdsourcing Bus Stop Landmark Locations with Google Street View: An Extended Analysis. ACM Trans. Access. Comput. 2015, 6, 1–23. [Google Scholar] [CrossRef]
- Saha, M.; Saugstad, M.; Maddali, H.T.; Zeng, A.; Holland, R.; Bower, S.; Dash, A.; Chen, S.; Li, A.; Hara, K.; et al. Project Sidewalk: A Web-based Crowdsourcing Tool for Collecting Sidewalk Accessibility Data At Scale. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, Glasgow, Scotland, UK, 4–9 May 2019; p. 62. [Google Scholar] [CrossRef]
- Miyata, A.; Okugawa, K.; Yamato, Y.; Maeda, T.; Murayama, Y.; Aibara, M.; Furuichi, M.; Murayama, Y. A Crowdsourcing Platform for Constructing Accessibility Maps Supporting Multiple Participation Modes. In Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, Extended Abstracts, Yokohama, Japan, 8–13 May 2021; pp. 1–6. [Google Scholar] [CrossRef]
- Guy, R.T.; Truong, K.N. CrossingGuard: Exploring information content in navigation aids for visually impaired pedestrians. In Proceedings of the CHI Conference on Human Factors in Computing Systems, Austin, TX, USA, 5–10 May 2012; pp. 405–414. [Google Scholar] [CrossRef]
- Budhathoki, N.R.; Haythornthwaite, C. Motivation for open collaboration: Crowd and community models and the case of OpenStreetMap. Am. Behav. Sci. 2013, 57, 548–575. [Google Scholar] [CrossRef]
- Murata, M.; Ahmetovic, D.; Sato, D.; Takagi, H.; Kitani, K.M.; Asakawa, C. Smartphone-based Indoor Localization for Blind Navigation across Building Complexes. In Proceedings of the 2018 IEEE International Conference on Pervasive Computing and Communications, Athens, Greece, 19–23 March 2018; pp. 1–10. [Google Scholar] [CrossRef]
- Barros, A.M.; Michel, M.; Moline, Y.; Corre, G.; Carrel, F. A Comprehensive Survey of Visual SLAM Algorithms. Robotics 2022, 11, 24. [Google Scholar] [CrossRef]
- Wu, Y.; Tang, F.; Li, H. Image-based camera localization: An overview. Vis. Comput. Ind. Biomed. Art 2018, 1, 1–13. [Google Scholar] [CrossRef]
- Magliani, F.; Fontanini, T.; Prati, A. Landmark Recognition: From Small-Scale to Large-Scale Retrieval. In Recent Advances in Computer Vision—Theories and Applications; Springer: Berlin/Heidelberg, Germany, 2019; Volume 804, pp. 237–259. [Google Scholar] [CrossRef]
- Yasuda, Y.D.V.; Martins, L.E.G.; Cappabianco, F.A.M. Autonomous Visual Navigation for Mobile Robots: A Systematic Literature Review. ACM Comput. Surv. 2021, 53, 1–34. [Google Scholar] [CrossRef]
- Chen, X.; Jin, L.; Zhu, Y.; Luo, C.; Wang, T. Text Recognition in the Wild: A Survey. ACM Comput. Surv. 2022, 54, 1–35. [Google Scholar] [CrossRef]
- Wang, D.; Liu, Z.; Shao, S.; Wu, X.; Chen, W.; Li, Z. Monocular Depth Estimation: A Survey. In Proceedings of the 49th Annual Conference of the IEEE Industrial Electronics Society, Singapore, Singapore, 16–19 October 2023; pp. 1–7. [Google Scholar] [CrossRef]
- Ham, C.C.W.; Lucey, S.; Singh, S.P.N. Absolute Scale Estimation of 3D Monocular Vision on Smart Devices. In Mobile Cloud Visual Media Computing; Springer: Berlin/Heidelberg, Germany, 2015; pp. 329–353. [Google Scholar] [CrossRef]
- Yu, R.; Wang, J.; Ma, S.; Huang, S.X.; Krishnan, G.; Wu, Y. Be Real in Scale: Swing for True Scale in Dual Camera Mode. In Proceedings of the IEEE International Symposium on Mixed and Augmented Reality, Sydney, Australia, 16–20 October 2023; pp. 1231–1239. [Google Scholar] [CrossRef]
- Hunaiti, Z.; Garaj, V.; Balachandran, W. A remote vision guidance system for visually impaired pedestrians. J. Navig. 2006, 59, 497–504. [Google Scholar] [CrossRef]
- Luo, W.; Xing, J.; Milan, A.; Zhang, X.; Liu, W.; Kim, T. Multiple object tracking: A literature review. Artif. Intell. 2021, 293, 103448. [Google Scholar] [CrossRef]
- Rudenko, A.; Palmieri, L.; Herman, M.; Kitani, K.M.; Gavrila, D.M.; Arras, K.O. Human motion trajectory prediction: A survey. Int. J. Robot. Res. 2020, 39. [Google Scholar] [CrossRef]
- Yu, R.; Zhou, Z. Towards Robust Human Trajectory Prediction in Raw Videos. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Prague, Czech Republic, 28–30 September 2021; pp. 8059–8066. [Google Scholar] [CrossRef]
- Ma, L.; Georgoulis, S.; Jia, X.; Gool, L.V. FoV-Net: Field-of-View Extrapolation Using Self-Attention and Uncertainty. IEEE Robot. Autom. Lett. 2021, 6, 4321–4328. [Google Scholar] [CrossRef]
- Yu, R.; Liu, J.; Zhou, Z.; Huang, S.X. NeRF-Enhanced Outpainting for Faithful Field-of-View Extrapolation. arXiv 2023, arXiv:2309.13240. [Google Scholar]
- Guilluy, W.; Oudre, L.; Beghdadi, A. Video stabilization: Overview, challenges and perspectives. Signal Process. Image Commun. 2021, 90, 116015. [Google Scholar] [CrossRef]
- Lee, S.; Reddie, M.; Gurdasani, K.; Wang, X.; Beck, J.; Rosson, M.B.; Carroll, J.M. Conversations for Vision: Remote Sighted Assistants Helping People with Visual Impairments. arXiv 2018, arXiv:1812.00148. [Google Scholar]
- Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object Detection in 20 Years: A Survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
- Lin, X.; Ren, P.; Xiao, Y.; Chang, X.; Hauptmann, A. Person Search Challenges and Solutions: A Survey. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, Virtual/Montreal, Canada, 19–27 August 2021; pp. 4500–4507. [Google Scholar] [CrossRef]
- Yu, R.; Du, D.; LaLonde, R.; Davila, D.; Funk, C.; Hoogs, A.; Clipp, B. Cascade Transformers for End-to-End Person Search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7257–7266. [Google Scholar] [CrossRef]
- Jain, V.; Al-Turjman, F.; Chaudhary, G.; Nayar, D.; Gupta, V.; Kumar, A. Video captioning: A review of theory, techniques and practices. Multim. Tools Appl. 2022, 81, 35619–35653. [Google Scholar] [CrossRef]
- Liu, H.; Ruan, Z.; Zhao, P.; Dong, C.; Shang, F.; Liu, Y.; Yang, L.; Timofte, R. Video super-resolution based on deep learning: A comprehensive survey. Artif. Intell. Rev. 2022, 55, 5981–6035. [Google Scholar] [CrossRef]
- Einabadi, F.; Guillemaut, J.; Hilton, A. Deep Neural Models for Illumination Estimation and Relighting: A Survey. Comput. Graph. Forum 2021, 40, 315–331. [Google Scholar] [CrossRef]
- Hunaiti, Z.; Garaj, V.; Balachandran, W.; Cecelja, F. Use of remote vision in navigation of visually impaired pedestrians. In Proceedings of the International Congress; Elsevier: Amsterdam, The Netherlands, 2005; Volume 1282, pp. 1026–1030. [Google Scholar] [CrossRef]
- Garaj, V.; Hunaiti, Z.; Balachandran, W. The effects of video image frame rate on the environmental hazards recognition performance in using remote vision to navigate visually impaired pedestrians. In Proceedings of the 4th International Conference on Mobile Technology, Applications, and Systems and the 1st International Symposium on Computer Human Interaction in Mobile Technology, Singapore, 10–12 September 2007; pp. 207–213. [Google Scholar] [CrossRef]
- Garaj, V.; Hunaiti, Z.; Balachandran, W. Using Remote Vision: The Effects of Video Image Frame Rate on Visual Object Recognition Performance. IEEE Trans. Syst. Man Cybern. Part A 2010, 40, 698–707. [Google Scholar] [CrossRef]
- Baranski, P.; Polanczyk, M.; Strumillo, P. A remote guidance system for the blind. In Proceedings of the 12th IEEE International Conference on e-Health Networking, Applications and Services, Lyon, France, 1–3 July 2010; pp. 386–390. [Google Scholar] [CrossRef]
- Xie, J.; Yu, R.; Lee, S.; Lyu, Y.; Billah, S.M.; Carroll, J.M. Helping Helpers: Supporting Volunteers in Remote Sighted Assistance with Augmented Reality Maps. In Proceedings of the Designing Interactive Systems Conference, Virtual Event, Australia, 13–17 June 2022; pp. 881–897. [Google Scholar] [CrossRef]
- Ham, C.C.W.; Lucey, S.; Singh, S.P.N. Hand Waving Away Scale. In Proceedings of the 13th European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2014; Volume 8692, pp. 279–293. [Google Scholar] [CrossRef]
- Yu, R.; Yuan, Z.; Zhu, M.; Zhou, Z. Data-driven Distributed State Estimation and Behavior Modeling in Sensor Networks. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Las Vegas, NV, USA, 25–29 October 2020; pp. 8192–8199. [Google Scholar] [CrossRef]
- Bai, X.; Yang, M.; Huang, T.; Dou, Z.; Yu, R.; Xu, Y. Deep-Person: Learning discriminative deep features for person Re-Identification. Pattern Recognit. 2020, 98, 107036. [Google Scholar] [CrossRef]
- Yu, R.; Dou, Z.; Bai, S.; Zhang, Z.; Xu, Y.; Bai, X. Hard-Aware Point-to-Set Deep Metric for Person Re-identification. In Proceedings of the 15th European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2018; Volume 11220, pp. 196–212. [Google Scholar] [CrossRef]
- Yu, R.; Zhou, Z.; Bai, S.; Bai, X. Divide and Fuse: A Re-ranking Approach for Person Re-identification. In Proceedings of the British Machine Vision Conference, London, UK, 4–7 September 2017. [Google Scholar]
- Zhao, Z.; Zheng, P.; Xu, S.; Wu, X. Object Detection With Deep Learning: A Review. IEEE Trans. Neural Networks Learn. Syst. 2019, 30, 3212–3232. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- Redmon, J.; Divvala, S.K.; Girshick, R.B.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
- Fischer, G.; Giaccardi, E.; Ye, Y.; Sutcliffe, A.G.; Mehandjiev, N. Meta-Design: A Manifesto for End-User Development. Commun. ACM 2004, 47, 33–37. [Google Scholar] [CrossRef]
- Ahmetovic, D.; Manduchi, R.; Coughlan, J.M.; Mascetti, S. Zebra Crossing Spotter: Automatic Population of Spatial Databases for Increased Safety of Blind Travelers. In Proceedings of the 17th International ACM SIGACCESS Conference on Computers & Accessibility, Lisbon, Portugal, 26–28 October 2015; pp. 251–258. [Google Scholar] [CrossRef]
- Ahmetovic, D.; Manduchi, R.; Coughlan, J.M.; Mascetti, S. Mind Your Crossings: Mining GIS Imagery for Crosswalk Localization. ACM Trans. Access. Comput. 2017, 9, 1–25. [Google Scholar] [CrossRef]
- Hara, K.; Sun, J.; Chazan, J.; Jacobs, D.W.; Froehlich, J. An Initial Study of Automatic Curb Ramp Detection with Crowdsourced Verification Using Google Street View Images. In Proceedings of the First AAAI Conference on Human Computation and Crowdsourcing, AAAI, Palm Springs, CA, USA, 7–9 November 2013; Volume WS-13-18. [Google Scholar] [CrossRef]
- Hara, K.; Sun, J.; Moore, R.; Jacobs, D.W.; Froehlich, J. Tohme: Detecting curb ramps in google street view using crowdsourcing, computer vision, and machine learning. In Proceedings of the 27th Annual ACM Symposium on User Interface Software and Technology, Honolulu, HI, USA, 5–8 October 2014; pp. 189–204. [Google Scholar] [CrossRef]
- Sun, J.; Jacobs, D.W. Seeing What is Not There: Learning Context to Determine Where Objects are Missing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1234–1242. [Google Scholar] [CrossRef]
- Weld, G.; Jang, E.; Li, A.; Zeng, A.; Heimerl, K.; Froehlich, J.E. Deep Learning for Automatically Detecting Sidewalk Accessibility Problems Using Streetscape Imagery. In Proceedings of the 21st International ACM SIGACCESS Conference on Computers and Accessibility, Pittsburgh, PA, USA, 28–30 October 2019; pp. 196–209. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Williams, M.A.; Hurst, A.; Kane, S.K. “Pray before you step out”: Describing personal and situational blind navigation behaviors. In Proceedings of the 15th International ACM SIGACCESS Conference on Computers and Accessibility. ACM, Bellevue, WA, USA, 21–23 October 2013; pp. 1–8. [Google Scholar] [CrossRef]
- Oster, G.; Nishijima, Y. Moiré patterns. Sci. Am. 1963, 208, 54–63. [Google Scholar] [CrossRef]
- Tekin, E.; Coughlan, J.M.; Shen, H. Real-time detection and reading of LED/LCD displays for visually impaired persons. In Proceedings of the IEEE Workshop on Applications of Computer Vision, Kona, HI, USA, 5–7 January 2011; pp. 491–496. [Google Scholar] [CrossRef]
- Morris, T.; Blenkhorn, P.; Crossey, L.; Ngo, Q.; Ross, M.; Werner, D.; Wong, C. Clearspeech: A Display Reader for the Visually Handicapped. IEEE Trans. Neural Syst. Rehabil. Eng. 2006, 14, 492–500. [Google Scholar] [CrossRef]
- Fusco, G.; Tekin, E.; Ladner, R.E.; Coughlan, J.M. Using computer vision to access appliance displays. In Proceedings of the 16th international ACM SIGACCESS conference on Computers & Accessibility, Rochester, NY, USA, 20–22 October 2014; pp. 281–282. [Google Scholar] [CrossRef]
- Guo, A.; Kong, J.; Rivera, M.L.; Xu, F.F.; Bigham, J.P. StateLens: A Reverse Engineering Solution for Making Existing Dynamic Touchscreens Accessible. In Proceedings of the 32nd Annual ACM Symposium on User Interface Software and Technology, New Orleans, LA, USA, 20–23 October 2019; pp. 371–385. [Google Scholar] [CrossRef]
- Liu, X.; Meng, G.; Pan, C. Scene text detection and recognition with advances in deep learning: A survey. Int. J. Document Anal. Recognit. 2019, 22, 143–162. [Google Scholar] [CrossRef]
- Yan, R.; Peng, L.; Xiao, S.; Yao, G. Primitive Representation Learning for Scene Text Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 284–293. [Google Scholar] [CrossRef]
- Wang, Y.; Xie, H.; Fang, S.; Wang, J.; Zhu, S.; Zhang, Y. From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 14174–14183. [Google Scholar] [CrossRef]
- Bhunia, A.K.; Sain, A.; Kumar, A.; Ghose, S.; Chowdhury, P.N.; Song, Y. Joint Visual Semantic Reasoning: Multi-Stage Decoder for Text Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 14920–14929. [Google Scholar] [CrossRef]
- Karatzas, D.; Gomez-Bigorda, L.; Nicolaou, A.; Ghosh, S.K.; Bagdanov, A.D.; Iwamura, M.; Matas, J.; Neumann, L.; Chandrasekhar, V.R.; Lu, S.; et al. ICDAR 2015 competition on Robust Reading. In Proceedings of the 13th International Conference on Document Analysis and Recognition, Nancy, France, 23–26 August 2015; pp. 1156–1160. [Google Scholar] [CrossRef]
- Wang, M.; Deng, W. Deep visual domain adaptation: A survey. Neurocomputing 2018, 312, 135–153. [Google Scholar] [CrossRef]
- Ye, J.; Qiu, C.; Zhang, Z. A survey on learning-based low-light image and video enhancement. Displays 2024, 81, 102614. [Google Scholar] [CrossRef]
- Gurari, D.; Li, Q.; Stangl, A.J.; Guo, A.; Lin, C.; Grauman, K.; Luo, J.; Bigham, J.P. VizWiz Grand Challenge: Answering Visual Questions From Blind People. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3608–3617. [Google Scholar] [CrossRef]
- Shi, B.; Yang, M.; Wang, X.; Lyu, P.; Yao, C.; Bai, X. ASTER: An Attentional Scene Text Recognizer with Flexible Rectification. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 2035–2048. [Google Scholar] [CrossRef]
- Mei, J.; Wu, Z.; Chen, X.; Qiao, Y.; Ding, H.; Jiang, X. DeepDeblur: Text image recovery from blur to sharp. Multim. Tools Appl. 2019, 78, 18869–18885. [Google Scholar] [CrossRef]
- Alahi, A.; Goel, K.; Ramanathan, V.; Robicquet, A.; Fei-Fei, L.; Savarese, S. Social LSTM: Human Trajectory Prediction in Crowded Spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 961–971. [Google Scholar] [CrossRef]
- Gupta, A.; Johnson, J.; Fei-Fei, L.; Savarese, S.; Alahi, A. Social GAN: Socially Acceptable Trajectories With Generative Adversarial Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2255–2264. [Google Scholar] [CrossRef]
- Yagi, T.; Mangalam, K.; Yonetani, R.; Sato, Y. Future Person Localization in First-Person Videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7593–7602. [Google Scholar] [CrossRef]
- Malla, S.; Dariush, B.; Choi, C. TITAN: Future Forecast Using Action Priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11183–11193. [Google Scholar] [CrossRef]
- Mohanan, M.G.; Salgoankar, A. A survey of robotic motion planning in dynamic environments. Robot. Auton. Syst. 2018, 100, 171–185. [Google Scholar] [CrossRef]
- Pulli, K.; Baksheev, A.; Kornyakov, K.; Eruhimov, V. Real-time computer vision with OpenCV. Commun. ACM 2012, 55, 61–69. [Google Scholar] [CrossRef]
- Baudisch, P.; Good, N.; Bellotti, V.; Schraedley, P.K. Keeping things in context: A comparative evaluation of focus plus context screens, overviews, and zooming. In Proceedings of the CHI 2002 Conference on Human Factors in Computing Systems, Minneapolis, MN, USA, 20–25 April 2002; pp. 259–266. [Google Scholar] [CrossRef]
- Haris, M.; Shakhnarovich, G.; Ukita, N. Space-Time-Aware Multi-Resolution Video Enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2856–2865. [Google Scholar] [CrossRef]
- Li, W.; Tao, X.; Guo, T.; Qi, L.; Lu, J.; Jia, J. MuCAN: Multi-correspondence Aggregation Network for Video Super-Resolution. In Proceedings of the 16th European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; Volume 12355, pp. 335–351. [Google Scholar] [CrossRef]
- Chan, K.C.K.; Zhou, S.; Xu, X.; Loy, C.C. BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5962–5971. [Google Scholar] [CrossRef]
- Debevec, P.E. Rendering synthetic objects into real scenes: Bridging traditional and image-based graphics with global illumination and high dynamic range photography. In Proceedings of the International Conference on Computer Graphics and Interactive Techniques, SIGGRAPH, Los Angeles, CA, USA, 11–15 August 2008; pp. 1–10. [Google Scholar] [CrossRef]
- Wu, Y.; He, Q.; Xue, T.; Garg, R.; Chen, J.; Veeraraghavan, A.; Barron, J.T. How to Train Neural Networks for Flare Removal. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 2219–2227. [Google Scholar] [CrossRef]
- Li, X.; Zhang, B.; Liao, J.; Sander, P.V. Let’s See Clearly: Contaminant Artifact Removal for Moving Cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 1991–2000. [Google Scholar] [CrossRef]
- Makav, B.; Kılıç, V. A new image captioning approach for visually impaired people. In Proceedings of the 11th International Conference on Electrical and Electronics Engineering, Bursa, Turkey, 28–30 November 2019; pp. 945–949. [Google Scholar] [CrossRef]
- Makav, B.; Kılıç, V. Smartphone-based image captioning for visually and hearing impaired. In Proceedings of the 11th International Conference on Electrical and Electronics Engineering, Bursa, Turkey, 28–30 November 2019; pp. 950–953. [Google Scholar] [CrossRef]
- Brick, E.R.; Alonso, V.C.; O’Brien, C.; Tong, S.; Tavernier, E.; Parekh, A.; Addlesee, A.; Lemon, O. Am I Allergic to This? Assisting Sight Impaired People in the Kitchen. In Proceedings of the International Conference on Multimodal Interaction, Montréal, QC, Canada, 18–22 October 2021; pp. 92–102. [Google Scholar] [CrossRef]
- Chen, C.; Anjum, S.; Gurari, D. Grounding Answers for Visual Questions Asked by Visually Impaired People. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 19076–19085. [Google Scholar] [CrossRef]
- Ahmetovic, D.; Sato, D.; Oh, U.; Ishihara, T.; Kitani, K.; Asakawa, C. ReCog: Supporting Blind People in Recognizing Personal Objects. In Proceedings of the CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 25–30 April 2020; pp. 1–12. [Google Scholar] [CrossRef]
- Hong, J.; Gandhi, J.; Mensah, E.E.; Zeraati, F.Z.; Jarjue, E.; Lee, K.; Kacorri, H. Blind Users Accessing Their Training Images in Teachable Object Recognizers. In Proceedings of the 24th International ACM SIGACCESS Conference on Computers and Accessibility, Athens, Greece, 23–26 October 2022; pp. 1–18. [Google Scholar] [CrossRef]
- Morrison, C.; Grayson, M.; Marques, R.F.; Massiceti, D.; Longden, C.; Wen, L.; Cutrell, E. Understanding Personalized Accessibility through Teachable AI: Designing and Evaluating Find My Things for People who are Blind or Low Vision. In Proceedings of the 25th International ACM SIGACCESS Conference on Computers and Accessibility, New York, NY, USA, 22–25 October 2023; pp. 1–12. [Google Scholar] [CrossRef]
- Penuela, R.E.G.; Collins, J.; Bennett, C.L.; Azenkot, S. Investigating Use Cases of AI-Powered Scene Description Applications for Blind and Low Vision People. In Proceedings of the CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 11–16 May 2024; pp. 1–21. [Google Scholar] [CrossRef]
- Seeing AI—Talking Camera for the Blind. 2024. Available online: https://www.seeingai.com (accessed on 15 May 2024).
- Zhao, Y.; Zhang, Y.; Xiang, R.; Li, J.; Li, H. VIALM: A Survey and Benchmark of Visually Impaired Assistance with Large Models. arXiv 2024, arXiv:2402.01735. [Google Scholar]
- Yang, B.; He, L.; Liu, K.; Yan, Z. VIAssist: Adapting Multi-modal Large Language Models for Users with Visual Impairments. arXiv 2024, arXiv:2404.02508. [Google Scholar]
- Xie, J.; Yu, R.; Zhang, H.; Billah, S.M.; Lee, S.; Carroll, J.M. Emerging Practices for Large Multimodal Model (LMM) Assistance for People with Visual Impairments: Implications for Design. arXiv 2024, arXiv:2407.08882. [Google Scholar]
- Bendel, O. How Can Generative AI Enhance the Well-being of Blind? arXiv 2024, arXiv:2402.07919. [Google Scholar] [CrossRef]
- Xie, J.; Yu, R.; Cui, K.; Lee, S.; Carroll, J.M.; Billah, S.M. Are Two Heads Better than One? Investigating Remote Sighted Assistance with Paired Volunteers. In Proceedings of the ACM Designing Interactive Systems Conference, Pittsburgh, PA, USA, 10–14 July 2023; pp. 1810–1825. [Google Scholar] [CrossRef]
- Midjourney. 2024. Available online: https://www.midjourney.com (accessed on 15 May 2024).
- OpenAI. DALL-E 2. 2024. Available online: https://openai.com/index/dall-e-2 (accessed on 15 May 2024).
- OpenAI. Sora. 2024. Available online: https://openai.com/index/sora (accessed on 15 May 2024).
- Salomoni, P.; Mirri, S.; Ferretti, S.; Roccetti, M. Profiling learners with special needs for custom e-learning experiences, a closed case? In Proceedings of the 2007 International Cross-Disciplinary Conference on Web Accessibility (W4A), Banff, AB, Canada, 7–8 May 2007; Volume 225, pp. 84–92. [Google Scholar] [CrossRef]
- Sanchez-Gordon, S.; Aguilar-Mayanquer, C.; Calle-Jimenez, T. Model for Profiling Users with Disabilities on e-Learning Platforms. IEEE Access 2021, 9, 74258–74274. [Google Scholar] [CrossRef]
- Zaib, S.; Khusro, S.; Ali, S.; Alam, F. Smartphone based indoor navigation for blind persons using user profile and simplified building information model. In Proceedings of the 2019 International Conference on Electrical, Communication, and Computer Engineering (ICECCE), Swat, Pakistan, 24–25 July 2019; pp. 1–6. [Google Scholar] [CrossRef]
- Xie, J.; Yu, R.; Zhang, H.; Lee, S.; Billah, S.M.; Carroll, J.M. BubbleCam: Engaging Privacy in Remote Sighted Assistance. In Proceedings of the CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 11–16 May 2024; pp. 1–16. [Google Scholar] [CrossRef]
- Akter, T.; Ahmed, T.; Kapadia, A.; Swaminathan, M. Shared Privacy Concerns of the Visually Impaired and Sighted Bystanders with Camera-Based Assistive Technologies. ACM Trans. Access. Comput. 2022, 15, 1–33. [Google Scholar] [CrossRef]
Challenges | CV Problems | |
---|---|---|
G1. | Localization and Orientation | |
(1) | Scarcity of indoor map [1,13,19,85] | Visual SLAM [86] |
(2) | Unable to localize the user in the map in real time [1,10,42,85] | Camera localization [87] |
(3) | Difficulty in orienting the user in his or her current surroundings [9,19,42] | Camera localization [87] |
(4) | Lack of landmarks or annotations on the map [1,62] | Landmark recognition [88] |
(5) | Outdated landmarks on the map [1,62] | Landmark recognition [88] |
(6) | Unable to change scale or resolution in indoor maps [1] | |
(7) | Last-few-meters navigation (e.g., guiding the user to the final destination) [26,85] | Visual navigation [89] |
G2. | Obstacle and Surrounding Information Acquirement and Detection | |
(1) | Difficulty in reading signages and texts in the user’s camera feed [43] | Scene text recognition [90] |
(2) | Difficulty in estimating the depth from the user’s camera feed and in conveying distance information [13] | Depth estimation [91]; scale estimation [92,93] |
(3) | Difficulty in detecting and tracking moving objects (e.g., cars and pedestrians) [43,94] | Multiple object tracking [95]; human trajectory prediction [96,97] |
(4) | Unable to project or estimate out-of-frame objects, people, or obstacles from the user’s camera feed [8,9,10,11,13,42,94] | Human trajectory prediction [96,97]; FOV extrapolation [98,99] |
(5) | Motion sickness due to unstable camera feed [43] | Video stabilization [100] |
G3. | Delivering Information and Understanding User-Specific Situation | |
(1) | Difficulty in proving various information (direction, obstacle, and surroundings) in a timely manner [1,101] | Object detection [102]; visual search [103,104] |
(2) | Adjusting the pace and level of detail in description provision through communication [1,43] | Video captioning [105] |
(3) | Cognitive overload | |
G4. | Network and External Issues | |
(1) | Losing connection and low quality of video feed [1,13,42,43,45,94] | Online visual SLAM [86] |
(2) | Poor quality of the video feed | Video super-resolution [106]; video relighting [107] |
Outdoor Scenarios | Indoor Scenarios |
---|---|
1. Going to mailbox | 1. Finding trash cans or vending machines |
2. Taking a walk around a familiar area (e.g., park, campus) | 2. Finding architectural features (e.g., stairs, elevators, doors, exits, or washrooms) |
3. Walking to the closest coffee shop | 3. Finding a point of interest in indoor navigation (e.g., a room number, an office) |
4. Finding the bus stop | 4 *. Navigating malls, hotels, conference venues, or similarly large establishments |
5 *. Crossing noisy intersections without veering | 5. Finding the correct train platform |
6. Calling a ride share and going to the pick-up location | 6. Navigating an airport (e.g., security to gate, gate to gate, or gate to baggage claim) |
7 *. Navigating from a parking lot or drop-off point to the interior of a business | 7. Finding an empty seat in theaters or an empty table in restaurants |
8 *. Navigating through parking lots or construction sites |
Emerging Problems in RSA | Current Status of Research | Proposed Human–AI Collaboration | |
---|---|---|---|
1 | Motivated by the identified challenges | ||
(1) | Making object detection and obstacle avoidance algorithms blind aware | •Existing object detection algorithms [119,120] are not blind aware. | •Human annotation of blind-aware objects for training and updating AI models. |
(2) | Localizing users under poor networks | •No prior work for large delays or breakdowns of video transmissions [13,94,111] |
•Using audio and camera pose in 3D maps •Interactive verification of camera pose |
(3) | Recognizing digital content on digital displays |
•No recognition systems for digital texts •OCR [135] suffers from domain shift [140] |
•AI-guided adjustment of camera view •Manual selection of AI recognition region |
(4) | Recognizing texts on irregular surfaces |
•No OCR systems for irregular surfaces •[135,136,137,138] Read text on flat surfaces |
•AI-based rectification for human •AI-guided movement/rotation of objects |
(5) | Predicting the trajectories of out-of-frame pedestrians or objects |
•No such prediction systems •Existing models [147,148] only predict in-frame objects in pixels |
•User-centered out-of-frame prediction •Agents mark the directions of interests •AI-guided camera movements |
(6) | Expanding the field-of-view of live camera feed | •No prior work for real-time FOV expansion |
•Task-specific use of fisheye lens •Human-customized view rendering |
(7) | Stabilizing live camera feeds for task-specific needs | •Existing video stabilization methods [100] are developed for general purposes | •Task-oriented and adjustable video stabilization based on human inputs |
(8) | Reconstructing high-resolution live video feeds | •Existing models [152,153,154] are limited by computational resources for live videos. | •Customized video super-resolution on certain parts based on human inputs |
(9) | Relighting and removing unwanted artifacts on live video | •Existing models [107,156,157] are developed for general purposes (e.g., HDR [155]) |
•Human-guided custom relighting •Interactive artifact detection and removal |
(10) | Describing hierarchical information of live camera feeds |
•Captioning tools [158,159] are not for PVI •VQA for PVI [160,161] performs poorly |
•AI helps agents organize information •Joint assistance by agents and AI |
Common human–AI collaboration strategies for different emerging problems: | |||
•AI-guided adjustment of camera views •Human-designated region for AI processing •Task-specific AI driven by human inputs | |||
2 | Integrating RSA with LLMs | ||
(1) | Human agents enhancing LLM-based AI | •No prior work |
•Human leading AI in intricate tasks •Human verifying AI for simple tasks |
(2) | LLM-based AI supporting human agents | •No prior work |
•Accelerating target localization with AI •AI-driven specialized knowledge support |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yu, R.; Lee, S.; Xie, J.; Billah, S.M.; Carroll, J.M. Human–AI Collaboration for Remote Sighted Assistance: Perspectives from the LLM Era. Future Internet 2024, 16, 254. https://doi.org/10.3390/fi16070254
Yu R, Lee S, Xie J, Billah SM, Carroll JM. Human–AI Collaboration for Remote Sighted Assistance: Perspectives from the LLM Era. Future Internet. 2024; 16(7):254. https://doi.org/10.3390/fi16070254
Chicago/Turabian StyleYu, Rui, Sooyeon Lee, Jingyi Xie, Syed Masum Billah, and John M. Carroll. 2024. "Human–AI Collaboration for Remote Sighted Assistance: Perspectives from the LLM Era" Future Internet 16, no. 7: 254. https://doi.org/10.3390/fi16070254
APA StyleYu, R., Lee, S., Xie, J., Billah, S. M., & Carroll, J. M. (2024). Human–AI Collaboration for Remote Sighted Assistance: Perspectives from the LLM Era. Future Internet, 16(7), 254. https://doi.org/10.3390/fi16070254