1. Introduction
In this paper, the main purpose is focused on navigation assistance for visually-impaired people in terrain awareness, a technical term that was originally coined for commercial aircraft. In aviation, a Terrain Awareness and Warning System (TAWS) is generally an on-board module aimed at preventing unintentional impacts with the ground [
1]. Within a different context, precisely blind assistance, the task of terrain awareness involves traversable ground parsing and navigation-related scene understanding, which are widely desired within the visually-impaired community [
2,
3].
According to the World Health Organization (WHO), an estimated 253 million people live with vision impairment, 36 million of whom are totally blind [
4]. Over the past decade, the striking improvement of Computer Vision (CV) has been an enormous benefit for the Visually-Impaired (VI), allowing individuals with blindness or visual impairments to access, understand and explore surrounding environments [
3,
5,
6]. These trends have accelerated the proliferation of monocular detectors and cost-effective RGB-Depth (RGB-D) sensors [
5], supposing essential prerequisites to aid perception and navigation in visually-impaired individuals by leveraging robotic vision [
7]. Along this line, a broad variety of navigational assistive technologies have been developed to accomplish specific goals including avoiding obstacles [
8,
9,
10,
11,
12,
13,
14,
15,
16,
17], finding paths [
18,
19,
20,
21,
22,
23,
24,
25,
26,
27,
28,
29], locating sidewalks [
30,
31,
32,
33], ascending stairs [
34,
35,
36,
37,
38] or descending steps [
39,
40] and negotiating water hazards [
41].
As a matter of fact, each one of these navigational tasks has been well tackled through its respective solutions, and the mobility of the visually impaired has been enhanced. Along with the increasing demand during everyday independent navigation [
2,
3], the assistance topic highlights challenges in juggling multiple tasks simultaneously and coordinating all of the perception needs efficiently. In response to these observations, the research community has been motivated to offer more independence by integrating different detectors at the basis of traversability perception, which is considered as the backbone of any VI-dedicated navigational assistive tool [
26].
However, a majority of processing pursues a sequential pipeline instead of a unified way, separately detecting different navigation-related scene elements. Thereby, it is computationally intensive to run multiple detectors together, and the processing latency makes it infeasible within the blind assistance context. For illustration, one of the pioneering works [
23,
35,
38] performed two main tasks for its personal guidance system. It approximately runs the full floor segmentation at 0.3 Frames Per Second (FPS) with additional stair detection iteration time ranging from 50–150 ms [
35]. In spite of being precise in staircase modeling, this approach depends on further optimization to provide assistance at normal walking speed. A more recent example could be the sound of vision system [
16,
17,
29], which aims to support impaired people to autonomously navigate in complex environments. While their fusion-based imaging was visually appealing, a long latency was incurred when identifying the elements of interest such as ground, walls and stairs. It takes more than 300 ms to compute stereo correspondences and detect negative obstacles [
17], let alone other processing components that make it non-ideal for real-time assistance on embedded platforms. This system should be enhanced by avoiding significant delays in its main processing pipeline. Towards this objective, multi-threading is an effective way to reduce latency while sharing computational burden between cores. The commercial version of smart glasses from KR-VISION [
42] has shown satisfactory performance for the detection of obstacles and hazardous curbs across different processing threads. It continuously receives images from the sensors and multi-tasks at different frame rates. Alternatively, a unified feedback design was proposed to complement the discrete detection of traversable areas and water puddles within a polarized RGB-Depth (pRGB-D) framework [
41]. However, the user study revealed a higher demand for discerning terrain information.
In the literature, a number of systems [
43,
44,
45,
46] rely on sensor fusion to understand more of the surrounding scenes. Along this line, proof-of-concepts were also investigated in [
47,
48,
49,
50] to use highly integrated radars to warn against collisions with pedestrians and cars, taking into consideration that fast-moving objects are response-time critical. Arguably, for navigation assistance, an even greater concern lies in the depth data from almost all commercial 3D sensors, which suffer from a limited depth range and could not maintain the robustness across various environments [
22,
26,
29,
37]. Inevitably, approaches based on a stereo camera or light-coding RGB-D sensor generally perform range expansion [
13,
14], depth enhancement [
22] or depend on both visual and depth information to complement each other [
23]. Not to mention the time consumption in these steps, underlying assumptions were frequently made such as: the ground plane is the biggest area [
9,
10]; the area directly in front of the user is accessible [
18,
19]; and variant versions of flat world [
24,
36], Manhattan world [
23,
27,
35,
38] or stixel world assumptions [
15,
25,
41]. These factors all limit the flexibility and applicability of navigational assistive technologies.
Nowadays, unlike the traditional approaches mentioned above, Convolutional Neural Networks (CNNs) learn and discriminate between different features directly from the input data using a deeper abstraction of representation layers [
51]. More precisely, recent advances in deep learning have achieved break-through results in most vision-based tasks including object classification [
52], object detection [
53], semantic segmentation [
54] and instance segmentation [
55]. Semantic segmentation, as one of the challenging tasks, aims to partition an image into several coherent semantically-meaningful parts. As depicted in
Figure 1, because traditional approaches detect different targets independently [
56], assistive feedback to the users are generated separately. Intuitively, it is beneficial to cover the tasks of the perception module of a VI-dedicated navigational assistive system in a unified manner, because it allows solving many problems at once and exploiting their inter-relations and spatial-relationships (contexts), creating reasonably favorable conditions for unified feedback design. Semantic segmentation is meant to fulfill exactly this purpose. It classifies a wide spectrum of scene classes directly, leading to pixel-wise understanding, which supposes a very rich source of processed information for upper-level navigational assistance in visually-impaired individuals. Additionally, the incessant increase of large-scale scene parsing datasets [
57,
58,
59] and affordable computational resources has also contributed to the momentum of CNN-based semantic segmentation in its growth as the key enabler, to cover navigation-related perception tasks [
56].
Based on these notions, we propose to seize pixel-wise semantic segmentation to provide terrain awareness in a unified way. Up until very recently, pixel-wise semantic segmentation was not usable in terms of speed. To respond to the surge in demand, efficient semantic segmentation has been a heavily researched topic over the past two years, spanning a diverse range of application domains with the emergence of architectures that could reach near real-time segmentation [
60,
61,
62,
63,
64,
65,
66,
67,
68]. These advances have made possible the utilization of full scene segmentation in time-critical cases like blind assistance. However, to the best of our knowledge, approaches that have customized real-time semantic segmentation to assist visually-impaired pedestrians are scarce in the state of the art. In this regard, our unified framework is a pioneering attempt going much further than simply identifying the most traversable direction [
28,
41], and it is different from those efforts made to aid navigation in prosthetic vision [
27,
69,
70] because our approach can be used and accessed by both blind and partially-sighted individuals.
We have already presented some preliminary studies related to our approaches [
22,
41]. This paper considerably extends previously-established proofs-of-concept by including novel contributions and results that reside in the following main aspects:
A unification of terrain awareness regarding traversable areas, obstacles, sidewalks, stairs, water hazards, pedestrians and vehicles.
A real-time semantic segmentation network to learn both global scene contexts and local textures without imposing any assumptions, while reaching higher performance than traditional approaches.
A real-world navigational assistance framework on a wearable prototype for visually-impaired individuals.
A comprehensive set of experiments on a large-scale public dataset, as well as an egocentric dataset captured with the assistive prototype. The real-world egocentric dataset can be accessed at [
71].
A closed-loop field test involving real visually-impaired users, which validates the effectivity and the versatility of our solution, as well as giving insightful hints about how to reach higher level safety and offer more independence to the users.
The remainder of this paper is structured as follows.
Section 2 reviews related work that has addressed both traversability-related terrain awareness and real-time semantic segmentation. In
Section 3, the framework is elaborated in terms of the wearable navigation assistance system, the semantic segmentation architecture and the implementation details. In
Section 4, the approach is evaluated and discussed regarding real-time/real-world performance by comparing to traditional algorithms and state-of-the-art networks. In
Section 5, a closed-loop field test is fully described with the aim to validate the effectivity and versatility of our approach.
Section 6 draws the conclusions and gives an outlook to future work.
6. Conclusions and Future Work
Navigational assistance for the Visually Impaired (VI) is undergoing a monumental boom thanks to the developments of Computer Vision (CV). However, monocular detectors or depth sensors are generally applied in separate tasks. In this paper, we derive achievability results for these perception tasks by utilizing real-time semantic segmentation. The proposed framework, based on deep neural networks and depth sensory segmentation, not only benefits the essential traversability at both short and long ranges, but also covers the needs of terrain awareness in a unified way.
We present a comprehensive set of experiments and a closed-loop field test to demonstrate that our approach strikes an excellent trade-off between reliability and speed and reaches high effectivity and versatility for navigation assistance in terms of unified environmental perception.
In the future, we aim to continuously improve our navigation assistive approach. Specifically, pixel-wise polarization estimation and multi-modal sensory awareness would be incorporated to robustify the framework against cross-season scenarios. Deep learning-based depth interpolation would be beneficial to enhance the RGB-D perception in high dynamic environments and expand the minimum/maximum detectable range. Intersection-centered scene elements including zebra crosswalks and traffic lights would be covered in the road crossing context. Hazardous curbs and water puddles would be addressed to further enhance traversability-related semantic perception by our advanced version of CNNs using hierarchical dilation. In addition, we are interested in panoramic semantic segmentation, which would be useful and fascinating to provide superior assistive awareness.
Moreover, it is necessary to run a larger study with visually-impaired participants to test this approach, while different sonification methods and audio output settings could be compared in a more general usage scenario with semantics-aware visual localization.