According to the World Health Organization, 285 million people were estimated to be visually impaired and 39 million of them are blind around the world in 2014 [1
]. It is very difficult for visually impaired people (VIP) to find their way through obstacles and wander in real-world scenarios. Recently, RGB-Depth (RGB-D) sensors revolutionized the research field of VIP aiding because of their versatility, portability, and cost-effectiveness. Compared with traditional assistive tools, such as a white cane, RGB-D sensors provide a great deal of information to the VIP. Typical RGB-D sensors, including light-coding sensors, time-of-flight sensors (ToF camera), and stereo cameras are able to acquire color information and perceive the environment in three dimensions at video frame rates. These depth-sensing technologies already have their mature commercial products, but each type of them has its own set of limits and requires certain working environments to perform well, which brings not only new opportunities but also challenges to overcome.
Light-coding sensors, such as PrimeSense [2
] (developed by PrimeSense based in Tel Aviv, Israel), Kinect [3
] (developed by Microsoft based in Redmond, WA, USA), Xtion Pro [4
] (developed by Asus based in Taipei, Taiwan), MV4D [5
] (developed by Mantis Vision based in Petach Tikva, Israel), and the Structure Sensor [6
] (developed by Occipital based in San Francisco, CA, USA) project near-IR laser speckles to code the scene. Since the distortion of the speckles depends on the depth of objects, an IR CMOS image sensor captures the distorted speckles and a depth map is generated through triangulating algorithms. However, they fail to return an efficient depth map in sunny environments because projected speckles are submerged by sunlight. As a result, approaches for VIP with light-coding sensors are just proof-of-concepts or only feasible in indoor environments [7
ToF cameras, such as CamCube [16
] (developed by PMD Technologies based in Siegen, Germany), DepthSense [17
] (developed by SoftKinetic based in Brussels, Belgium), and SwissRanger (developed by Heptagon based in Singapore) [18
] resolve distance based on the known speed of light, measuring the precise time of a light signal flight between the camera and the subject independently for each pixel of the image sensor. However, they are susceptible to ambient light. As a result, ToF camera-based approaches for VIP show poor performance in outdoor environments [19
Stereo cameras, such as the Bumblebee [22
] (developed by PointGrey based in Richmond, BC, Canada), ZED [23
] (developed by Stereolabs based in San Francisco, USA), and DUO [24
] (developed by DUO3D based in Henderson, NV, USA) estimates the depth map through stereo matching of images from two or more lenses. Points on one image are correlated to another image and depth is calculated via shift between a point on one image and another image. Stereo matching is a passive and texture-dependent process. As a result, stereo cameras return sparse depth images in textureless indoor scenes, such as a blank wall. This explains why solutions for VIP with stereo camera focus mainly on highly-textured outdoor environments [25
The RealSense R200 (developed by Intel based in Santa Clara, CA, USA) uses a combination of active projecting and passive stereo matching [29
]. IR laser projector projects static non-visible near-IR patterns on the scene, which is then acquired by the left and right IR cameras. The image processor generates a depth map through an embedded stereo-matching algorithm. In textureless indoor environments, the projected patterns enrich textures. As shown in Figure 1
b,c, the texture-less white wall has been projected with many near-IR patterns which are beneficial for stereo matching to generate depth information. In sunny outdoor environments, although projected patterns are submerged by sunlight, the near-IR component of sunlight shines on the scene to form well-textured IR images as shown in Figure 1
g. With the contribution of abundant textures to robust stereo matching, the combination allows the RealSense R200 to work under indoor and outdoor circumstances, delivering depth images though it has many noise sources, mismatched pixels, and black holes. In addition, it is possible to attain denser depth maps pending new algorithms. Illustrated in Figure 1
, the RealSense R200 is quite suitable for navigational assistance thanks not only to its environment adaptability, but also its small size.
However, the depth range of the RGB-D sensor is generally short. For the light-coding sensor, the speckles in the distance are too dark to be sensed. For the ToF camera, light signals are overwhelmed by ambient light in the distance. For stereo-cameras, since depth error increases with the increase of the depth value, stereo-cameras are prone to be unreliable in the distance [30
]. For the RealSense R200, on the one hand, since the power of IR laser projector is limited, if the coded object is in the distance, the speckles are too dark and sparse to enhance stereo matching. On the other hand, depth information in the distance is much less accurate than that in the normal working distance ranging from 650–2100 mm [31
]. As shown in Figure 2
, the original depth image is sparse a few meters away. In addition, the depth field angle of RGB-D sensor is generally small. For the RealSense R200, the horizontal field angle of IR camera is 59°. As we know, the depth image is generated through stereo matching from overlapping field angles of two IR cameras. Illustrated in Figure 3
, though red and green light are within the horizontal field angle of the left IR camera, only green light is within the overlapping field angle of two IR cameras. Thus, the efficient depth horizontal field angle is smaller than 59°, which is the horizontal field angle of a single IR camera. Consequently, as depicted in Figure 2
, both the distance and the angle range of the ground plane detection with the original depth image are small, which hampers longer and broader traversable area awareness for VIP.
In this paper, an effective approach to expand the traversable area detection is proposed. Since the original depth image is poor and sparse, two IR images are large-scale matched to generate a dense depth image. Additionally, the quality of the depth image is enhanced with the RGB image-guided filtering, which is comprised of functions, such as de-noising, hole-filling, and can estimate the depth map from the perspective of the RGB camera, whose horizontal field angle is wider than the depth camera. The preliminary traversable area is obtained with RANdom SAmple Consensus (RANSAC) segmentation [32
]. In addition to the RealSense R200, an attitude sensor, InvenSense MPU6050 [33
], is employed to adjust the point cloud from the camera coordinate system to the world coordinate system. This helps to eliminate sample errors in preliminary traversable area detection. Through estimating surface normal vectors of depth image patches, salient parts are removed from preliminary detection results. The highlighted process of the traversable area detection is to extend preliminary results to broader and longer ranges, which fully combines depth and color images. On the one hand, short-range depth information is enhanced with long-range RGB information. On the other hand, depth information adds a dimension of restrictions to the expansion stage based on seeded region growing algorithm [34
]. The approach proposed in this paper is integrated with a wearable prototype, containing a bone-conduction headphone, which provides a non-semantic stereophonic interface. Different from most navigational assistance approaches, which are not tested by VIP, eight visually impaired volunteers, three in whom are suffering from total blindness, have tried out our approach.
This paper is organized as follows: in Section 2
, related work that has addressed both traversable area detection and expansion are reviewed; in Section 3
, the presented approach is elaborated in detail; in Section 4
, extensive tests on indoor and outdoor scenarios demonstrate its effectiveness and robustness; in Section 5
, the approach is validated by the user study, effected by real VIP; and in Section 6
, relevant conclusions are drawn and outlooks to future work are depicted.
2. Related Work
In the literature, a lot of approaches have been proposed with respect to ground plane segmentation, access section detection, and traversable area awareness with RGB-D sensors.
In some approaches, ground plane segmentation is the first step of obstacle detection, which aims to separate feasible ground area from hazardous obstacles. Wang adopted meanshift segmentation to separate obstacles based on the depth image from a Kinect, in which planes are regarded as feasible areas if two conditions are met: the angle between the normal vector of the fitting plane and vertical direction of the camera coordinate system is less than a threshold; and the average distance and the standard deviation of all 3D points to the fitting plane are less than thresholds [35
]. Although the approach achieved good robustness under certain environment, the approach relies a lot on thresholds and assumptions. Cheng put forward an algorithm to detect ground with a Kinect based on seeded region growing [15
]. Instead of focusing on growing thresholds, edges of the depth image and boundaries of the region are adequately considered. However, the algorithm is unduly dependent on the depth image, and the seed pixels are elected according to a random number, causing fluctuations between frames, which is intolerable for assisting because unstable results would confuse VIP. Rodríguez simply estimated outdoor ground plane based on RANSAC plus filtering techniques, and used a polar grid representation to account for the potential obstacles [25
]. The approach is one of the few which have involved real VIP participation. However, the approach yields a ground plane detection error in more than ten percent of the frames, which is resolvable in our work.
In some approaches, the problem of navigable ground detection is addressed in conjunction with localization tasks. Perez-Yus used the RANSAC algorithm to segment planes in human-made indoor scenarios pending dense 3D point clouds. The approach is able to extract not only the ground but also ascending or descending stairs, and to determine the position and orientation of the user with visual odometry [36
]. Lee also incorporated visual odometry and feature-based metri-topological simultaneous localization and mapping (SLAM) [37
] to perform traversability analysis [26
]. The navigation system extracts ground plane to reduce drift imposed by the head-mounted RGB-D sensor and the paper demonstrated that the traversability map works more robustly with a light-coding sensor than with a stereo pair in low-textured environments. As for another indoor localization application, Sánchez detected floor and navigable areas to efficiently reduce the search space and thereby yielded real-time performance of both place recognition and tracking [39
In some approaches, surface normal vectors on the depth map have been used to determine the accessible section. Koester detected the accessible section by calculating the gradients and estimating surface normal vector directions of real-world scene patches [40
]. The approach allows for a fast and effective accessible section detection, even in crowded scenes. However, it prevents practical application for user studies with the overreliance on the quality of 3D reconstruction process and adherence to constraints such as the area directly in front of the user is accessible. Bellone defined a novel descriptor to measure the unevenness of a local surface based on the estimation of normal vectors [41
]. The index gives an enhanced description of the traversable area which takes into account both the inclination and roughness of the local surface. It is possible to perform obstacle avoidance and terrain traversability assessments simultaneously. However, the descriptor computation is complex and also relies on the sensor to generate dense 3D point clouds. Chessa derived the normal vectors to estimate surface orientation for collision avoidance and scene interpretation [42
]. The framework uses a disparity map as a powerful cue to validate the computation from optic flow, which suffers from the drawback of being sensitive to errors in the estimates of optical flow.
In some approaches, range extension are concerned to tackle the limitations imposed by RGB-D sensors. Muller presented a self-supervised learning process to accurately classify long-range terrain as traversable or not [43
]. It continuously receives images, generates supervisory labels, trains a classifier, and classifies the long-range portion of the images, which complete one full cycle every half second. Although the system classifies the traversable area of the image up to the horizon, the feature extraction requires large, distant image patches within fifteen meters, limiting the utility in general applications with commercial RGB-D sensors, which ranges mush closer. Reina proposed a self-learning framework to automatically train a ground classifier with multi-baseline stereovision [44
]. Two distinct classifiers include one based on geometric data, which detects the broad class of ground, and one based on color data, which further segments ground into subclasses. The approach makes predictions based on past observations, and the only underlying assumption is that the sensor is initialized from an area free of obstacles, which is typically violated in applications of VIP assisting. Milella features a radar-stereo system to address terrain traversability assessment in the context of outdoor navigation [45
]. The combination produces reliable results in the short range and trains a classifier operating on distant scenes. Damen also presented an unsupervised approach towards automatic video-based guidance in miniature and in fully-wearable form [47
]. These self-learning strategies make feasible navigation in long-range and long-duration applications, but they ignore the fact that most traversable pixels or image patches are connected parts rather than detached, which is fully considered in our approach, and also supports an expanded range of detection. Aladrén combines depth information with image intensities, robustly expands the range-based indoor floor segmentation [9
]. The overall diagram of the method composes complex processes, running at approximately 0.3 frames per second, which fails to assist VIP at normal walking speed.
Although plenty of related works have been done to analyze traversable area with RGB-D sensors, most of them are overly dependent on the depth image or cause intolerable side effects in navigational assistance for VIP. Compared with these works, the main advantages of our approach can be summarized as follows:
The 3D point cloud generated from the RealSense R200 is adjusted from the camera coordinate system to the world coordinate system with a measured sensor attitude angle, such that the sample errors are decreased to a great extent and the preliminary plane is segmented correctly.
The seeded region, growing adequately, considers the traversable area as connected parts, and expands the preliminary segmentation result to broader and longer ranges with RGB information.
The seeded region growing starts with preliminarily-segmented pixels other than according to the random number, thus the expansion is inherently stable between frames, which means the output will not fluctuate and confuse VIP. The seeded region growing is not reliant on a single threshold, and edges of the RGB image and depth differences are also considered to restrict growing into non-traversable area.
The approach does not require the depth image from sensor to be accurate or dense in long-range area, thus most consumer RGB-D sensors meet the requirements of the algorithm.
The sensor outputs efficient IR image pairs under both indoor and outdoor circumstances, ensuring practical usability of the approach.
In this section, experimental results are presented to validate our approach for traversable area detection. The approach is tested on a score of indoor and outdoor scenarios including offices, corridors, roads, playgrounds, and so on.
shows a number of traversable area detection results in the indoor environment. Largely-expanded traversable area provides two superiorities: firstly, longer range allows high-level path planning in advance; and, secondly, broader range allows precognition of various bends and corners. For special situations, such as color image blurring and image under-exposing, the approach still detects and expands the traversable area correctly, as shown in Figure 8
g,h. Additionally, the approach is robust regardless of continuous movement of the cameras as the user wanders in real-world scenes.
shows several traversable area detection results under outdoor circumstances. It can be seen that traversable area has been enlarged greatly out to the horizon. Rather than the short-range ground plane, the expanded traversable area frees the VIP to wander in the environment.
To compare the performance of traversable area detection with respect to other works in the literature, the results of several traversable detection approaches on a typical indoor scenario and outdoor scenario are shown in Figure 10
. Given the depth image, the approach proposed by Rodríguez estimated the ground plane based on RANSAC plus filtering techniques [25
]. Figure 10
n is a correct result of detecting the local ground, but the wall is wrongly detected as the ground plane in Figure 10
e, which is one type of sample error mentioned in the paper. This kind of error is dissolvable in our work with consideration of the inclination angle of the plane. The approach proposed by Cheng detected the ground with seeded region growing of depth information [15
]. The approach in [15
] projects RGB information onto the valid pixels of depth map, so the detecting result shown in Figure 10
f,o has many noises and black holes, and the detecting range is restricted since the depth information is discrete and prone to inaccuracy in long range. However, the main problem of the algorithm lies in that the seed pixels are elected randomly, thereby causes intolerable fluctuations to confuse VIP. In our previous works, we only employed depth information delivered by the light-coding sensor of the Microsoft Kinect [14
]. However, the sensor outputs a dense 3D point cloud (ranges from 0.8 m to 5 m) indoors and fails in sunny outdoor environments. As a result, the algorithms are unable to perform well when the sensor could not generate a dense map. In Figure 10
g,p, the idea of using surface normal vectors to segment ground presented in [14
] is able to segment the local ground plane but fails to segment the long-range traversable area robustly as the estimation of normal vectors asks the sensor to produce dense and accurate point clouds. In this paper, we fully combine RGB information and depth information to expand the local ground plane segmentation to long range. In the process, IR image large-scale matching and RGB image guided filtering are incorporated to enhance the depth images. Although the computing time improves from 280 ms to 610 ms per frame on a 1.90 GHz Intel Core Processor, within which the RGB image-guided filtering is hardware accelerated with the HD4400 integrated graphics, the range of traversable detection has been expanded to a great extent and the computing time contributed in this process endows VIPs to perceive traversability at long range and plan routes in advance so the traversing time eventually declines. Figure 10
h,q shows the results of traversable area detection without IR image large-scale matching and RGB image-guided filtering. The seeded region growing process is unable to enlarge the local ground segmentation based on RANSAC to long-range as the depth map is still discrete and sparse in the distance. Comparatively, in Figure 10
i,r, after IR image large-scale matching and RGB image-guided filtering, the segmented local ground plane largely grows to a longer and broader traversable area. The set of our images is available online at Kaiwei Wang Team [52
The approach creates a multithreaded program including a thread for image acquisition and depth enhancement, a thread for traversable area detection and expansion, as well as a thread for audio interface generation for the VIP. Together, the average processing time of a single frame is 610 ms on a 1.90 GHz Intel Core 5 processor, making the refresh rate of the VIP audio feedback 1.6 times per second. In addition, detection rate and expansion error for indoor and outdoor scenarios are presented to demonstrate the robustness and reliability of the approach. Indoor scenarios, including a complicated office room and a corridor are analyzed, while outdoor scenarios, including school roads and a playground, are evaluated. Typical results of the four scenarios are depicted in Figure 8
a,c and Figure 9
c,i. As depicted in Figure 11
, part of the car has been classified as traversable area, which is a typical example of expansion error.
In order to provide a quantitative evaluation of the approach, given Equations (8) and (9), detection rate (DR
) is defined as the number of frames which ground has been detected correctly (GD
) divided by the number of frames with ground (G
). Meanwhile, expansion error (EE
) is defined as the number of frames which traversable area has been expanded to non-ground areas (ENG
) divided by the number of frames with ground (G
Shown in Table 1
, detection rates of the four scenarios are all above 90%, demonstrating the robustness of the approach. For the scene of the corridor, it yields an expansion error of 15.9%. This is mainly due to inadequate lighting on the corners in the corridor, so the edges of the color image are fuzzy and the traversable area may be grown to the wall. Overall, the average expansion error is 7.8%, illustrating the reliability of the approach, which seldom recognizes hazardous obstacles as safe traversable area.
Additionally, the average density of depth images of four different scenarios is calculated to prove that IR image large-scale matching and RGB image guided filtering remarkably improve the density of the original depth image from the RealSense sensor. The density of the depth image is defined as the number of valid pixels divided by the resolution. As shown in Table 2
, the average density of the large-scale matched depth image is much higher than the original depth image and the guided-filtered depth image achieves 100% density.
RGB-D sensors are a ubiquitous choice to provide navigational assistance for visually impaired people, with good portability, functional diversity, and cost-effectiveness. However, most assisting solutions, such as traversable area awareness, suffer from the limitations imposed by RGB-D sensor ranging, which is short, narrow, and prone to failure. In this paper, an effective approach is proposed to expand ground detection results to a longer and broader range with a commercial RGB-D sensor, the Intel RealSense R200, which is compatible with both indoor and outdoor environments. Firstly, the depth image of the RealSense is enhanced with large scale matching and color guided filtering. Secondly, preliminary ground segmentation is obtained by the RANSAC algorithm. The segmentation is combined with an attitude sensor, which eliminates many sample errors and improves the robustness of the preliminary result. Lastly, the preliminary ground detection is expanded with seeded region growing, which fully combines depth, attitude, and color information. The horizontal field angle of the traversable area has been increased from 59° to 70°. Additionally, the expansion endows VIP the ability to predict traversability and plan paths in advance since the range has been enlarged greatly to a large extent. The approach is able to see smoothly to the horizon, being acutely aware of the traversable area at distances far beyond 10 m. Both indoor and outdoor empirical evidences are provided to demonstrate the robustness of the approach, in terms of image processing results, detection rate, and expansion error. In addition, a user study is described in detail, which proves the approach to be usable and reliable.
In the future, we aim to incessantly enhance our navigational assistance approach for the visually impaired. Especially, the implementation of the algorithm is not yet optimized, so we are looking forward to speeding it up. Additionally, a cross-modal stereo-matching scheme between IR images and RGB images would also be interesting and useful to inherently improve the detecting range and ranging accuracy of the camera.