Wearable Travel Aid for Environment Perception and Navigation of Visually Impaired People

This paper presents a wearable assistive device with the shape of a pair of eyeglasses that allows visually impaired people to navigate safely and quickly in unfamiliar environment, as well as perceive the complicated environment to automatically make decisions on the direction to move. The device uses a consumer Red, Green, Blue and Depth (RGB-D) camera and an Inertial Measurement Unit (IMU) to detect obstacles. As the device leverages the ground height continuity among adjacent image frames, it is able to segment the ground from obstacles accurately and rapidly. Based on the detected ground, the optimal walkable direction is computed and the user is then informed via converted beep sound. Moreover, by utilizing deep learning techniques, the device can semantically categorize the detected obstacles to improve the users' perception of surroundings. It combines a Convolutional Neural Network (CNN) deployed on a smartphone with a depth-image-based object detection to decide what the object type is and where the object is located, and then notifies the user of such information via speech. We evaluated the device's performance with different experiments in which 20 visually impaired people were asked to wear the device and move in an office, and found that they were able to avoid obstacle collisions and find the way in complicated scenarios.


I. INTRODUCTION
A CCORDING to the World Health Organization (WHO), there are about 285 million visually impaired people all over the world [1]. Visual impairment makes it challenging for them to perform autonomous navigation and environment perception in unfamiliar environments [2], [3]. Visually impaired people usually asked white canes or guide dogs for help when detecting obstacles over the past years. However, they may still suffer injuries from hanging obstacles, such as scaffoldings and portable ladders, as white canes and guide dogs can only detect obstacles at heights up to their chests [4], [5]. Recently, Electronic Travel Aids (ETAs) [6] utilizing advanced sensing techniques have greatly improved the travelling experience of visually impaired people. However, the ultrasonic sensor based ETAs are poor at obstacle identification due to the wide beam angle of ultrasonic sensors, and the laser sensors based ETAs are expensive, heavy and have high power consumption, which make them unsuitable for wearable applications [4], [7]. Although vision based ETAs (e.g. monocular camera J. Bai   based [8], stereo camera based [9], [10] and Red, Green, Blue and Depth (RGB-D) camera based [11], [12]) have been widely used for assisting visually impaired people to avoid obstacles and find obstacle-free paths, some problems still exist. For example, a chair may be considered as an obstacle when a blind person is looking for a seat, or no path will be found to enter a room with closed doors. Whereas suppose the object recognition technique be adopted, the visually impaired people can find the chair to sit or open the door to enter the room. Therefore, there is a need for developing new assistive devices to help visually impaired people in navigation and environment perception. This paper proposes a wearable assistive device (see Fig. 1) for visually impaired people's navigation and environment perception. The presented device detects the ground accurately and rapidly by leveraging an adaptive ground height segmentation algorithm and utilizing the ground height continuity among adjacent frames. With the detected ground, the optimal walkable direction can be computed and corresponding beep sound will be played out for blind navigation. Meanwhile, we adopt a lightweight Convolutional Neural Network (CNN) to semantically categorize the detected objects in the RGB image and then extract the object contour in the depth image to identify where the object is located. All the semantic information is finally converted to speech to inform the user about the perceived surroundings.

II. RELATED WORK
This research builds on important related work of ground detection and object recognition. Because the vision based assistive devices have advantages over the ultrasonic or laser sensor based devices as mentioned above, this section focuses the relevant works on vision based ground detection and object recognition. General ground detection algorithms can be divided into scene segmentation based algorithms and surface normal vector estimation based algorithms.
Segmentation: Saitoh et al. [10] proposed a mean-shift algorithm to fit the ground plane. If the angle between the fitting ground and the horizontal plane is less than a threshold, the ground will be taken as a traversable area. However, the algorithm is too sensitive to the threshold. Rodriguez et al. [14] proposed a Random Sample Consensus (RANSAC) [15] based ground plane detection method and the potential obstacle was represented by polar grid. The ground detection error of this method reaches more than ten percent, leading to high obstacle detection errors. The wearability of the system and the integration of its main components into smaller devices should also be improved.
Surface normal vector estimation: Koester et al. [16] presented a gradient and surface normal vector based detection algorithm to compute the accessible sections effectively even in crowded scenes. However, the success rate of detection heavily relies on the quality of 3-Dimensional(D) reconstruction process. Bellone et al. [17] estimated the normal vectors to a local surface via Principal Component Analysis (PCA) and generated an unevenness point descriptor to detect traversable and non-traversable regions. The performance of the method was greatly affected by the search radius, which limits its applications in practice. Aladren et al. [7] used depth and color images to detect the ground in longer distance. Although their system achieved high ground detection accuracy, it is too computational expensive (2 frames per second) for visually impaired people to navigate in real time. Similar to our work, Imai et al. [18] proposed a ground detection approach which considered both ground height and normal vectors. However, since the ground height was computed only by one column data in depth image, it is prone to error if an obstacle happened to exist in this column. In contrast, the ground height in our approach is computed more robustly and rapidly through weighting the ground heights in previous and current frames.

B. Object Recognition
Tapu et al. [21] presented a real-time obstacle detection and classification system for safe navigation of visually impaired people. The system used Scale Invariant Feature Transform (SIFT) and Features from Accelerated Segment Test (FAST) features to extract points of interest, and used Support Vector Machine (SVM) and Bag of Visual Words (BoVW) to classify these points. Although the system achieved 90% precision of classification, the information about object distance is not available, and this will result in misguidance to blind individuals. Lee et al. [11] proposed a robust depth-based obstacle detection system to obtain obstacle information which contains the distance, while it did not employ object recognition techniques, leading to poor environment perception functions.
Since AlexNet [22] won the ImageNet Challenge: ILSVRC 2012 [23], CNN based object detection methods have become unprecedentedly popular. Although these methods have higher detection accuracy, they usually have large network sizes and high hardware performance requirements. Kaur et al. [24] proposed a faster Recursive Convolutional Neural Network (RCNN) based scene perception system for visually impaired individuals. The system is able to provide the obstacle category and distance information. However, the distance information is obtained only through a single-line laser, making it unavailable or prone to error for some obstacles. Tapu et al. [25] proposed a DEEP-SEE framework that used both computer vision algorithms and CNN to detect objects encountered during navigation. Although its recognition accuracy is satisfactory, the heavy computation load makes it difficult to be implemented on a smartphone.
Lightweight CNN based 2-D object detection algorithms [26] [27] [28] [29] [30] have made great progress in recent years. They usually provide object category and location information in 2-D images, while there is still a lack of distance information. As a result, an object painted on the ground may be considered as an obstacle, and the visually impaired people will be misled.
To overcome the above limitations, we propose a 2.5-D object detection method that can provide the object category, distance and orientation information to make visually impaired people travel more easily.

III. SYSTEM DESIGN
As shown in Fig. 2, the system first acquires RGB image and depth image from an RGB-D camera, and also obtains the camera attitude angle from an Inertial Measurement Unit (IMU) attached on the camera. Then based on the above sensor data, a time-dependent adaptive ground detection algorithm is performed to detect ground. Next, an optimal walkable direction search algorithm is employed to find the direction that visually impaired people can follow. This direction will then be converted to beep sound to give the users instructions. When the user wants to know the surroundings, he can double tap the smartphone screen to trigger the 2.5-D object detection function. The surrounding information including category, distance and orientation of all obstacles will be provided to the user through speech. The algorithms will be described in the following sections in detail.

A. Time-Dependent Adaptive Ground Detection
Pointcloud reconstruction: As shown in Fig. 3, the camera coordinate system X c Y c Z c is centered at the camera, and the positive Z c −axis, Y c −axis, and X c −axis are defined as the camera's facing direction, up direction and left direction respectively. The world coordinate system X w Y w Z w is centered at the camera coordinate system center, the positive Z w −axis, Y w −axis, and X w −axis are the user's facing direction, vertically upward direction, and left direction respectively. Both coordinate systems are the left-handed Cartesian coordinate systems. The pixel value of point p(u, v) in the depth image represents the distance between point P (x, y, z) and the camera, which is equal to z. With the camera attitude angle measured by the IMU, the corresponding 3-D pointcloud in the world coordinate system can be calculated through: where K is the camera intrinsic parameter matrix, point (x w , y w , z w ) is the reconstructed point in the world coordinate system, α is the camera pitch angle, γ is the camera roll angle. Fig. 3. Coordinate system transformation.
Coarse ground fitting: As shown in Fig. 4, the initial ground height threshold T Y roi is calculated adaptively using the OTSU algorithm [31] in the current frame. Since the change of ground height in two adjacent frames is usually limited, the ground height T Y pre of the previous frame is used to reduce the perturbation of other planes (e.g. desk, sofa) (see Fig. 9). The final ground height threshold is computed as: where λ, µ are the weights.
Due to the inherent limitation of the depth camera, the depth accuracy always drops down with the increase of distance. Besides, the obstacles that are too far away from the person do not need to be considered. Therefore, only the points within a threshold T Z are used in order to reduce computation cost. By making use of the ground height T Y and the distance threshold T Z, the 3-D points for fitting the coarse ground can be computed as: Then the coarse ground is fitted with RANSAC algorithm [15], and represented as: Ground refinement: The normal vector n ground of coarse ground plane can be obtained directly by Eq. 4, and the ground pitch angle φ will then be computed through: where n xoz is the normal vector of plane X w OZ w . According to the ground pitch angle and the empirical slope angle, the coarse ground will be classified as one of the four types: horizontal, upslope, downslope and non-ground. If it is non-ground, then the visually impaired people will be directly informed that they cannot move on; otherwise, the coarse ground will be refined with the unevenness tolerance σ through: where dist(p, F init ) is the distance from point p to the coarse ground, F is the final 3-D point cloud of refined ground.
Finally, the refined ground height H is obtained through: and it will be used in next frame.

B. Optimal Walkable Direction Search
If no ground is detected, the system will directly inform the visually impaired people and stop proceeding the optimal walkable direction search algorithm. Otherwise, the optimal walkable direction is calculated as follows.
Since the walkable direction relies on the detected ground, only the 3-D points within the ground plane need to be considered for walkable direction search. These points are selected by: where F is the 3-D points on ground plane, H is the ground height, A, B, C, D are the plane parameters defined in Eq. 4, and ε is a constant for preventing the person from being collided with overhanging obstacles.
Then the 3-D points P are projected onto the plane X w OZ w (see Fig. 5). The nearest points in all sectors (each sector is represented as the sub-region in Fig. 5 and the angle of each sub-region is 0.5 • ) can be easily obtained. The award of each sector is computed as: where θ is the angle of a sector, w sw is the passable width (greater than the person's body width), z i is the z−axis coordinate values of the nearest point in sector i, α and β are the weights, and N is the total number of sectors. This award function ensures that the person moves toward the direction with smaller turn angle and longer traversable distance.
Next, the optimal walkable direction is obtained by: where τ is a distance threshold, z arg max i (award[i]) is the nearest distance of the sector with the maximum award, i max is the index of the sector with the maximum award, and N is the total number of sectors. If z arg max i (award[i]) is less than a small value τ , it is considered to have a very large risk of collision with obstacles. In that case, the optimal walkable direction does not exist, and the system will inform the users to turn left or right with a large angle to search a walkable direction. If z arg max i (award[i]) is larger than τ , the optimal walkable direction is that corresponding to the sector with the maximum award. If the turning angle is very small (e.g. |γ| ≤ 5 • ), the system will directly inform the users to go straight to prevent them from vacillating to the left and right.

C. 2.5-D Object Detection
CNN based object detection: As the MobileNet V2 [32] achieves relatively better results than other networks [26] [28] [29] [30] by using depth-wise separable convolutions, and has lower computational complexity and smaller model size, it is utilized here for object detection. The training is implemented on the COCO dataset, which includes 91 classes, such as person, car, bus, chair, etc. These classes are enough for helping visually impaired people have a general perception of surroundings. However, 2-D object detection cannot be directly used for blind navigation due to the lack of object distance information. For example, if an object painted on ground is recognized as an obstacle, it will lead visually impaired people to make an incorrect decision. Therefore, we also use the depth image based object detection to solve the above problem.
Depth image based object detection: The detected ground (see Section III-A) is firstly removed from the depth image. Then the close morphological processing 1 is performed to merge the small objects. Next, the external contours of the obstacles are extracted and their areas are computed. If the area is less than the threshold S, the corresponding obstacle will be merged into its nearest obstacle or taken as noise; otherwise, the obstacle location can be obtained as: 1) Compute the moment of each contour, and the centroid of the contour can be obtained according to the zeroorder and the first-order moment; 2) According to the contour centroid and the camera intrinsic parameter matrix K in Eq. 1, the obstacle location is represented as (θ pitch , θ yaw , z): where center(x, y) is the contour centroid, and z center is the depth value of contour centroid. Combination: The objects obtained by MobileNet V2 in the RGB image can be easily mapped to the depth image according to camera calibration matrix [R, t]. Then the intersection area C between the mapped area A and the detected contour B can be calculated (see Fig. 6): If S C max(S A ,S B ) (S i represents the area of region i) is greater than a threshold ζ (e.g. 0.7), it means the mapped area and the detected contour is the same object. Then the object distance is the minimum non-zero depth value within the intersection area C, and the object orientation relative to the user can be obtained with Eq. 11. This way, the key surrounding information including obstacle category, distance and orientation can be provided.

D. Audio Feedback
Navigation feedback: The beep sound is used to provide the navigation information to the user. When the visually impaired person encounters an obstacle that blocks his way, the system will keep beeping to alarm him that he cannot go straight. In that case, he should turn left or right to search the optimal walkable direction. When the beep sound stops, the user can continue moving.
2.5-D object detection feedback: When the visually impaired person walks in a relatively complicated environment, the 2.5-D object detection function can be activated by double tapping the smartphone screen, and the results will be converted to speech via a text-to-speech module and broadcast to the user. An example of 2.5-D object detection feedback is shown in Fig. 7, the object category (e.g. chair), distance (e.g.

IV. EXPERIMENTAL RESULTS AND DISCUSSIONS
The performance of the proposed system is evaluated in real scenarios. Firstly, the ground detection is tested since it plays an important role in the whole system. Then, we recruit 20 visually impaired persons to test the navigation and object detection function. We follow the protocol approved by the Bejing Fangshan District Disabled Persons' Federation for recruitment and experiments. All the people who participated in this experiment approved that the results (including data, images and videos) can be published with anonymity. Finally, the computational cost is evaluated to test the real-time performance.

A. Experiments on Ground Detection
To evaluate the proposed ground detection method quantitatively, we manually labelled a random sample of 1000 images captured in indoor scenario to get the ground truth. Fig. 8 shows the precision of the detected ground with different distance and different Intersection over Union (IOU) percentages. The IOU percentage is computed as: where N ∩ is the number of overlapping pixels between the detected ground and the ground truth, N detected is the number of detected ground pixels, and N groundtruth is the number of ground truth pixels.  As shown in Fig. 8, comparing with the RANSAC algorithm, the proposed method has a higher precision in all scenarios. Some comparative examples are shown in Fig. 9. By taking the advantage of the ground height continuity among adjacent frames, the proposed method is more robust to the interferences of other planes (such as sofa, table) than the RANSAC algorithm. It provides a solid base for the following object detection.

B. Experiments on Real Scenarios
Navigation task: Three paths in an office (see Fig. 10) are selected to evaluate the navigation performance. There are many obstacles including static and moving obstacles in these paths. Then 20 visually impaired people who were not familiar with these environments are asked to navigate following these paths with the help of either a white cane or the proposed device. They are trained for about 10 minutes to get familiar with our system. The average walking time and number of collisions with obstacles are recorded (see TABLE I). The visually impaired people using our proposed system spend less walking time than using the white cane. This proves that the proposed system has a higher navigation efficiency in unfamiliar environments.  As shown in TABLE I, the users have more collisions with obstacles when using a white cane. This is because they usually use the white cane to detect the obstacles on the ground instead of those hanging in the mid-air, and it is observed that almost all studied users collide with the desk or the hanging objects on the path C→D and E→F (see Fig. 11). Whereas with the proposed device, they are able to avoid collisions with those obstalces. This proved that the proposed system is more secure for visually impaired people.
2.5-D object detection task: Some complicated scenarios are designed to test if the 2.5-D object detection can help visually impaired people to travel more efficiently. Such examples include the scenario where the way is blocked by two chairs (see Fig. 12) and crowded with other obstacles. With only the navigation function in our proposed system, the user cannot Collide with the desk Collide with the hanging object Fig. 11. Collisions when the visually impaired people use a white cane.
find the moving direction after multiple searches and will turn back. However, if he activates the 2.5-D object detection and perceives more information about surroundings, he is able to move the chair and search the moving direction. This prevents the visually impaired people from taking more detours and improves their environmental perception abilities. This shows that the 2.5-D object detection indeed helps visually impaired people to travel more efficiently and brings a better travel experience to them.

C. Computational Cost
All algorithms are implemented on a smartphone with Qualcomm Snapdragon 820 CPU 2.0 GHz and RAM of 4 GB. The average computational time of the proposed system is calculated and shown in TABLE II. The images acquisition, ground detection, optimal walkable direction search and 2.5-D object detection cost about 0.66 ms, 13.53 ms, 7.19 ms and 114.13 ms respectively. The total time that all algorithms cost excluding the 2.5-D object detection is about 27.17 ms on average. Since the 2.5-D object detection is only activated when the visually impaired people want to know the surrounding information or enter a complicated scenario, the proposed system is able to provide real-time assistance for users' daily traveling. V. CONCLUSION This paper presents a wearable device which provides navigation and object detection assistance for visually impaired people. To provide reliable navigation, a ground detection algorithm that uses the ground height continuity between two adjacent frames is presented. Then an optimal walkable direction search method is developed to determine the moving direction. To improve the environmental perception ability of visually impaired people, a 2.5-D object detection function is presented. Audio feedback is used to inform the visually impaired people for both the navigation instructions and object detection results. Experimental results show that the proposed system can help visually impaired people travel efficiently and brings a better traveling experience for them. In future, the small-size obstacle detection will be considered.