A CNN-Based Wearable Assistive System for Visually Impaired People Walking Outdoors

: In this study, we propose an assistive system for helping visually impaired people walk outdoors. This assistive system contains an embedded system—Jetson AGX Xavier (manufacture by Nvidia in Santa Clara, CA, USA) and a binocular depth camera—ZED 2 (manufacture by Stereolabs in San Francisco, CA, USA). Based on the CNN neural network FAST-SCNN and the depth map obtained by the ZED 2, the image of the environment in front of the visually impaired user is split into seven equal divisions. A walkability conﬁdence value for each division is computed, and a voice prompt is played to guide the user toward the most appropriate direction such that the visually impaired user can navigate a safe path on the sidewalk, avoid any obstacles, or walk on the crosswalk safely. Furthermore, the obstacle in front of the user is identiﬁed by the network YOLOv5s proposed by Jocher, G. et al. Finally, we provided the proposed assistive system to a visually impaired person and experimented around an MRT station in Taiwan. The visually impaired person indicated that the proposed system indeed helped him feel safer when walking outdoors. The experiment also veriﬁed that the system could effectively guide the visually impaired person walking safely on the sidewalk and crosswalks.


Introduction
Scholars estimate that the number of visually impaired people worldwide will increase from 38.5 million in 2020 to more than 115 million by 2050 [1]. So increased societal and governmental attention will be needed in the future. Specifically, the visually impaired need assistive tools when they walk outdoors. White canes and guide dogs are currently the most well-known assistive tools for the visually impaired [2]. Although white canes are cheap and easy to use, they cannot provide important visual information, such as obstacle location, type, and proximity. Visual information is indispensable for environmental perception and movement safety during outdoor navigation [3]. Guide dogs can assist the visually impaired to avoid obstacles; however, their life span is about 8 to 12 years [4], and the cost of breeding and training a guide dog is very high. Many researchers have developed various assistive devices for the visually impaired [5][6][7]. To safely guide the visually impaired while walking, the assistive tools should have the ability to recognize the surrounding environment quickly and accurately.
A wearable device for visually impaired people was developed by [5] in which the device used different types of sounds to inform users whether there were obstacles in front of them or not. It also used different frequencies and decibels to indicate the location of the obstacles. In the paper [6], the authors combined wearable glasses and augmented reality technology and then integrated the traversable direction visual enhancement function to help partially sighted people to walk safely. In addition, they developed three-voice 2.1.1. The Hardware System Configuration The whole assistive system is wearable and contains an embedded device-Jetson AGX Xavier launched by Nvidia [22], a binocular camera named ZED 2. Additionally, a voice prompt manager is installed in the AGX. This work connects a sound card between the AGX and the headphone to produce voice prompts. The AGX featuring a Tegra System on Chip (Xavier SoC) comprises 8 NVIDIA Carmel CPU Cores (architecture compatible with ARMv8.2) and an integrated GPU based on the NVIDIA Volta architecture with 512 CUDA cores. In addition, it also adds a Tensor Core, an NVIDIA deep learning accelerator, a VLIW (Very long instruction word) video processor, and an image signal processor (ISP) to enhance AI computing performance. The ZED 2 is a stereoscopic camera designed for tracking vehicles and people with a vast field of view. It offers a more excellent depth detection with a 120-degree wide-angle field of view. These functions provide a great sense of safety to visually impaired people. The aim here is to combine depth-of-field detection (up to 20 m) with object detection. The whole assistive system is shown in Figure 1, powered by a lithium battery with a capacity of 5200 mAh (Figure 1a). The battery with fully charged (16.8V) can supply the system operation time of about 2.5 h. The battery and AGX are installed on the backboard of the user. The ZED 2 is fixed on the brim of the user's cap (see Figure 1b). Moreover, the headset is connected to the sound card, and the voice instruction will be output to the user.

The Main Method
This study designs and implements an assistive system for the visually impaired to walk on the sidewalk and crosswalk without hitting obstacles safely. In more detail, this assistive system can help the visually impaired recognize the surrounding environment, such as sidewalk, crosswalk, and obstacles, and then guides the visually impaired user to walk in a safe and correct direction.

The Hardware System Configuration
The whole assistive system is wearable and contains an embedded device-Jetson AGX Xavier launched by Nvidia [22], a binocular camera named ZED 2. Additionally, a voice prompt manager is installed in the AGX. This work connects a sound card between the AGX and the headphone to produce voice prompts. The AGX featuring a Tegra System on Chip (Xavier SoC) comprises 8 NVIDIA Carmel CPU Cores (architecture compatible with ARMv8.2) and an integrated GPU based on the NVIDIA Volta architecture with 512 CUDA cores. In addition, it also adds a Tensor Core, an NVIDIA deep learning accelerator, a VLIW (Very long instruction word) video processor, and an image signal processor (ISP) to enhance AI computing performance. The ZED 2 is a stereoscopic camera designed for tracking vehicles and people with a vast field of view. It offers a more excellent depth detection with a 120-degree wide-angle field of view. These functions provide a great sense of safety to visually impaired people. The aim here is to combine depth-offield detection (up to 20 m) with object detection. The whole assistive system is shown in Figure 1, powered by a lithium battery with a capacity of 5200 mAh (Figure 1a). The battery with fully charged (16.8V) can supply the system operation time of about 2.5 h. The battery and AGX are installed on the backboard of the user. The ZED 2 is fixed on the brim of the user's cap (see Figure 1b). Moreover, the headset is connected to the sound card, and the voice instruction will be output to the user. The software structure of the system is shown in Figure 2, in which there are two deep learning models, Fast-SCNN and YOLOv5s, performing the environment recognition task on the AGX. The result of FAST-SCNN with the depth map provided by ZED 2 can identify the walkable area in front of the user and suggest a walking direction. The YOLOv5s with the depth map can detect the obstacle ahead and calculate the obstacle distance away from the user. Moreover, the voice prompt manager produces audio instruction about obstacles and walkability for the visually impaired person. The software structure of the system is shown in Figure 2, in which there are two deep learning models, Fast-SCNN and YOLOv5s, performing the environment recognition task on the AGX. The result of FAST-SCNN with the depth map provided by ZED 2 can identify the walkable area in front of the user and suggest a walking direction. The YOLOv5s with the depth map can detect the obstacle ahead and calculate the obstacle distance away from the user. Moreover, the voice prompt manager produces audio instruction about obstacles and walkability for the visually impaired person. Appl. Sci. 2021, 11, x FOR PEER REVIEW 4 of 17 Figure 2. The software structure of the assistive system.

The Functions Required by the Visually Impaired People
In this section, we will present the main configurations of the system in detail and explain how the assistive system helps visually impaired people walk outdoors. Since the most common areas for visually impaired people walking are sidewalks or crosswalks.
Recognition of sidewalks and crosswalks should be done first. Second, we should provide the user with the front environment information, either a completely open area or an area with objects. The third task is to select the correct direction for visually impaired people walking forward on the sidewalk or crosswalk. At last, we recognize the types of obstacles to visually impaired people such that the user can know what obstacle is in front of them. Therefore, the following subsections will describe the above three tasks respectively in a more detailed way.

Recognition of Outdoor Environment and the Training of FAST-SCNN
To lead the visually impaired user to walk safely on the sidewalk and crosswalk, we designed the assistive system to identify the sidewalk or crosswalk in front of the visually impaired user. Fast Segmentation Convolutional Neural Network (Fast-SCNN) [23] is a real-time semantic segmentation model on high-resolution image data and offers an efficient computation on embedded devices with limited memory. In addition, Fast-SCNN has fewer parameters than those in other well-known segmentation networks [24]. Therefore, we employed the Fast-SCNN as the segmentation model to recognize the outdoor environment, including sidewalks, crosswalks, stairs, and asphalt roads. Here we suppose that only sidewalks and crosswalks are walkable for visually impaired people.
To train the Fast-SCNN, we collected significant photos captured from the perspective of pedestrians on the street around the test field to form the training dataset. The most common outdoor environments for pedestrians, including sidewalks, crosswalks, stairs, and roads, were involved in our dataset. To be precise, we collected the training data (available via the link https://github.com/kev72806/TW-NCU-ICIP-Lab-dataset (accessed on24 October 2021)) on sunny days during daylight hours and in four different seasons. The training data set includes 22,798 photos, in which there are 3673 original captures and other augmented data. Since the images were collected by ZED 2 video continuously, we randomly adopted one-tenth of the total images as the validation set and screened out some similar images in the set. Furthermore, we used MATLAB Image Labeler to label our data at the pixel level. The total data distribution is shown in Table 1. Use the following methods to do the data augmentation: 1. Randomly rotate the image anti-clockwise or clockwise within 5 degrees and crop 80% of the original size from the center, then enlarge it back to its original size; 2. Change the image's brightness by an arbitrarily finite degree; 3. Flip the image horizontally. Then let the Fast-SCNN network be trained using those training data. Table 2 shows the data or parameters used in the network training.

The Functions Required by the Visually Impaired People
In this section, we will present the main configurations of the system in detail and explain how the assistive system helps visually impaired people walk outdoors. Since the most common areas for visually impaired people walking are sidewalks or crosswalks.
Recognition of sidewalks and crosswalks should be done first. Second, we should provide the user with the front environment information, either a completely open area or an area with objects. The third task is to select the correct direction for visually impaired people walking forward on the sidewalk or crosswalk. At last, we recognize the types of obstacles to visually impaired people such that the user can know what obstacle is in front of them. Therefore, the following subsections will describe the above three tasks respectively in a more detailed way.

Recognition of Outdoor Environment and the Training of FAST-SCNN
To lead the visually impaired user to walk safely on the sidewalk and crosswalk, we designed the assistive system to identify the sidewalk or crosswalk in front of the visually impaired user. Fast Segmentation Convolutional Neural Network (Fast-SCNN) [23] is a realtime semantic segmentation model on high-resolution image data and offers an efficient computation on embedded devices with limited memory. In addition, Fast-SCNN has fewer parameters than those in other well-known segmentation networks [24]. Therefore, we employed the Fast-SCNN as the segmentation model to recognize the outdoor environment, including sidewalks, crosswalks, stairs, and asphalt roads. Here we suppose that only sidewalks and crosswalks are walkable for visually impaired people.
To train the Fast-SCNN, we collected significant photos captured from the perspective of pedestrians on the street around the test field to form the training dataset. The most common outdoor environments for pedestrians, including sidewalks, crosswalks, stairs, and roads, were involved in our dataset. To be precise, we collected the training data (available via the link https://github.com/kev72806/TW-NCU-ICIP-Lab-dataset (accessed on 24 October 2021)) on sunny days during daylight hours and in four different seasons. The training data set includes 22,798 photos, in which there are 3673 original captures and other augmented data. Since the images were collected by ZED 2 video continuously, we randomly adopted one-tenth of the total images as the validation set and screened out some similar images in the set. Furthermore, we used MATLAB Image Labeler to label our data at the pixel level. The total data distribution is shown in Table 1. Use the following methods to do the data augmentation: 1. Randomly rotate the image anti-clockwise or clockwise within 5 degrees and crop 80% of the original size from the center, then enlarge it back to its original size; 2. Change the image's brightness by an arbitrarily finite degree; 3. Flip the image horizontally. Then let the Fast-SCNN network be trained using those training data. Table 2 shows the data or parameters used in the network training. After training, the Fast-SCNN outputs a segmentation result for the image in front of the user to identify sidewalks, crosswalks, stairs, and asphalt roads in their walking path. Figure 3 shows two examples of semantic segmentation results, where the left and right parts are the original capture and its segmentation result, respectively. The green, blue, grey, and red colors represent crosswalks, sidewalks, asphalt roads, and stairs, respectively. Table 3 lists the performance of Fast-SCNN with pixel accuracy and Intersection over Union (IoU). After training, the Fast-SCNN outputs a segmentation result for the image in front of the user to identify sidewalks, crosswalks, stairs, and asphalt roads in their walking path. Figure 3 shows two examples of semantic segmentation results, where the left and right parts are the original capture and its segmentation result, respectively. The green, blue, grey, and red colors represent crosswalks, sidewalks, asphalt roads, and stairs, respectively. Table 3 lists the performance of Fast-SCNN with pixel accuracy and Intersection over Union (IoU).

Depth Map and the Openness Values
Although the FAST-SCNN can recognize the above six street environments in front of visually impaired users, it cannot give the environment distance ahead of the user. Therefore, ZED 2 provides the depth map for the image ahead to reinforce this deficiency. Let the color image captured from ZED 2 be on the left of Figure 4. The corresponding depth map is the middle one which was the output from ZED SDK. Take Figure 4 as an example, in which the images (a) and (b) represent the original color image and the depth map corresponding to (a), respectively. Suppose M d (i, j) is the value corresponding to the pixel (i, j), where (i, j) is the position of the pixel in the depth map and is defined as where d is the distance of the pixel (i, j) in the depth map away from the user, since the ZED 2 is mounted on the cap of the user, the front ground within 1.6 m is invisible, but if an object located within 1.6 m and is high enough to be seen by ZED 2, therefore, the pixel (i, j) on the object with distance d < 1.6 m will have M d (i, j) = −∞. If the pixel on the image with distance 1.6 ≤ d ≤ 20, then let the value of M d (i, j) be the distance; if the pixel on the image with a distance of more than 20 m, then the value of M d (i, j) is set 20. Although the FAST-SCNN can recognize the above six street environments in front of visually impaired users, it cannot give the environment distance ahead of the user. Therefore, ZED 2 provides the depth map for the image ahead to reinforce this deficiency. Let the color image captured from ZED 2 be on the left of Figure 4. The corresponding depth map is the middle one which was the output from ZED SDK. Take Figure 4 as an example, in which the images (a) and (b) represent the original color image and the depth map corresponding to (a), respectively. Suppose Md(i, j) is the value corresponding to the pixel (i, j), where (i, j) is the position of the pixel in the depth map and is defined as where d is the distance of the pixel (i, j) in the depth map away from the user, since the ZED 2 is mounted on the cap of the user, the front ground within 1.6 m is invisible, but if an object located within 1.6 m and is high enough to be seen by ZED 2, therefore, the pixel ( , ) on the object with distance d < 1.  Next, we consider an entirely open area as the image Figure 4c, in which there is no object except the ground within the image. Let us define Mo(i, j) be the value calculated from (1) in the entirely open area (such as image (c)). The darker the image's color (c) means, the closer it is to the user. Hence, the bottom pixels of the image are close to the user, and the white color of the upper part denotes the pixels are more than 20 m away from the user. Suppose we define the openness value of pixel (i, j) as (2).
It is seen that ∑ ∑   (c)). The darker the image's color (c) means, the closer it is to the user. Hence, the bottom pixels of the image are close to the user, and the white color of the upper part denotes the pixels are more than 20 m away from the user. Suppose we define the openness value of pixel (i, j) as (2).
It is seen that ∑

The Selection of Walking Direction
In this study, the semantic segmentation defines the sidewalk and crosswalk with blue and green colors, respectively, as walkable areas. Since the ZED 2 is mounted on the user's cap and suppose the user's height is H (unit: m). Then the bottom width of the captured image field of view is about 3.52 × H meters (when the user looks forward). We split the image evenly into seven divisions from left to right (see Figure 5) and number them from 0 to 6. For example, when the H is 1.7 m, each division has a width of about 3.52 × H/7 = 0.86 m which is wide enough for a person to go through. We plan to select the most suitable division from these seven divisions to be the suggested direction for guiding the visually impaired user to walk in.  Figure 5) and number them from 0 to 6. For example, when the H is 1.7 m, each division has a width of about 3.52 × H/7 = 0.86 m which is wide enough for a person to go through. We plan to select the most suitable division from these seven divisions to be the suggested direction for guiding the visually impaired user to walk in. The Fast-SCNN identifies the walkable areas, and the ZED 2 provides the depth map for the user. Hence, based on the seven divisions, the next task is to guide the visually impaired user to walk in a safe direction. We define two variables Confs and Confd, to determine the walkability confidence index for seven divisions. Confs is the confidence from the segmentation result, and Confd is the confidence from the depth map. Now let us introduce the calculation of Confs first. Taking Figure 5 as an instance, we binaries the pixel value in the walkable area to 1 and others to 0. The closer the position is to the bottom of the image, the closer it is to the user in the actual field. We define that the pixels closer to the bottom have a higher weight value, and the weight value from the bottom to the top of the image decreases linearly from 2 to 0. Therefore, the pixel values will be weighted by the spatial distance. Since the image is divided into seven equal divisions, as shown in Figure 5, the average weighted pixel value of all pixels in each division will be within [0,1], and that will be regarded as the confidence degree Confs of that division, where Confs is computed from (3).
where w(i, j) is the weight and p(i, j) is the pixel value of the position (i, j) in a division. N is the total number of pixels in the division. For instance, the third division from the right in Figure 5 contains large green and blue areas, so its confidence degree is 0.79, the largest in the seven divisions. The rightmost division has a tiny green area, so its Confs = 0.04 is very low. In other words, the direction shown in the third division from the right is most walkable.
Next, let us consider the calculation of Confd. According to the openness value Open(i,j) in (2), we can calculate a confidence degree based on the depth map. In each division, the average weighted openness value is computed from (4) and will be regarded as the confidence degree of the depth map. The Fast-SCNN identifies the walkable areas, and the ZED 2 provides the depth map for the user. Hence, based on the seven divisions, the next task is to guide the visually impaired user to walk in a safe direction. We define two variables Conf s and Conf d , to determine the walkability confidence index for seven divisions. Conf s is the confidence from the segmentation result, and Conf d is the confidence from the depth map. Now let us introduce the calculation of Conf s first. Taking Figure 5 as an instance, we binaries the pixel value in the walkable area to 1 and others to 0. The closer the position is to the bottom of the image, the closer it is to the user in the actual field. We define that the pixels closer to the bottom have a higher weight value, and the weight value from the bottom to the top of the image decreases linearly from 2 to 0. Therefore, the pixel values will be weighted by the spatial distance. Since the image is divided into seven equal divisions, as shown in Figure 5, the average weighted pixel value of all pixels in each division will be within [0,1], and that will be regarded as the confidence degree Conf s of that division, where Conf s is computed from (3).
where w(i, j) is the weight and p(i, j) is the pixel value of the position (i, j) in a division. N is the total number of pixels in the division. For instance, the third division from the right in Figure 5 contains large green and blue areas, so its confidence degree is 0.79, the largest in the seven divisions. The rightmost division has a tiny green area, so its Conf s = 0.04 is very low. In other words, the direction shown in the third division from the right is most walkable. Next, let us consider the calculation of Conf d . According to the openness value Open(i,j) in (2), we can calculate a confidence degree based on the depth map. In each division, the average weighted openness value is computed from (4) and will be regarded as the confidence degree Con f d of the depth map.
If there is a pixel with M d (i, j) = −∞, the Conf d must be −∞. For instance, in Figure 6, there is Conf d value for each division, and Conf d = −∞ in the middle division because there is a tree very close to the user. After we have values of Conf s and Conf d for each division, the walkability confidence degree Conf in each division is computed from (5).
Con f = minimum(Con f s , Con f d ) If there is a pixel with ( , ) = −∞, the Confd must be −∞. For instance, in Figure   6, there is Confd value for each division, and Confd = −∞ in the middle division because there is a tree very close to the user. After we have values of Confs and Confd for each division, the walkability confidence degree Conf in each division is computed from (5).
= minimum( , )    If there is a pixel with ( , ) = −∞, the Confd must be −∞. For instance, in Figure   6, there is Confd value for each division, and Confd = −∞ in the middle division because there is a tree very close to the user. After we have values of Confs and Confd for each division, the walkability confidence degree Conf in each division is computed from (5).

Walking Guide Strategy
After the confidences for seven divisions are obtained, we give a walking guide strategy for the visually impaired user as follows.
Case 1: If the Conf > 0.5 in the middle division, then the middle division is chosen, and the voice prompt gives the instruction "go straight." Otherwise, check Conf in all divisions and go to Case 2.
Case 2: Find the division with the highest Conf, which is larger than 0.2, in all divisions, then guide the user to walk to that division and the voice prompt gives instructions such as "slightly right/left," "right/left," or "go straight." Case 3: If all Conf values are smaller than 0.2, it means there is a dead-end ahead, and the user must turn around to find another way. In addition, it also occurs in the condition that the user looks up or down. Then the voice prompt plays the instruction "dead."

Remark 1:
If the selected division is the middle division continuously in Case 1 or Case 2, the voice prompt will not repeatedly play "go straight" to avoid interference with the user. For instance, all images in Figure 7 with the walking directions selections are shown in Table 4, and the corresponding voice prompts are given too.
The above walking guide strategy is summarized as Algorithm 1 and Figure 8 below. At last, we have to add one more remark about the voice prompts. Although both crosswalks and sidewalks are walkable to the visually impaired user, they are different environments that the visually impaired users should know. When the visually impaired user walks on the crosswalk, they should pay more attention to the surrounding environment. In the walking guide strategy, one more function is added. When the middle division is chosen, and the user walks from the sidewalk (or crosswalk) to the crosswalk (or sidewalk), the voice prompt will alert the user that the environment is changed.   Figure 7b the third division from the left, (slightly left) Figure 7c the leftmost division, (left) Figure 7d the third division from the right (slightly right) Figure 7e the second division from the right, (right) Figure 7f The middle division (go straight)

Inputs:
The confidence score based on segmentation, Con f s ∈ R 1×7 . The confidence score based on the depth map obtained from ZED 2, Con f d ∈ R 1×7 .

Obstacle Detection and the Training of YOLOv5
However, obstacles still may appear on the walking way when the visually impaired user is under the guidance of our assistive system. It should be detected further to prevent the user from danger. Compared with the white cane, the proposed system has the capability of identifying the obstacles. In this work, the frequently-used object detection, namely YOLOv5 [25], is adopted to perform obstacle detection and recognition. Since YOLOv5 has the advantages of lightweight and real-time computation. Especially the size of the YOLOv5s model is 27 MB, which is much smaller than the 245 MB of YOLOv4. Here, considering the most common objects on the road, we pre-determine six categories

Obstacle Detection and the Training of YOLOv5
However, obstacles still may appear on the walking way when the visually impaired user is under the guidance of our assistive system. It should be detected further to prevent the user from danger. Compared with the white cane, the proposed system has the capability of identifying the obstacles. In this work, the frequently-used object detection, namely YOLOv5 [25], is adopted to perform obstacle detection and recognition. Since YOLOv5 has the advantages of lightweight and real-time computation. Especially the size of the YOLOv5s model is 27 MB, which is much smaller than the 245 MB of YOLOv4. Here, considering the most common objects on the road, we pre-determine six categories of objects, including the motorcycle, bicycle, person, car, truck, and bus, that must be detected. The training data for the YOLOv5s comes from a part of the COCO dataset [26] and the street-view captures near Taoyuan High-Speed Rail station, Taiwan. The distribution of our dataset is shown in Table 5. We used 10,000 images from a part of the COCO dataset as training data, and the other 3122 images were collected by ourselves, in which 2534 images were added into the training set, and the other 588 images were used for validation. The last column denotes the label count for each specific object. For instance, 38,095 "person" labels mean that some pictures have multiple objects labeled "person," so that there are 38,095 "person" labels in the entirety of the 12,534 images. There was no data augmentation during the training stage because the number of samples in the training data was sufficient.  Table 6 shows the data and parameters used in network training. Table 7 shows the object detection results by YOLOv5s, where AP 50:95 in the first column takes values from 0.5 to 0.95, and the best results occur when the value is 0.5. The output of YOLOv5s is a bounding box in which the object can be identified. For instance, the bounding boxes shown in Figure 9 are obtained from YOLOv5s, and the number at the top of each bounding box is the confidence value for the object identification. Further, the distance between the user and the object is obtained from the ZED 2. Let the bounding box obtained from YOLOv5s be cropped to 64% of the original box size by shrinking each side of the box by 10% (see Figure 10). We took the median of all those pixel depths in the cropped box as the object's distance ahead of the user.   We also took the vertical line going through the center of the bounding box as the obstacle's position relative to the user. Here we have to combine the walking guide strategy in Section 2.2.4 and the obstacle detection result to cover obstacle identification and avoidance. If the obstacle's position appears in the middle division of the image in front of the user and is close enough, the walking guide strategy will guide the user to avoid the obstacle. At the same time, the voice prompt will inform the type of obstacle they are approaching. Here, the voice prompt will play if the obstacle is located in front of the user within 3 m. For instance, a car located in the third and fourth divisions from the right in Figure 11 is 6.43 m away from the user. When the user walks forward until the distance between the car and the user is within 3 m, the voice prompts the user "car" to remind the visually impaired user to avoid it. However, when the user encounters an obstacle out of the six categories or is hung at the height of the upper body, the system still can avoid the obstacle by the method described in Section 2.2.4 even YOLOv5 cannot recognize it.    We also took the vertical line going through the center of the bounding box as the obstacle's position relative to the user. Here we have to combine the walking guide strategy in Section 2.2.4 and the obstacle detection result to cover obstacle identification and avoidance. If the obstacle's position appears in the middle division of the image in front of the user and is close enough, the walking guide strategy will guide the user to avoid the obstacle. At the same time, the voice prompt will inform the type of obstacle they are approaching. Here, the voice prompt will play if the obstacle is located in front of the user within 3 m. For instance, a car located in the third and fourth divisions from the right in Figure 11 is 6.43 m away from the user. When the user walks forward until the distance between the car and the user is within 3 m, the voice prompts the user "car" to remind the visually impaired user to avoid it. However, when the user encounters an obstacle out of the six categories or is hung at the height of the upper body, the system still can avoid the obstacle by the method described in Section 2.2.4 even YOLOv5 cannot recognize it. We also took the vertical line going through the center of the bounding box as the obstacle's position relative to the user. Here we have to combine the walking guide strategy in Section 2.2.4 and the obstacle detection result to cover obstacle identification and avoidance. If the obstacle's position appears in the middle division of the image in front of the user and is close enough, the walking guide strategy will guide the user to avoid the obstacle. At the same time, the voice prompt will inform the type of obstacle they are approaching. Here, the voice prompt will play if the obstacle is located in front of the user within 3 m. For instance, a car located in the third and fourth divisions from the right in Figure 11 is 6.43 m away from the user. When the user walks forward until the distance between the car and the user is within 3 m, the voice prompts the user "car" to remind the visually impaired user to avoid it. However, when the user encounters an obstacle out of the six categories or is hung at the height of the upper body, the system still can avoid the obstacle by the method described in Section 2.2.4 even YOLOv5 cannot recognize it.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 13 of 17 Figure 11. The car appears in the third and fourth divisions from the right.

Experiments
We experimented with the proposed assistive system to guide a visually impaired person walking on sidewalks and crosswalks. Figure 12 shows that a visually impaired person is joining together to experiment. Notably, the prototype of our proposed system was carried on his back. The experiment was performed outdoors around the MRT station Figure 11. The car appears in the third and fourth divisions from the right.

Experiments
We experimented with the proposed assistive system to guide a visually impaired person walking on sidewalks and crosswalks. Figure 12 shows that a visually impaired person is joining together to experiment. Notably, the prototype of our proposed system was carried on his back. The experiment was performed outdoors around the MRT station nearby the Taoyuan High-Speed Rail station in Taiwan. The visually impaired user only needs to set the destination, and Google Maps will plan the route from his departure point to the destination. In this experiment, the walking path was planned and shown by the small blue dots in Figure 13, in which the route length is about 280 m. The experimental results show that if we ignore the time consumption of the voice prompts, the processing speed of the system is about 6 FPS (Frames Per Second) which includes 0.09 s for segmentation, 0.06 s for object detection and depth measurement, and 0.0145 s for confidence calculation and division selection. This meets the real-time requirement of use. Figure 11. The car appears in the third and fourth divisions from the right.

Experiments
We experimented with the proposed assistive system to guide a visually impaired person walking on sidewalks and crosswalks. Figure 12 shows that a visually impaired person is joining together to experiment. Notably, the prototype of our proposed system was carried on his back. The experiment was performed outdoors around the MRT station nearby the Taoyuan High-Speed Rail station in Taiwan. The visually impaired user only needs to set the destination, and Google Maps will plan the route from his departure point to the destination. In this experiment, the walking path was planned and shown by the small blue dots in Figure 13, in which the route length is about 280 m. The experimental results show that if we ignore the time consumption of the voice prompts, the processing speed of the system is about 6 FPS (Frames Per Second) which includes 0.09 s for segmentation, 0.06 s for object detection and depth measurement, and 0.0145 s for confidence calculation and division selection. This meets the real-time requirement of use.  We took 3-4 h for one experiment with a visually impaired person. It may have at least 7 or 8 rounds in an experiment. The walking path for an experiment was about 300~1200 m and must include sidewalks and crosswalks. We performed the experiments with three different paths, but those paths were all around the MRT station. Moreover, even though we used the same path, we might put different obstacles in front of the user in different positions. Here, one of the results is shown at the link: https://www.youtube.com/watch?v=G5QkhQY7h5M (accessed on 24 October 2021) in which the voice prompts are "right," "left," "slightly left," "slightly right," "go straight," and "poly." The prompt "poly" is played when the user is arriving at a specific point in the planned route, and the prompt "poly" always appears at the curved section (see Figure 13). In the middle of the video, the visually impaired person traversed a street by walking on a crosswalk without a traffic light. In Taiwan, a company called International Integrated Systems, Inc. has a traffic light platform, "Invignal," [27] to provide the traffic light status and timing for each intersection in northern Taiwan. However, the experiment using the platform "Invignal" is still in progress. We will publish the study result if we have a more stable experiment result in the future. We took 3-4 h for one experiment with a visually impaired person. It may have at least 7 or 8 rounds in an experiment. The walking path for an experiment was about 300~1200 m and must include sidewalks and crosswalks. We performed the experiments with three different paths, but those paths were all around the MRT station. Moreover, even though we used the same path, we might put different obstacles in front of the user in different positions. Here, one of the results is shown at the link: https://www.youtube. com/watch?v=G5QkhQY7h5M (accessed on 24 October 2021) in which the voice prompts are "right," "left," "slightly left," "slightly right," "go straight," and "poly." The prompt "poly" is played when the user is arriving at a specific point in the planned route, and the prompt "poly" always appears at the curved section (see Figure 13). In the middle of the video, the visually impaired person traversed a street by walking on a crosswalk without a traffic light. In Taiwan, a company called International Integrated Systems, Inc. has a traffic light platform, "Invignal," [27] to provide the traffic light status and timing for each intersection in northern Taiwan. However, the experiment using the platform "Invignal" is still in progress. We will publish the study result if we have a more stable experiment result in the future.

Discussions
Having performed many rounds of experiments, the visually impaired user gave us some feedback as follows: 1. This system is helpful to the user, and the user appreciates that we care about the needs of visually impaired people; 2. The user likes to be informed by the system if there are objects around the user, but the prompts cannot play too often to bother the user; 3. The walking guide strategy provided by the system gives the user a great sense of security and reduces the number and time of his orientation training; 4. The user would prefer the system to be lighter and smaller just in case the user still needs the white cane; 5. The user needs to take some time to become familiar with this system. The user hopes the system can be friendlier without long time training. We appreciate that the user gave us some positive feedback and suggestions to improve our development in future work.
In Section 2.2.1, we have mentioned that the training data were collected on sunny days during daylight hours to train the Fast CNN. However, we also have done some experiments that used the trained Fast-SCNN on rainy days or at night. The experiment results are stated below. The system still works well in cloudy weather with sufficient brightness. However, it almost fails on rainy days when there is stagnant water affecting the segmentation result of the sidewalk. Furthermore, the system is unstable to recognize the sidewalk and crosswalk without stagnant water. Moreover, it did not work well at night since the environment is too dark, or streetlights of different brightness may confuse the segmentation result. Therefore, for the sake of safety, we suggest the visually impaired uses the proposed system on daytimes and not rainy days. In order to improve the system to be also feasible on rainy days and at night, it needs to collect a lot of extra training data on rainy days and at night. We need more time and more effort to study this issue in the future.
We have to admit that we did not invite enough visually impaired persons to do the experiments since it is hard to find a large number of visually impaired people to test in a short time in Taiwan. However, we have to supplement that before we invited the natural visually impaired person to experiment one time, we would request several (about 5-6) students in my lab wear blindfolds to do many experiments in advance (see Figure 14). This paper just provides a prototype of the assistive system for visually impaired people and proposes a preliminary result for helping the visually impaired. We will continue finding more visually impaired users to experiment with verifying the effectiveness of the proposed system.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 15 of 17 uses the proposed system on daytimes and not rainy days. In order to improve the system to be also feasible on rainy days and at night, it needs to collect a lot of extra training data on rainy days and at night. We need more time and more effort to study this issue in the future.
We have to admit that we did not invite enough visually impaired persons to do the experiments since it is hard to find a large number of visually impaired people to test in a short time in Taiwan. However, we have to supplement that before we invited the natural visually impaired person to experiment one time, we would request several (about 5-6) students in my lab wear blindfolds to do many experiments in advance (see Figure 14). This paper just provides a prototype of the assistive system for visually impaired people and proposes a preliminary result for helping the visually impaired. We will continue finding more visually impaired users to experiment with verifying the effectiveness of the proposed system.

Conclusions
This study has proposed a wearable assistive system to help the visually impaired walk safely on the sidewalk or crosswalk without hitting obstacles. The semantic segmentation model Fast-SCNN was trained to recognize the user's surrounding environment,

Conclusions
This study has proposed a wearable assistive system to help the visually impaired walk safely on the sidewalk or crosswalk without hitting obstacles. The semantic segmentation model Fast-SCNN was trained to recognize the user's surrounding environment, and the depth map created by the ZED 2 was used to measure the object's distance in front of the user. Combining the above two environmental information, we have developed a walking guide strategy for the visually impaired. Moreover, the object detection model YOLOv5s was trained to detect and identify obstacles. With the aid of the proposed assistive system, the experiment has shown that a visually impaired user can walk on the sidewalk and crosswalk safely without hitting any obstacles. However, we have to admit that if an object appears suddenly in the camera's blind spot within 0.2~1 m in front of the user, they may not have time to avoid it. In this situation, the white cane is still helpful for the visually impaired person. Therefore, we suggest that the visually impaired person still uses the white cane to maximize safety when using the assistive system. In addition, the study on visually impaired people traversing an intersection with traffic lights is still in progress. The traffic light platform "Invignal" can provide the traffic light status and timing for each intersection in northern Taiwan. This platform will be used in our consequent experiment. We will publish the study result if we have a more stable experiment result in the future. Furthermore, we will continue experimenting with more visually impaired users to verify the effectiveness of our proposed system. For the limited testers, we have completed different experiments many times, and the performance is almost consistent.