3.1. Hardware Components
Our prototype has very few hardware parts; see Figure 3
. The perception component, based on the KR-Vision glasses (KR-Vision Technology, Hangzhou, China), combines a RGB-D camera (Intel, Santa Clara, CA, USA) with bone-conducting headphones (AfterShokz, East Syracuse, NY, USA), which we use to provide feedback. The computing component is a lightweight laptop (Lenovo, Beijing, China) carried in a backpack. These glasses are connected to a laptop using a single USB3 connection. The reduced amount of components and cabling makes the system ergonomic and easy to use.
Our key perception component, the RGB-D camera, is an Intel RealSense device model LR200 [44
]. RGB-D refers to a red, green, blue and depth camera and defines a camera that provides both color estimates as well as distance estimates in each pixel, usually aided by some sort of active infrared illumination source.
The depth camera technology used in the LR200 employs a hybrid design that combines classical stereo triangulation with the projected pattern technique. The LR200 uses two infrared cameras that are used to triangulate the depth perception, in a classical stereo setup. The camera includes hardware to solve the correspondence problem and directly delivers a per-pixel distance estimation in millimeters.
The LR200 also incorporates a laser projector that illuminates the scene with a pseudo-random pattern of dots, in a way analogous to the original Kinect cameras. For pure projected pattern-based cameras, the pattern is required to solve the correspondence problem and triangulate the distance. In the LR200, however, the projected pattern only has an assisting role, increasing the amount of textures on the image. This means that the LR200 is able to provide good depth estimates at distances and illumination conditions under which the projected pattern would not be visible—i.e., outdoors—albeit at a reduced precision. Furthermore, the LR200 suffers from no interference if more than one camera is observing the scene.
Regarding the drawbacks of the LR200, the RGB camera within the LR200 has a diminutive lens aperture, which provides poor image quality in low light situations. We have also observed that the dots projected by the camera can be seen as specular freckles on the RGB camera, specially in low light conditions.
We provide feedback to the user by means of bone-conducting headphones integrated within the glasses. Those transmit the sound to the inner ear through the skull, with the transceiver placed on the zygomatic bone (also known as the cheek bone). While sound quality is generally deemed to be lower than standard headphones, bone-conducting headphones do not obstruct the ears, allowing the users to hear the environment around them.
Our software does not have high performance requirements, and also it needs no specific hardware other than a Nvidia GPU to process the deep learning model. For our user tests, we used a 1.9 kg notebook equipped with a Core i7 5500U CPU and a GT840M GPU. This system includes processing power and battery in one unit and allows for a compact, ergonomic and robust solution for experimentation. However, we expect this system to be deployed using specific embedded hardware; to this end, we have also tested the system and evaluated its performance using an Nvidia Xavier [45
], which is a compact system powered by an ARM CPU and a powerful Nvidia GPU.
3.2. Software Components
Our prototype uses Ubuntu 20.20 Focal Fossa as our operating system to house our software components and the Robotic Operating System (ROS) [6
] to connect them.
These can be divided into three main components, as seen in Figure 4
. The module that captures data from the RGB-D is implemented in C++, using the librealsense library [46
]. The key perception algorithm is based on deep learning and is implemented in Python using PyTorch [47
]. The camera interface, post-processing and audio output feedback are implemented in C++ and use OpenCV [48
] and OpenAL [49
]. To communicate with those components, we use the Robotic Operating System (ROS) [6
The Robotic Operating System (ROS) is a communication framework used to connect several software components using a message-passing architecture. ROS is ideal for our use case, as it provides native messages types to communicate both RGB images as well as depth fields. Furthermore, ROS messages provide a translation functionality between C++ data structures obtained from librealsense to the Python data structures required by the PyTorch deep learning module. In addition, ROS handles buffering and synchronization problems, allowing the system to run as quickly as possible in real time. By using ROS, we avoided the need to translate the original PyTorch model in Python to a C++ equivalent.
The data capture module uses the librealsense library to access the LR200 camera and capture two of its provided streams: the color stream and the depth_aligned_to_color stream. The color stream provides 640 × 480 pixels of RGB data at 30 frames per second, while the depth_aligned_to_color stream provides per-pixel depth estimates, in millimeters, as a 640 × 480 field of 16 bit values, also at 30 frames per second. In this case, the depth field is already aligned in a per-pixel basis to the color image, so no extra translation is needed. The data capture module labels timestamps for both the captured RGB and depth images and sends the RGB image to the semantic segmentation module.
We use the real-time SwaftNet model, which was previously developed in the DS-PASS system [2
], to sense the surroundings; this model is capable of predicting high-resolution semantic segmentation maps both swiftly and accurately. As it is shown in Figure 5
, the SwaftNet architecture is built on an efficient U-shaped structure with channel-wise attention connections based on squeeze and excite operations [50
]. In this way, the attention-augmented lateral connections help to spotlight spatially-rich features from the downsampling path, which enhances the detail-sensitivity of semantic segmentation, which is critical for social-distancing detection. Besides this, the spatial pyramid pooling (SPP) module acts as an instrument to enlarge the receptive field before passing features through the shallow lightweight upsampling path for the final pixel-wise classification [2
SwaftNet is trained on Mapillary Vistas [3
], which is a street scene dataset that includes many images captured by pedestrians on sidewalks. In addition, we use a heterogeneous set of data augmentation techniques that are of critical relevance to the generalization capacity in unseen domains [51
]. Thereby, the semantic segmentation module performs robustly with glasses for blind people.
The post-processing module receives a timestamped field with labels from the semantic segmentation module and retrieves the depth field with the corresponding timestamp from the data capture module.
Each processed image will create a single beeping signal. Based on prior work, we fix this signal shape to a pure sinusoidal tone of 20 ms in length. We found that this length is sufficient to be perceived but short enough not to mask ambient noises.
As we emit one beep for each processed image, the frequency of the beeping depends ultimately on the processing power of the computing device and is limited to a maximum of 10 beeps per second.
The three parameters we use to modulate the beeping signal are its frequency, its volume and its spatial location. To obtain the corresponding values for those parameters, we apply a light post-processing step. We discard pixels that are not classified as persons, pixels whose distance is not provided by the depth camera, pixels closer than a minimum distance () set to 50 cm and pixels further away than a maximum distance () set to 150 cm. Of the remaining pixels, we only retain the that are closest to the camera; thus, we focus on the closest person visible. The system is not very sensitive to this threshold, and any value between and performs well for the purpose of focusing on the closest person.
The volume is proportional to the number of pixels retained and reaches a maximum level when of the image pixels are still retained. The stereoscopic sound allows us to signal the sound as if it were coming from a specific direction. The direction of the sonification is calculated by averaging the horizontal image coordinate of all remaining pixels.
Finally, the frequency of the tone is mapped to indicate urgency. Our system aims to be unobtrusive during most daily activities but to be intrusive—even to the point of being annoying—if it finds a person in front of the user that is too close to them. High frequencies are known to be more annoying than low frequencies; thus, we increase the frequency when we consistently detect a person in front of the user for longer periods of time, thus forcing the user to take action and increase their physical distance. By starting the beeping at a lower frequency, we prevent spurious false detections from being overly inconvenient.
The frequency mapping works as follows. Each selected pixel whose location was not selected in the previous frame starts with a frequency of 220 Hz; this frequency increases exponentially at a rate that is doubled each second. The frequency reaches a maximum at 1760 Hz, which is reached 7 s after finding a person within the warning range. The final notification tone simply averages the frequency of all selected pixels.
An example of the post-processing process can be seen in Figure 6