Towards Safe Visual Navigation of a Wheelchair Using Landmark Detection

: This article presents a method for extracting high-level semantic information through successful landmark detection using 2D RGB images. In particular, the focus is placed on the presence of particular labels (open path, humans, staircase, doorways, obstacles) in the encountered scene, which can be a fundamental source of information enhancing scene understanding and paving the path towards the safe navigation of the mobile unit. Experiments are conducted using a manual wheelchair to gather image instances from four indoor academic environments consisting of multiple labels. Afterwards, the ﬁne-tuning of a pretrained vision transformer (ViT) is conducted, and the performance is evaluated through an ablation study versus well-established state-of-the-art deep architectures for image classiﬁcation such as ResNet. Results show that the ﬁne-tuned ViT outperforms all other deep convolutional architectures while achieving satisfactory levels of generalization.


Introduction
Estimating traversable paths is of crucial importance for the safe and precise indoor navigation of mobile units. An abundant number of applications in robotics consider the concept of traversability estimation as the cornerstone of extracting semantic information for motion planning. Deciding about the navigability of an area depends not only on the terrain's physical properties, such as slope, roughness, surface condition but also on the mechanical characteristics of the mobile unit traversing it [1]. Since different environments illustrate diverse amounts of uncertainty, it becomes apparent that the effort to collect and interpret data from various sensor modalities can lead to further predicaments as a result of the type and the volume of data acquired.
Determining traversable paths have an immediate application in building navigation systems for smart and powered wheelchairs. This is due to the fact that wheelchair users often face maneuvering difficulties [2] when accomplishing daily tasks due to the presence of uneven and rough terrains [3], small corridors and doorways [4], and environments that are described by various levels of stochasticity, e.g., due to the presence of humans. Additionally, staircases have been traditionally problematic due to the geometric threats they exhibit and also for the difficulties they pose to 3D laser scanners [5].
This work's primary aim was to perform some preliminary experiments to extract high-level semantic information regarding the scene's traversability, based on the landmarks' relative position with respect to the vicinity of a manual wheelchair. The proposed multilabel classification system, using RGB images as input, aimed to efficiently detect the presence of particular labels (open path, humans, staircase, doorways, obstacles). This can be a fundamental source of information enhancing scene understanding. Hence, the data collection process takes into account all the characteristics associated with the object's appearance (geometrical features, volume, environment's illumination, etc.) but also the objects' relative position with respect to the proximity of the wheelchair.
Moreover, the suggested method can be an indispensable component along any sensing or control modules that compose the navigation system of the mobile unit. Specifically, we exploit the strengths of a wide-lens camera that can provide valuable insight about whether an object is an obstacle or not, since it considers more angles of the surroundings than the standard lens does. Leveraging the concept of transfer learning, a vision transformer (ViT) [6] is fine-tuned towards performing a multilabel classification on a small dataset with a mere number of labels. An initial framework is proposed, which, through the prism of multilabel image classification using wide-lens images, detects important landmarks for safe wheelchair navigation. The focus of the approach is on the relative position of a landmark encountered with regards to the proximity of the mobile unit. The rest of the paper is structured as follows: In Section 2, the related work revolving around the paper's axes of interest is discussed, Section 3 gives an overview of the implemented method, Sections 4 and 5 outline the experimental setup and the performed ablation study, respectively, and ultimately, Sections 6 and 7 discuss the results and the conclusions derived.

Related Work
Mounting sensors on the right locations on a wheelchair's body is of paramount importance towards detecting obstacles and performing simultaneous localization and mapping (SLAM) [2]. Vision sensors on wheelchairs have been utilized in a interconnected fashion along with various modalities such as laser [7], ultrasound [8], and tactile sensors [9]. Using wide-lens cameras on wheelchairs has been associated with endeavors in navigation and assistance [10][11][12], as well as object detection and localization [13]. Ultra wide-lens images, such as obtained from fisheye cameras, have been used in people detection methods [14], robot traversability estimation [15], SLAM [16], pedestrian/vehicle detection and tracking [17], and autonomous driving [18].
Unsupervised learning has shown great potential with transfer learning due to its capacity to learn specific features that can be proven advantageous for the final tuning on the downstream task [19,20].
Contrastive learning approaches portray the ability to create representations among similar and dissimilar images in an unsupervised fashion. Thus, they present the ability to facilitate the task of distinguishing between images and they have been employed in research works incorporating determining traversable regions [21] and designing local traversability models [22]. It has been shown that transfer learning approaches initially require a dataset of considerable size for the initial training (Kitti [23], ImageNet [24], etc.) before transferring features from a new domain to initialize an existing trained network and thus enhance the levels of generalization performance on new unseen data [25]. Research efforts in exploiting transfer learning involving wheelchairs have been exploring tasks such as surface detection while using different wheelchair units [26] and sidewalk classification [19].
Using pretrained transformers [27,28] acts as a vital tool in creating rich feature representations that can be utilized for fine-tuning with respect to the pertinent downstream tasks. In the field of mobile robotics, ViTs have shown remarkable performance in extracting semantic information for applications that include terrain classification [29], navigation [30], recognition [31], bird's eye view segmentation [32], and object detection [33]. Furthermore, vision transformers have shown remarkable results on image classification [34][35][36] tasks over methods such as convolutional neural networks (CNNs), as described by Raghu et al. [37]. An important property that a ViT displays is the fact that it can preserve input spatial information at its higher layers. This is what makes a ViT a more promising direction than ResNet, which is less spatially discriminative. Due to the ability to retain spatial information, the ViT is considered as the backbone of the method in conjunction with the fact that the relative position of landmarks in the dataset is the main source of semantic information of the encountered scene.

Method Description
Mobile units that operate according to the traditional sense-plan-control loop rely on vision systems that can accurately understand the environment to detect traversable paths and objects [38]. In this study, a methodology that detects meaningful landmarks and subsequently affects the mobile unit's decision-making is presented. Thus, this approach enhances safe navigation by providing scene information that accounts for the scene's traversability by indicating the presence (or not) of hazardous objects.
The gist of the proposed method relies on the use of a ViT encoder that consists of a sequence of self-attention and feed-forward layers. Specifically, we employed a ViT pretrained on ImageNet-21k using the generative, self-supervised learning method of masked autoencoders (MAE) [39], which exhibited major amounts of effectiveness in generalization. The MAE process included the following steps: • An input image was masked at random locations at a high masking ratio, roughly 75%; • An encoder (ViT) was applied on the visible parts of the image; • The decoder operated on both the encoded paths and the masked tokens; • Missing pixels were constructed.
After the pretraining process was complete, the decoder was discarded and the encoder was used for image classification tasks. Masked autoencoders exhibit the potential to learn visual scene semantics in a holistic manner, and thus they can act as a powerful pretraining method for this article's multilabel classification task. They have also shown substantial efficiency in transfer learning tasks such as object detection, instance segmentation, etc. We also experimented with the ViT-base-patch16-224 base model, that was pretrained on ImageNet-21k. This standard ViT was chosen since it could be supported by the available computational resources and could provide a comparison against ViT MAE .
For the supervised fine-tuning, a projection head was used, consisting of two fully connected layers. It was trained on both positive and negative data. The size of the output feature vector of the ViT was 768x1, and it was subsequently passed to the projection head that eventually classified the encountered scene with respect to the candidate classes (open path, doorways, staircase, humans, obstacle) ( Figure 1). This simple network structure was used to prevent any overfitting occurrences given the fact that only a small quantity of annotated data was used. The BCEWithLogitsLoss loss function was employed, which combined a sigmoid layer and the BCELoss in one single class: The reason for selecting this particular version of BCELoss was that the sequence of the log-sum-exp trick offered room for improved numerical stability. Since a multilabel classification task was considered, the decision threshold value for each label needed to be carefully selected; by comparing against the probability value for each class label, it helped decide whether the encountered scene included that label or not. For the rest of the paper this threshold hyperparameter is denoted as τ. This threshold directly determined how conservative the method was towards the prediction of a certain label.

Hardware
Throughout the experimental process, a human operator navigated a standard wheelchair in four different buildings around the University of Texas, Arlington (UTA) campus. Data were recorded using a GoPro HERO10 camera, which recorded at 60 frames per second and was mounted on the wheelchair seat ( Figure 2). For each building, the wheelchair was navigated in safe areas such as hallways and doorways, while encountering static (chairs, bins, tables, lockers) or dynamic (humans) obstacles. Moreover, ascending and descending staircases were targeted as additional areas of interest. Despite the fact that the environment was consistently academic, there were some distinct differences among the different buildings appearing in the dataset ( Figure 3). Namely, by observing the four different buildings comprising our dataset, the following points were witnessed: •

Data Collection and Processing
Data were recorded for approximately 150 min and created a dataset of 2704 images. The initial image size was 1920 × 1080 pixels before being resized to 224 × 224 pixels to match the resolution of the pretrained dataset. All images were manually labeled. The dataset included 2119 single-labeled images and 585 instances that comprised various combinations of the labels (open-path, humans, staircase, doorway, obstacles). Among the multilabeled images, 367 instances were described by two labels and 218 instances by three labels. Sets 1, 2, 3, 4 included 678, 697, 659, 670 image instances, respectively.

Fine-Tuning
For the conducted experiments, Pytorch (https://pytorch.org/, accessed on 13 December 2022) was used as the backbone framework. Training was done on a machine with two Titan RTX (24GB GDDR6 RAM, 4608 CUDA Cores) GPUs. Horizontal flips were performed as a means to augment the dataset. Training took place for 50 epochs using the BCE loss function, unless an early stopping callback terminated the trial upon observed convergence. Furthermore, the training parameters used were: batch size = 16, learning rate = 0.01, and weight decay = 5 × 10 −4 . For the fine-tuning part, all transformer's deeper layers were frozen, and the classifier was replaced with two fully connected layers; the last one performed the classification. Layers were fine-tuned using stochastic gradient descent (SGD).

Ablation Study
To evaluate the performance of the proposed fine-tuned method on the custom dataset, an ablation study was conducted. A four-fold cross-validation was performed with three buildings selected for training and the remaining one for testing. The rationale behind folding on the buildings was to exploit the visual dissimilarity between semantically equivalent classes between buildings. This comparison helped us evaluate the ability of the proposed method to generalize beyond learning the visual representations of specific landmarks. Utilizing the same architecture for the projection head, a deep residual network (ResNet) [40] (ResNet50) that had been pretrained on ImageNet-21k was fine-tuned. The classifier was replaced with the projection head for the classification.
Additionally, a GAN ensemble network was trained following the methodology described by Hirose et al. in [15]. We used the GO Stanford (https://cvgl.stanford.edu/gonet/dataset, accessed on 13 December 2022) dataset and pretrained it on approximately 75k unlabeled fisheye images. Finally, a small convolutional network was trained, comprising four convolutional and two fully connected layers each followed by a ReLU activation function, except for the final layer. A Hamming loss was chosen as the performance metrics (as suggested in [41] since it only penalized the individual labels, and we experimented with different values for τ. For both fine-tuned ViT and ResNet, the datasets that presented the highest (Set 4) and minimum (Set 3) hamming loss after performing four-fold cross validation were chosen. The results are shown in Figure 4.  = [0.15, 0.17, 0.18, 0.15 ,0. = [0.18, 0.20, 0.19, 0.17 ,0.

Results
The focus of this paper's approach was heavily dependent on landmarks' detection as this is crucial to ensure safe wheelchair navigation. The detection of staircases, humans, and miscellaneous static obstacles was prioritized by assigning a lower value for τ. Since humans' motion is governed by uncertainty and it is crucial to act in a conservative manner, given that predictions must align with the axis of safety, the best results, in terms of humans detection, were achieved when τ humans = 0.15. Similarly, the best detection results for staircases, static obstacles, doorways, and open paths were achieved when τ stairs = 0.17, τ obstacles = 0.18, τ doorways = 0.15, and τ open = 0.80, respectively. Figure 4 presents the results of the ablation study. The fine-tuned ViT MAE outperformed all other networks while displaying critical levels of consistency. This was in agreement with the results from the literature [6,37] in which a ViT's performance can significantly outperform CNNs' in image classification tasks. This argument was also supported by the fact that the MAE training includes the notion of learning visual semantics holistically. With regards to the ViT-base-patch16-224 network, it did not demonstrate a significant improvement compared to ResNet50. GAN's performance was lower, due to the difficulty in training the ensemble's networks with an adequate number of data, whereas the custom fully supervised CNN did not exhibit a major amount of efficiency for practical tasks.
The lowest values of the Hamming loss, implying high levels of performance, were observed for Set 3. This was due to the fact that Set 3 displayed a considerable amount of balance with respect to varying illumination and object features. Contrariwise, Set 4 presented the largest amount of hamming loss because it was the one with the most uniquely distinct features in terms of visual information. Compared to the others sets, Set 4 was significantly more differentiated including the darkest illumination as well as areas with a dense concentration of bulky objects. The best performance of ViT MAE was achieved when using τ values = (0.15, 0.17, 0.18, 0.15, 0.80) for humans, staircases, static obstacles, doorways, and open paths, respectively. Figure 5 displays a comparison between the Hamming loss as computed by fine-tuning the MAE and ResNet50 on Set 3 that exhibited the best performance. Specifically, the finetuned ViT MAE convincingly outperformed a fine-tuned ResNet50, with the performance margin, described by the Hamming loss, widening as the fraction of training data increased. Additionally, it was noticed that even for a small quantity of training data available, ViT MAE 's Hamming loss was smaller than that of ResNet50. This showed that ViT MAE could be largely beneficial in scenarios where only a small number of training instances is available. In Figure 6, the recall was examined as observed in Set 3 for the images that included the "humans" label. ViT MAE consistently achieved a recall of around 86% for training sets larger than 40%, while ResNet50 achieved lower performance. Hence, it can be inferred that ViT MAE could sufficiently address the presence of humans in the scene. Overall, the attribute of our dataset that construed an object as an obstacle given its relative position seemed to be exploited at full extent with the use of a vision transformer pretrained with MAE.  The confusion matrices depicted in Figure 7 provide an illustrative representation of the ViT MAE 's best performance as noted on Set 3. Overall, the detection performance achieved high levels of efficiency. In addition, the results were consistent along the various labels irrespective of the notable differences among the sets, which were collected in different buildings. This can be attributed to the presence of pretrained self-attention layers along with the property that masked autoencoders portray, which is to learn visual scene semantics in a comprehensive manner. The aforementioned arguments reinforced the claim that ViTs provide generalizable solutions to the multilabel classification problem for small datasets.

Conclusions and Future Work
A method that extracted high-level semantic information regarding the scene's navigability through landmark detection was proposed. Experiments were conducted in different indoor environments using a manually driven wheelchair and a wide-lens camera. The results indicated that our multilabel classification method achieved a high performance without the loss of generalization and enriched scene understanding. Therefore, the proposed approach can act as a preceding step before designing the motion planning (autonomous or not) of a manual wheelchair.
Furthermore, the results showed that fine-tuning a vision transformer could act as a powerful tool for multilabel classification tasks in small datasets. We showed that finetuning a vision transformer pretrained with MAE led to a stronger performance compared to state-of-the-art deep architectures for image classification such as ResNet. Avenues for further research and improvement involve the utilization and fusion of additional modalities (depth, laser), which, along with RGB images, can lead to a deeper evaluation and understanding of the semanticity of the predicted scene labels.