Multimodal Navigation Systems for Users with Visual Impairments—A Review and Analysis

: Multimodal interaction refers to situations where users are provided with multiple modes for interacting with systems. Researchers are working on multimodality solutions in several domains. The focus of this paper is within the domain of navigation systems for supporting users with visual impairments. Although several literature reviews have covered this domain, none have gone through the research synthesis of multimodal navigation systems. This paper provides a review and analysis of multimodal navigation solutions aimed at people with visual impairments. This review also puts forward recommendations for effective multimodal navigation systems. Moreover, this review also presents the challenges faced during the design, implementation and use of multimodal navigation systems. We call for more research to better understand the users’ evolving modality preferences during navigation.


Introduction
Navigation is an essential activity in human life. Montello [1] describes navigation as "coordinated and goal-directed movement through the environment by a living entity or intelligent machines." Navigation requires both planning and execution of movements. Several works [2][3][4][5], divide navigation into two components: orientation and mobility. Orientation refers to the process of keeping track of position and wayfinding, while mobility refers to obstacle detection and avoidance. Hence, effective navigation involves both mobility and orientation skills.
Several studies have documented that people with visual impairments often find navigation challenging [6][7][8]. These challenges may include issues with cognitive mapping, lack of access, spatial inference and updating [3,5,6], to mention a few. More complex spatial behaviors, such as integrating local information into a global understanding of layout configuration (e.g., a cognitive map); determining detours or shortcuts; and re-orienting if lost, can be performed only by a person with good mobility and orientation skills. These skills are critical for accurate wayfinding, which involves planning and determining routes through an environment. People with visual impairments may lack those skills and consequently may struggle to navigate successfully [5]. Obstacles can be avoided effectively using conventional navigation aids, such as a white cane or a guide dog [9]. However, these aids do not provide vital information about the surrounding environment. Giudice [6] describes that it is difficult to gain access to environmental information without vision, yet it is essential for effective decision making, environmental learning, spatial updating and cognitive map development. Moreover, visual experiences play a critical role in accurate spatial learning, for the development of spatial representations and for guiding spatial behaviors [3]. Giudice [6] claims that spatial knowledge acquisition is slower and less accurate without visual experience.
The studies conducted by Al-Ammar et al. [10] show that to improve the navigation accessibility of users with visual impairment, one needs to enable navigation as an independent and safe activity. Conventional navigation aids such as white canes and guide dogs have a long history [11,12]. Studies have also shown that there are limitations associated with such conventional tools [11]. To improve upon conventional navigation aids, several navigation systems have been proposed that use different technologies [13][14][15]. They are designed to work indoors, outdoors or both [15] and rely on certain technologies [2,16]. The World Health Organization (WHO) defines such tools collectively as "assistive technology" [17]. WHO further points out that assistive technology products maintain or improve an individual's functioning and independence by nurturing their well-being. Hersh and Johnson [11] elaborated that assistive navigation tools for users with visual impairments have the potential to describe the environment such that obstacles can be avoided. Different devices and systems have been proposed for navigation support to users with visual impairments. These devices and techniques can be divided into three categories [9]: electronic travel aids (ETAs), electronic orientation aids (EOAs) and position locator devices (PLDs) [18]. ETAs are general devices to help people with visual impairments avoid obstacles. ETAs may have sensing inputs such as depth cameras, general cameras, radio frequency identification (RFID), ultrasonic sensors and infrared sensors. EOAs help visually impaired people navigate in unknown environments. These systems provide guiding directions and obstacle warnings. PLDs help determine the precise position of a device, and use technologies such as the Global Positioning System (GPS) and geographic information systems (GISs). Lin et al. [18] gives a detailed explanation of these categories. In this paper, the term "navigation system" is used to denote any tool, aid or device that provide navigation support to users with visual impairments. In addition, we use the term "users" or "target users" to denote the term "users with visual impairments".
Researchers have been exploring the navigation support applications of emerging technologies for several decades [19]. Advancements in computer vision, wearable technology, multisensory research and medicine have led to the design and development of various assistive technology solutions, particularly in the domain of navigation systems for supporting users with visual impairments [14,20]. Ton et al. [21] observed that the research had explored a wide range of technology-mediated sensory-substitution to compensate for vision loss. The developments in artificial intelligence (AI)-in object detection using machine learning algorithms, location identification using sensors, etc.-can be exploited to understand the environment during navigation. The developments in smartphone technologies have also opened up new possibilities in navigation system design [22][23][24]. One key challenge is how to communicate the information in a simple and understandable form to the user, especially as other senses (touch, hearing, smell and taste) have lower bandwidths than vision [13]. Therefore, effective communication of relevant information to the users is a major requirement for such navigation systems.
Bernsen and Dybkjaer [25] defines the term modality in the human-computer interaction (HCI) domain as a way of representing or communicating information in some medium. The term "multimodality" refers to the use of different modalities together to perform a task or a function [26,27]. Modalities are typically visual, aural or haptic [28]. Navigation systems that use different modes to communicate with the user are called multimodal navigation systems [29]. Several multimodal systems were proposed to assist users for navigation [30][31][32]. Many prototypes have been reported without much practical evaluation involving target users [30]. It is therefore uncertain whether these proposals offer any actual benefits to users. A few studies have also been published with convincing validation involving the users [33,34].
Several surveys have addressed navigation systems designed for users with visual impairments [13][14][15]35]. Some focused on the types of devices or technology, while others on the environments of use. To the best of our knowledge, no systematic surveys have addressed multimodal navigation systems. This paper, therefore, provides an overview of the major advances in multimodal navigation systems for supporting users with visual impairments. This paper is organized as follows. Section 2 presents the general theory of multimodality. Section 3 gives a brief overview of the application of multimodality in the navigation system and also describes the methodology we used for this review. Section 4 discusses the multimodal navigation systems and their affiliated studies. Section 5 summarizes the challenges and presents a set of recommendations for the design of a multimodal navigation system for people with visual impairments. The paper concludes in Section 6.

Multimodality in Human-Computer Interaction
In the context of HCI, a modality can be considered as a single sensory channel of input and output between a computer and a human [28,36]. A unimodal system uses one modality, whereas a multimodal system relies on several modalities [36]. Studies have shown that multimodal systems can provide more flexibility and reliability compared to unimodal systems [37,38]. Oviatt et al. [39] elaborates on the possible advantages of a multimodal interaction system, such as freedom to use a combination of modalities or to switch to a more-suited modality. Designers and developers working with HCI have also tried to utilize different modalities to provide complementary solutions to a task that may be redundant in function but convey information more robustly to the user [40,41]. Based on the perception of information, modalities can be generally defined in two forms: human-computer (input) and computer-human (output) [27]. During the interaction, the available input modalities are utilized by the user to communicate with the system, and the system uses several output modalities to communicate back to the user [42].
Computers utilize multiple modalities to communicate and send information to users [43]. Vision is the most frequently used modality, followed by audio and haptic. Haptic communication occurs through vibrations or other tactile sensations. Examples of touch-based (haptic) modality channels include smartphone vibrations. The other modalities such as smell, taste and heat are less used in interactive systems [44]. Audio offers the benefits of rich interaction experiences depending on the context of use and helps provide more robust systems when used in combination with other modalities [45]. Such redundancies are used when a user wants to communicate with a system via voice while driving a car without taking the hands off the steering wheel.
Epstein [46] and Grifoni et al. [47] have shown that with the increasing use of smartphones and other mobile devices, users are becoming more comfortable in experimenting with different new modalities. After the introduction of voice assistants such as Siri, Alexa, Cortana and Google Home, some users began to use voice assistants as an alternative way to communicate with computers and other digital devices [48,49]. This epitomizes how certain modalities with contrasting strengths are useful in various situations [50]. Some other modalities such as computer vision can be utilized to capture three-dimensional gesticulations using depth cameras, such as Microsoft Kinect [44].
Multimodal systems have the potential to increase accessibility to users by relying on different modalities. Due to the benefits of using multimodal inputs and outputs, multimodal fusion is also used in various applications to support user needs [51]. The process of integrating information from multiple input modalities and combining them into a specific format for further processing is termed multimodal fusion [52,53]. To allow their interpretation, a multimodal system must recognize different input modalities and combine them according to temporal and contextual constraints [47,54,55]. An example of a multimodal human-computer interaction system is illustrated in Figure 1. This two-level flow of modalities (action and perception) explains how a user and a system interact with each other and also the different steps involving in the process [56]. Figure 1. Multimodal human-computer interaction (adapted from [56]).

Multimodality in Navigation Systems
Multimodal navigation systems have several advantages compared to unimodal navigation systems. Vainio [57] explained that multimodal navigation systems allow the user the flexibility to give inputs or receive outputs, in a preferred modality. He also emphasized the need for developing multimodal navigation systems to assist mobile users. Brock et al. [58] also showed that navigation systems proposed for users with visual impairments could not be considered as effective if the inputs and outputs depend upon only a single mode of interaction. Multimodalities help improve the system robustness [45]. This is helpful in situations where one of the modalities fails, and a different modality can be used instead [55]. This is mostly applicable in a navigation system with different redundant modalities which serves a similar function in the system [59]. Jacko [43] argued that the multimodal navigation systems allow for greater accessibility and flexibility for users who can perform tasks much better with unimodal systems. Sears and Jacko [60] affirmed that different combinations of modalities have the possibility to enhance user comfort in human-computer interactions. For example, in a noisy environment, vibratory feedback may be more effective than aural feedback when receiving directions. Alternatively, audio may be a more suitable choice, if the user wants to get more details about the environment of navigation such as landmarks and traffic signs.
Although there are a vast number of documented studies on assistive navigation systems for people with visual impairments, we were unable to find many studies exploring multimodality. We used the major publication databases ACM Digital Library, IEEEXplore, ScienceDirect and Google Scholar to find the relevant publications matching with the inclusion criteria. We used the keywords "navigation systems + visually impaired" and "navigation systems + blind". After reviewing the abstracts, we excluded those that were outside the scope. The papers selected for this review were not limited to those documented as complete and functioning systems, but also at prototyping stages.
The reviewed papers have been categorized and discussed in the sub-sections based on how the multimodality concepts were utilized. Papers describing navigation systems which use the multimodal interaction were placed in one group. Next, papers describing interactive map-based multimodal systems constituted the second category. The third category included papers that document multimodal interfaces. Papers which focus on virtual environments for training the users for using multimodal navigation systems belonged to the fourth category.

Multimodal Navigation Systems
Multimodal navigation systems hold the potential for enhancing the accessibility for users with visual impairments. In a multimodal system, the user has the flexibility to give instructions and to receive the guidance in their most preferred modality.
The EyeBeacons system [61] is a framework for multimodal wayfinding communication using wearable devices. The framework uses three different modalities for passing navigation instructions: aural, tactile and visual. The system has three main components: a bone conduction headset, a smartphone and a smartwatch. Bone-conduction headphones rely on the sound being transmitted through vibrations on the bones of the head and jaw, instead of eardrums as in traditional headsets. This is particularly useful for improved situational awareness [62]. A smartwatch was used to sense the wayfinding messages in the form of vibrations. The participants who tested the system reported that both vibrations and audio tunes were difficult to distinguish. The Assistive Sensor Solutions for Independent and Safe Travel (ASSIST) indoor navigation system [33] was also designed to give three types of sensory feedback to the users similar to those reported in [61], namely, visual, aural and tactile. The system's usability testing was carried out with users, and they expressed favorable opinions about the system. The participants also suggested offering options to turn on or off certain features.
Tyflos [30] is a multimodal assistive piece of technology designed for reading and navigation. A stochastic Petri-net model is used to drive its multimodal interaction. A camera captures visual information from the environment. This visual information is transformed into either vibratory or aural feedback. The user communicates with the system via a speech recognition interface. The feedback information is communicated to the user through a vibration array vest attached to the abdomen. The authors did not report any user evaluation of the system.
Gallo et al. [63] proposed a system which can be integrated with a conventional white cane and thus provides multimodal augmented haptic feedback. The multimodal feedback system consisted of a shock system to simulate the behavior of a long cane, a vibrotactile interface to display obstacle distance information and an auditory alarm system for head level obstacles. The device is triggered when a distant obstacle is detected, and the user experiences a sensation in the cane handle. The auditory feedback is mainly used as an emergency handler to alert the users. User evaluation showed that object detection and distance information was helpful and easy to understand. However, they expressed that to get better estimations of the distances to the obstacles, users needed training using the device. The Range-IT system [64] is a similar system which uses a white cane. After detecting the obstacle using a 3D depth camera, Range-IT provides information such as type of object, distance and direction in relation to the user, using an aural-vibrotactile interface. The output from the vibrotactile belt and the sonification messages, along with the verbal messages from a bone conduction headset, helped participants to perceive multimodal feedback during the navigation in a laboratory setup. The weight was an issue with the prototype as the user had to carry a laptop and 3D cameras.
HapAR [31] is a mobile augmented reality application which was introduced to guide users around a university campus. The user can activate the application by giving a Siri voice command. The system processes the request and tries to find the location of interest. When the user is close to the destination or any point of interest, both aural feedback and haptic feedback are triggered. User feedback showed that sound feedback was masked by outdoor environment noises such as wind and people talking. Additionally, the intensity of the haptic feedback varied with different smartphone models, negatively affecting the system's performance. Another similar system which provided both aural and tactile feedback is Personal Radar [65]. This indoor system performs obstacle detection, provides the current location and gives directions.
NavCog [66] is a smartphone-based navigation system for blind users. The system uses a network of Bluetooth low energy (BLE) beacons. The NavCog interaction was designed to avoid overloading the user with cognitively demanding messages. NavCog uses simple sounds and verbal cues to give turn-by-turn instructions. Users interact with the system through a simple touch interface. NavCog also informs users about nearby points of interest (POI) and possible accessibility issues. The system needs to be improved in terms of localization accuracy to avoid confusion when making small turns.
iASSIST [67] is an iOS-based indoor navigation application for both sighted and visually impaired users. Hybrid indoor models were created with Wi-Fi/cellular data connectivity, beacon signals and a 3D spatial model. During the navigation stage, the user with the mobile application is localized within the floor plan using the connected data network to give an optimal route to the destination. The system uses visual, aural and haptic feedback to provide turn-by-turn navigation instructions to the user. The limitations of the system include dependability on data connectivity in delivering services and the absence of obstacle and scene understanding features. The authors did not report any user evaluation results.
Fusiello et al. [32] proposed a navigation system which used a combination of stereo vision and sonification. The user would hear the sound in the environment with a stereo headset. Visual processing includes the segmentation of objects detected and corresponding three-dimensional (3D) reconstruction. The aural processing includes the experiential enhancement of the 3D scene through artificially created sounds. The system provides auditory cues to help the user to identify the position and distance of the pointed object or surface. The system is strenuous to use, as the user has to continuously listen to audio signals. Similarly, Sound of Vision [34] also provides a three-dimensional representation of the environment through sound and tactile modalities.
The Personal Guidance System [68,69] consists of different components, such as a module for determining the traveler's position and orientation in space, a GIS comprising a detailed database for route planning and a user interface. The system has different display modes such as spatialized sound from a virtual acoustic display, and verbal commands issued by a synthetic speech display. Compared to verbal commands, the virtual display showed the highest effect in terms of both guidance performance and user preferences. Disadvantages of this system include partially occluded external sounds which are essential in echolocation, and a high system weight, making it impractical to carry around. There is also an additional cost and complexity associated with virtual acoustic hardware.
The system proposed by Wang et al. [70] included a camera and an embedded computer with three feedback modes, vibration, braille and audio. The system used techniques from computer vision and motion planning to identify walkable space, and recognize and locate specific types of objects such as chairs. These descriptions are communicated through vibrations. The user also receives feedback via a braille display and audio that is synthesized using text-to-speech. The evaluation of the system was conducted by blind participants. The user evaluations showed that the haptic obstacle feedback was more comfortable. Braille displays offered richer high-level feedback but had longer reaction times due to sweeping of the fingers on the braille cells. Audio feedback was considered undesirable because of the low refresh rate and long latency, and due to the potential obstruction of other sounds from the environment.
None of the systems reviewed here fully utilized the multimodality concept. Moreover, only a small fraction of the studies conducted convincing user evaluations. A consolidated summary of the different multimodal navigation systems is given in Table 1. The table categorizes the reviewed systems with the main software and hardware components, localization technologies and modalities involved.

Interfaces
Several computer interfaces have been designed and developed to enhance the interaction between humans and computers. The usage of multimodal interfaces in navigational systems allows users to interact with systems using several communication modes. Diaz and Payandeh [71] argue that multimodal interfaces enable powerful, flexible and feature-rich interactive experiences.
ActiVis [72] was implemented with the main objective of giving necessary directions to the users by perceiving its surroundings. By creating a multimodal user interface, ActiVis is designed to help users receive navigational information in the form of aural and vibration cues in a more effective manner. This multimodal interface was implemented on Google's Project Tango device developed using Android and Tango SDKs. Their multimodal user interface includes a co-adaptive module to help users learn user behavior over time and also adapt the feedback parameters to improve user performance.
Bellotto et al. [73] proposed a concept for a multimodal interface for an active vision system to control a smartphone camera orientation, using a combination of verbal messages, 3D sounds and vibrations. It was implemented as a smartphone application. Usability tests were conducted with several blindfolded users to identify the accuracy, success rate and user response times. Users reported difficulty in interpreting the sound signals correctly.
TravelMan [74] was introduced as a multimodal mobile application for serving public transport information in Finland. It also provides pedestrian guidance for users with visual impairments. The application supports several output modalities, including synthesized speech, small display-based graphical elements using fisheye techniques, non-speech sounds and haptics. The input modalities of the system consist of text input, speech recognition, physical gestures and positioning information. The camera-based movement detection was reported to be less robust, and the physical gestures feature needed to be expanded. Another drawback of the system was with the graphical interface, which is language-dependent and thus requires much display space.
The systems reviewed in this section gives an overview of how multimodality can be used in navigation application interfaces. Table 2 provides summarized information about the papers on multimodal interfaces.

Maps
Visual maps have several advantages as a tool for navigation, as they can give an overview of an environment and possess high information density. Over the last decades, several promising technologies have emerged that replace visual maps, such as point and sweep gestures, spatial sound, tactile information and other multimodal options. Tactile maps give users access to geographical representations. Although those maps serve as useful tools for the acquisition of spatial knowledge, they have some limitations, such as the need to read braille. Ducasse et al. [75] did an exhaustive review of interactive map prototypes. The authors compared the maps based on cost, availability, technological limitations, content, comprehension and interactivity. They suggested improving the accessibility of digital maps using wearable technologies and designing interaction techniques that provide users with more interactive functions, such as zooming and panning, for map exploration. In addition to several interactive digital maps, different multimodal maps have been proposed.
Brock et al. [76] proposed an interactive multimodal map prototype, which relies on a tactile paper map, a multi-touch screen and aural output. Four steps were involved in the design of the interactive map. The first step involved drawing and printing the tactile paper map. The second step concerned the choice of multi-touch technology. The third step included the selection of output interaction technology. The final step dealt with the selection of the software architecture for the prototype. The prototype was made with different software modules interconnected with middleware. The authors claimed that the prototype could be used as a platform for advanced interactions in spatial learning. User evaluations showed that some users found multi-touch and double-tapping difficult.
An instant tactile-aural map prototype was proposed by Wang et al. [77] that automatically created interactive tactile-aural maps from the local visual maps. The multimodal maps generated by the system could be used for navigation. The first step in the system is to extract text from local map images. The second step involves the recreation of tactile graphics. The third step comprises of multimodal integration and rendering. The final results are multimodal tactile-aural representations of the original map images. The users get instant aural annotations associated with the map graphics by pressing certain symbols in the generated map. Some of the shortcomings reported with the system include issues with graphics conversion, which may lead to broken navigation paths in the tactile map.
Talking TMAP [78] was a system which was designed to help with the automated generation of aural-tactile maps using Smith-Kettlewell's TMAP software. It combines Internet content, a geographic information system, braille embossers and a touch tablet to create aural-tactile street maps of neighborhood areas. There is an extra device called Talking Tactile Tablet (TTT) connected to the system which acts as a tactile graphics viewer.
The Vibro-Audio map (VAM) proposed by [79] supported environmental learning, cognitive map development and wayfinding behavior. VAM used a low-cost touchscreen-based multimodal interface of a commercial tablet. VAM was an example of a digital interactive map (DIM) that was rendered using vibrotactile and auditory information. The built-in vibration motor of the tablet device was used to provide haptic (vibrotactile) output. Evaluations conducted with target users showed that VAM performed similarly to the traditional tactile map overlays. The findings from the study were limited to indoor building environments only.
The TouchOver map study [80] investigated whether vibration and speech feedback can be used to make a digital map on a touchscreen device. The prototype consisted of an android map navigation application. When the user touched the map where there were underlying roads, the device vibrated and read the name of the road. Their results indicated that it is indeed possible to get a basic overview of the map layout, even if a person does not have access to the visual presentation. Shortcomings include the inability to detect whether roads are close and whether they cross. It is also hard to determine the directions of short roads.
The audio-tactile you-are-here (YAH) map system [81] presented map elements and updated location on a mobile pin-matrix display. The system consists of a set of tactile map symbols with raised and lowered pins representing varying map elements. Users can input map operation commands (panning, zooming, etc.) via either a mobile phone or an electronic cane. A field test was conducted with both visually impaired and blindfolded users who did not have experience with tactile maps and braille. Conclusions were that the system needed higher location accuracy, improved portability and a one-hand map exploration method.
Touch It, Key It, Speak It (Tikisi) [82] was a software framework for the accessible exploration of graphical information. Tikisi facilitated multimodal input through multi-touch gestures, keystrokes and spoken commands; and aural output. The system was used by moving a finger across a geographical map and issuing commands to go to specific locations such as cities or states. The testing of the Tikisi was done with target users. Feedback was positive. However, Tikisi used a standard tablet, so the shape recognition was not possible; c.f. tactile displays. Additionally, because of the lack of tactile feedback, it was not easy to estimate the relative size of the two objects.
SpaceSense [83] was a map application that ran on an iPhone. It was used for representing geographical information and also included custom spatial tactile feedback hardware. SpaceSense uses multiple vibration motors attached to different locations on the mobile touchscreen device. It offers high-level details on the distance and direction towards a destination and bookmarked locations. Through vibrotactile and sound feedback, the application helps users to maintain the spatial relationships between points. However, the system was only tested in one neighborhood. More work is needed to understand how the number of places and the route instructions affect the spatial relationship learning capability of users.
This section discussed how digital tactile multimodal maps could enhance navigation accessibility. We observed that only a few works mentioned here had explored the developments in sensor-based technologies on their research. A summary of the works is shown in Table 3.

Virtual Learning Environments
Virtual navigation environments allow users to experience navigating unknown locations in safety. The device setup can be used to simulate a real navigation experience by receiving feedback through different modalities. Moreover, the virtual environment can be controlled with parameters such as complexity level, and users can analyze the effects of various modalities [84]. Different multimodal virtual environments have been proposed to meet users' navigation needs.
Haptic and audio multimodality to explore and recognise the environment (HOMERE) [85] was a virtual reality (VR)-based multimodal system for exploring and navigating inside virtual environments. The system provided four types of feedback to stimulate feedback in a real environment. The force feedback was complementary to the cane simulation, thermal feedback complementary to the sun simulation and auditory feedback to the ambient atmosphere. Visual feedback was implemented for partially sighted people or sighted people to follow the navigation guidance from the virtual environment.
NAV-VIR [86] was a multimodal virtual environment designed to help discovering and exploring unknown areas. The system used aural-tactile feedback. The NAV-VIR system comprised two parts: (1) An interactive tactile interface called the Force Feedback Tablet (F2T). It output spatial information about possible paths for navigation. (2) A dynamic audio environment that provided a realistic, orientation-aware 3D simulation of the audio cues during the actual journey of a user with visual impairments. A preliminary evaluation done with target users showed that NAV-VIR was capable of generating convincing tactile stimuli.
A virtual environment platform for the development of assistive devices was proposed by Khoo et al. [87]. The environment was designed to evaluate multimodal sensors to be used in navigation and orientation tasks. The main focus of the work was to help in the design of the sensor interfaces and simulators in the virtual environment for future experimentation.
Canetroller [88] was designed to simulate white cane interactions to help transfer cane skills into the virtual world. A VR headset was used for 3D sound. Three types of multimodal feedback could be experienced by the user. First, physical resistance was generated by the controller when the virtual cane came in contact with virtual objects. Second, vibrotactile feedback simulated when the white cane touched objects. Third, spatial 3D audio simulated sound from the real world. The system was designed to work in both indoor and outdoor virtual environments.
The BlindAid system [89] was equipped with a haptic device and stereo headphones which provided multimodal feedback during the interaction. The system assisted the users with exploring the virtual environment based on their prior real space orientation skills. Additionally, the system provided spatial landmarks through haptic and auditory cues. Three modalities of operation (visual, aural and haptic) provided spatial information.
The system proposed by Kunz et al. [84] helped users build a cognitive navigation map of the surroundings. The virtual environment was controllable and could map objects such as walls and stairs to real-world entities. The user received acoustic and/or haptic feedback when an obstacle was detected in the environment.
Most of the papers discussed here employed recent technological advancements such as VR and 3D sound. A summary of the works on virtual learning environments for navigation is given in Table 4.

Discussion
Behavioral and cognitive neuroscience studies conducted by Ho et al. [90], Stanney et al. [91] and Calvert et al. [92] show that systems with multiple modalities can maximize user information processing. Moreover, systems designed with multimodal preferences can provide various combinations of signals from different sensory modalities and subsequently have beneficial effects on user performance with a particular system. Lee and Spence [93] argued that the presentation of multimodal feedback outputs to users was found to have enhanced performance and more pronounced benefits over unimodal systems. However, other studies [94][95][96] claim that multimodal feedback modes can confuse users. Confusion may occur when the user has several multimodal options for one function with a similar purpose but can enter into a dilemma on what to choose and which one is better. In terms of design considerations of a multimodal navigation system, it is important to consider how effectively and easily each of the multimodal feedback methods can be utilized by the users.
Common modalities used in almost all multimodal navigation systems are aural and tactile (see Table 1). Some systems enhance the audio with spatial or 3D audio. Many systems tested their prototypes with the target users, while some reported tests with blindfolded users. Some authors did not document any user evaluation of their systems.
The multimodal interfaces discussed in this review mostly utilized mobile systems such as tablets and smartphones (see Table 2). Different modalities such as aural (in the form of messages and non-speech alerts) and vibration cues were used in the systems for interaction. The multimodal maps mainly used the aural and tactile modalities (see Table 3). In cases in which systems have not been evaluated with users, it is impossible to conclude whether the intentions of the systems were met.
The different hardware components in the virtual-navigation laboratory environment can simulate different modalities in the real-time environments (see Table 4). The common modalities-such as audio, and haptic and its variants, such as vibration and force feedback-could be simulated in a virtual environment setup. These virtual training platforms claimed to help the users to experience the multimodal navigation systems before proceeding to the navigation in real environments.
It is interesting to note that almost all multimodal systems reviewed here employed the aural modality. Some user-interaction studies validate this perspective by pointing out that speech cues can be useful in providing mobility feedback for users [97,98]. This may be because audio modality can be a suitable choice when the users want to hear information about the environment during navigation. Moreover, empirical results show that users are uncomfortable when using audio feedback in public environments [99]. In noisy environments such as public places, it can be challenging to hear audio feedback. Additionally, users might have a social stigma when they think that auditory feedback is audible to the public as well. There are also privacy and security concerns in using audio feedback in public environments.
The advantage of haptic feedback is that the users can use it anywhere, anytime, without interrupting others. At the same time, vibrations are often more similar to each other, and not easy to distinguish compared to auditory feedback, which might create confusion among users [99].

Recommendations
Multimodal technology is a promising candidate in human-machine interfaces which may improve the accessibility within user environments such as mobile devices and navigation systems. The study conducted by Giudice [6] showed the importance of developing useful learning strategies to remedy travel-related problems faced by the users and argued that the focus of the research must be redirected to consider spatial information from all sensory modalities. Wentzel et al. [100] also supported the fact that multimodality is the key necessity for accessibility for a broad audience. Moreover, the authors confirmed that in an accessibility system, different modalities of interaction should be available and should be equivalent to each other. The European Telecommunications Standards Institute (ETSI) guidelines [101] prioritize the use of multimodal presentation and interaction for accessible systems. Two multimodality principles are mentioned in the guidelines. First, the use of multimodal presentation of information, which allows users with different preferences and abilities to use information in their desired manner. Second, the use of multimodal interaction to allow users to interact with a system, which follows individual needs and preferences. In addition, Wentzel et al. [100] suggested that a system or an application should be able to provide relevant multimodal feedback on user behavior. Since different users have different preferences, it is more appropriate to have customizable system settings for input/output modalities and frequency of feedback.
Based on the review, analysis and discussion presented herein, we put forward the following recommendations.
• Multimodality-multiple modalities should be available, and among them, audio feedback is always expected.

•
Customisability-flexible customisation option should be available for user-preferred settings. • Extendibility-it should be possible to extend a new feature or a new modality at a later stage.

•
Portability-the whole system should be portable and should not create an extra burden to the user with many devices.

•
Simplicity-adding additional modalities should not make the users feel that the system is complex or create confusion in selecting them.
• Dynamic mode selection-it should allow users to dynamically select the most appropriate mode of interaction for their current needs/environments. • Adaptability-using machine learning techniques, multimodal systems can be designed to be adaptable based on varying environments.

•
Privacy and security-it should address both the privacy and the security of the user.

Challenges
Adoption of multimodality in navigation system design also introduces some challenges. There can be system implementation level challenges and challenges associated with user adaptability to a new system. System-level challenges may occur during the stages such as data acquisition, transfer, fusion, processing and in the final delivery in a suitable form and an environment. Limited availability of multimodal datasets for navigation purposes is a barrier in the related research works [102]. Even though some multimodal datasets have been reported [103][104][105], the contributions in this area are scarce. Developers can face challenges in training and implementing the system with these limited options.
Vainio [57] showed that when users interact with a multimodal navigation system, they exhibit different patterns. For instance, either the users interact with a system simultaneously, or they integrate their interaction sequentially. Designing a system with varying user preferences could be challenging for developers. The processing stage in a multimodal system involves additional hurdles. Finding a suitable fusion level and fusion algorithm for the multimodal data is one of them. Caspo et al. [106] pointed out the trade-offs between the storage and the generation of different multimodalities while designing a multimodal system. Even though some flexibility in generation in terms of real-time parametrization exists, more complex processing is required. This review does not go into detail about the technical aspects of multimodal fusion and fission, but the complexity involved in the implementation is high [102].
Another challenge in the multimodal system design is the delivery of information in a user-preferred form. Different users have different interests and preferences, and these can change depending on their navigation environments. Implementing a system based on the adaptation to user preferences and learning them according to the situations and environmental conditions could be difficult [106]. Developers cannot decide which the most appropriate modality is and what is more favorable for a particular user. Moreover, if developers integrate too many multimodal feedback options, the user can get confused and distracted. As stated by Liljedahl et al. [107], good user experiences do not require cutting edge technology, but careful design of a multimodal user-centered system could provide better results. Determining the appropriate multimodal feedback methods based on the changing environmental conditions should be done before designing a system. One possible approach to address this would be to use machine learning to understand user preferences and make suitable recommendations. Research shows that it is possible to implement a system with self-learning to enable adaptive settings based on user preferences in varying environments [108,109].
Yet another challenge is how to make a multimodal system comfortable for users. Any user with visual impairment should not experience any difficulty in using the system. Moreover, adding multiple modalities is analogous to adding an extra layer of complexity to the system. Experiences of difficulties may lead to abandoning the technology [110].

Conclusions
Multimodal technology can be considered as a promising option that can be utilized in the design of effective and accessible navigation systems for users with visual impairments. Multimodal navigation solutions proposed for people with visual impairments were reviewed. The primary modalities that are utilized in almost every multimodal navigation system discussed in this review are aural and tactile. Even though many multimodal navigation solutions have been proposed, there is little evidence of what degree the target users continued to use these technologies in practice.
Studies concerning the extent to which multimodal systems are helpful for people with visual impairments in real-life navigation contexts are an important avenue to consider. Challenges are associated with designing, implementing and using multimodal navigation systems. Exploring the effectiveness of recent advancements in artificial intelligence and related technologies to help with tackling the different challenges in multimodality is an important area of future research. Moreover, we argue that more studies are needed to better understand the evolving preferences in modalities among users with visual impairments.