Challenges and Solutions for Autonomous Ground Robot Scene Understanding and Navigation in Unstructured Outdoor Environments: A Review

: The capabilities of autonomous mobile robotic systems have been steadily improving due to recent advancements in computer science, engineering, and related disciplines such as cognitive science. In controlled environments, robots have achieved relatively high levels of autonomy. In more unstructured environments, however, the development of fully autonomous mobile robots remains challenging due to the complexity of understanding these environments. Many autonomous mobile robots use classical, learning-based or hybrid approaches for navigation. More recent learning-based methods may replace the complete navigation pipeline or selected stages of the classical approach. For effective deployment, autonomous robots must understand their external environments at a sophisticated level according to their intended applications. Therefore, in addition to robot perception, scene analysis and higher-level scene understanding (e.g., traversable/non-traversable, rough or smooth terrain, etc.) are required for autonomous robot navigation in unstructured outdoor environments. This paper provides a comprehensive review and critical analysis of these methods in the context of their applications to the problems of robot perception and scene understanding in unstructured environments and the related problems of localisation, environment mapping and path planning. State-of-the-art sensor fusion methods and multimodal scene understanding approaches are also discussed and evaluated within this context. The paper concludes with an in-depth discussion regarding the current state of the autonomous ground robot navigation challenge in unstructured outdoor environments and the most promising future research directions to overcome these challenges.


Introduction
The use of autonomous mobile robotic systems is rapidly expanding in many scientific and commercial fields, and the capabilities of these robots have been growing due to continuous research and industrial efforts.The range of different research application areas in autonomous mobile robotics is wide, including areas such as surveillance, space exploration, defence, petrochemical, industrial automation, disaster management, construction, marine, personal assistance, extreme environments, sports entertainment, agriculture, transportation logistics, and many other industrial and non-industrial applications [1].Many different types of robot platforms have been and continue to be developed for these applications, and autonomous mobile robotics is a global and continuously evolving scientific field.Robots that move in contact with ground surfaces are commonly referred to as mobile ground robots, and these robots may be deployed in different working environments, including indoor or outdoor, structured or unstructured, and in proximity to static or dynamic actors.Each of these environments creates various challenges for robot applications.The overall system configuration of a robot mostly depends on the relevant operating environment; for example, a robot designed for a static environment may not effectively adapt to dynamic situations that arise within the environment [2].The development of ground robotic systems is particularly challenging when the intended application area of autonomous vehicles/robots falls into the unstructured off-road category, making this a very active area of research.The difficulties mainly arise due to the weak scene understanding of robots in these unstructured environment scenarios.Mobile ground robots have recently shown increasingly sophisticated scene-understanding capabilities in indoor environments (e.g., Amazon Robotics warehouse robots [3]).However, in outdoor robot navigation, the environments are generally much less structured and more unpredictable (e.g., lighting, shadows, occlusions and texture) than in indoor robot navigation application scenarios.Due to these particular challenges and the more limited level of progress achieved to date to address them, this work seeks to provide a thorough review of existing research works in this field in order to provide directions for advancing robot navigation in unstructured outdoor environments.The agriculture industry is one example of a relevant application area to deploy scene understanding based on off-road autonomous robotic systems.In recent times, the agriculture industry has shown a growing adoption of robotics technologies, but in general, the involvement of novel technologies depends on economic sustainability.Ground mobile robots and manipulators are already used in precision farming to pick fruits, harvest vegetables, and remove undesirable vegetation, but their application areas are generally fairly narrow due to limited scene understanding capability.In the past few years, self-driving vehicles have gradually grown in technological capability and market size within the automobile industry [3,4].Despite these developments, autonomous driving is still a very challenging problem, and in complex scenarios, the performance level remains below that of an average human operator.This is particularly true when the intended application area of autonomous vehicles/robots includes off-road areas.This low performance mainly arises due to the weak external environment understanding of robots.In general, application areas such as disaster management, environment exploration, defence, mining and transportation are associated with complex and unstructured environments.Therefore, these fields are also driving further research on mobile robot scene understanding as it relates to the broader topic of autonomous robot navigation.The following challenges for the classical autonomous robot navigation pipeline remain for current robotic systems: • Perception; • Localisation and mapping; and • Scene understanding.
In robot perception, robots sense environments using different sensors and extract actionable information via the sensor data.Perception plays an important role in the realisation of autonomous robots.For robots to perform effectively in unstructured outdoor environments in real time, it is essential that they possess accurate, reliable, and robust perception capabilities.To achieve these characteristics, in general, autonomous robots in complex scenarios are equipped with several sensor modalities (that can be exteroceptive or proprioceptive) [5,6].Different modalities such as sound, pressure, temperature, light, and contact have been used in robot environmental perception applications [7].Sensor fusion (combining different sensor modalities) has been applied in many recent autonomous mobile robot/self-driving applications [8].Multimodal sensor fusion brings the complementary properties of different sensors together to achieve better environment perception across a range of conditions [9].Many recent deep learning-based sensor fusion methods have shown higher robustness in perception than conventional mono-sensor methodologies [6,10].Camera, Light Detection and Ranging (LiDAR), radar, ultrasonic, Global Navigation Satellite System (GNSS), inertial measurement unit (IMU) and odometry sensors are used in many mobile ground robot perception applications.The most frequently used robot vision-based sensor fusion methods combine camera images with LiDAR point clouds.Research into sensor fusion techniques remains an important component of achieving better sensing capabilities for unstructured outdoor environments.
In the second stage of the autonomous robot navigation pipeline, the information retrieved from perception sensors is used for robot localisation and to map their external environments.Autonomous mobile robots require accurate and reliable environmental mappings and localisation at a sophisticated level based on the application context.The Simultaneous Localisation and Mapping (SLAM) approach is a commonly used technique in autonomous mobile robot systems to represent the robot positions and the map of their external environments when both the robot pose and environmental map are previously unknown.Many SLAM systems use LiDAR sensors and vision-based sensors such as Red-Green-Blue (RGB)/RGB-Depth (RGB-D) cameras that can retrieve visual or proximity information, and the fusion of these sensors has achieved better robustness than using either camera or LiDAR sensors alone [11].LiDAR sensors use highly accurate Timeof-Flight laser range finding for the detection of objects.An example of a lightweight and low-cost semantic environment classification approach for indoor robot navigation is introduced in [12].Semantic classification becomes challenging with the variation of the lighting, shadows, occlusions, texture and the number of unseen semantic classes.Dynamic agents present another problem, [13] has proposed a LiDAR-based concept to segment large areas with dynamic agents.It can use LiDAR range view representations and further categorise objects in the images into semantic classes.Several research works have incorporated the fusion of camera and LiDAR data for improving scene understanding of outdoor environments [14].
A fundamental aim of robotic vision is to interpret the semantic information present in a scene to provide scene understanding.Scene understanding goes beyond object detection and requires analysis and elaboration of the data retrieved by the sensors [15].This concept is used in many practical applications, such as self-driving vehicles, surveillance, transportation, mobile robot navigation, and activity recognition.Understanding scenes using images or videos is a complex problem, however, and requires more steps than just recording and extracting features [16].Scene understanding can be aided by taking advantage of multiple sensor modalities, which is usually termed multimodal scene understanding [17].One of the requirements for autonomous robots operating in unstructured environments is the capacity to understand the surrounding environment.Scene understanding comprises subtasks such as depth estimation, scene categorisation, object detection, object tracking, and event categorisation [18].These scene-understanding sub-tasks can describe different aspects of a scene acquired by perception sensors.In scene understanding, a representation is given to a scene by carrying out some of the above tasks jointly to obtain a holistic understanding of the retrieved scene.To generate this overall representation, the information observed from the above-mentioned scene understanding subtasks must be combined meaningfully.Some of the early methods used to obtain scene understanding included using a block world assumption [19] or bottom-up, top-down inference [20].Many of these early works depended on heuristics rather than learning-based methods and, thus, were not suitable for generalisation to unstructured real-world scenarios [21].The most used recent methods try to acquire information from deep learning-based (i.e., convolutional neural networks (CNNs), graph neural networks, vision transformers) approaches.Scene understanding in unstructured outdoor environments using multimodal scene understanding concepts is a challenging task.Many modern scene understanding methods use featurebased high-level representations of environments.In unstructured environments, however, the detection of useful object features is challenging.Therefore, it is difficult to reliably interpret visual information from unstructured or dynamic environments [22].However, to attain real-world effectiveness, robots should understand their operating environments up to a level that is accurate enough to execute real-time and goal-oriented decisions.
One of the key capabilities required for autonomous robot operation is autonomous path planning for robot navigation.Many ground robot navigation applications in unstructured environments require traversal of uneven terrains.Some challenges in robot unstructured environment navigation, including traversal of previously unseen unstructured grounds, can be addressed by adequately designing robot structure, path planning and control [23,24].Path planning is generally separated into global and local planning.For global planning, previous knowledge of the operating environment is necessary, and this planning method is also identified as an off-line mode for robot path planning.Robot local path planning, also known as online robot path planning, allows for real-time decisions to be made by the robot in response to the perception of the local environment.Autonomous robot local path planning for optimal terrain traversal in unstructured outdoor environments is an important challenge to solve for robots operating in off-road conditions.This is due to the limitations of standard path planning algorithms, which are incapable of performing the desired tasks in dynamic or unstructured environments where the system lacks prior knowledge and/or already existing maps [25].Artificial potential fields, simulated annealing, fuzzy logic, artificial neural networks, and dynamic window [26] approaches are some of the algorithms that have been used in robot local path planning [21].An optimal local path planning approach should enable robots to adaptably deal with their environments, such as assisting in avoiding dynamic obstacles or identifying traversable routes through varying terrain conditions.In unstructured environments, classical path planning concepts such as Rapidly exploring Random Tree (RRT) variants, Batch Informed Tree (BIT), D* algorithm variants, artificial potential field methods, A* algorithm variants and learning-based path planning methods have been used [27].Over the past years, deep learning has been used to enhance the performance of sensor fusion [28], multimodal scene understanding [29], and robot local path planning [30] techniques.
For autonomous navigation, a robot must have the ability to understand both its pose and enough about the external environment to determine an optimal and traversable path to safely reach a goal position without human assistance in the robot control loop.However, fully autonomous navigation in unstructured environments has not yet been achieved despite significant advancements in computing and engineering technology.Autonomous vehicle navigation on urban roads has been of great interest due to its emerging commercial applications around the world.As a result, autonomous perception technologies are developing continuously due to the competitiveness of this industry [31].Compared to the level of development of methods for autonomous navigation in urban settings [32], however, autonomous navigation in off-road scenarios has not been studied to the same extent, meaning there are significant opportunities for new research in this area.A large number of autonomous navigation techniques have been explored, and these can be broadly divided into two subsections according to the approach used to execute control commands following the processing of the input sensory data.
As shown in Figure 1, autonomous navigation methods can be categorised as the classical modular pipeline or end-to-end learning-based approaches.The modular pipeline architecture includes intermediate representations that humans can interpret and that can provide information related to the failure modes of the approach.Furthermore, the system development can be parallelised among several expert teams because of its modular nature.Perception, localisation and mapping, path planning and robot control modules may consist of classical, learning-based or hybrid methods [8,33].Due to algorithmic and modelling limitations, however, the modular approach may not be optimal for general autonomous navigation applications.In addition, machine learning-based modules must be separately trained and validated using auxiliary loss functions, which can create suboptimality across the system as a whole.In contrast, end-to-end learning-based systems learn policies from observations of the outcomes that result from the actions of autonomous systems [18,34].In these systems, Deep Reinforcement Learning (DRL) is used to refine the control algorithms used to determine autonomous actions.However, commonly employed methods like imitation learning can suffer from overfitting and cause problems with the poor generalisation of the system behaviour when deploying in different environment scenarios [11].In addition, when errors occur in an end-to-end system, it can be complex to investigate because the origin of these errors is hidden in the holistic neural network-based architectures [8,35].It is clear, however, that future development of autonomous robot navigation will rely on further advancements in robot environmental perception and machine learning.This paper provides a comprehensive review of the current state of research and the most promising advancements relevant to the field of ground robot navigation in unstructured environments, including perception and sensor fusion, scene understanding, and path planning.The most promising state-of-the-art methods used in classical outdoor robot navigation pipelines are discussed and evaluated, along with emerging learningbased approaches; promising future research directions are then highlighted.Readers of this paper will gain insights and knowledge regarding current state-of-the-art capabilities, as well as trends and emerging opportunities in the autonomous robot outdoor scene understanding and navigation research field.The remainder of this paper is structured in the following manner.A discussion of robot vision and the types of active ranging sensors that are used for robot environment perception is presented in Section 2. This section also discusses deep learning-based camera and LiDAR sensor fusion methods for depth estimation, object detection, and semantic and instance segmentation.Section 3 describes modern mobile ground robot scene understanding techniques and scene representations that are utilised in robot navigation.Robot path planning algorithms at the global and local levels are discussed in Section 4. In Section 5, the details of robot vision and ranging sensors, fusion methods, scene understanding concepts, and local navigation approaches are summarised.Section 6 provides an overview of the research challenges related to the topic.Additionally, it offers some potential future research directions.Finally, the conclusion of the review can be found in Section 7.

Robot Environment Perception for Navigation
Robot external and internal environment sensing by extraction of raw sensor data and their interpretation is the basic principle of robot perception.In the modular or end-to-end robot navigation approach, sensors play a critical role in capturing the environment or internal robot attributes for robot perception.Therefore, to achieve better perception, a wide range of different sensor modalities have been investigated by researchers.A sensor modality represents a sensor that inputs a particular form of energy and processes the signal using similar methods.Modalities include raw input types for sensors like sound, pressure, light (infrared, visible), or magnetic fields.Robot perception sensor modalities commonly include cameras (infrared, RGB or depth), LiDAR, radar, sonar, GNSS, IMU, and odometry sensors.

Vision and Ranging Sensors
Camera sensors are usually incorporated in vision applications by researchers to retrieve environmental information for the use of mobile robots.However, LiDAR sensors have shown more reliability in low-light environmental conditions than cameras and produce highly accurate depth measurements.LiDAR sensors come with 2D or 3D mapping capability [36], and these sensors can generate high-fidelity point clouds of outdoor environments.However, these unstructured point clouds tend to become increasingly sparse as the sensor range is increased.

• Vision-based Sensor Types
Vision is crucial for a robot to navigate in unknown environments.Robots are equipped with vision-based sensors like cameras to understand environmental information.Cameras are generally cheaper than LiDAR sensors and are a type of passive sensor (although some depth cameras use active sensing).The monocular configuration is highly used in standard RGB cameras, for example, the GoPro Hero camera series.They are compact passive sensors that consume low power, approximately 5 watts, depending on resolution and mode settings.Monocular SLAM has been explored in the research literature due to its simple hardware design [32,37].However, the algorithms that are used for it are very complex because depth measurements cannot be directly retrieved from static monocular images.Monocular cameras also suffer from pose estimation problems [38].The pose of the camera is obtained by referencing previous poses.Hence, errors in pose estimation propagate through the process, and this phenomenon is called scale drift [28,29,39].
Stereo cameras are inspired by human eyes and use two lenses and separate passive image sensors to obtain two perspectives of the same scenes.They have been used in indoor and outdoor SLAM applications.These camera types use the disparity of the two images to calculate the depth information.Stereo camera vision does not suffer from scale drift.Available popular stereo cameras are Bumblebee 2, Bumblebee XB3, Surveyor stereo vision system, Capella, Minoru 3D Webcam, Ensenso N10, ZED 2, and PCI nDepth vision system.Stereo camera power consumption is generally between 2 to 15 watts.The maximum range of stereo cameras varies from 5 to 40 m at different depth resolution values.The accuracy of these sensors varies from around a few millimetres to 5 cm at the maximum range [40].The cost of these sensors varies from one hundred to several thousand Australian dollars.RGB-D camera sensors consist of monocular or stereo cameras coupled to infrared transmitters and receivers.Kinect camera from Microsoft is a relatively inexpensive RGB-D sensor that provides colour images and depth information of image pixels.The Kinect sensor is mainly used for indoor robot applications due to the saturation of infrared receivers in outdoor scenarios from sunlight [41].The Kinect sensor has three major versions: Kinect 1, Kinect 2, and Azure Kinect (the latest version).Kinect 1 uses structured light for depth measurement, and the other models use Time-Of-Flight (TOF) as the depth measuring principle.The newest Kinect model, Azure Kinect, also generates substantial noise in outdoor bright light conditions with a practical measuring range below 1.5 m [42].In general, RGB-D sensors utilise three depth measurement principles: structured light, TOF, and active infrared stereo techniques.The structured light RGB-D sensors underperform compared to TOF techniques in measuring the range of distant objects.The structured light technique is vulnerable to multi-device interference.The TOF methods suffer from multi-path interference and motion artefacts.The active infrared stereo principle has drawbacks due to common stereo matching problems such as occluded pixels and flying pixels near contour edges [43].Active infrared stereo cameras are identified as an extension of passive stereo cameras.They offer more reliable performance in indoor and some outdoor scenarios.However, they require high computation capability.Table 1 shows the main depth measurement methods that are used in RGB-D sensors.
Event cameras are asynchronous sensors that use different visual information acquisition principles than the standard frame-based image sensors.This camera type samples light based on the changes in the scene dynamics (asynchronously measuring brightness per pixel).These cameras currently cost several thousands of dollars.Event cameras have advantages like very high temporal resolution, high dynamic range, low latency (in microseconds), and lower power consumption than standard cameras.However, this camera type is not suitable for the detection of static objects.The main burden is the requirement of new methods (algorithms and hardware, e.g., neuromorphic-based approaches or spiking neural networks) to process event camera outputs to acquire data and information because traditional image processing methods are not directly applicable [44].Omnidirectional cameras are utilised in robotic applications where more information is needed about surrounding environments.These cameras provide a wider-angle view than conventional cameras, which typically have a limited field of view.Due to the lens configuration of omnidirectional cameras, the obtained images have distortions.Therefore, these cameras require different mathematical models, like the unified projection model, to correct image distortions.A summary of these different camera types and their advantages and disadvantages is shown in Table 2. Active sensors emit energy to the environment and measure the return signal from the environment.There are several types of active-ranging sensors, such as reflectivity, ultrasonic, laser rangefinder (e.g., LiDAR), optical triangulation (1D), and structured light (2D).The commonly used LiDAR sensor is an active-ranging sensor that shows improved performance compared to ultrasonic TOF sensors for perception in autonomous navigation systems [46].LiDAR imaging is one of the most studied topics in the optoelectronics field.The TOF measurement principle is used by LiDAR sensors for obtaining depth measure-ments in different environments.Rotating LiDAR imagers were the first type to successfully achieve acceptable performance using a rotating-mirror mechanical configuration with multiple stacked laser detectors [47].New trends in LiDAR technology are in the development of low-cost, compact, solid-state commercial LiDAR sensors [48,49].The three most widely used LiDAR techniques utilise pulsed, amplitude-modulated, and frequency-modulated laser beams.The most common commercially available method is to use a pulsed laser beam, which directly measures the time taken by the pulsed signal to return to the sensor.In such sensors, time measurements require resolutions in picoseconds (high-speed photon detectors).Therefore, the cost of pulsed LiDAR sensors is comparably higher than the other two methods for equivalent range and resolution [50].The range of a pulsed LiDAR is limited by the signal-to-noise ratio (SNR) of the sensor.Pulsed LiDAR sensors are suitable in indoor and outdoor environments due to their instantaneous higher peak pulse contrast to ambient irradiance noise.However, Amplitude Modulated Continuous Wave (AMCW) LiDAR sensors with the same average signal power have lower continuous wave peaks and, hence, are vulnerable to solar irradiance.AMCW LiDAR sensors are popular in indoor robot applications due to low SNR.In general, there is no significant depth resolution difference between pulsed LiDAR and AMCW LiDAR, but pulsed LiDAR may outperform AMCW LiDAR accuracy at the same optical power levels because of higher SNR.Increasing the modulation frequency of AMCW LiDAR sensors improves depth resolution but reduces ambiguity distance.Thus, pulsed sensors have higher ranges compared to AMCW sensors.AMCW sensors also have slow frame rates relative to pulsed sensors and usually underperform in dynamic environments.Frequency Modulated Continuous Wave (FMCW) sensors have higher depth resolution in comparison to pulsed and AMCW LiDAR sensors.These sensors can measure the depth and velocity of targets simultaneously, a highly advantageous feature for the autonomous vehicle industry.These sensors can also avoid interference from other LiDAR sensors because of the frequency modulation.FMCW sensors, however, require accurate frequency sweeps to generate emitter signals, which is a challenging task.FMCW sensor technology has been in continuous development but has not yet established itself at the commercial level.A summary of these three LiDAR technologies is provided in Table 3.The concept of robotic perception pertains to various robotics applications that utilise sensory information and modern deep learning methods.These applications include identifying objects, creating representations of the environment, comprehending scenes, detecting humans/pedestrians, recognising activities, categorising locations based on meaning, scene reconstruction, and others.Towards the goal of fully autonomous robot navigation, robust and accurate environmental perception is a necessity.Passive RGB cameras are relatively low cost and can capture rich environment details.However, their perception abilities are vulnerable to environmental illumination changes and occlusions.Therefore, sensor fusion methods are important to gain robust perception.

LiDAR and Camera Data Fusion
Achieving a reliable real-time understanding of external 3D environments is very important for safe robot navigation.Visual perception using cameras is commonly employed in many mobile robotic systems.Camera images can be efficiently and often effectively processed by CNN-based deep learning architectures.However, relying on one sensor can lead to robustness challenges, particularly in applications like self-driving vehicles, autonomous field robots, etc.Therefore, different sensor modalities are often combined to achieve better reliability and robustness for autonomous systems.The fusion of LiDAR and camera data is one of the most investigated sensor fusion areas in the multimodal perception literature [51].Camera and LiDAR fusion has been applied in various engineering fields and has shown better robustness than camera-only robot navigation approaches [52].This fusion strategy is more effective and popular than other sensor fusion approaches such as radar-camera, LiDAR-radar, and LiDAR-thermal camera.Still, the technical challenges and cost of sensors and processing power requirements have constrained the application of these methods in more general human activities.Recent deep-learning algorithms have significantly improved the performance of camera-LiDAR fusion methods [53], with monocular and stereo cameras mainly used with LiDAR sensors to fuse images and point cloud data [54].
Deep learning-based LiDAR and camera sensor fusion methods have been applied in depth completion, object detection, and semantic and instance segmentation.Image and point cloud scene representations include volumetric, 2D multi-view projection, graphbased and point-based.In general, most of the early methods fuse LiDAR and camera image semantics using 2D CNNs.Many 2D-based networks project LiDAR points onto respective image planes to process feature maps through 2D convolutions [55][56][57][58].Several works have used point cloud processing techniques such as PointNet [59] to extract features or 3D convolutions [60] to detect objects in volumetric representations [61].Some other LiDAR and image fusion methods use 2D LiDAR representations for feature fusion and then cluster and segment 3D LiDAR points to generate 3D region proposals [62].Voxel-based representations and multi-view camera-LiDAR fusion approaches are utilised to generate 3D proposals in object detection.State-of-the-art camera-LiDAR semantic segmentation methods employ feature fusion methods to obtain 2D and 3D voxel-based segmentation results.Multi-view approaches map RGB camera images onto the LiDAR Bird's-Eye-View (BEV) plane to align respective features from the RGB image plane to the BEV plane [63][64][65][66], and several other methods propose to combine LiDAR BEV features with RGB image features directly [51,54].These direct mapping methods use trained CNNs to align image features with LiDAR BEV features from different viewpoints.
Computer vision has been a rapidly growing field in the past decade, and the development of machine learning methods has only accelerated this.Recently, deep learning strategies have influenced the rapid advancement of various computer vision algorithms.Computer vision includes subtopics like object detection, depth estimation, semantic segmentation, instance segmentation, scene reconstruction, motion estimation, object tracking, scene understanding and end-to-end learning [41].Computer vision methods have been applied to a great extent in emerging autonomous navigation applications.However, these vision techniques may be less effective in previously unseen or complex environments and highly rely on the trained domain.Therefore, continuous improvements are being made towards the development of fully autonomous systems.Several state-of-the-art benchmarking datasets have been utilised to compare the performance of different autonomous driving vision methods.KITTI [67], Waymo [68], A2D2 [69], nuScenes [70], and Cityscapes [71] are some examples of these state-of-the-art autonomous driving datasets.

• Dense Depth Prediction
Dense depth completion is a technique that estimates dense depth images from sparse depth measurements.Achieving depth perception is of significant importance in many different engineering industries and research applications such as autonomous robotics, self-driving vehicles, augmented reality, and 3D map construction.LiDAR sensors, monocular cameras, stereo cameras, and RGB-D cameras have been the most utilised in dense depth estimation applications, but these sensors have specific limitations in the estimation process.LiDAR sensors with high accuracies are costly to use in large-scale applications.Three main challenges have been identified in LiDAR depth completion [72].The first relates to the fact that, in general, even expensive LiDAR sensors produce sparse measurements for distant objects.In addition, LiDAR points are irregularly spaced compared to monocular RGB images.Therefore, it is non-trivial to increase the depth prediction accuracy using the corresponding colour image.Secondly, there are difficulties in combining multiple sensor modalities.The third challenge is that depth estimation using deep learning-based approaches has limitations with regard to the availability of pixel-level ground truth depth labels for training networks.Another possible approach to depth estimation is by using stereo cameras.Stereo cameras, however, require accurate calibration, demand high computational requirements, and fail in featureless or uniformly patterned environments.RGB-D cameras are capable of depth sensing but have a limited measuring range and poor performance in outdoor environments.A technique called depth inpainting can be used as a depth completion method for structured light sensors like the Microsoft Kinect 1, and these sensors produce relatively dense depth measurements but are generally only usable in indoor environments.Dense depth estimation techniques generally up-sample sparse and irregular LiDAR depth measurements to dense and regular depth predictions.Depth completion methods, however, still have a variety of problems that need to be overcome.These challenges are primarily sensor-dependent, and solutions should overcome respective difficulties at the algorithm development stage.
Many state-of-the-art dense depth prediction networks combine relatively dense depth measurements or sparse depth maps with RGB images to assist the prediction process.In general, retrieving dense depth detail from relatively dense depth measurements is easier than from sparse depth maps.In relatively dense depth images, a higher percentage of pixels (typically over 80%) have observed depth values.Therefore, in similar scenarios, predicting dense depth is relatively less complex.However, in autonomous navigation applications, 3D LiDAR sensors account for only approximately 4% of pixels when depth measurements are mapped onto the camera image space, which creates challenges in generating reliable dense depth images [72].

• Dense Depth from Monocular Camera and LiDAR Fusion
Depth estimation based solely on monocular images is not reliable or robust.Therefore, to address these monocular camera limitations, the LiDAR-monocular camera fusion-based depth estimation process has been proposed by researchers.Using monocular RGB images and sparse LiDAR depth maps, a residual network learning-based autoencoder decoder network was introduced by [73] to estimate dense maps.However, this method needs a ground truth depth image when retrieving sparse depth images during the network training process.In practice, obtaining such ground truth images is not simple or easily scalable [72].To mitigate the requirement of a ground truth depth image, ref. [72] presented a selfsupervised model-based network that only requires a monocular RGB image sequence and LiDAR sparse depth images in the network training step.This network consists of a deep regression model to identify a one-to-one transformation from a sparse LiDAR depth map to a dense map.This method achieved state-of-the-art performance on the KITTI dataset and considers the pixel-level depth estimation as a deep regression problem in machine learning.LiDAR sparse depth maps use per-pixel depth, and pixels without measured depth are set to zero.The proposed network follows an encoder-decoder architecture.The encoder has a sequence of convolutions, and the decoder has a set of transposed convolutions to up sample feature spatial resolutions.Convolved sparse depth data and colour images are concatenated into a single tensor and input to residual blocks of ResNet-34 [74].The self-supervised training framework requires only colour/intensity images from monocular cameras and sparse depth images from LiDAR sensors.In the network training step, a separate RGB supervision signal is used (a nearby frame).LiDAR sparse depth can be used as a supervision signal.However, this framework requires a static environment to be able to warp the second RGB frame to the first one.
With this implementation, the root-mean-square of the depth prediction error reduces as a power function with the increment of the resolution of the LiDAR sensor (i.e., the number of scan lines).One of the limitations of this approach is that the observed environment needs to be stationary.If not, the network will not generate accurate results.Large moving objects and surfaces that have specular reflectance can cause the process of training the network to fail.These reasons reduce the applicability of this method in dynamic situations that are often present in outdoor environments.In addition, this network training process may become stuck in the local minimums of the photometric loss function due to improper network weight initialisation.This effect may result in output depth images that are not close enough to the ground truth because of the erroneous training process.
In [75], a real-time sparse-to-dense map is constructed by using a Convolutional Spatial Propagation Network (CSPN).The propagation process preserves LiDAR sparse depth input values in the final depth map.This network aims to extract the affinity matrix for respective images.The introduced method learns the affinity matrix by using a deep convolution neural network.The training process of the network model is achieved by incorporating a stochastic gradient descent optimiser.This network implementation showed memory paging cost as a dominant factor when larger images were fed into the PyTorch-based network.The CSPN network has shown good performance in real-time and thus is well-suited for applications such as robotics and autonomous driving.CSPN++ [76] is an improved version of the CSPN network with adaptively learning convolutional kernel sizes and numbers of iterations for propagation.The network training experiments were carried out using four NVIDIA Tesla P40 Graphic Processing Units (GPUs) on the KITTI dataset.This research shows that hyper-parameter learning from weighted assembling can lead to significant accuracy improvements, and weighted selection could reduce the computational resource with the same or better accuracy compared to the CSPN network.

• Dense Depth from Stereo Camera and LiDAR Fusion
Estimating depth using stereo cameras provides more reliable results compared to monocular cameras.LiDAR sensors can produce depth measurements with improved accuracy over increased ranges and varying lighting conditions.The fusion of LiDAR and stereo camera sensors produces more accurate 3D mappings of environments than LiDARmonocular camera depth completion.However, stereo cameras commonly have shorter detection ranges, and depth estimation becomes challenging in textureless environments and high occlusion scenarios.One of the significant works in LiDAR-stereo camera fusion is presented in [77].This paper presents the first unsupervised LiDAR-stereo fusion network.The network does not require dense ground truth maps, and training is done in an end-toend manner that shows a broad generalisation capability in various real-world scenarios.The sparsity of LiDAR depth measurements can vary in real-world applications.One advantage of the network proposed in this work is that it handles a high range of sparsity up to the point where the LiDAR sensor has no depth measurements.A feedback loop has been incorporated into the network to connect outputs with inputs to compensate for noisy LiDAR depth measurements and misalignments between the LiDAR and stereo sensors.This network is currently regarded as one of the state-of-the-art methods for LiDAR-stereo camera fusion.

• Multimodal Object Detection
Reliable object detection is a vital part of the autonomous navigation of robots/vehicles.Object detection in autonomous navigation is described as identifying and locating various objects in an environment scene in the form of bounding boxes, including dynamic objects and static objects.Object detection may become difficult due to sensor accuracy, lighting conditions, occlusions, shadows, reflections, etc.One major challenge in object detection is occlusion, which consists of different types.The main occlusion types are self-occlusion, inter-object occlusion, and object-to-background occlusion [78].Early image-based object detection algorithms commonly included two steps.The first stage was dividing the image into multiple smaller sections.Then, these sections were conveyed into an image classifier to identify whether the image section contained an object or not.If any object was detected in an image section, the respective portion of the original image was marked with the relevant object label.The sliding window approach is one way of achieving the above-mentioned first step [79].
A different set of algorithms uses a technique in contrast to the sliding window approach by grouping similar pixels of an image to form a region.These regions are then fed to a classifier to identify semantic classes (with grouping done using image segmentation methods).Further improved image segmentation can be achieved by using the selective search algorithm [80].The selective search algorithm emphasises a hierarchical grouping-based segmentation algorithm.In this method, initially detected image regions are merged in a stepwise manner by selecting the most similar segments until the whole image represents a single region.These regions resulting from each step are added to the region proposals and fed to a classifier.The classifier performance depends on the used region proposal method.This object detection approach does not produce real-time performance suitable for autonomous navigation applications.However, advances such as SSPnet [81], Fast Regional-based Convolutional Neural Networks (R-CNNs) [82], and Faster R-CNN [83] were introduced to address this issue.The Faster R-CNN network generates a feature map by utilising the CNN layer output, and the region proposal generation is achieved by sliding a window (comprising three different aspect ratios and sizes) over the feature map.Each sliding window is mapped to a vector and fed to two parallel classification and regression networks.The classification network calculates the probability of region proposals containing objects, and the regression network indicates the coordinates of each of the proposals.
Object detection research has been mainly employed in the autonomous vehicle industry (for vehicle and pedestrian detection [84]) and mobile robotics.In contrast to camera-only object detection, sensor fusion has been implemented in different real-world applications to obtain more accurate and robust detection results.As previously discussed, LiDAR and camera sensor fusion are some of the most used and highest-performing sensor fusion methods.LiDAR and camera sensor fusion object detection approaches consist of two main techniques.These are the sequential and one-step models [54].Sequential models use 2D proposal-based methods or direct 3D proposals to detect objects.In the sequential approach, 2D/3D regions are proposed in the first stage, and then 3D bounding box regression is done in the second stage.The 2D/3D region proposal stage incorporates fused image-point cloud regions that may contain objects.In the bounding box regression stage, feature extraction from region proposals and bounding box prediction is done.One-step models generate region proposals and achieve bounding box regression in parallel in a single step.The 2D proposal-based sequential approach uses 2D image semantics to generate a 2D proposal and point cloud processing methods to detect dynamic objects.This approach utilises already developed image processing models to identify 2D object proposals and then project these proposals to LiDAR 3D point cloud space for object detection.
Two approaches are mainly used to manipulate image-based 2D object proposals and irregular 3D LiDAR data.In the first method, image-based 2D bounding boxes are projected to the LiDAR 3D point cloud to implement 3D point cloud object detection algorithms.The second approach utilises point cloud projections on the 2D images and applies 2D semantic segmentation techniques to achieve point-wise semantic labels of the points within the semantic regions [54].LiDAR-camera 2D sequential object detection methods include result, feature, and multi-level fusion strategies.These 2D proposal-based result-level fusion methods incorporate image object detection algorithms to retrieve 2D region proposals.These retrieved 2D object bounding boxes are then mapped onto 3D LiDAR point clouds.The enclosing points in frustum proposals are transferred into a point cloud-based 3D object detection algorithm [57].The overall performance of this object detection approach depends on the modular behaviour of the 2D detection architecture.
Sequential fusion may lose complementary data in LiDAR point clouds due to initial 2D image object proposal detection.Two-dimensional proposal-based feature-level fusion uses a mapping from 3D LiDAR point clouds onto the respective 2D images and employs image-based techniques for image feature extraction and object detection.One of these approaches appends per-point 2D semantic details as additional channels of LiDAR 3D point clouds and uses an existing LiDAR-based object detection method [85].However, this approach is not optimal for identifying objects in a three-dimensional world because 3D details in the point clouds may be lost due to the projection.
Multi-level fusion combines 2D result-level fusion with feature-level fusion.This approach uses already available 2D object detectors and generates 2D bounding boxes.Then, points within these bounding boxes are detected.Subsequently, image and point cloud features are combined within the bounding boxes to estimate 3D objects.LiDAR and camera object detection using 3D proposal-based sequential models avoid 2D to 3D proposal transformations and directly generate 3D proposals from 2D or 3D data.This technique consists of two approaches, namely multi-view and voxel-based.MV3D [66] is a multi-view object detection network that uses LiDAR and camera data to predict the full 3D envelope of objects in the 3D environment.A deep fusion scheme was proposed to fuse features from multiple views in respective regions.The detection network comprises two networks: the 3D proposal network and the region-based fusion network.As the inputs, the LiDAR BEV, LiDAR front view and the RGB camera image are fed to the network.The LiDAR BEV is fed to the 3D proposal network to retrieve 3D box proposals.These proposals are used to extract features from the LiDAR front-view and camera RGB image inputs.Then, using these extracted features, the deep fusion network predicts object size, orientation, and location in the 3D space.The network was built on the 16-layer VGGnet [86], and the KITTI dataset was used for the training process.
One of the drawbacks of the multi-level fusion method is the loss of small objects in the detection stage due to feature map down-sampling.Combining image and point cloud feature maps by RoI (Regions of Interest)-pooling decreases the fine-grained spatial details.MVX-Net [65] introduces a method to fuse point cloud and image data voxel-wise or point-wise.Two-dimensional CNNs are used for the image feature extraction process, and a VoxelNet [87] based network detects 3D objects in the voxel representation.The input 3D LiDAR point cloud is mapped to the 2D image for image feature extraction in the point-wise fusion method, and then voxelisation and processing are done using VoxelNet.In voxel-wise fusion, the point cloud is firstly voxelised and then projected onto the imagebased 2D feature representation to extract features.This sequential approach achieved state-of-the-art performance for 3D object detection at the time of its publication.Object detection utilising one-stage models performs object proposal retrieval and bounding box prediction in a single process.These detection models are suitable for real-time autonomous robot decision-making scenarios.State-of-the-art single-stage object detection methods, such as [88], simultaneously process depth images and RGB images to fuse points with image features, and then the generated feature map is used for bounding box prediction.The introduced method [88] utilises, two CNN-based networks to parallelly process point cloud and RGB front-view images.One CNN identifies LiDAR features, and the other CNN extracts RGB image features.Then, these RGB image features are mapped into the LiDAR range view.Finally, mapped RGB and LiDAR image features are concatenated and fed into LaserNet [89] for object detection and semantic segmentation.This network has been trained in an end-to-end manner.The network training was done for 300K iterations with a batch size of 128, distributed over 32 GPUs.The image fusion, object detection, and semantic segmentation process took 38 milliseconds on an Nvidia Titan Xp GPU.

• Multimodal Semantic Segmentation
Scene semantic segmentation assigns a semantic category label to each pixel in a scene image, and it can be regarded as a refinement of object detection [90].A scene can incorporate obstacles, free space, and living creatures (not necessarily limited to these categories).The complete semantic segmentation of images applies these semantic categories to all pixels across an image.Many recent computer vision methods rely on CNN architectures.These networks favour dense depth image data over sparse data.In general, LiDAR sensors produce irregular, sparse depth measurements.Ref. [73] introduced a technique to utilise LiDAR sparse data and RGB images to achieve depth completion and semantic segmentation in a 2D view.This network can work with sparse depth measurement densities as low as 0.8%, and at the time of its publication, this method showed state-of-the-art performance on the KITTI benchmark dataset.The base network of this prior work was adopted from NASNet [91], which has an encoder and decoder architecture.Using LiDAR and RGB image feature fusion, ref. [92] proposed a novel method to achieve 2D semantic segmentation in 2D images.This method introduced a self-supervised network that suits different object categories, geometric locations, and environmental contexts.This selfsupervised network uses two sensor modality-specific encoder streams, which concatenate to a single intermediate encoder and then connect into a decoder to fuse the complementary features.The segmentation part is achieved with a network termed AdapNet++ [92] that consists of an encoder-decoder architecture.All these network models were implemented using the deep learning TensorFlow library.Another high-performing deep learning-based LiDAR-Camera 2D semantic segmentation method was presented in [93].In this method, the generated 3D LiDAR point data is mapped to the 2D image and up-sampled to retrieve a 2D image set that consists of spatial information.Then, fully convolutional networks are used to segment the image using three approaches: signal level, feature, and cross-fusion.In the cross-fusion method, the network was designed to learn directly from the input data.
The techniques discussed up to now have been 2D semantic segmentation methods.In contrast to 2D methods, 3D semantic segmentation approaches provide a realistic 3D inference of environments.An early approach for a 3D scene semantic segmentation network is presented in [94] termed 3DMV.This method requires relatively dense depth scans along with RGB images, and it was developed to map indoor scenarios.Voxelised 3D geometries are built by using LiDAR depth scans.Two-dimensional feature maps are extracted from the RGB images using CNNs, and these image feature maps are mapped in a voxel-wise manner with the 3D grid geometry.This fused 3D geometry is then fed into 3D CNNs to obtain a per-voxel semantic prediction.The overall performance of the approach depends on the voxel resolution, and real-time processing is challenging for higher voxel resolutions.Therefore, this dense volumetric grid becomes impractical for high resolutions.The system was implemented using PyTorch and utilised 2D and 3D convolution layers already provided by the application programming interface.Semantic segmentation of point clouds is challenging for structureless and featureless regions [95].A point-based 3D semantic segmentation framework has been introduced by [95].This approach effectively optimises geometric construction and pixel-level features of outdoor scenes.The network projects features of detected RGB images into LiDAR space and learns 2D surface texture and 3D geometric attributes.These multi-viewpoint features are extracted by implementing a semantic segmentation network and then fused point-wise in the point cloud.Then, this point data is passed to a PointNet++ [96] based network to identify per-point semantic label predictions.A similar approach was followed by Multi-view Point-Net [97] to fuse RGB semantics with 3D geometry to obtain per LiDAR point semantic labels.
Instead of localised or point cloud representations, ref. [98] have used a high dimensional lattice representation for LiDAR and camera data processing.This representation reduces memory usage and computational cost by utilising bilateral convolutional layers.Then, these layers employ convolutions for unoccupied sections in the generated lattice representation.Firstly, the identified features of point clouds are mapped to a highdimensional lattice, and then convolutions are used.Following this, CNNs are applied to detect image features from multi-view images, and these features are projected to the lattice representation to combine with three-dimensional lattice features.The generated lattice feature map was assessed by using 3D CNNs to obtain point-wise labels.A spatialaware and hierarchical learning strategy has been incorporated to learn 3D features.The introduced network was capable of training in an end-to-end manner.

• Multimodal Instance Segmentation
Instance segmentation identifies individual instances within a semantic category.It is considered a more advanced semantic segmentation method.This method not only provides per-pixel semantic categories but also distinguishes object instances, which is more advantageous for robot scene understanding.However, instance segmentation in autonomous navigation introduces more challenges than semantic segmentation.Instance segmentation approaches based on fused LiDAR-camera sensor data show proposal-based and proposal-free architectures.A voxel-wise 3D instance segmentation approach was introduced by [99] that consists of two-stage 3D CNN networks.A feature map was extracted from the low-resolution voxel grid by implementing 3D CNNs.Another feature map was obtained from the RGB multi-view images using 2D CNNs and projected onto the associated voxels in the volumetric grid to append with respective 3D geometry features.Then, object classes and 3D bounding boxes are predicted by feeding these fused features to a 3D CNN architecture.In the second phase, another 3D CNN estimates the per-pixel object instances using already identified features, object classes and bounding boxes.
These voxel-based segmentation methods are constrained by the voxel resolution and require increased computation capabilities with higher grid resolutions.The application of instance segmentation in LiDAR-camera fusion for real-time systems is challenging.Some research studies had limitations, such as the system developed in [100], which does not support dynamic environments.A proposal-free deep learning framework that jointly realises 3D semantic and instance segmentation is presented in [101].This method performs 3D instance segmentation in the BEV of point clouds.However, this approach is less effective in identifying vertically oriented objects because of the BEV segmentation approach.This method first extracts a 2D semantic and instance map from a 2D BEV representation of the observed point cloud.Then, using the mean shift algorithm [102], and semantic features of the 2D BEV, instance segmentation is achieved by propagating 3D features onto the point cloud.It should be noted that the instance segmentation approaches discussed have been developed to segment 3D point clouds from static indoor environments, and these methods have not shown any segmentation capabilities in dynamic environments.
Overall, while significant progress has been made in the perception capabilities of autonomous robots, particularly in regard to object detection and scene segmentation, many of the existing approaches have only been tested in relatively structured indoor environments, and substantial additional work may be required to adapt these techniques to be used in unstructured outdoor environments.Nonetheless, some very useful research directions have been identified that have the potential to significantly advance this field.

Robot Scene Understanding for Navigation Planning
For effective navigation at local and global scales, robots build maps to represent their external environments and to assist with making safe and accurate decisions at local levels.SLAM refers to the real-time combination of robot localisation and external environment mapping.SLAM is highly focused on mapping techniques.Early mobile ground robot map-building approaches comprised feature-based and photometric error-based methods.Robots could efficiently localise their positions in sparse or dense maps by implementing these mapping techniques.However, autonomous robots require more than basic 2D or 3D maps for effective task planning.Therefore, semantic SLAM techniques have been developed and investigated.Topological maps represent the environment as abstract graphs in contrast to classical metric maps.Classical metric maps are used in robotics to visualise the environment as a two-dimensional grid of cells, and each of these cells represents a specific location or area.The occupancy of a cell is typically represented by a binary value.However, these maps have limitations, such as their assumption of a flat and static environment and their inability to illustrate sensor data uncertainty.
To overcome these limitations, researchers have developed probabilistic mapping techniques, such as occupancy grid mapping, which use probabilities to represent occu-pancy and can handle more complex and dynamic environments.The metric maps include details such as visited locations by a robot, proximity to features and general arrangement of external environments.One of the challenges in classical 3D geometric mapping is the accumulation of position errors over time at the global map level.Topological maps are not directly affected by these error-driven shifts in the geometrical maps, which means that topological mapping is more adaptable to different environmental contexts than geometric mapping.Classical and topological mapping methods have their strengths and weaknesses.Hybrid methods combine metric and topological mapping techniques to leverage advantages of each method.Many hybrid representations have been utilised by indoor robots to generate environmental mappings.Hybrid mapping representations from recent research works generally consist of two or more layers in a hierarchy (e.g., places, categories, ontology).The research work presented in [103] uses depth measurements, semantic segmentation and scene flow to map a 3D environment.This concept can detect the dynamic behaviour of objects.Neural implicit scalable encoding for SLAM [104] introduces a SLAM framework that utilises a hierarchical feature representation.It uses a hierarchical feature grid with encoder-decoder networks to predict occupancies.The geometric maps of indoor environments are developed by using extracted coarse, mid, and fine-level feature grids.The middle and fine-level feature grids represent observed scene geometry.The feature grid in the coarse level detects indoor geometries such as the floor and walls.The coarse level occupancy is used in the scene reconstruction of unobserved environment regions.In the reconstruction process, coarse-level occupancy is optimised in the mid-level feature grid and refined at the fine level.
Kimera [105] presents a dynamic scene graph that consults a five-layer hierarchical representation.The method combines a metric semantic mesh with a topological spatial representation.It is capable of segmenting indoor structures and is robust in overcrowded environments.The five layers of the graph include metric-semantic mesh, objects and agents, places and structures, rooms, and buildings.To effectively implement these hybrid hierarchical maps in real-world applications, the representations need to be updated as the environment changes and should accurately map the interconnections between environment attributes.These hierarchical mapping techniques were implemented in indoor environments where semantic information extraction is less complex, however, and have not yet been applied to unstructured outdoor environments.Kimera-multi [106] is a state-of-the-art metric semantic SLAM system for multi-robot systems.The method is an extension of Kimera [105] and was tested in computer-simulated environments based on SLAM datasets and outdoor datasets collected by ground robots.This concept creates a 3D mesh that labels outdoor environments using semantic labels to gain high-level scene understanding.Kimera-multi outperforms the accuracy of Kimera [105] visual-inertial odometry SLAM.Overall, Robot semantic understanding requires the classification of both objects and places.To extract these predefined object models, conditional random field models, CNN-based scene and place classification algorithms, and scene graphs (e.g., robot scene graphs [107], dynamic scene graphs [105]) are utilised.
SLAM techniques can produce sparse or dense geometric or semantic maps of outdoor environments.However, these maps may not have adequate detail for safe local robot navigation where terrains consist of abrupt deviations and vegetation.Therefore, deep learning and image-based scene understanding are necessary directions to identify terrain properties and suitable environment regions for robot navigation in challenging outdoor unstructured environments.Recent robot navigation trends have emerged to segment camera egocentric images using computer vision approaches such as visual attention mechanisms [108].Ref. [109] presents a camera-only approach for segmenting egocentric images to assist robot navigation.The segmentation procedure consists of soft labelling of the images according to three levels of driveability.For every pixel, soft labels are mapped by assessing original semantic classes.The SegNet [110] based deep convolutional encoderdecoder architecture was used for the pixel classification.Another paper [111] introduces an RGB-based method to classify terrains into navigability levels by extracting multi-scale features.This work achieved state-of-the-art performance on the RUGD [112] and RELLIS-3D [113] datasets at the time of publication.A multi-head self-attention network architecture was used to classify the terrain into smooth, rough, bumpy, forbidden, obstacle, and background regions.This network requires further improvements to identify drastic terrain geometry changes (i.e., slopes) to gain higher accuracy in navigability segmentation.
In general, 2D scene understanding can lose essential details from the mapping process of 2D data to the 3D real world.Therefore, 3D scene understanding-based methods are used with the most recent robot navigation applications.Ref. [114] presented a 3D semantic segmentation method using a CNN encoder-decoder architecture.This approach inputs RGB images and depth images for the fusion of feature maps, with the FuseNet [115] architecture taken as the base for the CNN.Visual dynamic object-aware SLAM [116] combines SLAM with dynamic modelling of scenes, enabling robots to operate effectively in dynamic environments.This work introduces instance segmentation of dynamic objects with the camera trajectory.The model can identify and track stationary structures and dynamic objects and integrate them into the traditional SLAM framework.This framework does not need prior knowledge of objects and their shapes or models.The method uses monocular depth estimation to achieve instance-level segmentation and optical flow.Then, static and dynamic object features are tracked to trace the trajectories.The local map is updated in each consecutive frame.Environmental scene understanding using only monocular cameras can be challenging in mapping applications.As a result, semantic segmentation using LiDAR, radar, and stereo cameras has been investigated.The development of deep learning techniques has increased the usage of point cloud data in robot 3D scene understanding applications.VoxNet [61] is a 3D CNN architecture that was designed to obtain voxel grid representations of the environment using point cloud data.Although this network is capable of detecting objects, object detection alone is not sufficient for scene understanding.Ref. [117] used a LeNet [118] based architecture for semantic segmentation of point clouds.This approach could segment outdoor scenes into seven object categories by voxelisation of input point cloud data.Ref. [119] proposed a method for semantic segmentation of indoor scenes using only depth images.This network jointly predicts the 3D occupancy grid and semantic categories.However, some of the objects were missing during the inference due to a lack of depth details for those objects.In general, point cloud data is irregular; therefore, only representing point cloud data using uniform voxel grids might not accurately interpret real-world information.To address this adverse effect, ref. [120] proposed the OctNet architecture to represent point cloud data in a hierarchical partitioning of space using voxel representations.One of the problems of using voxel grid representations is the requirement of high computational power for varying sparsity of irregular point cloud data.Multi-view representations of 3D point clouds can be employed to reduce the adverse effects of volumetric grid representations.Ref. [121] attempted to generate 2D views of point cloud data, which were then utilised to analyse large urban point cloud data in 3D scenes using deep CNNs.Then, "tangent convolutions" were applied to the point cloud data to obtain an image by projecting local geometry to a 2D plane tangent to a point.
In general, 2D projective methods are more efficient and scalable than 3D volumetric representations.However, there may be a degree of information loss in the final 3D structure due to inaccuracies in those projections.Instead of converting point cloud data into uniform volumetric representations, processing these point clouds as unstructured data is also a feasible approach [122].Ref. [59] proposed PointNet, one of the significant point cloud semantic segmentation works.This work shows a degree of permutation invariance to the input point cloud data.It can classify objects by allocating them to categories (e.g., TV, table) and carry out semantic segmentation in scenes.To try to improve on these point-level processing techniques, several works have been carried out that use deep learning methods [123][124][125][126][127]. Computer memory requirements for 3D voxel representations tend to vary according to the level of voxel resolution.In the presence of sensor noise, the estimation of the occupancy of voxels might be challenging.Instead of using 3D volumetric representations, graph and tree-based representations have been proposed to recover 3D structures from point cloud data.Ref. [128] introduces a 3D graph neural network to semantically segment the environment using 2D RGB image and point cloud data.This method uses both CNNs and graph neural networks to predict semantic classes, with the CNN used to extract features from 2D RGB images.The Superpoints graph method was introduced by [129] to learn scene contextual information and geometry.Superpoints are interest points in a point cloud that are detected and described using a neural network.They are densely sampled and designed to be efficient and robust for use in SLAM applications.First, the input point cloud is separated into geometric elements (named Superpoints), and then these Superpoints are combined considering their mutual features.Finally, graph convolutions are implemented to generate contextual information and semantic labels.This method was shown to be appropriate for the segmentation of large point clouds.Another significant work that processes orderless point cloud data using a permutation invariant deep learning architecture is So-Net [130].This network extracts features from point clouds hierarchically.The receptive field of the CNN is controlled systematically by performing a point-to-node k-nearest neighbour search to retrieve the point cloud spatial information.This network conserves the topology attributes of the point representations.
Overall, a number of approaches show promise for improving the scene understanding capabilities of autonomous robots, although questions remain regarding how effectively they can be applied in outdoor unstructured environments, particularly with dense vegetation or other complex terrain features.Improvements in both sensor capabilities and algorithms are likely to be necessary to achieve significant progress in this area, but some promising research directions have been highlighted.

Mobile Robot Local Path Planning
Approaches for path planning of robots in local environments can be broadly classified into either classical methods or learning-based methods that attempt to modify the pathplanning based on factors such as environmental conditions and prior experience [131,132].The classical methods follow a modular architecture with environmental perception, planning of paths relative to generated global maps, and trajectory following.Many classical techniques are applicable for robot navigation in static environments.However, their suitability can substantially diminish in unstructured or dynamic environments [133].In general, these methods can be used effectively in indoor mobile robot applications but may not be suitable for outdoor off-road navigation conditions (e.g., terrain with grass might be traversable despite appearing blocked, and ground with mud or sand may not be suitable for pathing, despite appearing clear), hence the requirement to investigate machine learning-based navigation approaches.Several global and local motion planning methods for ground robot navigation are compared in Tables 4 and 5.
The A* and Dijkstra algorithms have been well-researched in the past decades, and these methods have shown their potential by extensively being applied with the Robot Operating System (ROS) in many real-world robotics applications.Combining these two path-planning methods with heuristic searching is effective in relatively low-complexity 2D environments.However, these methods require heavy computational capabilities in high-dimensional environments or may struggle in unstructured and dynamic working environments.
Random sampling-based path planning algorithms generally consist of BITs and RRTs.Regionally Accelerated Batch Informed Trees (RABIT) are more widely used and perform better in high dimensional and dynamic environments [134] in comparison to graph search-based path planning algorithms.Bionic-based intelligent robot path planning methods simulate the behaviours of insects to generate evolving paths.These evolutionary methods include the ant colony, particle swarm optimisation, and genetic and artificial bee colony algorithms.Many other optimised versions of these algorithms have been proposed to improve calculation efficiency and avoid local minimum problems.A welding robot system developed in [135] had a combination of genetic and particle swarm optimisation algorithms to solve for the shortest route while avoiding obstacles.The artificial potential field method, ant colony optimisation, and geometry optimisation path methods were combined in another research study [136] to search the globally optimal path in 2D scenarios using computer simulations.This method showed fast solutions and reduced the risk of trapping robots in local minimum points.A multi-objective hierarchical particle swarm optimisation method has been proposed by [137] to plan global optimal path trajectories in cluttered environments.This method utilises three layers to generate the robot navigation trajectories.The triangular decomposition method [138] is applied in the first layer, the Dijkstra algorithm is used in the second, and a modified particle swarm optimisation algorithm is applied in the last layer.In general, the local path planning strategies that have been discussed use the available sensor data of robots regarding their surroundings to map, understand, and generate local paths while avoiding obstacles.These local path planning methods are effective in mobile robotic applications because the data captured by sensors varies in real-time in response to the dynamics of environments.In comparison to global path planning strategies, local path planning is more critical for practical robot operation and usually serves as a bridge between global planning and the direct control of robots.However, local path planners have one notable drawback in that they often lead robots to local minimum points.Many classical local path planning algorithms can generate optimal paths in the local environments while avoiding local minimum problems.These methods include techniques such as the Fuzzy Logic algorithm, artificial potential field method, and simulated annealing algorithm.In general, these methods do not evaluate the relative velocities between robots and dynamic environment objects, however, which can lead to difficulties.In many worst-case scenarios, even the velocity profiles of these obstacles can be hard to acquire for the robots.Visual-inertial odometry methods show success in outdoor environment navigation scenarios but are not suitable for navigating in off-road conditions without pre-built maps or GPS assistance.
Machine learning methods have been applied for mobile robot navigation to learn semantic information [139-141] and statistical patterns [142,143] of environments.Several other research works [144][145][146][147] have used machine learning to achieve robustness in path following.In recent years, many Reinforcement Learning (RL) and imitation learning methods and approaches based on self-supervised learning [148][149][150][151] have been applied in mobile robot navigation policy design and for training supervision.Many classical modular and deep learning-based approaches have been utilised in the navigation modules of outdoor mobile robots [152].RL uses strategies to learn optimal robot decisions from experiences.The interconnection between environments and robots is modelled as a Markov Decision Process (MDP).Robots receive rewards as feedback signals for training while traversing different environments.The basic RL process is illustrated in Figure 2. RL methods can be separated into model-free and model-based learning scenarios.In model-free RL, the robots are not required to evaluate the MDP model rewards or policies directly and can obtain these directly through what the robot experiences.Model-free RL approaches have several subcategories such as value-based, policy-based, and Actor-Critic (improved versions of the policy-based algorithms) [153,154].
In value-based methods, optimal policies are obtained by iteratively updating the value functions.Policy gradient-based methods directly approximate a policy network and update the policy parameters to obtain an optimal policy that maximises the reward value.Deep Q network (DQN) and Double DQN are the two main value-based DRL methods.A DQN-based end-to-end navigation method has been introduced in [155].In this work, a feature-extracting network used an edge segmentation method to improve the efficiency of the network training process.The simulated models were transferred to the real world without significant performance loss.Discrete robot actions were implemented based on a grid map.The robot exploration framework is divided into decision, planning, and mapping modules.The learning-based decision module has shown good performance, efficiency, and adaptability in novel environments.A graph-based technique has been implemented for the mapping module.The applied path planning module includes the A* algorithm as the global planner along with a timed elastic band [156] local planner.Value-based DRL methods produce discrete actions, meaning they are not appropriate for the continuous robot action space.The policy-based DRL methods provide continuous motion commands for robots.Combining the policy gradient with the value function creates the Actor-Critic RL method.In general, Actor-Critic algorithms are well-suited for the continuous motion space of autonomous mobile robots.The Actor approximates the policy used to generate actions in the environment.The Critic architecture is accountable for using a reward function to evaluate robot actions iteratively and guides the Actor in successive iterations.The Critic uses deep learning value functions such as DQN and Double DQN to evaluate each iteration step.The Actor-Critic approach includes learning methods such as Depth Deterministic Policy Gradient algorithm (DDPG), Trust Region Policy Optimisation (TRPO), Proximal Policy Optimisation (PPO), Asynchronous Advantage Actor-Critic (A3C), and Soft Actor-Critic (SAC) [21].An extended Actor-Critic algorithm was proposed in [157].The visual navigation module of the deep neural network consisted of depth map prediction and semantic segmentation auxiliary tasks.The proposed learning network requires an image of the target and an observed image as inputs.This network architecture was proposed to obtain a visual navigation policy for indoor environments.A summary of DRL motion planning methods is included in Table 6.Model-based robot RL methods generate models of external environments using supervised training and implement value functions to learn actions that maximise the return.Model-based RL has faster convergence and high sample efficiency.Ref. [158] presents a biped robot that learns on a rotating platform by combining model-based and model-free machine-learning methods.This paper has addressed the overfitting problem of model-based robots.The research has simulated the robot in a 2D scenario and has shown a reduction in learning time compared to model-free RL algorithms.Imitation learning is another related approach that uses demonstrations by experts operating robots for specific tasks to obtain policy functions.Mobile robots learn mappings between observations of these demonstrations and appropriate robot actions.Path trajectories can be rapidly generated by manipulating learned policies from expert demonstrations.However, the capacity of the robot to practically record these demonstration results from real-world experiments can be challenging.Ref. [159] has introduced a method to acquire a mapping from actions to states using ego-centric videos collected by a human demonstrator using a mobile phone camera.The policy learning step was executed within a simulation platform; then, the developed navigation policy was tested on a Clearpath Jackal wheeled robot in an indoor environment.The robot and the trained videos had view-point mismatches, but the model was robust to those changes.The system was successfully able to map camera sensor inputs to actuator commands using the developed imitation learning policy.
Inverse RL is another approach capable of learning reward functions from expert demonstrations.However, inverse RL is more appropriate for exploring novel environments because imitation learning tries to directly follow a demonstrator rather than improve beyond that knowledge level.The creation of an inverse RL network by employing a visionbased imitation learning method was presented by [160].This method approximates a value function from one middle layer of a policy trained by imitation learning.The related experiments were carried out on a real-world setup and using the ROS Gazebo simulation platform.This proposed method has shown the capability of generalising the system for unseen environments by producing cost maps.Traditional geometric-based SLAM procedures lack the capacity to capture dynamic objects in external environments and are thus only suitable for planning robot actions in static environments.RL-based techniques, however, can learn policies by considering dynamic obstacles in outdoor environments, although the sample efficiency is lower compared to imitation learning and model-based learning.The implementation of learning-based maps and traditional path planners is one way to improve sample efficiency.A mobile robot affordance map generation using the RL policy-based method was introduced by [161].This research used metric cost maps (attaching semantics and geometry) for robot navigation.The A* classical path planner was implemented for path generation.This learning-based SLAM approach was implemented using simulation software.Semantic, dynamic, and behavioural attributes of unseen environments have been learnt by the model using the simulation scheme.Local path planning remains a highly challenging task, particularly in environments with dynamic actors or challenging terrain features such as unpredictably varying topography or varying ground conditions.Modern machine learning approaches, coupled with advanced SLAM techniques, show promise for enabling effective navigation planning, but methods that can effectively handle all of these challenges do not yet exist, and considerable further work will be required to enable fully autonomous navigation in all conditions.

Summary of the Current State-of-the-Art Techniques
Autonomous navigation can be broadly categorised into the classical modular pipeline methods and end-to-end learning approaches.The modular pipeline approach has limitations due to the requirement for high levels of human intervention in designing the modules, the loss of information through each module, and the overall lack of robustness when conditions vary beyond anticipated limits.The end-to-end learning approach suffers from problems such as over-fitting and poor generalisation.Both modular and end-to-end robot navigation approaches rely on sensors to capture information about the environment or internal robot attributes.Researchers have investigated different sensor modalities to improve robot perception, including raw input types such as sound, pressure, light, and magnetic fields, as well as common robot perception sensor modalities such as cameras, LiDAR, radar, sonar, GNSS, IMU, and odometry sensors.It is crucial to have a reliable real-time understanding of external 3D environments to ensure safe robot navigation.While cameras are commonly used in mobile robotics, relying on a single sensor is not always robust.Different sensor modalities are often combined in robotic applications to improve perception reliability and robustness.Camera and LiDAR fusion has been a popular approach in the robotics community.This fusion method has shown better performance than other vision-based dual sensor fusion approaches, and recent advances in deep learning algorithms have further improved its performance.Monocular and stereo cameras are commonly used with LiDAR sensors to fuse images and point cloud data.However, the technical challenges and cost of sensors and processing power requirements still limit the widespread application of these methods for regular use.Computer vision has experienced rapid growth in the past decade, with deep learning methods accelerating its development.Object detection, depth estimation, semantic segmentation, instance segmentation, scene reconstruction, motion estimation, object tracking, scene understanding, and end-to-end learning are among the subtopics of computer vision.These methods have been widely applied in autonomous navigation, but their accuracy and reliability can be limited, prompting ongoing efforts to improve their quality.Benchmarking datasets, such as KITTI, Waymo, A2D2, nuScenes, Cityscapes, RELLIS-3D, RUGD, Freiburg [29], and WildDash [162], have been used to compare the performance of different vision methods in autonomous urban driving and off-road driving applications.
Semantic SLAM techniques provide geometric and semantic details of external environments.However, existing semantic SLAM techniques may not be adequate for the safe navigation of robots in outdoor unstructured environments due to challenges in feature detection, uneven terrain conditions (which can lead to robot localisation errors, detection errors, etc.) and effects of wind on vegetation, which can lead to sensor noise and ambiguous results.Therefore, terrain traversability analysis and improved scene understanding using deep learning methods have been developed that are showing promise in assisting robot navigation in outdoor unstructured environments.
Many classical robot path-planning approaches are used for navigation in local environments.These classical approaches rely on a modular architecture to perceive the environment, plan paths relative to generated maps, and follow trajectories.However, these classical methods are often challenged in dynamic or deformable environments and are not well-suited for off-road navigation conditions where terrains may be unpredictable.Learning-based methods have been developed to improve robot path-planning abilities under different environmental conditions and have shown promise in addressing these challenges.In particular, learning-based navigation methods have been shown to be better suited for off-road navigation conditions, where terrains can be highly variable and unpredictable.

Research Challenges and Future Directions in Unstructured Outdoor Environment Navigation
Despite the significant progress that has been made in improving the abilities of robots to perceive their environment, localise themselves within it, and plan paths to navigate through it; significant challenges remain with the application of these methods to unstructured outdoor environments.Many limitations exist with respect to perception capability, particularly when sensor and computation costs are a factor, and effective methods for robust scene understanding within such environments have not yet been demonstrated, substantially hindering the ability to deploy fully autonomous robots into such environments.This section of the paper explores the current state of these significant research challenges and provides suggestions for the most promising directions for future research efforts to overcome them.

Research Challenges
SLAM methods have been extensively used for robot navigation applications in indoor environments.Urban autonomous navigation systems also implement SLAM algorithms successfully for sophisticated motion planning.Only a few applications have demonstrated the use of SLAM for dirt roads and off-road uneven terrain navigation.Many visual SLAM techniques are unstable in off-road environments with trees, bushes, uneven terrain and cluttered objects.Application of classical SLAM approaches in off-road environments with vegetation is challenging because of the irregular and dynamically varying structure of the natural environment.LiDAR-based methods may detect traversable regions as geometric obstacles.Therefore, existing SLAM research on robot navigation in off-road unstructured environments is limited.Several feature-based SLAM algorithms fail due to reasons such as the absence of specific features or the loss of detected features caused by the dynamic movement of vegetation due to wind.In general, mobile robots in off-road environments in the presence of vegetation use GNSS with IMUs to estimate the real-time position of robots.In relatively open tree canopies, GNSS sensors can establish a consistent communication link with satellites.The GNSS positioning accuracy in sparse tree canopy environments can be decimeter level [163].However, under dense tree canopies, GNSS may lose location data due to potential signal interruptions.
Visual SLAM (VSLAM) uses features or pixel intensity in images to generate information about environments.Many visual SLAM methods employ feature-based (e.g., points, lines, corners, planes) algorithms for detecting objects or developing a higher-level understanding of environments.In general, monocular, stereo and RGB-D cameras are used in VSLAM.RGB-D camera SLAM is restricted to indoor applications because of its sensitivity to ambient light conditions.SLAM approaches that use event cameras are still not well established.Many point-feature-based SLAM methods are not robust in low-textured scenes such as plane surfaces and are vulnerable to illumination changes.Line features are less sensitive to changes in lighting conditions than point features.Plane features are often found in the artificial environment but are not regularly present in natural environments.
Using pixel intensities instead of features in SLAM is more robust to changes in lighting conditions and results in more scene information.Modern semantic VSLAM methods are mostly restricted to specific environments such as indoor or urban driving.Many VSLAM approaches focus on feature-based techniques to generate maps and localise robots.In an off-road environment with vegetation, feature extraction is challenging.The number of distinguishable features in these environments might be low due to the unstructured nature of natural environments.The presence of occlusions, shadows and illumination variation in off-road unstructured environments can obstruct useful feature detection in VSLAM.Therefore, obtaining a higher level of semantic understanding is more complex than understanding urban driving scenarios.In order to improve VSLAM performance in vegetated, outdoor, and unstructured environments, one can train deep learning models to understand features using a large dataset.However, there are difficulties in adapting these networks to unseen environments [164].
LiDAR SLAM is more robust than VSLAM in varying illumination conditions and under shadows.Point cloud registration and feature-based methods are the most common LiDAR SLAM approaches [165].The LiDAR SLAM point cloud registration approach is more robust in vegetated natural environment mapping than the feature-based method.An adverse effect of LiDAR SLAM is the loss of semantic details in the environment.Researchers have introduced camera and LiDAR sensor fusion-based SLAM concepts to compensate for these issues.There are two separate LiDAR-camera fused SLAM directions, loosely coupled and tightly coupled.In loosely coupled SLAM systems, the modalities use two independent approaches for extracting features and high-level information to combine the two information streams.Computations in loosely coupled systems are less complex and generate faster results than in tightly coupled systems.However, these systems may produce low accuracies due to complementary information losses in independent sensor data processing steps.Tightly-coupled models fuse the two forms of sensor data into one framework to generate feature detection or higher-level semantic understanding.Therefore, this approach is more robust in challenging environments.However, these methods are more complex and need high computer memory and GPU requirements.
In unstructured outdoor environments, robots require perception, scene understanding and identification of terrain regions suitable for robot navigation.If robot working environments are dynamic, achieving these tasks will be more challenging.Robot scene understanding requires a robot to be able to perceive, analyse, and interpret its working environment.Scene understanding relies on subtopics such as object recognition, semantic segmentation, instance segmentation and scene representations of images or videos.Robot scene understanding is a continuous process, and robots need to be able to learn and adapt to new environments and situations over time.Implementation of machine learning algorithms in scene understanding has increased the possibilities of generating more sophisticated learning-based architectures.However, in general, these methods have high computational requirements.Object recognition in unstructured environments is challenging due to the higher frequency of object occlusions, similar features in different objects, indistinct object boundaries and varying lighting conditions.Semantic segmentation of an unstructured environment is another essential part of understanding the environment, and it is more complex than segmentation in urban environment scenarios due to the variation of different semantic classes.It is challenging to extract fine-grained semantic details from unstructured outdoor environments.These environments have wide semantic diversity, and training robots using large, annotated datasets is challenging due to the limited availability and difficulty of creating such datasets.
Researchers use instance segmentation algorithms to acquire a semantically rich understanding of environments, relationships between objects, and contexts.However, limited research has been conducted on robot instance segmentation of urban road environments, and these methods require high computational requirements and are not robust in off-road scenarios.Robots should learn how to robustly interpret environments over time using scene-understanding information to overcome these limitations.Point-based, graph-based, tree-based, multi-view projections, image-based, and volumetric representations have been developed to achieve scene understanding beyond basic perception.These techniques can comprehend environments beyond what can be inferred from a single image.Further investigation of the above-mentioned scene representations is necessary to learn and evolve scene interpretations over time.Novel concepts need to be developed and investigated to extract information from scenes at the object and semantic levels.
Robot domain identification and adaptation are valuable for real-world applications to improve performance and achieve more sophisticated navigation capability.In general, robots require the adaptability of learning-based navigation models.Therefore, networks that are trained on one domain (e.g., simulated environment [166], urban city environment) need to identify and adapt to changes in the application domain (e.g., real world, rural off-road environment).These domain gaps can be expressed as one challenge for robot navigation in unstructured environments.The domain gap refers to the significant deviations in environment attributes of the known or trained domains relative to the actual operating domains of the robot.These differences in application domain can occur due to variations in weather, texture, previously unseen terrain conditions, environment context changes (e.g., static to dynamic) or different factors in the robot operating environments.Deep learning architectures such as reinforcement learning can generate robust performance in one target domain but may subsequently fail due to an unforeseen drastic change in that environment.
Robots can be designed to be more adaptable if they are trained on large and diverse real-world data.However, the collection of real-world data to achieve such diversity can be challenging.Thus, there is a requirement for the development of more efficient data collection and data augmentation techniques.One option to address this is to train robots in computer simulators using data collected synthetically instead of training on real-world data.This method has benefits such as a reduction of the number of real-world experiments and a more economical and safer deployment in comparison to real-world systems.However, the trained models might have low adaptation to the real world due to overfitting to the simulation environment conditions.If the physics and the environments of simulators are significantly different from the target domain, the learned policies and loss functions in the simulator domains might not be directly applicable to the real world, thus leading to low performance.Novel techniques are needed to ensure that learningbased models, policies, and loss functions are transferable to the application domain to bridge this gap.Real-time performance is a critical aspect of robots working effectively in unstructured environments.The models trained in simulators might be unnecessarily complex for application in the real world.Therefore, techniques of domain adaptation must be transferable, economically sustainable, and computationally efficient in making decisions.In safety-critical applications, domain adaptation must be performed while ensuring safety, which requires careful validation and testing to ensure that the adapted model behaves safely in the real world.

Future Research Directions
Multimodal sensor fusion for robot vision provides more robust and reliable environment perception and scene understanding results than mono-sensor robot vision strategies in unstructured environments.Cameras and LiDARs are usually employed to retrieve RGB and depth data of external environments, respectively.However, high-resolution LiDAR sensors are costly, and their depth data sparsity increases when the observation range is increased.In contrast to mechanical LiDARs, solid-state LiDARS can provide highly accurate and higher-resolution depth data.Their costs are also likely to be significantly lower than mechanical spinning LiDARs, but these sensors are still at the research stage.The development and commercialisation of solid-state LiDARs will likely be pivotal for future advancements in robotic unstructured environment perception.
In robot navigation, domain change can involve training a model on simulated data and then fine-tuning it on real-world data to improve its performance in the real world.Robot domain adaptability can be enhanced by using techniques such as transfer learning, adversarial training, or domain adversarial neural networks.Researchers have proposed techniques such as data augmentation, feature alignment, or domain adaptation loss functions to address the domain gap.However, domain adaptation remains an active area of research in robotics, and there is still much to be done to improve the performance of machine-learning models in real-world settings.End-to-end learning architectures for robot navigation are appropriate for understanding robot tasks in the local motion planning context.For long-range navigation and in different environment domains, learning-based methods in combination with the classical hierarchical navigation pipeline show promise to produce more robust and safe results.
Robot environment perception, analysis and scene understanding are critical for robots to generate useful maps and make intelligent decisions.Recent research works such as Kimera [105] and Kimera-multi [106] are able to generate high-level geometric and hierarchical semantic environment understanding for robots.Robot systems that consist of geometric and hierarchical scene representations combined with incremental scene understanding will be advantageous in future robot outdoor unstructured navigation applications.However, the degree of scalability and adaptability of these approaches in unstructured environments with vegetation remains an open research question.In vegetated unstructured environments, obtaining feature-based scene understanding is challenging.Many SLAM methods provide rich geometric data but may incorrectly identify grass-covered terrains as not being traversable and muddy areas as traversable.Therefore, the combination of LiDAR-based geometric/semantic maps with the assistance of reinforcement learning of ego-centric images for robot local path planning could be a promising direction for more robust outdoor unstructured environment navigation.This approach will combine robot SLAM with image-based scene understanding to generate higher levels of spatial understanding of local environments for robots to be able to generate more accurate and safe movements.Overall, our comprehensive review of the literature has indicated that further research into multimodal sensor fusion techniques and deep learning-based scene understanding and task planning methods provides the most promise towards achieving the goal of fully autonomous navigation in outdoor unstructured environments.

Conclusions
This paper has provided a comprehensive review of the current state-of-the-art in autonomous mobile ground robot navigation, identified research gaps and challenges, and suggested promising future research directions for improved autonomous navigation in outdoor unstructured terrains.A broad review of robot sensing, camera-LiDAR sensor fusion, robot scene understanding, and local path planning techniques has been provided to deliver a comprehensive discussion of their essential methodologies and current capabilities and limitations.The use of deep learning, multimodal sensor fusion, incremental scene understanding concepts, scene representations that preserve input data topology and spatial geometry, and learning-based hierarchical path planning concepts are identified as promising research domains to investigate in order to realise fully autonomous navigation in unstructured outdoor environments.Our review has indicated that applying these

Figure 1 .
Figure 1.Classical and end-to-end autonomous navigation approaches.

Table 4 .
Global path planning algorithms.

Table 5 .
Local path planning algorithms.