1. Introduction
Autonomous mobile robots are increasingly being used in logistics, smart cities, security, and outdoor delivery services. Recently, several countries have allowed mobile robots to operate in pedestrian spaces such as sidewalks and crosswalks, expanding their operation from roads to shared spaces and mixed traffic zones. This expansion implies that the concept of a drivable area for mobile robots is no longer limited to roads. Instead, it refers to any continuous planar surface that the robot is currently navigating. These environments are more diverse and less predictable than structured roads, containing dynamic obstacles such as pedestrians, bicycles, and e-scooters, as well as static obstacles. Changing conditions, such as lighting and weather, add further challenges to perception. To navigate safely in such scenarios, autonomous mobile robots must accurately recognize drivable areas in real time and adapt to variations in their surroundings.
Drivable area recognition methods predominantly rely on RGB images. Lane-detection-based approaches [
1,
2] effectively extract lanes and boundaries to identify driving paths, but cannot account for dynamic obstacles such as surrounding vehicles or pedestrians. To address this limitation, pixel-level semantic segmentation methods [
3] have been introduced, which enable the fine-grained partitioning of road regions and direct recognition of drivable areas. Although such RGB-based segmentation approaches achieve high performance on structured roads such as highways and urban streets, their effectiveness decreases significantly in unstructured environments such as unpaved roads or off-road terrain. In these settings, boundaries are often ambiguous and surface materials are irregular, leading to degraded recognition performance. To mitigate these issues, Zhang et al. [
4] proposed segmentation-based recognition in complex unstructured areas such as parks, and Lee et al. [
5] attempted drivable area detection on forest roads. However, their methods still suffered performance degradation in multipath terrains and required additional training under varying illumination conditions. Therefore, despite leveraging rich visual cues, RGB-based drivable area recognition has fundamental limitations, including insufficient handling of dynamic obstacles, degraded performance in unstructured environments, and vulnerability to illumination changes and shadows.
To overcome these challenges, LiDAR-based approaches have also been widely studied and applied because of their robustness against illumination changes, ability to capture precise spatial geometry, and effectiveness at obstacle detection. Asvadi et al. [
6] proposed a method for detecting both static and moving obstacles in driving environments, whereas Xie et al. [
7] developed a technique for identifying negative obstacles such as ditches and depressions, effectively excluding non-drivable areas. Steinke et al. [
8] introduced GroundGrid, a LiDAR-based method for ground segmentation and terrain estimation, which provides a reliable foundation for drivable area recognition. However, this approach primarily focuses on separating the ground from non-ground regions and lacks a semantic understanding of surface-level features such as lane markings, curb boundaries, or pavement patterns. To overcome this limitation, Ali et al. [
9] proposed a 3D-LiDAR-based road segmentation framework that exploits angular road structures to identify drivable areas in deteriorating road regions. LiDAR-based methods have inherent limitations, as they struggle with low-reflectivity or irregularly shaped objects, and low-channel sensors often produce sparse point clouds that hinder accurate geometric reconstruction.
Several large-scale semantic datasets have been developed to overcome these limitations and support robust LiDAR-based learning. SemanticKITTI [
10] offers densely annotated LiDAR sequences collected across various driving scenes, enabling detailed semantic learning. Leveraging such resources, Cylinder3D [
11] adopts a cylindrical projection strategy that improves segmentation performance under sparse point cloud conditions. Other datasets [
12,
13,
14] and models [
15,
16,
17,
18,
19,
20] have also emerged, continuously advancing 3D segmentation capabilities in complex outdoor environments.
Considering the limitations of relying solely on LiDAR data, recent studies have explored multimodal approaches integrating RGB imagery with LiDAR data. By leveraging the complementary strengths of visual and spatial cues, these methods have achieved robust performance under varying illumination, dynamic obstacles, and complex scene conditions [
20]. Xue et al. [
21] addressed the challenge of drivable area detection on dark, unstructured roads by combining CNN-based fusion with surface normal estimation. Candan et al. [
22] applied a U-Net-based fusion framework integrating RGB images with LiDAR range representations, thereby improving drivable area recognition in structured road environments. Despite these advancements, performance still depends heavily on sensor configurations such as the LiDAR channel count and resolution.
Most RGB–LiDAR fusion models, as well as LiDAR-based learning models, are designed and developed under fixed sensor setups, assuming consistent resolution and channel count. When these parameters change, most existing methods require retraining for deployment in new environments [
23]. This retraining requirement creates a critical dependency. It becomes especially problematic in real-world scenarios where sensor configurations vary widely, such as in autonomous delivery robots, mobile platforms, or other systems deployed across diverse environments. Therefore, there is a growing need for a flexible and robust perception framework that can operate consistently across varying sensor configurations without additional model tuning. This need is especially critical for systems that demand performance and scalable deployment.
To overcome these challenges, this study proposes a practical framework for drivable area perception and mapping, designed to be both sensor-agnostic and modular in architecture. The system separates RGB and LiDAR preprocessing prior to fusion. This design allows seamless adaptation to various sensor configurations without retraining. In addition, it achieves high computational efficiency, suitable for real-time robotic navigation while maintaining recognition performance comparable to learning-based methods. The key contributions of this study are summarized as follows:
We present a robust drivable area recognition method based on modular RGB–LiDAR fusion. The system processes RGB segmentation and LiDAR-based ground estimation independently, enabling reliable performance across various sensor types and resolutions. A LiDAR scaling technique is introduced to address point cloud sparsity and enhance the geometric continuity of detected drivable surfaces. Additionally, the perception pipeline is fully modular. Each component can be adapted or replaced independently, according to the available hardware. This flexibility supports seamless deployment across diverse robotic platforms and improves both scalability and integration efficiency.
- 2.
Real-time global drivable area mapping
We propose a method for generating a global drivable area map in real time. The local drivable area identified in each frame is transformed into a global coordinate frame through SLAM-based localization. The system incrementally integrates local recognition results over time. This process constructs a global drivable area map that captures both the surrounding environment and accumulated perception. The resulting map enables stable and reliable path planning in dynamic environments.
2. Methods
2.1. System Overview
The proposed system processes camera images and LiDAR point clouds to identify drivable areas where a robot can safely operate and generate a drivable area map. As illustrated in
Figure 1, the overall architecture comprises four main stages. In
Figure 1(a-1), which represents camera-based drivable area mask generation, the RGB image obtained from the camera is processed to generate drivable area masks. In
Figure 1(a-2), for LiDAR-based ground point cloud generation, the point cloud obtained from LiDAR is processed to produce an interpolated ground point cloud. In
Figure 1(a-3), which shows camera–LiDAR fusion, the interpolated ground point cloud is projected onto the image plane, and only the points intersecting the drivable area mask are retained to produce a high-confidence drivable area point cloud. Finally, in
Figure 1b, which represents global drivable area map generation, the drivable area point cloud is transformed from the LiDAR coordinate frame into the global coordinate frame using SLAM-estimated robot poses and incrementally accumulated to construct the global drivable area map. Steps (a-1) to (a-3) collectively form the local drivable area recognition module, and step (b) corresponds to the global drivable area map generation module.
2.2. RGB-Image-Based Drivable Area Mask Generation
The RGB image obtained from the camera is processed to generate a drivable area mask. At each time step
, the image is denoted as
, where
and
represent the image height and width, respectively, and the last dimension corresponds to three color channels. The object detection function
uses the YOLOv11s model to process
as an input and detects regions corresponding to drivable areas, outputting bounding boxes
. The segmentation function
then takes
and
as inputs to generate the final drivable area mask
for detected regions. Several models have been employed for this segmentation step, including SAM 2.1 Tiny [
24], FastSAM-x [
25], MobileSAM [
26], and YOLOv11-seg-l.
Here,
denotes a binary image mask, where
indicates that pixel
belongs to the drivable area, and a value of zero is assigned otherwise, as defined in Equation (1).
The complete process of camera-based drivable area mask generation is presented in Algorithm 1, and the corresponding pipeline is illustrated in
Figure 2.
Algorithm 1: Camera-based Drivable Area Mask Generation Process |
Input: : RGB image at time |
Output: : drivable area mask at time
|
- 1
; - 2
; - 3
|
2.3. LiDAR-Based Ground Point Cloud Extraction
The raw point cloud obtained from the LiDAR sensor is processed to extract and interpolate ground points. At each time step
, the point cloud is denoted as
, where each point
represents a 3D coordinate in Cartesian space. The ground point extraction function
separates ground and non-ground points from
based on non-uniform polar coordinate partitioning and region-wise plane fitting. This operation is implemented using a patchwork algorithm [
27].
The resulting ground point cloud
is sparse owing to the low number of LiDAR channels. The function
, based on the spherical range image interpolation method described in [
28], is applied to address this sparsity and more accurately represent the ground surface. Each ground point
obtained from the function
, is first converted into spherical coordinates as defined in Equation (2).
Here,
is the range,
is the azimuth (horizontal angle), and
is the elevation (vertical angle). The azimuth–elevation domain is then discretized with angular resolutions
and
, and each point is mapped to a spherical range image (SRI) pixel defined by Equation (3).
Here, and denote the angular step sizes in the horizontal (azimuth) and vertical (elevation) directions, respectively, and denotes the floor operator. The vertical step is refined by a factor , i.e., , which increases the number of elevation bins by and thus densifies the vertical sampling.
Missing values between LiDAR beams are estimated by bilinear interpolation as defined in Equation (4).
Here, , denote the indices of the neighboring grid points around in the SRI. In the standard bilinear setting, only the four closest neighbors contribute non-zero weights, and their contributions are summed to obtain .
Finally, the interpolated spherical coordinates are back-projected into Cartesian space as defined in Equation (5).
Here, and .
The interpolated ground point is denoted as
, and the final ground point cloud following interpolation is defined by Equation (6).
The complete process of LiDAR-based ground point cloud extraction is summarized in Algorithm 2, and the corresponding pipeline is illustrated in
Figure 3.
Algorithm 2: LiDAR-based Ground Point Cloud Extraction Process |
Input: : LiDAR point cloud at time
|
Output: : interpolated ground point cloud at time
|
- 1
; - 2
; - 3
|
2.4. Camera–LiDAR Fusion
The drivable area mask
from Equation (1) and interpolated ground point cloud
from Equation (6) are used as inputs to generate a colored drivable area point cloud, retaining only the points located at the intersection of the mask and ground point cloud. Each ground point
is projected onto the image plane using the
function. This function first applies the extrinsic transformation
, which represents the relative rotation and translation between the LiDAR and the camera, to convert the 3D point into the camera coordinate system. The transformed point is then multiplied by the intrinsic matrix
, which encodes the camera’s focal lengths and principal point, to map it onto the image plane in homogeneous coordinates. Finally, normalization by the depth
yields the 2D pixel coordinates
, as defined in Equation (7).
If the projected pixel coordinates
lie inside mask
, the corresponding point is classified as drivable. To avoid false positives near ambiguous mask boundaries, all points projected onto the mask edges were discarded. The RGB color
at the same pixel location in image
is then assigned to that point. Through this process, a colored drivable area point cloud
is generated, as defined in Equation (8).
The generation of
is summarized in Algorithm 3 and illustrated in
Figure 4.
Algorithm 3: Camera LiDAR Fusion Process |
Input: : RGB image at time
: drivable area mask at time : interpolated ground point cloud at time
|
Output: : colored drivable area point cloud at time
|
- 1
; - 2
do - 3
; - 4
then - 5
; - 6
; - 7
- 8
end - 9
end - 10
|
2.5. Global Drivable Area Map Generation
To construct the global drivable area map , the colored drivable area point clouds , obtained from Algorithm 3, are transformed into a global coordinate frame and incrementally accumulated over time. LiDAR-based SLAM is employed to estimate the pose of the robot for each frame. Only the results of keyframes selected according to the internal criteria of the SLAM system are incorporated into the global map.
The function
uses the LIO-SAM [
29] algorithm to estimate the robot’s global pose
from
at time
. Although SLAM continuously updates pose and map information, this process is simplified as a single function call in the pseudocode.
A colored drivable area point , defined in Equation (8), consists of a coordinate and color component. These components can be accessed using the functions and , which return the coordinate and RGB color of a point, respectively.
The
function decides whether the current frame should be selected as a keyframe based on two criteria. A frame is selected as a keyframe when the sensor has moved more than 1 m in translation compared to the last keyframe, or when the orientation change relative to the previous frame exceeds 0.2
. If either of these conditions is satisfied, the current frame is designated as a keyframe and the corresponding
is transformed into the global coordinate frame by applying the function
to the coordinate component
of each point
, as defined in Equation (9).
Here, is the rotation matrix, and is the translation vector. The transformed global point is denoted as and the color information of each point is preserved during transformation.
Finally, the transformed colored drivable area point clouds
from all keyframes are accumulated to generate the global colored drivable area map
. The complete procedure is summarized in Algorithm 4 and illustrated in
Figure 5.
Algorithm 4: Global Drivable Area Map Generation Process |
Input: : sequence of LiDAR point clouds : sequence of colored drivable area point clouds, , as defined in Equation (8) |
Output: : global colored drivable area map |
- 1
; - 2
do - 3
- 4
then - 5
; - 6
do - 7
; - 8
] - 9
- 10
end - 11
- 12
end - 13
end - 14
|
4. Discussion
Herein, we proposed a drivable area recognition system that fuses an RGB camera and LiDAR sensor. The proposed system is designed to be robust to domain differences between sensors and is applicable to diverse environments without requiring retraining. The system adopts a sensor-independent architecture capable of processing inputs regardless of the sensor type or resolution, and is implemented using an ROS framework, enabling easy integration into real-world robotic platforms. Furthermore, the system is modular in design, allowing easy replacement of individual components such as the RGB segmentation model or LiDAR preprocessing algorithm. This modularity offers flexibility and scalability, enabling future improvements in overall system performance through simple model upgrades.
In quantitative evaluations, while existing LiDAR-based semantic segmentation models achieved high accuracy with IoU scores ranging from approximately 94% to 95%, the proposed system achieved a slightly lower IoU of 89%. In terms of inference speed (FPS), which reflects real-time capabilities, the proposed system reached 15.82 FPS, significantly outperforming existing models that ranged between 4 and 10 FPS. This balance highlights the suitability of the method for real-time deployment in mobile robotics and autonomous driving. The lower IoU can be partly explained by the evaluation setting in which LiDAR-only baselines were trained on SemanticKITTI [
10] sequences 00–07 and 09–10 and evaluated on sequence 08, where the similarity between training and test environments may have led to dataset-specific overfitting and inflated performance. Our pipeline, by contrast, does not depend on such dataset-specific training but fuses RGB segmentation and LiDAR ground extraction in a modular manner, enabling robustness to changes in sensors and operating conditions.
This modularity not only enhances adaptability but also facilitates performance improvements through simple component upgrades. As demonstrated in
Table 2, simply replacing the RGB segmentation module with a more advanced model, such as YOLOv11-seg-l, improved IoU and speed compared to earlier models like SAM 2.1 Tiny, FastSAM-x, and MobileSAM. This demonstrates that the overall framework can be strengthened without structural changes by updating individual modules. For the RGB module, although our system already employs a foundation model–based segmentation approach, it can readily incorporate future foundation models that offer higher accuracy or stronger real-time performance, as well as more advanced transformer-based segmentation models as they are developed. For the LiDAR module, the same modular principle allows the substitution of more advanced and efficient ground segmentation algorithms, together with interpolation strategies that are both more real-time and better at preserving fine-grained structural features of the scene. In this perspective, the proposed system is not limited by its current configuration but rather provides a scalable platform that can continuously benefit from progress in RGB segmentation, LiDAR ground segmentation, and interpolation techniques.
We also examined the influence of LiDAR channel reduction to assess the framework’s generalizability. When the resolution was reduced from 64 to 8 channels, IoU decreased by 17.53% and precision decreased by 2.15%, even though no retraining was applied. These results suggest that the proposed system can still be applied across different LiDAR configurations without the need for retraining, while acknowledging that performance degradation is inevitable under extremely sparse settings.
The system demonstrated strong performance in terms of recognizing drivable areas while effectively filtering out both static and dynamic obstacles. As shown in
Figure 13, the system successfully excluded various static obstacles such as concrete blocks, raised curbs, and trees from recognized drivable areas. These results indicate that the system can maintain accurate drivable area segmentation even in environments with complex and irregular fixed structures.
Figure 14 illustrates the ability of the system to handle dynamic obstacles. In environments with moving objects such as pedestrians, vehicles, and motorcycles, the system reliably excludes these regions from the drivable area. These results confirm that the proposed method not only operates efficiently in static scenarios but also maintains safe drivable area recognition under dynamic conditions, which is essential for real-world deployment.
However, despite these promising results, certain limitations were observed in complex geometric configurations.
Figure 15a shows a road environment where the drivable area is not recognized over a speed bump. Although this region should be considered drivable, the surface discontinuity leads to an overly conservative result.
Figure 15b presents a curved sidewalk environment where the predicted drivable area incorrectly extends into the adjacent road. This misclassification likely arises from the lack of a distinct boundary or strong curvature. These examples highlight the system’s sensitivity to local geometric variations, such as bumps and curb edges, which can affect segmentation accuracy and boundary alignment.
As illustrated in
Figure 9a and
Figure 12a, the drivable areas recognized by the proposed system were accumulated into globally consistent maps, showing smooth connectivity and spatial coherence throughout the entire trajectory in both road and sidewalk environments.
Figure 9c and
Figure 12c further validate this spatial consistency by overlaying the generated maps onto high-resolution satellite imagery, clearly demonstrating their precise geometric alignment with actual road and sidewalk structures. These qualitative results confirm that the proposed fusion approach effectively generates accurate and coherent global drivable maps.
While dynamic objects are excluded from the drivable mask on a per-frame basis, the global map accumulation currently does not incorporate occupancy lifetimes or temporal consistency checks. As a result, ghost-free space could still occur when dynamic obstacles appear and disappear across consecutive frames. In practice, such empty regions are often filled by subsequent frames as the map continues to be accumulated, but the lack of explicit temporal filtering remains a limitation. This highlights the need for future extensions of the system that integrate temporal consistency mechanisms to further improve robustness in highly dynamic environments.
As illustrated in
Figure 16, the comparison emphasizes the importance of interpolation in providing a clearer representation of ground features. While the overall recognition pipeline does not depend on this step, interpolation contributes to generating more continuous and visually coherent global drivable area maps, thereby improving the interpretability of the results.
Although the proposed system exhibits promising results in various structured environments, it has several limitations. The performance of the system is influenced by the accuracy of sensor calibration, particularly when aligning LiDAR point clouds with RGB images. Inaccurate extrinsic parameters can lead to misaligned projections, which may affect recognition quality in certain scenarios. To further quantify this effect, we conducted an additional experiment in which the extrinsic parameters were intentionally perturbed. Specifically, yaw rotations of 5°, 10°, and 15° were introduced to simulate miscalibration. The results showed that IoU decreased by 25.3%, 46.7%, and 62.3% at 5°, 10°, and 15° miscalibrations, respectively, while precision decreased by 11.5%, 23.3%, and 34.3%. These findings demonstrate that even small angular miscalibrations can substantially degrade recognition performance, underscoring the importance of accurate initial calibration. Nevertheless, once an accurate initial calibration is established, our system performs projection on a frame-by-frame basis, so cumulative errors do not occur, ensuring stable operation in practice.
Additionally, the current evaluation was conducted in structured environments such as roads and sidewalks, and the robustness of the system in unstructured or complex terrains such as construction sites or uneven natural surfaces has not yet been validated. The system also depends on RGB-based segmentation in its front-end processing, meaning its overall performance can be affected by the quality of the segmentation model, especially under challenging visual conditions such as shadows, reflections, or poor lighting. These aspects present important directions for future improvements and generalization. In particular, conducting systematic evaluations under adverse conditions such as nighttime scenes, glare, rain, or shadowed environments, and constructing a failure taxonomy of segmentation outcomes, would represent valuable extensions of this work. However, such efforts require datasets that provide synchronized RGB and LiDAR inputs along with pixel-level ground-truth annotations under various stress conditions. These types of resources are currently scarce. As future work, we intend to develop custom data collection strategies and evaluation protocols to address this limitation.