Enhancing Autonomous Truck Navigation in Underground Mines: A Review of 3D Object Detection Systems, Challenges, and Future Trends

Essien, Ellen; Frimpong, Samuel

doi:10.3390/drones9060433

Open AccessReview

Enhancing Autonomous Truck Navigation in Underground Mines: A Review of 3D Object Detection Systems, Challenges, and Future Trends

by

Ellen Essien

^*

and

Samuel Frimpong

Department of Mining and Explosives Engineering, Missouri University of Science and Technology, Rolla, MO 65409, USA

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(6), 433; https://doi.org/10.3390/drones9060433

Submission received: 7 May 2025 / Revised: 8 June 2025 / Accepted: 10 June 2025 / Published: 14 June 2025

(This article belongs to the Topic AI and Data-Driven Advancements in Industry 4.0, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Integrating autonomous haulage systems into underground mining has revolutionized safety and operational efficiency. However, deploying 3D detection systems for autonomous truck navigation in such an environment faces persistent challenges due to dust, occlusion, complex terrains, and low visibility. This affects their reliability and real-time processing. While existing reviews have discussed object detection techniques and sensor-based systems, providing valuable insights into their applications, only a few have addressed the unique underground challenges that affect 3D detection models. This review synthesizes the current advancements in 3D object detection models for underground autonomous truck navigation. It assesses deep learning algorithms, fusion techniques, multi-modal sensor suites, and limited datasets in an underground detection system. This study uses systematic database searches with selection criteria for relevance to underground perception. The findings of this work show that the mid-level fusion method for combining different sensor suites enhances robust detection. Though YOLO (You Only Look Once)-based detection models provide superior real-time performance, challenges persist in small object detection, computational trade-offs, and data scarcity. This paper concludes by identifying research gaps and proposing future directions for a more scalable and resilient underground perception system. The main novelty is its review of underground 3D detection systems in autonomous trucks.

Keywords:

3D object detection; autonomous trucks; deep learning; sensor fusion; underground mines; YOLO algorithms

1. Introduction

The evolution of autonomous driving systems in the mining sector has brought significant interest in enhancing computer vision for accurate and real-time 3D object detection. Unlike urban [1,2,3] or surface mining driving [4,5,6], underground mining environments present unique constraints such as limited visibility, dust, confined spaces, and uneven terrain that pose significant challenges for 3D object detection. These object detection systems are the perceptual backbone of autonomous truck haulage for obstacle recognition, object classification, and navigation in real time.

Recent advancements in computer vision/image processing powered by deep learning (DL) and artificial intelligence (AI) have significantly improved situational awareness in autonomous driving systems [7]. AL/ML-based object detection and tracking models have seen applications in robotics [8,9,10], urban autonomous driving [4,11,12], collision avoidance systems [13,14], and security systems for monitoring and surveillance [15]. AI-ML techniques, particularly those involving deep learning, play a significant role in these systems [7,16]. They have addressed the challenges of machine–human interactions, particularly for collision prevention, injuries, and fatalities. In the underground environment, research increasingly has implemented AI/ML architectures with LiDAR (Light Detection and Ranging), thermal infrared (IR), and RGB (Red, Green, Blue) cameras to detect pedestrians, machinery, and hazards [5,6,17,18,19,20]. Three-dimensional object detection models, which give richer spatial information than 2D models, have become a critical area of research for safe autonomous truck navigation [1,21,22,23]. The DL algorithms include Convolutional Neural Networks (CNNs) and YOLO (You Only Look Once), which have become efficient in identifying and classifying objects in real time [6,24,25].

These current models have demonstrated effectiveness in improving detection capabilities under diverse and harsh conditions [15,26,27]. However, they are limited due to small and occluded object detection inaccuracies amidst noise and obstructions. Also, they lack robustness when applied to specific environments such as underground mines, necessitating more innovative solutions [15,28].

As the mining industry transitions to Industry 4.0 and 5.0, which prioritize human–machine interactions, it is crucial to continually develop and innovate new systems, methods, and solutions for object detection and anti-collision systems in underground mining environments. Deploying autonomous vehicles into the underground mining industry is a paradigm shift toward operational efficiency and safety enhancement. These autonomous haulage trucks are designed to navigate dynamic and intricate mining environments, enhancing rapid decision-making accuracy and situational awareness. To ensure this, they heavily depend on sophisticated 3D object detection systems. However, the capabilities of these models are currently limited, which hinders the exploitation of their full potential in autonomous mining operations. Current detection systems frequently encounter limitations like trade-offs between speed and accuracy, especially under adverse conditions such as variable terrains, fluctuating lighting, and diverse objects [6,12,29]. Sensor modalities such as LiDAR, IR, and RGB cameras often result in suboptimal performance, particularly in dynamic obstacles and rapidly changing settings. Also, the real-time application capabilities of these detection systems remain insufficient, resulting in latency issues that could compromise operational efficiency and safety. By analyzing how the conditions mentioned above shape model and sensor choices, this survey provides a uniquely focused perspective on underground-specific object detection design requirements, an area often under-represented in existing literature.

Many review articles such as [2,16,18,30,31,32,33,34,35,36] have explored the advancements of object detection models in autonomous vehicles and mining environments. Immam et al. [18] reviewed anti-collision systems based on computer vision in underground mines. The study examined machine learning algorithms, which included CNNs, Fast R-CNN, and YOLO series, for real-time object detection and sensors employed in autonomous trucks to reduce accidents. Tang et al. [37] studied multi-sensor fusion detection methods for 3D object detection in urban autonomous driving. They developed a taxonomy that categorized fusion approaches and assessed their efficacy in improving detection accuracy and safe driving in autonomous vehicle driving scenarios. Cui et al. [31] reviewed navigation and positioning technologies in underground coal mines. The study investigated multiple techniques, like visual image feature-based systems, inertial navigation, visible light communication (VLC), and sensor fusion methods, to improve the accuracy and robustness of the detection systems.

Patrucco et al. [32] underscored the importance of multi-sensor systems by integrating different sensors to address the limitations of individual sensors. The study also investigated anti-collision technologies like cameras, LiDAR, and radar, discussing their principles, advantages, limitations, and costs. Wang et al. [33] also reviewed detection system advancements in unmanned driving technology for coal mine transportation systems. They discussed multi-sensor fusion strategies to address the limitations of single sensors. Nevertheless, there is a limited examination of underground settings’ complex and dynamic conditions and the scarcity of standardized datasets in detection models for benchmarking.

To contextualize the focus and novelty of this study, a comparative analysis was conducted against existing review papers and the literature. Table 1 presents a summary of key survey works related to object detection, underground navigation, and collision prevention across mining environments. Each study is evaluated based on its focus, identified gaps, and how this current review builds upon or diverges from those contributions. This detailed comparison highlights the distinct emphasis of this study on DL-based 3D object detection, multi-sensor fusion, and dataset challenges tailored for underground autonomous truck deployment.

The underground mining environment requires robust sensor fusion techniques, low-latency processing, and models resilient to noise and occlusion, which conventional surveys overlook. A standardized dataset with underground conditions for practical model evaluation and performance also exists. This review paper addresses the lacuna by synthesizing the most recent developments in 3D object detection, sensor fusion strategies, and dataset challenges within underground autonomous haulage systems. It highlights the strengths, limitations, and suitability of various 3D detection systems for autonomous truck navigation in an underground environment. The objectives of this review are as follows:

To categorize and evaluate the various sensor modalities employed in underground autonomous haulage 3D detection systems.
To explore multi-sensor fusion approaches, their performance, and trade-offs.
To analyze and synthesize the deep learning architectures used in object detection models, particularly YOLO variants and CNNs.
To identify key underground dataset limitations for object detection models.
To identify significant challenges in underground autonomous truck object detection deployment and propose future directions for developing scalable and reliable detection systems.

This paper presents a comprehensive review to evaluate the current literature on 3D object detection systems for underground autonomous trucks. The process commenced with clearly defining the research questions, followed by a comprehensive search of pertinent studies in reputable academic databases, such as SpringerLink, Multidisciplinary Digital Publishing Institute (MDPI), ResearchGate, Institute of Electrical and Electronics (IEEE) Xplore, and Google Scholar. The search was conducted using keywords such as “3D object detection”, “autonomous trucks”, and “mining”, in conjunction with relevant terms such as “pedestrian detection”, “underground”, and “object detection”. There were defined criteria for inclusion and exclusion that guided the selection process. The studies considered for article inclusion were restricted to journals published in reputable academic sources, conference papers, and peer-reviewed articles. The only research included was recent, within the last 5–10 years, and specifically addressed 3D object detection in autonomous vehicles in underground mining environments. Data extracted from the selected studies were categorized and analyzed in key areas, including sensor modalities, 3D detection systems, multi-sensor fusion, fusion strategies, underground datasets, and specific mining challenges. In real-world scenarios, categorization enabled a comparative analysis of various approaches and their performance. The review followed established frameworks to guarantee consistency, transparency, data synthesis, and reporting comprehensiveness.

The review aims to enhance the reliability and robustness of autonomous systems in the underground mining industry by comprehensively analyzing 3D object detection methodologies in autonomous vehicles. The study employs a pragmatic approach, acknowledging that underground mining environments present distinct requirements that substantially differ from those in more controlled industrial environments. It is, therefore, essential to comprehensively evaluate and compare current 3D object detection techniques in such environments to ascertain their strengths, limitations, and application-specific challenges. This investigation will help discover current literature and technology limitations for developing more efficient detection systems that deliver real-time, precise, and resilient performance in underground mining environments. The study will emphasize recent advances and establish a basis for recommending future directions, particularly optimizing algorithms for rugged, resource-limited, and unpredictable mining environments. The study contributes to the advancement of autonomous haulage truck technology consistent with the mining industry’s safety protocols for zero fatality and operational requirements, thereby fostering a more efficient and secure future for the sector.

The structure of the paper is organized as follows: Section 2 provides a comprehensive review of the key sensor-based 3D detection systems for autonomous truck navigation, multi-sensor detection systems, and their strengths and limitations. Section 3 delves into multi-sensor fusion strategies, grouping them into early, mid, and decision-level fusion, and assesses their impact on detection accuracy. Section 4 discusses current deep learning architectures, their deployments, and state-of-the-art underground 3D detection model comparisons. Section 5 synthesizes underground dataset challenges related to autonomous object detection. Section 6 identifies the key challenges detection systems face in underground settings and proposes future directions. Finally, Section 7 synthesizes the findings of this review. Figure 1 shows the structure of the entire review report, starting with the background and motivation, followed by sections on 3D sensor techniques, object detection models, sensor fusion strategies, case studies, underground dataset challenges, and future research directions. It provides a clear roadmap for understanding the paper’s flow and focus.

2. Overview of Sensor Modalities for 3D Object Detection in Underground Autonomous Haulage Trucks

Three-dimensional (3D) object detection is pivotal to autonomous haulage truck operations in underground mining environments where safety, situational awareness, and navigation are critical. These detection systems rely on different sensor modalities to provide spatial awareness, track moving and static objects such as workers, equipment, and structural features, and detect hazards. Unlike controlled urban environments, underground settings have variable lighting, dust, occlusions, and uneven terrains, which necessitate the integration of complementary and robust sensors for detection.

This section provides a detailed analysis of key sensor-based detection perception systems used in underground autonomous trucks, which include IR, RGB cameras, and LiDAR systems. Each system is evaluated for its operational principles, integration with deep learning algorithms, underground applicability, and performance trade-offs.

2.1. Infrared (IR) Thermal Systems

Infrared (IR) thermal sensors (Figure 2) detect heat signals from objects and convert them into thermal images or temperature maps. IR’s ability to perceive heat from objects or obstacles makes it significantly valuable for underground mining environments, where smoke, low light, and dust often compromise traditional optical systems. Unlike RGB cameras that depend on ambient lighting, thermal sensors can function in complete darkness and thermally unstable conditions.

These sensor modalities are effective in the following:

Pedestrian or worker detection by identifying humans’ heat signatures.
Collision avoidance in situations where RGB cameras and LiDAR sensors struggle.

IR imagery can be segmented and classified in real time when integrated with YOLO-based or CNN architectures. Many studies have demonstrated the successful application of different YOLO and two-stage CNN algorithms on thermal imagery for object detection in underground mining environments [18,41,42,43,44,45]. Figure 3 and Figure 4 illustrate the application of IR with DL architectures in underground environments for pedestrian detection and navigation. Keza et al. [44] presented a pedestrian detection system that enhances safety in underground mines by integrating thermal imaging with 3D sensors. Using an FLIR thermal camera and depth sensors (TOF and Kinect), the system classifies regions using four methods and segments thermal images based on temperature thresholds. However, the model demonstrated susceptibility to motion distortion and mist and lacked enough underground datasets for model development. In a related effort by Szrek et al. [42], their AMICOS project utilized infrared thermography on an unmanned ground vehicle (UGV) to support search and rescue operations in underground mines. This system demonstrated the effectiveness of thermal imaging for detecting humans and navigating low-visibility environments. Its real-time temperature-based classification shows strong potential for adaptation in autonomous safety and perception modules, especially where visual occlusion and low lighting are major challenges.

Key Features and Advantages of IR Systems

Effective in Low- or No-Light Conditions: IR systems rely on detecting heat signatures rather than visible light, making them highly effective in poorly lit or completely dark environments in underground mines.
Thermal Object Detection: These sensors distinguish objects based on their heat signatures to identify equipment, vehicles, and workers, even in smoke, fog, or dust.
Long-Range Detection Capabilities: Certain IR sensors, like long-wave infrared (LWIR), can detect objects over significant distances, enabling the early identification of hazards or obstacles.
Insensitive to Ambient Light Variations: Unlike RGB cameras, IR systems are unaffected by ambient light changes, ensuring consistent performance in dynamic lighting conditions.
Compact and Durable Designs: IR sensors are lightweight, built to endure the harsh mining environment, and can withstand hot temperatures, vibrations, and dust.
Resistance to Ambient Light Variations: Unlike RGB cameras, IR sensors are not affected by changes in ambient light, offering consistent performance both day and night.

Despite these strengths of IR sensor modalities, there are notable limitations:

Limited spatial resolution often affects fine-grained object classification;
High purchase cost of IR sensors such as long-wave IR (LWIR) are often costly;
Inference from reflective equipment or heat surfaces with similar thermal profiles degrades their performance.

Despite these limitations, their fusion into multi-sensory systems is invaluable, which enhances system robustness and reliability. When integrated with LiDAR (for depth information) and RGB (for texture), IR sensors contribute uniquely to real-time perception and classification.

2.2. RGB-Based 3D Detection Systems

The RGB cameras in Figure 5 remain one of the foundational sensor types utilized in 3D object detection systems because of their ability to capture high-resolution visual data with texture and color details [43,45,47,48]. They are deployed in underground mining autonomous trucks for object classification, scene understanding, and equipment recognition. The RGB camera sensors illustrated in Figure 5 capture 2D images, which comprise red, green, and blue color channels combined to form a full-color image and a per-pixel depth report. In 3D object detection systems, RCB cameras are often used alone or with active stereo or time-of-flight sensing technology [18]. They are integrated with other sensors, such as LiDAR or IR sensors, for multi-sensor systems to enhance the perception model’s scene understanding and robustness. Current detection algorithms, such as YOLO and CNNs, have significantly improved the accuracy and speed of RGB camera models [19,25,48,49]. These algorithms use features such as pattern recognition and color differentiation for precise detections and classifications.

RGB sensor detection systems are vital in autonomous mining trucks because they leverage visual data from cameras to identify, classify, and track objects in the environment. The RGB camera has various configurations as follows:

The monocular cameras, which are lightweight and cost-effective, do not have depth perception.
Stereo cameras estimate depth by triangulating differences between two lenses.
Depth cameras incorporate active sensors for near-field 3D imaging.

These sensor types are very compatible with DL models like YOLO and CNN. Zhang et al. [50] proposed an LDSI-YOLOv8 framework to enhance missed detection and low recognition for multiple targets in underground coal mines, leveraging an RGB camera. The work reported an accuracy of 91.4%, demonstrating a 4.3% increase compared to the original YOLOv8 algorithm. Imam et al. [51] also developed an anti-collision system for underground mines, focusing on pedestrian detection via RGB cameras with a YOLOv5 DL-based algorithm to enhance pedestrian detection accuracy in low-visibility conditions.

Philion and Fidler [52] also presented the Lift–Splat–Shoot model, which encodes multi-camera images into bird’s-eye-view (BEV) representations for autonomous vehicle driving applications. The key objective was to create an end-to-end architecture that directly transforms multi-camera data into a unified BEV frame for semantic perception, understanding, and motion planning. The research demonstrated that the model could segment vehicles, drivable areas, and lanes by combining frustum-based depth inference and pooling techniques. However, there was reduced performance in low-light conditions and reliance on simple-frame data, which affects depth estimation accuracy compared to LiDAR-based models.

Versatility and affordability make camera-based models attractive for underground applications. They are less expensive than LiDAR and can be applied in areas such as identifying potential hazards and miner helmets for safety [17,53,54]. The key features and advantages of RGB cameras include the following:

High-Resolution Imaging. RGB cameras can capture rich and detailed visual information, providing the necessary resolution to identify and classify objects based on color and texture.
Cost-Effectiveness. They are relatively inexpensive compared to other sensors, such as LiDAR sensors, making them a known choice for cost-effective object detection models. Their affordability enables widespread applicability in autonomous vehicles.
Color-Based Object Recognition. An additional layer of information is provided by the ability of RGB cameras to perceive colors, to distinguish between similar-shaped objects, and to identify warning signals.
Lightweight Design. RBG cameras are compact and easy to integrate into autonomous vehicle platforms, enabling more sensor placement, system design, and integration flexibility.

However, there are significant challenges encountered as follows:

High reliance on lighting makes them unsuitable for low-light environments.
Affected by visual occlusions from debris and dust, which reduce detection accuracy.
Lack of depth perception unless integrated with a stereo or LiDAR sensor.
High computational load for processing high-resolution image frames in real time.

Multi-sensor fusion systems integrating RGB cameras with LiDAR or thermal sensor data address these limitations [24,27,28,54,55]. Fusion provides enhanced object classification and detection by combining color from RGB cameras with depth and range information from other sensors such as LiDAR. Autonomous systems achieve a more comprehensive and accurate perception of their environments by fusing these data from complementary sensor sources. Additionally, advancements in image technologies and algorithms have been introduced to address some of these challenges. These include high-dynamic range (HDR) imaging to improve the camera’s ability to handle extreme variations in brightness and enhance visibility in poor and variable lighting conditions [56,57,58]. ML algorithms are increasingly trained on augmented datasets to handle noisy and distorted visual data inputs better, improving model robustness in adverse conditions [25,28].

2.3. LiDAR (Light Detection and Ranging) System

The capability of LiDAR technology to generate precise depth measurements of the encompassing environment has made it a pivotal sensor type for detecting objects in autonomous vehicles. As shown in Figure 6 (solid-state LiDAR), these sensor types can accurately detect the distance between objects in a scanning area of up to about 50 m [32]. The technology of LiDAR sensors has made significant strides since their inception, making it a critical component in detecting 3D objects for autonomous vehicle navigation. LiDAR technology was initially developed for military and atmospheric applications [59]. It has since improved to accommodate the needs of commercial and industrial sectors, such as the underground mine. LiDAR is the primary solution when incorporated with deep learning algorithms in mining, where the transition to autonomous operations requires real-time and precise detection. Solid-state designs, multi-beam configurations, and scanning mechanisms have significantly enhanced detection range, resolution, and robustness. This has addressed the limitations of early systems in challenging environments that depend on only single-point laser systems. LiDAR sensors emit laser pulses that travel through the atmosphere, bounce off objects, and measure their return time to the sensor, generating high-density point clouds representing 3D information about the object. This information includes object size, position, and orientation. This ability allows autonomous trucks to detect nearby equipment, obstacles, workers, and infrastructure and navigate in GPS-denied underground tunnels.

Modern LiDAR systems employ solid-state designs or rotating mirrors to perform rapid scans across 360 degrees of targeted areas. Data from these scans are processed in real time to create dense point clouds, which serve as the foundation for object detection, tracking, and classification algorithms.

Key advantages of LiDAR systems in underground mining environments include the following:

High-Precision Depth Measurement: LiDAR sensors’ time-of-flight range ensures the precise localization of objects: This makes it particularly useful in underground mining, where exact spatial awareness is critical for safety and overall operational efficiency.
Resilience to Environmental Interference: LiDAR is highly resistant to environmental interferences, such as vibration and glare. Advanced multi-echo and solid-state have built-in glare filters, enabling reliable performance in some of the underground mining environments.
Wide Field of View and Real-Time Mapping: Several LiDAR sensors offer a full 360° coverage view. This ensures comprehensive detection of objects, obstacles, and workers around mining vehicles, enhancing situational awareness.
Efficient and Reliable Performance: LiDAR sensors are unaffected by an object’s color or texture, allowing consistent detection regardless of the visual appearance of objects. Unlike RGB cameras, LiDAR systems are effective in low-light conditions as they solely rely on active laser emissions rather than ambient light.
Compatibility with SLAM systems: LiDAR systems support simultaneous localization and mapping (SLAM), enabling autonomous haulage trucks to track their positions while dynamically building maps of the environment.

LiDAR produces precise depth measurements in the harsh and complex underground mining environment, making it significant. Advanced processing algorithms applied to LiDAR data can classify, detect, and predict objects in the environment for autonomous truck navigation. Integrating these algorithms with LiDAR can distinguish between stationary walls, moving equipment, and other objects, which is crucial for dynamic decision-making in real-time detection operations.

LiDAR sensors are essential for recent 3D object detection systems in autonomous vehicles in underground mining environments. LiDAR point cloud data comprises millions of points (Figure 7) that map the objects’ surroundings in 3D (positions, size, and shape), providing a precise view beyond 2D images or videos, which is critical for accurately identifying objects in a scene. LiDAR-based systems’ resilience in harsh environments is a significant advantage for their application. Underground mines are characterized by conditions impairing the performance of other sensor-based systems, like cameras or radar sensors. However, LiDAR systems, such as moisture, are less affected by these conditions. Combined with DL algorithms, LiDAR data can improve object detection accuracy and provide a robust obstacle avoidance mechanism for safe autonomous truck navigation. Compared to RGB data, LiDAR 3D point clouds are critical in providing structural and spatial information of precise depth. However, the 3D point clouds are unordered, sparse, and sensitive to local variations, which makes raw LiDAR data processing challenging [60].

Generally, LiDAR object detection systems can be categorized into traditional and DL systems.

2.3.1. Traditional Methods

Traditional methods often depend on geometric and clustering techniques to detect objects. Algorithms such as DBSCAN [61,62,63] assume that objects have a higher point density than their surroundings. This method groups nearby points into clusters, then applies shape fit to detect objects based on geometric assumptions. Though this approach is practical in structured environments and computationally efficient, it often struggles in noisy, complex, and unstructured environments. Usually, it fails to detect dynamic elements such as pedestrians or moving vehicles. They are more rule-based and best suited for controlled and simpler environments.

2.3.2. Deep Learning-Based Methods

Recent technological advancements have enabled more accurate detection of complex real-world environments. These models automatically learn features from large, annotated datasets and are robust to occlusions and noisy inputs, providing accurate and more flexible detection in dynamic environments. LiDAR-based DL systems are generally classified into view-based, voxel-based, point-based, and hybrid point-voxel-based methods [21,64,65,66].

Voxel-Based Methods: Voxel-based models transform raw and irregular 3D point clouds into structured grids known as voxels for processing using 3D convolutional neural [65,66,67,68]. This generates simplified data, allowing the use of mature CNN architectures for object detection. Popular techniques include the following:
○
VoxelNet [67]: This end-to-end framework partitions LiDAR point clouds into voxel grids and encodes object features using stacked voxel feature encoding (VFE) layers. This process is followed by 3D convolution to extract geometric and spatial features. The method streamlines the detection pipeline by eliminating separate feature extraction stages. Figure 8 shows the application of voxelization for autonomous truck navigation and detection, aiming to improve safety and operational efficiency
○
SECOND (Sparsely Embedded Convolutions Detection) [66]: This is an optimized version of VoxelNet that employs sparse 3D convolutions, which improves computational efficiency without sacrificing accuracy [66].

These models simplify point clouds’ irregular and sparse nature and have high localization accuracy due to precise spatial encoding in the voxel grid. Additionally, their ability to process volumetric data directly makes them suitable for underground mine environments. However, the limitations of voxel-based methods include a lack of semantic richness, which limits their classification accuracy when implemented alone. Fin-grained voxelization can lead to high computational demand. Lastly, they struggle with small object detection due to limited resolution in sparse voxel grids.

2.: Point-Based Methods. Point-based methods directly process raw LiDAR point clouds without converting them into voxel grids (voxelization) to 2D projections. This method preserves fine-grained geometric details of the environment by operating on ordered 3D points. This makes them highly effective in detecting partially occluded or irregularly shaped objects. The key pioneering family of point-based methods is PointNet [69]. This uses symmetric functions to learn object global shape features from unordered point sets. PointNett++ [70] extends this by using hierarchical feature learning to capture local context in clustered regions. PointPillars [71] shown in Figure 9 are a widely adopted method in autonomous driving. The method partitions point clouds into vertical columns known as “pillars” to offer a balance between detail preservation and computational efficiency. It converts spatial features into a pseudo-image format, which enables fast detection through 2D convolution while retaining meaningful 3D structures.

Despite their advantages in handling detailed geometry, point-based methods face limitations in computational cost, especially in cluttered environments, limited scalability in real time, and the need for substantial GPU memory for training and inference.

3.: Hybrid-Based Methods: Hybrid approaches integrate the strengths of voxel-based and point-based techniques to enhance the accuracy and efficiency of 3D object detection. These methods utilize voxelization for structured data representation while leveraging fine local feature extraction of point-based models. A hybrid model typically starts by extracting local geometric object features from raw point clouds using point-based encoders. The features are then embedded into voxel grids, where the voxel-based backbone utilizes convolutions to learn global context and make predictions. This design helps the hybrid model to balance robustness, precision, and processing speed, making it suitable for cluttered and complex underground environments.

Despite the fast adaptation of LiDAR-based systems, several limitations impact their deployment in underground mining environments:

Lack of Object Identification: LiDAR systems can detect the presence and position of objects, but do not provide detailed information about object types or characteristics, which is essential for some mining applications.
Computational Demand: This includes the processing of dense and sparse point clouds.
Environmental Sensitivity: Accuracy is impacted by environmental conditions, such as dust, fog, water vapor, and snow. These factors degrade signal quality and require additional hardware and software processing to mitigate detection failures.
Sensor Surface Contamination Challenges: Dust and debris accumulating on the sensor surface impair detection functionality, necessitating regular cleaning or protective mechanisms to maintain reliability.
High Costs and High Resolution: These LiDAR units are expensive, which often hinders their large-scale deployment in budget-sensitive underground operations
Range and Field of View Limitations: The typical range of LiDAR sensors is limited to around 50 m [28], restricting their effectiveness in larger underground mining operations. Furthermore, planar scanning systems may miss obstacles above or below the scanning plane, posing safety risks.
Energy Consumption and Infrastructure Requirements: LiDAR systems consume more energy than other sensors and demand robust infrastructure for effective operation, complicating deployment in confined underground mining spaces.

While models such as 3DSG [5,68] have demonstrated enhanced 3D object detection for surface mining trucks using LiDAR sensors, their direct applicability to underground environments remains limited due to differences in spatial constraints, lighting conditions, and operational challenges.

2.4. Multi-Sensor Fusion in Underground Mining: Perception Enhancement Through Integration

Multi-sensor fusion models [72,73,74] are at the forefront of recent 3D detection systems for autonomous vehicle navigation in underground mining environments. These systems integrate data from multiple sensors such as cameras, LiDAR, radar, or thermal to develop a comprehensive perception of the environment. Each sensor type has its strengths and limitations, and by combining them, multi-sensory fusion systems capitalize on their complementary capabilities to enhance detection accuracy and reliability for the safe navigation of autonomous trucks. LiDAR provides precise 3D spatial information, excelling in in-depth measurement, but often struggles in particulate-heavy environments. Cameras offer high visual data, capturing color and texture information for more detailed feature extraction, but their performance degrades in low-light or obscured conditions. Radar is reliable in adverse conditions and excels at long-range detection, but is less precise in resolution. Thermal sensors capture heat signatures from objects and surroundings, making them capable of detecting objects in completely dark and challenging conditions, such as fog, smoke, and dust. By fusing these data streams, multi-sensory models address the individual limitations of each sensor to create a robust perception framework suitable for challenging conditions in underground mines. As sensor technologies and fusion algorithms continue to evolve, they promise a future where autonomous trucks achieve even greater safety, efficiency, and adaptability in mining operations.

Szrek et al. [42] evaluated a UGV-based human detection system in underground mines using RGB and IR cameras alongside YOLOv3 and HOG algorithms. While RGB imagery provided visual context, it struggled in low-light and cluttered environments, especially for non-standing individuals. The study showed that RGB detection alone was often insufficient, but combining it with IR data improved reliability. A key limitation was using pre-trained models that were not optimized for underground settings, which affected accuracy. Xu et al. [75] proposed an autonomous vehicle localization method for underground coal mine tunnels based on fusing vision and ultrasonic sensors. They use infrared cameras to detect wall-mounted barcodes and ultrasonic sensors to measure distances, enabling geometric calculation of vehicle position without relying on complex SLAM. The method achieves sub-meter accuracy but is limited by dependence on manual barcode deployment and potential occlusion in dynamic tunnel environments.

Zhang et al. [76] also developed a real-time underground mining vehicle localization method by fusing YOLOv5-based object detection with high-precision laser distance measurement. The system identifies mining trucks visually using YOLOv5s and calculates exact positioning via laser sensors. However, the system had limitations, including its sensitivity to environmental conditions such as humidity and dust, which reduced detection robustness for small and fast-moving objects, with future improvements suggested through upgrading to YOLOv7/YOLOv8 and enhancing multi-object tracking capability. While recent advances have explored multi-sensor fusion using RGB, LiDAR, and thermal imagery with CNN or YOLO-based algorithms in environments such as search-and-rescue tunnels, urban settings, and surface mine environments, applications specifically targeting underground autonomous haulage trucks remain extremely limited. Most existing work focuses on drones, indicating a critical research gap for truck-scale 3D object detection and navigation in confined mining conditions.

Key advantages of multi-sensor fusion include the following:

Enhanced Detection Accuracy: Combining the complementary strengths of different sensors, like LiDAR’s depth information with RGB camera’s texture data and thermal imaging’s heat signatures, will improve overall object detection performance.
Robustness in Harsh Conditions: Multi-sensor models maintain improved detection capabilities in challenging underground conditions where individual sensors may fail.
Redundancy for Safety: These models provide multiple sources of information, allowing the system to continue functioning safely if one sensor fails or becomes unreliable.
Improved Localization and Mapping: Fusing vision (RGB cameras) with depth measurements (LiDAR) strengthens localization precision and supports robust mapping in GPS-denied underground environments.
Adaptability to Dynamic Conditions: Adaptability provides dynamic sensor prioritization, where the system can rely more heavily on the most reliable sensors depending on environmental changes, like thermal sensors in low-visibility conditions and LiDAR for depth information.

With these capabilities of multi-sensor systems, notable challenges associated with them include the following:

High Computational Demand: Processing large volumes of synchronized data from different sensors in real time requires powerful and often costly computational hardware.
Complex Synchronization and Calibration: Multi-sensor fusion requires precise calibration and alignment across different sensors, which is technically challenging, especially in underground settings with vibrations and environmental noise.
Scarcity of Datasets for Training: There is a lack of large, annotated, multi-sensor datasets that specifically capture underground environment conditions, limiting the ability to train robust deep learning models
Increased System Weight: Adding multiple high-end sensors such as LiDAR, thermal, and RGB cameras raises the overall cost and maintenance complexity and may impact autonomous truck payload or energy efficiency.

3. Sensor Data Fusion Methods

Fusing data from multi-modal sensors is vital in 3D object detection for underground autonomous trucks. The harsh underground conditions require robust perception systems capable of overcoming occlusions, dust, and low-visibility environments. Fusion methods are generally classified into early-stage (raw data), mid-stage (feature-level), and late-stage (decision-level) fusions, depending on when the fusion occurs in the data preprocessing pipeline, as illustrated in Figure 10 and Figure 11. Figure 10 illustrates the key sensor fusion types for various fusion strategies aimed at enhancing detection model accuracy and improving robustness, with their different mode of fusion strategies. Figure 11 demonstrates key architectural differences between early, mid-level, and late fusion strategies. These diagrams show where sensor data, such as LiDAR, cameras, and radar, is combined within the object detection pipeline. These stages help clarify how different models balance accuracy and flexibility in perception systems. Figure 12 illustrates a comparative view of the relative accuracy and computational complexity of the three fusion strategies. It places each method within a performance-efficiency spectrum, visualizing how the choice of fusion impacts both model quality and hardware demands. Each strategy has distinct strengths and significant trade-offs, as described below. For instance, though early fusion offers high accuracy, it needs powerful processors and precise calibration. Mid-level fusion strikes a balance between accuracy and efficiency, making it well-suited for underground haulage trucks operating in real time. Late fusion trades some accuracy for simplicity and speed, which may be acceptable for non-critical tasks or as a fallback system for applications.

3.1. Multi-Fusion-Level Methods

3.1.1. Early-Stage Fusion

Early-stage fusion, or sensor-level fusion, integrates raw, unprocessed data streams from multiple sensors into a single dataset before performing feature extraction [78]. This method is valuable for creating robust detection systems in environments requiring fine-grained detection details, as it captures the full signal fidelity of individual sensor modalities.

For instance, LiDAR point clouds can be registered onto the RGB camera image pixel grid to generate depth-colored visual maps. Thermal sensor outputs can also be overlaid on visual data to identify heat-based anomalies. This strategy significantly detects small, obscured objects that a single sensor type may not fully recognize. This approach maximizes the amount of extracted information, which enables detailed feature extraction. The richness of the raw data captured ensures that critical details are not lost during the fusion process.

The primary advantage of early-stage fusion is its ability to retain rich and detailed information across multiple sensor modalities. This gives an expressive input for DL models, enhancing detection precision and accuracy. However, this imposes significant computational demands, as raw sensor data requires huge memory and processing power. This computational burden causes a bottleneck to systems requiring real-time processing, such as the underground mine.

Moreover, data temporal misalignment from different sensors, resolution disparities, and distinct sensor frame rates introduce another challenge. This can be addressed by applying advanced calibration techniques and synchronization protocols for adequate alignment [2,37,79,80]. Misaligned data can lead to inaccuracies in the final fused dataset, undermining the system’s reliability, accuracy, and effectiveness [2]. Despite the earlier limitations, early fusion is essential in environments requiring precise spatial understanding, and computational overhead is not the primary constraint.

3.1.2. Mid-Level Fusion

Mid-level fusion involves the integration of independent extraction features from individual sensor modalities to create a more unified and comprehensive representation of the environment. Rather than performing a fusion of raw sensor data, each sensor undergoes preliminary processing via a Convolutional Neural Network (CNN) or other feature encoders to extract high-level object representations such as contours, depth features, temperature gradients, and motion cues. These extracted features are consequently aligned and integrated into a shared feature space.

This fusion strategy provides an effective balance between perception richness and computational efficiency. By processing feature representations with low dimensionality, memory and bandwidth are reduced instead of the significant volume of raw data, while capturing rich information from each sensor modality. PointPainting by [81] presents a sequential (mid-level) sensor fusion approach for 3D object detection that combines LiDAR point clouds with semantic information from RGB images. The core idea is to first run a 2D semantic segmentation network on RGB images to produce class probabilities (or “paint”) for each pixel. These semantic labels are then projected onto the corresponding LiDAR points, effectively “painting” each point with class information. The enriched point cloud is then fed into a 3D detection network, such as PointRCNN or SECOND. This mid-level fusion strategy enables the model to leverage both geometric precision from LiDAR and semantic context from images. The studies demonstrated improved detection, especially for small or partially occluded objects, while maintaining modularity and flexibility in network design.

Mid-level fusion provides several advantages, including data efficiency. This reduction in computational requirements offers a more efficient system for real-time operations. Mid-level fusion is particularly relevant for real-time applications in underground autonomous trucks where edge deployment and split-second decision-making are critical.

Nevertheless, mid-level fusion introduces its challenges. There is difficulty in effectively aligning extracted features from heterogeneous sensors, which require spatial and temporal synchronization and are often trivial. Misalignments can degrade fusion quality and present detection errors. Additionally, fusion performance solely depends on the design and quality of the feature extraction modules as well as fusion architecture, such as cross-modal attention networks, which require careful design and tuning.

3.1.3. Late-Stage Fusion

Late-level (decision-level) fusion integrates the final outputs of independently processed sensor data, such as bounding boxes, to make high-level decisions. Each sensor operates autonomously in this approach, and the outputs are generated using techniques like weighted averages or voting schemes to obtain final decisions. In [82], the study employed a decision fusion method utilizing camera and LiDAR sensors for mine track object detection. This strategy is favored for its simplicity and modularity, enabling sensors to be trained and maintained separately, simplifying the model design. The decision fusion approach is computationally lightweight, making it suitable for embedded systems or as a fallback in redundant safety layers [79]. As individual sensors process their data autonomously, the fusion process occurs at the decision level, which does not require handling extensive raw data. This makes late-stage fusion suitable for resource-constrained environments and real-time processing applications.

Despite these advantages, there are limitations, such as over-reliance on high-level decisions. Additionally, it lacks access to raw or intermediate data. It cannot refine ambiguities that are introduced earlier in the training pipeline [83]. If one sensor misclassifies an object, the fusion process cannot correct it. This challenge often reduces overall detection accuracy and system performance, particularly in complex scenarios of late fusion models, especially in occluded and multiple overlapping scenes [84,85,86].

Primarily developed for autonomous surface vehicles, the hybrid fusion framework proposed by [84] demonstrates a scalable strategy that combines mid-level and late-level fusion for robust perception. It integrated LiDAR, radar, and camera inputs while addressing sensor misalignment through asynchronous fusion. SCANet [85] also introduced a spatial-channel attention network that fuses LiDAR and RGB data at multiple stages to enhance 3D object detection. Although designed for surface vehicles, its use of attention mechanisms offers an alternative to traditional late-stage fusion by selectively emphasizing informative features across modalities. This enhancement improves detection in unstructured environments, making it a promising candidate for adaptation in underground autonomous vehicles, where scene complexity and sensor noise are significant challenges. Roy et al. [86] proposed a concurrent spatial and channel “squeeze and excitation” (scSE) mechanism that enhances feature refinement by jointly recalibrating spatial and channel-wise responses. Its ability to improve semantic clarity makes it valuable for strengthening the interpretability and reliability of decision-level sensor fusion frameworks.

These innovations are especially relevant for the adaptation of underground haulage systems, where real-time performance and environmental noise pose significant challenges. While such frameworks are yet to be widely deployed in underground mining, they offer a blueprint for developing more resilient perception systems in unstructured, low-visibility environments. Table 2 presents the three sensor fusion strategies commonly applied in multi-sensor systems. It outlines the advantages of each fusion level and discusses its key features. This summary serves as a guide for researchers in selecting a suitable fusion framework that balances accuracy and real-time performance in an underground mining environment.

This section and the comparison below demonstrate how each fusion strategy proves its unique strengths and limitations based on the deployment context. In complex underground mining environments, where safety-critical decisions are required in real time, mid-level fusion often provides the best balance between detection accuracy and real-time decision-making. However, hybrid strategies that combine early and late fusion may also provide good results for fail-safe designs. Table 2 above gives a comparative summary of the advantages and limitations of early-fusion, mid-fusion, and late sensor fusion approaches used in 3D object detection. While these strategies have gained wide applications in autonomous driving and surface mining systems [72,88,89,90], their deployment in underground autonomous truck navigation remains significantly underexplored. To the best of our knowledge, there is a lack of existing studies that have systematically implemented or benchmarked these fusion levels within the complex, sensor-degraded environments typical of underground mining. This lack of targeted evaluation highlights a critical gap. It underscores the necessity for underground-specific sensor fusion designs that can overcome dust, occlusion, denial of GNSS and GPS signals, and dynamic underground layouts.

3.2. Sensor Fusion Advantages and Limitations

3.2.1. Advantages of Sensor Fusion in the Underground Mine Environment

Sensor fusion significantly enhances perception performance in autonomous haulage systems, especially within complex and dynamic underground mine environments. By integrating different sensor modalities such as LiDAR, camera, and IR, sensor fusion improves decision-making accuracy by overcoming individual sensor limitations. The key advantages include the following:

Resilience in Harsh Conditions: Sensor fusion compensates for individual sensor weaknesses. For example, LiDAR often degrades in dense particulate environments, cameras struggle in low light, and thermal sensors may lose range in open spaces. The fusion of these sensors provides a more consistent and robust perception across such extreme conditions [87].
Enhanced Object Detection and Classification: Multi-modal sensor integration improves object detection precision. Combining LiDAR and thermal or camera data enhances the recognition of equipment, workers, and structural features, especially in low-visibility or occluded scenes [81].
Improved Operational Safety: Integrating multiple sensors increases system reliability and robustness, providing better object identification and more reliable obstacle avoidance. This ensures workers’ and equipment safety in confined underground settings [72].
Real-Time Decision-Making: Fusion systems enable faster and more informed decision-making, allowing real-time analysis of extreme environmental conditions. This enables autonomous trucks to respond promptly to dynamic changes in the environment [90].
Model reliability and Redundancy: Fusion systems enhance fault tolerance and robustness by ensuring that the failure of one sensor modality does not compromise the entire perception system.

3.2.2. Challenges in Sensor Data Fusion

Despite the transformative benefits of fusion modalities, sensor fusion in underground autonomous haulage systems faces several technical and environmental challenges that complicate its implementation. These challenges must be addressed to ensure accurate, scalable, reliable, and real-time detection systems.

Sensor Signal Reliability and Noise: Underground environmental conditions, such as noise, vibrations, dust, and electromagnetic interference, can significantly introduce noise into sensor outputs, degrading performance. LiDAR sensors may produce inaccurate point clouds due to reflective surfaces, while RGB cameras may struggle in low-light conditions. Filtering this noise without losing essential features requires advanced adaptive filtering and denoising techniques [91].
Heterogeneous Sensor Integration: Each sensor type, such as LiDAR, cameras, radar, and IR, produces data in distinct resolutions, formats, and operational ranges. Integrating this heterogeneous input data requires overcoming challenges associated with data pre-processing, feature extraction, and data representation. For instance, fusing LiDAR point cloud data with pixel-based data from cameras involves significant computational effort and advanced algorithms to ensure meaningful integration and representation [2].
Data Synchronization and Latency Management: Precise synchronization of data streams from various sensors is crucial for real-time sensor fusion performance. Differences in signal delays, sampling rates, and processing times can lead to temporal data mismatches, resulting in inaccurate fusion outputs. For example, data from a camera capturing images at 30 frames per second (fps) must be aligned with LiDAR data that operates at a different frequency (e.g., 15 Hz). Sophisticated temporal alignment algorithms or interpolation methods are required to ensure all sensor inputs contribute meaningful data to the fused output [2,37,91].
Computational Complexity. The large volume of high-dimensional data that multiple sensors generate presents significant computational challenges. Real-time processing, critical for applications such as autonomous truck navigation, demands high-performance hardware and optimized algorithms. Resource constraints, including limited processing power and energy availability in mining vehicles, further complicate this task. Efficient techniques that balance computational load with fusion accuracy are crucial for deployment in such environments. Refs. [83,91] discuss the computational challenges in sensor fusion, emphasizing the need for efficient distributed processing strategies and the use of techniques like data compression and dimensionality reduction to manage computational and resource requirements.
Environmental Factors: Underground conditions pose severe challenges to sensor performance. Poor lighting, dust, smoke, noise, and fog can obscure camera, radar, and LiDAR data, while extreme temperatures can impact sensor calibration and accuracy. Additionally, confined spaces and irregular terrains can cause occlusions or reflections that lead to distorted sensor readings. Designing fusion algorithms that can compensate for these environmental factors is a critical area of research. Addressing these challenges requires interdisciplinary solutions drawn from signal processing and machine learning.

4. Algorithms for 3D Object Detection in Autonomous Trucks

The rapid evolution of ML algorithms has revolutionized numerous sectors, including autonomous vehicles and industrial robotics. It enables these systems to learn from experience, process complex sensory data, adapt, and make real-time decisions. Over the past decade, ML techniques have transitioned from basic pattern recognition to sophisticated algorithms that address complex, real-world challenges. AI/ML technologies provide automation, safety, and productivity advancements in the mining industry, particularly underground operations.

Historically, traditional object detection techniques such as Histogram of Oriented Gradients (HOGs) and Support Vector Machines (SVMs) struggled to perform under harsh underground conditions because of occluded or noisy environments. Efforts to develop more robust and adaptable DL algorithms, particularly Convolutional Neural Networks (CNNs) and YOLO, have increased to address many of these challenges [92]. These algorithms allow for more precise, reliable, and adaptive perception in complex underground environments.

This section synthesizes major ML algorithms and their integration for 3D object detection in underground autonomous haulage navigation. The section emphasizes their architectural principles, recent applications in underground operations, and performance trade-offs.

4.1. Convolutional Neural Networks (CNNs)

Convolutional Neural Networks (CNNs) are the foundational technologies to modern computer vision because they can learn spatial hierarchies of image data features [93]. CNNs have proven resilient in underground environments due to poor lighting, dust, and occluded conditions.

CNN-based object detection models follow two-stage processes [94]. Girshick et al. [95] introduced the original R-CNN (region-based), which utilizes a selective search to propose regions of interest, followed by CNN-based classification. Though this method is accurate, it is computationally expensive and unsuitable for real-time applications. Subsequent models such as Fast R-CNN [96] and Faster R-CNN [97] directly used the region proposal stage in the CNN, minimizing latency and enabling near real-time applications. Mask R-CNN [98] further enhanced this by adding instance segmentation alongside detecting and classifying objects valuable for cluttered environments like underground mines.

Several studies have shown the use of CNNs for vehicle and personnel detection in underground tunnels, which validates the robustness of hierarchical feature extraction [94,97]. However, CNN algorithms remain challenged by their high computational load, local receptive fields that may miss the global context, and limitations in the 3D point cloud [23,99]. Lightweight CNN variants that employ quantization, pruning, and multispectral imaging have been explored to overcome these challenges. These techniques reduce model size and energy consumption capabilities, making CNNs suitable for edge deployment in underground autonomous haulage. However, challenges remain in meeting real-time performance and adaptability to complex underground scenarios in underground autonomous haulage.

4.2. YOLO (You Only Look Once) Series Algorithm

YOLO (You Only Look Once) algorithms have gained prominence and recognition for their single-stage framework approach. This feature enables simultaneous object localization and classification in one network pass, delivering real-time detection inference critical for dynamic underground environments [13,100,101,102]. Redmon et al. [103] pioneered the introduction of YOLOv1, redefining the detection task as a regression problem and significantly enhancing inference speed. Later versions, such as YOLOv3, YOLOv4, and YOLOv5, presented features including multi-scale detection (Darknet-53), cross-stage partial networks (CSPDarknet), and enhanced model flexibility [104]. YOLOv8, the latest version, includes lightweight designs, transformer-based enhancements, and an improved feature pyramid for better accuracy–speed balance. Figure 13 shows the traditional architecture of the YOLO framework. It processes a 448 × 448 × 3 input image through a DarkNet-based backbone to produce a 7 × 7 × 1024 feature map. This is flattened and passed through fully connected layers, resulting in a 7 × 7 × 30 output tensor. Each of the 7 × 7 grid cells predicts two bounding boxes and class probabilities, which enable real-time object detection. This architecture laid the foundation for later YOLO versions by combining localization and classification in a single pass.

Research has leveraged the YOLO algorithms to develop robust 3D object detection models [21]. Zhang et al. [19] proposed LDSI YOLOv8 to address issues of multi-target detection in the coal mine excavation environment. The work leveraged the YOLOv8 architecture and achieved an mAP improvement of 4.3% compared to the original YOLOv8 model as shown in Figure 14. The study achieved 91.4% mAP and 82.2 FPS, demonstrating strong adaptability to underground dusty, low-light, and occluded underground conditions. Despite its potential detection performance, the model’s training on a specialized excavation dataset may limit its generalization to different mining sites without retraining or domain adaptation. Ní et al. [100] also developed a YOLOv8-based pedestrian and hazard detection model for underground mining environments, which achieves real-time capability and improved accuracy. Zhang et al. [105] proposed YOLO-UCM, an enhanced YOLOv5 model, to improve pedestrian detection in underground coal mines. It integrated Vision Transformers (ViTs) and Meta-AconC for enhanced feature extraction and detection accuracy.

DDEB-YOLOv5s model [106], which incorporates a C3-Dense feature extraction module to enhance the perception of occluded and small-scale objects in a coal mining environment, builds upon the Cross-Stage Partita (CSP) architecture by introducing dense connections that enable multiple convolutional layers to share and reuse feature maps. It also introduced a weighted BiFPN (Bidirectional Feature Pyramid Network), which is used to merge image features from different layers more effectively, to improve accuracy for objects of various sizes. Finally, a decoupled detection head was leveraged to separate the tasks of classifying objects and locating them within the image, resulting in more precise results for both object type and position. It achieved multi-target tracking accuracy of 91.7% and stability in mining environments.

Li et al. [107] presented an improved YOLOv11-based miner detection model for underground coal mines, enhancing feature extraction with Efficient Channel Attention (ECA) and refining localization with a weighted CIoU loss. The model achieves 95.8% at mAP@50 and 59.6 FPS on a custom underground dataset, outperforming existing detectors. However, it mainly focuses on personnel detection, with future work needed for broader underground object recognition.

While YOLO models have achieved considerable success, they are challenged by difficulty detecting objects in highly occluded or cluttered conditions. This often requires additional computational resources as the scale increases. The trade-off between detection speed, accuracy, and model size remains a key challenge when deployed on embedded GPU systems in underground haulage trucks. Therefore, ongoing research aims to optimize YOLO’s performance in such settings, ensuring that real-time, accurate detection remains feasible even with limited computational power. A performance comparison of YOLO and CNN algorithms is presented in Table 3. The table highlights the key advantages and limitations of CNN-based and YOLO-based object detection frameworks in terms of application constraints in underground settings, characterized by low lighting, occlusion, computational overhead, and the need for real-time processing inference.

In summary, two-stage detection frameworks such as CNNs provide high detection accuracy but introduce computational overhead, making them unsuitable for real-time underground applications. On the other hand, YOLO-based models provide a viable direction for fast and efficient object detection, with recent versions demonstrating adaptability to underground conditions. However, they often suffer in small object detection and tradeoffs between accuracy and inference speed. Continued efforts to enhance these models for edge computing and improve their robustness and real-time performance under occlusions and dynamic environments will be crucial to advancing autonomous haulage safety and efficiency in underground mining operations.

4.3. Detection Model Comparison for Underground Autonomous Haulage Systems

Current 3D object detection models designed for underground autonomous haulage systems navigation differ in structure, sensor dependency, and performance. The performance of 3D object detection systems is crucial for ensuring the safety and efficiency of autonomous trucks, particularly in complex and dynamic environments such as underground mines.

Table 4 below evaluates and summarizes the performance of the leading 3D object detection models used in underground autonomous truck applications. It highlights their capabilities to handle the unique challenges posed by underground environments, detection approaches, advantages, limitations, and key performance metrics.

4.4. Real World Case Scenarios of 3D Object Detection Systems in Underground Autonomous Truck Navigation

Selected case studies are presented to demonstrate the real-world application and relevance of advanced 3D object detection methods. This highlights their deployment scenarios, performance benchmarks, and architectural innovations, as well as limitations. Recent developments have sought to address some relevant practical obstacles to perception in intricate underground environments, as illustrated by these examples.

Pen et al. [6] employed 3D LiDAR to develop an efficient obstacle detection method specifically designed for real-time vehicle navigation in underground mining environments. Their methodology involved reducing the 3D LiDAR point cloud density through voxel downsampling, followed by the extraction of object boundaries using Euclidean clustering and an optimized RANSAC (Random Sample Consensus) algorithm for ground segmentation. In a 50 m range, the system demonstrated a detection accuracy of over 95% and processed frames at a rate of 0.14 s per scan, thereby confirming its suitability for dynamic environments. Distinguishing between stationary apparatus and transient hazards remains a significant limitation, despite the strong performance, suggesting the potential for integrating semantic classification models.

Hanif et al. [13] presented SOD-YOLOv5s-4L, an enhanced object detection model designed for underground mining environments. Building on the YOLOv5s architecture, the study introduced enhancements, including a decoupled detection head to better separate classification and localization tasks, a dedicated small-object detection layer for accurately identifying minor hazards, and the SIoU loss function to improve bounding box precision. The system demonstrated a strong performance, achieving a mean Average Precision (mAP) of up to 98% across multiple object categories and 99% for small targets, outperforming existing YOLO variants, such as YOLOv7 and YOLOv8, under the same conditions. The results highlight the model’s suitability for complex underground settings with poor visibility and cluttered environments. However, the study noted key limitations. The added architectural complexity resulted in longer training times and slightly slower inference speeds, which may impact real-time deployment on resource-constrained edge devices. Again, due to the model being trained on a specific dataset, its generalization to a different underground mining layout may require further fine-tuning and validation.

A YOLOv5-based object detection model was developed and deployed in multiple active underground mining sites for real-time pedestrian detection near mobile machinery [51]. The model was trained on an augmented dataset of 1944 images collected under challenging underground conditions, including poor lighting, dust, and motion blur. It achieved promising results with a mean Average Precision (mAP) of 71.6%, indicating its feasibility for underground safety applications. The system’s performance was validated through real-world testing, demonstrating its ability to detect pedestrians even in the presence of motion blur. However, limitations were observed in detecting overlapping pedestrians and achieving optimal performance under all underground conditions. The study noted that improvements in hardware, such as industrial-grade cameras, and further tuning of model parameters could enhance reliability. Additionally, the system currently stops at detection and does not integrate reactive actions such as stopping machinery or triggering alerts.

The application of LDSI-YOLOv8, a YOLOv8 variant enhanced with DenseNet-inspired modules and transformer attention mechanisms, was implemented in this case study in active excavation environments. Variable illumination, irregular tunnel geometries, and overlapping objects distinguished coal mine scenes. Focusing on the multi-class detection of miners and apparatus, the system was trained and tested on video footage from real-world excavation sites. Moreover, 91.4% mAP and 82.2 FPS were attained by LDSI-YOLOv8, demonstrating its ability to maintain high precision in challenging visual conditions [19]. Although the system’s performance in extreme occlusion conditions still required refinement, the use of attention modules substantially enhanced the accuracy of detection around object edges.

Zhang et al. [76] also presented a sensor fusion-based tracking method for underground mining vehicles, which leveraged YOLOv5 for visual object detection and combined it with laser sensors for distance measurement. The system used a high-resolution camera to detect mine cars and a laser rangefinder to determine their position in real time. These components were mounted on a motorized platform, allowing dynamic alignment with moving targets. By fusing image and range data, the system calculates the spatial position of vehicles with centimeter-level precision, overcoming traditional limitations of monocular vision in dark and confined underground environments.

The model was tested in underground tunnel scenarios with low visibility, complex geometries, and occlusions. It demonstrated reliable performance in detecting and tracking moving mine cars, proving its value for enhancing autonomous navigation and collision avoidance. However, key limitations were noted, including critical misalignment in timing, which can degrade positional accuracy, as well as system complexity and increased cost due to the integration of precision hardware. Calibration drift over time can impact accuracy, requiring routine maintenance. Additionally, the approach, while accurate, may be less scalable in fleets with numerous vehicles due to the demands of data fusion processing.

Major Original Equipment Manufacturers (OEMs) have developed and deployed Autonomous Haulage Systems (AHSs), which have primarily seen improvements in surface mines. For instance, Caterpillar’s MineStar Command for hauling has been implemented at several large-scale operations, including Fortescue Metals Group’s Solomon Hub and BHP’s Jimblebar mines in Australia and several mines in the United States [109]. These applications enable autonomous trucks to navigate without an operator, utilizing onboard GPS, radar, and LiDAR systems integrated into a central control platform. Similarly, Komatsu’s AHS has been extensively utilized in the Pilbara region of Western Australia, notably at Rio Tinto’s Yandicoogina and Nammuldi mines, where over 300 autonomous trucks operate across multiple sites [110]. These systems rely on high-precision satellite navigation and obstacle detection technologies. Hitachi’s AHS, integrated with Wenco’s Fleet Management platform, employs LiDAR, GPS, and vision sensors in surface operations across North America [111,112]. These systems exemplify the successful deployment of sensor fusion in real-world industrial settings.

While designed for surface environments, they show the scalability and reliability of multi-sensor fusion in harsh mining conditions. Their success highlights the feasibility and industrial interest in adapting such technologies to underground environments, where perception challenges are even more severe due to GPS denial, limited visibility, and dynamic tunnel layouts. These case studies underscore the transition from conventional sensor-based obstacle avoidance to sophisticated DL-driven perception systems that can function in the challenging and dynamic environment of underground mining.

5. Dataset Analysis, Challenges, and Proposed Strategies

Annotated datasets tailored for underground environments are limited. Existing datasets are heavily focused on surface mine and urban driving conditions and are limited in diverse lighting, dust interference, and confined space conditions. While synthetic datasets for underground applications have been used as a cost-effective alternative, they often lack the complexities of real-world situations. Dynamic underground conditions create challenges that synthetic data fail to replicate accurately, leading to poor model generalization and reduced performance when applied in actual mining environments. Real-world datasets are crucial for models to handle the varied and unpredictable nature of underground mining. The better the dataset reflects real underground scenarios, the more accurately and reliably the model detects objects.

For instance, Zhang et al. [19] proposed an LDSI-YOLOv8 model trained on a custom dataset of 4800 annotated images from coal mine excavation faces, covering key targets such as workers and equipment under low-light and occlusion conditions. While it achieved 91.4% mAP and 82.2 FPS, the model’s main limitation is its validation on a single-site and limited dataset, which does not fully represent the underground setting and conditions and may affect generalizability across diverse underground environments.

Trybala et al. [74] developed the MIN3D dataset, which features multi-sensor 3D mapping data collected using an unmanned ground vehicle equipped with RGB-D cameras, LiDAR, IMU (Inertial Measurement Unit), and odometry sensors. The dataset was explicitly designed to capture the spatial, geometric, and visual complexity of underground mining tunnels. It includes multiple trajectories, annotated object categories, and scenarios involving dust interference, poor illumination, and irregular structures, making it valuable for evaluating sensor fusion and 3D perception models. MIN3D supports benchmarking for SLAM, object segmentation, and fusion-based 3D object detection under underground constraints. Despite its significance, the dataset is limited in terms of diversity, particularly in object classes, tunnel layouts, and dynamic underground scenes. Additionally, the lack of high-frequency sequences and labeled occlusion scenarios presents challenges for training deep learning (DL) models in real-time tracking. Nevertheless, MIN3D remains a foundational dataset for validating perception frameworks in underground environments, highlighting the importance of domain-specific data for autonomous truck navigation and safety.

Zhang et al. [105] also introduced a specialized object detection framework, known as YOLO-UCM, based on YOLOv5, tailored for pedestrian safety in underground coal mines. Due to the limitations of publicly available underground datasets, they developed the SDUCM-dataset, which simulates underground mining scenes under conditions such as poor lighting, smoke, dust, and occlusion. The model integrates advanced techniques, including Vision Transformer modules, Merge-NMS, and Meta-AconC activation, to enhance detection accuracy and generalizability. It achieved a maximum of 93.5% mAP and a recall of 99%, outperforming traditional YOLO versions and Faster R-CNN in both accuracy and inference speed. Despite its effectiveness, limitations remain, such as the dataset’s lack of diversity across mining sites and the YOLO-UCMx model’s size (87M+ parameters), which makes real-time deployment on embedded systems challenging. The authors suggest that future work should involve knowledge distillation and GAN-based dataset expansion to enhance model compactness and robustness.

Zhao et al. [106] also introduced a real-time perception approach based on an improved YOLOv5s model for detecting key targets such as shearers and mine workers in coal mine excavation faces. The model enhances spatial feature extraction through attention mechanisms and multi-scale fusion, thereby improving accuracy under occlusion and dim lighting conditions. The dataset used consists of actual images collected from underground coal mines, representing realistic operational scenarios and object categories. However, the dataset is highly task-specific, focusing only on machinery and layouts, which may limit the model’s generalizability to different underground mining settings or broader detection applications.

Table 5 outlines the key underground-specific datasets and their characteristics. This section presents a comprehensive breakdown of challenges relating to datasets in underground 3D object detection and maps each challenge to implementable solutions. Addressing real-world underground constraints like dust, occlusion, sensor misalignments, and data imbalance will advance the frontier of autonomous truck navigation in an underground mine environment. This fills a significant gap in the literature and provides actionable insights for future system model designs.

Recent detection frameworks in underground mining environments have achieved impressive results using RGB imagery, LiDAR, and thermal sensors. These rely solely on single data points and pose limitations under dust, fog, smoke, or complete dark conditions. Multi-sensor fusion offers a promising direction, integrating LiDAR for depth perception, thermal infrared for heat-based detection, and a camera for rich texture. Future research should prioritize developing fusion-based models that combine complementary sensor data, enhancing situational awareness and resilience for autonomous truck operations in the highly variable and constrained underground mining environment.

5.1. Dataset Challenges in the Underground Environment

The development of scalable and effective 3D object detection systems for autonomous underground haulage trucks is hindered by significant dataset-related challenges:

Environmental Complexity: Underground environmental conditions pose significant challenges, which hinder effective dataset collection for detection models. Such challenges include poor lighting conditions, as these settings lack natural light. This leads to images with low contrast and clarity, making it difficult for optical sensors such as cameras to capture objects accurately. Additionally, dust and smoke from drilling, blasting, and transportation activities often scatter light and obscure sensors, limiting data quality. Furthermore, uneven terrain, cluttered backgrounds, and waste materials make distinguishing between detection objects and irrelevant objects challenging for safe detection.
Data Annotation Challenges: Annotating 3D data like LiDAR point clouds for detection models is particularly challenging as it is labor-intensive and needs expert knowledge of mining scenarios. Unlike 2D image annotation, identifying objects in 3D, especially in scenarios involving overlapping objects or partial occlusions, demands significant effort. The annotating process is prone to human errors and can significantly impact model performance [2,37].
Suboptimal Dataset Representation: Mining datasets are often characterized by issues related to overrepresenting common objects, such as trucks, and under-representing critical but rare classes, such as pedestrians. Frequently encountered objects dominate the dataset, while less common but critical elements, such as workers, are under-represented in the data. This imbalance causes biases that prioritize common classes at the expense of rarer but safety-critical objects to be detected. This degrades model performance and limits generalizability.
Inefficient Model Generalization: A significant challenge in underground mining is the variability in mine layout and equipment types across different mines. This makes models trained in one mine less effective in another. The dynamic nature of mining operations necessitates continuous model updates that can adapt to variable scenarios.
High Computational Complexity and Real-Time Constraints: High-resolution sensors generate terabytes of data daily, which require massive infrastructure for storage, robust computational efficiency, and real-time processing. Managing such large datasets while maintaining efficiency is a persistent issue [74]. Latency in data processing can potentially compromise the system’s effectiveness and lead to unsafe critical decisions [82,87].
Temporal Synchronization in Multi-sensor Models: Temporal synchronization is a significant challenge in multi-sensor 3D object detection models. Variability in sampling rates, operating modes, and data speed from different sensors causes a misaligned data stream and impairs fusion quality and detection accuracy. Delays in transmission and hardware limitations worsen synchronization challenges. The study by [113] highlights the critical issue of temporal misalignment in sensor fusion, particularly in 3D object detection scenarios. Their study demonstrates that even minor synchronization errors between sensors, such as LiDAR and cameras, can result in significant inaccuracies in object detection, underscoring the need for robust synchronization mechanisms.

5.2. Proposed Strategies for Dataset Optimization and Model Robustness in Underground

The following strategies are proposed to address these challenges:

Enhance Sensor Capabilities: Deploy robust sensors that withstand uneven terrain, dust, extreme vibrations, and temperature fluctuations to ensure consistent detection system performance. Integrate multi-sensor configurations by combining complementary sensing modalities. This will provide a robust and reliable detection system to enhance safety in underground mining operations.
Advanced Data Preprocessing and Augmentation Techniques: Effective preprocessing and data augmentation techniques should be employed to significantly simulate dust, noise, occlusion, and variabilities in lightning [7,9,75]. Synthetic dust clouds, altering textures, and adjusting brightness can help the system adapt to variable surface types or equipment, ensuring robust object detection models for challenging underground conditions.
Improved Data Synchronization: Solving temporal synchronization issues in multi-modal 3D object detection demands a combination of advanced computational methods and real-time data management strategies. Use software-based interpolation, a Kalman filter, or a deep learning-driven alignment framework to offer more flexible solutions for proper data synchronization. Catching strategies can also compensate for data transmission delays.
Effective Data Annotation: Leverage data annotation tools such as semi-automated tools, active learning, and pre-trained models [83]. This will reduce manual annotation effort and improve labeling accuracy by minimizing human error and maximizing efficiency. Additionally, domain-specific annotation guidelines can ensure consistency.
Optimize Dataset Representation: Accurate object detection requires balanced datasets. Applying class weighting, oversampling, and Generative Adversarial Network (GAN)-based synthetic data generation will balance rare and common object instances in datasets. This approach ensures that objects like pedestrians and/or rare but critical events receive more attention during training. This improves model robustness in accurately detecting safety-critical objects in real-world scenarios.
Improvement in Model Generalization: Employ domain adaptation techniques such as transfer learning and adversarial training to improve models’ generalizability in cross-site applications. Consistent and continuous fine-tuning based on environmental feedback and characteristics enhances model adaptation.
Edge Processing and Optimized Data Handling: Efficient data handling is crucial in managing large data volumes. Use efficient compression techniques to reduce the size of datasets without sacrificing critical features and integrity. Employing edge computing to optimize data will enable real-time and on-site data preprocessing. This reduces latency and bandwidth usage, enhancing the system’s ability to operate in real time. It will reduce data transmission time and minimize the load on central systems, allowing for quicker decision-making.

6. Key Challenges and Future Directions in 3D Detection Systems for Underground Mines

6.1. Challenges in the Underground Environment

The application of 3D object detection systems in underground mining environments is complex due to the unique and extreme conditions. These challenges can be categorized into environmental constraints, real-time operational intricacies, and computational limitations. Overcoming these barriers, especially with sensor fusion, real-time processing, detection accuracy, and pedestrian safety, will require further advancements in AI/ML, sensor technology, and edge computing. Future 3D detection systems in autonomous trucks must be intelligent, robust, and adaptive enough to operate safely and efficiently in unpredictable, diverse underground mining environments.

Environmental Challenges. Underground environments present near-zero natural light, which requires reliance on artificial illumination. This condition introduces variable lighting, low-light noise, shadows, and glare, adversely impacting vision-based systems. The prevalence of low-light imaging noise and incomplete data under artificial lighting conditions affects the performance of object detectors. IR modalities improve visibility but are limited in range and resolution. Dust, airborne particulates, and smoke generated by drilling, blasting, and other unit mining operational activities interfere with sensor signals [4]. These conditions degrade the quality of LiDAR point cloud data, leading to degraded object detection. Although radar sensors are less affected by these environmental factors, they lack resolution, necessitating multi-sensor fusion techniques to ensure robust detection. Creating fail-safes, redundancy in sensor data, and methods for ensuring detection capabilities and reliability are essential for reliable performance.
Dynamic and Irregular Obstacles. Continuous machinery and other object movements of other objects create a dynamic and unpredictable detection environment. Debris, uneven terrain, and narrow layouts can lead to misclassifications due to their resemblance to natural geological features in sensor data. Advanced semantic segmentation and computationally efficient ML-based classifiers are needed to address these challenges.
Equipment Blind Spots and Sensor Occlusion. Large mining trucks have extensive blind spots, particularly around corners and confined spaces, and occlusions from structures or materials often obscure key objects. This increases the risk of undetected obstacles, raising the likelihood of collisions causing fatalities. Multi-view and re-identification methods for occluded objects have shown improved continuity in detection. However, these solutions often introduce significant computational overheads, limiting real-time performance implementation in underground environments.
Sensor Data Integration and Overload. High-resolution LiDAR generates millions of data points per second, requiring advanced data fusion algorithms to integrate and process information from multiple sensors efficiently. Real-time data fusion across LiDAR, IR, and camera presents a substantial computational load. Managing data overload is critical to ensuring timely and accurate object detection.
Real-Time Latency Challenges. The maximum need for split-second decision-making in autonomous trucks operating in high-risk environments means delays are unacceptable. Edge computing systems, which process data locally to reduce latency, have demonstrated effectiveness in mitigating this issue. However, challenges remain in the trade-offs between model complexity and detection accuracy. While YOLO algorithms have shown potential enhancement in 3D object detection, they are often affected by low-light, cluttered, small objects, and dynamic underground mine settings.
Efficiency vs. Accuracy Trade-Offs. Large models give high accuracy but are computationally intensive and unsuitable for resource-constrained environments. Lightweight models such as YOLO-Nano have high speed but may lack robustness for detecting small or partially occluded objects [114].
Generalization Across Mining Sites. Every underground mining environment has unique variability in tunnel geometry, layout, infrastructure, machinery, and operational processes. Detection models trained in one site may not be scalable in another, which poses a significant obstacle to generalizing 3D object detection systems. Due to these variations, domain shifts cause models to fail to perform effectively on another site.

6.2. Future Research Directions

Three-dimensional object detection remains crucial for both safety and efficiency. While significant progress has been made, challenges remain in optimizing algorithms, improving sensor fusion, and enabling real-time decision-making in dynamic underground environments. The following research directions are proposed to push the frontier of autonomous haulage detection systems:

Multi-Sensor Fusion and Edge Computing: Integrating data from sensors like LiDAR, IR, and cameras with edge computing is needed to reduce latency and improve real-time processing. Enhanced fusion techniques that combine high-resolution LiDAR data with camera visual information could provide more detailed and accurate object detection. Additionally, by processing data locally, autonomous trucks can make real-time decisions without relying on external servers, improving response times and operational efficiency.
Collaborative Detection Systems: Future autonomous trucks may not operate in isolation but as part of a larger network of autonomous systems, including other trucks, drones, and support equipment. Collaboration between these systems to share detection data and build shared scene understanding can improve object detection. Multi-agent communication and coordination could improve detection capabilities, object tracking, awareness, and cooperative navigation.
Design Lightweight and Real-time DL Models: Research into the development of optimized lightweight models for embedded GPUs in autonomous trucks. Advancements in pruning and quantization can minimize model size without significantly compromising detection accuracy.
Real-Time Object Classification and Localization: Improvements should be made in detection, real-time object classification, and 3D localization. Improved situational awareness enhances safe navigation and operational efficiency in underground mining operations.
Development of Standardized, open-access datasets: More effort is needed to develop large-scale, labeled datasets that are open-access and reflect underground mining conditions, which will improve research into more robust models for autonomous truck safe navigation. Simulation data combined with real underground mining footage will provide efficient training benchmarks.
Advanced AI and ML Models: Leveraging cutting-edge AI techniques, such as transfer learning, reinforcement learning, or deep reinforcement learning, could significantly enhance the adaptability and robustness of 3D object detection systems. These models could enable them to learn from diverse datasets and improve their detection capabilities, even in challenging environments like the underground mine. More sophisticated and newer versions of YOLO, like YOLOv8, can be integrated to improve detection speed and accuracy for maximum system performance. Developing domain-adaptive and site-specific fine-tuning models capable of learning and generalizing across different environments needs much research focus.
Field Validation of Object Detection Systems: Validation of detection systems in operational underground sites to evaluate the model’s real-world performance. Additionally, safety-critical metrics should be incorporated into the evaluation process to assess the developed models’ real-world capabilities and safety impact.
Regulatory and Ethical Compliance: As autonomous trucks become more recognized in the mining industry, ensuring that object detection systems meet safety standards and regulatory requirements will be essential. Developing frameworks that align detection systems with safety regulations and ethical standards is crucial. This will ensure transparency in model design and performance benchmarking, which is key to operational and public trust.
Long-Term Robustness and Reliability: Long-term deployment of autonomous trucks in underground mines will require systems that can withstand harsh conditions, such as exposure to dust, vibration, and moisture. Ensure long-term stability of systems through real-time health monitoring, regular sensor updates, and robustness against sensor drift and wear. This will be essential to ensure continued safety and model performance.

Addressing these challenges and leveraging emerging research directions will enable the mining industry to accelerate the safe and efficient implementation of autonomous truck haulage in underground operations. This will mark a transformative growth in automation in the industry.

7. Discussions

The application of 3D detection systems in underground autonomous trucks presents critical challenges and opportunities different from other domains like urban driving and indoor robotic machines. The underground mining environment is known for its complex terrain, occlusion, high particulate matter, limited visibility, and dynamic equipment-worker interactions. Therefore, it necessitates more specialized sensor configurations and robust perception models capable of operating in these extreme conditions. This paper reveals that multi-modal sensor fusion, especially combining LiDAR and RGB cameras, provides the most robust perception capabilities. While the camera contributes contextual texture data, LiDAR offers high-resolution depth information. When fused effectively, they compensate for each other’s limitations, especially under low visibility or high dust conditions. Mid-level fusion is demonstrated to be the most balanced fusion approach, providing efficient feature integration while maintaining sufficient data richness for real-time object detection. Deep learning algorithms, particularly the YOLO-based framework, have shown strong performance in object detection tasks. Architectures such as YOLOv5 and YOLOv8 demonstrate real-time detection capabilities with significant mean Average Precision (mAP). However, small object detection and a lack of large-scale annotated datasets often hinder their performance in the underground environment. The following key trades were:

Accuracy vs. Speed: It was noticed that high-accuracy models like YOLOv8 often demonstrate slower inference times. This can be problematic for real-time underground navigation.
Sensor Cost vs. Redundancy: Multi-sensor models improve robustness. However, they also increase hardware costs and integration complexity.
Fusion Complexity vs. Benefit: The early fusion approach provides detailed insights and features but is highly computationally expensive. Late fusion is computationally efficient but is less accurate.

Benchmarking for diverse datasets shows limited consistency in model evaluation, making it difficult to compare models directly. This shows the need for standardized mining-specific datasets and benchmarking protocols for efficient performance evaluation.

8. Conclusions

This review has comprehensively surveyed 3D object detection systems tailored explicitly for underground mining autonomous truck navigation. The review uniquely bridges the research gap between urban autonomous systems and the specific challenges of underground truck navigation. This study evaluates the current state of sensor modalities, detection algorithms, and fusion techniques. Additionally, it highlights the unique constraints posed by underground complex conditions. Though technologies such as LiDAR, RGB cameras, and thermal sensors have proven individual strengths, integrating these sensors provides a promising path forward. Deep learning architecture, particularly YOLO-based object detectors, has demonstrated strong potential in real-time detection. However, challenges persist, including occlusion handling, limited underground datasets, computational overhead, and performance trade-offs. While significant strides have been made in 3D object detection for autonomous trucks in mining, ongoing innovation and research are essential to overcoming the persisting challenges. These advancements will enhance operational efficiency and play a crucial role in safeguarding the well-being of workers.

Author Contributions

S.F.: Proposed the idea, supervision, and final review of the draft. E.E.: Wrote the manuscript and edited it. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Centers for Disease Control and Prevention and National Institute for Occupational Safety and Health (CDC-NIOSH-U60OH012685-01-00).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this study.

Acknowledgments

Institutions and individuals’ invaluable contributions and support made the comprehensive assessment possible. I am profoundly grateful for the unwavering guidance, encouragement, and insightful feedback my advisor, Samuel Frimpong, provided during this research. I also recognize the contributions of the research community, whose pioneering work in artificial intelligence, machine learning, and object detection has established the foundation for this study. I particularly appreciate my colleagues and peers in the mining engineering program for their constructive critiques and thoughtful discussions, which have substantially enhanced this review. Finally, I profoundly appreciate the exceptional opportunity that the Centers for Disease Control and Prevention and National Institute of Occupational Safety and Health (CDC-NIOSH) has afforded me to finance my PhD program. Their commitment to advancing research will improve the safety of mining operations. This work aims to enhance the safety and efficiency of autonomous haulage systems in the underground mining sector. I am continually motivated by the collaborative efforts of researchers and practitioners striving to achieve this shared goal.

Conflicts of Interest

The authors have no conflicts of interest to declare. The funder did not play a role in designing the study, collecting, analyzing, or interpreting data, writing the manuscript, or making the decision to publish the findings.

References

Mao, J.; Shi, S.; Wang, X.; Li, H. 3D Object Detection for Autonomous Driving: A Comprehensive Survey. arXiv 2022, arXiv:2206.09474. [Google Scholar] [CrossRef]
Tang, L.; Bi, J.; Zhang, X.; Li, J.; Wang, L.; Yang, L.; Song, Z.; Wei, H.; Zhang, G.; Zhao, L.; et al. Multi-Modal 3D Object Detection in Autonomous Driving: A Survey and Taxonomy. IEEE Trans. Intell. Veh. 2023, 8, 3781–3798. [Google Scholar] [CrossRef]
Yuan, Q.; Chen, X.; Liu, P.; Wang, H. A review of 3D object detection based on autonomous driving. Vis. Comput. 2024, 41, 1757–1775. [Google Scholar] [CrossRef]
Xu, T.; Tian, B.; Wang, G.; Wu, J. 3D Vehicle Detection with RSU LiDAR for Autonomous Mine. IEEE Trans. Veh. Technol. 2021, 70, 344–355. [Google Scholar] [CrossRef]
Yu, G.; Chen, P.; Zhou, B.; Zhao, F.; Wang, Z.; Li, H.; Gong, Z. 3DSG: A 3D LiDAR-Based Object Detection Method for Autonomous Mining Trucks Fusing Semantic and Geometric Features. Appl. Sci. 2022, 12, 12444. [Google Scholar] [CrossRef]
Peng, P.; Pan, J.; Xi, M.; Chen, L.; Zhao, Z. A Novel Obstacle Detection Method in Underground Mines Based on 3D LiDAR. IEEE Access 2024, 12, 106685–106694. [Google Scholar] [CrossRef]
Garg, S.; Rajaram, G.; Sundar, R.; Murugan, P.; Dakshinamoorthy, P.; Manimaran, A. Artificial Intelligence Algorithms for Object Detection and Recognition In video and Images. Multimed. Tools Appl. 2025, 83, 1–18. [Google Scholar] [CrossRef]
Lee, K.; Dai, Y.; Kim, D. An Advanced Approach to Object Detection and Tracking in Robotics and Autonomous Vehicles Using YOLOv8 and LiDAR Data Fusion. Electronics 2024, 13, 2250. [Google Scholar] [CrossRef]
Roghanchi, P.; Shahmoradi, J.; Talebi, E.; Hassanalian, M. A comprehensive review of applications of drone technology in the mining industry. Drones 2020, 4, 34. [Google Scholar] [CrossRef]
Awuah-Offei, K.; Nadendla, V.S.S.; Addy, C. YOLO-Based Miner Detection Using Thermal Images in Underground Mines. Min. Met. Explor. 2025, 1–18. [Google Scholar] [CrossRef]
Somua-Gyimah, G.; Somua-Gyimah, G.; Frimpong, S.; Gbadam, E. A Computer Vision System for Terrain Recognition and Object Detection Tasks in Mining and Construction Environments. 2018. Available online: https://www.researchgate.net/publication/330130008 (accessed on 15 December 2024).
Rao, T.; Xu, H.; Pan, T. Pedestrian Detection Model in Underground Coal Mine Based on Active and Semi-supervised Learning. In Proceedings of the 2023 8th International Conference on Signal and Image Processing (ICSIP), Wuxi, China, 8–10 July 2023; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA; pp. 104–108. [Google Scholar]
Yu, Z.; Li, Z.; Sana, M.U.; Bashir, R.; Hanif, M.W.; Farooq, S.A. A new network model for multiple object detection for autonomous vehicle detection in mining environment. IET Image Process. 2024, 18, 3277–3287. [Google Scholar] [CrossRef]
Benz, P.; Montenegro, S.; Gageik, N. Obstacle detection and collision avoidance for a UAV with complementary low-cost sensors. IEEE Access 2015, 3, 599–609. [Google Scholar] [CrossRef]
Pobar, M.; Ivasic-Kos, M.; Kristo, M. Thermal Object Detection in Difficult Weather Conditions Using YOLO. IEEE Access 2020, 8, 125459–125476. [Google Scholar] [CrossRef]
Tang, K.H.D. Artificial Intelligence in Occupational Health and Safety Risk Management of Construction, Mining, and Oil and Gas Sectors: Advances and Prospects. J. Eng. Res. Rep. 2024, 26, 241–253. [Google Scholar] [CrossRef]
Tripathy, D.P.; Ala, C.K. Identification of safety hazards in Indian underground coal mines. J. Sustain. Min. 2018, 17, 175–183. [Google Scholar] [CrossRef]
Imam, M.; Baïna, K.; Tabii, Y.; Ressami, E.M.; Adlaoui, Y.; Benzakour, I.; Abdelwahed, E.H. The Future of Mine Safety: A Comprehensive Review of Anti-Collision Systems Based on Computer Vision in Underground Mines. Sensors 2023, 23, 4294. [Google Scholar] [CrossRef]
Li, C.; Wang, H.; Li, J.; Yao, L.; Zhang, Z.; Tao, L. LDSI-YOLOv8: Real-time detection method for multiple targets in coal mine excavation scenes. IEEE Access 2024, 12, 132592–132604. [Google Scholar] [CrossRef]
Li, P.; Li, C.; Yao, G.; Long, T.; Yuan, X. A Novel Method for 3D Object Detection in Open-Pit Mine Based on Hybrid Solid-State LiDAR Point Cloud. J. Sens. 2024, 2024, 5854745. [Google Scholar] [CrossRef]
Velastin, S.A.; Salmane, P.H.; Velázquez, J.M.R.; Khoudour, L.; Mai, N.A.M.; Duthon, P.; Crouzil, A.; Pierre, G.S. 3D Object Detection for Self-Driving Cars Using Video and LiDAR: An Ablation Study. Sensors 2023, 23, 3223. [Google Scholar] [CrossRef]
Zhang, P.; He, L.; Lin, X. A New Literature Review of 3D Object Detection on Autonomous Driving. J. Artif. Intell. Res. 2025, 82, 973–1015. [Google Scholar] [CrossRef]
Wang, Y.; Wang, S.; Li, Y.; Liu, M. A Comprehensive Review of 3D Object Detection in Autonomous Driving: Technological Advances and Future Directions. arXiv 2024, arXiv:2408.16530. [Google Scholar]
Jiang, P.; Wang, J.; Song, L.; Li, J.; Xu, X.; Dong, S.; Ding, L.; Xu, T. FusionRCNN: LiDAR-Camera Fusion for Two-Stage 3D Object Detection. Remote. Sens. 2023, 15, 1839. [Google Scholar] [CrossRef]
Fu, Z.; Ling, J.; Yuan, X.; Li, H.; Li, H.; Li, Y. Yolov8n-FADS: A Study for Enhancing Miners’ Helmet Detection Accuracy in Complex Underground Environments. Sensors 2024, 24, 3767. [Google Scholar] [CrossRef] [PubMed]
Agrawal, P.; Das, B. Object Detection for Self-Driving Car in Complex Traffic Scenarios. MATEC Web Conf. 2024, 393, 04002. [Google Scholar] [CrossRef]
Ogunrinde, I.; Bernadin, S. Deep Camera–Radar Fusion with an Attention Framework for Autonomous Vehicle Vision in Foggy Weather Conditions. Sensors 2023, 23, 6255. [Google Scholar] [CrossRef]
Ren, Z. Enhanced YOLOv8 Infrared Image Object Detection Method with SPD Module. 2024. Available online: https://woodyinternational.com/index.php/jtpet/article/view/21 (accessed on 29 March 2025).
Ruiz-Del-Solar, J.; Parra-Tsunekawa, I.; Inostroza, F. Robust Localization for Underground Mining Vehicles: An Application in a Room and Pillar Mine. Sensors 2023, 23, 8059. [Google Scholar] [CrossRef]
Chahal, M.; Poddar, N.; Rajpurkar, A.; Joshi, G.P.; Kumar, N.; Cho, W.; Parekh, D. A Review on Autonomous Vehicles: Progress, Methods and Challenges. Electronics 2022, 11, 2162. [Google Scholar] [CrossRef]
Liu, Q.; Cui, Y.; Liu, S. Navigation and positioning technology in underground coal mines and tunnels: A review. J. South Afr. Inst. Min. Met. 2021, 121, 295–303. [Google Scholar] [CrossRef]
Pira, E.; Sorlini, A.; Patrucco, M.; Pentimalli, S.; Nebbia, R. Anti-collision systems in tunneling to improve effectiveness and safety in a system-quality approach: A review of the state of the art. Infrastructures 2021, 6, 42. [Google Scholar] [CrossRef]
Bao, J.; Yin, Y.; Wang, M.; Yuan, X.; Khalid, S. Research Status and Development Trend of Unmanned Driving Technology in Coal Mine Transportation. Energies 2022, 15, 9133. [Google Scholar] [CrossRef]
Liang, L.; Du, Y.; Zhang, H.; Song, B.; Zhang, J. Applications of Machine Vision in Coal Mine Fully Mechanized Tunneling Faces: A Review. IEEE Access 2023, 11, 102871–102898. [Google Scholar] [CrossRef]
Wang, K.; Ren, F.; Zhou, T.; Li, X. Performance and Challenges of 3D Object Detection Methods in Complex Scenes for Autonomous Driving. IEEE Trans. Intell. Veh. 2022, 8, 1699–1716. [Google Scholar] [CrossRef]
Banerjee, A.; Contreras, M.; Jain, A.; Bhatt, N.P.; Hashemi, E. A survey on 3D object detection in real time for autonomous driving. Front. Robot. AI 2024, 11, 1212070. [Google Scholar] [CrossRef]
Mao, Z.; Tang, Y.; Wang, H.; Wang, Y.; He, H. Multi-modality 3D object detection in autonomous driving: A review. Neurocomputing 2023, 553, 126587. [Google Scholar] [CrossRef]
Hao, G.; Zhang, K.; Li, Z.; Zhang, R. Unmanned aerial vehicle navigation in underground structure inspection: A review. Geol. J. 2023, 58, 2454–2472. [Google Scholar] [CrossRef]
Jiao, W.; Li, L.; Xu, X.; Zhang, Q. Challenges of Autonomous Navigation and Perception Technology for Unmanned Special Vehicles in Underground Mine. In Proceedings of the 2023 6th International Symposium on Autonomous Systems (ISAS), Nanjing, China, 23–25 June 2023; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA; pp. 1–6. [Google Scholar]
Tau 2|Teledyne FLIR. Available online: https://www.flir.fr/products/tau-2/ (accessed on 26 April 2025).
IEEE Xplore Full-Text PDF. Available online: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6072167 (accessed on 4 May 2025).
Worsa-Kozak, M.; Szrek, J.; Wodecki, J.; Zimroz, R.; Góralczyk, M.; Michalak, A. Application of the infrared thermography and unmanned ground vehicle for rescue action support in underground mine—The AMICOS project. Rem. Sens. 2020, 13, 69. [Google Scholar] [CrossRef]
Wei, X.; Yuan, X.; Dai, X. TIRNet: Object detection in thermal infrared images for autonomous driving. Appl. Intell. 2020, 51, 1244–1261. [Google Scholar] [CrossRef]
Parasar, D.; Kazi, N. Human identification using thermal sensing inside mines. In Proceedings of the 2021 5th International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, India, 6–8 May 2021; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA; pp. 608–615. [Google Scholar]
Green, J.J.; Dickens, J.S.; van Wyk, M.A. Pedestrian detection for underground mine vehicles using thermal images. In Proceedings of the IEEE AFRICON Conference, Livingstone, Zambia 13–15 September 2011; pp. 1–6. [Google Scholar]
Papachristos, C.; Khattak, S.; Mascarich, F. Autonomous Navigation and Mapping in Underground Mines Using Aerial Robots; IEEE: Piscateway, NJ, USA, 2019. [Google Scholar]
Wang, S.; Liu, Q.; Ye, H.; Xu, Z. YOLOv8-CB: Dense Pedestrian Detection Algorithm Based on In-Vehicle Camera. Electronics 2024, 13, 236. [Google Scholar] [CrossRef]
Apoorva, M.; Shanbhogue, N.M.; Hegde, S.S.; Rao, Y.P.; Chaitanya, L. RGB Camera Based Object Detection and Object Co-ordinate Extraction. In Proceedings of the 2022 IEEE 7th International Conference for Convergence in Technology (I2CT), Mumbai, India, 7–9 April 2022; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA; pp. 1–5. [Google Scholar]
Rahul; Nair, B.B. Camera-based object detection, identification and distance estimation. In Proceedings of the 2018 2nd International Conference on Micro-Electronics and Telecommunication Engineering (ICMETE), Ghaziabad, India, 20–21 September 2018; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA; pp. 203–205. [Google Scholar]
Gu, J.; Guo, J.; Liu, H.; Lou, H.; Duan, X.; Bi, L.; Chen, H. DC-YOLOv8: Small-Size Object Detection Algorithm Based on Camera Sensor. Electronics 2023, 12, 2323. [Google Scholar] [CrossRef]
Baïna, K.; Imam, M.; Tabii, Y.; Benzakour, I.; Adlaoui, Y.; Ressami, E.M.; Abdelwahed, E.H. Anti-Collision System for Accident Prevention in Underground Mines using Computer Vision. In Proceedings of the ICAAI 2022: 2022 The 6th International Conference on Advances in Artificial Intelligence, Birmingham, UK, 21–23 October 2022; Association for Computing Machinery: New York, NY, USA; pp. 94–101. [Google Scholar]
Fidler, S.; Philion, J. Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D. arXiv 2020, arXiv:2008.05711. [Google Scholar]
Huang, K.; Zhou, R.; Cai, F.; Li, S. Detection of Large Foreign Objects on Coal Mine Belt Conveyor Based on Improved. Processes 2023, 11, 2469. [Google Scholar] [CrossRef]
Baltes, R.; Clausen, E.; Uth, F.; Polnik, B.; Kurpiel, W.; Kriegsch, P. An innovative person detection system based on thermal imaging cameras dedicate for underground belt conveyors. Min. Scince 2019, 26, 263–276. [Google Scholar] [CrossRef]
Yin, G.; Geng, K.; Wang, Z.; Li, S.; Qian, M. MVMM: Multiview Multimodal 3-D Object Detection for Autonomous Driving. IEEE Trans. Ind. Inform. 2023, 20, 845–853. [Google Scholar] [CrossRef]
Chung, M.; Seo, S.; Ko, Y. Evaluation of Field Applicability of High-Speed 3D Digital Image Correlation for Shock Vibration Measurement in Underground Mining. Rem. Sens. 2022, 14, 3133. [Google Scholar] [CrossRef]
Zhou, Z.; Geng, Z.; Xu, P. Safety monitoring method of moving target in underground coal mine based on computer vision processing. Sci. Rep. 2022, 12, 17899. [Google Scholar]
Mitsunaga, T.; Nayar, S. High Dynamic Range Imaging: Spatially Varying Pixel Exposures. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000, Hilton Head, SC, USA, 15 June 2000; pp. 472–479. [Google Scholar]
Wang, X.; Pan, H.; Guo, K.; Yang, X.; Luo, S. The evolution of LiDAR and its application in high precision measurement. IOP Conf. Ser. Earth Environ. Sci. 2020, 502, 012008. [Google Scholar] [CrossRef]
Jia, J.; Shen, X.; Yang, Z.; Sun, Y.; Liu, S. STD: Sparse-to-Dense 3D Object Detector for Point Cloud. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October 2019–2 November 2019; pp. 1951–1960. [Google Scholar]
Nguyen, H.T.; Lee, E.-H.; Bae, C.H.; Lee, S. Multiple object detection based on clustering and deep learning methods. Sensors 2020, 20, 4424. [Google Scholar] [CrossRef]
An, S.; Lee, S.E.; Oh, J.; Lee, S.; Kim, R. Point Cloud Clustering System with DBSCAN Algorithm for Low-Resolution LiDAR. In Proceedings of the 2024 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, 6–8 January 2024; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA; pp. 1–2. [Google Scholar]
Abudayyeh, O.; Awedat, K.; Chabaan, R.C.; Abdel-Qader, I.; El Yabroudi, M. Adaptive DBSCAN LiDAR Point Cloud Clustering For Autonomous Driving Applications. In Proceedings of the 2022 IEEE International Conference on Electro Information Technology (eIT), Mankato, MN, USA, 19–21 May 2022; pp. 221–224. [Google Scholar]
Zhu, M.; Tian, C.; Gong, Y.; Zhu, Z. A Systematic Survey of Transformer-Based 3D Object Detection for Autonomous Driving: Methods, Challenges and Trends. Drones 2024, 8, 412. [Google Scholar] [CrossRef]
Urtasun, R.; Luo, W.; Yang, B. PIXOR: Real-time 3D Object Detection from Point Clouds. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7652–7660. [Google Scholar]
Li, B.; Mao, Y.; Yan, Y. SECOND: Sparsely embedded convolutional detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef]
Zhou, Y.; Tuzel, O. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 4490–4499. [Google Scholar]
Moon, J.; Park, G.; Koh, J.; Kim, J.; Choi, J.W. LiDAR-Based 3D Temporal Object Detection via Motion-Aware LiDAR Feature Fusion. Sensors 2024, 24, 4667. [Google Scholar] [CrossRef]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Li, C.R.Q.; Hao, Y.; Leonidas, S.; Guibas, J. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Yang, J.; Lang, A.H.; Zhou, L.; Beijbom, O.; Caesar, H.; Vora, S. PointPillars: Fast Encoders for Object Detection from Point Clouds. 2019. Available online: https://github.com/nutonomy/second.pytorch (accessed on 8 March 2025).
Yuan, X.; Liu, H.; Hu, Y.; Li, C.; Pan, W.; Long, T. A Detection and Tracking Method Based on Heterogeneous Multi-Sensor Fusion for Unmanned Mining Trucks. Sensors 2022, 22, 5989. [Google Scholar] [CrossRef] [PubMed]
Wei, P.; Cagle, L.; Reza, T.; Ball, J.; Gafford, J. LiDAR and camera detection fusion in a real-time industrial multi-sensor collision avoidance system. Electronics 2018, 7, 84. [Google Scholar] [CrossRef]
Remondino, F.; Zimroz, R.; Szrek, J.; Wodecki, J.; Blachowski, J.; Kujawa, P.; Trybała, P. MIN3D Dataset: MultI-seNsor 3D Mapping with an Unmanned Ground Vehicle. PFG-J. Photogramm. Rem. Sens. Geoinf. Sci. 2023, 91, 425–442. [Google Scholar] [CrossRef]
Yang, W.; You, K.; Kim, Y.-I.; Li, W.; Xu, Z. Vehicle autonomous localization in local area of coal mine tunnel based on vision sensors and ultrasonic sensors. PLoS ONE 2017, 12, e0171012. [Google Scholar] [CrossRef]
Li, X.; Sun, Y.; Zhang, L.; Liu, J.; Xu, Y. Research on Positioning and Tracking Method of Intelligent Mine Car in Underground Mine Based on YOLOv5 Algorithm and Laser Sensor Fusion. Sustainability 2025, 17, 542. [Google Scholar] [CrossRef]
Nabati, M.R. Sensor Fusion for Object Detection and Tracking in Autonomous Sensor Fusion for Object Detection and Tracking in Autonomous Vehicles Vehicles. Available online: https://trace.tennessee.edu/utk_graddiss (accessed on 8 March 2025).
Glowacz, A.; Haris, M. Navigating an Automated Driving Vehicle via the Early Fusion of Multi-Modality. Sensors 2022, 22, 1425. [Google Scholar] [CrossRef]
Li, K.; Chehri, A.; Wang, X. Multi-Sensor Fusion Technology for 3D Object Detection in Autonomous Driving: A Review. IEEE Trans. Intell. Transp. Syst. 2023, 25, 1148–1165. [Google Scholar] [CrossRef]
Xu, Y.; Chen, Q.; Dai, Z.; Guan, Z.; Sun, F. Enhanced Object Detection in Autonomous Vehicles through LiDAR—Camera Sensor Fusion. World Electr. Veh. J. 2024, 15, 297. [Google Scholar] [CrossRef]
Lang, A.H.; Helou, B.; Beijbom, O.; Vora, S. Point painting: Sequential fusion for 3D object detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 4603–4611. [Google Scholar]
Tian, B.; Liu, B.; Qiao, J. Mine track obstacle detection method based on information fusion. J. Phys. Conf. Ser. 2022, 2229, 012023. [Google Scholar] [CrossRef]
Barry, J.; Yeong, D.J.; Velasco-Hernandez, G.; Walsh, J. Sensor and sensor fusion technology in autonomous vehicles: A review. Sensors 2021, 21, 2140. [Google Scholar] [CrossRef]
Jahromi, B.S.; Tulabandhula, T.; Cetin, S. Real-time hybrid multi-sensor fusion framework for perception in autonomous vehicles. Sensors 2019, 19, 4357. [Google Scholar] [CrossRef]
Zhou, R.; Li, X.; Jiang, W. SCANet: A Spatial and Channel Attention based Network for Partial-to-Partial Point Cloud Registration. In Proceedings of the ICASSP: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary Telus Convention Center, Calgary, AB, Canada, 15–20 April 2018; Institute of Electrical and Electronics Engineers: New York, NY, USA, 2018. [Google Scholar]
Navab, N.; Wachinger, C.; Roy, A.G. Concurrent Spatial and Channel Squeeze & Excitation in Fully Convolutional Networks. arXiv 2018, arXiv:1803.02579. [Google Scholar]
Qiu, Z.; Martínez-Sánchez, J.; Arias-Sánchez, P.; Rashdi, R. External multi-modal imaging sensor calibration for sensor fusion: A review. Inf. Fusion 2023, 97, 101806. [Google Scholar] [CrossRef]
Tao, F.; Fu, Z.; Wang, J.; Han, L.; Li, C. A Novel Multi-Object Tracking Framework Based on Multis-ensor Data Fusion for Autonomous Driving in Adverse Weather Environments. IEEE Sens. J. 2025, 25, 16068–16079. [Google Scholar] [CrossRef]
Ou, Y.; Qin, T.; Cai, Y.; Wei, R. Intelligent Systems in Motion. Int. J. Semantic Web Inf. Syst. 2023, 19, 1–35. [Google Scholar] [CrossRef]
Zhao, G.; Yang, J.; Huang, Q.; Ge, S.; Gui, T.; Zhang, Y. Enhancement Technology for Perception in Smart Mining Vehicles: 4D Millimeter-Wave Radar and Multi-Sensor Fusion. IEEE Trans. Intell. Veh. 2024, 9, 5009–5013. [Google Scholar] [CrossRef]
Raja, P.; Kumar, R.K.; Kumar, A. An Overview of Sensor Fusion and Data Analytics in WSNs. Available online: www.ijfmr.com (accessed on 5 June 2025).
Zimmer, W.; Ercelik, E.; Zhou, X.; Ortiz, X.J.D.; Knoll, A. A Survey of Robust 3D Object Detection Methods in Point Clouds. arXiv 2022, arXiv:2204.00106. [Google Scholar]
Hussain, M. YOLO-v1 to YOLO-v8, the Rise of YOLO and Its Complementary Nature toward Digital Manufacturing and Industrial Defect Detection. Machines 2023, 11, 677. [Google Scholar] [CrossRef]
Stemmer, M.R.; Schneider, D.G. CNN-Based Multi-Object Detection and Segmentation in 3D LiDAR Data for Dynamic Industrial Environments. Robotics 2024, 13, 174. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J.; Mercan, E. R-CNN for Object Detection Outline 1. Problem Statement: Object Detection (and Segmentation) 2. Background: DPM, Selective Search, Regionlets 3. Method overview 4. Evaluation 5. Exten-sions to DPM and RGB-D 6. Discussion, 20214.Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 24–27 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar] [CrossRef]
Lu, D.; Xie, Q.; Wei, M.; Gao, K.; Xu, L.; Li, J. Transformers in 3D Point Clouds: A Survey. arXiv 2023, arXiv:2205.07417. [Google Scholar]
Ni, Y.; Huo, J.; Hou, Y.; Wang, J.; Guo, P. Detection of Underground Dangerous Area Based on Improving YOLOV8. Electronics 2024, 13, 623. [Google Scholar] [CrossRef]
Ouyang, Y.; Li, Y.; Gao, X.; Zhao, Z.; Zhang, X.; Zheng, Z.; Deng, X.; Ye, T. An adaptive focused target feature fusion network for detection of foreign bodies in coal flow. Int. J. Mach. Learn. Cybern. 2023, 14, 2777–2791. [Google Scholar] [CrossRef]
Song, Z.; Zhou, M.; Men, Y.; Qing, X. Mine underground object detection algorithm based on TTFNet and anchor-free. Open Comput. Sci. 2024, 14. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.; Liao, H. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Zhou, Y.; Zhang, Y. YOLOv5 Based Pedestrian Safety Detection in Underground Coal Mines. In Proceedings of the 2021 IEEE International Conference on Robotics and Biomimetics (ROBIO), Sanya, China, 27–31 December 2021; pp. 1700–1705. [Google Scholar]
Zhao, D.; Wang, P.; Chen, W.; Cheng, G.; Yang, Y.; Su, G. Research on real-time perception method of key targets in the comprehensive excavation working face of coal mine. Meas. Sci. Technol. 2023, 35, 015410. [Google Scholar] [CrossRef]
Li, Y.; Yan, H.; Li, D.; Wang, H. Robust Miner Detection in Challenging Underground Environments: An Improved YOLOv11 Approach. Appl. Sci. 2024, 14, 11700. [Google Scholar] [CrossRef]
Wang, Z.; Guan, Y.; Liu, J.; Xu, T.; Chen, W.; Mu, H. Slim-YOLO-PR_KD: An efficient pose-varied object detection method for underground coal mine. J. Real-Time Image Process. 2024, 21, 160. [Google Scholar] [CrossRef]
Cat® MineStarTM Command for Hauling Manages the Autonomous Ecosystem|Cat|Caterpillar. Available online: https://www.cat.com/en_US/news/machine-press-releases/cat-minestar-command-for-hauling-manages-the-autonomous-ecosystem.html (accessed on 5 June 2025).
Autonomous Haulage System|Komatsu. Available online: https://www.komatsu.com/en-us/technology/smart-mining/loading-and-haulage/autonomous-haulage-system (accessed on 5 June 2025).
Mining Fleet Management|Fleet Management System (FMS)|Production Monitoring and Control|Wenco Mining Sys-tems. Available online: https://www.wencomine.com/our-solutions/mining-fleet-management (accessed on 5 June 2025).
Autonomous Haulage System (AHS—Hitachi) Construction Machinery. Available online: https://www.hitachicm.com/global/en/solutions/solution-linkage/ahs/ (accessed on 5 June 2025).
Lin, C.-C.; Chen, J.-J.; Von Der Bruggen, G.; Gunzel, M.; Teper, H.; Kuhse, D.; Holscher, N. Sync or Sink? The Robustness of Sensor Fusion against Temporal Misalignment. In Proceedings of the 2024 IEEE 30th Real-Time and Embedded Technology and Applications Symposium (RTAS), Hong Kong, 13–16 May 2024; pp. 122–134. [Google Scholar]
Jegham, N.; Koh, C.Y.; Abdelatti, M.; Hendawi, A. YOLO Evolution: A Comprehensive Benchmark and Architectural Review of YOLOv12, YOLO11, and Their Previous Versions. arXiv 2024, arXiv:2411.00201. [Google Scholar]

Figure 1. Structure of the review report.

Figure 2. Thermal infrared camera [40].

Figure 3. Thermal sensor for pedestrian detection in underground operations: (a) navigating system [46]; (b) pedestrian detection [18].

Figure 4. Thermal imagery using YOLO algorithm for underground pedestrian detection [42].

Figure 5. RGB camera [18].

Figure 6. Three-dimensional LiDAR sensor [18].

Figure 7. LiDAR point clouds for 3D detection [6].

Figure 8. VoxelNet processing of raw point cloud providing 3D detection results in underground environment [67].

Figure 9. PointPillar network framework [71].

Figure 10. Sensor-level fusion stages in multi-sensory fusion [77].

Figure 11. Multi-fusion-level methods [3].

Figure 12. Fusion strategy performance comparison.

Figure 13. YOLO architecture [93].

Figure 14. YOLOv8 detection model in an underground mining environment [19].

Table 1. Comparative review of prior literature on perception and navigation systems in underground environments.

Review Paper	Focus Area	Limitations	Key Contributions in This Study
Imam et al. [18]	General object detection systems in underground mining	Focuses only on anti-collision frameworks	DL-based perception, systems, and underground-specific design limitations
Cui et al. [31]	Underground mine positioning systems and algorithms	Limited discussion on fusion architecture comparison and underground constraints	Provide comparative analysis across fusion levels with underground mining focus
Patrucco et al. [32]	Anti-collision technologies and safety protocols in tunneling and underground construction	No discussion on dataset limitations or challenges in confined spaces	Highlights dataset gaps and 3D perception limitations in underground scenarios
Shahmoradi et al. [9]	Applications of drones across surface and underground mining operations	Not focused on object detection challenges or underground autonomous truck navigation	Narrows down to AHS and object perception in underground environments
Zhang et al. [38]	UAV-based navigation and mapping in underground mining inspections	Only on UAVs for mapping and inspection	Addresses underground AHS detection models and dataset challenges
Xu et al. [39]	Sensing and navigation technologies for underground vehicle navigation	Does not delve into DL-based object detection	Discusses integration of DL-based 3D object detection and fusion strategies for underground autonomous trucks

Table 2. Sensor fusion strategies in 3D object detection.

Fusion Approach	Level	Advantages	Limitations
Early-Level Fusion	Raw data	- Rich joint data representation [87] - Fine-grained for small and obscured object detection - Detailed spatial modeling - High-resolution outputs	- Complex calibration - Sensitive to noise, distortions, and misalignments - High memory and computation - Demands precise spatial and temporal data alignment
Mid-Level Fusion	Feature Level	- Balances accuracy and efficiency - Lower data volume [83] - Adaptable to real-time models - Works well with diverse sensor modalities	- Requires accurate feature alignment and resolution compatibility - Loss of raw data details - Complex architecture tuning
Late-Stage Fusion	Decision Level	- No raw data alignment needed - Simple to implement - Computationally efficient - Effective in redundancy layers - Suitable for real-time detection applications	- Cannot resolve early-stage mistakes [83] - Heavy reliance on accurate individual sensor performance - Not suitable for detailed integration of raw data detection accuracy in complex scene applications

Table 3. Comparison of YOLO and CNN DL models for object detection in underground mining environments.

DL-Approach	Advantages	Limitation
CNN	- Extracts hierarchical features effectively from image data - Effective in noisy and low-light underground conditions - Enhanced detection accuracy with multispectral data - Effective for segmentation and complex object shapes	- Computationally expensive, limiting real-time deployment - Need large, labeled datasets and have long training hours - Difficulty in deployment of embedded edge computers - Require large, labeled datasets for training - Prone to overfitting in complex and dynamic mine layouts
YOLO Series	- Single-stage architecture enables fast inference speed for real-time autonomous navigation - Well-structured tracing of overlapping and moving objects - Lightweight versions are suitable for deployment on GPU edge platforms - Easy to fine-tune across variable mining datasets - Latest YOLOv8 variants support transformer-based attention	- Trade-offs in accuracy for small or occluded objects. - Reduced robustness without data augmentation or custom tuning practices - Improved performance requires extensive hyperparameter tuning - Limited generalization to unseen mining data - Persistent challenge by sensor misalignment and fusion latency

Table 4. Comparison of recent 3D detection models for underground applications, where x means no information.

Model/Framework	Detection Algorithm	Sensor Modalities	mAP (%)	FPS	Limitations
LDSI-YOLOv8 [19]	YOLOv8n	RGB Camera	91.4	82.2	It has limited scalability in other mining environments.
YOLOv8 for Hazard Detection [100]	YOLOv8-based	RGB Camera	99.5	45	Limited robustness and generalization due to reliance on a small, self-constructed dataset
YOLO-UCM [105]	YOLOv5	RGB Camera	93.5	15	Model trained on a simulated dataset
DDEB-YOLOv5s + StrongSORT [106]	YOLOv5s with StrongSORT	RGB Camera	91.7	98	High model complexity - Requires significant computational resources - Reduced speed (98 FPS compared to lighter models)
YOLOv11-based Model [107]	YOLOv11	RGB Camera	95.8	59.6	Focuses mainly on personnel detection
Pedestrian Detection Model [51]	YOLOv5 (Deep Learning)	RGB Camera	71.6	x	Challenges with occlusion and detection in crowded scenes
Slim-YOLO-PR_KD [108]	YOLOv8s	RGB Camera	92.4	67	Scope limited to pedestrian detection

Table 5. Underground specific dataset and its characteristics.

Dataset Name	Sensor Type (s)	Objects Annotated	Environment	Limitations
LDSI-YOLOv8 Excavation Scenes [19]	RGB Camera	Pedestrian	Underground coal mine	Limited scalability across diverse mining environments Scene specific
Thermal image set [54]	Thermal IR	Workers, conveyor loads	Real coal mine	Lacks scalability
YOLO-UCM [105]	RGB Camera	Pedestrians	Underground mines	Model trained on a simulated dataset; real underground variability may affect model performance
Real-time perception excavation dataset [106]	RGB Camera	Miners, Equipment	Excavation working faces in coal mines	Generalization to highly dynamic or new tunnel layouts is untested
MANAGEM Pedestrian Detection Model [51]	RGB Camera	Pedestrians	Underground coal mines	Sensitive to occlusion and crowded scenes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Essien, E.; Frimpong, S. Enhancing Autonomous Truck Navigation in Underground Mines: A Review of 3D Object Detection Systems, Challenges, and Future Trends. Drones 2025, 9, 433. https://doi.org/10.3390/drones9060433

AMA Style

Essien E, Frimpong S. Enhancing Autonomous Truck Navigation in Underground Mines: A Review of 3D Object Detection Systems, Challenges, and Future Trends. Drones. 2025; 9(6):433. https://doi.org/10.3390/drones9060433

Chicago/Turabian Style

Essien, Ellen, and Samuel Frimpong. 2025. "Enhancing Autonomous Truck Navigation in Underground Mines: A Review of 3D Object Detection Systems, Challenges, and Future Trends" Drones 9, no. 6: 433. https://doi.org/10.3390/drones9060433

APA Style

Essien, E., & Frimpong, S. (2025). Enhancing Autonomous Truck Navigation in Underground Mines: A Review of 3D Object Detection Systems, Challenges, and Future Trends. Drones, 9(6), 433. https://doi.org/10.3390/drones9060433

Article Menu

Enhancing Autonomous Truck Navigation in Underground Mines: A Review of 3D Object Detection Systems, Challenges, and Future Trends

Abstract

1. Introduction

2. Overview of Sensor Modalities for 3D Object Detection in Underground Autonomous Haulage Trucks

2.1. Infrared (IR) Thermal Systems

2.2. RGB-Based 3D Detection Systems

2.3. LiDAR (Light Detection and Ranging) System

2.3.1. Traditional Methods

2.3.2. Deep Learning-Based Methods

2.4. Multi-Sensor Fusion in Underground Mining: Perception Enhancement Through Integration

3. Sensor Data Fusion Methods

3.1. Multi-Fusion-Level Methods

3.1.1. Early-Stage Fusion

3.1.2. Mid-Level Fusion

3.1.3. Late-Stage Fusion

3.2. Sensor Fusion Advantages and Limitations

3.2.1. Advantages of Sensor Fusion in the Underground Mine Environment

3.2.2. Challenges in Sensor Data Fusion

4. Algorithms for 3D Object Detection in Autonomous Trucks

4.1. Convolutional Neural Networks (CNNs)

4.2. YOLO (You Only Look Once) Series Algorithm

4.3. Detection Model Comparison for Underground Autonomous Haulage Systems

4.4. Real World Case Scenarios of 3D Object Detection Systems in Underground Autonomous Truck Navigation

5. Dataset Analysis, Challenges, and Proposed Strategies

5.1. Dataset Challenges in the Underground Environment

5.2. Proposed Strategies for Dataset Optimization and Model Robustness in Underground

6. Key Challenges and Future Directions in 3D Detection Systems for Underground Mines

6.1. Challenges in the Underground Environment

6.2. Future Research Directions

7. Discussions

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI