1. Introduction
In diverse applications including robotic manipulation [
1,
2,
3], augmented reality (AR) [
4], and autonomous driving [
5,
6], 6D object pose estimation, which simultaneously estimates an object’s 3D translation and rotation, plays a critical role. For robots to physically interact with surrounding objects, they must accurately perceive the translation and rotation of target objects. Accurate pose estimation is particularly important for robots to perform precise manipulation tasks in disaster response sites or hazardous areas where direct human access is difficult, due to insufficient visual information. For example, in critical tasks such as operating valves during a nuclear facility accident involving steam or leaking vapor, the success of robotic operations heavily depends on the reliability of pose estimation.
However, prior research on 6D pose estimation [
7,
8,
9,
10,
11,
12,
13,
14,
15] has primarily been trained on data collected in environments with clear visibility [
16,
17]. These methods rely on sharp multi-modal cues such as texture, color, and patterns in RGB images, along with accurate depth measurements to infer geometric relationships. In contrast, such scenarios involve localized aerosol conditions within the robot manipulation workspace where these assumptions break down. Aerosols cause light scattering and absorption that simultaneously degrade both RGB and depth sensors—RGB images lose semantic information, while depth measurements suffer severe noise and data sparsity. Critically, aerosol concentration is often spatially non-uniform, resulting in varying degradation levels across regions, which makes it difficult to identify which measurements are reliable. Despite the significant impact on RGB-D perception, research in this domain remains limited due to the difficulty of collecting real-world datasets and synthesizing realistic sensor degradation.
To address the challenges of aerosol environments, Son et al. [
18] proposed an approach that restores corrupted sensor data before performing pose estimation. Their step-by-step restoration pipeline first recovers degraded RGB images through a dehazing module, and then employs depth completion to restore depth measurements. However, this restoration-based approach suffers from two fundamental limitations. First, due to the scarcity of aerosol training data, their method relies on large-scale foundation models for depth completion at every input frame, requiring substantial computational resources during inference. This severely constrains real-time robotic applications requiring rapid response, particularly in emergency response or dynamic manufacturing environments. Second, while depth completion generates dense depth maps, the restored depth lacks spatial reliability information—it remains unclear which regions are trustworthy and which contain significant noise. Since aerosol concentration is spatially non-uniform, distinguishing reliable regions from noisy ones is crucial for accurate pose estimation. To overcome these limitations, we propose a method that quantifies depth reliability and enables accurate, fast pose estimation in highly non-uniform aerosol environments without foundation model-based depth completion.
Our work is based on two key insights. First, we observe that the feature-level attention map generated during the dehazing process can serve as an indicator of depth reliability. The Color Attenuation Prior [
19] establishes that aerosol concentration is related to color attenuation, and the dehazing module learns attention maps based on this luminance deviation to identify degraded regions. Since RGB and depth sensors capture the same scene through the same aerosol medium, regions with high aerosol concentration exhibit both RGB degradation and depth measurement errors. Thus, the attention map not only aids RGB restoration but also serves as a depth reliability indicator. We empirically validate this relationship in
Section 3.1. Based on this insight, we integrate attention information into a standard point cloud to construct an Attention-Guided Point cloud (AGP). By distilling features from a clean network trained under normal conditions, the model learns to extract robust feature representations even in regions where depth reliability is degraded
Second, aerosol environments pose a fundamental limitation in data availability, as realistically simulating aerosol-induced RGB degradation and depth sensor noise is physically difficult. Consequently, large-scale synthetic data [
20,
21] generation becomes challenging, restricting training to a small amount of real aerosol data. To address this constraint, we adopt a feature-level distillation strategy that transfers discriminative representations from a clean network—trained with large-scale synthetic and real data under normal conditions—to an aerosol network. This allows the aerosol network to learn effectively despite limited domain-specific data, and only the aerosol network is used during inference.
Experimental results demonstrate that RA6D effectively mitigates the impact of RGB and depth degradation in aerosol environments by leveraging reliability-aware depth signals to guide the network toward clean-domain representations, resulting in improved accuracy and faster inference. Comprehensive ablation studies further validate the contribution of each component. Moreover, the proposed reliability-guided adaptation enables RA6D to operate robustly even under severe sensor degradation and limited domain-specific data, highlighting its suitability for practical deployment in time-critical robotic applications.
The contributions of this work are summarized as follows:
We propose RA6D, a reliability-aware framework for robust 6D pose estimation in aerosol environments by integrating depth reliability into the estimation process.
We introduce an Attention-Guided Point cloud (AGP) representation that leverages attention maps from the dehazing module to encode reliability cues reflecting spatial degradation.
We develop a feature distillation framework that transfers knowledge from a clean network trained on large-scale synthetic data to an aerosol network trained on limited real data, enabling efficient adaptation to aerosol environments.
2. Related Work
2.1. Instance-Level 6D Pose Estimation
Instance-level 6D object pose estimation predicts the 3D translation and rotation of a specific object with a given CAD model, playing a critical role in robotic manipulation and augmented reality. Methods are broadly categorized into RGB-based [
7,
8,
9,
10,
11] and RGB-D-based [
12,
13,
14,
15] approaches depending on the input modality. RGB-based methods are relatively challenging, as they must estimate 3D pose from 2D images alone without depth information. These methods can be categorized into correspondence-based and direct regression approaches. Correspondence-based methods [
7,
8,
9,
10] establish 2D–3D correspondences between images and 3D models, and then compute pose via PnP (Perspective-n-Point) algorithm. Direct regression methods [
11] directly predict 6D pose parameters from RGB features but face limitations in generalization due to the nonlinearity of rotation space. In contrast, RGB-D-based methods [
12,
13,
14,
15] achieve more robust and accurate pose estimation by converting depth information into point clouds and directly leveraging 3D geometric constraints. DenseFusion [
12] and PVN3D [
13] fuse RGB and point cloud features to predict pose, while FFB6D [
15] effectively combines complementary information from both modalities through bidirectional fusion, achieving fast and robust pose estimation without ICP-based postprocessing. Our work builds upon the bidirectional fusion architecture of FFB6D.
However, existing RGB-D methods were designed assuming high-quality sensor data in normal environments. In aerosol environments, light scattering and absorption simultaneously degrade both RGB and depth sensors. Critically, aerosol concentration is spatially non-uniform, resulting in varying degradation levels across different regions, making it difficult to identify which regions are reliable and which are severely corrupted. Such simultaneous degradation of both modalities presents significant challenges for accurate pose estimation. While prior work has addressed individual challenges such as occlusion [
22,
23] or illumination variation [
24], research on aerosol environments where RGB and depth degrade simultaneously and non-uniformly remains underexplored. To address these challenges, we integrate dehazing-based RGB restoration with attention-guided depth reliability quantification and feature distillation.
2.2. Image Dehazing
Recent studies have shown that visual perception systems relying on RGB observations suffer severe performance degradation in real-world environments under adverse weather conditions such as aerosols, raising concerns about robustness in practical deployment [
25,
26]. To restore RGB image degradation caused by aerosols, image dehazing techniques have been developed [
27,
28]. Dehazing research is broadly divided into physics-based and learning-based methods. Physics-based methods [
19,
29] are based on the following Atmospheric Scattering Model:
where
is the observed degraded image,
is the clean image to be restored,
is the transmission map, and A is the atmospheric light. These methods estimate
and A to recover
, assuming uniform aerosol distribution where atmospheric parameters are treated as global constants. However, real-world aerosol environments exhibit non-uniform distributions, where aerosol concentration varies spatially, making global parameter-based modeling challenging.
To overcome these limitations, recent learning-based methods [
30,
31,
32,
33,
34] leverage deep neural networks to generate clean images directly in an end-to-end manner. Among these approaches, attention mechanisms have proven particularly effective. FFA-Net [
31] combines multi-scale information through feature fusion attention, while DehazeFormer [
32] captures global context with transformer-based attention. SCANet [
33] effectively handles non-uniform aerosol distributions through attention mechanisms, and the generated attention maps can identify regions with varying degradation levels. Notably, SCANet learns these attention maps using Y channel luminance deviation in YCbCr color space. We adopt SCANet as our dehazing module, as this color-based attention learning serves as a key cue for our depth reliability quantification.
2.3. Clean-to-Aerosol Feature Distillation
Feature distillation transfers feature representations learned in one domain to another, mitigating domain shift problems [
35,
36]. It typically minimizes differences between intermediate layer feature maps of one network and corresponding feature maps of another, enabling transfer of learned discriminative representations. Recent work has demonstrated the effectiveness of leveraging clean condition features to improve performance under degraded conditions in adverse weather environments. AWARDistill [
37] improved 3D object detection in foggy conditions using clear weather features, while Sunshine-to-Rainy [
38] transferred sunny LiDAR features to rainy conditions to enhance detection robustness. More recent studies further extend this paradigm by incorporating degradation-aware and multi-modal distillation strategies, enabling more effective knowledge transfer across heterogeneous degradation types and sensing modalities [
39,
40]. These approaches successfully alleviate limited data problems in target domains by transferring knowledge from clean to degraded domains.
While these feature distillation approaches have proven effective in adverse weather conditions, most focus on tasks such as 3D object detection [
37,
38,
41] or semantic segmentation [
42]. In aerosol environments, 6D pose estimation remains a relatively under-explored area. Unlike typical 6D pose estimation that can be trained on large-scale synthetic data, generating realistic aerosol RGB-D training data is extremely challenging due to the difficulty of simulating complex physical phenomena such as light scattering and absorption. Moreover, aerosol concentration is distributed spatially non-uniformly, resulting in varying reliability across regions. To address these challenges, we transfer discriminative representations learned from large-scale synthetic data under clean conditions to aerosol environments through feature distillation, leveraging attention generated during the dehazing process as spatial weights to adaptively handle non-uniform degradation.
3. Method
We propose RA6D, a framework for robust 6D object pose estimation in aerosol environments that integrates dehazing and pose estimation. Our framework addresses two key challenges: (1) overcoming the difficulty of generating realistic aerosol RGB-D training data through clean-to-aerosol feature distillation and (2) explicitly accounting for the degradation of depth quality under aerosol conditions through Attention-Guided Point cloud (AGP) and feature distillation.
Figure 1 illustrates the overall architecture of our method.
The framework consists of a dehazing module for RGB restoration, a feature distillation module, and an attention-guided RGB-D 6D pose estimation module. The RGB dehazing module restores RGB images, and the attention map generated during this process quantifies depth reliability and is combined with the depth point cloud to construct an AGP. The RGB-D 6D pose estimation module predicts 6D pose through bidirectional feature fusion from the restored RGB and AGP. To enhance robustness with limited aerosol data, the feature distillation module uses AGP during training to align aerosol network features with clean network features.
Unlike conventional preprocessing-based dehazing, our method quantifies depth reliability through the dehazing attention map and integrates it as an additional channel in the point cloud. The key idea is that attention values indicate the degree of aerosol degradation: higher attention values signify severe aerosol degradation in the region and low reliability of depth information. This AGP enables adaptive learning for spatially non-uniform aerosol distributions. The feature distillation process utilizes this quantified attention information as spatial weights. In heavily degraded regions, we leverage more discriminative representations from the clean network, while in less degraded regions, we utilize the aerosol network’s own learning capability, enabling adaptive learning for such environments.
3.1. Dehazing Module and Depth Reliability
To remove RGB image degradation caused by aerosols, we adopt an attention-based dehazing network [
33] that effectively handles non-uniform aerosol distributions as our RGB dehazing module. This network identifies the spatial distribution of aerosols through attention mechanisms, forming the basis for our core idea of depth reliability quantification. Given an aerosol-degraded image
, the dehazing module simultaneously generates a restored image
and a spatial aerosol attention map
where
is the dehazing network. The generated spatial aerosol attention map
represents the degree of aerosol degradation in each region. Regions with high attention values indicate areas where aerosol degradation is severe and the dehazing network performed intensive restoration.
Our key insight is that the attention map generated during dehazing can serve as an indicator of depth reliability. This is grounded in the Color Attenuation Prior (CAP) [
19], which establishes that aerosol concentration is directly related to luminance and saturation changes, as follows:
where
is aerosol concentration,
is brightness, and
is saturation. This relationship arises because aerosol scattering increases brightness while reducing saturation in affected regions. Building upon this principle, the dehazing network learns attention maps supervised by luminance deviation in YCbCr color space, as follows:
Since luminance deviation directly reflects the brightness change caused by aerosol scattering, the learned attention captures aerosol concentration:
.
Given that attention captures aerosol concentration, we further examine whether it can indicate depth reliability. Since RGB and depth sensors capture the same scene through the same aerosol medium, regions with high aerosol concentration affect both modalities. We therefore hypothesize that regions with high aerosol concentration exhibit both RGB degradation and depth measurement errors. To validate this, we computed the Spearman rank correlation between attention values and depth errors. The top of
Figure 2 qualitatively illustrates this relationship—attention maps correspond well with regions showing severe depth noise and missing data. To validate this quantitatively, we computed the Spearman rank correlation between attention values and depth errors. As shown in
Figure 2a, the Spearman correlation between attention values and depth errors across entire scenes is
= 0.9091. We further analyzed object regions separately, and
Figure 2b shows consistent correlations across tested object categories (
= 0.9394 for valve,
= 0.8500 for household objects).
3.2. Attention-Guided Point Cloud
This section explains how we combine the dehazing attention map with depth point cloud to construct Attention-Guided Point cloud (AGP). Building upon the established relationship between luminance variation and depth reliability (
Section 3.1), the attention map serves as a guide for identifying reliable and unreliable regions of depth measurements, enabling the network to adaptively focus on regions based on their reliability.
Based on this relationship, we construct AGP by integrating attention information into the point cloud representation. Given a depth image
, restored RGB image
, and attention map
, we construct a point cloud. While traditional RGB-D point cloud are 9-dimensional vectors (3D coordinates, RGB colors, and surface normals), we extend this to 10 dimensions by adding attention values as an additional channel, as follows:
where
are 3D coordinates,
are colors,
is the normal vector, and
is the attention value at the corresponding pixel location. Points with high attention values indicate regions with severe aerosol degradation and low depth reliability. By directly encoding this spatially non-uniform information into the point cloud, the subsequent feature extraction network can adaptively learn based on the reliability of each point. Finally, we obtain an AGP
.
3.3. 6D Pose Estimation Module
Given the restored RGB image
and AGP
, the RGB-D 6D pose estimation module predicts the target object’s 6D pose. We build upon the FFB6D [
15] architecture that bidirectionally fuses RGB and point cloud features. Unlike the original FFB6D, we use Attention-Guided Point cloud as input to incorporate aerosol degradation information. The network consists of an RGB encoder
[
43] and a point cloud encoder
[
44], with bidirectional fusion modules added at each encoder layer for complementary fusion of RGB and point cloud features, as follows:
These features are bidirectionally fused at each encoder layer, allowing complementary information to be exchanged between RGB and point cloud modalities. Using the fused features, we detect 3D keypoints of the target object for 6D pose estimation. The keypoint detection module predicts 3D offsets
from each point to
K predefined keypoints and is trained with L1 loss, as follows:
where
is the predicted offset and
is the ground-truth offset. This keypoint loss is later utilized as task-specific supervision in the feature distillation of
Section 3.4. The predicted offsets are aggregated into final keypoint locations through clustering. Rotation
and translation
are then computed through least-squares fitting [
45] between detected keypoints and CAD model keypoints to obtain the final 6D pose
.
3.4. Feature Distillation from Clean to Aerosol
RGB-D data collected in aerosol environments are limited, making it difficult to train robust 6D pose estimation networks. To address this, we propose a feature distillation strategy that transfers feature representations from a clean network trained on large-scale data to an aerosol network.
Figure 3 illustrates the overall structure of our feature distillation. Our method utilizes paired clean RGB-D images
and aerosol RGB-D images
of the same scene. During training, the aerosol RGB image
is first restored by the dehazing module described earlier. While acquiring such perfectly aligned pairs in the wild can be challenging, it is feasible in controlled settings (e.g., capturing static scenes before and after aerosol injection) or through synthetic data generation, where atmospheric effects can be toggled. Crucially, this paired requirement is strictly limited to the training phase; during inference, the model relies solely on the aerosol-domain input, ensuring practical deployability. Given these inputs, the aerosol RGB image is restored through the dehazing module described earlier. Subsequently, a clean point cloud
and an Attention-Guided Point cloud
are constructed from the respective conditions. The clean and aerosol networks employ separate point cloud encoders to extract intermediate layer features, as follows:
where
and
are the point cloud encoders of the clean and aerosol networks, respectively, and
and
are features extracted from the intermediate layers of each encoder. The two features represent the same scene under different conditions (clean vs. aerosol). The key to feature distillation is aligning aerosol features with clean features while adaptively responding to the spatially non-uniform aerosol distribution. We employ cosine similarity as the distillation loss, as it aligns normalized feature directions without being affected by the magnitude differences that frequently arise under aerosol degradation, as follows:
This loss aligns aerosol features with clean features, while AGP provides spatial reliability cues that weight the distillation process. Low-reliability regions, therefore, receive stronger guidance from clean-domain representations, producing a complementary and more robust learning process.
We perform spatially adaptive distillation by leveraging the AGP constructed in
Section 3.4. Specifically, we sample points from regions with high attention values for intensive feature distillation. Since points with high attention values correspond to regions with severe aerosol degradation, this sampling strategy enables the network to effectively learn discriminative features from the clean network in the most challenging regions. Additionally, through keypoint detection task supervision, we guide the distilled features to learn task-specific representations that directly contribute to 6D pose estimation. The final training loss combines keypoint loss and distillation loss, as follows:
where
and
are hyperparameters that balance the two losses. This attention-guided feature distillation enables the aerosol network to learn discriminative representations from the clean network, even with limited aerosol data.
4. Experiments
4.1. Experiment Setup
4.1.1. Dataset
Standard benchmarks for 6D pose estimation are typically captured in clear environments, and datasets containing ground-truth pose annotations under degraded optical conditions are rare due to the inherent difficulty of data acquisition. Therefore, we evaluate our method on the aerosol RGB-D dataset [
46], which is one of the few benchmarks tailored to this problem setting. This dataset provides paired observations of the same scenes under normal and aerosol conditions using an Intel RealSense D435i sensor. The dataset comprises 4 objects: 2 household objects from LINEMOD [
17] (cat, glue) and 2 valves used in industrial sites (Ball valve, Globe valve). For each object, 200 pairs of RGB-D images were captured from various viewpoints using a turntable and a height-adjustable camera. Aerosol conditions were generated using a steam generator to simulate challenging situations with limited visibility. The dataset was split into training and test sets. We use a real-world aerosol RGB-D dataset for both training and evaluation, since accurately simulating non-homogeneous aerosols and their impact on active IR depth sensing remains challenging. Training scenes feature simple backgrounds, while test scenes feature cluttered environments to evaluate robustness.
4.1.2. Evaluation Metrics
We use ADD and ADD-S metrics to evaluate 6D pose estimation performance. ADD (Average Distance) applies predicted and ground-truth poses to the 3D model and computes the average Euclidean distance between vertices, as follows:
where
M is the vertex set of the 3D model, and
and
are the predicted and ground-truth poses, respectively. For symmetric objects, we use ADD-S, computing the Average Distance to the closest corresponding vertex, as follows:
Accuracy is calculated by considering predictions with ADD(-S) values within
of the object diameter as correct.
4.1.3. Implementation Details
We build our framework based on FFB6D [
15] with SCANet [
33] as the dehazing module. The clean network processes standard 9D point clouds, while the aerosol network processes 10D AGP, where the additional dimension represents attention values quantifying aerosol degradation. Point clouds consist of
N = 12,800 points with
K = 8 keypoints for pose estimation. We train for 50 epochs using an Adam optimizer with a learning rate of 1 ×
. The loss weights are set to
,
. These values were selected based on validation experiments to balance the different numerical characteristics of the two loss terms. In particular, the distillation loss is derived from cosine similarity and thus has a bounded dynamic range, whereas the keypoint regression loss is an L1 loss on 3D offsets whose magnitude is sensitive to the data scale. Empirically, this weighting either prevents loss from dominating the optimization or leads to stable convergence across all object categories. Evaluation uses ADD/ADD-S metrics following the LINEMOD protocol [
16]. All experiments are conducted on an NVIDIA A100 GPU.
4.2. Comparison with Representative Methods
One consideration on the aerosol benchmark is that many RGB-D 6D pipelines are ROI-based [
47,
48,
49,
50] and rely on external object detectors or bounding-box proposals; under severe aerosol-induced degradation, these upstream modules can fail, and performance may become dominated by detection/ROI errors rather than the pose estimator itself, confounding a fair comparison. To avoid this confounding factor, we use FFB6D [
15] as the primary scene-level, end-to-end RGB-D baseline that jointly performs instance segmentation and pose estimation without external ROI proposals. In addition, we include Son et al. [
18] as a representative restoration-first pipeline (restoration followed by pose estimation). Under normal conditions, we evaluate ROPE [
10] (RGB-only) and FFB6D (RGB-D) to establish baseline performance. Under aerosol conditions, we compare against FFB6D trained on aerosol data and Son et al. [
18]. RA6D achieves 58.0% average accuracy under aerosol conditions, outperforming all baselines.
4.2.1. Quantitative Results
Table 1 compares accuracies under normal and aerosol conditions. Under normal conditions, RGB-D-based FFB6D achieves 69.4% average accuracy, significantly outperforming RGB-only ROPE 52.2%. While ROPE is designed as an occlusion-robust RGB-based method, it shows limited performance compared with RGB-D approaches due to the absence of depth information, demonstrating the critical role of geometric depth in accurate pose estimation. Under aerosol conditions, all methods experience significant performance degradation. RGB-only ROPE drops sharply to 20.8% (−31.4%), while FFB6D decreases to 27.9% (−41.5%). Interestingly, FFB6D shows greater degradation than ROPE despite using depth information, which indicates that severely degraded depth sensors in aerosol environments produce noisy depth that negatively impacts network learning. Examining object-specific performance, Ball valve and globe Valve show the lowest accuracies of 12.5% and 11.2%, followed by Glue at 28.7%, while Cat stays at 59.5% due to rich textures, enabling sufficient feature extraction even from degraded RGB.
Son et al. applies dehazing and depth completion in a sequential manner and achieves an average accuracy of 55.0%, representing an approximately twofold improvement over FFB6D under aerosol conditions. While the method performs well on Cat (73.8%) and Glue (57.5%), it shows relatively limited performance on Ball valve (46.3%) and Globe valve (42.5%). This limitation stems from the fact that depth completion, even when producing dense depth maps from noisy measurements, does not explicitly quantify the spatial reliability of depth, making it difficult for the network to distinguish between reliable and unreliable regions. In contrast, our RA6D achieves an average accuracy of 58.0%, outperforming all compared baselines. Rather than relying on depth restoration, RA6D explicitly incorporates spatially varying depth reliability through Attention-Guided Point cloud and leverages feature distillation during training to mitigate the influence of unreliable depth information. As a result, RA6D attains the highest accuracy of 80.9% on Cat and outperforms Son et al. on Ball valve (47.5% vs. 46.3%) and Globe valve (47.5% vs. 42.5%), while showing comparable performance on Glue (56.2% vs. 57.5%). Overall, these results suggest that explicitly accounting for spatially varying depth reliability provides a more effective strategy than depth restoration alone in aerosol-degraded environments.
4.2.2. Computational Efficiency
Table 2 presents a comparison of inference times and computational complexity across methods. While the FFB6D baseline achieves the fastest runtime by employing only the 6D pose module, its practical usability is limited by poor robustness under aerosol conditions. In contrast, Son et al. incurs a substantial computational burden due to depth completion, resulting in prohibitively slow inference and making real-time deployment infeasible.
However, our RA6D achieves significantly faster inference while maintaining competitive accuracy under aerosol conditions by avoiding computationally expensive depth completion and relying solely on an efficient RGB dehazing module and a 6D pose estimation network. This efficiency comes from replacing high-overhead depth reconstruction with attention-guided feature distillation, which effectively transfers discriminative representations in a clean domain without introducing additional heavy components. These results demonstrate that attention-based reliability modeling provides a practical alternative to depth completion for efficient 6D pose estimation in degraded environments.
4.2.3. Qualitative Results
Figure 4 compares the qualitative results of each method. Under aerosol conditions, severe visual degradation makes even object contours difficult to distinguish, particularly for Ball valve and Glue, where heavy aerosols render objects nearly invisible. These extreme degradation conditions allow the verification of each method’s robustness.
Son et al. estimates poses reasonably close to the GT for Ball valve, Cat, and Globe valve. However, noticeable rotation errors persist for Ball valve, suggesting that a restoration-first pipeline (including depth completion) may not fully mitigate aerosol-induced depth degradation on specular/metallic surfaces. In contrast, RA6D yields more accurate and consistent pose estimates across these instances. We attribute this improvement to our attention-guided representation, which uses the dehazing attention map as a spatial reliability cue to downweight low-confidence regions and emphasize more reliable geometry. Specifically, for reflective objects such as Ball and Globe valves, RA6D appears less affected by depth noise than Son et al. For the Cat object, RA6D achieves high accuracy, indicating that feature distillation helps transfer discriminative representations in texture-rich regions. Overall, these qualitative results support that RA6D remains robust under aerosol conditions without relying on explicit depth completion, even across diverse surface properties.
4.3. Ablation Study
We conducted an ablation study to confirm the effectiveness of the dehazing module and each loss component used for feature distillation.
Table 3 indicates how much each part enhances the performance. The full model combining all components achieved the highest average accuracy of 58.0%.
When the dehazing module was removed, the accuracy dropped significantly to 36.0%, demonstrating that the degradation caused by aerosol substantially affects the performance of 6D pose estimation. In particular, metallic objects such as Ball valve and Globe valve showed a performance improvement of more than 30% when the dehazing module was applied. This improvement can be attributed to the restoration of structural boundaries and shape information that were severely damaged by light scattering and absorption, allowing for a more stable feature extraction. Consistent improvements were also observed for non-metallic objects such as Cat and Glue, confirming that the dehazing module provides robust visual representations across diverse materials and shapes.
Next, we analyzed the effect of feature distillation. When applied individually, the distillation loss and keypoint loss achieved accuracies of 50.8% and 54.3%, respectively, while jointly applying both losses led to additional performance gains. This is because guides the model to learn the feature representation of the clean network, thereby improving feature quality degraded by aerosols, while directly provides the supervision signal for accurate pose estimation. Thus, the two losses complement model learning from different perspectives. This trend persists even when the dehazing module is not used, confirming that both losses contribute to robust feature learning.
5. Conclusions
This paper presented RA6D, a framework for robust 6D object pose estimation in aerosol environments where both RGB and depth sensors degrade simultaneously. Our results suggest that explicitly quantifying sensor reliability provides a promising direction for robust perception under degraded sensing conditions. Our approach is built on the insight that the attention mechanism used for RGB dehazing inherently captures the spatial distribution of degradation, which directly corresponds to depth reliability. By integrating this attention information into an Attention-Guided Point cloud(AGP), we provide the network with explicit knowledge of region-wise reliability, and through feature distillation from networks trained under clean conditions, we address the challenge of limited aerosol training data.
Experimental results on aerosol benchmarks validate this direction, demonstrating that reliability-aware representations enable effective learning even when sensor quality varies spatially. This suggests that, even under limited aerosol-domain data and challenging real-world conditions, robust perception is better achieved by quantifying and leveraging depth reliability rather than attempting to restore it. By modeling sensor reliability directly instead of relying on restoration or depth completion, our approach achieves real-time performance and higher accuracy under severe degradation, making it practical for deployment in disaster-response and industrial environments. Future research should extend these principles across diverse sensing modalities and environmental conditions, and develop architectures where reliability information naturally propagates through the entire perception pipeline.
6. Limitations and Future Work
Despite the promising performance of RA6D, this work is limited by data availability and scalability. Our proposed feature distillation framework requires paired clean–aerosol RGB-D data during training, which must be collected through manual procedures in controlled real-world acquisition (e.g., capturing the same static scene before and after aerosol generation under a fixed setup). This requirement constrains dataset scale and diversity, although inference does not require paired data. To our knowledge, there are very few benchmarks that simultaneously cover non-homogeneous aerosol conditions, RGB-D sensor degradation, and ground-truth 6D pose annotations, making real-world capture more suitable for realistic evaluation at present. Simulating such interactions remains challenging and may introduce a sim-to-real gap, making real-world data more suitable for realistic evaluation. In future work, we plan to extend the framework to unpaired settings via domain adaptation or self-supervised alignment to improve scalability and generalization and to evaluate the method on objects with more diverse material properties, while preserving robustness to realistic aerosol-induced sensor degradation.
Author Contributions
Conceptualization, Y.C.; methodology, W.S., S.L. and T.K.; software, W.S. and G.S.; validation, T.K., Y.C. and S.L.; formal analysis, W.S.; investigation, W.S. and S.L.; resources, Y.C.; data curation, W.S. and S.L.; writing—original draft preparation, W.S.; writing—review and editing, W.S., S.L., T.K., G.S. and Y.C.; visualization, S.L.; supervision, Y.C.; project administration, Y.C.; funding acquisition, Y.C. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by a National Research Foundation of Korea (NRF) grant funded by the Korean government (Ministry of Science and ICT) (No. RS-2022-00144385, 40%), an Institute of Information and Communications Technology Planning and Evaluation (IITP) grant funded by the Korean government (MSIT) under the metaverse support program to nurture the best talents (No. IITP-2025-RS-2023-00254529, 30%), and a Technology Innovation Program (No. RS-2024-00421828, development of artificial intelligence software for unseen object manipulation that integrates prompt and situation-specific unseen object recognition and arbitrary gripper shape analysis through gripper self-observation, 30%) grant funded by the Ministry of Trade Industry and Energy (MOTIE, Korea).
Data Availability Statement
Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon request.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Wen, B.; Lian, W.; Bekris, K.; Schaal, S. You only demonstrate once: Category-level manipulation from single visual demonstration. Proceedings of Robotics: Science and Systems (RSS), New York City, NY, USA, 27 June–1 July 2022. [Google Scholar]
- Kappler, D.; Meier, F.; Issac, J.; Mainprice, J.; Garcia Cifuentes, C.; Wüthrich, M.; Berenz, V.; Schaal, S.; Ratliff, N.; Bohg, J. Real-time perception meets reactive motion generation. IEEE Robot. Autom. Lett. 2018, 3, 1864–1871. [Google Scholar] [CrossRef]
- Wen, B.; Lian, W.; Bekris, K.; Schaal, S. CatGrasp: Learning category-level task-relevant grasping in clutter from simulation. In Proceedings of the International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 6401–6408. [Google Scholar]
- Marchand, E.; Uchiyama, H.; Spindler, F. Pose estimation for augmented reality: A hands-on survey. IEEE Trans. Vis. Comput. Graph. 2015, 22, 2633–2651. [Google Scholar] [CrossRef] [PubMed]
- Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
- Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1907–1915. [Google Scholar]
- Peng, S.; Liu, Y.; Huang, Q.; Zhou, X.; Bao, H. PVNet: Pixel-wise Voting Network for 6DoF Pose Estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 4561–4570. [Google Scholar]
- Li, Z.; Wang, G.; Ji, X. CDPN: Coordinates-based disentangled pose network for real-time RGB-based 6-DoF object pose estimation. In Proceedings of the CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7677–7686. [Google Scholar]
- Hodan, T.; Barath, D.; Matas, J. Epos: Estimating 6d pose of objects with symmetries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 11504–11513. [Google Scholar]
- Chen, B.; Chin, T.J.; Klimavicius, M. Occlusion-robust object pose estimation with holistic representation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 4–8 January 2022; pp. 2929–2939. [Google Scholar]
- Xiang, Y.; Schmidt, T.; Narayanan, V.; Fox, D. Pose4CNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes. In Proceedings of the Robotics: Science and Systems (RSS), Pittsburgh, PA, USA, 26–30 June 2018. [Google Scholar]
- Wang, C.; Xu, D.; Zhu, Y.; Martín-Martín, R.; Lu, C.; Fei-Fei, L.; Savarese, S. Densefusion: 6d object pose estimation by iterative dense fusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 3343–3352. [Google Scholar]
- He, Y.; Sun, W.; Huang, H.; Liu, J.; Fan, H.; Sun, J. Pvn3d: A deep point-wise 3d keypoints voting network for 6dof pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 11632–11641. [Google Scholar]
- Jiang, X.; Li, D.; Chen, H.; Zheng, Y.; Zhao, R.; Wu, L. Uni6D: A Unified CNN Framework without Projection Breakdown for 6D Pose Estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 11174–11183. [Google Scholar]
- He, Y.; Huang, H.; Fan, H.; Chen, Q.; Sun, J. FFB6D: A Full Flow Bidirectional Fusion Network for 6D Pose Estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 3003–3013. [Google Scholar]
- Calli, B.; Singh, A.; Walsman, A.; Srinivasa, S.; Abbeel, P.; Dollar, A.M. The ycb object and model set: Towards common benchmarks for manipulation research. In Proceedings of the International Conference on Advanced Robotics (ICAR), Istanbul, Turkey, 27–31 July 2015; pp. 510–517. [Google Scholar]
- Hinterstoisser, S.; Holzer, S.; Cagniart, C.; Ilic, S.; Konolige, K.; Navab, N.; Lepetit, V. Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes. In Proceedings of the CVF International Conference on Computer Vision (ICCV), Barcelona, Spain, 6–13 November 2011; pp. 858–865. [Google Scholar]
- Son, W.; Son, G.; Kim, T.; Choi, Y. The Study of Object 6D Pose Estimation for High-Density Aerosol Environment. In Proceedings of the Transactions of the Korean Nuclear Society Autumn Meeting (KNS), Changwon, Republic of Korea, 30–31 October 2025. [Google Scholar]
- Zhu, Q.; Mai, J.; Shao, L. A fast single image haze removal algorithm using color attenuation prior. IEEE Trans. Image Process. 2015, 24, 3522–3533. [Google Scholar] [CrossRef] [PubMed]
- Zakharov, S.; Ambruș, R.; Guizilini, V.; Kehl, W.; Gaidon, A. Photo-realistic neural domain randomization. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 310–327. [Google Scholar]
- Sundermeyer, M.; Durner, M.; Puang, E.Y.; Marton, Z.C.; Vaskevicius, N.; Arras, K.O.; Triebel, R. Multi-path learning for object pose estimation across domains. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 13916–13925. [Google Scholar]
- Wang, G.; Manhardt, F.; Liu, X.; Ji, X.; Tombari, F. Occlusion-Aware Self-Supervised Monocular 6D Object Pose Estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 1788–1803. [Google Scholar] [CrossRef] [PubMed]
- Wang, J.; Luo, L.; Liang, W.; Yang, Z.X. OA-Pose: Occlusion-aware monocular 6-DoF object pose estimation under geometry alignment for robot manipulation. Pattern Recognit. 2024, 154, 110576. [Google Scholar] [CrossRef]
- Han, Y.; Yoon, T.; Woo, D.; Kim, S.; Kim, H.S. SenseShift6D: Multimodal RGB-D Benchmarking for Robust 6D Pose Estimation across Environment and Sensor Variations. arXiv 2025, arXiv:2507.05751. Available online: https://arxiv.org/abs/2507.05751 (accessed on 12 November 2025).
- Hao, X.; Wei, M.; Yang, Y.; Zhao, H.; Zhang, H.; Zhou, Y.; Wang, Q.; Li, W.; Kong, L.; Zhang, J. Is your hd map constructor reliable under sensor corruptions? In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 12–15 December 2024; pp. 22441–22482. [Google Scholar]
- Zhao, H.; Zhang, J.; Chen, Z.; Zhao, S.; Tao, D. Unimix: Towards domain adaptive and generalizable lidar semantic segmentation in adverse weather. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 14781–14791. [Google Scholar]
- Gui, J.; Cong, X.; Cao, Y.; Ren, W.; Zhang, J.; Zhang, J.; Cao, J.; Tao, D. A comprehensive survey and taxonomy on single image dehazing based on deep learning. ACM Comput. Surv. 2023, 55, 1–37. [Google Scholar] [CrossRef]
- Ancuti, C.O.; Ancuti, C.; Vasluianu, F.-A.; Timofte, R.; Liu, Y.; Wang, X.; Zhu, Y.; Shi, G.; Lu, X.; Fu, X. NTIRE 2024 dense and non-homogeneous dehazing challenge report. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 17–21 June 2024; pp. 6453–6468. [Google Scholar]
- Shu, Q.; Wu, C.; Xiao, Z.; Liu, R.W. Variational regularized transmission refinement for image dehazing. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 2781–2785. [Google Scholar]
- Liu, R.W.; Guo, Y.; Lu, Y.; Chui, K.T.; Gupta, B.B. Deep network-enabled haze visibility enhancement for visual iot-driven intelligent transportation systems. IEEE Trans. Ind. Inform. 2022, 19, 1581–1591. [Google Scholar] [CrossRef]
- Qin, X.; Wang, Z.; Bai, Y.; Xie, X.; Jia, H. FFA-Net: Feature fusion attention network for single image dehazing. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New York, NY, USA, 7–12 February 2020; pp. 11908–11915. [Google Scholar]
- Song, Y.; He, Z.; Qian, H.; Du, X. Vision transformers for single image dehazing. IEEE Trans. Image Process. 2023, 32, 1927–1941. [Google Scholar] [CrossRef] [PubMed]
- Guo, Y.; Gao, Y.; Liu, W.; Lu, Y.; Qu, J.; He, S.; Ren, W. SCANet: Self-Paced Semi-Curricular Attention Network for Non-Homogeneous Image Dehazing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 18–22 June 2023; pp. 1885–1894. [Google Scholar]
- Fu, M.; Liu, H.; Yu, Y.; Chen, J.; Wang, K. DW-GAN: A discrete wavelet transform GAN for non-homogeneous dehazing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, 19–25 June 2021; pp. 203–212. [Google Scholar]
- Tang, S.; Su, W.; Ye, M.; Zhu, X. Source-free domain adaptation with frozen multimodal foundation model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 23711–23720. [Google Scholar]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. Available online: https://arxiv.org/abs/1503.02531 (accessed on 13 November 2025).
- Liu, Y.; Zhang, Y.; Lan, R.; Cheng, C.; Wu, Z. AWARDistill: Adaptive and robust 3D object detection in adverse conditions through knowledge distillation. Expert Syst. Appl. 2025, 266, 126032. [Google Scholar] [CrossRef]
- Huang, X.; Wu, H.; Li, X.; Fan, X.; Wen, C.; Wang, C. Sunshine to rainstorm: Cross-weather knowledge distillation for robust 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; pp. 2409–2416. [Google Scholar]
- Zhao, H.; Zhang, Q.; Zhao, S.; Chen, Z.; Zhang, J.; Tao, D. Simdistill: Simulated multi-modal distillation for bev 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Vancouver, BC, Canada, 20–27 February 2024; pp. 7460–7468. [Google Scholar]
- Lu, X.; Xiao, J.; Zhu, Y.; Fu, X. Continuous adverse weather removal via degradation-aware distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 28113–28123. [Google Scholar]
- Chae, Y.; Kim, H.; Yoon, K.J. Towards Robust 3D Object Detection with LiDAR and 4D Radar Fusion in Various Weather Conditions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 15162–15172. [Google Scholar]
- Yang, X.; Yan, W.; Yuan, Y.; Mi, M.B.; Tan, R.T. Semantic segmentation in multiple adverse weather conditions with domain knowledge retention. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Vancouver, BC, Canada, 20–27 February 2024; pp. 6558–6566. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.; Trigoni, N.; Markham, A. Randla-net: Efficient semantic segmentation of large-scale point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 11108–11117. [Google Scholar]
- Arun, K.S.; Huang, T.S.; Blostein, S.D. Least-squares fitting of two 3-d point sets. IEEE Trans. Pattern Anal. Mach. Intell. 1987, 5, 698–700. [Google Scholar] [CrossRef] [PubMed]
- Yang, H.; Lee, S.; Kim, T.; Choi, Y. 6-DoF Object Pose Estimation Under Aerosol Conditions: Benchmark Dataset and Baseline. J. Inst. Control. Robot. Syst. 2024, 30, 614–620. [Google Scholar] [CrossRef]
- Su, Y.; Saleh, M.; Fetzer, T.; Rambach, J.; Navab, N.; Busam, B.; Stricker, D.; Tombari, F. Zebrapose: Coarse to fine surface encoding for 6dof object pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 6738–6748. [Google Scholar]
- Lin, Y.; Su, Y.; Nathan, P.; Inuganti, S.; Di, Y.; Sundermeyer, M.; Manhardt, F.; Stricker, D.; Rambach, J.; Zhang, Y. Hipose: Hierarchical binary surface encoding and correspondence pruning for rgb-d 6dof object pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 10148–10158. [Google Scholar]
- Hong, Z.; Hung, Y.; Chen, C. RDPN6D: Residual-based dense point-wise network for 6Dof object pose estimation based on RGB-D images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 5251–5260. [Google Scholar]
- Wang, Y.; Hu, M.; Li, H.; Luo, C. HccePose (BF): Predicting Front & Back Surfaces to Construct Ultra-Dense 2D-3D Correspondences for Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 7166–7175. [Google Scholar]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |