Author Contributions
Conceptualization, C.S.; Methodology, C.L.; Software, C.S.; Validation, C.L. and H.F.; Formal Analysis, C.S.; Investigation, Q.G.; Resources, Q.G.; Data Curation, B.O.; Writing—Original Draft Preparation, C.L. and R.W.; Writing—Review & Editing, Q.G., H.F. and all authors; Visualization, C.S. and R.W.; Supervision, H.F.; Project Administration, Q.G. and H.F.; Funding Acquisition, R.W. All authors have read and agreed to the published version of the manuscript.
Figure 1.
The overall architecture of the proposed DGR-MAE framework. The model adopts a teacher-student dual-branch structure with posterior semantic-guided differential reconstruction for cloud-occluded aircraft recognition. Different colored arrows denote the mask generation and information flow associated with the teacher and student branches, respectively.
Figure 1.
The overall architecture of the proposed DGR-MAE framework. The model adopts a teacher-student dual-branch structure with posterior semantic-guided differential reconstruction for cloud-occluded aircraft recognition. Different colored arrows denote the mask generation and information flow associated with the teacher and student branches, respectively.
Figure 2.
Overview of the proposed ASRAir benchmark. (a) ASRAir-Clean: cloud-free aircraft subset; (b) ASRAir-Occ: cloud-occluded robustness subset; (c) ASRAir-Sev: occlusion-severity evaluation subset stratified into 10 cloud coverage levels.
Figure 2.
Overview of the proposed ASRAir benchmark. (a) ASRAir-Clean: cloud-free aircraft subset; (b) ASRAir-Occ: cloud-occluded robustness subset; (c) ASRAir-Sev: occlusion-severity evaluation subset stratified into 10 cloud coverage levels.
Figure 3.
Class distribution comparison across ASRAir subsets. (a) ASRAir-Clean, (b) ASRAir-Occ, and (c) ASRAir-Sev. (d) Box plot summarizes distribution statistics.
Figure 3.
Class distribution comparison across ASRAir subsets. (a) ASRAir-Clean, (b) ASRAir-Occ, and (c) ASRAir-Sev. (d) Box plot summarizes distribution statistics.
Figure 4.
Per-class image distribution comparison between ASRAir-Clean and ASRAir-Occ.
Figure 4.
Per-class image distribution comparison between ASRAir-Clean and ASRAir-Occ.
Figure 5.
Cloud synthesis statistics. (a) Alpha distribution. (b) Brightness distribution. (c) Correlation between alpha and brightness. (d) Box plot of alpha values.
Figure 5.
Cloud synthesis statistics. (a) Alpha distribution. (b) Brightness distribution. (c) Correlation between alpha and brightness. (d) Box plot of alpha values.
Figure 6.
Cloud occlusion level distribution in ASRAir-Sev. (a) Number of samples per level. (b) Proportion distribution across severity levels.
Figure 6.
Cloud occlusion level distribution in ASRAir-Sev. (a) Number of samples per level. (b) Proportion distribution across severity levels.
Figure 7.
Top-1 classification performance comparison of representative vision models (ViT, iBOT, BEiT, MAE, MCMAE, and the proposed DGR-MAE) across different cloud occlusion levels (Level 1–Level 10) on the ASRAir-Sev dataset.
Figure 7.
Top-1 classification performance comparison of representative vision models (ViT, iBOT, BEiT, MAE, MCMAE, and the proposed DGR-MAE) across different cloud occlusion levels (Level 1–Level 10) on the ASRAir-Sev dataset.
Figure 8.
Visualization of attention maps during the pre-training stage. Columns correspond to different model categories, including ViT trained from scratch, iBOT based on contrastive self-supervised learning, masked image modeling methods (MAE, CrossMAE, MCMAE, and DMAE), and the proposed DGR-MAE. Warmer colors indicate higher attention responses, while cooler colors indicate lower attention responses.
Figure 8.
Visualization of attention maps during the pre-training stage. Columns correspond to different model categories, including ViT trained from scratch, iBOT based on contrastive self-supervised learning, masked image modeling methods (MAE, CrossMAE, MCMAE, and DMAE), and the proposed DGR-MAE. Warmer colors indicate higher attention responses, while cooler colors indicate lower attention responses.
Figure 9.
Visualization of the masking–reconstruction process during the pre-training stage. The student branch generates masked views via random masking, while the teacher branch produces attention-guided masked views. Reconstruction results from both branches are compared with MAE and CrossMAE. The proposed DGR-MAE achieves more accurate reconstruction in teacher-attended regions, demonstrating improved structural recovery under missing information conditions. The semi-transparent orange overlay indicates semantically important regions identified and preserved by the teacher branch through the attention-guided masking process.
Figure 9.
Visualization of the masking–reconstruction process during the pre-training stage. The student branch generates masked views via random masking, while the teacher branch produces attention-guided masked views. Reconstruction results from both branches are compared with MAE and CrossMAE. The proposed DGR-MAE achieves more accurate reconstruction in teacher-attended regions, demonstrating improved structural recovery under missing information conditions. The semi-transparent orange overlay indicates semantically important regions identified and preserved by the teacher branch through the attention-guided masking process.
Figure 10.
Visualization of attention maps during the fine-tuning stage under cloud occlusion, with different methods arranged from left to right as ViT, iBOT, MAE, CrossMAE, MCMAE, DMAE, and the proposed DGR-MAE. Warmer colors indicate higher attention responses, while cooler colors indicate lower attention responses.
Figure 10.
Visualization of attention maps during the fine-tuning stage under cloud occlusion, with different methods arranged from left to right as ViT, iBOT, MAE, CrossMAE, MCMAE, DMAE, and the proposed DGR-MAE. Warmer colors indicate higher attention responses, while cooler colors indicate lower attention responses.
Figure 11.
Comprehensive confusion matrix visualization of all compared methods listed in
Table 1 on the ASRAir-Sev evaluation subset, covering three representative paradigms: from-scratch supervised training, contrastive self-supervised learning, and masked image modeling.
Figure 11.
Comprehensive confusion matrix visualization of all compared methods listed in
Table 1 on the ASRAir-Sev evaluation subset, covering three representative paradigms: from-scratch supervised training, contrastive self-supervised learning, and masked image modeling.
Figure 12.
t-SNE visualization of learned feature embeddings across different methods for multi-scale aircraft recognition under cloud-occluded remote sensing conditions, illustrating intra-class compactness and inter-class separability.
Figure 12.
t-SNE visualization of learned feature embeddings across different methods for multi-scale aircraft recognition under cloud-occluded remote sensing conditions, illustrating intra-class compactness and inter-class separability.
Table 1.
Performance comparison of representative self-supervised vision models for cloud-occluded aircraft recognition on the ASRAir-Occ benchmark under the fine-tuning protocol. All methods are evaluated using the same ViT-Base backbone, and computational cost metrics (Params, FLOPs, and inference time) are reported under identical model architecture and inference settings.
Table 1.
Performance comparison of representative self-supervised vision models for cloud-occluded aircraft recognition on the ASRAir-Occ benchmark under the fine-tuning protocol. All methods are evaluated using the same ViT-Base backbone, and computational cost metrics (Params, FLOPs, and inference time) are reported under identical model architecture and inference settings.
| Method | Params (M) | FLOPs (G) | Inference (ms) | Top-1 | Top-5 |
|---|
| From-scratch training |
| DeiT [38] | 86.6 | 17.6 | 1.76 | 66.73 | 79.74 |
| ViT [18] | 86.6 | 17.6 | 1.76 | 59.53 | 73.26 |
| Contrastive learning methods |
| AttMask [36] | 86.6 | 17.6 | 1.76 | 65.95 | 78.48 |
| MoCo v3 [39] | 86.6 | 17.6 | 1.76 | 63.73 | 79.02 |
| iBOT [40] | 86.6 | 17.6 | 1.76 | 50.42 | 74.22 |
| Masked image modeling methods |
| BEiT [6] | 86.6 | 17.6 | 1.76 | 73.14 | 82.25 |
| MAE [8] | 86.6 | 17.6 | 1.76 | 73.08 | 81.95 |
| SimMIM [9] | 86.6 | 17.6 | 1.76 | 47.00 | 73.68 |
| CrossMAE [26] | 86.6 | 17.6 | 1.76 | 65.17 | 78.96 |
| MCMAE [41] | 86.6 | 17.6 | 1.76 | 73.02 | 83.03 |
| MixMIM [42] | 86.6 | 17.6 | 1.76 | 65.17 | 79.44 |
| BEiT V2 [43] | 86.6 | 17.6 | 1.76 | 71.04 | 80.94 |
| CAE [44] | 86.6 | 17.6 | 1.76 | 72.90 |
83.93
|
| DMAE [37] | 86.6 | 17.6 | 1.76 | 60.85 | 77.10 |
| DGR-MAE (Ours) | 86.6 | 17.6 | 1.76 | 74.28 | 82.79 |
Table 2.
Analysis of model robustness under progressive cloud occlusion: Top-1 accuracy comparison on the ASRAir-Sev dataset.
Table 2.
Analysis of model robustness under progressive cloud occlusion: Top-1 accuracy comparison on the ASRAir-Sev dataset.
| Method | Level 1 | Level 2 | Level 3 | Level 4 | Level 5 | Level 6 | Level 7 | Level 8 | Level 9 | Level 10 | All |
|---|
| From-scratch training |
| DeiT [38] | 96.77 | 97.10 | 96.77 | 98.71 | 99.35 | 97.10 | 93.87 | 75.67 | 40.13 | 9.76 | 63.63 |
| ViT [18] | 95.81 | 97.74 | 97.10 | 95.48 | 98.39 | 93.87 | 86.13 | 55.64 | 27.02 | 6.59 | 57.19 |
| Contrastive learning methods |
| AttMask [36] | 96.13 | 97.10 | 97.42 | 97.10 | 99.68 | 97.10 | 91.61 | 70.03 | 31.23 | 8.22 | 61.01 |
| MoCo v3 [39] | 94.19 | 95.81 | 96.77 | 96.45 | 97.10 | 95.81 | 88.71 | 65.88 | 27.83 | 6.02 | 58.64 |
| iBOT [40] | 86.13 | 91.61 | 89.68 | 86.13 | 87.10 | 81.94 | 69.03 | 41.25 | 18.93 | 6.51 | 49.22 |
| Masked image modeling methods |
| BEiT [6] | 96.13 | 93.23 | 94.52 | 90.00 | 93.23 | 85.48 | 67.42 | 30.42 | 13.43 | 5.70 | 48.60 |
| MAE [8] | 98.39 | 97.74 | 99.03 | 99.35 | 100.00 | 98.06 | 95.16 | 77.30 | 32.69 | 10.25 | 63.55 |
| SimMIM [9] | 68.39 | 77.42 | 80.97 | 75.48 | 74.52 | 72.90 | 57.42 | 37.09 | 15.86 | 6.51 | 42.63 |
| CrossMAE [26] | 95.81 | 96.45 | 96.77 | 97.10 | 98.71 | 96.77 | 89.03 | 60.24 | 25.24 | 8.38 | 58.49 |
| MCMAE [41] | 97.42 | 98.71 | 98.39 | 98.06 | 98.71 | 98.39 | 93.55 | 71.36 | 29.77 | 8.38 | 61.52 |
| MixMIM [42] | 96.45 | 96.77 | 97.42 | 97.10 | 97.74 | 96.77 | 96.13 | 56.82 | 20.55 | 7.73 | 57.07 |
| BEiT V2 [43] | 97.74 | 98.71 | 98.06 | 97.74 | 98.06 | 96.77 | 87.42 | 49.41 | 19.74 | 8.46 | 56.49 |
| CAE [44] | 98.39 | 98.06 | 98.39 | 99.35 | 100.00 | 97.74 | 95.81 | 80.86 | 40.29 | 8.79 | 64.68 |
| DMAE [37] | 90.32 | 93.55 | 95.48 | 93.55 | 93.55 | 90.97 | 78.71 | 47.77 | 14.72 | 6.02 | 52.42 |
| DGR-MAE (Ours) | 99.03 | 99.35 | 99.35 | 99.35 | 100.00 | 99.68 | 97.74 | 86.20 | 46.28 | 11.07 | 67.28 |
Table 3.
Ablation study on pretraining dataset.
Table 3.
Ablation study on pretraining dataset.
| Pretrain | Finetune | Top-1 |
|---|
| ASRAir-Clean | ASRAir-Occ | 74.28 |
| ASRAir-Occ | ASRAir-Occ | 73.26 |
Table 4.
Ablation study on teacher branch.
Table 4.
Ablation study on teacher branch.
| Teacher Branch | Top-1 |
|---|
| w/o teacher | 72.84 |
| w/teacher | 74.28 |
Table 5.
Ablation study on student masking strategy.
Table 5.
Ablation study on student masking strategy.
| Masking Strategy | Top-1 |
|---|
| Random | 74.28 |
| Grid | 73.08 |
| Block | 72.54 |
Table 6.
Ablation study on different mask ratios.
Table 6.
Ablation study on different mask ratios.
| Mask Ratio | Description | Top-1 |
|---|
| 0.25 | Low masking intensity | 72.90 |
| 0.50 | Moderate masking intensity | 73.08 |
| 0.75 | Standard masking intensity | 74.28 |
| 0.80 | High masking intensity | 73.50 |
| 0.85 | Very high masking intensity | 72.66 |
Table 7.
Ablation study on C2CF module components. ✓ indicates the component is used, and × indicates it is not used.
Table 7.
Ablation study on C2CF module components. ✓ indicates the component is used, and × indicates it is not used.
| Depthwise Conv | Cross-Attn | Top-1 |
|---|
| ✓ | ✓ | 74.28 |
| ✓ | × | 72.72 |
| × | ✓ | 73.08 |
| × | × | 72.72 |
Table 8.
Ablation study on mask reweighting strategy.
Table 8.
Ablation study on mask reweighting strategy.
| a | b | Description | Top-1 |
|---|
| 0.5 | 0.5 | Balanced weighting | 73.02 |
| 0.6 | 0.4 | Focus on key regions | 74.28 |
| 0.7 | 0.3 | Emphasize key regions | 73.14 |
| 0.8 | 0.2 | Over-emphasize key regions | 73.08 |
| 0.9 | 0.1 | Extreme emphasis on key regions | 73.56 |